Commit Graph

82876 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle)
c1401fd18f gfs2: convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
Add support for large folios and remove some accesses to page->mapping and
page->index.

Link: https://lkml.kernel.org/r/20230612210141.730128-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:30 -07:00
Matthew Wilcox (Oracle)
d0cfcaee0a gfs2: pass a folio to __gfs2_jdata_write_folio()
Remove a couple of folio->page conversions in the callers, and two calls
to compound_head() in the function itself.  Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().

Link: https://lkml.kernel.org/r/20230612210141.730128-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:29 -07:00
Matthew Wilcox (Oracle)
c0ba597db9 gfs2: use a folio inside gfs2_jdata_writepage()
Patch series "gfs2/buffer folio changes for 6.5", v3.

This kind of started off as a gfs2 patch series, then became entwined with
buffer heads once I realised that gfs2 was the only remaining caller of
__block_write_full_page().  For those not in the gfs2 world, the big point
of this series is that block_write_full_page() should now handle large
folios correctly.


This patch (of 14):

Replace a few implicit calls to compound_head() with one explicit one.

Link: https://lkml.kernel.org/r/20230612210141.730128-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230612210141.730128-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:29 -07:00
Liam R. Howlett
65ac132027 userfaultfd: fix regression in userfaultfd_unmap_prep()
Android reported a performance regression in the userfaultfd unmap path. 
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.

Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.

Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:28 -07:00
Ryan Roberts
c33c794828 mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper.  This means that by default, the accesses change from a
C dereference to a READ_ONCE().  This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte.  Arch code
is deliberately not converted, as the arch code knows best.  It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
//          COCCI=ptepget.cocci \
//          SPFLAGS="--include-headers" \
//          MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so.  This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot.  The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get().  HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined.  Fix by continuing to do a direct dereference
when MMU=n.  This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:25 -07:00
Hugh Dickins
2b683a4ff6 mm/userfaultfd: retry if pte_offset_map() fails
Instead of worrying whether the pmd is stable, userfaultfd_must_wait()
call pte_offset_map() as before, but go back to try again if that fails.

Risk of endless loop?  It already broke out if pmd_none(), !pmd_present()
or pmd_trans_huge(), and pte_offset_map() would have cleared pmd_bad():
which leaves pmd_devmap().  Presumably pmd_devmap() is inappropriate in a
vma subject to userfaultfd (it would have been mistreated before), but add
a check just to avoid all possibility of endless loop there.

Link: https://lkml.kernel.org/r/54423f-3dff-fd8d-614a-632727cc4cfb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:15 -07:00
Hugh Dickins
7780d04046 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
Simple walk_page_range() users should set ACTION_AGAIN to retry when
pte_offset_map_lock() fails.

No need to check pmd_trans_unstable(): that was precisely to avoid the
possiblity of calling pte_offset_map() on a racily removed or inserted THP
entry, but such cases are now safely handled inside it.  Likewise there is
no need to check pmd_none() or pmd_bad() before calling it.

Link: https://lkml.kernel.org/r/c77d9d10-3aad-e3ce-4896-99e91c7947f3@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: SeongJae Park <sj@kernel.org> for mm/damon part
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:13 -07:00
Hugh Dickins
26e1a0c327 mm: use pmdp_get_lockless() without surplus barrier()
Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

What is it all about?  Some mmap_lock avoidance i.e.  latency reduction. 
Initially just for the case of collapsing shmem or file pages to THPs; but
likely to be relied upon later in other contexts e.g.  freeing of empty
page tables (but that's not work I'm doing).  mmap_write_lock avoidance
when collapsing to anon THPs?  Perhaps, but again that's not work I've
done: a quick attempt was not as easy as the shmem/file case.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and have
heard that shmem MADV_COLLAPSE's usefulness is being limited by that
mmap_write_lock it currently requires.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

What is this preparatory series about?

The current mmap locking will not be enough to guard against that tricky
transition between pmd entry pointing to page table, and empty pmd entry,
and pmd entry pointing to huge page: pte_offset_map() will have to
validate the pmd entry for itself, returning NULL if no page table is
there.  What to do about that varies: sometimes nearby error handling
indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
again" is appropriate (and if that risks an infinite loop, then there must
have been an oops, or pfn 0 mistaken for page table, before).

Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier, and
sets a better example, if each site is given appropriate handling: even
where deeper study might prove that failure could only happen if the pmd
table were corrupted.

Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more.  Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
its place.


This patch (of 32):

Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
reliable result with PAE (or READ_ONCE as before without PAE); and remove
the unnecessary extra barrier()s which got left behind in its callers.

HOWEVER: Note the small print in linux/pgtable.h, where it was designed
specifically for fast GUP, and depends on interrupts being disabled for
its full guarantee: most callers which have been added (here and before)
do NOT have interrupts disabled, so there is still some need for caution.

Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:12 -07:00
Roberto Sassu
36ce9d76b0 shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs
As the ramfs-based tmpfs uses ramfs_init_fs_context() for the
init_fs_context method, which allocates fc->s_fs_info, use ramfs_kill_sb()
to free it and avoid a memory leak.

Link: https://lkml.kernel.org/r/20230607161523.2876433-1-roberto.sassu@huaweicloud.com
Fixes: c3b1b1cbf0 ("ramfs: add support for "mode=" mount option")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:04 -07:00
Paulo Alcantara
fc1bd51d11 smb: client: fix warning in cifs_match_super()
Fix potential dereference of ERR_PTR @tlink as reported by kernel test
robot

  fs/smb/client/connect.c:2775 cifs_match_super() error: 'tlink'
  dereferencing possible ERR_PTR()

Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-19 18:16:26 -05:00
Shyam Prasad N
dc765027ed cifs: print nosharesock value while dumping mount options
We print most other mount options for a mount when dumping
the mount entries. But miss out the nosharesock value.
This change will print that in addition to the other options.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-19 18:16:26 -05:00
Bharath SM
da787d5b74 SMB3: Do not send lease break acknowledgment if all file handles have been closed
In case if all existing file handles are deferred handles and if all of
them gets closed due to handle lease break then we dont need to send
lease break acknowledgment to server, because last handle close will be
considered as lease break ack.
After closing deferred handels, we check for openfile list of inode,
if its empty then we skip sending lease break ack.

Fixes: 59a556aebc ("SMB3: drop reference to cfile before sending oplock break")
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-19 18:16:26 -05:00
Jeff Layton
cded49ba36 nfs: don't report STATX_BTIME in ->getattr
NFS doesn't properly support reporting the btime in getattr (yet), but
61a968b4f0 mistakenly added it to the request_mask. This causes statx
for STATX_BTIME to report a zeroed out btime instead of properly
clearing the flag.

Cc: stable@vger.kernel.org # v6.3+
Fixes: 61a968b4f0 ("nfs: report the inode version in getattr if requested")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2214134
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 17:16:43 -04:00
Ryusuke Konishi
782e53d0c1 nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
In a syzbot stress test that deliberately causes file system errors on
nilfs2 with a corrupted disk image, it has been reported that
nilfs_clear_dirty_page() called from nilfs_clear_dirty_pages() can cause a
general protection fault.

In nilfs_clear_dirty_pages(), when looking up dirty pages from the page
cache and calling nilfs_clear_dirty_page() for each dirty page/folio
retrieved, the back reference from the argument page to "mapping" may have
been changed to NULL (and possibly others).  It is necessary to check this
after locking the page/folio.

So, fix this issue by not calling nilfs_clear_dirty_page() on a page/folio
after locking it in nilfs_clear_dirty_pages() if the back reference
"mapping" from the page/folio is different from the "mapping" that held
the page/folio just before.

Link: https://lkml.kernel.org/r/20230612021456.3682-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+53369d11851d8f26735c@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000da4f6b05eb9bf593@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 13:19:35 -07:00
Qi Zheng
47a7c01c3e Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
Patch series "revert shrinker_srcu related changes".


This patch (of 7):

This reverts commit cf2e309ebc.

Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless").  The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly.  That's why
unregister_shrinker() has become slower.

After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink.  So
revert the shrinker_mutex back to shrinker_rwsem first.

[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/

Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev
Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 13:19:33 -07:00
Ryusuke Konishi
679bd7ebdd nilfs2: fix buffer corruption due to concurrent device reads
As a result of analysis of a syzbot report, it turned out that in three
cases where nilfs2 allocates block device buffers directly via sb_getblk,
concurrent reads to the device can corrupt the allocated buffers.

Nilfs2 uses sb_getblk for segment summary blocks, that make up a log
header, and the super root block, that is the trailer, and when moving and
writing the second super block after fs resize.

In any of these, since the uptodate flag is not set when storing metadata
to be written in the allocated buffers, the stored metadata will be
overwritten if a device read of the same block occurs concurrently before
the write.  This causes metadata corruption and misbehavior in the log
write itself, causing warnings in nilfs_btree_assign() as reported.

Fix these issues by setting an uptodate flag on the buffer head on the
first or before modifying each buffer obtained with sb_getblk, and
clearing the flag on failure.

When setting the uptodate flag, the lock_buffer/unlock_buffer pair is used
to perform necessary exclusive control, and the buffer is filled to ensure
that uninitialized bytes are not mixed into the data read from others.  As
for buffers for segment summary blocks, they are filled incrementally, so
if the uptodate flag was unset on their allocation, set the flag and zero
fill the buffer once at that point.

Also, regarding the superblock move routine, the starting point of the
memset call to zerofill the block is incorrectly specified, which can
cause a buffer overflow on file systems with block sizes greater than
4KiB.  In addition, if the superblock is moved within a large block, it is
necessary to assume the possibility that the data in the superblock will
be destroyed by zero-filling before copying.  So fix these potential
issues as well.

Link: https://lkml.kernel.org/r/20230609035732.20426-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+31837fe952932efc8fb9@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000030000a05e981f475@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 13:19:33 -07:00
Olga Kornievskaia
c907e72f58 NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
When the client received NFS4ERR_BADSESSION, it schedules recovery
and start the state manager thread which in turn freezes the
session table and does not allow for any new requests to use the
no-longer valid session. However, it is possible that before
the state manager thread runs, a new operation would use the
released slot that received BADSESSION and was therefore not
updated its sequence number. Such re-use of the slot can lead
the application errors.

Fixes: 5c441544f0 ("NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:10:57 -04:00
Qi Zheng
7f7ab33689 NFSv4.2: fix wrong shrinker_id
Currently, the list_lru::shrinker_id corresponding to the nfs4_xattr
shrinkers is wrong:

>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)18
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)19
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)20

This is not what we expect, which will cause these shrinkers
not to be found in shrink_slab_memcg().

We should assign shrinker::id before calling list_lru_init_memcg(),
so that the corresponding list_lru::shrinker_id will be assigned
the correct value like below:

>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)16
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)17
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)18
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)16
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)17
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)18

So just do it.

Fixes: 95ad37f90c ("NFSv4.2: add client side xattr caching.")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:10:45 -04:00
Benjamin Coddington
6ad477a69a NFSv4: Clean up some shutdown loops
If a SEQUENCE call receives -EIO for a shutdown client, it will retry the
RPC call.  Instead of doing that for a shutdown client, just bail out.

Likewise, if the state manager decides to perform recovery for a shutdown
client, it will continuously retry.  As above, just bail out.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:09:25 -04:00
Benjamin Coddington
7d3e26a054 NFS: Cancel all existing RPC tasks when shutdown
Walk existing RPC tasks and cancel them with -EIO when the client is
shutdown.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:08:46 -04:00
Benjamin Coddington
d9615d166c NFS: add sysfs shutdown knob
Within each nfs_server sysfs tree, add an entry named "shutdown".  Writing
1 to this file will set the cl_shutdown bit on the rpc_clnt structs
associated with that mount.  If cl_shutdown is set, the task scheduler
immediately returns -EIO for new tasks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:08:12 -04:00
Benjamin Coddington
f4057ffd0e NFS: add a sysfs link to the acl rpc_client
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:06:40 -04:00
Benjamin Coddington
d97c058977 NFS: add a sysfs link to the lockd rpc_client
After lockd is started, add a symlink for lockd's rpc_client under
NFS' superblock sysfs.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:06:07 -04:00
Benjamin Coddington
e13b549319 NFS: Add sysfs links to sunrpc clients for nfs_clients
For the general and state management nfs_client under each mount, create
symlinks to their respective rpc_client sysfs entries.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:04:13 -04:00
Benjamin Coddington
1c7251187d NFS: add superblock sysfs entries
Create a sysfs directory for each mount that corresponds to the mount's
nfs_server struct.  As the mount is being constructed, use the name
"server-n", but rename it to the "MAJOR:MINOR" of the mount after assigning
a device_id. The rename approach allows us to populate the mount's directory
with links to the various rpc_client objects during the mount's
construction.  The naming convention (MAJOR:MINOR) can be used to reference
a particular NFS mount's sysfs tree.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:03:48 -04:00
Benjamin Coddington
e96f9268ee NFS: Make all of /sys/fs/nfs network-namespace unique
Expand the NFS network-namespaced sysfs from /sys/fs/nfs/net down one level
into /sys/fs/nfs by moving the "net" kobject onto struct
nfs_netns_client and setting it up during network namespace init.

This prepares the way for superblock kobjects within /sys/fs/nfs that will
only be visible to matching network namespaces.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:51 -04:00
Benjamin Coddington
943aef2dbc NFS: Open-code the nfs_kset kset_create_and_add()
In preparation to make objects below /sys/fs/nfs namespace aware, we need
to define our own kobj_type for the nfs kset so that we can add the
.child_ns_type member in a following patch.  No functional change here, only
the unrolling of kset_create_and_add().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:07 -04:00
Benjamin Coddington
d5082ace6c NFS: rename nfs_client_kobj to nfs_net_kobj
Match the variable names to the sysfs structure.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:07 -04:00
Benjamin Coddington
8b18a2edec NFS: rename nfs_client_kset to nfs_kset
Be brief and match the subsystem name.  There's no need to distinguish this
kset variable from the server.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:56:45 -04:00
Filipe Manana
8a4a0b2a3e btrfs: fix race between quota disable and relocation
If we disable quotas while we have a relocation of a metadata block group
that has extents belonging to the quota root, we can cause the relocation
to fail with -ENOENT. This is because relocation builds backref nodes for
extents of the quota root and later needs to walk the backrefs and access
the quota root - however if in between a task disables quotas, it results
in deleting the quota root from the root tree (with btrfs_del_root(),
called from btrfs_quota_disable().

This can be sporadically triggered by test case btrfs/255 from fstests:

  $ ./check btrfs/255
  FSTYP         -- btrfs
  PLATFORM      -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
  MKFS_OPTIONS  -- /dev/sdc
  MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

  btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
  - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
      --- tests/btrfs/255.out	2023-03-02 21:47:53.876609426 +0000
      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad	2023-06-16 10:20:39.267563212 +0100
      @@ -1,2 +1,4 @@
       QA output created by 255
      +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
      +There may be more info in syslog - try dmesg | tail
       Silence is golden
      ...
      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad'  to see the entire diff)
  Ran: btrfs/255
  Failures: btrfs/255
  Failed 1 of 1 tests

To fix this make the quota disable operation take the cleaner mutex, as
relocation of a block group also takes this mutex. This is also what we
do when deleting a subvolume/snapshot, we take the cleaner mutex in the
cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
at btrfs_drop_snapshot() while under the protection of the cleaner mutex.

Fixes: bed92eae26 ("Btrfs: qgroup implementation and prototypes")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 20:29:25 +02:00
Filipe Manana
08eb2ad9db btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
Add a comment to struct btrfs_fs_info::dirty_cowonly_roots to mention
that struct btrfs_fs_info::trans_lock is the lock that protects that
list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 20:14:44 +02:00
Filipe Manana
babebf023e btrfs: fix race when deleting free space root from the dirty cow roots list
When deleting the free space tree we are deleting the free space root
from the list fs_info->dirty_cowonly_roots without taking the lock that
protects it, which is struct btrfs_fs_info::trans_lock.
This unsynchronized list manipulation may cause chaos if there's another
concurrent manipulation of this list, such as when adding a root to it
with ctree.c:add_root_to_dirty_list().

This can result in all sorts of weird failures caused by a race, such as
the following crash:

  [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
  [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
  [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
  [337571.279928] Code: 85 38 06 00 (...)
  [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
  [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
  [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
  [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
  [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
  [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
  [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
  [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
  [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [337571.282874] Call Trace:
  [337571.283101]  <TASK>
  [337571.283327]  ? __die_body+0x1b/0x60
  [337571.283570]  ? die_addr+0x39/0x60
  [337571.283796]  ? exc_general_protection+0x22e/0x430
  [337571.284022]  ? asm_exc_general_protection+0x22/0x30
  [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
  [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
  [337571.284803]  ? _raw_spin_unlock+0x15/0x30
  [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
  [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
  [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
  [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
  [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
  [337571.286358]  ? mod_objcg_state+0xd2/0x360
  [337571.286577]  ? refill_obj_stock+0xb0/0x160
  [337571.286798]  ? seq_release+0x25/0x30
  [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
  [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
  [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
  [337571.287675]  __x64_sys_ioctl+0x88/0xc0
  [337571.287901]  do_syscall_64+0x38/0x90
  [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [337571.288352] RIP: 0033:0x7f478aaffe9b

So fix this by locking struct btrfs_fs_info::trans_lock before deleting
the free space root from that list.

Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 20:14:24 +02:00
Filipe Manana
b31cb5a6eb btrfs: fix race when deleting quota root from the dirty cow roots list
When disabling quotas we are deleting the quota root from the list
fs_info->dirty_cowonly_roots without taking the lock that protects it,
which is struct btrfs_fs_info::trans_lock. This unsynchronized list
manipulation may cause chaos if there's another concurrent manipulation
of this list, such as when adding a root to it with
ctree.c:add_root_to_dirty_list().

This can result in all sorts of weird failures caused by a race, such as
the following crash:

  [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
  [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
  [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
  [337571.279928] Code: 85 38 06 00 (...)
  [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
  [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
  [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
  [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
  [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
  [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
  [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
  [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
  [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [337571.282874] Call Trace:
  [337571.283101]  <TASK>
  [337571.283327]  ? __die_body+0x1b/0x60
  [337571.283570]  ? die_addr+0x39/0x60
  [337571.283796]  ? exc_general_protection+0x22e/0x430
  [337571.284022]  ? asm_exc_general_protection+0x22/0x30
  [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
  [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
  [337571.284803]  ? _raw_spin_unlock+0x15/0x30
  [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
  [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
  [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
  [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
  [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
  [337571.286358]  ? mod_objcg_state+0xd2/0x360
  [337571.286577]  ? refill_obj_stock+0xb0/0x160
  [337571.286798]  ? seq_release+0x25/0x30
  [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
  [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
  [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
  [337571.287675]  __x64_sys_ioctl+0x88/0xc0
  [337571.287901]  do_syscall_64+0x38/0x90
  [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [337571.288352] RIP: 0033:0x7f478aaffe9b

So fix this by locking struct btrfs_fs_info::trans_lock before deleting
the quota root from that list.

Fixes: bed92eae26 ("Btrfs: qgroup implementation and prototypes")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 20:07:41 +02:00
Naohiro Aota
6442550027 btrfs: tracepoints: also show actual number of the outstanding extents
The btrfs_inode_mod_outstanding_extents trace event only shows the modified
number to the number of outstanding extents. It would be helpful if we can
see the resulting extent number as well.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 19:47:58 +02:00
Chuck Lever
c8407f2e56 NFS: Add an "xprtsec=" NFS mount option
After some discussion, we decided that controlling transport layer
security policy should be separate from the setting for the user
authentication flavor. To accomplish this, add a new NFS mount
option to select a transport layer security policy for RPC
operations associated with the mount point.

  xprtsec=none     - Transport layer security is forced off.

  xprtsec=tls      - Establish an encryption-only TLS session. If
                     the initial handshake fails, the mount fails.
                     If TLS is not available on a reconnect, drop
                     the connection and try again.

  xprtsec=mtls     - Both sides authenticate and an encrypted
                     session is created. If the initial handshake
                     fails, the mount fails. If TLS is not available
                     on a reconnect, drop the connection and try
                     again.

To support client peer authentication (mtls), the handshake daemon
will have configurable default authentication material (certificate
or pre-shared key). In the future, mount options can be added that
can provide this material on a per-mount basis.

Updates to mount.nfs (to support xprtsec=auto) and nfs(5) will be
sent under separate cover.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:30:17 -04:00
Chuck Lever
6c0a8c5fcf NFS: Have struct nfs_client carry a TLS policy field
The new field is used to match struct nfs_clients that have the same
TLS policy setting.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:29:23 -04:00
Amir Goldstein
bc2473c90f
ovl: enable fsnotify events on underlying real files
Overlayfs creates the real underlying files with fake f_path, whose
f_inode is on the underlying fs and f_path on overlayfs.

Those real files were open with FMODE_NONOTIFY, because fsnotify code was
not prapared to handle fsnotify hooks on files with fake path correctly
and fanotify would report unexpected event->fd with fake overlayfs path,
when the underlying fs was being watched.

Teach fsnotify to handle events on the real files, and do not set real
files to FMODE_NONOTIFY to allow operations on real file (e.g. open,
access, modify, close) to generate async and permission events.

Because fsnotify does not have notifications on address space
operations, we do not need to worry about ->vm_file not reporting
events to a watched overlayfs when users are accessing a mapped
overlayfs file.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230615112229.2143178-6-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:18:04 +02:00
Amir Goldstein
62d53c4a1d
fs: use backing_file container for internal files with "fake" f_path
Overlayfs uses open_with_fake_path() to allocate internal kernel files,
with a "fake" path - whose f_path is not on the same fs as f_inode.

Allocate a container struct backing_file for those internal files, that
is used to hold the "fake" ovl path along with the real path.

backing_file_real_path() can be used to access the stored real path.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230615112229.2143178-5-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:16:38 +02:00
Chuck Lever
9e8ab85a7e NFS: Improvements for fs_context-related tracepoints
Add some missing observability to the fs_context tracepoints
added by commit 33ce83ef0b ("NFS: Replace fs_context-related
dprintk() call sites with tracepoints").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:15:23 -04:00
Amir Goldstein
8a05a8c31d
fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
Use a common helper init_file() instead of __alloc_file() for
alloc_empty_file*() helpers and improrve the documentation.

This is needed for a follow up patch that allocates a backing_file
container.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-4-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:12:04 +02:00
Amir Goldstein
cbb0b9d4bb
fs: use a helper for opening kernel internal files
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile
without accounting for nr_files.

cachefiles uses open_with_fake_path() for the same reason without the
need for a fake path.

Fork open_with_fake_path() to kernel_file_open() which only does the
noaccount part and use it in cachefiles.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-3-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:11:58 +02:00
Anna Schumaker
86e2e1f6d9 NFSv4.2: SETXATTR should update ctime
Otherwise, `stat` will report a stale value to users.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:10:48 -04:00
Anna Schumaker
64edd55d0f NFSv4.2: Clean up xattr size macros
Fold them into the other NFS v4.2 operations in the right spots and
adjust spacing to keep the same style.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:09:50 -04:00
Amir Goldstein
d56e0ddb8f
fs: rename {vfs,kernel}_tmpfile_open()
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile
without accounting for nr_files.

Rename this helper to kernel_tmpfile_open() to better reflect this
helper is used for kernel internal users.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:09:09 +02:00
Anna Schumaker
d594097367 NFSv4.2: Clean up nfs4_xdr_dec_*xattr() functions
I add commends above each function to match the style of the other
nfs4_xdr_dec_*() functions. I also remove the unnecessary #ifdef
CONFIG_NFS_V4_2 that was added around this code, since we are already in
a v4.2-only file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:09:07 -04:00
Anna Schumaker
31f1bd8f89 NFSv4.2: Clean up: Move nfs4_xdr_enc_*xattr() functions
They should be in the nfs4_xdr_enc_*() section, and not at the bottom of
the file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:08:40 -04:00
Anna Schumaker
04b4c9fb07 NFSv4.2: Clean up: move decode_*xattr() functions
Move them out of the encode_*() section and into the decode_*() section
where it belongs.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:08:20 -04:00
Anna Schumaker
fd42ba8223 NFSv4.2: Clean up: Move the encode_copy_commit() function
Move the function to be with the other encode_*() functions, instead of
in the middle of the nfs4_xdr_enc_*() section.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:07:54 -04:00
Jeff Layton
c9e561c475 btrfs: update i_version in update_dev_time
When updating the ctime, we also want to update i_version.

This is just something I noticed by inspection. There is probably no way
to test this today unless you can somehow get to this inode via nfsd.
Still, I think it's the right thing to do for consistency's sake.

David Sterba's comment: I don't see anything wrong with setting the
iversion bit, however I also don't see where this would be useful.
Agreed with the consistency, otherwise the time is updated when device
super block is wiped or a device initialized, both are big events so
missing that due to lack of iversion update seems unlikely. I'll add it
to the queue, thanks.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 17:38:50 +02:00
Ben Dooks
e794203e9d btrfs: make btrfs_compressed_bioset static
The 'btrfs_compressed_bioset' struct isn't exported outside of the
fs/btrfs/compression.c file, so make it static to fix the following
sparse warning:

fs/btrfs/compression.c:40:16: warning: symbol 'btrfs_compressed_bioset' was not declared. Should it be static?

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 17:01:44 +02:00
Vishal Moola (Oracle)
819da022dd afs: Fix waiting for writeback then skipping folio
Commit acc8d8588c converted afs_writepages_region() to write back a
folio batch. The function waits for writeback to a folio, but then
proceeds to the rest of the batch without trying to write that folio
again. This patch fixes has it attempt to write the folio again.

[DH: Also remove an 'else' that adding a goto makes redundant]

Fixes: acc8d8588c ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20230607204120.89416-2-vishal.moola@gmail.com/
2023-06-19 14:30:58 +01:00
Vishal Moola (Oracle)
a2b6f2ab3e afs: Fix dangling folio ref counts in writeback
Commit acc8d8588c converted afs_writepages_region() to write back a
folio batch. If writeback needs rescheduling, the function exits without
dropping the references to the folios in fbatch. This patch fixes that.

[DH: Moved the added line before the _leave()]

Fixes: acc8d8588c ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20230607204120.89416-1-vishal.moola@gmail.com/
2023-06-19 14:30:48 +01:00
Matt Corallo
160fe8f6fd btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.

`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.

This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.

Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.

This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.

In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).

  [770.648] ------------[ cut here ]------------
  [770.648] BTRFS: Transaction aborted (error -22)
  [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
  [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G        W 6.1.0-0.deb11.7-powerpc64le #1  Debian 6.1.20-2~bpo11+1a~test
  [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
  [770.648] NIP:  c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
  [770.648] REGS: c000200089afe9a0 TRAP: 0700   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
  [770.648] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 28848282  XER: 20040000
  [770.648] CFAR: c000000000135110 IRQMASK: 0
	    GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
	    GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
	    GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
	    GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
	    GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
	    GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
	    GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
	    GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
  [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
  [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
  [770.648] Call Trace:
  [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
  [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
  [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
  [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
  [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
  [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
  [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
  [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
  [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
  [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
  [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
  [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
  [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
  [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
  [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
  [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
  [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
  [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
  [770.648] --- interrupt: c00 at 0x7fff94126800
  [770.648] NIP:  00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
  [770.648] REGS: c000200089affe80 TRAP: 0c00   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
  [770.648] MSR:  900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE>  CR: 24002848  XER: 00000000
  [770.648] IRQMASK: 0
	    GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
	    GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
	    GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
	    GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
	    GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
	    GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
	    GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
	    GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
  [770.648] NIP [00007fff94126800] 0x7fff94126800
  [770.648] LR [0000000107e0b594] 0x107e0b594
  [770.648] --- interrupt: c00
  [770.648] Instruction dump:
  [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
  [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
  [770.648] ---[ end trace 0000000000000000 ]---
  [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
  [770.648] BTRFS info (device dm-2: state EA): forced readonly
  [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
  [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
  [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
  [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown

Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:40 +02:00
Qu Wenruo
81db6ae842 btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
Since the scrub rework introduced by commit 2af2aaf982 ("btrfs:
scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
and later commits, scrub only needs one single workqueue,
fs_info::scrub_worker.

That scrub_wr_completion_workers is initially to handle the delay work
after write bios finished.  But the new scrub code goes submit-and-wait
for write bios, thus all the work are done inside the scrub_worker.

The last user of fs_info::scrub_wr_completion_workers is removed in
commit 16f9399349 ("btrfs: scrub: remove the old writeback
infrastructure"), so we can safely remove the workqueue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:40 +02:00
Qu Wenruo
c2bbc0bab0 btrfs: scrub: remove scrub_ctx::csum_list member
Since the rework of scrub introduced by commit 2af2aaf982 ("btrfs:
scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
and later commits, scrub no longer keeps its data checksum inside sctx.

Instead we have scrub_stripe::csums for the checksum of the stripe.
Thus we can remove the unused scrub_ctx::csum_list member.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:40 +02:00
Filipe Manana
6822b3f708 btrfs: do not BUG_ON after failure to migrate space during truncation
During truncation we reserve 2 metadata units when starting a transaction
(reserved space goes to fs_info->trans_block_rsv) and then attempt to
migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
local block reserve. If we ever fail we trigger a BUG_ON(), which should
never happen, because we reserved 2 units. However if we happen to fail
for some reason, we don't need to be so dire and hit a BUG_ON(), we can
just error out the truncation and, since this is highly unexpected,
surround the error check with WARN_ON(), to get an informative stack
trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
const, since it's not supposed to ever change and any accidental change
could possibly make the space migration not so unlikely to fail.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
df9f278239 btrfs: do not BUG_ON on failure to get dir index for new snapshot
During the transaction commit path, at create_pending_snapshot(), there
is no need to BUG_ON() in case we fail to get a dir index for the snapshot
in the parent directory. This should fail very rarely because the parent
inode should be loaded in memory already, with the respective delayed
inode created and the parent inode's index_cnt field already initialized.

However if it fails, it may be -ENOMEM like the comment at
create_pending_snapshot() says or any error returned by
btrfs_search_slot() through btrfs_set_inode_index_count(), which can be
pretty much anything such as -EIO or -EUCLEAN for example. So the comment
is not correct when it says it can only be -ENOMEM.

However doing a BUG_ON() here is overkill, since we can instead abort
the transaction and return the error. Note that any error returned by
create_pending_snapshot() will eventually result in a transaction
abort at cleanup_transaction(), called from btrfs_commit_transaction(),
but we can explicitly abort the transaction at this point instead so that
we get a stack trace to tell us that the call to btrfs_set_inode_index()
failed.

So just abort the transaction and return in case btrfs_set_inode_index()
returned an error at create_pending_snapshot().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
6f3eb72a1f btrfs: send: do not BUG_ON() on unexpected symlink data extent
There's really no need to BUG_ON() if we find a symlink with an extent
that is not inline or is compressed. We can just make send fail with
an error (-EUCLEAN) and log an informative error message, so just do
that instead of BUG_ON().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
fc4026e26b btrfs: do not BUG_ON() when dropping inode items from log root
When dropping inode items from a log tree at drop_inode_items(), we this
BUG_ON() on the result of btrfs_search_slot() because we don't expect an
exact match since having a key with an offset of (u64)-1 is unexpected.
That is generally true, but for dir index keys for example, we can get a
key with such an offset value, even though it's very unlikely and it would
take ages to increase the sequence counter for a dir index up to (u64)-1.
We can deal with an exact match, we just have to delete the key at that
slot, so there is really no need to BUG_ON(), error out or trigger any
warning. So remove the BUG_ON().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
7569141e8f btrfs: replace BUG_ON() at split_item() with proper error handling
There's no need to BUG_ON() at split_item() if the leaf does not have
enough free space for the new item, we can just return -ENOSPC since
the caller can deal with errors from split_item(). Also, as this is a
very unlikely condition to happen, because the caller has invoked
setup_leaf_for_split() before calling split_item(), surround the
condition with a WARN_ON() which makes it easier to notice this
unexpected condition and tags the if branch with 'unlikely' as well.

I've actually once hit this BUG_ON() with some incorrect code changes
I had, which was very inconvenient as it required rebooting the VM.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
751a27615d btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release all
resources in the context of all callers, and we have to abort because other
future tree searches that use the tree mod log (btrfs_search_old_slot())
may get inconsistent results if other operations modify the tree after
that failure and before the tree mod log based search.

This implies btrfs_del_ptr() return an int instead of void, and making all
callers check for returned errors.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
50b5d1fc41 btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
At insert_ptr(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release all
resources in the context of all callers, and we have to abort because other
future tree searches that use the tree mod log (btrfs_search_old_slot())
may get inconsistent results if other operations modify the tree after
that failure and before the tree mod log based search.

This implies making insert_ptr() return an int instead of void, and making
all callers check for returned errors.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
f61aa7ba08 btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
At insert_new_root(), instead of doing a BUG_ON() in case we fail to
record the tree mod log operation, just return the error to the callers
after releasing the allocated tree block. At this point we haven't made
any changes to the b+tree, so we haven't left it in an inconsistent state
and therefore have no need to abort the transaction. All we need to do is
to unlock and free the extent buffer we just allocated with the purpose
of making it the new root.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
11d6ae0355 btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to
record tree mod log operations, do a transaction abort and return the
error to the caller. There's really no need for the BUG_ON() as we can
release all resources in this context, and we have to abort because other
future tree searches that use the tree mod log (btrfs_search_old_slot())
may get inconsistent results if other operations modify the tree after
that failure and before the tree mod log based search.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
eced687e22 btrfs: abort transaction at update_ref_for_cow() when ref count is zero
At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find
that the extent buffer has an unexpected ref count of zero, however we can
simply use btrfs_abort_transaction(), which achieves the same purposes: to
turn the fs to error state, abort the current transaction and turn the fs
to RO mode as well. Besides that, btrfs_abort_transaction() also prints a
stack trace which makes it more useful.

Also, as this is a very unexpected situation, indicating a serious
corruption/inconsistency, tag the if branch as 'unlikely', set the error
code to -EUCLEAN instead of -EROFS, and log an explicit message.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Filipe Manana
725026ed59 btrfs: abort transaction at balance_level() when left child is missing
At balance_level() we are calling btrfs_handle_fs_error() when the middle
child only has 1 item and the left child is missing, however we can simply
use btrfs_abort_transaction(), which achieves the same purposes: to turn
the fs to error state, abort the current transaction and turn the fs to
RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace
which makes it more useful.

Also, as this is a highly unexpected case and it's about a b+tree
inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if
branch as 'unlikely' and log an explicit error message.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
87b8e9d06e btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level()
At balance_level(), when trying to promote a child node to a root node, if
we fail to read the child we call btrfs_handle_fs_error(), which turns the
fs to RO mode and sets it to error state as well, causing any ongoing
transaction to abort. This however is not necessary because at that point
we have not made any change yet at balance_level(), so any error reading
the child node does not leaves us in any inconsistent state. Therefore we
can just return the error to the caller.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
daefe4d435 btrfs: rename enospc label to out at balance_level()
At balance_level() we have this 'enospc' label where we jump to in case
we get an error at several places. However that error is certainly not
-ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child
extent buffer for example, or -ENOMEM when trying to record tree mod log
operations. So to make this less confusing, rename the label to 'out'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
39020d8abc btrfs: do not BUG_ON() on tree mod log failure at balance_level()
At balance_level(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release
all resources in this context, and we have to abort because other future
tree searches that use the tree mod log (btrfs_search_old_slot()) may get
inconsistent results if other operations modify the tree after that
failure and before the tree mod log based search.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
40b0a74938 btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block()
At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to
record a tree mod log root insertion operation, do a transaction abort
instead. There's really no need for the BUG_ON(), we can properly
release all resources in this context and turn the filesystem to RO mode
and in an error state instead.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
8793ed87b3 btrfs: avoid tree mod log ENOMEM failures when we don't need to log
When logging tree mod log operations we start by checking, in a lockless
manner, if we need to log - if we don't, we just return and do nothing,
otherwise we will allocate one or more tree mod log operations and then
check again if we need to log. This second check will take the tree mod
log lock in write mode if we need to log, otherwise it will do nothing
and we just free the allocated memory and return success.

We can improve on this by not returning an error in case the memory
allocations fail, unless the second check tells us that we actually need
to log. That is, if we fail to allocate memory and the second check tells
use that we don't need to log, we can just return success and avoid
returning -ENOMEM to the caller. Currently tree mod log failures are
dealt with either a BUG_ON() or a transaction abort, as tree mod log
operations are logged in code paths that modify a b+tree.

So just avoid failing with -ENOMEM if we fail to allocate a tree mod log
operation unless we actually need to log the operations, that is, if
tree_mod_dont_log() returns true.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
ede600e497 btrfs: fix extent buffer leak after tree mod log failure at split_node()
At split_node(), if we fail to log the tree mod log copy operation, we
return without unlocking the split extent buffer we just allocated and
without decrementing the reference we own on it. Fix this by unlocking
it and decrementing the ref count before returning.

Fixes: 5de865eebb ("Btrfs: fix tree mod logging")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
d09c51521f btrfs: add missing error handling when logging operation while COWing extent buffer
When COWing an extent buffer that is not the root node, we need to log in
the tree mod log that we replaced a pointer in the parent node, otherwise
a tree mod log user doing a search on the b+tree can return incorrect
results (that miss something). We are doing the call to
btrfs_tree_mod_log_insert_key() but we totally ignore its return value.

So fix this by adding the missing error handling, resulting in a
transaction abort and freeing the COWed extent buffer.

Fixes: f230475e62 ("Btrfs: put all block modifications into the tree mod log")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Christoph Hellwig
f02c75e630 btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for btrfs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Filipe Manana
aadb164bdd btrfs: update documentation for a block group's bg_list member
Currently we are only documenting two uses of the bg_list member of a
block group, but there two more:

1) To track deleted block groups for discard purposes, introduced in
   commit e33e17ee10 ("btrfs: add missing discards when unpinning
   extents with -o discard");

2) To track block groups for automatic reclaim, introduced more recently
   by commit 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")

So document those two other use cases.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Naohiro Aota
7e27180994 btrfs: reinsert BGs failed to reclaim
The reclaim process can temporarily fail. For example, if the space is
getting tight, it fails to make the block group read-only. If there are no
further writes on that block group, the block group will never get back to
the reclaim list, and the BG never gets reclaimed. In a certain workload,
we can leave many such block groups never reclaimed.

So, let's get it back to the list and give it a chance to be reclaimed.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Naohiro Aota
93463ff7b5 btrfs: bail out reclaim process if filesystem is read-only
When a filesystem is read-only, we cannot reclaim a block group as it
cannot rewrite the data. Just bail out in that case.

Note that it can drop block groups in this case. As we did
sb_start_write(), read-only filesystem means we got a fatal error and
forced read-only. There is no chance to reclaim them again.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Naohiro Aota
a9f189716c btrfs: move out now unused BG from the reclaim list
An unused block group is easy to remove to free up space and should be
reclaimed fast. Such block group can often already be a target of the
reclaim process. As we check list_empty(&bg->bg_list), we keep it in the
reclaim list. That block group is never reclaimed until the file system
is filled e.g. up to 75%.

Instead, we can move unused block group to the unused list and delete it
fast.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Naohiro Aota
3ed01616ba btrfs: delete unused BGs while reclaiming BGs
The reclaiming process only starts after the filesystem volumes are
allocated to a certain level (75% by default). Thus, the list of
reclaiming target block groups can build up so huge at the time the
reclaim process kicks in. On a test run, there were over 1000 BGs in the
reclaim list.

As the reclaim involves rewriting the data, it takes really long time to
reclaim the BGs. While the reclaim is running, btrfs_delete_unused_bgs()
won't proceed because the reclaim side is holding
fs_info->reclaim_bgs_lock. As a result, we will have a large number of
unused BGs kept in the unused list. On my test run, I got 1057 unused BGs.

Since deleting a block group is relatively easy and fast work, we can call
btrfs_delete_unused_bgs() while it reclaims BGs, to avoid building up
unused BGs.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
0d394cca84 btrfs: use btrfs_finish_ordered_extent to complete buffered writes
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
b41b6f6937 btrfs: use btrfs_finish_ordered_extent to complete direct writes
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
7dd4395490 btrfs: use btrfs_finish_ordered_extent to complete compressed writes
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
4ba8223d3d btrfs: open code end_extent_writepage in end_bio_extent_writepage
This prepares for switching to more efficient ordered_extent processing
and already removes the forth and back conversion from len to end back to
len.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
122e9ede53 btrfs: add a btrfs_finish_ordered_extent helper
Add a helper to complete an ordered_extent without first doing a lookup.
The tracepoint cannot use the ordered_extent class as we also want to
print the range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
2d6f107ea6 btrfs: factor out a btrfs_queue_ordered_fn helper
Factor out a helper to queue up an ordered_extent completion in a work
queue.  This new helper will later be used complete an ordered_extent
without first doing a lookup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
53df25869a btrfs: factor out a can_finish_ordered_extent helper
Factor out a helper from btrfs_mark_ordered_io_finished that does the
actual per-ordered_extent work to check if we want to schedule an I/O
completion.  This new helper will later be used complete an
ordered_extent without first doing a lookup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
c59360f61a btrfs: use bbio->ordered in btrfs_csum_one_bio
Use the ordered_extent pointer in the btrfs_bio instead of looking it
up manually.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
ec63b84d46 btrfs: add an ordered_extent pointer to struct btrfs_bio
Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.

The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
112397acc3 btrfs: open code btrfs_bio_end_io in btrfs_dio_submit_io
btrfs_dio_submit_io is the only place that uses btrfs_bio_end_io to end a
bio that hasn't been submitted using btrfs_submit_bio yet, and this
invariant will become a problem with upcoming changes to the btrfs bio
layer.  Just open code the assignment of bi_status and the call to
btrfs_dio_end_io.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
fbe960877b btrfs: add a is_data_bbio helper
Add a helper to check for that a btrfs_bio has a valid inode, and that
it is a data inode to key off all the special handling for data path
checksumming.  Note that this uses is_data_inode instead of REQ_META as
REQ_META is only set directly before submission in submit_one_bio and
we'll also want to use this helper for error handling where REQ_META
isn't set yet.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
ebfe4d4eb6 btrfs: remove btrfs_add_ordered_extent
All callers are gone now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
d611935b5d btrfs: pass an ordered_extent to btrfs_submit_compressed_write
btrfs_submit_compressed_write always operates on a single ordered_extent.
Make that explicit by using btrfs_alloc_ordered_extent in the callers
and passing the ordered_extent to btrfs_submit_compressed_write.  This
will help with storing and ordered_extent pointer in the btrfs_bio in
subsequent patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
34bfaf1530 btrfs: pass an ordered_extent to btrfs_reloc_clone_csums
Both callers of btrfs_reloc_clone_csums allocate the ordered_extent that
btrfs_reloc_clone_csums operates on.  Switch them to use
btrfs_alloc_ordered_extent instead of btrfs_add_ordered_extent and
pass the ordered_extent to btrfs_reloc_clone_csums instead of doing an
extra lookup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
3daea5fda1 btrfs: merge the two calls to btrfs_add_ordered_extent in run_delalloc_nocow
Refactor run_delalloc_nocow a little bit so that there is only a single
call to btrfs_add_ordered_extent instead of two.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
a39da514eb btrfs: limit write bios to a single ordered extent
Currently buffered writeback bios are allowed to span multiple
ordered_extents, although that basically never actually happens since
commit 4a445b7b61 ("btrfs: don't merge pages into bio if their page
offset is not contiguous").

Supporting bios than span ordered_extents complicates the file
checksumming code, and prevents us from adding an ordered_extent pointer
to the btrfs_bio structure.  Use the existing code to limit a bio to
single ordered_extent for zoned device writes for all writes.

This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the
handling of multiple ordered_extents in btrfs_csum_one_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
c731cd0b6d btrfs: fix file_offset for REQ_BTRFS_ONE_ORDERED bios that get split
If a bio gets split, it needs to have a proper file_offset for checksum
validation and repair to work properly.

Based on feedback from Josef, commit 852eee62d3 ("btrfs: allow
btrfs_submit_bio to split bios") skipped this adjustment for ONE_ORDERED
bios.  But if we actually ever need to split a ONE_ORDERED read bio, this
will lead to a wrong file offset in the repair code.  Right now the only
user of the file_offset is logging of an error message so this is mostly
harmless, but the wrong offset might be more problematic for additional
users in the future.

Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
David Sterba
1a1b0e729d btrfs: add block-group tree to lockdep classes
The block group tree was not present among the lockdep classes. We could
get potentially lockdep warnings but so far none has been seen, also
because block-group-tree is a relatively new feature.

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
7027f87108 btrfs: don't treat zoned writeback as being from an async helper thread
When extent_write_locked_range was originally added, it was only used
writing back compressed pages from an async helper thread.  But it is
now also used for writing back pages on zoned devices, where it is
called directly from the ->writepage context.  In this case we want to
be able to pass on the writeback_control instead of creating a new one,
and more importantly want to use all the normal cgroup interaction
instead of potentially deferring writeback to another helper.

Fixes: 898793d992 ("btrfs: zoned: write out partially allocated region")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
eb34dceace btrfs: only call __extent_writepage_io from extent_write_locked_range
__extent_writepage does a lot of things that make no sense for
extent_write_locked_range, given that extent_write_locked_range itself is
called from __extent_writepage either directly or through a workqueue,
and all this work has already been done in the first invocation and the
pages haven't been unlocked since.  Call __extent_writepage_io directly
instead and open code the logic tracked in
btrfs_bio_ctrl::extent_locked.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
9ecdbee819 btrfs: move writeback_control::nr_to_write update to __extent_writepage
Move the nr_to_write accounting from __extent_writepage_io to
__extent_writepage_io as we'll grow another __extent_writepage_io that
doesn't want this accounting soon.  Also drop the obsolete comment -
decrementing a counter in the on-stack writeback_control data structure
doesn't need the page lock.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
f22b5dcbd7 btrfs: remove non-standard extent handling in __extent_writepage_io
__extent_writepage_io is never called for compressed or inline extents,
or holes.  Remove the not quite working code for them and replace it with
asserts that these cases don't happen.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
a994310aa2 btrfs: remove PAGE_SET_ERROR
Now that the btrfs writeback code has stopped using PageError, using
PAGE_SET_ERROR to just set the per-address_space error flag is confusing.
Open code the mapping_set_error calls in the callers and remove
the PAGE_SET_ERROR flag.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
2b2553f123 btrfs: stop setting PageError in the data I/O path
PageError is not used by the VFS/MM and deprecated because it uses up a
page bit and has no coherent rules.  Instead read errors are usually
propagated by not setting or clearing the uptodate bit, and write errors
are propagated through the address_space.  Btrfs now only sets the flag
and never clears it for data pages, so just remove all places setting it,
and the subpage error bit.

Note that the error propagation for superblock writes that work on the
block device mapping still uses PageError for now, but that will be
addressed in a separate series.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
3e92499e3b btrfs: don't check PageError in __extent_writepage
__extent_writepage currenly sets PageError whenever any error happens,
and the also checks for PageError to decide if to call error handling.
This leads to very unclear responsibility for cleaning up on errors.
In the VM and generic writeback helpers the basic idea is that once
I/O is fired off all error handling responsibility is delegated to the
end I/O handler.  But if that end I/O handler sets the PageError bit,
and the submitter checks it, the bit could in some cases leak into the
submission context for fast enough I/O.

Fix this by simply not checking PageError and just using the local
ret variable to check for submission errors.  This also fundamentally
solves the long problem documented in a comment in __extent_writepage
by never leaking the error bit into the submission context.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
bb7b05fe1b btrfs: rename cow_file_range_async to run_delalloc_compressed
cow_file_range_async is only used for compressed writeback.  Rename it
to run_delalloc_compressed, which also fits in with run_delalloc_nocow
and run_delalloc_zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
973fb26e81 btrfs: don't fail writeback when allocating the compression context fails
If cow_file_range_async fails to allocate the asynchronous writeback
context, it currently returns an error and entirely fails the writeback.
This is not a good idea as a writeback failure is a non-temporary error
condition that will make the file system unusable.  Just fall back to
synchronous uncompressed writeback instead.  This requires us to delay
setting the BTRFS_INODE_HAS_ASYNC_EXTENT flag until we've committed to
the async writeback.

The compression checks INODE_NOCOMPRESS and FORCE_COMPRESS are moved
from cow_file_range_async to the preceding checks btrfs_run_delalloc_range().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
57201dddd6 btrfs: don't check PageError in btrfs_verify_page
btrfs_verify_page is called from the readpage completion handler, which
is only used to read pages, or parts of pages that aren't uptodate yet.
The only case where PageError could be set on a page in btrfs is if we
had a previous writeback error, but in that case we won't called readpage
on it, as it has previously been marked uptodate.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
2c14f0ffdd btrfs: fix fsverify read error handling in end_page_read
Also clear the uptodate bit to make sure the page isn't seen as uptodate
in the page cache if fsverity verification fails.

Fixes: 146054090b ("btrfs: initial fsverity support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
ed9ee98ecb btrfs: factor out a btrfs_verify_page helper
Split all the conditionals for the fsverity calls in end_page_read into
a btrfs_verify_page helper to keep the code readable and make additional
refactoring easier.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
36614a3beb btrfs: fix range_end calculation in extent_write_locked_range
The range_end field in struct writeback_control is inclusive, just like
the end parameter passed to extent_write_locked_range.  Not doing this
could cause extra writeout, which is harmless but suboptimal.

Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Boris Burkov
5cead5422a btrfs: insert tree mod log move in push_node_left
There is a fairly unlikely race condition in tree mod log rewind that
can result in a kernel panic which has the following trace:

  [530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096
  [530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096
  [530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002
  [530.618] #PF: supervisor read access in kernel mode
  [530.629] #PF: error_code(0x0000) - not-present page
  [530.641] PGD 0 P4D 0
  [530.647] Oops: 0000 [#1] SMP
  [530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
  [530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
  [530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00
  [530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
  [530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100
  [530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8
  [530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff
  [530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000
  [530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0
  [530.848] FS:  00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000
  [530.866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0
  [530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [530.928] Call Trace:
  [530.934]  ? btrfs_printk+0x13b/0x18c
  [530.943]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
  [530.955]  btrfs_map_bio+0x75/0x330
  [530.963]  ? kmem_cache_alloc+0x12a/0x2d0
  [530.973]  ? btrfs_submit_metadata_bio+0x63/0x100
  [530.984]  btrfs_submit_metadata_bio+0xa4/0x100
  [530.995]  submit_extent_page+0x30f/0x360
  [531.004]  read_extent_buffer_pages+0x49e/0x6d0
  [531.015]  ? submit_extent_page+0x360/0x360
  [531.025]  btree_read_extent_buffer_pages+0x5f/0x150
  [531.037]  read_tree_block+0x37/0x60
  [531.046]  read_block_for_search+0x18b/0x410
  [531.056]  btrfs_search_old_slot+0x198/0x2f0
  [531.066]  resolve_indirect_ref+0xfe/0x6f0
  [531.076]  ? ulist_alloc+0x31/0x60
  [531.084]  ? kmem_cache_alloc_trace+0x12e/0x2b0
  [531.095]  find_parent_nodes+0x720/0x1830
  [531.105]  ? ulist_alloc+0x10/0x60
  [531.113]  iterate_extent_inodes+0xea/0x370
  [531.123]  ? btrfs_previous_extent_item+0x8f/0x110
  [531.134]  ? btrfs_search_path_in_tree+0x240/0x240
  [531.146]  iterate_inodes_from_logical+0x98/0xd0
  [531.157]  ? btrfs_search_path_in_tree+0x240/0x240
  [531.168]  btrfs_ioctl_logical_to_ino+0xd9/0x180
  [531.179]  btrfs_ioctl+0xe2/0x2eb0

This occurs when logical inode resolution takes a tree mod log sequence
number, and then while backref walking hits a rewind on a busy node
which has the following sequence of tree mod log operations (numbers
filled in from a specific example, but they are somewhat arbitrary)

  REMOVE_WHILE_FREEING slot 532
  REMOVE_WHILE_FREEING slot 531
  REMOVE_WHILE_FREEING slot 530
  ...
  REMOVE_WHILE_FREEING slot 0
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0
  ADD slot 455
  ADD slot 454
  ADD slot 453
  ...
  ADD slot 0
  MOVE src slot 0 -> dst slot 456 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0

When this sequence gets applied via btrfs_tree_mod_log_rewind, it
allocates a fresh rewind eb, and first inserts the correct key info for
the 533 elements, then overwrites the first 456 of them, then decrements
the count by 456 via the add ops, then rewinds the move by doing a
memmove from 456:988->0:532. We have never written anything past 532, so
that memmove writes garbage into the 0:532 range. In practice, this
results in a lot of fully 0 keys. The rewind then puts valid keys into
slots 0:455 with the last removes, but 456:532 are still invalid.

When search_old_slot uses this eb, if it uses one of those invalid
slots, it can then read the extent buffer and issue a bio for offset 0
which ultimately panics looking up extent mappings.

This bad tree mod log sequence gets generated when the node balancing
code happens to do a balance_node_right followed by a push_node_left
while logging in the tree mod log. Illustrated for ebs L and R (left and
right):

	L                 R
  start:
  [XXX|YYY|...]      [ZZZ|...|...]
  balance_node_right:
  [XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
  [XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
  push_node_left:
  [XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
  [XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)

This is because balance_node_right logs a move, but push_node_left
explicitly doesn't. That is because logging the move would remove the
overwritten src < dst range in the right eb, which was already logged
when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
include a move from 456:988 to 0:532 after remove 0:455 and before
removing 0:532. Reversing that sequence would entail creating keys for
0:532, then moving those keys out to 456:988, then creating more keys
for 0:455.

i.e.,

  REMOVE_WHILE_FREEING slot 532
  REMOVE_WHILE_FREEING slot 531
  REMOVE_WHILE_FREEING slot 530
  ...
  REMOVE_WHILE_FREEING slot 0
  MOVE src slot 456 -> dst slot 0 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0
  ADD slot 455
  ADD slot 454
  ADD slot 453
  ...
  ADD slot 0
  MOVE src slot 0 -> dst slot 456 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0

Fix this to log the move but avoid the double remove by putting all the
logging logic in btrfs_tree_mod_log_eb_copy which has enough information
to detect these cases and properly log moves, removes, and adds. Leave
btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
tree mod logging.

(Un)fortunately, this is quite difficult to reproduce, and I was only
able to reproduce it by adding sleeps in btrfs_search_old_slot that
would encourage more log rewinding during ino_to_logical ioctls. I was
able to hit the warning in the previous patch in the series without the
fix quite quickly, but not after this patch.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Boris Burkov
95c8e349d8 btrfs: warn on invalid slot in tree mod log rewind
The way that tree mod log tracks the ultimate length of the eb, the
variable 'n', eventually turns up the correct value, but at intermediate
steps during the rewind, n can be inaccurate as a representation of the
end of the eb. For example, it doesn't get updated on move rewinds, and
it does get updated for add/remove in the middle of the eb.

To detect cases with invalid moves, introduce a separate variable called
max_slot which tries to track the maximum valid slot in the rewind eb.
We can then warn if we do a move whose src range goes beyond the max
valid slot.

There is a commented caveat that it is possible to have this value be an
overestimate due to the challenge of properly handling 'add' operations
in the middle of the eb, but in practice it doesn't cause enough of a
problem to throw out the max idea in favor of tracking every valid slot.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
David Sterba
8ab546bb30 btrfs: disable allocation warnings for compression workspaces
The workspaces for compression are typically much larger than a page and
for high zstd levels in the range of megabytes. There's a fallback to
vmalloc but this can still fail (see the report).

Some of the workspaces are preallocated at module load time so we have a
safe fallback, otherwise when a new workspace is needed it's allocated
but if this fails then the process waits. Which means the warning is
only causing noise and we can use the GFP flag to disable it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217466
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
8680e58761 btrfs: open code need_full_stripe conditions
need_full_stripe is just a somewhat complicated way to say
"op != BTRFS_MAP_READ".  Just spell that explicit check out, which makes
a lot of the code currently using the helper easier to understand.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
723b8bb17e btrfs: open code btrfs_map_sblock
btrfs_map_sblock just hard codes three arguments and calls
btrfs_map_sblock.  Remove it as it doesn't provide any real value, but
makes following the btrfs_map_block call chains harder.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
cd4efd210e btrfs: rename __btrfs_map_block to btrfs_map_block
Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
d69d7ffc26 btrfs: remove unused btrfs_map_block
There are no users of btrfs_map_block left, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig
78a213a05d btrfs: optimize simple reads in btrfsic_map_block
Pass a smap into __btrfs_map_block so that the usual case of a read that
doesn't require parity raid recovery doesn't need an extra memory
allocation for the btrfs_io_context.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
3965a4c793 btrfs: remove unused BTRFS_MAP_DISCARD
BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
btrfs_op() only only checked in two ASSERTS.

Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06f1
("btrfs: split discard handling out of btrfs_map_block").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
David Sterba
efcfcbc6a3 btrfs: add xxhash to fast checksum implementations
The implementation of XXHASH is now CPU only but still fast enough to be
considered for the synchronous checksumming, like non-generic crc32c.

A userspace benchmark comparing it to various implementations (patched
hash-speedtest from btrfs-progs):

  Block size:     4096
  Iterations:     1000000
  Implementation: builtin
  Units:          CPU cycles

	NULL-NOP: cycles:     73384294, cycles/i       73
     NULL-MEMCPY: cycles:    228033868, cycles/i      228,    61664.320 MiB/s
      CRC32C-ref: cycles:  24758559416, cycles/i    24758,      567.950 MiB/s
       CRC32C-NI: cycles:   1194350470, cycles/i     1194,    11773.433 MiB/s
  CRC32C-ADLERSW: cycles:   6150186216, cycles/i     6150,     2286.372 MiB/s
  CRC32C-ADLERHW: cycles:    626979180, cycles/i      626,    22427.453 MiB/s
      CRC32C-PCL: cycles:    466746732, cycles/i      466,    30126.699 MiB/s
	  XXHASH: cycles:    860656400, cycles/i      860,    16338.188 MiB/s

Comparing purely software implementation (ref), current outdated
accelerated using crc32q instruction (NI), optimized implementations by
M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
and the best one that was taken from kernel using the PCLMULQDQ
instruction (PCL).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
f000bc6fe4 btrfs: pass the new logical address to split_extent_map
split_extent_map splits off the first chunk of an extent map into a new
one.  One of the two users is the zoned I/O completion code that wants to
rewrite the logical block start address right after this split.  Pass in
the logical address to be set in the split off first extent_map as an
argument to avoid an extra extent tree lookup for this case.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
71df088c1c btrfs: defer splitting of ordered extents until I/O completion
The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append.  To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time.  This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed.  This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
52b1fdca23 btrfs: handle completed ordered extents in btrfs_split_ordered_extent
To delay splitting ordered_extents to I/O completion time we need to be
able to handle fully completed ordered extents in
btrfs_split_ordered_extent.  Besides a bit of accounting this primarily
involved moving over the csums to the split bio for the range that it
covers, which is simple enough because we always have one
btrfs_ordered_sum per bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
816f589b8d btrfs: atomically insert the new extent in btrfs_split_ordered_extent
Currently there is a small race window in btrfs_split_ordered_extent,
where the reduced old extent can be looked up on the per-inode rbtree
or the per-root list while the newly split out one isn't visible yet.

Fix this by open coding btrfs_alloc_ordered_extent in
btrfs_split_ordered_extent, and holding the tree lock and
root->ordered_extent_lock over the entire tree and extent manipulation.

Note that this introduces new lock ordering because previously
ordered_extent_lock was never held over the tree lock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
53d9981ca2 btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers
Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate
and insert the logic extent.  The pure alloc helper will be used to
improve btrfs_split_ordered_extent.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
b0307e2864 btrfs: return the new ordered_extent from btrfs_split_ordered_extent
Return the ordered_extent split from the passed in one.  This will be
needed to be able to store an ordered_extent in the btrfs_bio.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
ebdb44a00e btrfs: reorder conditions in btrfs_extract_ordered_extent
There is no good reason for doing one before the other in terms of
failure implications, but doing the extent_map split first will
simplify some upcoming refactoring.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
a6f3e205e4 btrfs: move split_extent_map to extent_map.c
split_extent_map doesn't have anything to do with the other code in
inode.c, so move it to extent_map.c.

This also allows marking replace_extent_mapping static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
3887653c44 btrfs: record orig_physical only for the original bio
btrfs_submit_dev_bio is also called for clone bios that aren't embedded
into a btrfs_bio structure, but previous commit "btrfs: optimize the
logical to physical mapping for zoned writes" added code to assign
btrfs_bio.orig_physical in it.

This is harmless right now as only the single data profile can be used
on zoned devices, but will blow up when the RAID stripe tree is added.
Move it out into the single I/O specific branch in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
cbfce4c7fb btrfs: optimize the logical to physical mapping for zoned writes
The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient.  It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Because zoned file systems now need the ordered_sums structure to
record the actual write location returned by zone append, allocate dummy
structures without the csum array for them when the I/O doesn't use
checksums, and free them when completing the ordered_extent.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
5cfe76f846 btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical
btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
renaming it to ->logical.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
6e4b2479ab btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
len can't ever be negative, so mark it as an u32 instead of int.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
e9cb93b9fb btrfs: don't call btrfs_record_physical_zoned for failed append
When a zoned append command fails there is no written address reported,
so don't try to record it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
dd8b7b0416 btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED
Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the
run-time check for the zone size.  This will allow to make use of
compiler dead code elimination for code guarded by btrfs_is_zoned, and
for example provide just a dangling prototype for a function instead of
adding a stub.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
99f09ce309 btrfs: make btrfs_destroy_delayed_refs() return void
btrfs_destroy_delayed_refs() always returns 0 and its single caller does
not check its return value, as it also returns void, and so does the
callers' caller and so on. This is because we are in the transaction abort
path, where we have no way to deal with errors (we are in a critical
situation) and all cleanup of resources works in a best effort fashion.
So make btrfs_destroy_delayed_refs() return void.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
184533e361 btrfs: remove unnecessary prototype declarations at disk-io.c
We have a few static functions at disk-io.c for which we have a forward
declaration of their prototype, but it's not needed because all those
functions are defined before they are called, so remove them.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
f1ed785a5b btrfs: use a single switch statement when initializing delayed ref head
At init_delayed_ref_head(), we are using two separate if statements to
check the delayed ref head action, and initializing 'must_insert_reserved'
to false twice, once when the variable is declared and once again in an
else branch.

Make this simpler and more straightforward by having a single switch
statement, also moving the comment about a drop action to the
corresponding switch case to make it more clear and eliminating the
duplicated initialization of 'must_insert_reserved' to false.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
61c681fef7 btrfs: use bool type for delayed ref head fields that are used as booleans
There's no point in have several fields defined as 1 bit unsigned int in
struct btrfs_delayed_ref_head, we can instead use a bool type, it makes
the code a bit more readable and it doesn't change the structure size.
So switch them to proper booleans.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Filipe Manana
1e6b71c34b btrfs: assert correct lock is held at btrfs_select_ref_head()
The function btrfs_select_ref_head() iterates over the red black tree of
delayed reference heads, which is protected by the spinlock in the delayed
refs root. The function doesn't take the lock, it's taken by its single
caller, btrfs_obtain_ref_head(), because it needs to call that function
and btrfs_delayed_ref_lock() in the same critical section (delimited by
that spinlock). So assert at btrfs_select_ref_head() that we are holding
the expected lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
798f4d95db btrfs: get rid of label and goto at insert_delayed_ref()
At insert_delayed_ref() there's no point of having a label and goto in the
case we were able to insert the delayed ref head. We can just add the code
under label to the if statement's body and return immediately, and also
there is no need to track the return value in a variable, we can just
return a literal true or false value directly. So do those changes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
f38462c447 btrfs: make insert_delayed_ref() return a bool instead of an int
insert_delayed_ref() can only return 0 or 1, to indicate if the given
delayed reference was added to the head reference or if it was merged
into an existing delayed ref, respectively. So just make it return a
boolean instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
293f8197a4 btrfs: use a bool to track qgroup record insertion when adding ref head
We are using an integer as a boolean to track the qgroup record insertion
status when adding a delayed reference head. Since all we need is a
boolean, switch the type from int to bool to make it more obvious.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
4d34ad34d7 btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node
The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node,
as we can check whether a reference is in the tree or not simply by
checking its red black tree node member with RB_EMPTY_NODE(), as when we
remove it from the tree we always call RB_CLEAR_NODE(). So remove that
field and use RB_EMPTY_NODE().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
53499d5f6b btrfs: remove unused is_head field from struct btrfs_delayed_ref_node
The 'is_head' field of struct btrfs_delayed_ref_node is no longer after
commit d278850eff ("btrfs: remove delayed_ref_node from ref_head"),
so remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Filipe Manana
315dd5cc75 btrfs: reorder some members of struct btrfs_delayed_ref_head
Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members
in different cache lines (even on a release, non-debug, kernel). This is
not optimal because when iterating the red black tree of delayed ref heads
for inserting a new delayed ref head (htree_insert()) we have to pull in 2
cache lines of delayed ref heads we find in a patch, one for the tree node
(struct rb_node) and another one for the 'bytenr' field. The same applies
when searching for an existing delayed ref head (find_ref_head()).
On a release (non-debug) kernel, the structure also has two 4 bytes holes,
which makes it 8 bytes longer than necessary. Its current layout is the
following:

  struct btrfs_delayed_ref_head {
          u64                        bytenr;               /*     0     8 */
          u64                        num_bytes;            /*     8     8 */
          refcount_t                 refs;                 /*    16     4 */

          /* XXX 4 bytes hole, try to pack */

          struct mutex               mutex;                /*    24    32 */
          spinlock_t                 lock;                 /*    56     4 */

          /* XXX 4 bytes hole, try to pack */

          /* --- cacheline 1 boundary (64 bytes) --- */
          struct rb_root_cached      ref_tree;             /*    64    16 */
          struct list_head           ref_add_list;         /*    80    16 */
          struct rb_node             href_node __attribute__((__aligned__(8))); /*    96    24 */
          struct btrfs_delayed_extent_op * extent_op;      /*   120     8 */
          /* --- cacheline 2 boundary (128 bytes) --- */
          int                        total_ref_mod;        /*   128     4 */
          int                        ref_mod;              /*   132     4 */
          unsigned int               must_insert_reserved:1; /*   136: 0  4 */
          unsigned int               is_data:1;            /*   136: 1  4 */
          unsigned int               is_system:1;          /*   136: 2  4 */
          unsigned int               processing:1;         /*   136: 3  4 */

          /* size: 144, cachelines: 3, members: 15 */
          /* sum members: 128, holes: 2, sum holes: 8 */
          /* sum bitfield members: 4 bits (0 bytes) */
          /* padding: 4 */
          /* bit_padding: 28 bits */
          /* forced alignments: 1 */
          /* last cacheline: 16 bytes */
  } __attribute__((__aligned__(8)));

This change reorders the 'href_node' and 'refs' members so that we have
the 'href_node' in the same cache line as the 'bytenr' field, while also
eliminating the two holes and reducing the structure size from 144 bytes
down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64)
instead of 28. The new structure layout after this change is now:

  struct btrfs_delayed_ref_head {
          u64                        bytenr;               /*     0     8 */
          u64                        num_bytes;            /*     8     8 */
          struct rb_node             href_node __attribute__((__aligned__(8))); /*    16    24 */
          struct mutex               mutex;                /*    40    32 */
          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
          refcount_t                 refs;                 /*    72     4 */
          spinlock_t                 lock;                 /*    76     4 */
          struct rb_root_cached      ref_tree;             /*    80    16 */
          struct list_head           ref_add_list;         /*    96    16 */
          struct btrfs_delayed_extent_op * extent_op;      /*   112     8 */
          int                        total_ref_mod;        /*   120     4 */
          int                        ref_mod;              /*   124     4 */
          /* --- cacheline 2 boundary (128 bytes) --- */
          unsigned int               must_insert_reserved:1; /*   128: 0  4 */
          unsigned int               is_data:1;            /*   128: 1  4 */
          unsigned int               is_system:1;          /*   128: 2  4 */
          unsigned int               processing:1;         /*   128: 3  4 */

          /* size: 136, cachelines: 3, members: 15 */
          /* padding: 4 */
          /* bit_padding: 28 bits */
          /* forced alignments: 1 */
          /* last cacheline: 8 bytes */
  } __attribute__((__aligned__(8)));

Running the following fs_mark test shows some significant improvement.

  $ cat test.sh
  #!/bin/bash

  # 15G null block device
  DEV=/dev/nullb0
  MNT=/mnt/nullb0
  FILES=100000
  THREADS=$(nproc --all)
  FILE_SIZE=0

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

Before this change:

FSUse%        Count         Size    Files/sec     App Overhead
    10      1200000            0     112631.3         11928055
    16      2400000            0     189943.8         12140777
    23      3600000            0     150719.2         13178480
    50      4800000            0      99137.3         12504293
    53      6000000            0     111733.9         12670836

                    Total files/sec: 664165.5

After this change:

FSUse%        Count         Size    Files/sec     App Overhead
    10      1200000            0     148589.5         11565889
    16      2400000            0     227743.8         11561596
    23      3600000            0     191590.5         12550755
    30      4800000            0     179812.3         12629610
    53      6000000            0      92471.4         12352383

                    Total files/sec: 840207.5

Measuring the execution times of htree_insert(), in nanoseconds, during
those fs_mark runs:

Before this change:

  Range:  0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231
  Percentiles:  90th: 980.000; 95th: 1208.000; 99th: 2090.000
     0.000 -    6.384:       257 |
     6.384 -   26.259:       977 |
    26.259 -   99.635:      4963 |
    99.635 -  370.526:    136800 #############
   370.526 - 1370.603:    566110 #####################################################
  1370.603 - 5062.704:     24945 ##
  5062.704 - 18693.248:      944 |
  18693.248 - 69014.670:     211 |
  69014.670 - 254791.959:     30 |
  254791.959 - 940647.000:     4 |

After this change:

  Range:  0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422
  Percentiles:  90th: 918.000; 95th: 1113.000; 99th: 1987.000
     0.000 -    5.585:      163 |
     5.585 -   20.678:      452 |
    20.678 -   70.369:     1806 |
    70.369 -  233.965:    26268 ####
   233.965 -  772.564:   333519 #####################################################
   772.564 - 2545.771:    91820 ###############
  2545.771 - 8383.615:     2238 |
  8383.615 - 27603.280:     170 |
  27603.280 - 90879.297:     68 |
  90879.297 - 299200.000:    12 |

Mean, percentiles, maximum times are all better, as well as a lower
standard deviation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Qu Wenruo
31dd8c81dd btrfs: use the same uptodate variable for end_bio_extent_readpage()
In function end_bio_extent_readpage() we call
endio_readpage_release_extent() to unlock the extent io tree.

However we pass PageUptodate(page) as @uptodate parameter for it, while
for previous end_page_read() call, we use a dedicated @uptodate local
variable.

This is not a big deal, as even for subpage cases, either the bio only
covers part of the page, then the @uptodate is always false, and the
subpage ranges can still be merged.

But for the sake of consistency, always use @uptodate variable when
possible.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Qu Wenruo
5a96341927 btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently
Currently alloc_extent_buffer() would make the extent buffer uptodate if
the corresponding pages are also uptodate.

But this check is only checking PageUptodate, which is fine for regular
cases, but not for subpage cases, as we can have multiple extent buffers
in the same page.

So here we go btrfs_page_test_uptodate() instead.

The old code doesn't cause any problem, but is not efficient, as it
would cause extra metadata read even if the range is already uptodate.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
David Sterba
b831306b3b btrfs: print assertion failure report and stack trace from the same line
Assertions reports are split into two parts, the exact file and location
of the condition and then the stack trace printed from
btrfs_assertfail(). This means all the stack traces report the same line
and this is what's typically reported by various tools, making it harder
to distinguish the reports.

  [403.2467] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259
  [403.2479] ------------[ cut here ]------------
  [403.2484] kernel BUG at fs/btrfs/messages.c:259!
  [403.2488] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  [403.2493] CPU: 2 PID: 23202 Comm: umount Not tainted 6.2.0-rc4-default+ #67
  [403.2499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
  [403.2509] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
  ...
  [403.2595] Call Trace:
  [403.2598]  <TASK>
  [403.2601]  btrfs_free_block_groups.cold+0x52/0xae [btrfs]
  [403.2608]  close_ctree+0x6c2/0x761 [btrfs]
  [403.2613]  ? __wait_for_common+0x2b8/0x360
  [403.2618]  ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs]
  [403.2626]  ? mark_held_locks+0x6b/0x90
  [403.2630]  ? lockdep_hardirqs_on_prepare+0x13d/0x200
  [403.2636]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
  [403.2642]  ? trace_hardirqs_on+0x2d/0x110
  [403.2646]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
  [403.2652]  generic_shutdown_super+0xb0/0x1c0
  [403.2657]  kill_anon_super+0x1e/0x40
  [403.2662]  btrfs_kill_super+0x25/0x30 [btrfs]
  [403.2668]  deactivate_locked_super+0x4c/0xc0

By making btrfs_assertfail a macro we'll get the same line number for
the BUG output:

  [63.5736] assertion failed: 0, in fs/btrfs/super.c:1572
  [63.5758] ------------[ cut here ]------------
  [63.5782] kernel BUG at fs/btrfs/super.c:1572!
  [63.5807] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
  [63.5831] CPU: 0 PID: 859 Comm: mount Tainted: G      D            6.3.0-rc7-default+ #2062
  [63.5868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
  [63.5905] RIP: 0010:btrfs_mount+0x24/0x30 [btrfs]
  [63.5964] RSP: 0018:ffff88800e69fcd8 EFLAGS: 00010246
  [63.5982] RAX: 000000000000002d RBX: ffff888008fc1400 RCX: 0000000000000000
  [63.6004] RDX: 0000000000000000 RSI: ffffffffb90fd868 RDI: ffffffffbcc3ff20
  [63.6026] RBP: ffffffffc081b200 R08: 0000000000000001 R09: ffff88800e69fa27
  [63.6046] R10: ffffed1001cd3f44 R11: 0000000000000001 R12: ffff888005a3c370
  [63.6062] R13: ffffffffc058e830 R14: 0000000000000000 R15: 00000000ffffffff
  [63.6081] FS:  00007f7b3561f800(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000
  [63.6105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [63.6120] CR2: 00007fff83726e10 CR3: 0000000002a9e000 CR4: 00000000000006b0
  [63.6137] Call Trace:
  [63.6143]  <TASK>
  [63.6148]  legacy_get_tree+0x80/0xd0
  [63.6158]  vfs_get_tree+0x43/0x120
  [63.6166]  do_new_mount+0x1f3/0x3d0
  [63.6176]  ? do_add_mount+0x140/0x140
  [63.6187]  ? cap_capable+0xa4/0xe0
  [63.6197]  path_mount+0x223/0xc10

This comes at a cost of bloating the final btrfs.ko module due all the
inlining, as long as assertions are compiled in. This is a must for
debugging builds but this is often enabled on release builds too.

Release build:

   text    data     bss     dec     hex filename
1251676   20317   16088 1288081  13a791 pre/btrfs.ko
1260612   29473   16088 1306173  13ee3d post/btrfs.ko

DELTA: +8936

CC: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:31 +02:00
Qu Wenruo
75258f20fb btrfs: subpage: dump extra subpage bitmaps for debug
There is a bug report that assert_eb_page_uptodate() gets triggered for
free space tree metadata.

Without proper dump for the subpage bitmaps it's much harder to debug.

Thus this patch would dump all the subpage bitmaps (split them into
their own bitmaps) for a easier debugging.

The output would look like this:
(Dumped after a tree block got read from disk)

  page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9
  memcg:ffff0000d7d62000
  aops:btree_aops [btrfs] ino:1
  flags: 0x8000000000002002(referenced|private|zone=2)
  page_type: 0xffffffff()
  raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0
  raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000
  page dumped because: btrfs subpage dump
  BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked=

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
Tejun Heo
58e814fcac btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND
==========

When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().

However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.

While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.

This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.

BTRFS
=====

* fs_info->scrub_workers initialized in scrub_workers_get() was setting
  @max_active to 1 when @is_dev_replace is set and it seems that the
  workqueue actually needs to be ordered if @is_dev_replace. Update the code
  so that alloc_ordered_workqueue() is used if @is_dev_replace.

* fs_info->discard_ctl.discard_workers initialized in
  btrfs_init_workqueues() was directly using alloc_workqueue() w/
  @max_active==1. Converted to alloc_ordered_workqueue().

* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
  btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
  which are allocated with btrfs_alloc_workqueue().

  btrfs_workqueue implements automatic @max_active adjustment which is
  disabled when the specified max limit is below a certain threshold, so
  calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
  workqueue whose @max_active won't be changed as the auto-tuning is
  disabled.

  This is rather brittle in that nothing clearly indicates that the two
  workqueues should be ordered or btrfs_alloc_workqueue() must disable
  auto-tuning when @limit_active==1.

  This patch factors out the common btrfs_workqueue init code into
  btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
  The two workqueues are converted to use the new ordered allocation
  interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
1d12680044 btrfs: drop gfp from parameter extent state helpers
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

   text    data     bss     dec     hex filename
1250432   20985   16088 1287505  13a551 pre/btrfs.ko
1247074   20985   16088 1284147  139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
62bc60473a btrfs: pass NOWAIT for set/clear extent bits as another bit
The only flags we now pass to set_extent_bit/__clear_extent_bit are
GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This
requires an extra parameter to be passed everywhere but is almost always
the same.

Encode the GFP_NOWAIT as an artificial extent bit and extract the
real bits and gfp mask in the lowest level helpers. Now the passed
gfp mask is not actually used and can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
7dde7a8ab3 btrfs: drop NOFAIL from set_extent_bit allocation masks
The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
(commit f0486c68e4 ("Btrfs: Introduce contexts for metadata
reservation")), without any explanation why it would be needed.

Meanwhile we've updated the semantics of set_extent_bit to handle failed
allocations and do unlock, sleep and retry if needed.  The use of the
NOFAIL flag is also an outlier, we never want any of the set/clear
extent bit helpers to fail, they're used for many critical changes like
extent locking, besides the extent state bit changes.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
0acd32c294 btrfs: open code set_extent_bits
This helper calls set_extent_bit with two more parameters set to default
values, but otherwise it's purpose is not clear.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
e85de967bc btrfs: open code set_extent_bits_nowait
The helper only passes GFP_NOWAIT as gfp flags and is used two times.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
fe1a598c42 btrfs: open code set_extent_dirty
The helper is used a few times, that it's setting the DIRTY extent bit
is still clear.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
eea8686e68 btrfs: open code set_extent_new
The helper is used only once.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
66240ab115 btrfs: open code set_extent_delalloc
The helper is used once in fs code and a few times in the self test
code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
dc5646c15c btrfs: open code set_extent_defrag
The helper is used only once.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Christoph Hellwig
25ac047c9d btrfs: remove a pointless NULL check in btrfs_lookup_fs_root
btrfs_grab_root already checks for a NULL root itself.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Christoph Hellwig
e91909aace btrfs: convert btrfs_get_global_root to use a switch statement
Use a switch statement instead of an endless chain of if statements
to make the code a little cleaner.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Christoph Hellwig
85724171b3 btrfs: fix the btrfs_get_global_root return value
btrfs_grab_root returns either the root or NULL, and the callers of
btrfs_get_global_root expect it to return the same.  But all the more
recently added roots instead return an ERR_PTR, so fix this.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
d85512d54e btrfs: add and fix comments in btrfs_fs_devices
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
25984a5ae8 btrfs: consolidate uuid comparisons in btrfs_validate_super
There are three ways the fsid is validated in btrfs_validate_super():

- verify that super_copy::fsid is the same as fs_devices::fsid

- if the metadata_uuid flag is set, verify if super_copy::metadata_uuid
  and fs_devices::metadata_uuid are the same.

- a few lines below, often missed out, verify if dev_item::fsid is the
  same as fs_devices::metadata_uuid.

The function btrfs_validate_super() contains multiple if-statements with
memcmp() to check UUIDs. This patch consolidates them into a single
location.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
a3c54b0be1 btrfs: simplify how changed fsid and metadata_uuid is checked
We often check if the metadata_uuid is not the same as fsid, and then we
check if the given fsid matches the metadata_uuid. This patch refactors
this logic into function match_fsid_changed and utilize it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
1a89834500 btrfs: simplify fsid and metadata_uuid comparisons
Refactor the functions find_fsid() and find_fsid_with_metadata_uuid(),
as they currently share a common set of code to compare the fsid and
metadata_uuid. Create a common helper function, match_fsid_fs_devices().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
413fb1bc1d btrfs: return bool from check_tree_block_fsid instead of int
Simplify the return type of check_tree_block_fsid() from int (1 or 0) to
bool. Its only user is interested in knowing the success or failure.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
f62c302e6d btrfs: add comment about metadata_uuid in btrfs_fs_devices
Add comment about metadata_uuid in btrfs_fs_devices.
No functional change.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
c6930d7d11 btrfs: merge calls to alloc_fs_devices in device_list_add
Simplify has_metadata_uuid checks - by localizing the has_metadata_uuid
checked within alloc_fs_devices()'s second argument, it improves the
code readability.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain
19c4c49ca9 btrfs: streamline fsid checks in alloc_fs_devices
We currently have redundant checks for the non-null value of fsid
simplify it.

And, no one is using alloc_fs_devices() with a NULL metadata_uuid
while fsid is not NULL, add an assert() to verify this condition.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Anand Jain
4693893bf8 btrfs: reduce struct btrfs_fs_devices size by moving fsid_change
Pack bool fsid_change and bool seeding with other bool declarations in the
struct btrfs_fs_devices, approximately 6 bytes is saved, depending on
the config.

   before: 512 bytes
   after: 496 bytes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
46672a44b0 btrfs: merge write_one_subpage_eb into write_one_eb
Most of the code in write_one_subpage_eb and write_one_eb is shared,
so merge the two functions into one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
d7172f52e9 btrfs: use per-buffer locking for extent_buffer reading
Instead of locking and unlocking every page or the extent, just add a
new EXTENT_BUFFER_READING bit that mirrors EXTENT_BUFFER_WRITEBACK
for synchronizing threads trying to read an extent_buffer and to wait
for I/O completion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
9e2aff90fc btrfs: stop using lock_extent in btrfs_buffer_uptodate
The only other place that locks extents on the btree inode is
read_extent_buffer_subpage while reading in the partial page for a
buffer.  This means locking the extent in btrfs_buffer_uptodate does not
synchronize with anything on non-subpage file systems, and on subpage
file systems it only waits for a parallel read(-ahead) to finish,
which seems to be counter to what the callers actually expect.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
f3d315eb93 btrfs: don't check for uptodate pages in read_extent_buffer_pages
The only place that reads in pages and thus marks them uptodate for
the btree inode is read_extent_buffer_pages.  Which means that either
pages are already uptodate from an old buffer when creating a new
one in alloc_extent_buffer, or they will be updated by ca call
to read_extent_buffer_pages.  This means the checks for uptodate
pages in read_extent_buffer_pages and read_extent_buffer_subpage are
superfluous and can be removed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
011134f444 btrfs: stop using PageError for extent_buffers
PageError is only used to limit the uptodate check in
assert_eb_page_uptodate.  But we have a much more useful flag indicating
the exact condition we are about with the EXTENT_BUFFER_WRITE_ERR flag,
so use that instead and help the kernel toward eventually removing
PageError.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
113fa05c2f btrfs: remove the io_pages field in struct extent_buffer
No need to track the number of pages under I/O now that each
extent_buffer is read and written using a single bio.  For the
read side we need to grab an extra reference for the duration of
the I/O to prevent eviction, though.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
31d89399da btrfs: remove the extent_buffer lookup in btree block checksumming
The checksumming of btree blocks always operates on the entire
extent_buffer, and because btree blocks are always allocated contiguously
on disk they are never split by btrfs_submit_bio.

Simplify the checksumming code by finding the extent_buffer in the
btrfs_bio private data instead of trying to search through the bio_vec.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
cd88a4fdbf btrfs: use a separate end_io handler for extent_buffer writing
Now that we always use a single bio to write an extent_buffer, the buffer
can be passed to the end_io handler as private data.  This allows
to simplify the metadata write end I/O handler, and merge the subpage
end_io handler into the main one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
b51e6b4bda btrfs: don't use btrfs_bio_ctrl for extent buffer writing
The btrfs_bio_ctrl machinery is overkill for writing extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio.  Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
81a79b6ae4 btrfs: move page locking from lock_extent_buffer_for_io to write_one_eb
Locking the pages in lock_extent_buffer_for_io only for the non-subpage
case is very confusing.  Move it to write_one_eb to mirror the subpage
case and simplify the code. Now lock_extent_buffer_for_io does not leave
all the pages locked and each is individually locked/unlocked in
write_one_eb.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Christoph Hellwig
50b21d7a06 btrfs: submit a writeback bio per extent_buffer
Stop trying to cluster writes of multiple extent_buffers into a single
bio.  There is no need for that as the blk_plug mechanism used all the
way up in writeback_inodes_wb gives us the same I/O pattern even with
multiple bios.  Removing the clustering simplifies
lock_extent_buffer_for_io a lot and will also allow passing the eb
as private data to the end I/O handler.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
9fdd160160 btrfs: return bool from lock_extent_buffer_for_io
lock_extent_buffer_for_io never returns a negative error value, so switch
the return value to a simple bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
3d66b4b27d btrfs: do not try to unlock the extent for non-subpage metadata reads
Only subpage metadata reads lock the extent.  Don't try to unlock it and
waste cycles in the extent tree lookup for PAGE_SIZE or larger metadata.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
046b562b20 btrfs: use a separate end_io handler for read_extent_buffer
Now that we always use a single bio to read an extent_buffer, the buffer
can be passed to the end_io handler as private data.  This allows
implementing a much simplified dedicated end I/O handler for metadata
reads.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
e194931076 btrfs: remove the mirror_num argument to btrfs_submit_compressed_read
Given that read recovery for data I/O is handled in the storage layer,
the mirror_num argument to btrfs_submit_compressed_read is always 0,
so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
b78b98e06f btrfs: don't use btrfs_bio_ctrl for extent buffer reading
The btrfs_bio_ctrl machinery is overkill for reading extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio.  Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls in a helper function shared between
the subpage and node size >= PAGE_SIZE cases.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
e95382834c btrfs: always read the entire extent_buffer
Currently read_extent_buffer_pages skips pages that are already uptodate
when reading in an extent_buffer.  While this reduces the amount of data
read, it increases the number of I/O operations as we now need to do
multiple I/Os when reading an extent buffer with one or more uptodate
pages in the middle of it.  On any modern storage device, be that hard
drives or SSDs this actually decreases I/O performance.  Fortunately
this case is pretty rare as the pages are always initially read together
and then aged the same way.  Besides simplifying the code a bit as-is
this will allow for major simplifications to the I/O completion handler
later on.

Note that the case where all pages are uptodate is still handled by an
optimized fast path that does not read any data from disk.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
d87e6575e9 btrfs: merge verify_parent_transid and btrfs_buffer_uptodate
verify_parent_transid is only called by btrfs_buffer_uptodate, which
confusingly inverts the return value.  Merge the two functions and
reflow the parent_transid so that error handling is in a branch.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
aebcc1596b btrfs: move setting the buffer uptodate out of validate_extent_buffer
Setting the buffer uptodate in a function that is named as a validation
helper is a it confusing.  Move the call from validate_extent_buffer to
the one of its two callers that didn't already have a duplicate call
to set_extent_buffer_uptodate.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
243984b3b9 btrfs: subpage: fix error handling in end_bio_subpage_eb_writepage
Call btrfs_page_clear_uptodate instead of ClearPageUptodate to properly
manage the uptodate bit for the subpage case.

Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Christoph Hellwig
7f26fb1c13 btrfs: mark extent_buffer_under_io static
extent_buffer_under_io is only used in extent_io.c, so mark it static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:27 +02:00
Qu Wenruo
edc728814f btrfs: trigger orphan inode cleanup during START_SYNC ioctl
There is an internal error report that scrub found an error in an orphan
inode's data.

However there are very limited ways to cleanup such orphan inodes:

- btrfs_start_pre_rw_mount()
  This happens at either mount, or RO->RW switch.
  This is not a viable solution for root fs which may not be unmounted
  or RO mounted.

  Furthermore this doesn't cover every subvolume, it only covers the
  currently cached subvolumes.

- btrfs_lookup_dentry()
  This happens when we first lookup the subvolume dentry.
  But dentry can be cached thus it's not ensured to be triggered every
  time.

- create_snapshot()
  This only happens for the created snapshot, not the source one.

This means if we didn't trigger orphan items cleanup, there is really no
other way to manually trigger it. Add this step to the START_SYNC ioctl.
This is a slight change in the semantics of the ioctl but as sync can be
potentially slow and is usually paired with WAIT_SYNC ioctl.

The errors are not handled because the main point of the ioctl is the
async commit, orphan cleanup is a side effect.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
618d1d7da5 btrfs: fix comment referring to no longer existing btrfs_clean_tree_block()
There's a comment at btrfs_init_new_buffer() that refers to a function
named btrfs_clean_tree_block(), however the function was renamed to
btrfs_clear_buffer_dirty() in commit 190a83391b ("btrfs: rename
btrfs_clean_tree_block to btrfs_clear_buffer_dirty"). So update the
comment to refer to the current name.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
59fcf38817 btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool
The for_rename argument of btrfs_record_unlink_dir() is defined as an
integer, but the argument is in fact used as a boolean. So change it to
a boolean to make its use more clear.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
acfb5a4f11 btrfs: remove pointless label and goto at btrfs_record_unlink_dir()
There's no point of having a label and goto at btrfs_record_unlink_dir()
because the function is trivial and can just return early if we are not
in a rename context. So remove the label and goto and instead return
early if we are not in a rename.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
1e75ef039d btrfs: update comments at btrfs_record_unlink_dir() to be more clear
Update the comments at btrfs_record_unlink_dir() so that they mention
where new names are logged and where old names are removed. Also, while
at it make the width of the comments closer to 80 columns and capitalize
the sentences and finish them with punctuation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
d67ba263f4 btrfs: use inode_logged() at btrfs_record_unlink_dir()
At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  num_init_files=10000
  num_new_files=10000

  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  mkdir $MNT/testdir
  for ((i = 1; i <= $num_init_files; i++)); do
      echo -n > $MNT/testdir/file_$i
   done

  echo -n > $MNT/testdir/foo

  sync

  # Add some files so that there's more work in the transaction other
  # than just renaming file foo.
  for ((i = 1; i <= $num_new_files; i++)); do
      echo -n > $MNT/testdir/new_file_$i
  done

  # Change the file, fsync it.
  setfattr -n user.x1 -v 123 $MNT/testdir/foo
  xfs_io -c "fsync" $MNT/testdir/foo

  # Now triggger eviction of file foo but no eviction for our test
  # directory, since it is being used by the process below. This will
  # set logged_trans of the file's inode to 0 once it is loaded again.
  (
      cd $MNT/testdir
      while true; do
          :
      done
  ) &
  pid=$!

  echo 2 > /proc/sys/vm/drop_caches

  kill $pid
  wait $pid

  # Move foo out of our testdir. This will set last_unlink_trans
  # of the directory inode to the current transaction, because
  # logged_trans of both the directory and the file are set to 0.
  mv $MNT/testdir/foo $MNT/foo

  # Change file foo again and fsync it.
  # This fsync will result in a transaction commit because the rename
  # above has set last_unlink_trans of the parent directory to the id
  # of the current transaction and because our inode for file foo has
  # last_unlink_trans set to the current transaction, since it was
  # evicted and reloaded and it was previously modified in the current
  # transaction (the xattr addition).
  xfs_io -c "pwrite 0 64K" $MNT/foo
  start=$(date +%s%N)
  xfs_io -c "fsync" $MNT/foo
  end=$(date +%s%N)
  dur=$(( (end - start) / 1000000 ))

  echo "file fsync took: $dur milliseconds"

  umount $MNT

Before this patch:   fsync took 19 milliseconds
After this patch:    fsync took  5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Filipe Manana
bf1f4fd3fa btrfs: use inode_logged() at need_log_inode()
At need_log_inode() we directly check the ->logged_trans field of the
given inode to check if it was previously logged in the transaction, with
the goal of skipping logging the inode again when it's not necessary.
The ->logged_trans field in not persisted in the inode item or elsewhere,
it's only stored in memory (struct btrfs_inode), so it's transient and
lost once the inode is evicted and then loaded again. Once an inode is
loaded, we are conservative and set ->logged_trans to 0, which may mean
that either the inode was never logged in the current transaction or it
was logged but evicted before being loaded again.

Instead of checking the inode's ->logged_trans field directly, we can
use instead the helper inode_logged(), which will really check if the
inode was logged before in the current transaction in case we have a
->logged_trans field with a value of 0. This will prevent unnecessarily
logging an inode when it's not needed, and in some cases preventing a
transaction commit, in case the logging requires a fallback to a
transaction commit. The following test script shows a scenario where
due to eviction we fallback a transaction commit when trying to fsync
a file that was renamed:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  num_init_files=10000
  num_new_files=10000

  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  mkdir $MNT/testdir
  for ((i = 1; i <= $num_init_files; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  echo -n > $MNT/testdir/foo

  sync

  # Add some files so that there's more work in the transaction other
  # than just renaming file foo.
  for ((i = 1; i <= $num_new_files; i++)); do
      echo -n > $MNT/testdir/new_file_$i
  done

  # Fsync the directory first.
  xfs_io -c "fsync" $MNT/testdir

  # Rename file foo.
  mv $MNT/testdir/foo $MNT/testdir/bar

  # Now trigger eviction of the test directory's inode.
  # Once loaded again, it will have logged_trans set to 0 and
  # last_unlink_trans set to the current transaction.
  echo 2 > /proc/sys/vm/drop_caches

  # Fsync file bar (ex-foo).
  # Before the patch the fsync would result in a transaction commit
  # because the inode for file bar has last_unlink_trans set to the
  # current transaction, so it will attempt to log the parent directory
  # as well, which will fallback to a full transaction commit because
  # it also has its last_unlink_trans set to the current transaction,
  # due to the inode eviction.
  start=$(date +%s%N)
  xfs_io -c "fsync" $MNT/testdir/bar
  end=$(date +%s%N)
  dur=$(( (end - start) / 1000000 ))

  echo "file fsync took: $dur milliseconds"

  umount $MNT

Before this patch:  fsync took 22 milliseconds
After this patch:   fsync took  8 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Jiapeng Chong
b9cb105e73 btrfs: scrub: remove more unused functions
These functions are defined in the scrub.c file, but last callers were
removed in e9255d6c40 ("btrfs: scrub: remove the old scrub recheck
code").

fs/btrfs/scrub.c:553:20: warning: unused function 'scrub_stripe_index_and_offset'.
fs/btrfs/scrub.c:543:19: warning: unused function 'scrub_nr_raid_mirrors'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4937
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Qu Wenruo
b7f9945a14 btrfs: handle tree backref walk error properly
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):

	fs/btrfs/inode.c:283 print_data_reloc_error()
	error: uninitialized symbol 'ref_level'.

[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.

The offending code looks like this:

	do {
		ret = tree_backref_for_extent();
		btrfs_warn_rl();
	} while (ret != 1);

There are several problems involved:

- No error handling
  If that tree_backref_for_extent() failed, we would output the same
  error again and again, never really exit as it requires ret == 1 to
  exit.

- Always do one extra output
  As tree_backref_for_extent() only return > 0 if there is no more
  backref item.
  This means after the last item we hit, we would output an invalid
  error message for ret > 0 case.

[FIX]
Fix the old code by:

- Move @ref_root and @ref_level into the if branch
  And do not initialize them, so we can catch such uninitialized values
  just like what we do in the inode.c

- Explicitly check the return value of tree_backref_for_extent()
  And handle ret < 0 and ret > 0 cases properly.

- No more do {} while () loop
  Instead go while (true) {} loop since we will handle @ret manually.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Christoph Hellwig
f880fe6e0b btrfs: don't hold an extra reference for redirtied buffers
When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit.  But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).

Remove the extra reference and the infrastructure used to drop it
again.

History behind redirty logic:

In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Christoph Hellwig
f18cc97845 btrfs: fix dirty_metadata_bytes for redirtied buffers
dirty_metadata_bytes is decremented in both places that clear the dirty
bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
means that a buffer that is redirtied using btrfs_redirty_list_add won't
be added to dirty_metadata_bytes, but it will be subtracted when written
out, leading an inconsistency in the counter.

Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
set_extent_buffer_dirty to also account for the redirty case, and remove
the now unused set_extent_buffer_dirty return value.

Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Johannes Thumshirn
bb5167e619 btrfs: unexport btrfs_run_discard_work and make it static
Mark btrfs_run_discard_work static and move it above its callers.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
016f9d0b74 btrfs: rename del_ptr to btrfs_del_ptr and export it
This exists internal to ctree.c, however btrfs check needs to use it for
some of its operations.  I'd rather not duplicate that code inside of
btrfs check as this is low level and I want to keep this code in one
place, so rename the function to btrfs_del_ptr and export it so that it
can be used inside of btrfs-progs safely.  Add a comment to make sure
this doesn't get removed by a future cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
b3cbfb0dd4 btrfs: add a btrfs_csum_type_size helper
This is needed in btrfs-progs for the tools that convert the checksum
types for file systems and a few other things.  We don't have it in the
kernel as we just want to get the size for the super blocks type.
However I don't want to have to manually add this every time we sync
ctree.c into btrfs-progs, so add the helper in the kernel with a note so
it doesn't get removed by a later cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
a95b7f9360 btrfs: add __KERNEL__ check for btrfs_no_printk
We want to override this in btrfs-progs, so wrap this in the __KERNEL__
check so we can easily sync this to btrfs-progs and have our local
version of btrfs_no_printk do the work.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
f541833c8e btrfs: move split_flags/combine_flags helpers to inode-item.h
These are more related to the inode item flags on disk than the
in-memory btrfs_inode, move the helpers to inode-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
2cac5af165 btrfs: move btrfs_verify_level_key into tree-checker.c
This is more a buffer validation helper, move it into the tree-checker
files where it makes more sense.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
c26fa931eb btrfs: add __btrfs_check_node helper
This helper returns a btrfs_tree_block_status for the various errors,
and then btrfs_check_node() will return -EUCLEAN if it gets anything
other than BTRFS_TREE_BLOCK_CLEAN which will be used by the kernel.  In
the future btrfs-progs will use this helper instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
924452c80e btrfs: extend btrfs_leaf_check to return btrfs_tree_block_status
Instead of blanket returning -EUCLEAN for all the failures in
btrfs_check_leaf, use btrfs_tree_block_status and return the appropriate
status for each failure.  Rename the helper to __btrfs_check_leaf and
then make a wrapper of btrfs_check_leaf that will return -EUCLEAN to
non-clean error codes.  This will allow us to have the
__btrfs_check_leaf variant in btrfs-progs while keeping the behavior in
the kernel consistent.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
c8d5421563 btrfs: use btrfs_tree_block_status for leaf item errors
We have a variety of item specific errors that can occur.  For now
simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM,
this can be fleshed out as we need in the future.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
a7b4e6c7aa btrfs: add btrfs_tree_block_status definitions to tree-checker.h
We use this in btrfs-progs to determine if we can fix different types of
corruptions.  We don't care about this in the kernel, however it would
be good to share this code between the kernel and btrfs-progs, so add
the status definitions so we can start converting the tree-checker code
over to using these status flags instead of blanket returning -EUCLEAN.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
85d8a826c7 btrfs: simplify btrfs_check_leaf_* helpers into a single helper
We have two helpers for checking leaves, because we have an extra check
for debugging in btrfs_mark_buffer_dirty(), and at that stage we may
have item data that isn't consistent yet.  However we can handle this
case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we
know the buffer should be internally consistent, otherwise we need to
skip checking the item data.

Simplify this helper down a single helper and handle the item data
checking logic internally to the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
4aec05fa5a btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
54d687c13a btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Qu Wenruo
94ead93e63 btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.  If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
7e5ba55994 btrfs: assert tree lock is held when removing free space entries
Removing a free space entry from an in memory space cache requires having
the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
code paths that remove an entry, so add assertions where appropriate to
verify we are holding the lock, as the lock is acquired by some other
function up in the call chain, which makes it easy to miss in the future.

Note: for this to work we need to lock the local btrfs_free_space_ctl at
load_free_space_cache(), which was not being done because it's local,
declared on the stack, so no other task has access to it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9649bd9a29 btrfs: assert tree lock is held when linking free space
When linking a free space entry, at link_free_space(), the caller should
be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
argument, which is necessary for manipulating the red black tree of free
space entries (done by tree_insert_offset(), which already asserts the
lock is held) and for manipulating the 'free_space', 'free_extents',
'discardable_extents' and 'discardable_bytes' counters of the given
struct btrfs_free_space_ctl.

So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
is held by the current task. We have multiple code paths that end up
calling link_free_space(), and all currently take the lock before calling
it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
91de9e978d btrfs: assert tree lock is held when searching for free space entries
When searching for a free space entry by offset, at tree_search_offset(),
we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
assert that. We have multiple callers of tree_search_offset(), and all
currently hold the necessary lock before calling it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
13c2018fcc btrfs: assert proper locks are held at tree_insert_offset()
There are multiple code paths leading to tree_insert_offset(), and each
path takes the necessary locks before tree_insert_offset() is called,
since they do other things that require those locks to be held. This makes
it easy to miss the locking somewhere, so make tree_insert_offset() assert
that the required locks are being held by the calling task.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
0d6bac4d30 btrfs: simplify arguments to tree_insert_offset()
For the in-memory component of space caching (free space cache and free
space tree), three of the arguments passed to tree_insert_offset() can
always be taken from the new free space entry that we are about to add.

So simplify tree_insert_offset() to take the new entry instead of the
'offset', 'node' and 'bitmap' arguments. This will also allow to make
further changes simpler.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
b77433b144 btrfs: use precomputed end offsets at do_trimming()
The are two computations of end offsets at do_trimming() that are not
necessary, as they were previously computed and stored in local const
variables. So just use the variables instead, to make the source code
shorter and easier to read.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9085f42571 btrfs: avoid searching twice for previous node when merging free space entries
At try_merge_free_space(), avoid calling twice rb_prev() to find the
previous node, as that requires looping through the red black tree, so
store the result of the rb_prev() call and then use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Filipe Manana
fbb2e654d8 btrfs: avoid extra memory allocation when copying free space cache
At copy_free_space_cache(), we add a new entry to the block group's ctl
before we free the entry from the temporary ctl. Adding a new entry
requires the allocation of a new struct btrfs_free_space, so we can
avoid a temporary extra allocation by freeing the entry from the
temporary ctl before we add a new entry to the main ctl, which possibly
also reduces the chances for a memory allocation failure in case of very
high memory pressure. So just do that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Tom Rix
12df6a622e btrfs: simplify transid initialization in btrfs_ioctl_wait_sync
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
8bfec2e426 btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
da02361807 btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.

E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:

Unpatched:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

With synchronous CRCs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.

(thanks to Jens Axboe via Chris Mason for the benchmarking)

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
eee3b81178 btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:

- Always dump the leaf first, then the error message

- Output the slot number if possible
  Especially in __btrfs_free_extent() the leaf dump of extent tree can
  be pretty large.
  With an extra slot number it's much easier to locate the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
6c75a589cb btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Naohiro Aota
b5345d6cee btrfs: export bitmap_test_range_all_{set,zero}
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other
components. Move them to misc.h and use them in zoned.c. Also, as
find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned
int", convert the type to "unsigned long".

While at it, also rewrite the "if (...) return true; else return false;"
pattern and add const to the input bitmap.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
88ad95b055 btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a
sibling node/leaf, it's very unexpected to have the last key of the left
sibling greater than or equals to the first key of the right sibling, as
that means we have a (serious) corruption that breaks the key ordering
properties of a b+tree. Since this is unexpected, surround the comparison
with the unlikely macro, which helps the compiler generate better code
for the most expected case (no existing b+tree corruption). This is also
what we do for other unexpected cases of invalid key ordering (like at
btrfs_set_item_key_safe()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f2db4d5cb4 btrfs: make btrfs_free_device() static
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Sweet Tea Dorminy
1b53e51a4a btrfs: don't commit transaction for every subvol create
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.

This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.

While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.

It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f469c8bd90 btrfs: unexport btrfs_prev_leaf()
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Christian Brauner
1784fbc2ed ovl: port to new mount api
We recently ported util-linux to the new mount api. Now the mount(8)
tool will by default use the new mount api. While trying hard to fall
back to the old mount api gracefully there are still cases where we run
into issues that are difficult to handle nicely.

Now with mount(8) and libmount supporting the new mount api I expect an
increase in the number of bug reports and issues we're going to see with
filesystems that don't yet support the new mount api. So it's time we
rectify this.

When ovl_fill_super() fails before setting sb->s_root, we need to cleanup
sb->s_fs_info.  The logic is a bit convoluted but tl;dr: If sget_fc() has
succeeded fc->s_fs_info will have been transferred to sb->s_fs_info.
So by the time ->fill_super()/ovl_fill_super() is called fc->s_fs_info
is NULL consequently fs_context->free() won't call ovl_free_fs().

If we fail before sb->s_root() is set then ->put_super() won't be called
which would call ovl_free_fs(). IOW, if we fail in ->fill_super() before
sb->s_root we have to clean it up.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
ac519625ed ovl: factor out ovl_parse_options() helper
For parsing a single mount option.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
af5f2396b6 ovl: store enum redirect_mode in config instead of a string
Do all the logic to set the mode during mount options parsing and
do not keep the option string around.

Use a constant_table to translate from enum redirect mode to string
in preperation for new mount api option parsing.

The mount option "off" is translated to either "follow" or "nofollow",
depending on the "redirect_always_follow" build/module config, so
in effect, there are only three possible redirect modes.

This results in a minor change to the string that is displayed
in show_options() - when redirect_dir is enabled by default and the user
mounts with the option "redirect_dir=off", instead of displaying the mode
"redirect_dir=off" in show_options(), the displayed mode will be either
"redirect_dir=follow" or "redirect_dir=nofollow", depending on the value
of "redirect_always_follow" build/module config.

The displayed mode reflects the effective mode, so mounting overlayfs
again with the dispalyed redirect_dir option will result with the same
effective and displayed mode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
dcb399de1e ovl: pass ovl_fs to xino helpers
Internal ovl methods should use ovl_fs and not sb as much as
possible.

Use a constant_table to translate from enum xino mode to string
in preperation for new mount api option parsing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
367d002d6c ovl: clarify ovl_get_root() semantics
Change the semantics to take a reference on upperdentry instead
of transferrig the reference.

This is needed for upcoming port to new mount api.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
e4599d4b1a ovl: negate the ofs->share_whiteout boolean
The default common case is that whiteout sharing is enabled.
Change to storing the negated no_shared_whiteout state, so we will not
need to initialize it.

This is the first step towards removing all config and feature
initializations out of ovl_fill_super().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Christian Brauner
f723edb8a5 ovl: check type and offset of struct vfsmount in ovl_entry
Porting overlayfs to the new amount api I started experiencing random
crashes that couldn't be explained easily. So after much debugging and
reasoning it became clear that struct ovl_entry requires the point to
struct vfsmount to be the first member and of type struct vfsmount.

During the port I added a new member at the beginning of struct
ovl_entry which broke all over the place in the form of random crashes
and cache corruptions. While there's a comment in ovl_free_fs() to the
effect of "Hack! Reuse ofs->layers as a vfsmount array before freeing
it" there's no such comment on struct ovl_entry which makes this easy to
trip over.

Add a comment and two static asserts for both the offset and the type of
pointer in struct ovl_entry.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
42dd69ae1a ovl: implement lazy lookup of lowerdata in data-only layers
Defer lookup of lowerdata in the data-only layers to first data access
or before copy up.

We perform lowerdata lookup before copy up even if copy up is metadata
only copy up.  We can further optimize this lookup later if needed.

We do best effort lazy lookup of lowerdata for d_real_inode(), because
this interface does not expect errors.  The only current in-tree caller
of d_real_inode() is trace_uprobe and this caller is likely going to be
followed reading from the file, before placing uprobes on offset within
the file, so lowerdata should be available when setting the uprobe.

Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
4166564478 ovl: prepare for lazy lookup of lowerdata inode
Make the code handle the case of numlower > 1 and missing lowerdata
dentry gracefully.

Missing lowerdata dentry is an indication for lazy lookup of lowerdata
and in that case the lowerdata_redirect path is stored in ovl_inode.

Following commits will defer lookup and perform the lazy lookup on
access.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
2b21da9208 ovl: prepare to store lowerdata redirect for lazy lowerdata lookup
Prepare to allow ovl_lookup() to leave the last entry in a non-dir
lowerstack empty to signify lazy lowerdata lookup.

In this case, ovl_lookup() stores the redirect path from metacopy to
lowerdata in ovl_inode, which is going to be used later to perform the
lazy lowerdata lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
5436ab0a86 ovl: implement lookup in data-only layers
Lookup in data-only layers only for a lower metacopy with an absolute
redirect xattr.

The metacopy xattr is not checked on files found in the data-only layers
and redirect xattr are not followed in the data-only layers.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
37ebf056d6 ovl: introduce data-only lower layers
Introduce the format lowerdir=lower1:lower2::lowerdata1::lowerdata2
where the lower layers on the right of the :: separators are not merged
into the overlayfs merge dirs.

Data-only lower layers are only allowed at the bottom of the stack.

The files in those layers are only meant to be accessible via absolute
redirect from metacopy files in lower layers.  Following changes will
implement lookup in the data layers.

This feature was requested for composefs ostree use case, where the
lower data layer should only be accessiable via absolute redirects
from metacopy inodes.

The lower data layers are not required to a have a unique uuid or any
uuid at all, because they are never used to compose the overlayfs inode
st_ino/st_dev.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
9e88f90524 ovl: remove unneeded goto instructions
There is nothing in the out goto target of ovl_get_layers().

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
ab1eb5ffb7 ovl: deduplicate lowerdata and lowerstack[]
The ovl_inode contains a copy of lowerdata in lowerstack[], so the
lowerdata inode member can be removed.

Use accessors ovl_lowerdata*() to get the lowerdata whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
ac900ed4f2 ovl: deduplicate lowerpath and lowerstack[]
The ovl_inode contains a copy of lowerpath in lowerstack[0], so the
lowerpath member can be removed.

Use accessor ovl_lowerpath() to get the lowerpath whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
0af950f57f ovl: move ovl_entry into ovl_inode
The lower stacks of all the ovl inode aliases should be identical
and there is redundant information in ovl_entry and ovl_inode.

Move lowerstack into ovl_inode and keep only the OVL_E_FLAGS
per overlay dentry.

Following patches will deduplicate redundant ovl_inode fields.

Note that for pure upper and negative dentries, OVL_E(dentry) may be
NULL now, so it is imporatnt to use the ovl_numlower() accessor.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
163db0da35 ovl: factor out ovl_free_entry() and ovl_stack_*() helpers
In preparation for moving lowerstack into ovl_inode.

Note that in ovl_lookup() the temp stack dentry refs are now cloned
into the final ovl_lowerstack instead of being transferred, so cleanup
always needs to call ovl_stack_free(stack).

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
5522c9c7cb ovl: use ovl_numlower() and ovl_lowerstack() accessors
This helps fortify against dereferencing a NULL ovl_entry,
before we move the ovl_entry reference into ovl_inode.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
a6ff2bc0be ovl: use OVL_E() and OVL_E_FLAGS() accessors
Instead of open coded instances, because we are about to split
the two apart.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Amir Goldstein
b07d5cc93e ovl: update of dentry revalidate flags after copy up
After copy up, we may need to update d_flags if upper dentry is on a
remote fs and lower dentries are not.

Add helpers to allow incremental update of the revalidate flags.

Fixes: bccece1ead ("ovl: allow remote upper")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
f4e19e595c ovl: fix null pointer dereference in ovl_get_acl_rcu()
Following process:
         P1                     P2
 path_openat
  link_path_walk
   may_lookup
    inode_permission(rcu)
     ovl_permission
      acl_permission_check
       check_acl
        get_cached_acl_rcu
	 ovl_get_inode_acl
	  realinode = ovl_inode_real(ovl_inode)
	                      drop_cache
		               __dentry_kill(ovl_dentry)
				iput(ovl_inode)
		                 ovl_destroy_inode(ovl_inode)
		                  dput(oi->__upperdentry)
		                   dentry_kill(upperdentry)
		                    dentry_unlink_inode
				     upperdentry->d_inode = NULL
	    ovl_inode_upper
	     upperdentry = ovl_i_dentry_upper(ovl_inode)
	     d_inode(upperdentry) // returns NULL
	  IS_POSIXACL(realinode) // NULL pointer dereference
, will trigger an null pointer dereference at realinode:
  [  205.472797] BUG: kernel NULL pointer dereference, address:
                 0000000000000028
  [  205.476701] CPU: 2 PID: 2713 Comm: ls Not tainted
                 6.3.0-12064-g2edfa098e750-dirty #1216
  [  205.478754] RIP: 0010:do_ovl_get_acl+0x5d/0x300
  [  205.489584] Call Trace:
  [  205.489812]  <TASK>
  [  205.490014]  ovl_get_inode_acl+0x26/0x30
  [  205.490466]  get_cached_acl_rcu+0x61/0xa0
  [  205.490908]  generic_permission+0x1bf/0x4e0
  [  205.491447]  ovl_permission+0x79/0x1b0
  [  205.491917]  inode_permission+0x15e/0x2c0
  [  205.492425]  link_path_walk+0x115/0x550
  [  205.493311]  path_lookupat.isra.0+0xb2/0x200
  [  205.493803]  filename_lookup+0xda/0x240
  [  205.495747]  vfs_fstatat+0x7b/0xb0

Fetch a reproducer in [Link].

Use the helper ovl_i_path_realinode() to get realinode and then do
non-nullptr checking.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217404
Fixes: 332f606b32 ("ovl: enable RCU'd ->get_acl()")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
1a73f5b8f0 ovl: fix null pointer dereference in ovl_permission()
Following process:
          P1                     P2
 path_lookupat
  link_path_walk
   inode_permission
    ovl_permission
      ovl_i_path_real(inode, &realpath)
        path->dentry = ovl_i_dentry_upper(inode)
                          drop_cache
			   __dentry_kill(ovl_dentry)
		            iput(ovl_inode)
		             ovl_destroy_inode(ovl_inode)
		              dput(oi->__upperdentry)
		               dentry_kill(upperdentry)
		                dentry_unlink_inode
				 upperdentry->d_inode = NULL
      realinode = d_inode(realpath.dentry) // return NULL
      inode_permission(realinode)
       inode->i_sb  // NULL pointer dereference
, will trigger an null pointer dereference at realinode:
  [  335.664979] BUG: kernel NULL pointer dereference,
                 address: 0000000000000002
  [  335.668032] CPU: 0 PID: 2592 Comm: ls Not tainted 6.3.0
  [  335.669956] RIP: 0010:inode_permission+0x33/0x2c0
  [  335.678939] Call Trace:
  [  335.679165]  <TASK>
  [  335.679371]  ovl_permission+0xde/0x320
  [  335.679723]  inode_permission+0x15e/0x2c0
  [  335.680090]  link_path_walk+0x115/0x550
  [  335.680771]  path_lookupat.isra.0+0xb2/0x200
  [  335.681170]  filename_lookup+0xda/0x240
  [  335.681922]  vfs_statx+0xa6/0x1f0
  [  335.682233]  vfs_fstatat+0x7b/0xb0

Fetch a reproducer in [Link].

Use the helper ovl_i_path_realinode() to get realinode and then do
non-nullptr checking.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217405
Fixes: 4b7791b2e9 ("ovl: handle idmappings in ovl_permission()")
Cc: <stable@vger.kernel.org> # v5.19
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
b2dd05f107 ovl: let helper ovl_i_path_real() return the realinode
Let helper ovl_i_path_real() return the realinode to prepare for
checking non-null realinode in RCU walking path.

[msz] Use d_inode_rcu() since we are depending on the consitency
between dentry and inode being non-NULL in an RCU setting.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: ffa5723c6d ("ovl: store lower path in ovl_inode")
Cc: <stable@vger.kernel.org> # v5.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:00:38 +03:00
Chuck Lever
5e092be741 NFSD: Distinguish per-net namespace initialization
I find the naming of nfsd_init_net() and nfsd_startup_net() to be
confusingly similar. Rename the namespace initialization and tear-
down ops and add comments to distinguish their separate purposes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:02:52 -04:00
Jeff Layton
ed9ab7346e nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net
Commit f5f9d4a314 ("nfsd: move reply cache initialization into nfsd
startup") moved the initialization of the reply cache into nfsd startup,
but didn't account for the stats counters, which can be accessed before
nfsd is ever started. The result can be a NULL pointer dereference when
someone accesses /proc/fs/nfsd/reply_cache_stats while nfsd is still
shut down.

This is a regression and a user-triggerable oops in the right situation:

- non-x86_64 arch
- /proc/fs/nfsd is mounted in the namespace
- nfsd is not started in the namespace
- unprivileged user calls "cat /proc/fs/nfsd/reply_cache_stats"

Although this is easy to trigger on some arches (like aarch64), on
x86_64, calling this_cpu_ptr(NULL) evidently returns a pointer to the
fixed_percpu_data. That struct looks just enough like a newly
initialized percpu var to allow nfsd_reply_cache_stats_show to access
it without Oopsing.

Move the initialization of the per-net+per-cpu reply-cache counters
back into nfsd_init_net, while leaving the rest of the reply cache
allocations to be done at nfsd startup time.

Kudos to Eirik who did most of the legwork to track this down.

Cc: stable@vger.kernel.org # v6.3+
Fixes: f5f9d4a314 ("nfsd: move reply cache initialization into nfsd startup")
Reported-and-tested-by: Eirik Fuller <efuller@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2215429
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:02:40 -04:00
Joel Granados
2f2665c13a sysctl: replace child with an enumeration
This is part of the effort to remove the empty element at the end of
ctl_table structs. "child" was a deprecated elem in this struct and was
being used to differentiate between two types of ctl_tables: "normal"
and "permanently emtpy".

What changed?:
* Replace "child" with an enumeration that will have two values: the
  default (0) and the permanently empty (1). The latter is left at zero
  so when struct ctl_table is created with kzalloc or in a local
  context, it will have the zero value by default. We document the
  new enum with kdoc.
* Remove the "empty child" check from sysctl_check_table
* Remove count_subheaders function as there is no longer a need to
  calculate how many headers there are for every child
* Remove the recursive call to unregister_sysctl_table as there is no
  need to traverse down the child tree any longer
* Add a new SYSCTL_PERM_EMPTY_DIR binary flag
* Remove the last remanence of child from partport/procfs.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-18 02:32:54 -07:00
Joel Granados
94a6490518 sysctl: Remove debugging dump_stack
Remove unneeded dump_stack in __register_sysctl_table

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-18 02:32:54 -07:00
Jingbo Xu
f02615eb6f erofs: use separate xattr parsers for listxattr/getxattr
There's a callback styled xattr parser, i.e. xattr_foreach(), which is
shared among listxattr and getxattr.

Convert it to two separate xattr parsers to serve listxattr and getxattr
for better readability.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
4b077b5012 erofs: unify inline/shared xattr iterators for listxattr/getxattr
Make inline_{list,get}xattr() as well as inline_xattr_iter_begin()
unified as erofs_xattr_iter_inline(), and shared_{list,get}xattr()
unified as erofs_xattr_iter_shared().

After these changes, both erofs_xattr_iter_{inline,shared}() return 0 on
success, and negative error on failure.

One thing worth noting is that, the logic of returning it->buffer_ofs
when there's no shared xattrs in shared_listxattr() is moved to
erofs_listxattr() to make the unification possible.  The only difference
is that, semantically the old behavior will return ENOATTR rather than
it->buffer_ofs if ENOATTR encountered when listxattr is parsing upon a
specific shared xattr, while now the new behavior will return
it->buffer_ofs in this case.  This is not an issue, as listxattr upon a
specific xattr won't return ENOATTR.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
5a8ffb1975 erofs: make the size of read data stored in buffer_ofs
Since now xattr_iter structures have been unified, make the size of the
read data stored in buffer_ofs.  Don't bother reusing buffer_size for
this use, which may be confusing.

This is in preparation for the following further cleanup.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-4-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
8e823961de erofs: unify xattr_iter structures
Unify xattr_iter/listxattr_iter/getxattr_iter structures into
erofs_xattr_iter structure.

This is in preparation for the following further cleanup.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Jingbo Xu
eba67eb6de erofs: use absolute position in xattr iterator
Replace blkaddr/ofs with pos in 'struct erofs_xattr_iter'.

After erofs_bread() is introduced to replace raw page cache APIs for
metadata I/Os handling, xattr_iter_fixup() is no longer needed anymore.

In addition, it is also unnecessary to check if the iterated position is
span over the block boundary as absolute offset is used instead of
blkaddr + offset pairs.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Gao Xiang
001b8ccd06 erofs: fix compact 4B support for 16k block size
In compact 4B, two adjacent lclusters are packed together as a unit to
form on-disk indexes for effective random access, as below:

(amortized = 4, vcnt = 2)
       _____________________________________________
      |___@_____ encoded bits __________|_ blkaddr _|
      0        .                                    amortized * vcnt = 8
      .             .
      .                  .              amortized * vcnt - 4 = 4
      .                        .
      .____________________________.
      |_type (2 bits)_|_clusterofs_|

Therefore, encoded bits for each pack are 32 bits (4 bytes). IOWs,
since each lcluster can get 16 bits for its type and clusterofs, the
maximum supported lclustersize for compact 4B format is 16k (14 bits).

Fix this to enable compact 4B format for 16k lclusters (blocks), which
is tested on an arm64 server with 16k page size.

Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601112341.56960-1-hsiangkao@linux.alibaba.com
2023-06-18 12:10:53 +08:00
Jingbo Xu
9c39ec0cff erofs: convert erofs_read_metabuf() to erofs_bread() for xattr
buf->inode is constant once initialized during erofs_buf's lifetime.
Thus call erofs_init_metabuf() and erofs_bread() separately to avoid
the repetition of assigning buf->inode when iterating xattrs.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601024347.108469-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Gao Xiang
43d86ec936 erofs: use poison pointer to replace the hard-coded address
It's safer and cleaner to replace such hard-coded illegal pointer
with poison pointers.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-7-hsiangkao@linux.alibaba.com
2023-06-18 12:10:53 +08:00
Gao Xiang
7674a42f35 erofs: use struct lockref to replace handcrafted approach
Let's avoid the current handcrafted lockref although `struct lockref`
inclusion usually increases extra 4 bytes with an explicit spinlock if
CONFIG_DEBUG_SPINLOCK is off.

Apart from the size difference, note that the meaning of refcount is
also changed to active users. IOWs, it doesn't take an extra refcount
for XArray tree insertion.

I don't observe any significant performance difference at least on
our cloud compute server but the new one indeed simplifies the
overall codebase a bit.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230529123727.79943-1-hsiangkao@linux.alibaba.com
2023-06-18 12:10:47 +08:00
Chuck Lever
262176798b NFSD: Add an nfsd4_encode_nfstime4() helper
Clean up: de-duplicate some common code.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Namjae Jeon
5005bcb421 ksmbd: validate session id and tree id in the compound request
This patch validate session id and tree id in compound request.
If first operation in the compound is SMB2 ECHO request, ksmbd bypass
session and tree validation. So work->sess and work->tcon could be NULL.
If secound request in the compound access work->sess or tcon, It cause
NULL pointer dereferecing error.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21165
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:51 -05:00
Namjae Jeon
5fe7f7b782 ksmbd: fix out-of-bound read in smb2_write
ksmbd_smb2_check_message doesn't validate hdr->NextCommand. If
->NextCommand is bigger than Offset + Length of smb2 write, It will
allow oversized smb2 write length. It will cause OOB read in smb2_write.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21164
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:36 -05:00
Namjae Jeon
40b268d384 ksmbd: add mnt_want_write to ksmbd vfs functions
ksmbd is doing write access using vfs helpers. There are the cases that
mnt_want_write() is not called in vfs helper. This patch add missing
mnt_want_write() to ksmbd vfs functions.

Cc: stable@vger.kernel.org
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:19 -05:00
Namjae Jeon
2b9b8f3b68 ksmbd: validate command payload size
->StructureSize2 indicates command payload size. ksmbd should validate
this size with rfc1002 length before accessing it.
This patch remove unneeded check and add the validation for this.

[    8.912583] BUG: KASAN: slab-out-of-bounds in ksmbd_smb2_check_message+0x12a/0xc50
[    8.913051] Read of size 2 at addr ffff88800ac7d92c by task kworker/0:0/7
...
[    8.914967] Call Trace:
[    8.915126]  <TASK>
[    8.915267]  dump_stack_lvl+0x33/0x50
[    8.915506]  print_report+0xcc/0x620
[    8.916558]  kasan_report+0xae/0xe0
[    8.917080]  kasan_check_range+0x35/0x1b0
[    8.917334]  ksmbd_smb2_check_message+0x12a/0xc50
[    8.917935]  ksmbd_verify_smb_message+0xae/0xd0
[    8.918223]  handle_ksmbd_work+0x192/0x820
[    8.918478]  process_one_work+0x419/0x760
[    8.918727]  worker_thread+0x2a2/0x6f0
[    8.919222]  kthread+0x187/0x1d0
[    8.919723]  ret_from_fork+0x1f/0x30
[    8.919954]  </TASK>

Cc: stable@vger.kernel.org
Reported-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:03 -05:00
David Howells
ba00b19067 afs: Fix vlserver probe RTT handling
In the same spirit as commit ca57f02295 ("afs: Fix fileserver probe
RTT handling"), don't rule out using a vlserver just because there
haven't been enough packets yet to calculate a real rtt.  Always set the
server's probe rtt from the estimate provided by rxrpc_kernel_get_srtt,
which is capped at 1 second.

This could lead to EDESTADDRREQ errors when accessing a cell for the
first time, even though the vl servers are known and have responded to a
probe.

Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2023-June/006746.html
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-16 14:43:41 -07:00
Linus Torvalds
4973ca2955 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSMg4YACgkQxWXV+ddt
 WDvNxg/9G45Lcn3YPYXicbzKcrrz4fpg4gqx9IX226DfJX78iZskl3LN1w+gFcj0
 gAKSC73ZZCGhIqrHOuWIbH5+BRO3FzTB9zr7tfx4H+pFWHs0BgYPqcoBjLTHZ/Pn
 2RYu+F922tGaPW7LZ2LtGlv+8Y4IDtWVe6uRyxSqv3dtF1jcgUfnJk2zJXG5z41R
 h1BSX7mcWUxUXbSJqTzAij7jyvbpnmy1BjsGDRG2G2J/AmvpUBtx1Gc3aKWhD2Up
 vNLQkl4OxbaW1t8CV9u6iGduS5mUAetOXoT2DTr3sSQMeA56Gpues/qb6qQVTbwb
 2cBnwQugZyz39yZkyvvopy6z2rasMmw6V/aPLKTLvPN/P+DYwU+bfcFuNa+LFxz4
 KJqGvZdrwDlhGc80+xjKhly4zLahAt0H+Y1yKjRK2RRx/TsXl4ufVc5hpq9rj8eK
 AoNvoZw9W3/L0juMUfZILhMbD2f7XGbUXlNhIXHCZsOZzuZBqNMNNv9d8b5ncbWE
 q6a5EJXzQzk13kiurVBZJoZokYxsUzEBsKeij4aaP1Rkw8r/62GvEt79Nu8X+67+
 cQyZ6CQ6eZ2PsPx9DtooCbAnH6huIPf9yagn5J2Li6H6VdvOlP6zIi7Tp33AhPdp
 1BMfaNq46l6Gxiu1pnclzSb8abVLb71ZxXNItEK/EkbH/uktaro=
 =NAyd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes for NOCOW files, a regression fix in scrub and an assertion
  fix:

   - NOCOW fixes:
      - keep length of iomap direct io request in case of a failure
      - properly pass mode of extent reference checking, this can break
        some cases for swapfile

   - fix error value confusion when scrubbing a stripe

   - convert assertion to a proper error handling when loading global
     roots, reported by syzbot"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: fix a return value overwrite in scrub_stripe()
  btrfs: do not ASSERT() on duplicated global roots
  btrfs: can_nocow_file_extent should pass down args->strict from callers
  btrfs: fix iomap_begin length for nocow writes
2023-06-16 12:41:56 -07:00
Christoph Hellwig
2e82f6c3bf splice: simplify a conditional in copy_splice_read
Check for -EFAULT instead of wrapping the check in an ret < 0 block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
Christoph Hellwig
0b24be4691 splice: don't call file_accessed in copy_splice_read
copy_splice_read calls into ->read_iter to read the data, which already
calls file_accessed.

Fixes: 33b3b04154 ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
David Howells
ca2d49f77c splice, net: Fix splice_to_socket() to handle pipe bufs larger than a page
splice_to_socket() assumes that a pipe_buffer won't hold more than a single
page of data - but this assumption can be violated by skb_splice_bits()
when it splices from a socket into a pipe.

The problem is that splice_to_socket() doesn't advance the pipe_buffer
length and offset when transcribing from the pipe buf into a bio_vec, so if
the buf is >PAGE_SIZE, it keeps repeating the same initial chunk and
doesn't advance the tail index.  It then subtracts this from "remain" and
overcounts the amount of data to be sent.

The cleanup phase then tries to overclean the pipe, hits an unused pipe buf
and a NULL-pointer dereference occurs.

Fix this by not restricting the bio_vec size to PAGE_SIZE and instead
transcribing the entirety of each pipe_buffer into a single bio_vec and
advancing the tail index if remain hasn't hit zero yet.

Large bio_vecs will then be split up by iterator functions such as
iov_iter_extract_pages().

This resulted in a KASAN report looking like:

general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
...
RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:203 [inline]
RIP: 0010:splice_to_socket+0xa91/0xe30 fs/splice.c:933

Fixes: 2dc334f1a6 ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Reported-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000000900e905fdeb8e39@google.com/
Tested-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/1428985.1686737388@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:50:02 -07:00
Jakub Kicinski
173780ff18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/mlx5/driver.h
  617f5db1a6 ("RDMA/mlx5: Fix affinity assignment")
  dc13180824 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  47867f0a7e ("selftests: mptcp: join: skip check if MIB counter not supported")
  425ba80312 ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
  45b1a1227a ("mptcp: introduces more address related mibs")
  0639fa230a ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:19:41 -07:00
Linus Torvalds
62d8779610 Fix two regressions in ext4, one report by syzkaller[1], and reported by
multiple users (and tracked by regzbot[2]).
 
 [1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
 [2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSLPr8ACgkQ8vlZVpUN
 gaO08wgAlk1HqZub9iJm7ZQRx4qVb4zGYGL8So4IE/971FT8Vq3suJsQlTWpPVBB
 8O6H5bsq5F54wfjmY44aUyBjc1SEJL2zfrEeMxcsxH+1tEoEe8i+FT3BXVr6+Y7k
 qdZYrqmtbYpM2dyNW7LY4UDRQBE6ug+r6Cq1z2Mo7cfwfiBcYYe7lncbl3xJe7D8
 8DZLgjQg48wxvU5EOK2yHMHw0YfJySnmaCo7J7sU5TLZj91IEnlNGwAjYIy1GUBV
 3I+/+U70QR1VtZig7v4q6TI3LDHfegRc8zIIOtM3u1qs2bzLFgyoNURtut2vhF2+
 /XyXxFabmlpG7yp3UZVetCECKVa2Kg==
 =WJ4O
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4, one report by syzkaller[1], and reported
  by multiple users (and tracked by regzbot[2])"

[1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
[2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: drop the call to ext4_error() from ext4_get_group_info()
  Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
2023-06-15 15:40:58 -07:00
Linus Torvalds
7a043feb6c eight, mostly small, smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSKVUsACgkQiiy9cAdy
 T1EwhAv/bazgxHAjxmoZ8hxw21uH/Ufvf2Cvt49yFTjxiaOCoSPLovA5AyB1FlTz
 xsw/Cf6haO+RIYpLQ1KAIg7Qt3zwUu50Io2s/yyaHmQu7d+eNUPISajrJIjYOnqB
 tAVHav0XcvTaPAzyr9luqmBZnLaAeatK5CbrK+gT4xstW/cuTu/F4JUF8fOL++ww
 /IDTvo6UKqEKuPD8dpszXZnJjuj8aIsRIh8CBGI73QrDVExTzhrdmRULWqwSj0Lr
 nHsh8hgUZ25dSe/9jIagPqgvzaizvcyzopJuO8mpQAMDww9o1kBlud9GLdgNGnfM
 bDHHbn3Y9Fu5VaxMJ/n9lozMXT+OB+3ZrrBFP8J0unSidzsEzWKfYjE0gPXGKru8
 YhvjWKdJZjIOEu9QRuD4KK+MOF20cuq3OmvKpoZiEr2niB7oX25B+JddxW2CJD5a
 oAiP1+1/wiVe79KUlLLRe+OTtujUXtMiXPePzHh15dYetg+P4qJCBqWjDPdzRmdr
 MfMhsZpq
 =peSC
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Eight, mostly small, smb3 client fixes:

   - important fix for deferred close oops (race with unmount) found
     with xfstest generic/098 to some servers

   - important reconnect fix

   - fix problem with max_credits mount option

   - two multichannel (interface related) fixes

   - one trivial removal of confusing comment

   - two small debugging improvements (to better spot crediting
     problems)"

* tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add a warning when the in-flight count goes negative
  cifs: fix lease break oops in xfstest generic/098
  cifs: fix max_credits implementation
  cifs: fix sockaddr comparison in iface_cmp
  smb/client: print "Unknown" instead of bogus link speed value
  cifs: print all credit counters in DebugData
  cifs: fix status checks in cifs_tree_connect
  smb: remove obsolete comment
2023-06-15 15:24:33 -07:00
Jan Kara
c541dce86c fs: Protect reconfiguration of sb read-write from racing writes
The reconfigure / remount code takes a lot of effort to protect
filesystem's reconfiguration code from racing writes on remounting
read-only. However during remounting read-only filesystem to read-write
mode userspace writes can start immediately once we clear SB_RDONLY
flag. This is inconvenient for example for ext4 because we need to do
some writes to the filesystem (such as preparation of quota files)
before we can take userspace writes so we are clearing SB_RDONLY flag
before we are fully ready to accept userpace writes and syzbot has found
a way to exploit this [1]. Also as far as I'm reading the code
the filesystem remount code was protected from racing writes in the
legacy mount path by the mount's MNT_READONLY flag so this is relatively
new problem. It is actually fairly easy to protect remount read-write
from racing writes using sb->s_readonly_remount flag so let's just do
that instead of having to workaround these races in the filesystem code.

[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/

Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230615113848.8439-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 16:58:04 +02:00
Miquel Raynal
a91845b9a8 sysfs: Skip empty folders creation
Most sysfs attributes are statically defined, the goal with this design
being to be able to move all the filesystem description into read-only
memory. Anyway, it may be relevant in some cases to populate attributes
at run time. This leads to situation where an attribute may or may not be
present depending on conditions which are not known at compile
time, up to the point where no attribute at all gets added in a folder
which then becomes "sometimes" empty. Problem is, providing an attribute
group with a name and without .[bin_]attrs members will be loudly
refused by the core, leading in most cases to a device registration
failure.

The simple way to support such situation right now is to dynamically
allocate an empty attribute array, which is:
* a (small) waste of space
* a waste of time
* disturbing, to say the least, as an empty sysfs folder will be created
  anyway.

Another (even worse) possibility would be to dynamically overwrite a
member of the attribute_group list, hopefully the last, which is also
supposed to remain in the read-only section.

In order to avoid these hackish situations, while still giving a little
bit of flexibility, we might just check the validity of the .[bin_]attrs
list and, if empty, just skip the attribute group creation instead of
failing. This way, developers will not be tempted to workaround the
core with useless allocations or strange writes on supposedly read-only
structures.

The content of the WARN() message is kept but turned into a debug
message in order to help developers understanding why their sysfs
folders might now silently fail to be created.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Message-ID: <20230614063018.2419043-3-miquel.raynal@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-15 13:37:53 +02:00
Miquel Raynal
4981e0139f sysfs: Improve readability by following the kernel coding style
The purpose of the if/else block is to select the right sysfs directory
entry to be used for the files creation. At a first look when you have
the file in front of you, it really seems like the "create_files()"
lines right after the block are badly indented and the "else" does not
guard. In practice the code is correct but lacks curly brackets to show
where the big if/else block actually ends. Add these brackets to comply
with the current kernel coding style and to ease the understanding of
the whole logic.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Message-ID: <20230614063018.2419043-2-miquel.raynal@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-15 13:37:50 +02:00
Andreas Gruenbacher
cad1e15804 gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
Rename the SDF_FS_FROZEN flag to SDF_FREEZE_INITIATOR to indicate more
clearly that the node that has this flag set is the initiator of the
freeze.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
9e4f09565f gfs2: Reconfiguring frozen filesystem already rejected
Reconfiguring a frozen filesystem is already rejected in
reconfigure_super(), so there is no need to check for that condition
again at the filesystem level.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
e392edd5d5 gfs2: Rename gfs2_freeze_lock{ => _shared }
Rename gfs2_freeze_lock to gfs2_freeze_lock_shared to make it a bit more
obvious that this function establishes the "thawed" state of the freeze
glock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
097cca525a gfs2: Rename the {freeze,thaw}_super callbacks
Rename gfs2_freeze to gfs2_freeze_super and gfs2_unfreeze to
gfs2_thaw_super to match the names of the corresponding super
operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
af1abe1146 gfs2: Rename remaining "transaction" glock references
The transaction glock was repurposed to serve as the new freeze glock
years ago.  Don't refer to it as the transaction glock anymore.

Also, to be more precise, call it the "freeze glock" instead of the
"freeze lock".  Ditto for the journal glock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Jeff Layton
797a1d894d autofs: set ctime as well when mtime changes on a dir
When adding entries to a directory, POSIX generally requires that the
ctime also be updated alongside the mtime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Ian Kent <raven@themaw.net>
Message-Id: <20230612104524.17058-4-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:24 +02:00
Wen Yang
33d8b5d782 eventfd: show the EFD_SEMAPHORE flag in fdinfo
The EFD_SEMAPHORE flag should be displayed in fdinfo,
as different value could affect the behavior of eventfd.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Eric Biggers <ebiggers@google.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <tencent_05B9CFEFE6B9BC2A9B3A27886A122A7D9205@qq.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:23 +02:00
Fabio M. De Francesco
5c075c5b8f fs/aio: Stop allocating aio rings from HIGHMEM
There is no need to allocate aio rings from HIGHMEM because of very
little memory needed here.

Therefore, use GFP_USER flag in find_or_create_page() and get rid of
kmap*() mappings.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Message-Id: <20230609145937.17610-1-fmdefrancesco@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:23 +02:00
Kemeng Shi
19a043bb1f ext4: try all groups in ext4_mb_new_blocks_simple
ext4_mb_new_blocks_simple ignores the group before goal, so it will fail
if free blocks reside in group before goal. Try all groups to avoid
unexpected failure.
Search finishes either if any free block is found or if no available
blocks are found. Simpliy check "i >= max" to distinguish the above
cases.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
95a4c3c7e0 ext4: remove ext4_block_group and ext4_block_group_offset declaration
For ext4_block_group and ext4_block_group_offset, there are only
declaration without definition. Just remove them.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
1eff590489 ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
ext4_mb_use_preallocated will ignore the demand to alloc goal blocks,
although the EXT4_MB_HINT_GOAL_ONLY is requested.
For group pa, ext4_mb_group_or_file will not set EXT4_MB_HINT_GROUP_ALLOC
if EXT4_MB_HINT_GOAL_ONLY is set. So we will not alloc goal blocks from
group pa if EXT4_MB_HINT_GOAL_ONLY is set.
For inode pa, ext4_mb_pa_goal_check is added to check if free extent in
found inode pa meets goal blocks when EXT4_MB_HINT_GOAL_ONLY is set.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
c3defd99d5 ext4: treat stripe in block unit
Stripe is misused in block unit and in cluster unit in different code
paths. User awared of stripe maybe not awared of bigalloc feature, so
treat stripe only in block unit to fix this.
Besides, it's hard to get stripe aligned blocks (start and length are both
aligned with stripe) if stripe is not aligned with cluster, just disable
stripe and alert user in this case to simpfy the code and avoid
unnecessary work to get stripe aligned blocks which likely to be failed.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
99c515e3a8 ext4: fix wrong unit use in ext4_mb_find_by_goal
We need start in block unit while fe_start is in cluster unit. Use
ext4_grp_offs_to_block helper to convert fe_start to get start in
block unit.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
497885f72d ext4: fix unit mismatch in ext4_mb_new_blocks_simple
The "i" returned from mb_find_next_zero_bit is in cluster unit and we
need offset "block" corresponding to "i" in block unit. Convert "i" to
block unit to fix the unit mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
b3916da0d9 ext4: fix wrong unit use in ext4_mb_normalize_request
NRL_CHECK_SIZE will compare input req and size, so req and size should
be in same unit. Input req "fe_len" is in cluster unit while input
size "(8<<20)>>bsbits" is in block unit. Convert "fe_len" to block
unit to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Matthew Wilcox
0dea40aa31 ext4: Call fsverity_verify_folio()
Now that fsverity supports working on entire folios, call
fsverity_verify_folio() instead of fsverity_verify_page()

Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230516192713.1070469-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
d19500da4b ext4: Make ext4_write_inline_data_end() use folio
ext4_write_inline_data_end() is completely converted to work with folio.
Also all callers of ext4_write_inline_data_end() already works on folio
except ext4_da_write_end(). Mostly for consistency and saving few
instructions maybe, this patch just converts ext4_da_write_end() to work
with folio which makes the last caller of ext4_write_inline_data_end()
also converted to work with folio.
We then make ext4_write_inline_data_end() take folio instead of page.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/1bcea771720ff451a5a59b3f1bcd5fae51cb7ce7.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
80be8c5cc9 ext4: Make mpage_journal_page_buffers use folio
This patch converts mpage_journal_page_buffers() to use folio and also
removes the PAGE_SIZE assumption.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/ebc3ac80e6a54b53327740d010ce684a83512021.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
36c9b45040 ext4: Change remaining tracepoints to use folio
ext4_readpage() is converted to ext4_read_folio() hence change the
related tracepoint from trace_ext4_readpage(page) to
trace_ext4_read_folio(folio). Do the same for
trace_ext4_releasepage(page) to trace_ext4_release_folio(folio)

As a minor bit of optimization to avoid an extra dereferencing,
since both of the above functions already were dereferencing
folio->mapping->host, hence change the tracepoint argument to take
(inode, folio).

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/caba2b3c0147bed4ea7706767dc1d19cd0e29ab0.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
0b956de151 ext4: kill unused function ext4_journalled_write_inline_data
Commit 3f079114bf ("ext4: Convert data=journal writeback to use ext4_writepages()")
Added support for writeback of journalled data into ext4_writepages()
and killed function __ext4_journalled_writepage() which used to call
ext4_journalled_write_inline_data() for inline data.
This function got left over by mistake. Hence kill it's definition as
no one uses it.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/122b2a8d5e0650686f23ed6da26ed9e04105562b.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Fabio M. De Francesco
f451fd97dd ext4: drop the call to ext4_error() from ext4_get_group_info()
A recent patch added a call to ext4_error() which is problematic since
some callers of the ext4_get_group_info() function may be holding a
spinlock, whereas ext4_error() must never be called in atomic context.

This triggered a report from Syzbot: "BUG: sleeping function called from
invalid context in ext4_update_super" (see the link below).

Therefore, drop the call to ext4_error() from ext4_get_group_info(). In
the meantime use eight characters tabs instead of nine characters ones.

Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/
Fixes: 5354b2af34 ("ext4: allow ext4_get_group_info() to fail")
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
2023-06-14 22:24:05 -04:00
Kemeng Shi
1948279211 Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
This reverts commit ad3f09be6c.

The reverted commit was intended to simpfy the code to get group
descriptor block number in non-meta block group by assuming
s_gdb_count is block number used for all non-meta block group descriptors.
However s_gdb_count is block number used for all meta *and* non-meta
group descriptors. So s_gdb_group will be > actual group descriptor block
number used for all non-meta block group which should be "total non-meta
block group" / "group descriptors per block", e.g. s_first_meta_bg.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com
Fixes: ad3f09be6c ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa")
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-14 22:22:27 -04:00
Jiasheng Jiang
d97038d5ec pstore/ram: Add check for kstrdup
Add check for the return value of kstrdup() and return the error
if it fails in order to avoid NULL pointer dereference.

Fixes: e163fdb3f7 ("pstore/ram: Regularize prz label allocation lifetime")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230614093733.36048-1-jiasheng@iscas.ac.cn
2023-06-14 11:52:10 -07:00
Eric Biggers
74836ecbc5 fsverity: rework fsverity_get_digest() again
Address several issues with the calling convention and documentation of
fsverity_get_digest():

- Make it provide the hash algorithm as either a FS_VERITY_HASH_ALG_*
  value or HASH_ALGO_* value, at the caller's choice, rather than only a
  HASH_ALGO_* value as it did before.  This allows callers to work with
  the fsverity native algorithm numbers if they want to.  HASH_ALGO_* is
  what IMA uses, but other users (e.g. overlayfs) should use
  FS_VERITY_HASH_ALG_* to match fsverity-utils and the fsverity UAPI.

- Make it return the digest size so that it doesn't need to be looked up
  separately.  Use the return value for this, since 0 works nicely for
  the "file doesn't have fsverity enabled" case.  This also makes it
  clear that no other errors are possible.

- Rename the 'digest' parameter to 'raw_digest' and clearly document
  that it is only useful in combination with the algorithm ID.  This
  hopefully clears up a point of confusion.

- Export it to modules, since overlayfs will need it for checking the
  fsverity digests of lowerdata files
  (https://lore.kernel.org/r/dd294a44e8f401e6b5140029d8355f88748cd8fd.1686565330.git.alexl@redhat.com).

Acked-by: Mimi Zohar <zohar@linux.ibm.com> # for the IMA piece
Link: https://lore.kernel.org/r/20230612190047.59755-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-14 10:41:07 -07:00
Qu Wenruo
b50f2d048e btrfs: scrub: fix a return value overwrite in scrub_stripe()
[RETURN VALUE OVERWRITE]
Inside scrub_stripe(), we would submit all the remaining stripes after
iterating all extents.

But since flush_scrub_stripes() can return error, we need to avoid
overwriting the existing @ret if there is any error.

However the existing check is doing the wrong check:

	ret2 = flush_scrub_stripes();
	if (!ret2)
		ret = ret2;

This would overwrite the existing @ret to 0 as long as the final flush
detects no critical errors.

[FIX]
We should check @ret other than @ret2 in that case.

Fixes: 8eb3dd17ea ("btrfs: dev-replace: error out if we have unrepaired metadata error during")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-14 18:30:30 +02:00
Alexander Aring
1696c75f18 fs: dlm: add send ack threshold and append acks to msgs
This patch changes the time when we sending an ack back to tell the
other side it can free some message because it is arrived on the
receiver node, due random reconnects e.g. TCP resets this is handled as
well on application layer to not let DLM run into a deadlock state.

The current handling has the following problems:

1. We end in situations that we only send an ack back message of 16
   bytes out and no other messages. Whereas DLM has logic to combine
   so much messages as it can in one send() socket call. This behaviour
   can be discovered by "trace-cmd start -e dlm_recv" and observing the
   ret field being 16 bytes.

2. When processing of DLM messages will never end because we receive a
   lot of messages, we will not send an ack back as it happens when
   the processing loop ends.

This patch introduces a likely and unlikely threshold case. The likely
case will send an ack back on a transmit path if the threshold is
triggered of amount of processed upper layer protocol. This will solve
issue 1 because it will be send when another normal DLM message will be
sent. It solves issue 2 because it is not part of the processing loop.

There is however a unlikely case, the unlikely case has a bigger
threshold and will be triggered when we only receive messages and do not
sent any message back. This case avoids that the sending node will keep
a lot of message for a long time as we send sometimes ack backs to tell
the sender to finally release messages.

The atomic cmpxchg() is there to provide a atomically ack send with
reset of the upper layer protocol delivery counter.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
d00725cab2 fs: dlm: handle sequence numbers as atomic
Currently seq_next is only be read on the receive side which processed
in an ordered way. The seq_send is being protected by locks. To being
able to read the seq_next value on send side as well we convert it to an
atomic_t value. The atomic_cmpxchg() is probably not necessary, however
the atomic_inc() depends on a if coniditional and this should be handled
in an atomic context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
75a7d60134 fs: dlm: handle lkb wait count as atomic_t
Currently the lkb_wait_count is locked by the rsb lock and it should be
fine to handle lkb_wait_count as non atomic_t value. However for the
overall process of reducing locking this patch converts it to an
atomic_t value.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
07ee38674a fs: dlm: filter ourself midcomms calls
It makes no sense to call midcomms/lowcomms functionality for the local
node as socket functionality is only required for remote nodes. This
patch filters those calls in the upper layer of lockspace membership
handling instead of doing it in midcomms/lowcomms layer as they should
never be aware of local nodeid.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
70cf2fecf8 fs: dlm: warn about messages from left nodes
This patch warns about messages which are received from nodes who
already left the lockspace resource signaled by the cluster manager.
Before commit 489d8e559c ("fs: dlm: add reliable connection if
reconnect") there was a synchronization issue with the socket
lifetime and the cluster event of leaving a lockspace and other
nodes did not stop of sending messages because the cluster manager has a
pending message to leave the lockspace. The reliable session layer for
dlm use sequence numbers to ensure dlm message were never being dropped.
If this is not corrected synchronized we have a problem, this patch will
use the filter case and turn it into a WARN_ON_ONCE() so we seeing such
issue on the kernel log because it should never happen now.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
5ce9ef30f2 fs: dlm: move dlm_purge_lkb_callbacks to user module
This patch moves the dlm_purge_lkb_callbacks() function from ast to user
dlm module as it is only a function being used by dlm user
implementation. I got be hinted to hold specific locks regarding the
callback handling for dlm_purge_lkb_callbacks() but it was false
positive. It is confusing because ast dlm implementation uses a
different locking behaviour as user locks uses as DLM handles kernel and
user dlm locks differently. To avoid the confusing we move this function
to dlm user implementation.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
d41a1a3db4 fs: dlm: cleanup STOP_IO bitflag set when stop io
There should no difference between setting the CF_IO_STOP flag
before restore_callbacks() to do it before or afterwards. The
restore_callbacks() will be sure that no callback is executed anymore
when the bit wasn't set.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
f8bce79d9d fs: dlm: don't check othercon twice
This patch removes an another check if con->othercon set inside the
branch which already does that.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
cbba21169e fs: dlm: unregister memory at the very last
The dlm modules midcomms, debugfs, lockspace, uses kmem caches. We
ensure that the kmem caches getting deallocated after those modules
exited.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
f68bb23cad fs: dlm: fix missing pending to false
This patch sets the process_dlm_messages_pending boolean to false when
there was no message to process. It is a case which should not happen
but if we are prepared to recover from this situation by setting pending
boolean to false.

Cc: stable@vger.kernel.org
Fixes: dbb751ffab ("fs: dlm: parallelize lowcomms socket handling")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
7a931477bf fs: dlm: clear pending bit when queue was empty
This patch clears the DLM_IFL_CB_PENDING_BIT flag which will be set when
there is callback work queued when there was no callback to dequeue. It
is a buggy case and should never happen, that's why there is a
WARN_ON(). However if the case happens we are prepared to somehow
recover from it.

Cc: stable@vger.kernel.org
Fixes: 61bed0baa4 ("fs: dlm: use a non-static queue for callbacks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
c6b6d6dcc7 fs: dlm: revert check required context while close
This patch reverts commit 2c3fa6ae4d ("dlm: check required context
while close"). The function dlm_midcomms_close(), which will call later
dlm_lowcomms_close(), is called when the cluster manager tells the node
got fenced which means on midcomms/lowcomms layer to disconnect the node
from the cluster communication. The node can rejoin the cluster later.
This patch was ensuring no new message were able to be triggered when we
are in the close() function context. This was done by checking if the
lockspace has been stopped. However there is a missing check that we
only need to check specific lockspaces where the fenced node is member
of. This is currently complicated because there is no way to easily
check if a node is part of a specific lockspace without stopping the
recovery. For now we just revert this commit as it is just a check to
finding possible leaks of stopping lockspaces before close() is called.

Cc: stable@vger.kernel.org
Fixes: 2c3fa6ae4d ("dlm: check required context while close")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Shyam Prasad N
e4645cc2f1 cifs: add a warning when the in-flight count goes negative
We've seen the in-flight count go into negative with some
internal stress testing in Microsoft.

Adding a WARN when this happens, in hope of understanding
why this happens when it happens.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:05 -05:00
Steve French
c774e6779f cifs: fix lease break oops in xfstest generic/098
umount can race with lease break so need to check if
tcon->ses->server is still valid to send the lease
break response.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Fixes: 59a556aebc ("SMB3: drop reference to cfile before sending oplock break")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:01 -05:00
David Howells
d44c404207 block: Fix dio_cleanup() to advance the head index
Fix dio_bio_cleanup() to advance the head index into the list of pages past
the pages it has released, as __blockdev_direct_IO() will call it twice if
do_direct_IO() fails.

The issue was causing:

        WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio

This can be triggered by setting up a clean pair of UDF filesystems on
loopback devices and running the generic/451 xfstest with them as the
scratch and test partitions.  Something like the following:

    fallocate /mnt2/udf_scratch -l 1G
    fallocate /mnt2/udf_test -l 1G
    mknod /dev/lo0 b 7 0
    mknod /dev/lo1 b 7 1
    losetup lo0 /mnt2/udf_scratch
    losetup lo1 /mnt2/udf_test
    mkfs -t udf /dev/lo0
    mkfs -t udf /dev/lo1
    cd xfstests
    ./check generic/451

with xfstests configured by putting the following into local.config:

    export FSTYP=udf
    export DISABLE_UDF_TEST=1
    export TEST_DEV=/dev/lo1
    export TEST_DIR=/xfstest.test
    export SCRATCH_DEV=/dev/lo0
    export SCRATCH_MNT=/xfstest.scratch

Fixes: 1ccf164ec8 ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 06:58:18 -06:00
Christoph Hellwig
8812387d05 zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for zonefs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2023-06-14 08:51:18 +09:00
Damien Le Moal
16d7fd3cfa zonefs: use iomap for synchronous direct writes
Remove the function zonefs_file_dio_append() that is used to manually
issue REQ_OP_ZONE_APPEND BIOs for processing synchronous direct writes
and use iomap instead.

To preserve the use of zone append operations for synchronous writes,
different struct iomap_dio_ops are defined. For synchronous direct
writes using zone append, zonefs_zone_append_dio_ops is introduced.
The submit_bio operation of this structure is defined as the function
zonefs_file_zone_append_dio_submit_io() which is used to change the BIO
opreation for synchronous direct IO writes to REQ_OP_ZONE_APPEND.

In order to preserve the write location check on completion of zone
append BIOs, the end_io operation is also defined using the function
zonefs_file_zone_append_dio_bio_end_io(). This check now relies on the
zonefs_zone_append_bio structure, allocated together with zone append
BIOs with a dedicated BIO set. This structure include the target inode
of a zone append BIO as well as the target append offset location for
the zone append operation. This is used to perform a check against
bio->bi_iter.bi_sector when the BIO completes, without needing to use
the zone information z_wpoffset field, thus removing the need for
taking the inode truncate mutex.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2023-06-14 08:51:05 +09:00
Long Li
c3b880acad xfs: fix ag count overflow during growfs
I found a corruption during growfs:

 XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of
   file fs/xfs/libxfs/xfs_alloc.c.  Caller __xfs_free_extent+0x28e/0x3c0
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_corruption_error+0x134/0x150
  __xfs_free_extent+0x2c1/0x3c0
  xfs_ag_extend_space+0x291/0x3e0
  xfs_growfs_data+0xd72/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 XFS (loop0): Corruption detected. Unmount and run xfs_repair
 XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file
   fs/xfs/xfs_trans.c.  Caller xfs_growfs_data+0x691/0xe90
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_error_report+0x93/0xc0
  xfs_trans_cancel+0x2c0/0x350
  xfs_growfs_data+0x691/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f2d86706577

The bug can be reproduced with the following sequence:

 # truncate -s  1073741824 xfs_test.img
 # mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img
 # truncate -s 2305843009213693952  xfs_test.img
 # mount -o loop xfs_test.img /mnt/test
 # xfs_growfs -D  1125899907891200  /mnt/test

The root cause is that during growfs, user space passed in a large value
of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is
too small, new AG count will exceed UINT_MAX. Because of AG number type
is unsigned int and it would overflow, that caused nagcount much smaller
than the actual value. During AG extent space, delta blocks in
xfs_resizefs_init_new_ags() will much larger than the actual value due to
incorrect nagcount, even exceed UINT_MAX. This will cause corruption and
be detected in __xfs_free_extent. Fix it by growing the filesystem to up
to the maximally allowed AGs and not return EINVAL when new AG count
overflow.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-13 08:49:20 -07:00
Andreas Gruenbacher
cea44032bc gfs2: retry interrupted internal reads
The iomap-based read operations done by gfs2 for its system files, such
as rindex, may sometimes be interrupted and return -EINTR.   This
confuses some users of gfs2_internal_read().  Fix that by retrying
interrupted reads.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-13 16:51:25 +02:00
Tuo Li
6fa0a72cbb gfs2: Fix possible data races in gfs2_show_options()
Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
without holding the lock gt_spin in gfs2_show_options():

  val = sdp->sd_tune.gt_logd_secs;
  if (val != 30)
    seq_printf(s, ",commit=%d", val);

And thus can cause data races when gfs2_show_options() and other functions
such as gfs2_reconfigure() are concurrently executed:

  spin_lock(&gt->gt_spin);
  gt->gt_logd_secs = newargs->ar_commit;

To fix these possible data races, the lock sdp->sd_tune.gt_spin is
acquired before accessing the fields of gfs2_tune and released after these
accesses.

Further changes by Andreas:

- Don't hold the spin lock over the seq_printf operations.

Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-13 14:34:17 +02:00
Jan Kara
404615d7f1 ext2: Drop fragment support
Ext2 has fields in superblock reserved for subblock allocation support.
However that never landed. Drop the many years dead code.

Reported-by: syzbot+af5e10f73dbff48f70af@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
2023-06-13 12:37:31 +02:00
Christoph Hellwig
b294349993 xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for xfs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-12 18:09:23 -07:00
Darrick J. Wong
61d7e8274c xfs: drop EXPERIMENTAL tag for large extent counts
This feature has been baking in upstream for ~10mo with no bug reports.
It seems to work fine here, let's get rid of the scary warnings?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-06-12 18:09:04 -07:00
Darrick J. Wong
06f3ef6e17 xfs: don't deplete the reserve pool when trying to shrink the fs
Every now and then, xfs/168 fails with this logged in dmesg:

Reserve blocks depleted! Consider increasing reserve pool size.
EXPERIMENTAL online shrink feature in use. Use at your own risk!
Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
Error -28 reserving per-AG metadata reserve pool.
Corruption of in-memory data (0x8) detected at xfs_ag_shrink_space+0x23c/0x3b0 [xfs] (fs/xfs/libxfs/xfs_ag.c:1007).  Shutting down filesystem.

It's silly to deplete the reserved blocks pool just to shrink the
filesystem, particularly since the fs goes down after that.

Fixes: fb2fc17201 ("xfs: support shrinking unused space in the last AG")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-12 18:09:04 -07:00
Qu Wenruo
745806fb45 btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.

The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0

  BTRFS error (device loop0: state C): failed to load root csum
  assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3664!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
  RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
  RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
  RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
  R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
  R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
  FS:  0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
   load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
   load_global_roots fs/btrfs/disk-io.c:2501 [inline]
   btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
   init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
   open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
   btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
   btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
   legacy_get_tree+0xea/0x180 fs/fs_context.c:610
   vfs_get_tree+0x88/0x270 fs/super.c:1530
   fc_mount fs/namespace.c:1043 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
   btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884

[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.

So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.

And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().

But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.

So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.

[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.

To make further debugging easier, also add two explicit error messages:

- Error message for conflicting global roots
- Error message when using backup roots slot

Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 01:21:16 +02:00
Linus Torvalds
fb054096ae 19 hotfixes. 14 are cc:stable and the remainder address issues which were
introduced during this -rc cycle or which were considered inappropriate
 for a backport.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
 jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
 iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
 =mTdI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. 14 are cc:stable and the remainder address issues which
  were introduced during this development cycle or which were considered
  inappropriate for a backport"

* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  zswap: do not shrink if cgroup may not zswap
  page cache: fix page_cache_next/prev_miss off by one
  ocfs2: check new file size on fallocate call
  mailmap: add entry for John Keeping
  mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
  epoll: ep_autoremove_wake_function should use list_del_init_careful
  mm/gup_test: fix ioctl fail for compat task
  nilfs2: reject devices with insufficient block count
  ocfs2: fix use-after-free when unmounting read-only filesystem
  lib/test_vmalloc.c: avoid garbage in page array
  nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
  riscv/purgatory: remove PGO flags
  powerpc/purgatory: remove PGO flags
  x86/purgatory: remove PGO flags
  kexec: support purgatories with .text.hot sections
  mm/uffd: allow vma to merge as much as possible
  mm/uffd: fix vma operation where start addr cuts part of vma
  radix-tree: move declarations to header
  nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12 16:14:34 -07:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Chao Yu
5079e1c0c8 f2fs: avoid dead loop in f2fs_issue_checkpoint()
generic/082 reports a bug as below:

__schedule+0x332/0xf60
schedule+0x6f/0xf0
schedule_timeout+0x23b/0x2a0
wait_for_completion+0x8f/0x140
f2fs_issue_checkpoint+0xfe/0x1b0
f2fs_sync_fs+0x9d/0xb0
sync_filesystem+0x87/0xb0
dquot_load_quota_sb+0x41b/0x460
dquot_load_quota_inode+0xa5/0x130
dquot_quota_on+0x4b/0x60
f2fs_quota_on+0xe3/0x1b0
do_quotactl+0x483/0x700
__x64_sys_quotactl+0x15c/0x310
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

The root casue is race case as below:

Thread A			Kworker			IRQ
- write()
: write data to quota.user file

				- writepages
				 - f2fs_submit_page_write
				  - __is_cp_guaranteed return false
				  - inc_page_count(F2FS_WB_DATA)
				 - submit_bio
- quotactl(Q_QUOTAON)
 - f2fs_quota_on
  - dquot_quota_on
   - dquot_load_quota_inode
    - vfs_setup_quota_inode
    : inode->i_flags |= S_NOQUOTA
							- f2fs_write_end_io
							 - __is_cp_guaranteed return true
							 - dec_page_count(F2FS_WB_CP_DATA)
    - dquot_load_quota_sb
     - f2fs_sync_fs
      - f2fs_issue_checkpoint
       - do_checkpoint
        - f2fs_wait_on_all_pages(F2FS_WB_CP_DATA)
        : loop due to F2FS_WB_CP_DATA count is negative

Calling filemap_fdatawrite() and filemap_fdatawait() to keep all data
clean before quota file setup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:37:36 -07:00
Wu Bo
cadfc2f9f8 f2fs: fix args passed to trace_f2fs_lookup_end
The NULL return of 'd_splice_alias' dosen't mean error. Thus the
successful case will also return NULL, which makes the tracepoint always
print 'err=-ENOENT'.

And the different cases of 'new' & 'err' are list as following:
1) dentry exists: err(0) with new(NULL) --> dentry, err=0
2) dentry exists: err(0) with new(VALID) --> new, err=0
3) dentry exists: err(0) with new(ERR) --> dentry, err=ERR
4) no dentry exists: err(-ENOENT) with new(NULL) --> dentry, err=-ENOENT
5) no dentry exists: err(-ENOENT) with new(VALID) --> new, err=-ENOENT
6) no dentry exists: err(-ENOENT) with new(ERR) --> dentry, err=ERR

Signed-off-by: Wu Bo <bo.wu@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Yangtao Li
38b57833de f2fs: flag as supporting buffered async reads
The f2fs uses generic_file_buffered_read(), which supports buffered async
reads since commit 1a0a7853b9 ("mm: support async buffered reads in
generic_file_buffered_read()").

Let's enable it to match other file-systems. The read performance has been
greatly improved under io_uring:

    167M/s -> 234M/s, Increase ratio by 40%

Test w/:
    ./fio --name=onessd --filename=/data/test/local/io_uring_test
    --size=256M --rw=randread --bs=4k --direct=0 --overwrite=0
    --numjobs=1 --iodepth=1 --time_based=0 --runtime=10
    --ioengine=io_uring --registerfiles --fixedbufs
    --gtod_reduce=1 --group_reporting --sqthread_poll=1

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chao Yu
20872584b8 f2fs: fix to drop all dirty meta/node pages during umount()
For cp error case, there will be dirty meta/node pages remained after
f2fs_write_checkpoint() in f2fs_put_super(), drop them explicitly, and
do sanity check on reference count of dirty pages and inflight IOs.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chunhai Guo
38a4a330c8 f2fs: Detect looped node chain efficiently
find_fsync_dnodes() detect the looped node chain by comparing the loop
counter with free blocks. While it may take tens of seconds to quit when
the free blocks are large enough. We can use Floyd's cycle detection
algorithm to make the detection more efficient.

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Daejun Park
25f9080576 f2fs: add async reset zone command support
This patch enables submit reset zone command asynchornously. It helps
decrease average latency of write IOs in high utilization scenario by
faster checkpointing.

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chao Yu
901c12d144 f2fs: flush error flags in workqueue
In IRQ context, it wakes up workqueue to record errors into on-disk
superblock fields rather than in-memory fields.

Fixes: 1aa161e431 ("f2fs: fix scheduling while atomic in decompression path")
Fixes: 95fa90c9e5 ("f2fs: support recording errors into superblock")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
458c15dfbc f2fs: don't reset unchangable mount option in f2fs_remount()
syzbot reports a bug as below:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:__lock_acquire+0x69/0x2000 kernel/locking/lockdep.c:4942
Call Trace:
 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5691
 __raw_write_lock include/linux/rwlock_api_smp.h:209 [inline]
 _raw_write_lock+0x2e/0x40 kernel/locking/spinlock.c:300
 __drop_extent_tree+0x3ac/0x660 fs/f2fs/extent_cache.c:1100
 f2fs_drop_extent_tree+0x17/0x30 fs/f2fs/extent_cache.c:1116
 f2fs_insert_range+0x2d5/0x3c0 fs/f2fs/file.c:1664
 f2fs_fallocate+0x4e4/0x6d0 fs/f2fs/file.c:1838
 vfs_fallocate+0x54b/0x6b0 fs/open.c:324
 ksys_fallocate fs/open.c:347 [inline]
 __do_sys_fallocate fs/open.c:355 [inline]
 __se_sys_fallocate fs/open.c:353 [inline]
 __x64_sys_fallocate+0xbd/0x100 fs/open.c:353
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is race condition as below:
- since it tries to remount rw filesystem, so that do_remount won't
call sb_prepare_remount_readonly to block fallocate, there may be race
condition in between remount and fallocate.
- in f2fs_remount(), default_options() will reset mount option to default
one, and then update it based on result of parse_options(), so there is
a hole which race condition can happen.

Thread A			Thread B
- f2fs_fill_super
 - parse_options
  - clear_opt(READ_EXTENT_CACHE)

- f2fs_remount
 - default_options
  - set_opt(READ_EXTENT_CACHE)
				- f2fs_fallocate
				 - f2fs_insert_range
				  - f2fs_drop_extent_tree
				   - __drop_extent_tree
				    - __may_extent_tree
				     - test_opt(READ_EXTENT_CACHE) return true
				    - write_lock(&et->lock) access NULL pointer
 - parse_options
  - clear_opt(READ_EXTENT_CACHE)

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+d015b6c2fbb5c383bf08@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/20230522124203.3838360-1-chao@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
d8189834d4 f2fs: fix to avoid NULL pointer dereference f2fs_write_end_io()
butt3rflyh4ck reports a bug as below:

When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is
failed, f2fs kernel thread would invoke callback function to update f2fs io
info, it would call  f2fs_write_end_io and may trigger null-ptr-deref in
NODE_MAPPING.

general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline]
RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370
 <TASK>
 bio_endio+0x5af/0x6c0 block/bio.c:1608
 req_bio_endio block/blk-mq.c:761 [inline]
 blk_update_request+0x5cc/0x1690 block/blk-mq.c:906
 blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023
 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101
 __do_softirq+0x1d4/0x8ef kernel/softirq.c:571
 run_ksoftirqd kernel/softirq.c:939 [inline]
 run_ksoftirqd+0x31/0x60 kernel/softirq.c:931
 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
 kthread+0x33e/0x440 kernel/kthread.c:379
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

The root cause is below race case can cause leaving dirty metadata
in f2fs after filesystem is remount as ro:

Thread A				Thread B
- f2fs_ioc_resize_fs
 - f2fs_readonly   --- return false
 - f2fs_resize_fs
					- f2fs_remount
					 - write_checkpoint
					 - set f2fs as ro
  - free_segment_range
   - update meta_inode's data

Then, if f2fs_put_super()  fails to write_checkpoint due to readonly
status, and meta_inode's dirty data will be writebacked after node_inode
is put, finally, f2fs_write_end_io will access NULL pointer on
sbi->node_inode.

Thread A				IRQ context
- f2fs_put_super
 - write_checkpoint fails
 - iput(node_inode)
 - node_inode = NULL
 - iput(meta_inode)
  - write_inode_now
   - f2fs_write_meta_page
					- f2fs_write_end_io
					 - NODE_MAPPING(sbi)
					 : access NULL pointer on node_inode

Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn
Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
bfd4766239 f2fs: clean up w/ sbi->log_sectors_per_block
Use sbi->log_sectors_per_block to clean up below calculated one:

unsigned int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
90b7c4b748 f2fs: fix to set noatime and immutable flag for quota file
We should set noatime bit for quota files, since no one cares about
atime of quota file, and we should set immutalbe bit as well, due to
nobody should write to the file through exported interfaces.

Meanwhile this patch use inode_lock to avoid race condition during
inode->i_flags, f2fs_inode->i_flags update.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
77e820ea73 f2fs: renew value of F2FS_FEATURE_*
Define F2FS_FEATURE_* macro w/ 32-bits value rather than 16-bits value.

No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Chao Yu
478d7100f4 f2fs: renew value of F2FS_MOUNT_*
Then we can just define newly introduced mount option w/ lasted
free number rather than random free one.

Just cleanup, no logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Chao Yu
f082c6b205 f2fs: fix potential deadlock due to unpaired node_write lock use
If S_NOQUOTA is cleared from inode during data page writeback of quota
file, it may miss to unlock node_write lock, result in potential
deadlock, fix to use the lock in paired.

Kworker					Thread
- writepage
 if (IS_NOQUOTA())
   f2fs_down_read(&sbi->node_write);
					- vfs_cleanup_quota_inode
					 - inode->i_flags &= ~S_NOQUOTA;
 if (IS_NOQUOTA())
   f2fs_up_read(&sbi->node_write);

Fixes: 79963d967b ("f2fs: shrink node_write lock coverage")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Yonggil Song
36ded4c106 f2fs: Fix over-estimating free section during FG GC
There was a bug that finishing FG GC unconditionally because free sections
are over-estimated after checkpoint in FG GC.
This patch initializes sec_freed by every checkpoint in FG GC.

Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Daeho Jeong
04abeb699d f2fs: close unused open zones while mounting
Zoned UFS allows only 6 open zones at the same time, so we need to take
care of the count of open zones while mounting.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Mickaël Salaün
74ce793bcb
hostfs: Fix ephemeral inodes
hostfs creates a new inode for each opened or created file, which
created useless inode allocations and forbade identifying a host file
with a kernel inode.

Fix this uncommon filesystem behavior by tying kernel inodes to host
file's inode and device IDs.  Even if the host filesystem inodes may be
recycled, this cannot happen while a file referencing it is opened,
which is the case with hostfs.  It should be noted that hostfs inode IDs
may not be unique for the same hostfs superblock because multiple host's
(backed) superblocks may be used.

Delete inodes when dropping them to force backed host's file descriptors
closing.

This enables to entirely remove ARCH_EPHEMERAL_INODES, and then makes
Landlock fully supported by UML.  This is very useful for testing
changes.

These changes also factor out and simplify some helpers thanks to the
new hostfs_inode_update() and the hostfs_iget() revamp: read_name(),
hostfs_create(), hostfs_lookup(), hostfs_mknod(), and
hostfs_fill_sb_common().

A following commit with new Landlock tests check this new hostfs inode
consistency.

Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20230612191430.339153-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
2023-06-12 21:26:19 +02:00
Luís Henriques
26a6ffff7d ocfs2: check new file size on fallocate call
When changing a file size with fallocate() the new size isn't being
checked.  In particular, the FSIZE ulimit isn't being checked, which makes
fstest generic/228 fail.  Simply adding a call to inode_newsize_ok() fixes
this issue.

Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Benjamin Segall
2192bba03d epoll: ep_autoremove_wake_function should use list_del_init_careful
autoremove_wake_function uses list_del_init_careful, so should epoll's
more aggressive variant.  It only doesn't because it was copied from an
older wait.c rather than the most recent.

[bsegall@google.com: add comment]
  Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com
Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com
Fixes: a16ceb1396 ("epoll: autoremove wakers even more aggressively")
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Ryusuke Konishi
92c5d1b860 nilfs2: reject devices with insufficient block count
The current sanity check for nilfs2 geometry information lacks checks for
the number of segments stored in superblocks, so even for device images
that have been destructively truncated or have an unusually high number of
segments, the mount operation may succeed.

This causes out-of-bounds block I/O on file system block reads or log
writes to the segments, the latter in particular causing
"a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to
hang.

Fix this issue by checking the number of segments stored in the superblock
and avoiding mounting devices that can cause out-of-bounds accesses.  To
eliminate the possibility of overflow when calculating the number of
blocks required for the device from the number of segments, this also adds
a helper function to calculate the upper bound on the number of segments
and inserts a check using it.

Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Luís Henriques
50d927880e ocfs2: fix use-after-free when unmounting read-only filesystem
It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using
fstest generic/452.  After a read-only remount, quotas are suspended and
ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info().  When unmounting
the filesystem, an UAF access to the oinfo will eventually cause a crash.
 
BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0
Read of size 8 at addr ffff8880389a8208 by task umount/669
...
Call Trace:
 <TASK>
 ...
 timer_delete+0x54/0xc0
 try_to_grab_pending+0x31/0x230
 __cancel_work_timer+0x6c/0x270
 ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2]
 ocfs2_dismount_volume+0xdd/0x450 [ocfs2]
 generic_shutdown_super+0xaa/0x280
 kill_block_super+0x46/0x70
 deactivate_locked_super+0x4d/0xb0
 cleanup_mnt+0x135/0x1f0
 ...
 </TASK>

Allocated by task 632:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x8b/0x90
 ocfs2_local_read_info+0xe3/0x9a0 [ocfs2]
 dquot_load_quota_sb+0x34b/0x680
 dquot_load_quota_inode+0xfe/0x1a0
 ocfs2_enable_quotas+0x190/0x2f0 [ocfs2]
 ocfs2_fill_super+0x14ef/0x2120 [ocfs2]
 mount_bdev+0x1be/0x200
 legacy_get_tree+0x6c/0xb0
 vfs_get_tree+0x3e/0x110
 path_mount+0xa90/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 650:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x50
 __kasan_slab_free+0xf9/0x150
 __kmem_cache_free+0x89/0x180
 ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2]
 dquot_disable+0x35f/0xa70
 ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2]
 ocfs2_remount+0x150/0x580 [ocfs2]
 reconfigure_super+0x1a5/0x3a0
 path_mount+0xc8a/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Ryusuke Konishi
fee5eaecca nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
Syzbot reports that in its stress test for resize ioctl, the log writing
function nilfs_segctor_do_construct hits a WARN_ON in
nilfs_segctor_truncate_segments().

It turned out that there is a problem with the current implementation of
the resize ioctl, which changes the writable range on the device (the
range of allocatable segments) at the end of the resize process.

This order is necessary for file system expansion to avoid corrupting the
superblock at trailing edge.  However, in the case of a file system
shrink, if log writes occur after truncating out-of-bounds trailing
segments and before the resize is complete, segments may be allocated from
the truncated space.

The userspace resize tool was fine as it limits the range of allocatable
segments before performing the resize, but it can run into this issue if
the resize ioctl is called alone.

Fix this issue by changing nilfs_sufile_resize() to update the range of
allocatable segments immediately after successful truncation of segment
space in case of file system shrink.

Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com
Fixes: 4e33f9eab0 ("nilfs2: implement resize ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Peter Xu
5543d3c43c mm/uffd: allow vma to merge as much as possible
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.

For example, when we have:

  vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)

Then someone unregisters uffd on range (5-9), it should logically become:

  vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)

But with current code we'll have:

  vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)

This patch allows such merge to happen correctly before ioctl returns.

This behavior seems to have existed since the 1st day of uffd.  Since
pgoff for vma_merge() is only used to identify the possibility of vma
merging, meanwhile here what we did was always passing in a pgoff smaller
than what we should, so there should have no other side effect besides not
merging it.  Let's still tentatively copy stable for this, even though I
don't see anything will go wrong besides vma being split (which is mostly
not user visible).

Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Peter Xu
270aa01062 mm/uffd: fix vma operation where start addr cuts part of vma
Patch series "mm/uffd: Fix vma merge/split", v2.

This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues.

Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis.  The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.

Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.

Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].

[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e


This patch (of 2):

It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:

https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/

It's possible that "start" is contained within vma but not clamped to its
start.  We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to prev. 
As we loop, each subsequent VMA will be clamped to the start.

This patch will eliminate the report and make sure vma_merge() calls will
become legal again.

One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf1.  Commit 29417d292b helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.

Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ryusuke Konishi
2f012f2bac nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.

This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.

Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().

Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:49 -07:00
Bob Peterson
c8ed1b3593 gfs2: Fix duplicate should_fault_in_pages() call
In gfs2_file_buffered_write(), we currently jump from the second call of
function should_fault_in_pages() to above the first call, so
should_fault_in_pages() is getting called twice in a row, causing it to
accidentally fall back to single-page writes rather than trying the more
efficient multi-page writes first.

Fix that by moving the retry label to the correct place, behind the
first call to should_fault_in_pages().

Fixes: e1fa9ea85c ("gfs2: Stop using glock holder auto-demotion for now")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-12 20:04:35 +02:00
Linus Torvalds
ace9e12da2 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSHScAACgkQxWXV+ddt
 WDvoLA/8CxGfC9i/zO2odxbV1id8JiubGyi2Q28ANE3ygwRBI2dh7u2TBTv9aKPF
 Bzm6VsafG2OwMuwu08jO3t98+QrxU9vb6YCzCPL4t+8IDLJhwpz6zdH/Lvl3RnyV
 nz+aKHi2vfTRKt1Cf4uB5dVzPM3QVHYi3vidt15Suf2nhKnXimu0FVGXabQfd44z
 cCE4ep8IkLshcrsEOwVQj44isRXztJza3D6P7zPfu0NB5Bue7VJNBI4JoGOAT8UQ
 8c+V1U6EbMARWcdbk4Vm34IoAAxcQW6MNnHG83+ie2OpuKJ9g7oNXMTPL73gntNr
 DtC38Vr8gbpXJFmqOCwD8+9f3jP2pX6LjJT0IR6eGJbCleWd6JPlvnfJ+QHdb/vE
 LblDjH84O0Js+0iPKOSKzglfrKZPYDEnIBUwbZQICj/8+aHPU1Y4eTRcv52bVnpa
 1umdz19Sjh0HjuX4k44E/fLgGnLw+ezxhe6WQ7RdDrnr4+9tXpz0z/ZsatIgl1Pc
 wfS5Y2XBIdzKBIF8FxAEL3xCXd6byOsMMhSRu6J7W8Tgw5dnvKiQLRCK+FIpBRru
 WZ7vrNKz67marmqcIp0Hpoipd5+ib6pAdZs69GAvk4bWvVoLZ0Vuyb3lQr5fg6Vm
 Xn1iwcYoWjlAYrpVW31dlaVCfoewm96qbzNa3XqA87I/6frGFcc=
 =ABpK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A  more fixes and regression fixes:

   - in subpage mode, fix crash when repairing metadata at the end of
     a stripe

   - properly enable async discard when remounting from read-only to
     read-write

   - scrub regression fixes:
      - respect read-only scrub when attempting to do a repair
      - fix reporting of found errors, the stats don't get properly
        accounted after a stripe repair"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: also report errors hit during the initial read
  btrfs: scrub: respect the read-only flag during repair
  btrfs: properly enable async discard when switching from RO->RW
  btrfs: subpage: fix a crash in metadata repair path
2023-06-12 10:53:35 -07:00
Dai Ngo
58f5d89400 NFSD: add encoding of op_recall flag for write delegation
Modified nfsd4_encode_open to encode the op_recall flag properly
for OPEN result with write delegation granted.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
2023-06-12 12:16:35 -04:00
NeilBrown
665e89ab7c lockd: drop inappropriate svc_get() from locked_get()
The below-mentioned patch was intended to simplify refcounting on the
svc_serv used by locked.  The goal was to only ever have a single
reference from the single thread.  To that end we dropped a call to
lockd_start_svc() (except when creating thread) which would take a
reference, and dropped the svc_put(serv) that would drop that reference.

Unfortunately we didn't also remove the svc_get() from
lockd_create_svc() in the case where the svc_serv already existed.
So after the patch:
 - on the first call the svc_serv was allocated and the one reference
   was given to the thread, so there are no extra references
 - on subsequent calls svc_get() was called so there is now an extra
   reference.
This is clearly not consistent.

The inconsistency is also clear in the current code in lockd_get()
takes *two* references, one on nlmsvc_serv and one by incrementing
nlmsvc_users.   This clearly does not match lockd_put().

So: drop that svc_get() from lockd_get() (which used to be in
lockd_create_svc().

Reported-by: Ido Schimmel <idosch@idosch.org>
Closes: https://lore.kernel.org/linux-nfs/ZHsI%2FH16VX9kJQX1@shredder/T/#u
Fixes: b73a297204 ("lockd: move lockd_start_svc() call into lockd_create_svc()")
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:34 -04:00
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
81b1fb7d17 fs: remove sb->s_mode
There is no real need to store the open mode in the super_block now.
It is only used by f2fs, which can easily recalculate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
3f0b3e785e block: add a sb_open_mode helper
Add a helper to return the open flags for blkdev_get_by* for passed in
super block flags instead of open coding the logic in many places.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2ef789288a btrfs: don't pass a holder for non-exclusive blkdev_get_by_path
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead and remove the holder argument from the
call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls.

Exclusive mode for device scanning is not used since commit 50d281fc43
("btrfs: scan device in non-exclusive mode")".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
7b7b06d55a gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag"), file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.

Remove .direct_IO from gfs2_aops and set FMODE_CAN_ODIRECT in
gfs2_open_common for regular files that do not use data journalling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-12 12:18:55 +02:00
Amir Goldstein
7b8c9d7bb4 fsnotify: move fsnotify_open() hook into do_dentry_open()
fsnotify_open() hook is called only from high level system calls
context and not called for the very many helpers to open files.

This may makes sense for many of the special file open cases, but it is
inconsistent with fsnotify_close() hook that is called for every last
fput() of on a file object with FMODE_OPENED.

As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
without ever observing an OPEN event.

Fix this inconsistency by replacing all the fsnotify_open() hooks with
a single hook inside do_dentry_open().

If there are special cases that would like to opt-out of the possible
overhead of fsnotify() call in fsnotify_open(), they would probably also
want to avoid the overhead of fsnotify() call in the rest of the fsnotify
hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.

However, in the majority of those cases, the s_fsnotify_connectors
optimization in fsnotify_parent() would be sufficient to avoid the
overhead of fsnotify() call anyway.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
2023-06-12 10:43:45 +02:00
Shyam Prasad N
5e90aa21eb cifs: fix max_credits implementation
The current implementation of max_credits on the client does
not work because the CreditRequest logic for several commands
does not take max_credits into account.

Still, we can end up asking the server for more credits, depending
on the number of credits in flight. For this, we need to
limit the credits while parsing the responses too.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Shyam Prasad N
2991b77409 cifs: fix sockaddr comparison in iface_cmp
iface_cmp used to simply do a memcmp of the two
provided struct sockaddrs. The comparison needs to do more
based on the address family. Similar logic was already
present in cifs_match_ipaddr. Doing something similar now.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Enzo Matsumiya
50e63d6db6 smb/client: print "Unknown" instead of bogus link speed value
The virtio driver for Linux guests will not set a link speed to its
paravirtualized NICs.  This will be seen as -1 in the ethernet layer, and
when some servers (e.g. samba) fetches it, it's converted to an unsigned
value (and multiplied by 1000 * 1000), so in client side we end up with:

1)      Speed: 4294967295000000 bps

in DebugData.

This patch introduces a helper that returns a speed string (in Mbps or
Gbps) if interface speed is valid (>= SPEED_10 and <= SPEED_800000), or
"Unknown" otherwise.

The reason to not change the value in iface->speed is because we don't
know the real speed of the HW backing the server NIC, so let's keep
considering these as the fastest NICs available.

Also print "Capabilities: None" when the interface doesn't support any.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Shyam Prasad N
2a44b38966 cifs: print all credit counters in DebugData
Output of /proc/fs/cifs/DebugData shows only the per-connection
counter for the number of credits of regular type. i.e. the
credits reserved for echo and oplocks are not displayed.

There have been situations recently where having this info
would have been useful. This change prints the credit counters
of all three types: regular, echo, oplocks.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
Shyam Prasad N
91f4480c41 cifs: fix status checks in cifs_tree_connect
The ordering of status checks at the beginning of
cifs_tree_connect is wrong. As a result, a tcon
which is good may stay marked as needing reconnect
infinitely.

Fixes: 2f0e4f0342 ("cifs: check only tcon status on tcon related functions")
Cc: stable@vger.kernel.org # 6.3
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
鑫华
a5998a9ec3 smb: remove obsolete comment
Because do_gettimeofday has been removed and replaced by ktime_get_real_ts64,
So just remove the comment as it's not needed now.

Signed-off-by: 鑫华 <jixianghua@xfusion.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
Jeff Layton
518f375c15 nfsd: don't provide pre/post-op attrs if fh_getattr fails
nfsd calls fh_getattr to get the latest inode attrs for pre/post-op
info. In the event that fh_getattr fails, it resorts to scraping cached
values out of the inode directly.

Since these attributes are optional, we can just skip providing them
altogether when this happens.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Neil Brown <neilb@suse.de>
2023-06-11 16:37:46 -04:00
Chuck Lever
df56b384de NFSD: Remove nfsd_readv()
nfsd_readv()'s consumers now use nfsd_iter_read().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:46 -04:00
Chuck Lever
703d752155 NFSD: Hoist rq_vec preparation into nfsd_read() [step two]
Now that the preparation of an rq_vec has been removed from the
generic read path, nfsd_splice_read() no longer needs to reset
rq_next_page.

nfsd4_encode_read() calls nfsd_splice_read() directly. As far as I
can ascertain, resetting rq_next_page for NFSv4 splice reads is
unnecessary because rq_next_page is already set correctly.

Moreover, resetting it might even be incorrect if previous
operations in the COMPOUND have already consumed at least a page of
the send buffer. I would expect that the result would be encoding
the READ payload over previously-encoded results.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:46 -04:00
Chuck Lever
507df40ebf NFSD: Hoist rq_vec preparation into nfsd_read()
Accrue the following benefits:

a) Deduplicate this common bit of code.

b) Don't prepare rq_vec for NFSv2 and NFSv3 spliced reads, which
   don't use rq_vec. This is already the case for
   nfsd4_encode_read().

c) Eventually, converting NFSD's read path to use a bvec iterator
   will be simpler.

In the next patch, nfsd_iter_read() will replace nfsd_readv() for
all NFS versions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:45 -04:00
Chuck Lever
ed4a567a17 NFSD: Update rq_next_page between COMPOUND operations
A GETATTR with a large result can advance xdr->page_ptr without
updating rq_next_page. If a splice READ follows that GETATTR in the
COMPOUND, nfsd_splice_actor can start splicing at the wrong page.

I've also seen READLINK and READDIR leave rq_next_page in an
unmodified state.

There are potentially a myriad of combinations like this, so play it
safe: move the rq_next_page update to nfsd4_encode_operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:45 -04:00
Chuck Lever
ba21e20b30 NFSD: Use svcxdr_encode_opaque_pages() in nfsd4_encode_splice_read()
Commit 15b23ef5d3 ("nfsd4: fix corruption of NFSv4 read data")
encountered exactly the same issue: after a splice read, a
filesystem-owned page is left in rq_pages[]; the symptoms are the
same as described there.

If the computed number of pages in nfsd4_encode_splice_read() is not
exactly the same as the actual number of pages that were consumed by
nfsd_splice_actor() (say, because of a bug) then hilarity ensues.

Instead of recomputing the page offset based on the size of the
payload, use rq_next_page, which is already properly updated by
nfsd_splice_actor(), to cause svc_rqst_release_pages() to operate
correctly in every instance.

This is a defensive change since we believe that after commit
27c934dd88 ("nfsd: don't replace page in rq_pages if it's a
continuation of last page") has been applied, there are no known
opportunities for nfsd_splice_actor() to screw up. So I'm not
marking it for stable backport.

Reported-by: Andy Zlotek <andy.zlotek@oracle.com>
Suggested-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:40 -04:00
Linus Torvalds
65d7ca5987 five smb3 server fixes, all also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSFH04ACgkQiiy9cAdy
 T1HXbAv5AZmcFRMjMTY4Yk/4Csm9wFn11ieMtmtBPfzdHLt5SNnVL+QK69wUJEVb
 Z0Q/mcc6xJvgzda3W9k7jg0G3ZiWty9GJ8ezqlVNLUQdRuRVqnoS5FoF9d68/UUN
 t+AVz2mLb28wIZbx13OXQ4Eb46KgbI4fQN/mZ7c6BrbKHu6CxYg84SyXG8WaBgw3
 FYGpGGnF94O26kbAwa+9ZI51zrNOpHFtjNZtdq/c/uYhpSa9aE4uNDkI841A2v2W
 //l9CxbK8gwXObOFmXlsMOrT+z/4rigL7iWJaR/6fRvcswv6x1Nqd8xvLum9hzDl
 IRvXrshKUP1w7i5eAYUavzIqc11nlGM+EGUhYpkTmED6oxpXufyf5029JH5ZziSR
 0lsQ/1ZCI1/7h5fkZIxI7khwIhOBVitrRj4rylTQzlrTyZZUdR+O2ONlsqmP5+tX
 Iw+z0NHrg/4+Mt2SSMx67QXoT2RU8E0GxtLOk5u6Hq4KJjv8wC5NquA+EVnVovsK
 he0D1ibu
 =JIvi
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Five smb3 server fixes, all also for stable:

   - Fix four slab out of bounds warnings: improve checks for protocol
     id, and for small packet length, and for create context parsing,
     and for negotiate context parsing

   - Fix for incorrect dereferencing POSIX ACLs"

* tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: validate smb request protocol id
  ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop
  ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR()
  ksmbd: fix out-of-bound read in parse_lease_state()
  ksmbd: fix out-of-bound read in deassemble_neg_contexts()
2023-06-11 10:07:35 -07:00
Joseph Qi
69fe5c430c ocfs2: cleanup trace events
After commit 6dc8bc0fb3 ("ocfs2: switch to iter_file_splice_write()"),
ocfs2 has switched from ocfs2_file_splice_write() to
iter_file_splice_write(), so cleanup the corresponding trace event as
well.

Link: https://lkml.kernel.org/r/20230528132033.217664-2-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:23 -07:00
Joseph Qi
d32840ad4a ocfs2: correct return value of ocfs2_local_free_info()
Now in ocfs2_local_free_info(), it returns 0 even if it actually fails. 
Though it doesn't cause any real problem since the only caller
dquot_disable() ignores the return value, we'd better return correct as it
is.

Link: https://lkml.kernel.org/r/20230528132033.217664-1-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:22 -07:00
Vincent Whitchurch
e994f5b677 squashfs: cache partial compressed blocks
Before commit 93e72b3c61 ("squashfs: migrate from ll_rw_block
usage to BIO"), compressed blocks read by squashfs were cached in the page
cache, but that is not the case after that commit.  That has lead to
squashfs having to re-read a lot of sectors from disk/flash.

For example, the first sectors of every metadata block need to be read
twice from the disk.  Once partially to read the length, and a second time
to read the block itself.  Also, in linear reads of large files, the last
sectors of one data block are re-read from disk when reading the next data
block, since the compressed blocks are of variable sizes and not aligned
to device blocks.  This extra I/O results in a degrade in read performance
of, for example, ~16% in one scenario on my ARM platform using squashfs
with dm-verity and NAND.

Since the decompressed data is cached in the page cache or squashfs'
internal metadata and fragment caches, caching _all_ compressed pages
would lead to a lot of double caching and is undesirable.  But make the
code cache any disk blocks which were only partially requested, since
these are the ones likely to include data which is needed by other file
system blocks.  This restores read performance in my test scenario.

The compressed block caching is only applied when the disk block size is
equal to the page size, to avoid having to deal with caching sub-page
reads.

[akpm@linux-foundation.org: fs/squashfs/block.c needs linux/pagemap.h]
[vincent.whitchurch@axis.com: fix page update race]
  Link: https://lkml.kernel.org/r/20230526-squashfs-cache-fixup-v1-1-d54a7fa23e7b@axis.com
[vincent.whitchurch@axis.com: fix page indices]
  Link: https://lkml.kernel.org/r/20230526-squashfs-cache-fixup-v1-2-d54a7fa23e7b@axis.com
[akpm@linux-foundation.org: fix layout, per hch]
Link: https://lkml.kernel.org/r/20230510-squashfs-cache-v4-1-3bd394e1ee71@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:14 -07:00
Christoph Hellwig
6b81459c9c squashfs: don't include buffer_head.h
Squashfs has stopped using buffers heads in 93e72b3c61
("squashfs: migrate from ll_rw_block usage to BIO").

Link: https://lkml.kernel.org/r/20230517071622.245151-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:14 -07:00
Azeem Shaikh
9e627588bf procfs: replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.  This read may exceed the
destination size limit.  This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1].  In an effort
to remove strlcpy() completely [2], replace strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Link: https://lkml.kernel.org/r/20230510212457.3491385-1-azeemshaikh38@gmail.com
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:13 -07:00
Christoph Hellwig
64d1b4dd82 fuse: use direct_write_fallback
Use the generic direct_write_fallback helper instead of duplicating the
logic.

Link: https://lkml.kernel.org/r/20230601145904.1385409-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
596df33d67 fuse: drop redundant arguments to fuse_perform_write
pos is always equal to iocb->ki_pos, and mapping is always equal to
iocb->ki_filp->f_mapping.

Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
70e986c3b4 fuse: update ki_pos in fuse_perform_write
Both callers of fuse_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
44fff0fa08 fs: factor out a direct_write_fallback helper
Add a helper dealing with handling the syncing of a buffered write
fallback for direct I/O.

Link: https://lkml.kernel.org/r/20230601145904.1385409-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
8ee93b4bb6 iomap: use kiocb_write_and_wait and kiocb_invalidate_pages
Use the common helpers for direct I/O page invalidation instead of open
coding the logic.  This leads to a slight reordering of checks in
__iomap_dio_rw to keep the logic straight.

Link: https://lkml.kernel.org/r/20230601145904.1385409-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
219580eea1 iomap: update ki_pos in iomap_file_buffered_write
All callers of iomap_file_buffered_write need to updated ki_pos, move it
into common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
c402a9a943 filemap: add a kiocb_invalidate_post_direct_write helper
Add a helper to invalidate page cache after a dio write.

Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
182c25e9c1 filemap: update ki_pos in generic_perform_write
All callers of generic_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:52 -07:00
Christoph Hellwig
936e114a24 iomap: update ki_pos a little later in iomap_dio_complete
Move the ki_pos update down a bit to prepare for a better common helper
that invalidates pages based of an iocb.

Link: https://lkml.kernel.org/r/20230601145904.1385409-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:52 -07:00
Christoph Hellwig
0d625446d0 backing_dev: remove current->backing_dev_info
Patch series "cleanup the filemap / direct I/O interaction", v4.

This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O.  This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.


This patch (of 12):

The last user of current->backing_dev_info disappeared in commit
b9b1335e64 ("remove bdi_congested() and wb_congested() and related
functions").  Remove the field and all assignments to it.

Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:51 -07:00
Lorenzo Stoakes
ca5e863233 mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-

- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
  remove it.

- __access_remote_vm() was already using vma_lookup() when the original
  lookup failed so by doing the lookup directly this also de-duplicates the
  code.

We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().

As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.

This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.

[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Yuanchu Xie
7bab8dfb12 mm: pagemap: restrict pagewalk to the requested range
The pagewalk in pagemap_read reads one PTE past the end of the requested
range, and stops when the buffer runs out of space. While it produces
the right result, the extra read is unnecessary and less performant.

I timed the following command before and after this patch:
	dd count=100000 if=/proc/self/pagemap of=/dev/null
The results are consistently within 0.001s across 5 runs.

Before:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0763159 s, 671 MB/s

real    0m0.078s
user    0m0.012s
sys     0m0.065s

After:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0487928 s, 1.0 GB/s

real    0m0.050s
user    0m0.011s
sys     0m0.039s

Link: https://lkml.kernel.org/r/20230515172608.3558391-1-yuanchu@google.com
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:20 -07:00
Ackerley Tng
adef080382 fs: hugetlbfs: set vma policy only when needed for allocating folio
Calling hugetlb_set_vma_policy() later avoids setting the vma policy
and then dropping it on a page cache hit.

Link: https://lkml.kernel.org/r/20230502235622.3652586-1-ackerleytng@google.com
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:17 -07:00
Yosry Ahmed
2816ea2abf writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.

A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic.  This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted.  There were two remaining atomic flushing
contexts after that series.  This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled.  We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter.  This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
remove them and simplify the code.

[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/


This patch (of 5):

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.

Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:14 -07:00
Linus Torvalds
7e8c948b3f A fix for a potential data corruption in differential backup and
snapshot-based mirroring scenarios in RBD and a reference counting
 fixup to avoid use-after-free in CephFS, all marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmSDUh4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi0guB/4l7wOFnFvC+Dz5Y0KKuq2zFGQ64eZM
 hVpKEANsV/py/MTOdCzhW5cNcNj5/g8+1eozGxA8IzckzWf+25ziIn+BNWOO7DK1
 eO1U0wdiFnkXzr3nKSqNqm+hrUupAUd4Rb6644I4FwWKRu1WQydRjmvFVE+gw86O
 eeXujr3IlhhDF/VqO0sekCx9MaFPQaCaoscM3gU04meKAG84jt3oezueOlRqYFTX
 batwJ33wzVtLSh1NJIhC0iBMuBgvnuqQ9R8bHTdSNkR8Ov4V3B4DQGL4lmYnBxbv
 L3fMcz+sdZu3bDptUta4ZgdS4LkxUUUUEK07XeoBhAjZ3qPrMiD/gXay
 =S7Tu
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for a potential data corruption in differential backup and
  snapshot-based mirroring scenarios in RBD and a reference counting
  fixup to avoid use-after-free in CephFS, all marked for stable"

* tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-client:
  ceph: fix use-after-free bug for inodes when flushing capsnaps
  rbd: get snapshot context after exclusive lock is ensured to be held
  rbd: move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting
2023-06-09 10:53:58 -07:00
Linus Torvalds
8fc1c596c2 Fix an ext4 regression which breaks remounting r/w file systems that
have the quota feature enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSClpwACgkQ8vlZVpUN
 gaME9AgAhKZuMUc9HxcMmXgkrVVTYh18OQVNwb8yI09gVspUkNC43mgwtYLAL6XT
 fr1ifzSDgjKYSAaSXALP//zxo/xGAl1cKfjLz/XqeqU2TlVnQyvUTkO5ywe2YYHo
 UZ2HbbiYXKH1c5PsqcwaiPmcf/sZvrXjvpvP4v4iO/b58UBKG5OIIlP6X/XgArco
 KYeMuFZYOX+jsgP/9+I+cutUEa9paHnMxycmPARdrnem3eA0jZtS7YrLJijxFmbV
 c4AFcbtuaW+ZmMHPsxtYoeQxM7TbL3vA0FghS14MxtdCipLkv5MTPnstaW/jGnST
 IlHLwLADCktbusV8UVJotWfAWcKeGw==
 =lKq5
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
 "Fix an ext4 regression which breaks remounting r/w file systems that
  have the quota feature enabled"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: only check dquot_initialize_needed() when debugging
  Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
2023-06-09 08:23:01 -07:00
David Howells
219d92056b splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()
splice_direct_to_actor() doesn't manage SPLICE_F_MORE correctly[1] - and,
as a result, it incorrectly signals/fails to signal MSG_MORE when splicing
to a socket.  The problem I'm seeing happens when a short splice occurs
because we got a short read due to hitting the EOF on a file: as the length
read (read_len) is less than the remaining size to be spliced (len),
SPLICE_F_MORE (and thus MSG_MORE) is set.

The issue is that, for the moment, we have no way to know *why* the short
read occurred and so can't make a good decision on whether we *should* keep
MSG_MORE set.

MSG_SENDPAGE_NOTLAST was added to work around this, but that is also set
incorrectly under some circumstances - for example if a short read fills a
single pipe_buffer, but the next read would return more (seqfile can do
this).

This was observed with the multi_chunk_sendfile tests in the tls kselftest
program.  Some of those tests would hang and time out when the last chunk
of file was less than the sendfile request size:

	build/kselftest/net/tls -r tls.12_aes_gcm.multi_chunk_sendfile

This has been observed before[2] and worked around in AF_TLS[3].

Fix this by making splice_direct_to_actor() always signal SPLICE_F_MORE if
we haven't yet hit the requested operation size.  SPLICE_F_MORE remains
signalled if the user passed it in to splice() but otherwise gets cleared
when we've read sufficient data to fulfill the request.

If, however, we get a premature EOF from ->splice_read(), have sent at
least one byte and SPLICE_F_MORE was not set by the caller, ->splice_eof()
will be invoked.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org

Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ [2]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a [3]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:31 -07:00
David Howells
2bfc668509 splice, net: Add a splice_eof op to file-ops and socket-ops
Add an optional method, ->splice_eof(), to allow splice to indicate the
premature termination of a splice to struct file_operations and struct
proto_ops.

This is called if sendfile() or splice() encounters all of the following
conditions inside splice_direct_to_actor():

 (1) the user did not set SPLICE_F_MORE (splice only), and

 (2) an EOF condition occurred (->splice_read() returned 0), and

 (3) we haven't read enough to fulfill the request (ie. len > 0 still), and

 (4) we have already spliced at least one byte.

A further patch will modify the behaviour of SPLICE_F_MORE to always be
passed to the actor if either the user set it or we haven't yet read
sufficient data to fulfill the request.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:30 -07:00
David Howells
2dc334f1a6 splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
Replace generic_splice_sendpage() + splice_from_pipe + pipe_to_sendpage()
with a net-specific handler, splice_to_socket(), that calls sendmsg() with
MSG_SPLICE_PAGES set instead of calling ->sendpage().

MSG_MORE is used to indicate if the sendmsg() is expected to be followed
with more data.

This allows multiple pipe-buffer pages to be passed in a single call in a
BVEC iterator, allowing the processing to be pushed down to a loop in the
protocol driver.  This helps pave the way for passing multipage folios down
too.

Protocols that haven't been converted to handle MSG_SPLICE_PAGES yet should
just ignore it and do a normal sendmsg() for now - although that may be a
bit slower as it may copy everything.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:30 -07:00
Jakub Kicinski
449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Linus Torvalds
79b6fad546 xfs: fixes for 6.4-rc5
This update contains:
 - Propagate unlinked inode list corruption back up to log recovery (regression
   fix).
 - improve corruption detection for AGFL entries, AGFL indexes and XEFI extents
   (syzkaller fuzzer oops report).
 - Avoid double perag reference release (regression fix).
 - Improve extent merging detection in scrub (regression fix).
 - Fix a new undefined high bit shift (regression fix).
 - Fix for AGF vs inode cluster buffer deadlock (regression fix).
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmSBcAIUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h1gEg/+IfG2aNR8P4+rqhOJ2yF5fZYtsqS1
 HkOX/N/Q8gBNXMwh3wWoXyBJk7gBhwySwcGvlYXoMZf6+alXGHaTMl8whxmFYaT9
 +aPBWo4lXRec6YHx016ZOjnNkLiWhyxdUvh85IFf0EJm5mK9QqjoX+lmbPc7HDzh
 0nFL66jaxM8W36QhK0srdwwjD3kNgZ2ZRNonlRULOzyTPpFfh985esTrmfmn3Ulx
 xiejw57xdpti9x+Pm5WZjUsW1/gx50hMS+yn/KiIWTQqncIO/OuirZSTrOFtUWTM
 xIfMB9xlkdaSmMCUyx2r2RVWJawXP++aT7nbza2eWJa0WSn5kZmHXugzI+V9zUx7
 M0oakkOXJl2pYakVr7G8JU4djZkQNu41JkuLVf5U7O3yYRWlXzViAqljd3S2C/+i
 pSjG9ram8esd/CAmw/hE6Jvhm6QYS1/D3KQ9Gs6JaptzR8Xjc7t7GEj1T0pMyPem
 iZx80C6fi87k/94hQ+HXalrAyJER9EmcQ25yngKucjgfrO0BrzNLGDus4uY0+IzX
 Y2T6xcSF/Vhd1soaklRuHryF7Vv7ECCIWVUV2pH7GHZwv0LaXvBnC1CUdYXskXy8
 RGlIBL75lBkOfZp0zK/R11sm1qxzXPCayBkZtSglj/RdNaZiFO9uwO139eGl+xP5
 ytjMvitThOGoAfU=
 =vT74
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Dave Chinner:
 "These are a set of regression fixes discovered on recent kernels. I
  was hoping to send this to you a week and half ago, but events out of
  my control delayed finalising the changes until early this week.

  Whilst the diffstat looks large for this stage of the merge window, a
  large chunk of it comes from moving the guts of one function from one
  file to another i.e. it's the same code, it is just run in a different
  context where it is safe to hold a specific lock. Otherwise the
  individual changes are relatively small and straigtht forward.

  Summary:

   - Propagate unlinked inode list corruption back up to log recovery
     (regression fix)

   - improve corruption detection for AGFL entries, AGFL indexes and
     XEFI extents (syzkaller fuzzer oops report)

   - Avoid double perag reference release (regression fix)

   - Improve extent merging detection in scrub (regression fix)

   - Fix a new undefined high bit shift (regression fix)

   - Fix for AGF vs inode cluster buffer deadlock (regression fix)"

* tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: collect errors from inodegc for unlinked inode recovery
  xfs: validate block number being freed before adding to xefi
  xfs: validity check agbnos on the AGFL
  xfs: fix agf/agfl verification on v4 filesystems
  xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()
  xfs: fix broken logic when detecting mergeable bmap records
  xfs: Fix undefined behavior of shift into sign bit
  xfs: fix AGF vs inode cluster buffer deadlock
  xfs: defered work could create precommits
  xfs: restore allocation trylock iteration
  xfs: buffer pins need to hold a buffer reference
2023-06-08 08:46:58 -07:00
Theodore Ts'o
dea9d8f764 ext4: only check dquot_initialize_needed() when debugging
ext4_xattr_block_set() relies on its caller to call dquot_initialize()
on the inode.  To assure that this has happened there are WARN_ON
checks.  Unfortunately, this is subject to false positives if there is
an antagonist thread which is flipping the file system at high rates
between r/o and rw.  So only do the check if EXT4_XATTR_DEBUG is
enabled.

Link: https://lore.kernel.org/r/20230608044056.GA1418535@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-08 10:06:40 -04:00
Theodore Ts'o
1b29243933 Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
This reverts commit a44be64bbe.

Link: https://lore.kernel.org/r/653b3359-2005-21b1-039d-c55ca4cffdcc@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-08 09:57:04 -04:00
Christoph Hellwig
4bb218a65a fs: unexport buffer_check_dirty_writeback
buffer_check_dirty_writeback is only used by the block device aops,
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230608122958.276954-1-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-08 14:39:57 +02:00
Qu Wenruo
79b8ee702c btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:

  # mkfs.btrfs -f $dev -d DUP
  # mount $dev $mnt
  # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
  # umount $dev
  # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
  # mount $dev $mnt
  # btrfs scrub start -BR $mnt
  scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
  Scrub started:    Tue Jun  6 14:56:50 2023
  Status:           finished
  Duration:         0:00:00
  	data_extents_scrubbed: 2
  	tree_extents_scrubbed: 18
  	data_bytes_scrubbed: 131072
  	tree_bytes_scrubbed: 294912
  	read_errors: 0
  	csum_errors: 0 <<< No errors here
  	verify_errors: 0
         [...]
  	uncorrectable_errors: 0
  	unverified_errors: 0
  	corrected_errors: 16		<<< Only corrected errors
  	last_physical: 2723151872

This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.

While on v6.3.x kernels, the report is different:

 	csum_errors: 16			<<<
 	verify_errors: 0
	[...]
 	uncorrectable_errors: 0
 	unverified_errors: 0
 	corrected_errors: 16 <<<

[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.

For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.

Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.

To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.

Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 14:34:01 +02:00
Qu Wenruo
1f2030ff6e btrfs: scrub: respect the read-only flag during repair
[BUG]
With recent scrub rework, the scrub operation no longer respects the
read-only flag passed by "-r" option of "btrfs scrub start" command.

  # mkfs.btrfs -f -d raid1 $dev1 $dev2
  # mount $dev1 $mnt
  # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file
  # sync
  # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1
  # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2
  # mount $dev1 $mnt -o ro
  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1351614464

  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 10:00:17 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

[CAUSE]
In the newly reworked scrub code, repair is always submitted no matter
if we're doing a read-only scrub.

[FIX]
Fix it by skipping the write submission if the scrub is a read-only one.

Unfortunately for the report part, even for a read-only scrub we will
still report it as corrected errors, as we know it's repairable, even we
won't really submit the write.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 13:56:38 +02:00
David Howells
f5f82cd187 Move netfs_extract_iter_to_sg() to lib/scatterlist.c
Move netfs_extract_iter_to_sg() to lib/scatterlist.c as it's going to be
used by more than just network filesystems (AF_ALG, for example).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
936dc763c5 Wrap lines at 80
Wrap a line at 80 to stop checkpatch complaining.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Simon Horman <simon.horman@corigine.com>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
3b9e9f72ba Fix a couple of spelling mistakes
Fix a couple of spelling mistakes in a comment.

Suggested-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/ZHH2mSRqeL4Gs1ft@corigine.com/
Link: https://lore.kernel.org/r/ZHH1nqZWOGzxlidT@corigine.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
0d7aeb6870 Drop the netfs_ prefix from netfs_extract_iter_to_sg()
Rename netfs_extract_iter_to_sg() and its auxiliary functions to drop the
netfs_ prefix.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
Xiubo Li
409e873ea3 ceph: fix use-after-free bug for inodes when flushing capsnaps
There is a race between capsnaps flush and removing the inode from
'mdsc->snap_flush_list' list:

   == Thread A ==                     == Thread B ==
ceph_queue_cap_snap()
 -> allocate 'capsnapA'
 ->ihold('&ci->vfs_inode')
 ->add 'capsnapA' to 'ci->i_cap_snaps'
 ->add 'ci' to 'mdsc->snap_flush_list'
    ...
   == Thread C ==
ceph_flush_snaps()
 ->__ceph_flush_snaps()
  ->__send_flush_snap()
                                handle_cap_flushsnap_ack()
                                 ->iput('&ci->vfs_inode')
                                   this also will release 'ci'
                                    ...
				      == Thread D ==
                                ceph_handle_snap()
                                 ->flush_snaps()
                                  ->iterate 'mdsc->snap_flush_list'
                                   ->get the stale 'ci'
 ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
   'mdsc->snap_flush_list'           will WARNING

To fix this we will increase the inode's i_count ref when adding 'ci'
to the 'mdsc->snap_flush_list' list.

[ idryomov: need_put int -> bool ]

Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-08 08:56:25 +02:00
Thomas Weißschuh
6217642027 fs: avoid empty option when generating legacy mount string
As each option string fragment is always prepended with a comma it would
happen that the whole string always starts with a comma. This could be
interpreted by filesystem drivers as an empty option and may produce
errors.

For example the NTFS driver from ntfs.ko behaves like this and fails
when mounted via the new API.

Link: https://github.com/util-linux/util-linux/issues/2298
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Fixes: 3e1aeb00e6 ("vfs: Implement a filesystem superblock creation/configuration context")
Cc: stable@vger.kernel.org
Message-Id: <20230607-fs-empty-option-v1-1-20c8dbf4671b@weissschuh.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07 21:49:55 +02:00
David Howells
a27648c742 afs: Fix setting of mtime when creating a file/dir/symlink
kafs incorrectly passes a zero mtime (ie. 1st Jan 1970) to the server when
creating a file, dir or symlink because the mtime recorded in the
afs_operation struct gets passed to the server by the marshalling routines,
but the afs_mkdir(), afs_create() and afs_symlink() functions don't set it.

This gets masked if a file or directory is subsequently modified.

Fix this by filling in op->mtime before calling the create op.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-07 09:03:12 -07:00
Miklos Szeredi
a9d1c4c6df fuse: revalidate: don't invalidate if interrupted
If the LOOKUP request triggered from fuse_dentry_revalidate() is
interrupted, then the dentry will be invalidated, possibly resulting in
submounts being unmounted.

Reported-by: Xu Rongbo <xurongbo@baidu.com>
Closes: https://lore.kernel.org/all/CAJfpegswN_CJJ6C3RZiaK6rpFmNyWmXfaEpnQUJ42KCwNF5tWw@mail.gmail.com/
Fixes: 9e6268db49 ("[PATCH] FUSE - read-write operations")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
Bernd Schubert
3066ff9347 fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
This is just a safety precaution to avoid checking flags on memory that was
initialized on the user space side.  libfuse zeroes struct fuse_init_out
outarg, but this is not guranteed to be done in all implementations.
Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is
set.

There is a risk with this change, though - it might break existing user
space libraries, which are already using flags2 without setting
FUSE_INIT_EXT.

The corresponding libfuse patch is here
https://github.com/libfuse/libfuse/pull/662

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Fixes: 53db28933e ("fuse: extend init flags")
Cc: <stable@vger.kernel.org> # v5.17
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
zyfjeff
1c3610d30e fuse: remove duplicate check for nodeid
before this check, the nodeid has already been checked once, so
the check here doesn't make an sense, so remove the check for
nodeid here.

            if (err || !outarg->nodeid)
                    goto out_put_forget;

            err = -EIO;
    >>>     if (!outarg->nodeid)
                    goto out_put_forget;

Signed-off-by: zyfjeff <zyfjeff@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:34 +02:00
Miklos Szeredi
5cadfbd5a1 fuse: add feature flag for expire-only
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of
FUSE_NOTIFY_INVAL_ENTRY is effective.

This is needed for backports of this feature, otherwise the server could
just check the protocol version.

Fixes: 4f8d37020e ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
Cc: <stable@vger.kernel.org> # v6.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:33 +02:00
Jan Kara
2454ad83b9
fs: Restrict lock_two_nondirectories() to non-directory inodes
Currently lock_two_nondirectories() is skipping any passed directories.
After vfs_rename() uses lock_two_inodes(), all the remaining four users
of this function pass only regular files to it. So drop the somewhat
unusual "skip directory" logic and instead warn if anybody passes
directory to it. This also allows us to use lock_two_inodes() in
lock_two_nondirectories() to concentrate the lock ordering logic in less
places.

Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-6-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07 09:15:11 +02:00
Chris Mason
981a37bab5 btrfs: properly enable async discard when switching from RO->RW
The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info
to force discards off when the filesystem has aborted or we're generally
not able to run discards.  This gets flipped on when we're mounted rw,
and also when we go from ro->rw.

Commit 63a7cb1307 ("btrfs: auto enable discard=async when possible")
enabled async discard by default, and this meant
"mount -o ro /dev/xxx /yyy" had async discards turned on.

Unfortunately, this meant our check in btrfs_remount_cleanup() would see
that discards are already on:

    /* If we toggled discard async */
    if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
	btrfs_test_opt(fs_info, DISCARD_ASYNC))
	    btrfs_discard_resume(fs_info);

So, we'd never call btrfs_discard_resume() when remounting the root
filesystem from ro->rw.

drgn shows this really nicely:

import os
import sys

from drgn.helpers.linux.fs import path_lookup
from drgn import NULL, Object, Type, cast

def btrfs_sb(sb):
    return cast("struct btrfs_fs_info *", sb.s_fs_info)

if len(sys.argv) == 1:
    path = "/"
else:
    path = sys.argv[1]

fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info)

BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING']
if fs_info.flags & BTRFS_FS_DISCARD_RUNNING:
    print("discard running flag is on")
else:
    print("discard running flag is off")

[root]# mount | grep nvme
/dev/nvme0n1p3 on / type btrfs
(rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

[root]# ./discard_running.drgn
discard running flag is off

[root]# mount -o remount,discard=sync /
[root]# mount -o remount,discard=async /
[root]# ./discard_running.drgn
discard running flag is on

The fix is to call btrfs_discard_resume() when we're going from ro->rw.
It already checks to make sure the async discard flag is on, so it'll do
the right thing.

Fixes: 63a7cb1307 ("btrfs: auto enable discard=async when possible")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 19:44:22 +02:00
Bob Peterson
f9da18cd46 gfs2: Don't remember delete unless it's successful
This patch changes function evict_unlinked_inode so it does not call
gfs2_inode_remember_delete until it gets a good return code from
gfs2_dinode_dealloc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
17a5934653 gfs2: Update rl_unlinked before releasing rgrp lock
Function gfs2_free_di was changing the rgrp lvb count of unlinked
dinodes after the lock was released. This patch moves it inside the
lock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
dac0fc31be gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
This patch fixes a case in which function gfs2_quota_hold encounters an
assert error and exits. The lack of gfs2_qa_put causes further problems
when the inode is evicted and the get/put count is non-zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
9b620429ec gfs2: ignore rindex_update failure in dinode_dealloc
Before this patch, function gfs2_dinode_dealloc would abort if it got a
bad return code from gfs2_rindex_update(). The problem is that it left the
dinode in the unlinked (not free) state, which meant subsequent fsck
would clean it up and flag an error. That meant some of our QE tests
would fail.

The sole purpose of gfs2_rindex_update(), in this code path, is to read in
any newer rgrps added by gfs2_grow. But since this is a delete operation
it won't actually use any of those new rgrps. It can really only twiddle
the bits from "Unlinked" to "Free" in an existing rgrp. Therefore the
error should not prevent the transition from unlinked to free.

This patch makes gfs2_dinode_dealloc ignore the bad return code and
proceed with freeing the dinode so the QE tests will not be tripped up.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
e4f82bf21f gfs2: fix minor comment typos
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
a9b0f6f4ad gfs2: simplify gdlm_put_lock with out_free label
This patch introduces a new out_free label and consolidates the three
places function gdlm_put_lock freed the glock.  No change in
functionality.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Kirill A. Shutemov
dcdfdd40fa mm: Add support for unaccepted memory
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
2023-06-06 16:38:22 +02:00
Linus Torvalds
0bdd0f0bf1 gfs2 fix
- Don't get stuck writing page onto itself under direct I/O.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmR/KcEUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr3wA//eCaUYWKOiSXbbvNeP1pdawN8zRZO
 alpK/nsujcB4bDdiDUTREYFBfcmBKEIboFz6e02DL8MPp2mc0fZ/Ox9yq/nR7o9x
 9Y1CjWmC1zoUkqw6V8+vbg0m432OlcWglppgywHjvyUiEnyUzBfnxIRH/k0lLYor
 yR7vViSZQhJ4jroSeEVNKsCeZUiY7y5tiLo+bcHYYF7lolab/ZfNxacr1/lSAuww
 WS0frjAeSBneQ5aU2JQ60lbJcJQydfdoS3n0dlyX6qJeVoFCnhQAJeiVSQaYMFpE
 HaYFs/3YGjzAkoWqX5CAzLIfxsIHepdaP4PtITg3xwyMQ1j3X5H5n1FOyJCcbXz7
 s7gq/RXjU3TV/hVPSukVGhwjavB8gMrbQtmpSYHpA99ldeNfIVwps2PSWdoDND/B
 7w+g0T5U+yd7DNz2tL9YML7Anioc0K6y1hVvacuPIgNHJyLQ5XaPYJXUUJFfl0X6
 njcZVmfK56RQRPR7jDp26F4X+Pw+GjahJJq05zCwsmFnP6+pX2gjYNx9LidXBugU
 1L8BlN2IZvc9ShvInZIuQHwTMEXAMjjSd5JtX7iZLnHxeWDfIlo64ZrroiiNbrk6
 pDqG+C6fkYV1h1fzX3bvFZIvFoKERxu9u9TunBIiBQFHQNYWoe2i5zpZNXyT7Soo
 epIqTaWmf8l/3vo=
 =9syc
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Don't get stuck writing page onto itself under direct I/O

* tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't get stuck writing page onto itself under direct I/O
2023-06-06 05:49:06 -07:00
Andy Shevchenko
7afb6d8fa8 jbd2: Avoid printing outside the boundary of the buffer
Theoretically possible that "%pg" will take all room for the j_devname
and hence the "-%lu" will go outside the boundary due to unconditional
sprintf() in use. To make this code more robust, replace two sequential
s*printf():s by a single call and then replace forbidden character.
It's possible to do this way, because '/' won't ever be in the result
of "-%lu".

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605170553.7835-2-andriy.shevchenko@linux.intel.com
2023-06-05 15:31:12 -07:00
Qu Wenruo
917ac77846 btrfs: subpage: fix a crash in metadata repair path
[BUG]
Test case btrfs/027 would crash with subpage (64K page size, 4K
sectorsize) with the following dying messages:

  debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
  assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/messages.c:259!
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Call trace:
   btrfs_assertfail+0x28/0x2c [btrfs]
   btrfs_map_repair_block+0x150/0x2b8 [btrfs]
   btrfs_repair_io_failure+0xd4/0x31c [btrfs]
   btrfs_read_extent_buffer+0x150/0x16c [btrfs]
   read_tree_block+0x38/0xbc [btrfs]
   read_tree_root_path+0xfc/0x1bc [btrfs]
   btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
   open_ctree+0xa30/0x172c [btrfs]
   btrfs_mount_root+0x3c4/0x4a4 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   vfs_kern_mount.part.0+0x90/0xd4
   vfs_kern_mount+0x14/0x28
   btrfs_mount+0x114/0x418 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   path_mount+0x3e0/0xb64
   __arm64_sys_mount+0x200/0x2d8
   invoke_syscall+0x48/0x114
   el0_svc_common.constprop.0+0x60/0x11c
   do_el0_svc+0x38/0x98
   el0_svc+0x40/0xa8
   el0t_64_sync_handler+0xf4/0x120
   el0t_64_sync+0x190/0x194
  Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)

[CAUSE]
In btrfs/027 we test RAID6 with missing devices, in this particular
case, we're repairing a metadata at the end of a data stripe.

But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
and for subpage case this can cross stripe boundary and lead to the
above BUG_ON().

This metadata repair code is always there, since the introduction of
subpage support, but this can trigger BUG_ON() after the bio split
ability at btrfs_map_bio().

[FIX]
Instead of passing the old PAGE_SIZE, we calculate the correct length
based on the eb size and page size for both regular and subpage cases.

CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 19:21:57 +02:00
Christoph Hellwig
cf056a4312 init: improve the name_to_dev_t interface
name_to_dev_t has a very misleading name, that doesn't make clear
it should only be used by the early init code, and also has a bad
calling convention that doesn't allow returning different kinds of
errors.  Rename it to early_lookup_bdev to make the use case clear,
and return an errno, where -EINVAL means the string could not be
parsed, and -ENODEV means it the string was valid, but there was
no device found for it.

Also stub out the whole call for !CONFIG_BLOCK as all the non-block
root cases are always covered in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:56:46 -06:00
Christoph Hellwig
dd2e31afba ext4: wire up the ->mark_dead holder operation for log devices
Implement a set of holder_ops that shut down the file system when the
block device used as log device is removed undeneath the file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
f5db130d44 ext4: wire up sops->shutdown
Wire up the shutdown method to shut down the file system when the
underlying block device is marked dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00