Commit Graph

9520 Commits

Author SHA1 Message Date
Darrick J. Wong
7f8a44f372 xfs: fix sb_spino_align checks for large fsblock sizes
For a sparse inodes filesystem, mkfs.xfs computes the values of
sb_spino_align and sb_inoalignmt with the following code:

	int     cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;

	if (cfg->sb_feat.crcs_enabled)
		cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;

	sbp->sb_spino_align = cluster_size >> cfg->blocklog;
	sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
			cfg->inodesize >> cfg->blocklog;

On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384.  As a result,
sb_spino_align and sb_inoalignmt are both set to zero.  Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:

# mkfs.xfs -f -b size=64k, /dev/sda
meta-data=/dev/sda               isize=512    agcount=4, agsize=81136 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=65536  blocks=324544, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=65536  blocks=5006, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
Discarding blocks...Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: writing AG headers failed, err=22

Prior to commit 59e43f5479 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock.  Adjust the checks to handle existing filesystems.

Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 59e43f5479 ("xfs: sb_spino_align is not verified")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:12 -08:00
Darrick J. Wong
ca378189fd xfs: convert quotacheck to attach dquot buffers
Now that we've converted the dquot logging machinery to attach the dquot
buffer to the li_buf pointer so that the AIL dqflush doesn't have to
allocate or read buffers in a reclaim path, do the same for the
quotacheck code so that the reclaim shrinker dqflush call doesn't have
to do that either.

Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:12 -08:00
Darrick J. Wong
acc8f8628c xfs: attach dquot buffer to dquot log item buffer
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel
when running fstests with quotas enabled:

WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18
CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G        W          6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c
<snip>
Call trace:
 __alloc_pages_noprof+0xc9c/0xf18
 alloc_pages_mpol_noprof+0x94/0x240
 alloc_pages_noprof+0x68/0xf8
 new_slab+0x3e0/0x568
 ___slab_alloc+0x5a0/0xb88
 __slab_alloc.constprop.0+0x7c/0xf8
 __kmalloc_noprof+0x404/0x4d0
 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88]
 xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88]
 xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88]
 xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88]
 xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88]
 xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88]
 kthread+0x110/0x128
 ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---

This corresponds to the line:

	WARN_ON_ONCE(current->flags & PF_MEMALLOC);

within the NOFAIL checks.  What's happening here is that the XFS AIL is
trying to write a disk quota update back into the filesystem, but for
that it needs to read the ondisk buffer for the dquot.  The buffer is
not in memory anymore, probably because it was evicted.  Regardless, the
buffer cache tries to allocate a new buffer, but those allocations are
NOFAIL.  The AIL thread has marked itself PF_MEMALLOC (aka noreclaim)
since commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists")
presumably because reclaim can push on XFS to push on the AIL.

An easy way to fix this probably would have been to drop the NOFAIL flag
from the xfs_buf allocation and open code a retry loop, but then there's
still the problem that for bs>ps filesystems, the buffer itself could
require up to 64k worth of pages.

Inode items had similar behavior (multi-page cluster buffers that we
don't want to allocate in the AIL) which we solved by making transaction
precommit attach the inode cluster buffers to the dirty log item.  Let's
solve the dquot problem in the same way.

So: Make a real precommit handler to read the dquot buffer and attach it
to the log item; pass it to dqflush in the push method; and have the
iodone function detach the buffer once we've flushed everything.  Add a
state flag to the log item to track when a thread has entered the
precommit -> push mechanism to skip the detaching if it turns out that
the dquot is very busy, as we don't hold the dquot lock between log item
commit and AIL push).

Reading and attaching the dquot buffer in the precommit hook is inspired
by the work done for inode cluster buffers some time ago.

Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
ec88b41b93 xfs: clean up log item accesses in xfs_qm_dqflush{,_done}
Clean up these functions a little bit before we move on to the real
modifications, and make the variable naming consistent for dquot log
items.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
a40fe30868 xfs: separate dquot buffer reads from xfs_dqflush
The first step towards holding the dquot buffer in the li_buf instead of
reading it in the AIL is to separate the part that reads the buffer from
the actual flush code.  There should be no functional changes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
07137e925f xfs: don't lose solo dquot update transactions
Quota counter updates are tracked via incore objects which hang off the
xfs_trans object.  These changes are then turned into dirty log items in
xfs_trans_apply_dquot_deltas just prior to commiting the log items to
the CIL.

However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be
set on the transaction.  In other words, a pure quota counter update
will be silently discarded if there are no other dirty log items
attached to the transaction.

This is currently not the case anywhere in the filesystem because quota
updates always dirty at least one other metadata item, but a subsequent
bug fix will add dquot log item precommits, so we actually need a dirty
dquot log item prior to xfs_trans_run_precommits.  Also let's not leave
a logic bomb.

Cc: <stable@vger.kernel.org> # v2.6.35
Fixes: 0924378a68 ("xfs: split out iclog writing from xfs_trans_commit()")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
3762113b59 xfs: don't lose solo superblock counter update transactions
Superblock counter updates are tracked via per-transaction counters in
the xfs_trans object.  These changes are then turned into dirty log
items in xfs_trans_apply_sb_deltas just prior to commiting the log items
to the CIL.

However, updating the per-transaction counter deltas do not cause
XFS_TRANS_DIRTY to be set on the transaction.  In other words, a pure sb
counter update will be silently discarded if there are no other dirty
log items attached to the transaction.

This is currently not the case anywhere in the filesystem because sb
counter updates always dirty at least one other metadata item, but let's
not leave a logic bomb.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
a004afdc62 xfs: avoid nested calls to __xfs_trans_commit
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls
__xfs_trans_commit again on the same transaction.  In other words,
there's a nested function call (albeit with slightly different
arguments) that has caused minor amounts of confusion in the past.
There's no reason to keep this around, since there's only one place
where we actually want the xfs_defer_finish_noroll, and that is in the
top level xfs_trans_commit call.

This also reduces stack usage a little bit.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:11 -08:00
Darrick J. Wong
44d9b07e52 xfs: only run precommits once per transaction object
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:

tx0 -> txA -> txB -> txC

Prior to commit cb04211748, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC.  That was fixed by adding the second precommit call.

Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that.  In other words, we run precommits
twice on tx0:

xfs_trans_commit(tx0)
    __xfs_trans_commit(tx0, false)
        xfs_trans_run_precommits(tx0)
        xfs_defer_finish_noroll(tx0)
            xfs_trans_roll(tx0)
                txA = xfs_trans_dup(tx0)
                __xfs_trans_commit(tx0, true)
                xfs_trans_run_precommits(tx0)

This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption.  However, it doesn't make sense to run
precommits twice.

Fix this situation by only running precommits after finish_noroll.

Cc: <stable@vger.kernel.org> # v6.4
Fixes: cb04211748 ("xfs: defered work could create precommits")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:10 -08:00
Darrick J. Wong
53b001a21c xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:

 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311).  Shutting down filesystem.
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
 XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c.  Caller xfs_trans_dqresv+0x236/0x440 [xfs]
 XFS (dm-0): Corruption detected. Unmount and run xfs_repair
 XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7

The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged.  The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory.  Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out.  Fix that.

Cc: <stable@vger.kernel.org> # v6.10
Fixes: bd5562111d ("xfs: Hold inode locks in xfs_trans_alloc_dir")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:10 -08:00
Darrick J. Wong
ffc3ea4f3c xfs: fix scrub tracepoints when inode-rooted btrees are involved
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled.  The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.

Cc: <stable@vger.kernel.org> # v5.7
Fixes: 92219c292a ("xfs: convert btree cursor inode-private member names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:10 -08:00
Darrick J. Wong
6d7b4bc1c3 xfs: update btree keys correctly when _insrec splits an inode root block
In commit 2c813ad66a, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.

However, I missed a subtlety about the way inode-rooted btrees work.  If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block.  We don't
pass a pointer to the new block to the caller because that work has
already been done.  The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.

This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.

Cc: <stable@vger.kernel.org> # v4.8
Fixes: 2c813ad66a ("xfs: support btrees with overlapping intervals for keys")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:10 -08:00
Darrick J. Wong
23bee6f390 xfs: fix error bailout in xfs_rtginode_create
smatch reported that we screwed up the error cleanup in this function.
Fix it.

Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: ae897e0bed ("xfs: support creating per-RTG files in growfs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:10 -08:00
Darrick J. Wong
af9f02457f xfs: fix null bno_hint handling in xfs_rtallocate_rtg
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka
NULLFSBLOCK).  If the allocation request is for a file range that's
adjacent to an existing mapping, it will then change bno_hint to the
blkno hint in the bmalloca structure.

In other words, bno_hint is either a rt block number, or it's all 1s.
Unfortunately, commit ec12f97f1b didn't take the NULLRTBLOCK state
into account, which means that it tries to translate that into a
realtime extent number.  We then end up with an obnoxiously high rtx
number and pointlessly feed that to the near allocator.  This often
fails and falls back to the by-size allocator.  Seeing as we had no
locality hint anyway, this is a waste of time.

Fix the code to detect a lack of bno_hint correctly.  This was detected
by running xfs/009 with metadir enabled and a 28k rt extent size.

Cc: <stable@vger.kernel.org> # v6.12
Fixes: ec12f97f1b ("xfs: make the rtalloc start hint a xfs_rtblock_t")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
dc5a052739 xfs: mark metadir repair tempfiles with IRECOVERY
Once in a long while, xfs/566 and xfs/801 report directory corruption in
one of the metadata subdirectories while it's forcibly rebuilding all
filesystem metadata.  I observed the following sequence of events:

1. Initiate a repair of the parent pointers for the /quota/user file.
   This is the secret file containing user quota data.

2. The pptr repair thread creates a temporary file and begins staging
   parent pointers in the ondisk metadata in preparation for an
   exchange-range to commit the new pptr data.

3. At the same time, initiate a repair of the /quota directory itself.

4. The dir repair thread finds the temporary file from (2), scans it for
   parent pointers, and stages a dirent in its own temporary dir in
   preparation to commit the fixed directory.

5. The parent pointer repair completes and frees the temporary file.

6. The dir repair commits the new directory and scans it again.  It
   finds the dirent that points to the old temporary file in (2) and
   marks the directory corrupt.

Oops!  Repair code must never scan the temporary files that other repair
functions create to stage new metadata.  They're not supposed to do
that, but the predicate function xrep_is_tempfile is incorrect because
it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary
file, but xrep_tempfile_adjust_directory_tree creates exactly that.

Fix this by setting the IRECOVERY flag on temporary metadata directory
inodes and using that to correct the predicate.  Repair code is supposed
to erase all the data in temporary files before releasing them, so it's
ok if a thread scans the temporary file after we drop IRECOVERY.

Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: bb6cdd5529 ("xfs: hide metadata inodes from everyone because they are special")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
6f4669708a xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlink
If we need to reset a symlink target to the "durr it's busted" string,
then we clear the zapped flag as well.  However, this should be using
the provided helper so that we don't set the zapped state on an
otherwise ok symlink.

Cc: <stable@vger.kernel.org> # v6.10
Fixes: 2651923d8d ("xfs: online repair of symbolic links")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
aa7bfb537e xfs: separate healthy clearing mask during repair
In commit d9041681dd we introduced some XFS_SICK_*ZAPPED flags so
that the inode record repair code could clean up a damaged inode record
enough to iget the inode but still be able to remember that the higher
level repair code needs to be called.  As part of that, we introduced a
xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED
state to be removed if that higher level metadata actually checks out.
This was done by setting additional bits in sick_mask hoping that
xchk_update_health will clear all those bits after a healthy scrub.

Unfortunately, that's not quite what sick_mask means -- bits in that
mask are indeed cleared if the metadata is healthy, but they're set if
the metadata is NOT healthy.  fsck is only intended to set the ZAPPED
bits explicitly.

If something else sets the CORRUPT/XCORRUPT state after the
xchk_mark_healthy_if_clean call, we end up marking the metadata zapped.
This can happen if the following sequence happens:

1. Scrub runs, discovers that the metadata is fine but could be
   optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag.
   That causes the ZAPPED flag to be set in sick_mask because the
   metadata is not CORRUPT or XCORRUPT.

2. Repair runs to optimize the metadata.

3. Some other metadata used for cross-referencing in (1) becomes
   corrupt.

4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due
   to the events in (3).

5. Now the xchk_health_update sets the ZAPPED flag on the metadata we
   just repaired.  This is not the correct state.

Fix this by moving the "if healthy" mask to a separate field, and only
ever using it to clear the sick state.

Cc: <stable@vger.kernel.org> # v6.8
Fixes: d9041681dd ("xfs: set inode sick state flags when we zap either ondisk fork")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
7ce31f20a0 xfs: don't drop errno values when we fail to ficlone the entire range
Way back when we first implemented FICLONE for XFS, life was simple --
either the the entire remapping completed, or something happened and we
had to return an errno explaining what happened.  Neither of those
ioctls support returning partial results, so it's all or nothing.

Then things got complicated when copy_file_range came along, because it
actually can return the number of bytes copied, so commit 3f68c1f562
tried to make it so that we could return a partial result if the
REMAP_FILE_CAN_SHORTEN flag is set.  This is also how FIDEDUPERANGE can
indicate that the kernel performed a partial deduplication.

Unfortunately, the logic is wrong if an error stops the remapping and
CAN_SHORTEN is not set.  Because those callers cannot return partial
results, it is an error for ->remap_file_range to return a positive
quantity that is less than the @len passed in.  Implementations really
should be returning a negative errno in this case, because that's what
btrfs (which introduced FICLONE{,RANGE}) did.

Therefore, ->remap_range implementations cannot silently drop an errno
that they might have when the number of bytes remapped is less than the
number of bytes requested and CAN_SHORTEN is not set.

Found by running generic/562 on a 64k fsblock filesystem and wondering
why it reported corrupt files.

Cc: <stable@vger.kernel.org> # v4.20
Fixes: 3fc9f5e409 ("xfs: remove xfs_reflink_remap_range")
Really-Fixes: 3f68c1f562 ("xfs: support returning partial reflink results")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
bd27c7bcdc xfs: return a 64-bit block count from xfs_btree_count_blocks
With the nrext64 feature enabled, it's possible for a data fork to have
2^48 extent mappings.  Even with a 64k fsblock size, that maps out to
a bmbt containing more than 2^32 blocks.  Therefore, this predicate must
return a u64 count to avoid an integer wraparound that will cause scrub
to do the wrong thing.

It's unlikely that any such filesystem currently exists, because the
incore bmbt would consume more than 64GB of kernel memory on its own,
and so far nobody except me has driven a filesystem that far, judging
from the lack of complaints.

Cc: <stable@vger.kernel.org> # v5.19
Fixes: df9ad5cc7a ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:09 -08:00
Darrick J. Wong
e1d8602b6c xfs: keep quota directory inode loaded
In the same vein as the previous patch, there's no point in the metapath
scrub setup function doing a lookup on the quota metadir just so it can
validate that lookups work correctly.  Instead, retain the quota
directory inode in memory for the lifetime of the mount so that we can
check this meaningfully.

Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291 ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:08 -08:00
Darrick J. Wong
9b72800103 xfs: metapath scrubber should use the already loaded inodes
Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of
the quota inodes, just grab them from the quotainfo structure.  The
whole point of this scrubber is to make sure that the dirents exist, so
it's completely silly to do lookups.

Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291 ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:08 -08:00
Darrick J. Wong
a440a28ddb xfs: fix off-by-one error in fsmap's end_daddr usage
In commit ca6448aed4, we created an "end_daddr" variable to fix
fsmap reporting when the end of the range requested falls in the middle
of an unknown (aka free on the rmapbt) region.  Unfortunately, I didn't
notice that the the code sets end_daddr to the last sector of the device
but then uses that quantity to compute the length of the synthesized
mapping.

Zizhi Wo later observed that when end_daddr isn't set, we still don't
report the last fsblock on a device because in that case (aka when
info->last is true), the info->high mapping that we pass to
xfs_getfsmap_group_helper has a startblock that points to the last
fsblock.  This is also wrong because the code uses startblock to
compute the length of the synthesized mapping.

Fix the second problem by setting end_daddr unconditionally, and fix the
first problem by setting start_daddr to one past the end of the range to
query.

Cc: <stable@vger.kernel.org> # v6.11
Fixes: ca6448aed4 ("xfs: Fix missing interval for missing_owner in xfs fsmap")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reported-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12 17:45:08 -08:00
Josef Bacik
5121711eb8 fs: enable pre-content events on supported file systems
Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
2024-12-11 17:28:41 +01:00
Josef Bacik
7f4796a465 xfs: add pre-content fsnotify hook for DAX faults
xfs has it's own handling for DAX faults, so we need to add the
pre-content fsnotify hook for this case. Other faults go through
filemap_fault so they're handled properly there.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/9eccdf59a65b72f0a1a5e2f2b9bff8eda2d4f2d9.1731684329.git.josef@toxicpanda.com
2024-12-11 17:28:41 +01:00
Linus Torvalds
9141c5d389 Bug fixes for 6.13-rc2
* Use xchg() in xlog_cil_insert_pcp_aggregate()
 * Fix ABBA deadlock on a race between mount and log shutdown
 * Fix quota softlimit incoherency on delalloc
 * Fix sparse inode limits on runt AG
 * remove unknown compat feature checks in SB write valdation
 * Eliminate a lockdep false positive
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZ072iwAKCRBGdaER5Qtf
 prtCAX4kKOVnDzn2dX4YWpFaPAFvaWbiH0GIIVIiRuQLqzARya/lurNXjfanuotc
 4oJ3JacBgJ1MWYiBX2j95AEHJaes/G3Nm+EsXeqefWWxxQvrCcQt5kdtw1fVY8kz
 5NCMUtxjIQ==
 =FM4s
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - Use xchg() in xlog_cil_insert_pcp_aggregate()

 - Fix ABBA deadlock on a race between mount and log shutdown

 - Fix quota softlimit incoherency on delalloc

 - Fix sparse inode limits on runt AG

 - remove unknown compat feature checks in SB write valdation

 - Eliminate a lockdep false positive

* tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay
  xfs: Use xchg() in xlog_cil_insert_pcp_aggregate()
  xfs: prevent mount and log shutdown race
  xfs: delalloc and quota softlimit timers are incoherent
  xfs: fix sparse inode limits on runt AG
  xfs: remove unknown compat feature check in superblock write validation
  xfs: eliminate lockdep false positives in xfs_attr_shortform_list
2024-12-03 10:46:49 -08:00
Christoph Hellwig
cc2dba08cc xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay
xfs_bmap_add_extent_hole_delay works entirely on delalloc extents, for
which xfs_bmap_same_rtgroup doesn't make sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-28 12:54:22 +01:00
Uros Bizjak
214093534f xfs: Use xchg() in xlog_cil_insert_pcp_aggregate()
try_cmpxchg() loop with constant "new" value can be substituted
with just xchg() to atomically get and clear the location.

The code on x86_64 improves from:

    1e7f:	48 89 4c 24 10       	mov    %rcx,0x10(%rsp)
    1e84:	48 03 14 c5 00 00 00 	add    0x0(,%rax,8),%rdx
    1e8b:	00
			1e88: R_X86_64_32S	__per_cpu_offset
    1e8c:	8b 02                	mov    (%rdx),%eax
    1e8e:	41 89 c5             	mov    %eax,%r13d
    1e91:	31 c9                	xor    %ecx,%ecx
    1e93:	f0 0f b1 0a          	lock cmpxchg %ecx,(%rdx)
    1e97:	75 f5                	jne    1e8e <xlog_cil_commit+0x84e>
    1e99:	48 8b 4c 24 10       	mov    0x10(%rsp),%rcx
    1e9e:	45 01 e9             	add    %r13d,%r9d

to just:

    1e7f:	48 03 14 cd 00 00 00 	add    0x0(,%rcx,8),%rdx
    1e86:	00
			1e83: R_X86_64_32S	__per_cpu_offset
    1e87:	31 c9                	xor    %ecx,%ecx
    1e89:	87 0a                	xchg   %ecx,(%rdx)
    1e8b:	41 01 cb             	add    %ecx,%r11d

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <elder@riscstar.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-28 12:42:10 +01:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Dave Chinner
a858109960 xfs: prevent mount and log shutdown race
I recently had an fstests hang where there were two internal tasks
stuck like so:

[ 6559.010870] task:kworker/24:45   state:D stack:12152 pid:631308 tgid:631308 ppid:2      flags:0x00004000
[ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
[ 6559.020349] Call Trace:
[ 6559.022002]  <TASK>
[ 6559.023426]  __schedule+0x650/0xb10
[ 6559.025734]  schedule+0x6d/0xf0
[ 6559.027835]  schedule_timeout+0x31/0x180
[ 6559.030582]  wait_for_common+0x10c/0x1e0
[ 6559.033495]  wait_for_completion+0x1d/0x30
[ 6559.036463]  __flush_workqueue+0xeb/0x490
[ 6559.039479]  ? mempool_alloc_slab+0x15/0x20
[ 6559.042537]  xlog_cil_force_seq+0xa1/0x2f0
[ 6559.045498]  ? bio_alloc_bioset+0x1d8/0x510
[ 6559.048578]  ? submit_bio_noacct+0x2f2/0x380
[ 6559.051665]  ? xlog_force_shutdown+0x3b/0x170
[ 6559.054819]  xfs_log_force+0x77/0x230
[ 6559.057455]  xlog_force_shutdown+0x3b/0x170
[ 6559.060507]  xfs_do_force_shutdown+0xd4/0x200
[ 6559.063798]  ? xfs_buf_rele+0x1bd/0x580
[ 6559.066541]  xfs_buf_ioend_handle_error+0x163/0x2e0
[ 6559.070099]  xfs_buf_ioend+0x61/0x200
[ 6559.072728]  xfs_buf_ioend_work+0x15/0x20
[ 6559.075706]  process_scheduled_works+0x1d4/0x400
[ 6559.078814]  worker_thread+0x234/0x2e0
[ 6559.081300]  kthread+0x147/0x170
[ 6559.083462]  ? __pfx_worker_thread+0x10/0x10
[ 6559.086295]  ? __pfx_kthread+0x10/0x10
[ 6559.088771]  ret_from_fork+0x3e/0x50
[ 6559.091153]  ? __pfx_kthread+0x10/0x10
[ 6559.093624]  ret_from_fork_asm+0x1a/0x30
[ 6559.096227]  </TASK>

[ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
[ 6559.112673] Call Trace:
[ 6559.114333]  <TASK>
[ 6559.115760]  __schedule+0x650/0xb10
[ 6559.118084]  schedule+0x6d/0xf0
[ 6559.120175]  schedule_timeout+0x31/0x180
[ 6559.122776]  ? call_rcu+0xee/0x2f0
[ 6559.125034]  __down_common+0xbe/0x1f0
[ 6559.127470]  __down+0x1d/0x30
[ 6559.129458]  down+0x48/0x50
[ 6559.131343]  ? xfs_buf_item_unpin+0x8d/0x380
[ 6559.134213]  xfs_buf_lock+0x3d/0xe0
[ 6559.136544]  xfs_buf_item_unpin+0x8d/0x380
[ 6559.139253]  xlog_cil_committed+0x287/0x520
[ 6559.142019]  ? sched_clock+0x10/0x30
[ 6559.144384]  ? sched_clock_cpu+0x10/0x190
[ 6559.147039]  ? psi_group_change+0x48/0x310
[ 6559.149735]  ? _raw_spin_unlock+0xe/0x30
[ 6559.152340]  ? finish_task_switch+0xbc/0x310
[ 6559.155163]  xlog_cil_process_committed+0x6d/0x90
[ 6559.158265]  xlog_state_shutdown_callbacks+0x53/0x110
[ 6559.161564]  ? xlog_cil_push_work+0xa70/0xaf0
[ 6559.164441]  xlog_state_release_iclog+0xba/0x1b0
[ 6559.167483]  xlog_cil_push_work+0xa70/0xaf0
[ 6559.170260]  process_scheduled_works+0x1d4/0x400
[ 6559.173286]  worker_thread+0x234/0x2e0
[ 6559.175779]  kthread+0x147/0x170
[ 6559.177933]  ? __pfx_worker_thread+0x10/0x10
[ 6559.180748]  ? __pfx_kthread+0x10/0x10
[ 6559.183231]  ret_from_fork+0x3e/0x50
[ 6559.185601]  ? __pfx_kthread+0x10/0x10
[ 6559.188092]  ret_from_fork_asm+0x1a/0x30
[ 6559.190692]  </TASK>

This is an ABBA deadlock where buffer IO completion is triggering a
forced shutdown with the buffer lock held. It is waiting for the CIL
to flush as part of the log force. The CIL flush is blocked doing
shutdown processing of all it's objects, trying to unpin a buffer
item. That requires taking the buffer lock....

For the CIL to be doing shutdown processing, the log must be marked
with XLOG_IO_ERROR, but that doesn't happen until after the log
force is issued. Hence for xfs_do_force_shutdown() to be forcing
the log on a shut down log, we must have had a racing
xlog_force_shutdown and xfs_force_shutdown like so:

p0			p1			CIL push

   			<holds buffer lock>
xlog_force_shutdown
  xfs_log_force
   test_and_set_bit(XLOG_IO_ERROR)
   						xlog_state_release_iclog()
						  sees XLOG_IO_ERROR
						  xlog_state_shutdown_callbacks
						    ....
						    xfs_buf_item_unpin
						    xfs_buf_lock
						    <blocks on buffer p1 holds>

   			xfs_force_shutdown
			  xfs_set_shutdown(mp) wins
			    xlog_force_shutdown
			      xfs_log_force
			        <blocks on CIL push>

  xfs_set_shutdown(mp) fails
  <shuts down rest of log>

The deadlock can be mitigated by avoiding the log force on the
second pass through xlog_force_shutdown. Do this by adding another
atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
xlog_force_shutdown() but doesn't mark the log as shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22 11:24:51 +01:00
Dave Chinner
c9c293240e xfs: delalloc and quota softlimit timers are incoherent
I've been seeing this failure on during xfs/050 recently:

 XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
....
 Call Trace:
  <TASK>
  xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
  xfs_qm_scall_getquota_next+0x69/0xa0
  xfs_fs_get_nextdqblk+0x62/0xf0
  quota_getnextxquota+0xbf/0x320
  do_quotactl+0x1a1/0x410
  __se_sys_quotactl+0x126/0x310
  __x64_sys_quotactl+0x21/0x30
  x64_sys_call+0x2819/0x2ee0
  do_syscall_64+0x68/0x130
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

It turns out that the _qmount call has silently been failing to
unmount and mount the filesystem, so when the softlimit is pushed
past with a buffered write, it is not getting synced to disk before
the next quota report is being run.

Hence when the quota report runs, we have 300 blocks of delalloc
data on an inode, with a soft limit of 200 blocks. XFS dquots
account delalloc reservations as used space, hence the dquot is over
the soft limit.

However, we don't update the soft limit timers until we do a
transactional update of the dquot. That is, the dquot sits over the
soft limit without a softlimit timer being started until writeback
occurs and the allocation modifies the dquot and we call
xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
xfs_trans_commit() context.

This isn't really a problem, except for this debug code in
xfs_qm_scall_getquota_fill_qc():

        if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
                if ((dst->d_space > dst->d_spc_softlimit) &&
                    (dst->d_spc_softlimit > 0)) {
                        ASSERT(dst->d_spc_timer != 0);
                }
....

It asserts taht if the used block count is over the soft limit,
it *must* have a soft limit timer running. This is clearly not
the case, because we haven't committed the delalloc space to disk
yet. Hence the soft limit is only exceeded temporarily in memory
(which isn't an issue) and we start the timer the moment we exceed
the soft limit in journalled metadata.

This debug was introduced in:

commit 0d5ad8383061fbc0a9804fbb98218750000fe032
Author: Supriya Wickrematillake <sup@sgi.com>
Date:   Wed May 15 22:44:44 1996 +0000

    initial checkin
    quotactl syscall functions.

The very first quota support commit back in 1996. This is zero-day
debug for Irix and, as it turns out, a zero-day bug in the debug
code because the delalloc code on Irix didn't update the softlimit
timers, either.

IOWs, this issue has been in the code for 28 years.

We obviously don't care if soft limit timers are a bit rubbery when
we have delalloc reservations in memory. Production systems running
quota reports have been exposed to this situation for 28 years and
nobody has noticed it, so the debug code is essentially worthless at
this point in time.

We also have the on-disk dquot verifiers checking that the soft
limit timer is running whenever the dquot is over the soft limit
before we write it to disk and after we read it from disk. These
aren't firing, so it is clear the issue is purely a temporary
in-memory incoherency that I never would have noticed had the test
not silently failed to unmount the filesystem.

Hence I'm simply going to trash this runtime debug because it isn't
useful in the slightest for catching quota bugs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22 11:24:45 +01:00
Dave Chinner
1332533358 xfs: fix sparse inode limits on runt AG
The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.

Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.

Fixes: 56d1115c9b ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22 11:24:40 +01:00
Long Li
652f03db89 xfs: remove unknown compat feature check in superblock write validation
Compat features are new features that older kernels can safely ignore,
allowing read-write mounts without issues. The current sb write validation
implementation returns -EFSCORRUPTED for unknown compat features,
preventing filesystem write operations and contradicting the feature's
definition.

Additionally, if the mounted image is unclean, the log recovery may need
to write to the superblock. Returning an error for unknown compat features
during sb write validation can cause mount failures.

Although XFS currently does not use compat feature flags, this issue
affects current kernels' ability to mount images that may use compat
feature flags in the future.

Since superblock read validation already warns about unknown compat
features, it's unnecessary to repeat this warning during write validation.
Therefore, the relevant code in write validation is being removed.

Fixes: 9e037cb797 ("xfs: check for unknown v5 feature bits in superblock write verifier")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22 10:20:55 +01:00
Long Li
45f69d091b xfs: eliminate lockdep false positives in xfs_attr_shortform_list
xfs_attr_shortform_list() only called from a non-transactional context, it
hold ilock before alloc memory and maybe trapped in memory reclaim. Since
commit 204fae32d5f7("xfs: clean up remaining GFP_NOFS users") removed
GFP_NOFS flag, lockdep warning will be report as [1]. Eliminate lockdep
false positives by use __GFP_NOLOCKDEP to alloc memory
in xfs_attr_shortform_list().

[1] https://lore.kernel.org/linux-xfs/000000000000e33add0616358204@google.com/

Reported-by: syzbot+4248e91deb3db78358a2@syzkaller.appspotmail.com
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22 09:52:03 +01:00
Linus Torvalds
2edc8f933d New xfs code for 6.13
* convert perag to use xarrays
 * create a new generic allocation group structure
 * Add metadata inode dir trees
 * Create in-core rt allocation groups
 * Shard the RT section into allocation groups
 * Persist quota options with the enw metadata dir tree
 * Enable quota for RT volumes
 * Enable metadata directory trees
 * Some bugfixes
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZzyNwAAKCRBGdaER5Qtf
 psV3AYCncK/pVhFfKQSFbnCvgPSoAe7N9n0Wt5gmjy0Ill2mbQXVl9ADXkH6a015
 gcGM3t4BgIHLJQndL/Uz+3a0L5IriEb9QkAfzmx8t3vjiRBzBe3WfywEx9Yt7kZe
 xbxEJ2HQpA==
 =3ngC
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Carlos Maiolino:
 "The bulk of this pull request is a major rework that Darrick and
  Christoph have been doing on XFS's real-time volume, coupled with a
  few features to support this rework. It does also includes some bug
  fixes.

   - convert perag to use xarrays

   - create a new generic allocation group structure

   - add metadata inode dir trees

   - create in-core rt allocation groups

   - shard the RT section into allocation groups

   - persist quota options with the enw metadata dir tree

   - enable quota for RT volumes

   - enable metadata directory trees

   - some bugfixes"

* tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits)
  xfs: port ondisk structure checks from xfs/122 to the kernel
  xfs: separate space btree structures in xfs_ondisk.h
  xfs: convert struct typedefs in xfs_ondisk.h
  xfs: enable metadata directory feature
  xfs: enable realtime quota again
  xfs: update sb field checks when metadir is turned on
  xfs: reserve quota for realtime files correctly
  xfs: create quota preallocation watermarks for realtime quota
  xfs: report realtime block quota limits on realtime directories
  xfs: persist quota flags with metadir
  xfs: advertise realtime quota support in the xqm stat files
  xfs: scrub quota file metapaths
  xfs: fix chown with rt quota
  xfs: use metadir for quota inodes
  xfs: refactor xfs_qm_destroy_quotainos
  xfs: use rtgroup busy extent list for FITRIM
  xfs: implement busy extent tracking for rtgroups
  xfs: port the perag discard code to handle generic groups
  xfs: move the min and max group block numbers to xfs_group
  xfs: adjust min_block usage in xfs_verify_agbno
  ...
2024-11-21 09:20:07 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Linus Torvalds
241c7ed4d4 vfs-6.13.untorn.writes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcopwAKCRCRxhvAZXjc
 oitWAQD68PGFI6/ES9x+qGsDFEZBH08icuO+a9dyaZXyNRosDgD/ex2zHj6F7IzS
 Ghgb9jiqWQ8l2+PDYfisxa/0jiqCbAk=
 =DmXf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs untorn write support from Christian Brauner:
 "An atomic write is a write issed with torn-write protection. This
  means for a power failure or any hardware failure all or none of the
  data from the write will be stored, never a mix of old and new data.

  This work is already supported for block devices. If a block device is
  opened with O_DIRECT and the block device supports atomic write, then
  FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block
  device.

  This contains the work to expand atomic write support to filesystems,
  specifically ext4 and XFS. Currently, only support for writing exactly
  one filesystem block atomically is added.

  Since it's now possible to have filesystem block size > page size for
  XFS, it's possible to write 4K+ blocks atomically on x86"

* tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: drop an obsolete comment in iomap_dio_bio_iter
  ext4: Do not fallback to buffered-io for DIO atomic write
  ext4: Support setting FMODE_CAN_ATOMIC_WRITE
  ext4: Check for atomic writes support in write iter
  ext4: Add statx support for atomic writes
  xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  xfs: Validate atomic writes
  xfs: Support atomic write for statx
  fs: iomap: Atomic write support
  fs: Export generic_atomic_write_valid()
  block: Add bdev atomic write limits helpers
  fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
  block/fs: Pass an iocb to generic_atomic_write_valid()
2024-11-18 11:30:09 -08:00
Linus Torvalds
6ac81fd55e vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
 oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
 SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
 =nVlm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs multigrain timestamps from Christian Brauner:
 "This is another try at implementing multigrain timestamps. This time
  with significant help from the timekeeping maintainers to reduce the
  performance impact.

  Thomas provided a base branch that contains the required timekeeping
  interfaces for the VFS. It serves as the base for the multi-grain
  timestamp work:

   - Multigrain timestamps allow the kernel to use fine-grained
     timestamps when an inode's attributes is being actively observed
     via ->getattr(). With this support, it's possible for a file to get
     a fine-grained timestamp, and another modified after it to get a
     coarse-grained stamp that is earlier than the fine-grained time. If
     this happens then the files can appear to have been modified in
     reverse order, which breaks VFS ordering guarantees.

     To prevent this, a floor value is maintained for multigrain
     timestamps. Whenever a fine-grained timestamp is handed out, record
     it, and when later coarse-grained stamps are handed out, ensure
     they are not earlier than that value. If the coarse-grained
     timestamp is earlier than the fine-grained floor, return the floor
     value instead.

     The timekeeper changes add a static singleton atomic64_t into
     timekeeper.c that is used to keep track of the latest fine-grained
     time ever handed out. This is tracked as a monotonic ktime_t value
     to ensure that it isn't affected by clock jumps. Because it is
     updated at different times than the rest of the timekeeper object,
     the floor value is managed independently of the timekeeper via a
     cmpxchg() operation, and sits on its own cacheline.

     Two new public timekeeper interfaces are added:

      (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
          later of the coarse-grained clock and the floor time

      (2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
          and tries to swap it into the floor. A timespec64 is filled
          with the result.

   - The VFS has always used coarse-grained timestamps when updating the
     ctime and mtime after a change. This has the benefit of allowing
     filesystems to optimize away a lot metadata updates, down to around
     1 per jiffy, even when a file is under heavy writes.

     Unfortunately, this has always been an issue when we're exporting
     via NFSv3, which relies on timestamps to validate caches. A lot of
     changes can happen in a jiffy, so timestamps aren't sufficient to
     help the client decide when to invalidate the cache. Even with
     NFSv4, a lot of exported filesystems don't properly support a
     change attribute and are subject to the same problems with
     timestamp granularity. Other applications have similar issues with
     timestamps (e.g backup applications).

     If we were to always use fine-grained timestamps, that would
     improve the situation, but that becomes rather expensive, as the
     underlying filesystem would have to log a lot more metadata
     updates.

     This adds a way to only use fine-grained timestamps when they are
     being actively queried. Use the (unused) top bit in
     inode->i_ctime_nsec as a flag that indicates whether the current
     timestamps have been queried via stat() or the like. When it's set,
     we allow the kernel to use a fine-grained timestamp iff it's
     necessary to make the ctime show a different value.

     This solves the problem of being able to distinguish the timestamp
     between updates, but introduces a new problem: it's now possible
     for a file being changed to get a fine-grained timestamp. A file
     that is altered just a bit later can then get a coarse-grained one
     that appears older than the earlier fine-grained time. This
     violates timestamp ordering guarantees.

     This is where the earlier mentioned timkeeping interfaces help. A
     global monotonic atomic64_t value is kept that acts as a timestamp
     floor. When we go to stamp a file, we first get the latter of the
     current floor value and the current coarse-grained time. If the
     inode ctime hasn't been queried then we just attempt to stamp it
     with that value.

     If it has been queried, then first see whether the current coarse
     time is later than the existing ctime. If it is, then we accept
     that value. If it isn't, then we get a fine-grained time and try to
     swap that into the global floor. Whether that succeeds or fails, we
     take the resulting floor time, convert it to realtime and try to
     swap that into the ctime.

     We take the result of the ctime swap whether it succeeds or fails,
     since either is just as valid.

     Filesystems can opt into this by setting the FS_MGTIME fstype flag.
     Others should be unaffected (other than being subject to the same
     floor value as multigrain filesystems)"

* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reduce pointer chasing in is_mgtime() test
  tmpfs: add support for multigrain timestamps
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps
  fs: add percpu counters for significant multigrain timestamp events
  fs: tracepoints around multigrain timestamp events
  fs: handle delegated timestamps in setattr_copy_mgtime
  timekeeping: Add percpu counter for tracking floor swap events
  timekeeping: Add interfaces for handling timestamps with a floor value
  fs: have setattr_copy handle multigrain timestamps appropriately
  fs: add infrastructure for multigrain timestamps
2024-11-18 09:15:39 -08:00
Carlos Maiolino
5877dc24be xfs: improve ondisk structure checks [v5.5 10/10]
Reorganize xfs_ondisk.h to group the build checks by type, then add a
 bunch of missing checks that were in xfs/122 but not the build system.
 With this, we can get rid of xfs/122.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 ph8MAP4nWT1zCLAEDpFqVQmJO7wjlS02x5xFTHEzcms30ptsIwEA73ycS74tAmP5
 CGsPxFVzhG8oMkWgrWXnXuOGfpQBPQo=
 =Dx2e
 -----END PGP SIGNATURE-----

Merge tag 'better-ondisk-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: improve ondisk structure checks [v5.5 10/10]

Reorganize xfs_ondisk.h to group the build checks by type, then add a
bunch of missing checks that were in xfs/122 but not the build system.
With this, we can get rid of xfs/122.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:03:15 +01:00
Carlos Maiolino
052378aef8 xfs: enable metadir [v5.5 09/10]
Actually enable this very large feature, which adds metadata directory
 trees, allocation groups on the realtime volume, persistent quota
 options, and quota for realtime files.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pvjKAP9A1v0QwkBo65ARf1gQb2AZYlebV8o6faSrpc+10vOPxgD/SV3GKhxQrtaT
 gOswA6f9QJN3E82IJ6fR2PZgSke+iQs=
 =xjZL
 -----END PGP SIGNATURE-----

Merge tag 'metadir-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: enable metadir [v5.5 09/10]

Actually enable this very large feature, which adds metadata directory
trees, allocation groups on the realtime volume, persistent quota
options, and quota for realtime files.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:02:55 +01:00
Carlos Maiolino
8ca118e17a xfs: enable quota for realtime volumes [v5.5 08/10]
At some point, I realized that I've refactored enough of the quota code
 in XFS that I should evaluate whether or not quota actually works on
 realtime volumes.  It turns out that it nearly works: the only broken
 pieces are chown and delayed allocation, and reporting of project
 quotas in the statvfs output for projinherit+rtinherit directories.
 
 Fix these things and we can have realtime quotas again after 20 years.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pkh4AQCtjI73mwU9rhzs2MO5nLNlpg9bgOxute+G4eqGCP02CwEAvg/LpT9yA6qk
 1jM5x8C6xy03yIWTUc+DcMPqoCqJzwc=
 =2QMv
 -----END PGP SIGNATURE-----

Merge tag 'realtime-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: enable quota for realtime volumes [v5.5 08/10]

At some point, I realized that I've refactored enough of the quota code
in XFS that I should evaluate whether or not quota actually works on
realtime volumes.  It turns out that it nearly works: the only broken
pieces are chown and delayed allocation, and reporting of project
quotas in the statvfs output for projinherit+rtinherit directories.

Fix these things and we can have realtime quotas again after 20 years.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:02:25 +01:00
Carlos Maiolino
93c0f79edf xfs: persist quota options with metadir [v5.5 07/10]
Store the quota files in the metadata directory tree instead of the
 superblock.  Since we're introducing a new incompat feature flag, let's
 also make the mount process bring up quotas in whatever state they were
 when the filesystem was last unmounted, instead of requiring sysadmins
 to remember that themselves.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pod/AP9+FSHh3UF9ccpBQXMAG0NmhXYJBQ7grnAp4q89ko6fCAD+JyUsqi55zDw6
 KnLSWZgNuO+aaCmMKXX4hEttA0//gAY=
 =0YFz
 -----END PGP SIGNATURE-----

Merge tag 'metadir-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: persist quota options with metadir [v5.5 07/10]

Store the quota files in the metadata directory tree instead of the
superblock.  Since we're introducing a new incompat feature flag, let's
also make the mount process bring up quotas in whatever state they were
when the filesystem was last unmounted, instead of requiring sysadmins
to remember that themselves.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:01:12 +01:00
Carlos Maiolino
b939bcdca3 xfs: shard the realtime section [v5.5 06/10]
Right now, the realtime section uses a single pair of metadata inodes to
 store the free space information.  This presents a scalability problem
 since every thread trying to allocate or free rt extents have to lock
 these files.  Solve this problem by sharding the realtime section into
 separate realtime allocation groups.
 
 While we're at it, define a superblock to be stamped into the start of
 the rt section.  This enables utilities such as blkid to identify block
 devices containing realtime sections, and avoids the situation where
 anything written into block 0 of the realtime extent can be
 misinterpreted as file data.
 
 The best advantage for rtgroups will become evident later when we get to
 adding rmap and reflink to the realtime volume, since the geometry
 constraints are the same for rt groups and AGs.  Hence we can reuse all
 that code directly.
 
 This is a very large patchset, but it catches us up with 20 years of
 technical debt that have accumulated.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pqk4AQD31pupAefiZ39TFLz0oA1+Q2WUOoLxH/3Ovqin1GJNPgD9EG04/14fDmRU
 WDUSVfU8JKKJYEXXZnLeJLsvEUL2EQ0=
 =1/oh
 -----END PGP SIGNATURE-----

Merge tag 'realtime-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: shard the realtime section [v5.5 06/10]

Right now, the realtime section uses a single pair of metadata inodes to
store the free space information.  This presents a scalability problem
since every thread trying to allocate or free rt extents have to lock
these files.  Solve this problem by sharding the realtime section into
separate realtime allocation groups.

While we're at it, define a superblock to be stamped into the start of
the rt section.  This enables utilities such as blkid to identify block
devices containing realtime sections, and avoids the situation where
anything written into block 0 of the realtime extent can be
misinterpreted as file data.

The best advantage for rtgroups will become evident later when we get to
adding rmap and reflink to the realtime volume, since the geometry
constraints are the same for rt groups and AGs.  Hence we can reuse all
that code directly.

This is a very large patchset, but it catches us up with 20 years of
technical debt that have accumulated.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:00:42 +01:00
Carlos Maiolino
cb288c9fb2 xfs: preparation for realtime allocation groups [v5.5 05/10]
Prepare for realtime groups by adding a few bug fixes and generic code
 that will be necessary.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pgmeAP980iZAY49aL85dhr/QNl0G5YmuLOx6UW8DiAALCJxyxAEAnD5ilA7vQz40
 80/cn+Y77fT3LptpAHTM5/FY+42IOgM=
 =AzwL
 -----END PGP SIGNATURE-----

Merge tag 'rtgroups-prep-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: preparation for realtime allocation groups [v5.5 05/10]

Prepare for realtime groups by adding a few bug fixes and generic code
that will be necessary.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 11:00:16 +01:00
Carlos Maiolino
6b3582aca3 xfs: create incore rt allocation groups [v5.5 04/10]
Add in-memory data structures for sharding the realtime volume into
 independent allocation groups.  For existing filesystems, the entire rt
 volume is modelled as having a single large group, with (potentially) a
 number of rt extents exceeding 2^32 blocks, though these are not likely
 to exist because the codebase has been a bit broken for decades.  The
 next series fills in the ondisk format and other supporting structures.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 ptrrAP41PURivFpHWXqg0sajsIUUezhuAdfg41fJqOop81qWDAEA2CsLf1z0c9/P
 CQS/tlQ3xdwZ0MYZMaw2o0EgSHYjwg8=
 =qVdv
 -----END PGP SIGNATURE-----

Merge tag 'incore-rtgroups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: create incore rt allocation groups [v5.5 04/10]

Add in-memory data structures for sharding the realtime volume into
independent allocation groups.  For existing filesystems, the entire rt
volume is modelled as having a single large group, with (potentially) a
number of rt extents exceeding 2^32 blocks, though these are not likely
to exist because the codebase has been a bit broken for decades.  The
next series fills in the ondisk format and other supporting structures.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 10:59:34 +01:00
Carlos Maiolino
d7a5b69bf0 xfs: metadata inode directory trees [v5.5 03/10]
This series delivers a new feature -- metadata inode directories.  This
 is a separate directory tree (rooted in the superblock) that contains
 only inodes that contain filesystem metadata.  Different metadata
 objects can be looked up with regular paths.
 
 Start by creating xfs_imeta{dir,file}* functions to mediate access to
 the metadata directory tree.  By the end of this mega series, all
 existing metadata inodes (rt+quota) will use this directory tree instead
 of the superblock.
 
 Next, define the metadir on-disk format, which consists of marking
 inodes with a new iflag that says they're metadata.  This prevents
 bulkstat and friends from ever getting their hands on fs metadata files.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 ppNiAP9JgPFENv3P0UCJiCDqtWZlNWfz8a9ngAehm4AQMA0P9gD/XgVYNKZRY2Q3
 P+3Sh1TVZ63dcEENlmEFE3myKjeJKAQ=
 =IJLc
 -----END PGP SIGNATURE-----

Merge tag 'metadata-directory-tree-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: metadata inode directory trees [v5.5 03/10]

This series delivers a new feature -- metadata inode directories.  This
is a separate directory tree (rooted in the superblock) that contains
only inodes that contain filesystem metadata.  Different metadata
objects can be looked up with regular paths.

Start by creating xfs_imeta{dir,file}* functions to mediate access to
the metadata directory tree.  By the end of this mega series, all
existing metadata inodes (rt+quota) will use this directory tree instead
of the superblock.

Next, define the metadir on-disk format, which consists of marking
inodes with a new iflag that says they're metadata.  This prevents
bulkstat and friends from ever getting their hands on fs metadata files.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 10:59:05 +01:00
Carlos Maiolino
28cf0d1a34 xfs: create a generic allocation group structure [v5.5 02/10]
Soon we'll be sharding the realtime volume into separate allocation
 groups.  These rt groups will /mostly/ behave the same as the ones on
 the data device, but since rt groups don't have quite the same set of
 struct fields as perags, let's hoist the parts that will be shared by
 both into a common xfs_group object.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR
 pnJDAQCh14f3aSCHslr4XeG1YkT7NZ44AtMjBqEi5GRN26wC0wD+PeaVSRgt+5wy
 nONT/nFqU5pApe1w2pq78SoJ+vLJxAk=
 =la16
 -----END PGP SIGNATURE-----

Merge tag 'generic-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: create a generic allocation group structure [v5.5 02/10]

Soon we'll be sharding the realtime volume into separate allocation
groups.  These rt groups will /mostly/ behave the same as the ones on
the data device, but since rt groups don't have quite the same set of
struct fields as perags, let's hoist the parts that will be shared by
both into a common xfs_group object.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 10:58:27 +01:00
Carlos Maiolino
131ffe5e69 xfs: convert perag to use xarrays [v5.5 01/10]
Convert the xfs_mount perag tree to use an xarray instead of a radix
 tree.  There should be no functional changes here.
 
 With a bit of luck, this should all go splendidly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQcwAKCRBKO3ySh0YR
 pmKsAQDko7YQ/dJZsjDvBJRUEnjrJqK9ukR9Ovc00ak0WXIDYQD/cdm8xzkixHn2
 yfwsuFxIm6k0PPzb9koSCDT/CQS6QAA=
 =WGih
 -----END PGP SIGNATURE-----

Merge tag 'perag-xarray-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge

xfs: convert perag to use xarrays [v5.5 01/10]

Convert the xfs_mount perag tree to use an xarray instead of a radix
tree.  There should be no functional changes here.

With a bit of luck, this should all go splendidly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12 10:57:32 +01:00
Kairui Song
da0c02516c mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead.  And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.

Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
Kairui Song
fb56fdf8b9 mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node.  This lock contention is heavy
when multiple cgroups modify the same list_lru.

This lock can be split into per-cgroup scope to reduce contention.

To achieve this, we need a stable list_lru_one for every cgroup.  This
commit adds a lock to each list_lru_one and introduced a helper function
lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg.
Then reworked the reparenting process.

Reparenting will switch the list_lru_one instances one by one.  By locking
each instance and marking it dead using the nr_items counter, reparenting
ensures that all items in the corresponding cgroup (on-list or not,
because items have a stable cgroup, see below) will see the list_lru_one
switch synchronously.

Objcg reparent is also moved after list_lru reparent so items will have a
stable mem cgroup until all list_lru_one instances are drained.

The only caller that doesn't work the *_obj interfaces are direct calls to
list_lru_{add,del}.  But it's only used by zswap and that's also based on
objcg, so it's fine.

This also changes the bahaviour of the isolation function when LRU_RETRY
or LRU_REMOVED_RETRY is returned, because now releasing the lock could
unblock reparenting and free the list_lru_one, isolation function will
have to return withoug re-lock the lru.

prepare() {
    mkdir /tmp/test-fs
    modprobe brd rd_nr=1 rd_size=33554432
    mkfs.xfs -f /dev/ram0
    mount -t xfs /dev/ram0 /tmp/test-fs
    for i in $(seq 1 512); do
        mkdir "/tmp/test-fs/$i"
        for j in $(seq 1 10240); do
            echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
        done &
    done; wait
}

do_test() {
    read_worker() {
        sleep 1
        tar -cv "$1" &>/dev/null
    }
    read_in_all() {
        cd "/tmp/test-fs" && ls
        for i in $(seq 1 512); do
            (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
            read_worker "$i" &
        done; wait
    }
    for i in $(seq 1 512); do
        mkdir -p "/sys/fs/cgroup/benchmark/$i"
    done
    echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
    echo 512M > /sys/fs/cgroup/benchmark/memory.max
    echo 3 > /proc/sys/vm/drop_caches
    time read_in_all
}

Above script simulates compression of small files in multiple cgroups
with memory pressure. Run prepare() then do_test for 6 times:

Before:
real      0m7.762s user      0m11.340s sys       3m11.224s
real      0m8.123s user      0m11.548s sys       3m2.549s
real      0m7.736s user      0m11.515s sys       3m11.171s
real      0m8.539s user      0m11.508s sys       3m7.618s
real      0m7.928s user      0m11.349s sys       3m13.063s
real      0m8.105s user      0m11.128s sys       3m14.313s

After this commit (about ~15% faster):
real      0m6.953s user      0m11.327s sys       2m42.912s
real      0m7.453s user      0m11.343s sys       2m51.942s
real      0m6.916s user      0m11.269s sys       2m43.957s
real      0m6.894s user      0m11.528s sys       2m45.346s
real      0m6.911s user      0m11.095s sys       2m43.168s
real      0m6.773s user      0m11.518s sys       2m40.774s

Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
Darrick J. Wong
13877bc79d xfs: port ondisk structure checks from xfs/122 to the kernel
Check this with every kernel and userspace build, so we can drop the
nonsense in xfs/122.  Roughly drafted with:

sed -e 's/^offsetof/\tXFS_CHECK_OFFSET/g' \
	-e 's/^sizeof/\tXFS_CHECK_STRUCT_SIZE/g' \
	-e 's/ = \([0-9]*\)/,\t\t\t\1);/g' \
	-e 's/xfs_sb_t/struct xfs_dsb/g' \
	-e 's/),/,/g' \
	-e 's/xfs_\([a-z0-9_]*\)_t,/struct xfs_\1,/g' \
	< tests/xfs/122.out | sort

and then manual fixups.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:47 -08:00
Darrick J. Wong
131a883fff xfs: separate space btree structures in xfs_ondisk.h
Create a separate section for space management btrees so that they're
not mixed in with file structures.  Ignore the dsb stuff sprinkled
around for now, because we'll deal with that in a subsequent patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:47 -08:00
Darrick J. Wong
89b38282d1 xfs: convert struct typedefs in xfs_ondisk.h
Replace xfs_foo_t with struct xfs_foo where appropriate.  The next patch
will import more checks from xfs/122, and it's easier to automate
deduplication if we don't have to reason about typedefs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:47 -08:00
Darrick J. Wong
ea079efd36 xfs: enable metadata directory feature
Enable the metadata directory feature.  With this feature, all metadata
inodes are placed in the metadata directory, and the only inumbers in
the superblock are the roots of the two directory trees.

The RT device is now sharded into a number of rtgroups, where 0 rtgroups
mean that no RT extents are supported, and the traditional XFS stub RT
bitmap and summary inodes don't exist.  A single rtgroup gives roughly
identical behavior to the traditional RT setup, but now with checksummed
and self identifying free space metadata.

For quota, the quota options are read from the superblock unless
explicitly overridden via mount options.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
edc038f7f3 xfs: enable realtime quota again
Enable quotas for the realtime device.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
28d756d4d5 xfs: update sb field checks when metadir is turned on
When metadir is enabled, we want to check the two new rtgroups fields,
and we don't want to check the old inumbers that are now in the metadir.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
b7020ba86a xfs: reserve quota for realtime files correctly
Fix xfs_quota_reserve_blkres to reserve rt block quota whenever we're
dealing with a realtime file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
5dd70852b0 xfs: create quota preallocation watermarks for realtime quota
Refactor the quota preallocation watermarking code so that it'll work
for realtime quota too.  Convert the do_div calls into div_u64 for
compactness.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
9a17ebfea9 xfs: report realtime block quota limits on realtime directories
On the data device, calling statvfs on a projinherit directory results
in the block and avail counts being curtailed to the project quota block
limits, if any are set.  Do the same for realtime files or directories,
only use the project quota rt block limits.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:46 -08:00
Darrick J. Wong
d5d9dd5b30 xfs: persist quota flags with metadir
It's annoying that one has to keep reminding XFS about what quota
options it should mount with, since the quota flags recording the
previous state are sitting right there in the primary superblock.  Even
more strangely, there exists a noquota option to disable quotas
completely, so it's odder still that providing no options is the same as
noquota.

Starting with metadir, let's change the behavior so that if the user
does not specify any quota-related mount options at all, the ondisk
quota flags will be used to bring up quota.  In other words, the
filesystem will mount in the same state and with the same functionality
as it had during the last mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
184c619f55 xfs: advertise realtime quota support in the xqm stat files
Add a fifth column to this (really old) stat file to advertise that the
kernel supports quota for realtime volumes.  This will be used by
fstests to detect kernel support.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
128a055291 xfs: scrub quota file metapaths
Enable online fsck for quota file metadata directory paths.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
b28564cae1 xfs: fix chown with rt quota
Make chown's quota adjustments work with realtime files.  This is mostly
a matter of calling xfs_inode_count_blocks on a given file to figure out
the number of blocks allocated to the data device and to the realtime
device, and using those quantities to update the quota accounting when
the id changes.  Delayed allocation reservations are moved from the old
dquot's incore reservation to the new dquot's incore reservation.

Note that there was a missing ILOCK bug in xfs_qm_dqusage_adjust that we
must fix before calling xfs_iread_extents.  Prior to 2.6.37 the locking
was correct, but then someone removed the ILOCK as part of a cleanup.
Nobody noticed because nowhere in the git history have we ever supported
rt+quota so nobody can use this.

I'm leaving git breadcrumbs in case anyone is desperate enough to try to
backport the rtquota code to old kernels.

Not-Cc: <stable@vger.kernel.org> # v2.6.37
Fixes: 52fda11424 ("xfs: simplify xfs_qm_dqusage_adjust")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
e80fbe1ad8 xfs: use metadir for quota inodes
Store the quota inodes in the /quota metadata directory if metadir is
enabled.  This enables us to stop using the sb_[ugp]uotino fields in the
superblock.  From this point on, all metadata files will be children of
the metadata directory tree root.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
fc23a426ce xfs: refactor xfs_qm_destroy_quotainos
Reuse this function instead of open-coding the logic.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:45 -08:00
Darrick J. Wong
a3315d1130 xfs: use rtgroup busy extent list for FITRIM
For filesystems that have rtgroups and hence use the busy extent list
for freed rt space, use that busy extent list so that FITRIM can issue
discard commands asynchronously without worrying about other callers
accidentally allocating and using space that is being discarded.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
7e85fc2394 xfs: implement busy extent tracking for rtgroups
For rtgroups filesystems, track newly freed (rt) space through the log
until the rt EFIs have been committed to disk.  This way we ensure that
space cannot be reused until all traces of the old owner are gone.

As a fringe benefit, we now support -o discard on the realtime device.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
0c271d906e xfs: port the perag discard code to handle generic groups
Port xfs_discard_extents and its tracepoints to handle generic groups
instead of just perags.  This is needed to enable busy extent tracking
for rtgroups.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
e0b5b97dde xfs: move the min and max group block numbers to xfs_group
Move the min and max agblock numbers to the generic xfs_group structure
so that we can start building validators for extents within an rtgroup.
While we're at it, use check_add_overflow for the extent length
computation because that has much better overflow checking.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
ceaa0bd773 xfs: adjust min_block usage in xfs_verify_agbno
There's some weird logic in xfs_verify_agbno -- min_block ought to be
the first agblock number in the AG that can be used by non-static
metadata.  However, we initialize it to the last agblock of the static
metadata, which works due to the <= check, even though this isn't
technically correct.

Change the check to < and set min_block to the next agblock past the
static metadata.  This hasn't been an issue up to now, but we're going
to move these things into the generic group struct, and this will cause
problems with rtgroups, where min_block can be zero for an rtgroup that
doesn't have a rt superblock.

Note that there's no user-visible impact with the old logic, so this
isn't a bug fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
7195f240c6 xfs: make xfs_rtblock_t a segmented address like xfs_fsblock_t
Now that we've finished adding allocation groups to the realtime volume,
let's make the file block mapping address (xfs_rtblock_t) a segmented
value just like we do on the data device.  This means that group number
and block number conversions can be done with shifting and masking
instead of integer division.

While in theory we could continue caching the rgno shift value in
m_rgblklog, the fact that we now always use the shift value means that
we have an opportunity to increase the redundancy of the rt geometry by
storing it in the ondisk superblock and adding more sb verifier code.
Extend the sueprblock to store the rgblklog value.

Now that we have segmented addresses, set the correct values in
m_groups[XG_TYPE_RTG] so that the xfs_group helpers work correctly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:44 -08:00
Darrick J. Wong
3f0205ebe7 xfs: create helpers to deal with rounding xfs_filblks_t to rtx boundaries
We're about to segment xfs_rtblock_t addresses, so we must create
type-specific helpers to do rt extent rounding of file mapping block
lengths because the rtb helpers soon will not do the right thing there.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
fd7588fa64 xfs: create helpers to deal with rounding xfs_fileoff_t to rtx boundaries
We're about to segment xfs_rtblock_t addresses, so we must create
type-specific helpers to do rt extent rounding of file block offsets
because the rtb helpers soon will not do the right thing there.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
ea99122b18 xfs: mask off the rtbitmap and summary inodes when metadir in use
Set the rtbitmap and summary file inumbers to NULLFSINO in the
superblock and make sure they're zeroed whenever we write the superblock
to disk, to mimic mkfs behavior.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
a74923333d xfs: scrub metadir paths for rtgroup metadata
Add the code we need to scan the metadata directory paths of rt group
metadata files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
1433f8f9ce xfs: repair realtime group superblock
Repair the realtime superblock if it has become out of date with the
primary superblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
3f1bdf50ab xfs: scrub the realtime group superblock
Enable scrubbing of realtime group superblocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:43 -08:00
Darrick J. Wong
7333c948c2 xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub
The bmbt scrubber will combine file mappings if they are mergeable to
reduce the number of cross-referencing checks.  However, we shouldn't
combine mappings that cross rt group boundaries because that will cause
verifiers to trip incorrectly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:42 -08:00
Christoph Hellwig
d162491c54 xfs: make the RT allocator rtgroup aware
Make the allocator rtgroup aware by either picking a specific group if
there is a hint, or loop over all groups otherwise.  A simple rotor is
provided to pick the placement for initial allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:42 -08:00
Darrick J. Wong
b91afef724 xfs: don't merge ioends across RTGs
Unlike AGs, RTGs don't always have metadata in their first blocks, and
thus we don't get automatic protection from merging I/O completions
across RTG boundaries.  Add code to set the IOMAP_F_BOUNDARY flag for
ioends that start at the first block of a RTG so that they never get
merged into the previous ioend.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:42 -08:00
Darrick J. Wong
44e69c9af1 xfs: use realtime EFI to free extents when rtgroups are enabled
When rmap is enabled, XFS expects a certain order of operations, which
is: 1) remove the file mapping, 2) remove the reverse mapping, and then
3) free the blocks.  When reflink is enabled, XFS replaces (3) with a
deferred refcount decrement operation that can schedule freeing the
blocks if that was the last refcount.

For realtime files, xfs_bmap_del_extent_real tries to do 1 and 3 in the
same transaction, which will break both rmap and reflink unless we
switch it to use realtime EFIs.  Both rmap and reflink depend on the
rtgroups feature, so let's turn on EFIs for all rtgroups filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:42 -08:00
Darrick J. Wong
fc91d9430e xfs: support error injection when freeing rt extents
A handful of fstests expect to be able to test what happens when extent
free intents fail to actually free the extent.  Now that we're
supporting EFIs for realtime extents, add to xfs_rtfree_extent the same
injection point that exists in the regular extent freeing code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:42 -08:00
Darrick J. Wong
4c8900bbf1 xfs: support logging EFIs for realtime extents
Teach the EFI mechanism how to free realtime extents.  We're going to
need this to enforce proper ordering of operations when we enable
realtime rmap.

Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
ops for rt extents.  This keeps the ondisk artifacts and processing code
completely separate between the rt and non-rt cases.  Hopefully this
will make it easier to debug filesystem problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:42 -08:00
Darrick J. Wong
b57283e1a0 xfs: force swapext to a realtime file to use the file content exchange ioctl
xfs_swap_extent_rmap does not use log items to track the overall
progress of an attempt to swap the extent mappings between two files.
If the system crashes in the middle of swapping a partially written
realtime extent, the mapping will be left in an inconsistent state
wherein a file can point to multiple extents on the rt volume.

The new file range exchange functionality handles this correctly, so all
callers must upgrade to that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:41 -08:00
Darrick J. Wong
e464d8e8bb xfs: store rtgroup information with a bmap intent
Make the bmap intent items take an active reference to the rtgroup
containing the space that is being mapped or unmapped.  We will need
this functionality once we start enabling rmap and reflink on the rt
volume.  Technically speaking we need it even for !rtgroups filesystems
to prevent the (dummy) rtgroup 0 from going away, even though this will
never happen.

As a bonus, we can rework the xfs_bmap_deferred_class tracepoint to use
the xfs_group object to figure out the type and group number, widen the
group block number field to fit 64-bit quantities, and get rid of the
now redundant opdev and rtblock fields.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:41 -08:00
Darrick J. Wong
ee32135148 xfs: grow the realtime section when realtime groups are enabled
Enable growing the rt section when realtime groups are enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:41 -08:00
Darrick J. Wong
a2c2836739 xfs: encode the rtsummary in big endian format
Currently, the ondisk realtime summary file counters are accessed in
units of 32-bit words.  There's no endian translation of the contents of
this file, which means that the Bad Things Happen(tm) if you go from
(say) x86 to powerpc.  Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.  Encode the summary
information in big endian format, like most of the rest of the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:41 -08:00
Darrick J. Wong
eba42c2c53 xfs: encode the rtbitmap in big endian format
Currently, the ondisk realtime bitmap file is accessed in units of
32-bit words.  There's no endian translation of the contents of this
file, which means that the Bad Things Happen(tm) if you go from (say)
x86 to powerpc.  Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:41 -08:00
Darrick J. Wong
118895aa95 xfs: add block headers to realtime bitmap and summary blocks
Upgrade rtbitmap and rtsummary blocks to have self describing metadata
like most every other thing in XFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:40 -08:00
Darrick J. Wong
3fa7a6d0c7 xfs: export the geometry of realtime groups to userspace
Create an ioctl so that the kernel can report the status of realtime
groups to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:40 -08:00
Darrick J. Wong
ab7bd650e1 xfs: record rt group metadata errors in the health system
Record the state of per-rtgroup metadata sickness in the rtgroup
structure for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:40 -08:00
Darrick J. Wong
21e62bddf0 xfs: convert sick_map loops to use ARRAY_SIZE
Convert these arrays to use ARRAY_SIZE insteead of requiring an empty
sentinel array element at the end.  This saves memory and would have
avoided a bug that worked its way into the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:40 -08:00
Darrick J. Wong
35537f25d2 xfs: add frextents to the lazysbcounters when rtgroups enabled
Make the free rt extent count a part of the lazy sb counters when the
realtime groups feature is enabled.  This is possible because the patch
to recompute frextents from the rtbitmap during log recovery predates
the code adding rtgroup support, hence we know that the value will
always be correct during runtime.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:40 -08:00
Christoph Hellwig
8458c4944e xfs: add a helper to prevent bmap merges across rtgroup boundaries
Except for the rt superblock, realtime groups do not store any metadata
at the start (or end) of the group.  There is nothing to prevent the
bmap code from merging allocations from multiple groups into a single
bmap record.  Add a helper to check for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: massage the commit message after pulling this into rtgroups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:40 -08:00
Darrick J. Wong
9bb5127347 xfs: check that rtblock extents do not break rtsupers or rtgroups
Check that rt block pointers do not point to the realtime superblock and
that allocated rt space extents do not cross rtgroup boundaries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:39 -08:00
Darrick J. Wong
8edde94d64 xfs: export realtime group geometry via XFS_FSOP_GEOM
Export the realtime geometry information so that userspace can query it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:39 -08:00
Darrick J. Wong
76d3be00df xfs: update realtime super every time we update the primary fs super
Every time we update parts of the primary filesystem superblock that are
echoed in the rt superblock, we must update the rt super.  Avoid
changing the log to support logging to the rt device by using ordered
buffers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:39 -08:00
Darrick J. Wong
18618e7100 xfs: check the realtime superblock at mount time
Check the realtime superblock at mount time, to ensure that the label
and uuids actually match the primary superblock on the data device.  If
the rt superblock is good, attach it to the xfs_mount so that the log
can use ordered buffers to keep this primary in sync with the primary
super on the data device.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:39 -08:00
Darrick J. Wong
96768e9151 xfs: define the format of rt groups
Define the ondisk format of realtime group metadata, and a superblock
for realtime volumes.  rt supers are conditionally enabled by a
predicate function so that they can be disabled if we ever implement
zoned storage support for the realtime volume.

For rt group enabled file systems there is a separate bitmap and summary
file for each group and thus the number of bitmap and summary blocks
needs to be calculated differently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:39 -08:00
Christoph Hellwig
f220f6da5f xfs: make RT extent numbers relative to the rtgroup
To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t
type encode the RT extent number relative to the rtgroup.  The biggest
part of this to clearly distinguish between the relative extent number
that gets masked when converting from a global block number and length
values that just have a factor applied to them when converting from
file system blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:38 -08:00
Darrick J. Wong
dca94251f6 xfs: fix rt device offset calculations for FITRIM
FITRIM on xfs has this bizarro uapi where we flatten all the physically
addressable storage across two block devices into a linear address
space.  In this address space, the realtime device comes immediately
after the data device.  Therefore, the xfs_trim_rtdev_extents has to
convert its input parameters from the linear address space to actual
rtdev block addresses on the realtime volume.

Right now the address space conversion is done in units of rtblocks.
However, a future patchset will convert xfs_rtblock_t to be a segmented
address space (group:blkno) like the data device.  Change the conversion
code to be done in units of daddrs since those will never be segmented.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:38 -08:00
Christoph Hellwig
f8c5a8415f xfs: refactor xfs_rtsummary_blockcount
Make xfs_rtsummary_blockcount take all the required information from
the mount structure and return the number of summary levels from it
as well.  This cleans up many of the callers and prepares for making the
rtsummary files per-rtgroup where they need to look at different value.

This means we recalculate some values in some callers, but as all these
calculations are outside the fast path and cheap, which seems like a
price worth paying.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:38 -08:00
Christoph Hellwig
5a7566c8d6 xfs: refactor xfs_rtbitmap_blockcount
Rename the existing xfs_rtbitmap_blockcount to
xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper
around it that takes the number of extents from the mount structure.

This will simplify the move to per-rtgroup bitmaps as those will need to
pass in the number of extents per rtgroup instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:38 -08:00
Christoph Hellwig
bde86b42d2 xfs: factor out a xfs_growfs_check_rtgeom helper
Split the check that the rtsummary fits into the log into a separate
helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: avoid division for the 0-rtx growfs check]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:38 -08:00
Christoph Hellwig
fc233f1fb0 xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks
Use xfs_growfs_rt_alloc_fake_mount instead of manually recalculating
the RT bitmap geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:37 -08:00
Darrick J. Wong
1029f08dc5 xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper
Split the code to set up a fake mount point to calculate new RT
geometry out of xfs_growfs_rt_bmblock so that it can be reused.

Note that this changes the rmblocks calculation method to be based
on the passed in rblocks and extsize and not the explicitly passed
one, but both methods will always lead to the same result.  The new
version just does a little bit more math while being more general.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:37 -08:00
Christoph Hellwig
cb9cd6e56e xfs: calculate RT bitmap and summary blocks based on sb_rextents
Use the on-disk rextents to calculate the bitmap and summary blocks
instead of the calculated one so that we can refactor the helpers for
calculating them.

As the RT bitmap and summary scrubbers already check that sb_rextents
match the block count this does not change coverage of the scrubber.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:37 -08:00
Darrick J. Wong
c1442d22a0 xfs: remove XFS_ILOCK_RT*
Now that we've centralized the realtime metadata locking routines, get
rid of the ILOCK subclasses since we now use explicit lockdep classes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:37 -08:00
Christoph Hellwig
ae897e0bed xfs: support creating per-RTG files in growfs
To support adding new RT groups in growfs, we need to be able to create
the per-RT group files.  Add a new xfs_rtginode_create helper to create
a given per-RTG file.  Most of the code for that is shared, but the
details of the actual file are abstracted out using a new create method
in struct xfs_rtginode_ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:37 -08:00
Christoph Hellwig
e3088ae2dc xfs: move RT bitmap and summary information to the rtgroup
Move the pointers to the RT bitmap and summary inodes as well as the
summary cache to the rtgroups structure to prepare for having a
separate bitmap and summary inodes for each rtgroup.

Code using the inodes now needs to operate on a rtgroup.  Where easily
possible such code is converted to iterate over all rtgroups, else
rtgroup 0 (the only one that can currently exist) is hardcoded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:37 -08:00
Christoph Hellwig
c8edf1cbef xfs: split xfs_trim_rtdev_extents
Split xfs_trim_rtdev_extents into two parts to prepare for reusing the
main validation also for RT group aware file systems.

Use the fully features xfs_daddr_to_rtb helper to convert from a daddr
to a xfs_rtblock_t to prepare for segmented addressing in RT groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:36 -08:00
Christoph Hellwig
d6d5c90ada xfs: cleanup xfs_getfsmap_rtdev_rtbitmap
Use mp->m_sb.sb_rblocks to calculate the end instead of sb_rextents that
needs a conversion, use consistent names to xfs_rtblock_t types, and
only calculated them by the time they are needed.  Remove the pointless
"high" local variable that only has a single user.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:36 -08:00
Christoph Hellwig
9154b5008c xfs: factor out a xfs_growfs_rt_alloc_blocks helper
Split out a helper to allocate or grow the rtbitmap and rtsummary files
in preparation of per-RT group bitmap and summary files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:36 -08:00
Christoph Hellwig
cd8d049082 xfs: add a xfs_qm_unmount_rt helper
RT group enabled file systems fix the bug where we pointlessly attach
quotas to the RT bitmap and summary files.  Split the code to detach the
quotas into a helper, make it conditional and document the differing
behavior for RT group and pre-RT group file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:36 -08:00
Christoph Hellwig
9c3cfb9c96 xfs: add a xfs_bmap_free_rtblocks helper
Split the RT extent freeing logic from xfs_bmap_del_extent_real because
it will become more complicated when adding RT group.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:36 -08:00
Darrick J. Wong
cd5b26f0c0 xfs: add rtgroup-based realtime scrubbing context management
Create a state tracking structure and helpers to initialize the tracking
structure so that we can check metadata records against the realtime
space management metadata.  Right now this is limited to grabbing the
incore rtgroup object, but we'll eventually add to the tracking
structure the ILOCK state and btree cursors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:36 -08:00
Darrick J. Wong
0d2c636e48 xfs: repair metadata directory file path connectivity
Fix disconnected or incorrect metadata directory paths.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:35 -08:00
Darrick J. Wong
65b1231b8c xfs: support caching rtgroup metadata inodes
Create the necessary per-rtgroup infrastructure that we need to load
metadata inodes into memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:35 -08:00
Darrick J. Wong
c29237a65c xfs: add a lockdep class key for rtgroup inodes
Add a dynamic lockdep class key for rtgroup inodes.  This will enable
lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
order.  Each class can have 8 subclasses, and for now we will only have
2 inodes per group.  This enables rtgroup order and inode order checks
when nesting ILOCKs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:35 -08:00
Darrick J. Wong
0e4875b3fb xfs: define locking primitives for realtime groups
Define helper functions to lock all metadata inodes related to a
realtime group.  There's not much to look at now, but this will become
important when we add per-rtgroup metadata files and online fsck code
for them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:35 -08:00
Darrick J. Wong
87fe4c34a3 xfs: create incore realtime group structures
Create an incore object that will contain information about a realtime
allocation group.  This will eventually enable us to shard the realtime
section in a similar manner to how we shard the data section, but for
now just a single object for the entire RT subvolume is created.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:35 -08:00
Christoph Hellwig
dcfc65befb xfs: clean up xfs_getfsmap_helper arguments
The calling conventions for xfs_getfsmap_helper are confusing -- callers
pass in an rmap record, but they must also supply startblock and
blockcount in daddr units.  This was bolted onto the original fsmap
implementation so that we could report *something* for realtime
volumes, which do not support rmap and hence can draw only from the rt
free space bitmap.  Free space on the rt volume can be more than 2^32
fsblocks long, which means that we can't use the rmap startblock or
blockcount fields.

This is confusing for callers, because they must supplying redundant
data, but not all of it is used.  Streamline this by creating a separate
fsmap irec structure that contains exactly the data we need, once.

Note that we actually do need rm_startblock for rmap key comparisons
when we're actually querying an rmap btree, so leave that field but
document why it's there.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:35 -08:00
Darrick J. Wong
87b7c205da xfs: confirm dotdot target before replacing it during a repair
xfs_dir_replace trips an assertion if you tell it to change a dirent to
point to an inumber that it already points at.  Look up the dotdot entry
directly to confirm that we need to make a change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
b3c03efa59 xfs: check metadata directory file path connectivity
Create a new scrubber type that checks that well known metadata
directory paths are connected to the metadata inode that the incore
structures think is in use.  For example, check that "/quota/user" in
the metadata directory tree actually points to
mp->m_quotainfo->qi_uquotaip->i_ino.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
9dc31acb01 xfs: move repair temporary files to the metadata directory tree
Due to resource acquisition rules, we have to create the ondisk
temporary files used to stage a filesystem repair before we can acquire
a reference to the inode that we actually want to repair.  Therefore,
we do not know at tempfile creation time whether the tempfile will
belong to the regular directory tree or the metadata directory tree.

This distinction becomes important when the swapext code tries to figure
out the quota accounting of the two files whose mappings are being
swapped.  The swapext code assumes that accounting updates are required
for a file if dqattach attaches dquots.  Metadir files are never
accounted in quota, which means that swapext must not update the quota
accounting when swapping in a repaired directory/xattr/rtbitmap structure.

Prior to the swapext call, therefore, both files must be marked as
METADIR for dqattach so that dqattach will ignore them.  Add support for
a repair tempfile to be switched to the metadir tree and switched back
before being released so that ifree will just free the file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
dcde94bdee xfs: check the metadata directory inumber in superblocks
When metadata directories are enabled, make sure that the secondary
superblocks point to the metadata directory.  This isn't strictly
required because the secondaries are only used to recover damaged
filesystems, and the metadir root inumber is fixed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
3d2c341111 xfs: scrub metadata directories
Teach online scrub about the metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
5dab2daa8a xfs: fix di_metatype field of inodes that won't load
Make sure that the di_metatype field is at least set plausibly so that
later scrubbers could set the real type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:34 -08:00
Darrick J. Wong
aec2eb7da8 xfs: adjust parent pointer scrubber for sb-rooted metadata files
Starting with the metadata directory feature, we're allowed to call the
directory and parent pointer scrubbers for every metadata file,
including the ones that are children of the superblock.

For these children, checking the link count against the number of parent
pointers is a bit funny -- there's no such thing as a parent pointer for
a child of the superblock since there's no corresponding dirent.  For
purposes of validating nlink, we pretend that there is a parent pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:33 -08:00
Darrick J. Wong
91fb4232be xfs: metadata files can have xattrs if metadir is enabled
If parent pointers are enabled, then metadata files will store parent
pointers in xattrs, just like files in the user visible directory tree.
Therefore, scrub and repair need to handle attr forks for metadata files
on metadir filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:33 -08:00
Darrick J. Wong
13af229ee0 xfs: do not count metadata directory files when doing online quotacheck
Previously, we stated that files in the metadata directory tree are not
counted in the dquot information.  Fix the online quotacheck code to
reflect this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:33 -08:00
Darrick J. Wong
679b098b59 xfs: refactor directory tree root predicates
Metadata directory trees make reasoning about the parent of a file more
difficult.  Traditionally, user files are children of sb_rootino, and
metadata files are "children" of the superblock.  Now, we add a third
possibility -- some metadata files can be children of sb_metadirino, but
the classic ones (rt free space data and quotas) are left alone.

Let's add some helper functions (instead of open-coding the logic
everywhere) to make scrub logic easier to understand.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:33 -08:00
Darrick J. Wong
be42fc1393 xfs: record health problems with the metadata directory
Make a report to the health monitoring subsystem any time we encounter
something in the metadata directory tree that looks like corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:33 -08:00
Darrick J. Wong
61b6bdb30a xfs: adjust xfs_bmap_add_attrfork for metadir
Online repair might use the xfs_bmap_add_attrfork to repair a file in
the metadata directory tree if (say) the metadata file lacks the correct
parent pointers.  In that case, it is not correct to check that the file
is dqattached -- metadata files must be not have /any/ dquot attached at
all.  Adjust the assertions appropriately.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
cc0cf84aa7 xfs: mark quota inodes as metadata files
When we're creating quota files at mount time, make sure to mark them as
metadir inodes if appropriate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
382e275f0e xfs: don't count metadata directory files to quota
Files in the metadata directory tree are internal to the filesystem.
Don't count the inodes or the blocks they use in the root dquot because
users do not need to know about their resource usage.  This will also
quiet down complaints about dquot usage not matching du output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
df866c538f xfs: allow bulkstat to return metadata directories
Allow the V5 bulkstat ioctl to return information about metadata
directory files so that xfs_scrub can find and scrub them, since they
are otherwise ordinary directories.

(Metadata files of course require per-file scrub code and hence do not
need exposure.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
688828d8f8 xfs: advertise metadata directory feature
Advertise the existence of the metadata directory feature; this will be
used by scrub to decide if it needs to scan the metadir too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
bb6cdd5529 xfs: hide metadata inodes from everyone because they are special
Metadata inodes are private files and therefore cannot be exposed to
userspace.  This means no bulkstat, no open-by-handle, no linking them
into the directory tree, and no feeding them to LSMs.  As such, we mark
them S_PRIVATE, which stops all that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:32 -08:00
Darrick J. Wong
8651b410ae xfs: disable the agi rotor for metadata inodes
Ideally, we'd put all the metadata inodes in one place if we could, so
that the metadata all stay reasonably close together instead of
spreading out over the disk.  Furthermore, if the log is internal we'd
probably prefer to keep the metadata near the log.  Therefore, disable
AGI rotoring for metadata inode allocations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Darrick J. Wong
5d9b54a4ef xfs: read and write metadata inode directory tree
Plumb in the bits we need to load metadata inodes from a named entry in
a metadir directory, create (or hardlink) inodes into a metadir
directory, create metadir directories, and flag inodes as being metadata
files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Darrick J. Wong
7297fd0beb xfs: enforce metadata inode flag
Add checks for the metadata inode flag so that we don't ever leak
metadata inodes out to userspace, and we don't ever try to read a
regular inode as metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Darrick J. Wong
c555dd9b8c xfs: load metadata directory root at mount time
Load the metadata directory root inode into memory at mount time and
release it at unmount time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Darrick J. Wong
dcf6069143 xfs: iget for metadata inodes
Create a xfs_trans_metafile_iget function for metadata inodes to ensure
that when we try to iget a metadata file, the inode is allocated and its
file mode matches the metadata file type the caller expects.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Darrick J. Wong
4f3d4dd1b0 xfs: define the on-disk format for the metadir feature
Define the on-disk layout and feature flags for the metadata inode
directory feature.  Add a xfs_sb_version_hasmetadir for benefit of
xfs_repair, which needs to know where the new end of the superblock
lies.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:31 -08:00
Christoph Hellwig
e5e5cae05b xfs: store a generic group structure in the intents
Replace the pag pointers in the extent free, bmap, rmap and refcount
intent structures with a pointer to the generic group to prepare
for adding intents for realtime groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:30 -08:00
Darrick J. Wong
ecc8065dfa xfs: standardize EXPERIMENTAL warning generation
Refactor the open-coded warnings about EXPERIMENTAL feature use into a
standard helper before we go adding more experimental features.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:30 -08:00
Darrick J. Wong
4d272929a5 xfs: rename metadata inode predicates
The predicate xfs_internal_inum tells us if an inumber refers to one of
the inodes rooted in the superblock.  Soon we're going to have internal
inodes in a metadata directory tree, so this helper should be renamed
to capture its limited scope.

Ondisk inodes will soon have a flag to indicate that they're metadata
inodes.  Head off some confusion by renaming the xfs_is_metadata_inode
predicate to xfs_is_internal_inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:30 -08:00
Darrick J. Wong
fdf5703b61 xfs: constify the xfs_inode predicates
Change the xfs_inode predicates to take a const struct xfs_inode pointer
because they do not change the inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:30 -08:00
Darrick J. Wong
8d939f4bd7 xfs: constify the xfs_sb predicates
Change the xfs_sb predicates to take a const struct xfs_sb pointer
because they do not change the superblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:30 -08:00
Christoph Hellwig
ba102a682d xfs: remove xfs_group_intent_hold and xfs_group_intent_rele
Each of them just has a single caller, so fold them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
759cc1989a xfs: add group based bno conversion helpers
Add/move the blocks, blklog and blkmask fields to the generic groups
structure so that code can work with AGs and RTGs by just using the
right index into the array.

Then, add convenience helpers to convert block numbers based on the
generic group.  This will allow writing code that doesn't care if it is
used on AGs or the upcoming realtime groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
198febb9fe xfs: store a generic xfs_group pointer in xfs_getfsmap_info
Replace the pag and rtg pointers with a generic group pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
77a530e6c4 xfs: add a generic group pointer to the btree cursor
Replace the pag pointers in the type specific union with a generic
xfs_group pointer.  This prepares for adding realtime group support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
adbc76aa0f xfs: convert busy extent tracking to the generic group structure
Split busy extent tracking from struct xfs_perag into its own private
structure, which can be pointed to by the generic group structure.

Note that this structure is now dynamically allocated instead of embedded
as the upcoming zone XFS code doesn't need it and will also have an
unusually high number of groups due to hardware constraints.  Dynamically
allocating the structure this is a big memory saver for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
0e10cb98f1 xfs: convert extent busy tracepoints to the generic group structure
Prepare for tracking busy RT extents by passing the generic group
structure to the xfs_extent_busy_class tracepoints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:29 -08:00
Christoph Hellwig
6af1300d47 xfs: return the busy generation from xfs_extent_busy_list_empty
This avoid having to poke into the internals of the busy tracking in
xrep_setup_ag_allocbt.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
eb4a84a3c2 xfs: move the online repair rmap hooks to the generic group structure
Prepare for the upcoming realtime groups feature by moving the online
repair rmap hooks to based to the generic xfs_group structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
34cf3a6f39 xfs: move draining of deferred operations to the generic group structure
Prepare supporting the upcoming realtime groups feature by moving the
deferred operation draining to the generic xfs_group structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
2ed27a5464 xfs: mark xfs_perag_intent_{hold,rele} static
These two functions are only used inside of xfs_drain.c, so mark them
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
5c8483cec3 xfs: move metadata health tracking to the generic group structure
Prepare for also tracking the health status of the upcoming realtime
groups by moving the health tracking code to the generic xfs_group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
86437e6abb xfs: switch perag iteration from the for_each macros to a while based iterator
The current for_each_perag* macros are a bit annoying in that they
require the caller to both provide an object and an index iterator, and
also somewhat obsfucate the underlying control flow mechanism.

Switch to open coded while loops using new xfs_perag_next{,_from,_range}
helpers that return the next pag structure to iterate on based on the
previous one or NULL for the loop start.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:28 -08:00
Christoph Hellwig
d66496578b xfs: insert the pag structures into the xarray later
Cleaning up is much easier if a structure can't be looked up yet, so only
insert the pag once it is fully set up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
819928770b xfs: add a xfs_group_next_range helper
Add a helper to iterate over iterate over all groups, which can be used
as a simple while loop:

	struct xfs_group		*xg = NULL;

	while ((xg = xfs_group_next_range(mp, xg, 0, MAX_GROUP))) {
		...
	}

This will be wrapped by the realtime group code first, and eventually
replace the for_each_rtgroup_from and for_each_rtgroup_range helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
201c5fa342 xfs: split xfs_initialize_perag
Factor out a xfs_perag_alloc helper that allocates a single perag
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
e9c4d8bfb2 xfs: factor out a generic xfs_group structure
Split the lookup and refcount handling of struct xfs_perag into an
embedded xfs_group structure that can be reused for the upcoming
realtime groups.

It will be extended with more features later.

Note that he xg_type field will only need a single bit even with
realtime group support.  For now it fills a hole, but it might be
worth to fold it into another field if we can use this space better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
c4ae021bcb xfs: convert remaining trace points to pass pag structures
Convert all tracepoints that take [mp,agno] tuples to take a pag argument
instead so that decoding only happens when tracepoints are enabled and to
clean up the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
0a4d79741d xfs: factor out a xfs_iwalk_args helper
Add a helper to share more code between xfs_iwalk and xfs_inobt_walk,
and at the same time do away with the extra flags indirect so that
everyone use the same names for the same flags when using the common
iwalk code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:27 -08:00
Christoph Hellwig
dc8df7e382 xfs: pass the pag to the xrep_newbt_extent_class tracepoints
This requires moving a few of the callsites a little bit to ensure that
we already have the reference, but allows for the decoding to only happen
when tracing is actually enabled, and cleans up the callsites a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
934dde65b2 xfs: pass the pag to the trace_xrep_calc_ag_resblks{,_btsize} trace points
This requires holding the pag refcount a little longer, but allows for the
decoding to only happen when tracing is actually enabled, and cleans up the
callsites a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
618a27a94d xfs: pass objects to the xrep_ibt_walk_rmap tracepoint
Pass the perag structure and the irec so that the decoding is only done
when tracing is actually enabled and the call sites look a lot neater,
and remove the pointless class indirection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
1209d360eb xfs: pass the iunlink item to the xfs_iunlink_update_dinode trace point
So that decoding is only done when tracing is actually enabled and the
call site look a lot neater.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
487092ceaa xfs: pass objects to the xfs_irec_merge_{pre,post} trace points
Pass the perag structure and the irec to these tracepoints so that the
decoding is only done when tracing is actually enabled and the call sites
look a lot neater.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
835ddb592f xfs: pass a perag structure to the xfs_ag_resv_init_error trace point
And remove the single instance class indirection for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:26 -08:00
Christoph Hellwig
2337ac79e9 xfs: constify pag arguments to trace points
Trace points never modify their arguments.  Mark all the pag objects
passed to trace points.  The exception is the xfs_ag_resv_class, which
uses the xfs_perag_resv helper that can't be marked const due to
other users modifying the returned structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
3c39444939 xfs: remove the unused xrep_bmap_walk_rmap trace point
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
c896fb44f6 xfs: remove the unused trace_xfs_iwalk_ag trace point
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
8dcf5e617f xfs: remove the mount field from struct xfs_busy_extents
The mount field is only passed to xfs_extent_busy_clear, which never uses
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
4a137e0915 xfs: keep a reference to the pag for busy extents
Processing of busy extents requires the perag structure, so keep the
reference while they are in flight.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
b6dc8c6dd2 xfs: pass a pag to xfs_extent_busy_{search,reuse}
Replace the [mp,agno] tuple with the perag structure, which will become
more useful later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:25 -08:00
Christoph Hellwig
6abd82ab6e xfs: add a xfs_agino_to_ino helper
Add a helpers to convert an agino to an ino based on a pag structure.

This provides a simpler conversion and better type safety compared to the
existing code that passes the mount structure and the agno separately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:24 -08:00
Christoph Hellwig
856a920ac2 xfs: add xfs_agbno_to_fsb and xfs_agbno_to_daddr helpers
Add helpers to convert an agbno to a daddr or fsbno based on a pag
structure.

This provides a simpler conversion and better type safety compared to the
existing code that passes the mount structure and the agno separately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:24 -08:00
Christoph Hellwig
db129fa011 xfs: remove the agno argument to xfs_free_ag_extent
xfs_free_ag_extent already has a pointer to the pag structure through
the agf buffer.  Use that instead of passing the redundant argument,
and do the same for the tracepoint.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:24 -08:00
Christoph Hellwig
67ce5ba575 xfs: pass a pag to xfs_difree_inode_chunk
We'll want to use more than just the agno field in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:24 -08:00
Christoph Hellwig
9943b45732 xfs: remove the unused pag_active_wq field in struct xfs_perag
pag_active_wq is only woken, but never waited for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:24 -08:00
Christoph Hellwig
4e071d79e4 xfs: remove the unused pagb_count field in struct xfs_perag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:23 -08:00
Christoph Hellwig
cd8ae42a82 xfs: fix superfluous clearing of info->low in __xfs_getfsmap_datadev
The for_each_perag helpers update the agno passed in for each iteration,
and thus the "if (pag->pag_agno == start_ag)" check will always be true.

Add another variable for the loop iterator so that the field is only
cleared after the first iteration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05 13:38:23 -08:00
Darrick J. Wong
62027820eb xfs: fix simplify extent lookup in xfs_can_free_eofblocks
In commit 11f4c3a53a, we tried to simplify the extent lookup in
xfs_can_free_eofblocks so that it doesn't incur the overhead of all the
extra stuff that xfs_bmapi_read does around the iext lookup.

Unfortunately, this causes regressions on generic/603, xfs/108,
generic/219, xfs/173, generic/694, xfs/052, generic/230, and xfs/441
when always_cow is turned on.  In all cases, the regressions take the
form of alwayscow files consuming rather more space than the golden
output is expecting.  I observed that in all these cases, the cause of
the excess space usage was due to CoW fork delalloc reservations that go
beyond EOF.

For alwayscow files we allow posteof delalloc CoW reservations because
all writes go through the CoW fork.  Recall that all extents in the CoW
fork are accounted for via i_delayed_blks, which means that prior to
this patch, we'd invoke xfs_free_eofblocks on first close if anything
was in the CoW fork.  Now we don't do that.

Fix the problem by reverting the removal of the i_delayed_blks check.

Cc: <stable@vger.kernel.org> # v6.12-rc1
Fixes: 11f4c3a53a ("xfs: simplify extent lookup in xfs_can_free_eofblocks")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05 13:38:23 -08:00
Christoph Hellwig
fe4e0faac9 xfs: remove xfs_page_mkwrite_iomap_ops
Shared the regular buffered write iomap_ops with the page fault path
and just check for the IOMAP_FAULT flag to skip delalloc punching.

This keeps the delalloc punching checks in one place, and will make it
easier to convert iomap to an iter model where the begin and end
handlers are merged into a single callback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:52:57 +01:00
Christoph Hellwig
a7fd3327d3 xfs: remove __xfs_filemap_fault
xfs_filemap_huge_fault only ever serves DAX faults, so hard code the
call to xfs_dax_read_fault and open code __xfs_filemap_fault in the
only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:52:57 +01:00
Christoph Hellwig
1eb6fc0447 xfs: split write fault handling out of __xfs_filemap_fault
Only two of the callers of __xfs_filemap_fault every handle read faults.
Split the write_fault handling out of __xfs_filemap_fault so that all
callers call that directly either conditionally or unconditionally and
only leave the read fault handling in __xfs_filemap_fault.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:52:57 +01:00
Christoph Hellwig
1171de3296 xfs: split the page fault trace event
Split the xfs_filemap_fault trace event into separate ones for read and
write faults and move them into the applicable locations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:52:57 +01:00
Dave Chinner
59e43f5479 xfs: sb_spino_align is not verified
It's just read in from the superblock and used without doing any
validity checks at all on the value.

Fixes: fb4f2b4e5a ("xfs: add sparse inode chunk alignment superblock field")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:51:59 +01:00
Christoph Hellwig
792ef2745d xfs: simplify sector number calculation in xfs_zero_extent
xfs_zero_extent does some really odd gymnstics to calculate the block
layer sectors numbers passed to blkdev_issue_zeroout.  This is because it
used to call sb_issue_zeroout and the calculations in that helper got
open coded here in the rather misleadingly named commit 3dc2916107
("dax: use sb_issue_zerout instead of calling dax_clear_sectors").

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:51:59 +01:00
Long Li
8b9b261594 xfs: remove the redundant xfs_alloc_log_agf
There are two invocations of xfs_alloc_log_agf in xfs_alloc_put_freelist.
The AGF does not change between the two calls. Although this does not pose
any practical problems, it seems like a small mistake. Therefore, fix it
by removing the first xfs_alloc_log_agf invocation.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05 13:51:55 +01:00
John Garry
3af5298ce9 xfs: Support setting FMODE_CAN_ATOMIC_WRITE
Set FMODE_CAN_ATOMIC_WRITE flag if we can atomic write for that inode.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>	 #On ppc64
2024-11-04 16:22:11 -08:00
John Garry
f096207d32 xfs: Validate atomic writes
Validate that an atomic write adheres to length/offset rules. Currently
we can only write a single FS block.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_write_iter(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
ATOMICWRITES flags would also need to be set for the inode.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:22:10 -08:00
John Garry
6432c6e723 xfs: Support atomic write for statx
Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size. As for
max, we limit also at FS block size, as there is no current method to
guarantee extent alignment or granularity for regular files.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:22:10 -08:00
Al Viro
8152f82010 fdget(), more trivial conversions
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.

[xfs_ioc_commit_range() chunk moved here as well]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
0d113fcbc2 simplify xfs_find_handle() a bit
XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and
let the file go; results in simpler control flow - cleanup is
the same for both "by descriptor" and "by pathname" cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Linus Torvalds
f6a7b4ec74 XFS bug fies for 6.12-rc6
* fix a sysbot reported crash on filestreams
 * Reduce cpu time spent searching for extents in
   a very fragmented FS
 * Check for delayed allocations before setting extsize
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf
 pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R
 3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG
 M+TLhUNkPQ==
 =kKX4
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - fix a sysbot reported crash on filestreams

 - Reduce cpu time spent searching for extents in a very fragmented FS

 - Check for delayed allocations before setting extsize

* tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: streamline xfs_filestream_pick_ag
  xfs: fix finding a last resort AG in xfs_filestream_pick_ag
  xfs: Reduce unnecessary searches when searching for the best extents
  xfs: Check for delayed allocations before setting extsize
2024-11-02 09:22:16 -10:00
Linus Torvalds
17fa6a5f93 vfs-6.12-rc6.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc
 oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq
 3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk=
 =/0iA
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull iomap fixes from Christian Brauner:
 "Fixes for iomap to prevent data corruption bugs in the fallocate
  unshare range implementation of fsdax and a small cleanup to turn
  iomap_want_unshare_iter() into an inline function"

* tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  iomap: turn iomap_want_unshare_iter into an inline function
  fsdax: dax_unshare_iter needs to copy entire blocks
  fsdax: remove zeroing code from dax_unshare_iter
  iomap: share iomap_unshare_iter predicate code with fsdax
  xfs: don't allocate COW extents when unsharing a hole
2024-11-01 07:45:00 -10:00
Christoph Hellwig
81a1e1c32e xfs: streamline xfs_filestream_pick_ag
Directly return the error from xfs_bmap_longest_free_extent instead
of breaking from the loop and handling it there, and use a done
label to directly jump to the exist when we found a suitable perag
structure to reduce the indentation level and pag/max_pag check
complexity in the tail of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Christoph Hellwig
dc60992ce7 xfs: fix finding a last resort AG in xfs_filestream_pick_ag
When the main loop in xfs_filestream_pick_ag fails to find a suitable
AG it tries to just pick the online AG.  But the loop for that uses
args->pag as loop iterator while the later code expects pag to be
set.  Fix this by reusing the max_pag case for this last resort, and
also add a check for impossible case of no AG just to make sure that
the uninitialized pag doesn't even escape in theory.

Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Fixes: f8f1ed1ab3 ("xfs: return a referenced perag from filestreams allocator")
Cc: <stable@vger.kernel.org> # v6.3
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Chi Zhiling
3ef2268403 xfs: Reduce unnecessary searches when searching for the best extents
Recently, we found that the CPU spent a lot of time in
xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented
spaces.

The reason is that we conducted much extra searching for extents that
could not yield a better result, and these searches would cost a lot of
time when there were millions of extents to search through. Even if we
get the same result length, we don't switch our choice to the new one,
so we can definitely terminate the search early.

Since the result length cannot exceed the found length, when the found
length equals the best result length we already have, we can conclude
the search.

We did a test in that filesystem:
[root@localhost ~]# xfs_db -c freesp /dev/vdb
   from      to extents  blocks    pct
      1       1     215     215   0.01
      2       3  994476 1988952  99.99

Before this patch:
 0)               |  xfs_alloc_ag_vextent_size [xfs]() {
 0) * 15597.94 us |  }

After this patch:
 0)               |  xfs_alloc_ag_vextent_size [xfs]() {
 0)   19.176 us    |  }

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Ojaswin Mujoo
2a492ff666 xfs: Check for delayed allocations before setting extsize
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.

While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()

**Without the patch (SUCCEEDS)**

$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)

**With the patch (FAILS as expected)**

$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument

Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Christoph Hellwig
4a201dcfa1 xfs: update the pag for the last AG at recovery time
Currently log recovery never updates the in-core perag values for the
last allocation group when they were grown by growfs.  This leads to
btree record validation failures for the alloc, ialloc or finotbt
trees if a transaction references this new space.

Found by Brian's new growfs recovery stress test.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:19 +02:00
Christoph Hellwig
069cf5e32b xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery.  Remove the flag and
stick to the default GFP_KERNEL policies.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
b882b0f813 xfs: error out when a superblock buffer update reduces the agcount
XFS currently does not support reducing the agcount, so error out if
a logged sb buffer tries to shrink the agcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
6a18765b54 xfs: update the file system geometry after recoverying superblock buffers
Primary superblock buffers that change the file system geometry after a
growfs operation can affect the operation of later CIL checkpoints that
make use of the newly added space and allocation groups.

Apply the changes to the in-memory structures as part of recovery pass 2,
to ensure recovery works fine for such cases.

In the future we should apply the logic to other updates such as features
bits as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
aa67ec6a25 xfs: merge the perag freeing helpers
There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases.  Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.

The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
82742f8c3f xfs: pass the exact range to initialize to xfs_initialize_perag
Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling.  Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined.  That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Darrick J. Wong
af8512c527 xfs: don't fail repairs on metadata files with no attr fork
Fix a minor bug where we fail repairs on metadata files that do not have
attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT.

Cc: stable@vger.kernel.org # v6.8
Fixes: 5a8e07e799 ("xfs: repair the inode core and forks of a metadata inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
f6f91d290c xfs: punch delalloc extents from the COW fork for COW writes
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.

Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
7d6fe5c586 xfs: set IOMAP_F_SHARED for all COW fork allocations
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork.  It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.

This is essentially a revert of commit 72a048c105 ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.

Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
c29440ff66 xfs: share more code in xfs_buffered_write_iomap_begin
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
8fe3b21efa xfs: support the COW fork in xfs_bmap_punch_delalloc_range
xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
abd7d651ad xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock
All XFS callers of iomap_zero_range and iomap_file_unshare already hold
invalidate_lock, so we can't take it again in
iomap_file_buffered_write_punch_delalloc.

Use the passed in flags argument to detect if we're called from a zero
or unshare operation and don't take the lock again in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
acfbac7764 xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock).  Currently that
is actually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.

The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
3c399374af xfs: factor out a xfs_file_write_zero_eof helper
Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
b784951662 iomap: move locking out of iomap_write_delalloc_release
XFS (which currently is the only user of iomap_write_delalloc_release)
already holds invalidate_lock for most zeroing operations.  To be able
to avoid a deadlock it needs to stop taking the lock, but doing so
in iomap would leak XFS locking details into iomap.

To avoid this require the caller to hold invalidate_lock when calling
iomap_write_delalloc_release instead of taking it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
caf0ea451d iomap: remove iomap_file_buffered_write_punch_delalloc
Currently iomap_file_buffered_write_punch_delalloc can be called from
XFS either with the invalidate lock held or not.  To fix this while
keeping the locking in the file system and not the iomap library
code we'll need to life the locking up into the file system.

To prepare for that, open code iomap_file_buffered_write_punch_delalloc
in the only caller, and instead export iomap_write_delalloc_release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Darrick J. Wong
0fb823f1cf xfs: fix integer overflow in xrep_bmap
The variable declaration in this function predates the merge of the
nrext64 (aka 64-bit extent counters) feature, which means that the
variable declaration type is insufficient to avoid an integer overflow.
Fix that by redeclaring the variable to be xfs_extnum_t.

Coverity-id: 1630958
Fixes: 8f71bede8e ("xfs: repair inode fork block mapping data structures")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-11 12:32:48 +02:00
Christian Brauner
b40508ca5d
Merge patch series "timekeeping/fs: multigrain timestamp redux"
Jeff Layton <jlayton@kernel.org> says:

The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.

To remedy this, keep a global monotonic atomic64_t value that acts as a
timestamp floor.  When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.

If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.

We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.

Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).

* patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org:
  tmpfs: add support for multigrain timestamps
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps
  fs: add percpu counters for significant multigrain timestamp events
  fs: tracepoints around multigrain timestamp events
  fs: handle delegated timestamps in setattr_copy_mgtime
  fs: have setattr_copy handle multigrain timestamps appropriately
  fs: add infrastructure for multigrain timestamps

Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:57 +02:00
Jeff Layton
1cf7e834a6
xfs: switch to multigrain timestamps
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-9-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:52 +02:00
Andrew Kreimer
77bfe1b11e xfs: fix a typo
Fix a typo in comments.

Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-09 10:05:26 +02:00
Brian Foster
4390f019ad xfs: don't free cowblocks from under dirty pagecache on unshare
fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.

For example, consider the following command sequence:

xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
	-c "pwrite 0 32k" -c "funshare 0 1k" <file>

This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.

This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:

$ xfs_io -c "fiemap -v" <file>
<file>:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   ...
   1: [8..511]:        hole               504
   ...
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  cd cd cd cd cd cd cd cd  ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  00 00 00 00 00 00 00 00  ........

To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.

Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-09 10:05:10 +02:00
Christian Brauner
dad1b6c805
Merge patch series "fsdax/xfs: unshare range fixes for 6.12"
Darrick J. Wong <djwong@kernel.org> says:

This patchset fixes multiple data corruption bugs in the fallocate
unshare range implementation for fsdax.

* patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs:
  fsdax: dax_unshare_iter needs to copy entire blocks
  fsdax: remove zeroing code from dax_unshare_iter
  iomap: share iomap_unshare_iter predicate code with fsdax
  xfs: don't allocate COW extents when unsharing a hole

Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-08 12:27:14 +02:00
Darrick J. Wong
b8c4076db5
xfs: don't allocate COW extents when unsharing a hole
It doesn't make sense to allocate a COW extent when unsharing a hole
because holes cannot be shared.

Fixes: 1f1397b721 ("xfs: don't allocate into the data fork for an unshare request")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:51:46 +02:00
Brian Foster
90a71daaf7 xfs: skip background cowblock trims on inodes open for write
The background blockgc scanner runs on a 5m interval by default and
trims preallocation (post-eof and cow fork) from inodes that are
otherwise idle. Idle effectively means that iolock can be acquired
without blocking and that the inode has no dirty pagecache or I/O in
flight.

This simple mechanism and heuristic has worked fairly well for
post-eof speculative preallocations. Support for reflink and COW
fork preallocations came sometime later and plugged into the same
mechanism, with similar heuristics. Some recent testing has shown
that COW fork preallocation may be notably more sensitive to blockgc
processing than post-eof preallocation, however.

For example, consider an 8GB reflinked file with a COW extent size
hint of 1MB. A worst case fully randomized overwrite of this file
results in ~8k extents of an average size of ~1MB. If the same
workload is interrupted a couple times for blockgc processing
(assuming the file goes idle), the resulting extent count explodes
to over 100k extents with an average size <100kB. This is
significantly worse than ideal and essentially defeats the COW
extent size hint mechanism.

While this particular test is instrumented, it reflects a fairly
reasonable pattern in practice where random I/Os might spread out
over a large period of time with varying periods of (in)activity.
For example, consider a cloned disk image file for a VM or container
with long uptime and variable and bursty usage. A background blockgc
scan that races and processes the image file when it happens to be
clean and idle can have a significant effect on the future
fragmentation level of the file, even when still in use.

To help combat this, update the heuristic to skip cowblocks inodes
that are currently opened for write access during non-sync blockgc
scans. This allows COW fork preallocations to persist for as long as
possible unless otherwise needed for functional purposes (i.e. a
sync scan), the file is idle and closed, or the inode is being
evicted from cache. While here, update the comments to help
distinguish performance oriented heuristics from the logic that
exists to maintain functional correctness.

Suggested-by: Darrick Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:12 +02:00
Christoph Hellwig
6aac770598 xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.

Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:12 +02:00
Christoph Hellwig
405ee87c69 xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in
xfs_bmap_btalloc.  Switch to call it from xfs_bmap_btalloc after
doing the basic setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b611fddc04 xfs: don't ifdef around the exact minlen allocations
Exact minlen allocations only exist as an error injection tool for debug
builds.  Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds.  Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.

The only downside is that the alloc_minlen_only field is unconditionally
added to struct xfs_alloc_args now, but by moving it around and packing
it tightly this doesn't actually increase the size of the structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
865469cd41 xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
Userdata and metadata allocations end up in the same allocation helpers.
Remove the separate xfs_bmap_alloc_userdata function to make this more
clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b3f4e84e2f xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork.  Use magic
1 to ask for the conversion here as well.

Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
a5f73342ab xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.

Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.

This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device.  In this case every about 5-10th run of
xfs/538 runs into the following assert:

	ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);

in xfs_attr3_leaf_split caused by an allocation failure.  Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
346c1d46d4 xfs: return bool from xfs_attr3_leaf_add
xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not.  Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.

Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b1c649da15 xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and
merging the two will simplify a following error handling fix.

To facilitate this move the remote block state save/restore helpers up in
the file so that they don't need forward declarations now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Uros Bizjak
20195d011c xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
Use !try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
xlog_cil_insert_pcp_aggregate().  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to
prevent the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Yan Zhen
6148b77960 xfs: scrub: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Zhang Zekun
f6225eebd7 xfs: Remove empty declartion in header file
The definition of xfs_attr_use_log_assist() has been removed since
commit d9c61ccb3b ("xfs: move xfs_attr_use_log_assist out of xfs_log.c").
So, Remove the empty declartion in header files.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Linus Torvalds
f8ffbc365f struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
 592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
 =HUl5
 -----END PGP SIGNATURE-----

Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
2024-09-23 09:35:36 -07:00
Linus Torvalds
617a814f14 ALong with the usual shower of singleton patches, notable patch series in
this pull request are:
 
 "Align kvrealloc() with krealloc()" from Danilo Krummrich.  Adds
 consistency to the APIs and behaviour of these two core allocation
 functions.  This also simplifies/enables Rustification.
 
 "Some cleanups for shmem" from Baolin Wang.  No functional changes - mode
 code reuse, better function naming, logic simplifications.
 
 "mm: some small page fault cleanups" from Josef Bacik.  No functional
 changes - code cleanups only.
 
 "Various memory tiering fixes" from Zi Yan.  A small fix and a little
 cleanup.
 
 "mm/swap: remove boilerplate" from Yu Zhao.  Code cleanups and
 simplifications and .text shrinkage.
 
 "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt.  This
 is a feature, it adds new feilds to /proc/vmstat such as
 
     $ grep kstack /proc/vmstat
     kstack_1k 3
     kstack_2k 188
     kstack_4k 11391
     kstack_8k 243
     kstack_16k 0
 
 which tells us that 11391 processes used 4k of stack while none at all
 used 16k.  Useful for some system tuning things, but partivularly useful
 for "the dynamic kernel stack project".
 
 "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
 Teaches kmemleak to detect leaksage of percpu memory.
 
 "mm: memcg: page counters optimizations" from Roman Gushchin.  "3
 independent small optimizations of page counters".
 
 "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
 Hildenbrand.  Improves PTE/PMD splitlock detection, makes powerpc/8xx work
 correctly by design rather than by accident.
 
 "mm: remove arch_make_page_accessible()" from David Hildenbrand.  Some
 folio conversions which make arch_make_page_accessible() unneeded.
 
 "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
 Cleans up and fixes our handling of the resetting of the cgroup/process
 peak-memory-use detector.
 
 "Make core VMA operations internal and testable" from Lorenzo Stoakes.
 Rationalizaion and encapsulation of the VMA manipulation APIs.  With a
 view to better enable testing of the VMA functions, even from a
 userspace-only harness.
 
 "mm: zswap: fixes for global shrinker" from Takero Funaki.  Fix issues in
 the zswap global shrinker, resulting in improved performance.
 
 "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao.  Fill in
 some missing info in /proc/zoneinfo.
 
 "mm: replace follow_page() by folio_walk" from David Hildenbrand.  Code
 cleanups and rationalizations (conversion to folio_walk()) resulting in
 the removal of follow_page().
 
 "improving dynamic zswap shrinker protection scheme" from Nhat Pham.  Some
 tuning to improve zswap's dynamic shrinker.  Significant reductions in
 swapin and improvements in performance are shown.
 
 "mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
 Improvements to the new unaccepted memory feature,
 
 "mm/mprotect: Fix dax puds" from Peter Xu.  Implements mprotect on DAX
 PUDs.  This was missing, although nobody seems to have notied yet.
 
 "Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
 Cleanups and modest performance improvements for the maple tree library
 code.
 
 "memcg: further decouple v1 code from v2" from Shakeel Butt.  Move more
 cgroup v1 remnants away from the v2 memcg code.
 
 "memcg: initiate deprecation of v1 features" from Shakeel Butt.  Adds
 various warnings telling users that memcg v1 features are deprecated.
 
 "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
 Greatly improves the success rate of the mTHP swap allocation.
 
 "mm: introduce numa_memblks" from Mike Rapoport.  Moves various disparate
 per-arch implementations of numa_memblk code into generic code.
 
 "mm: batch free swaps for zap_pte_range()" from Barry Song.  Greatly
 improves the performance of munmap() of swap-filled ptes.
 
 "support large folio swap-out and swap-in for shmem" from Baolin Wang.
 With this series we no longer split shmem large folios into simgle-page
 folios when swapping out shmem.
 
 "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao.  Nice performance
 improvements and code reductions for gigantic folios.
 
 "support shmem mTHP collapse" from Baolin Wang.  Adds support for
 khugepaged's collapsing of shmem mTHP folios.
 
 "mm: Optimize mseal checks" from Pedro Falcato.  Fixes an mprotect()
 performance regression due to the addition of mseal().
 
 "Increase the number of bits available in page_type" from Matthew Wilcox.
 Increases the number of bits available in page_type!
 
 "Simplify the page flags a little" from Matthew Wilcox.  Many legacy page
 flags are now folio flags, so the page-based flags and their
 accessors/mutators can be removed.
 
 "mm: store zero pages to be swapped out in a bitmap" from Usama Arif.  An
 optimization which permits us to avoid writing/reading zero-filled zswap
 pages to backing store.
 
 "Avoid MAP_FIXED gap exposure" from Liam Howlett.  Fixes a race window
 which occurs when a MAP_FIXED operqtion is occurring during an unrelated
 vma tree walk.
 
 "mm: remove vma_merge()" from Lorenzo Stoakes.  Major rotorooting of the
 vma_merge() functionality, making ot cleaner, more testable and better
 tested.
 
 "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.  Minor
 fixups of DAMON selftests and kunit tests.
 
 "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.  Code
 cleanups and folio conversions.
 
 "Shmem mTHP controls and stats improvements" from Ryan Roberts.  Cleanups
 for shmem controls and stats.
 
 "mm: count the number of anonymous THPs per size" from Barry Song.  Expose
 additional anon THP stats to userspace for improved tuning.
 
 "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
 conversions and removal of now-unused page-based APIs.
 
 "replace per-quota region priorities histogram buffer with per-context
 one" from SeongJae Park.  DAMON histogram rationalization.
 
 "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
 Park.  DAMON documentation updates.
 
 "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
 related doc and warn" from Jason Wang: fixes usage of page allocator
 __GFP_NOFAIL and GFP_ATOMIC flags.
 
 "mm: split underused THPs" from Yu Zhao.  Improve THP=always policy - this
 was overprovisioning THPs in sparsely accessed memory areas.
 
 "zram: introduce custom comp backends API" frm Sergey Senozhatsky.  Add
 support for zram run-time compression algorithm tuning.
 
 "mm: Care about shadow stack guard gap when getting an unmapped area" from
 Mark Brown.  Fix up the various arch_get_unmapped_area() implementations
 to better respect guard areas.
 
 "Improve mem_cgroup_iter()" from Kinsey Ho.  Improve the reliability of
 mem_cgroup_iter() and various code cleanups.
 
 "mm: Support huge pfnmaps" from Peter Xu.  Extends the usage of huge
 pfnmap support.
 
 "resource: Fix region_intersects() vs add_memory_driver_managed()" from
 Huang Ying.  Fix a bug in region_intersects() for systems with CXL memory.
 
 "mm: hwpoison: two more poison recovery" from Kefeng Wang.  Teaches a
 couple more code paths to correctly recover from the encountering of
 poisoned memry.
 
 "mm: enable large folios swap-in support" from Barry Song.  Support the
 swapin of mTHP memory into appropriately-sized folios, rather than into
 single-page folios.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
 jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
 =s0T+
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Linus Torvalds
171754c380 vfs-6.12.blocksize
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvwAKCRCRxhvAZXjc
 ohg3APwJWQnqFlBddcRl4yrPJ/cgcYSYAOdHb+E+blomSwdxcwEAmwsnLPNQOtw2
 rxKvQfZqhVT437bl7RpPPZrHGxwTng8=
 =6v1r
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs blocksize updates from Christian Brauner:
 "This contains the vfs infrastructure as well as the xfs bits to enable
  support for block sizes (bs) larger than page sizes (ps) plus a few
  fixes to related infrastructure.

  There has been efforts over the last 16 years to enable enable Large
  Block Sizes (LBS), that is block sizes in filesystems where bs > page
  size. Through these efforts we have learned that one of the main
  blockers to supporting bs > ps in filesystems has been a way to
  allocate pages that are at least the filesystem block size on the page
  cache where bs > ps.

  Thanks to various previous efforts it is possible to support bs > ps
  in XFS with only a few changes in XFS itself. Most changes are to the
  page cache to support minimum order folio support for the target block
  size on the filesystem.

  A motivation for Large Block Sizes today is to support high-capacity
  (large amount of Terabytes) QLC SSDs where the internal Indirection
  Unit (IU) are typically greater than 4k to help reduce DRAM and so in
  turn cost and space. In practice this then allows different
  architectures to use a base page size of 4k while still enabling
  support for block sizes aligned to the larger IUs by relying on high
  order folios on the page cache when needed.

  It also allows to take advantage of the drive's support for atomics
  larger than 4k with buffered IO support in Linux. As described this
  year at LSFMM, supporting large atomics greater than 4k enables
  databases to remove the need to rely on their own journaling, so they
  can disable double buffered writes, which is a feature different cloud
  providers are already enabling through custom storage solutions"

* tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  Documentation: iomap: fix a typo
  iomap: remove the iomap_file_buffered_write_punch_delalloc return value
  iomap: pass the iomap to the punch callback
  iomap: pass flags to iomap_file_buffered_write_punch_delalloc
  iomap: improve shared block detection in iomap_unshare_iter
  iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
  docs:filesystems: fix spelling and grammar mistakes in iomap design page
  filemap: fix htmldoc warning for mapping_align_index()
  iomap: make zero range flush conditional on unwritten mappings
  iomap: fix handling of dirty folios over unwritten extents
  iomap: add a private argument for iomap_file_buffered_write
  iomap: remove set_memor_ro() on zero page
  xfs: enable block size larger than page size support
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: expose block size in stat
  xfs: use kvmalloc for xattr buffers
  iomap: fix iomap_dio_zero() for fs bs > system page size
  filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  mm: split a folio in minimum folio order chunks
  readahead: allocate folios with mapping_min_order in readahead
  ...
2024-09-20 17:53:17 -07:00
Linus Torvalds
8751b21ad9 New code for 6.12:
* Introduce new ioctls to exchange contents of two files.
     The first ioctl does the preparation work to exchange the contents of two
     files while the second ioctl performs the actual exchange if the target
     file has not been changed since a given sampling point.
 
   * Fixes
     - Fix bugs associated with calculating the maximum range of realtime
       extents to scan for free space.
     - Copy keys instead of records when resizing the incore BMBT root block.
     - Do not report FITRIMming more bytes than possibly exist in the
       filesystem.
     - Modify xfs_fs.h to prevent C++ compilation errors.
     - Do not over eagerly free post-EOF speculative preallocation.
     - Ensure st_blocks never goes to zero during COW writes
 
   * Cleanups/refactors
     - Use Xarray to hold per-AG data instead of a Radix tree.
     - Cleanup the following functionality,
       - Realtime bitmap.
       - Inode allocator.
       - Quota.
       - Inode rooted btree code.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZtmy/gAKCRAH7y4RirJu
 9H3GAP9CnoiZu+U/QmNL5T15fgNGs+BQDrUNbmbn3bNlmIZviQEAi3p+50OlT0nP
 lcQ/89NJ6uDFNBiphpkGajlp5vn2BQ0=
 =7wy/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:
 "New code:

   - Introduce new ioctls to exchange contents of two files.

     The first ioctl does the preparation work to exchange the contents
     of two files while the second ioctl performs the actual exchange if
     the target file has not been changed since a given sampling point.

  Fixes:

   - Fix bugs associated with calculating the maximum range of realtime
     extents to scan for free space.

   - Copy keys instead of records when resizing the incore BMBT root
     block.

   - Do not report FITRIMming more bytes than possibly exist in the
     filesystem.

   - Modify xfs_fs.h to prevent C++ compilation errors.

   - Do not over eagerly free post-EOF speculative preallocation.

   - Ensure st_blocks never goes to zero during COW writes

  Cleanups/refactors:

   - Use Xarray to hold per-AG data instead of a Radix tree.

   - Cleanups to:
      - realtime bitmap
      - inode allocator
      - quota
      - inode rooted btree code"

* tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits)
  xfs: ensure st_blocks never goes to zero during COW writes
  xfs: use xas_for_each_marked in xfs_reclaim_inodes_count
  xfs: convert perag lookup to xarray
  xfs: simplify tagged perag iteration
  xfs: move the tagged perag lookup helpers to xfs_icache.c
  xfs: use kfree_rcu_mightsleep to free the perag structures
  xfs: use LIST_HEAD() to simplify code
  xfs: Remove duplicate xfs_trans_priv.h header
  xfs: remove unnecessary check
  xfs: Use xfs set and clear mp state helpers
  xfs: reclaim speculative preallocations for append only files
  xfs: simplify extent lookup in xfs_can_free_eofblocks
  xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  xfs: only free posteof blocks on first close
  xfs: don't free post-EOF blocks on read close
  xfs: skip all of xfs_file_release when shut down
  xfs: don't bother returning errors from xfs_file_release
  xfs: refactor f_op->release handling
  xfs: remove the i_mode check in xfs_release
  xfs: standardize the btree maxrecs function parameters
  ...
2024-09-19 07:03:55 +02:00
Linus Torvalds
9ea925c806 Updates for timers and timekeeping:
- Core:
 
 	- Overhaul of posix-timers in preparation of removing the
 	  workaround for periodic timers which have signal delivery
 	  ignored.
 
         - Remove the historical extra jiffie in msleep()
 
 	  msleep() adds an extra jiffie to the timeout value to ensure
 	  minimal sleep time. The timer wheel ensures minimal sleep
 	  time since the large rewrite to a non-cascading wheel, but the
 	  extra jiffie in msleep() remained unnoticed. Remove it.
 
         - Make the timer slack handling correct for realtime tasks.
 
 	  The procfs interface is inconsistent and does neither reflect
 	  reality nor conforms to the man page. Show the correct 0 slack
 	  for real time tasks and enforce it at the core level instead of
 	  having inconsistent individual checks in various timer setup
 	  functions.
 
         - The usual set of updates and enhancements all over the place.
 
   - Drivers:
 
         - Allow the ACPI PM timer to be turned off during suspend
 
 	- No new drivers
 
 	- The usual updates and enhancements in various drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
 ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
 xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
 V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
 FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
 zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
 Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
 Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
 F6p9cgoDNP9KFg==
 =jE0N
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
ee25861f26 vfs-6.12.fallocate
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 omD7AQCZuWPXkEGYFD37MJZuRXNEoq7Tuj6yd0O2b5khUpzvyAD+MPuthGiCMPsu
 voPpUP83x7T0D3JsEsCAXtNeVRcIBQI=
 =xTs6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fallocate updates from Christian Brauner:
 "This contains work to try and cleanup some the fallocate mode
  handling. Currently, it confusingly mixes operation modes and an
  optional flag.

  The work here tries to better define operation modes and optional
  flags allowing the core and filesystem code to use switch statements
  to switch on the operation mode"

* tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: refactor xfs_file_fallocate
  xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space
  xfs: call xfs_flush_unmap_range from xfs_free_file_space
  fs: sort out the fallocate mode vs flag mess
  ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE
  block: remove checks for FALLOC_FL_NO_HIDE_STALE
2024-09-16 09:34:08 +02:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Christoph Hellwig
4bceb9ba05
iomap: remove the iomap_file_buffered_write_punch_delalloc return value
iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.

As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-6-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10 11:14:15 +02:00
Christoph Hellwig
492f53758f
iomap: pass the iomap to the punch callback
XFS will need to look at the flags in the iomap structure, so pass it
down all the way to the callback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-5-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10 11:14:15 +02:00
Christoph Hellwig
11596dc3df
iomap: pass flags to iomap_file_buffered_write_punch_delalloc
To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for.  Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-4-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10 11:14:14 +02:00
Rik van Riel
e1e4cfd01a mm,tmpfs: consider end of file write in shmem_is_huge
Take the end of a file write into consideration when deciding whether or
not to use huge pages for tmpfs files when the tmpfs filesystem is mounted
with huge=within_size

This allows large writes that append to the end of a file to automatically
use large pages.

Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with
fio.  The numbers without THP or with huge=always stay the same, but the
performance with huge=within_size now matches that of huge=always.

huge		before		after
4kB pages	1560 MB/s	1560 MB/s
within_size	1560 MB/s	4720 MB/s
always:		4720 MB/s	4720 MB/s

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:12 -07:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Brian Foster
c5c810b94c iomap: fix handling of dirty folios over unwritten extents
The iomap zero range implementation doesn't properly handle dirty
pagecache over unwritten mappings. It skips such mappings as if they
were pre-zeroed. If some part of an unwritten mapping is dirty in
pagecache from a previous write, the data in cache should be zeroed
as well. Instead, the data is left in cache and creates a stale data
exposure problem if writeback occurs sometime after the zero range.

Most callers are unaffected by this because the higher level
filesystem contexts that call zero range typically perform a filemap
flush of the target range for other reasons. A couple contexts that
don't otherwise need to flush are write file size extension and
truncate in XFS. The former path is currently susceptible to the
stale data exposure problem and the latter performs a flush
specifically to work around it.

This is clearly inconsistent and incomplete. As a first step toward
correcting behavior, lift the XFS workaround to iomap_zero_range()
and unconditionally flush the range before the zero range operation
proceeds. While this appears to be a bit of a big hammer, most all
users already do this from calling context save for the couple of
exceptions noted above. Future patches will optimize or elide this
flush while maintaining functional correctness.

Fixes: ae259a9c85 ("fs: introduce iomap infrastructure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20240830145634.138439-2-bfoster@redhat.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03 15:01:24 +02:00
Josef Bacik
31754ea6cb iomap: add a private argument for iomap_file_buffered_write
In order to switch fuse over to using iomap for buffered writes we need
to be able to have the struct file for the original write, in case we
have to read in the page to make it uptodate.  Handle this by using the
existing private field in the iomap_iter, and add the argument to
iomap_file_buffered_write.  This will allow us to pass the file in
through the iomap buffered write path, and is flexible for any other
file systems needs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03 15:01:23 +02:00
Pankaj Raghav
7df7c204c6 xfs: enable block size larger than page size support
Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240827-xfs-fix-wformat-bs-gt-ps-v1-1-aec6717609e0@kernel.org # fix folded
Link: https://lore.kernel.org/r/20240822135018.1931258-11-kernel@pankajraghav.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03 15:00:52 +02:00
Christoph Hellwig
90fa22da6d xfs: ensure st_blocks never goes to zero during COW writes
COW writes remove the amount overwritten either directly for delalloc
reservations, or in earlier deferred transactions than adding the new
amount back in the bmap map transaction.  This means st_blocks on an
inode where all data is overwritten using the COW path can temporarily
show a 0 st_blocks.  This can easily be reproduced with the pending
zoned device support where all writes use this path and trips the
check in generic/615, but could also happen on a reflink file without
that.

Fix this by temporarily add the pending blocks to be mapped to
i_delayed_blks while the item is queued.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:47 +05:30
Christoph Hellwig
866cf1dd3d xfs: use xas_for_each_marked in xfs_reclaim_inodes_count
xfs_reclaim_inodes_count iterates over all AGs to sum up the reclaimable
inodes counts.  There is no point in grabbing a reference to the them or
unlock the RCU critical section for each iteration, so switch to the
more efficient xas_for_each_marked iterator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:46 +05:30
Christoph Hellwig
32fa4059fe xfs: convert perag lookup to xarray
Convert the perag lookup from the legacy radix tree to the xarray,
which allows for much nicer iteration and bulk lookup semantics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:46 +05:30
Christoph Hellwig
f9ffd095c8 xfs: simplify tagged perag iteration
Pass the old perag structure to the tagged loop helpers so that they can
grab the old agno before releasing the reference.  This removes the need
to separately track the agno and the iterator macro, and thus also
obsoletes the for_each_perag_tag syntactic sugar.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:44 +05:30
Christoph Hellwig
f48f0a8e00 xfs: move the tagged perag lookup helpers to xfs_icache.c
The tagged perag helpers are only used in xfs_icache.c in the kernel code
and not at all in xfsprogs.  Move them to xfs_icache.c in preparation for
switching to an xarray, for which I have no plan to implement the tagged
lookup functions for userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:43 +05:30
Christoph Hellwig
4ef7c6d39d xfs: use kfree_rcu_mightsleep to free the perag structures
Using the kfree_rcu_mightsleep is simpler and removes the need for a
rcu_head in the perag structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:43 +05:30
Hongbo Li
70045dafdf xfs: use LIST_HEAD() to simplify code
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD().

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:42 +05:30
Jiapeng Chong
9db384feea xfs: Remove duplicate xfs_trans_priv.h header
./fs/xfs/libxfs/xfs_defer.c: xfs_trans_priv.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9491
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:41 +05:30
Dan Carpenter
fb8b941c75 xfs: remove unnecessary check
We checked that "pip" is non-NULL at the start of the if else statement
so there is no need to check again here.  Delete the check.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:40 +05:30
John Garry
ca57120dfe xfs: Use xfs set and clear mp state helpers
Use the set and clear mp state helpers instead of open-coding.

It is noted that in some instances calls to atomic operation set_bit() and
clear_bit() are being replaced with test_and_set_bit() and
test_and_clear_bit(), respectively, as there is no specific helpers for
set_bit() and clear_bit() only. However should be ok, as we are just
ignoring the returned value from those "test" variants.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:39 +05:30
Christoph Hellwig
9372dce08b xfs: reclaim speculative preallocations for append only files
The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
writes that don't append at the current EOF.

But the commit originally adding XFS_DIFLAG_APPEND support (commit
a23321e766d in xfs xfs-import repository) also checked it to skip
releasing speculative preallocations, which doesn't make any sense.

Another commit (dd9f438e32 in the xfs-import repository) later extended
that flag to also report these speculation preallocations which should
not exist in getbmap.

Remove these checks as nothing XFS_DIFLAG_APPEND implies that
preallocations beyond EOF should exist, but explicitly check for
XFS_DIFLAG_APPEND in xfs_file_release to bypass the algorithm that
discard preallocations on the first close as append only files aren't
expected to be written to only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:39 +05:30
Christoph Hellwig
11f4c3a53a xfs: simplify extent lookup in xfs_can_free_eofblocks
xfs_can_free_eofblocks just cares if there is an extent beyond EOF.
Replace the call to xfs_bmapi_read with a xfs_iext_lookup_extent
as we've already checked that extents are read in earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Christoph Hellwig
b717089efe xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the
eofblocks, so don't bother locking the inode or performing the checks in
xfs_can_free_eofblocks.  Also switch to a test_and_set operation once
the iolock has been acquire so that only the caller that sets it actually
frees the post-EOF blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Darrick J. Wong
f1204d9645 xfs: only free posteof blocks on first close
Certain workloads fragment files on XFS very badly, such as a software
package that creates a number of threads, each of which repeatedly run
the sequence: open a file, perform a synchronous write, and close the
file, which defeats the speculative preallocation mechanism.  We work
around this problem by only deleting posteof blocks the /first/ time a
file is closed to preserve the behavior that unpacking a tarball lays
out files one after the other with no gaps.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: rebased, updated comment, renamed the flag]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Dave Chinner
816e3599ca xfs: don't free post-EOF blocks on read close
When we have a workload that does open/read/close in parallel with other
allocation, the file becomes rapidly fragmented. This is due to close()
calling xfs_file_release() and removing the speculative preallocation
beyond EOF.

Add a check for a writable context to xfs_file_release to skip the
post-EOF block freeing (an the similarly pointless flushing on truncate
down).

Before:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 919
/mnt/scratch/file.1: 916
/mnt/scratch/file.2: 919
/mnt/scratch/file.3: 920
/mnt/scratch/file.4: 920
/mnt/scratch/file.5: 921
/mnt/scratch/file.6: 916
/mnt/scratch/file.7: 918

After:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 24
/mnt/scratch/file.1: 24
/mnt/scratch/file.2: 11
/mnt/scratch/file.3: 24
/mnt/scratch/file.4: 3
/mnt/scratch/file.5: 24
/mnt/scratch/file.6: 24
/mnt/scratch/file.7: 23

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: wordsmithing, fix commit message]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: ported to the new ->release code structure]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Christoph Hellwig
c741d79c1a xfs: skip all of xfs_file_release when shut down
There is no point in trying to free post-EOF blocks when the file system
is shutdown, as it will just error out ASAP.  Instead return instantly
when xfs_file_release is called on a shut down file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Christoph Hellwig
98e44e2bc0 xfs: don't bother returning errors from xfs_file_release
While ->release returns int, the only caller ignores the return value.
As we're only doing cleanup work there isn't much of a point in
return a value to start with, so just document the situation instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:38 +05:30
Christoph Hellwig
5d3ca62611 xfs: refactor f_op->release handling
Currently f_op->release is split in not very obvious ways.  Fix that by
folding xfs_release into xfs_file_release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:37 +05:30
Christoph Hellwig
6e13dbebd5 xfs: remove the i_mode check in xfs_release
xfs_release is only called from xfs_file_release, which is wired up as
the f_op->release handler for regular files only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03 10:07:37 +05:30
Pankaj Raghav
cebf9dacd5 xfs: make the calculation generic in xfs_sb_validate_fsb_count()
Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240822135018.1931258-10-kernel@pankajraghav.com
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-02 16:19:44 +02:00
Pankaj Raghav
79012cfa00 xfs: expose block size in stat
For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240822135018.1931258-9-kernel@pankajraghav.com
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-02 16:19:44 +02:00
Dave Chinner
de631e1a8b xfs: use kvmalloc for xattr buffers
Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea82 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20240822135018.1931258-8-kernel@pankajraghav.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-02 16:19:44 +02:00
Danilo Krummrich
590b9d576c mm: kvmalloc: align kvrealloc() with krealloc()
Besides the obvious (and desired) difference between krealloc() and
kvrealloc(), there is some inconsistency in their function signatures and
behavior:

 - krealloc() frees the memory when the requested size is zero, whereas
   kvrealloc() simply returns a pointer to the existing allocation.

 - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas
   kvrealloc() does not accept a NULL pointer at all and, if passed,
   would fault instead.

 - krealloc() is self-contained, whereas kvrealloc() relies on the caller
   to provide the size of the previous allocation.

Inconsistent behavior throughout allocation APIs is error prone, hence
make kvrealloc() behave like krealloc(), which seems superior in all
mentioned aspects.

Besides that, implementing kvrealloc() by making use of krealloc() and
vrealloc() provides oppertunities to grow (and shrink) allocations more
efficiently.  For instance, vrealloc() can be optimized to allocate and
map additional pages to grow the allocation or unmap and free unused pages
to shrink the allocation.

[dakr@kernel.org: document concurrency restrictions]
  Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org
[dakr@kernel.org: disable KASAN when switching to vmalloc]
  Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org
[dakr@kernel.org: properly document __GFP_ZERO behavior]
  Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org
Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:44 -07:00
Darrick J. Wong
411a71256d xfs: standardize the btree maxrecs function parameters
Standardize the parameters in xfs_{alloc,bm,ino,rmap,refcount}bt_maxrecs
so that we have consistent calling conventions.  This doesn't affect the
kernel that much, but enables us to clean up userspace a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
79124b3740 xfs: replace shouty XFS_BM{BT,DR} macros
Replace all the shouty bmap btree and bmap disk root macros with actual
functions.

sed \
 -e 's/XFS_BMBT_BLOCK_LEN/xfs_bmbt_block_len/g' \
 -e 's/XFS_BMBT_REC_ADDR/xfs_bmbt_rec_addr/g' \
 -e 's/XFS_BMBT_KEY_ADDR/xfs_bmbt_key_addr/g' \
 -e 's/XFS_BMBT_PTR_ADDR/xfs_bmbt_ptr_addr/g' \
 -e 's/XFS_BMDR_REC_ADDR/xfs_bmdr_rec_addr/g' \
 -e 's/XFS_BMDR_KEY_ADDR/xfs_bmdr_key_addr/g' \
 -e 's/XFS_BMDR_PTR_ADDR/xfs_bmdr_ptr_addr/g' \
 -e 's/XFS_BMAP_BROOT_PTR_ADDR/xfs_bmap_broot_ptr_addr/g' \
 -e 's/XFS_BMAP_BROOT_SPACE_CALC/xfs_bmap_broot_space_calc/g' \
 -e 's/XFS_BMAP_BROOT_SPACE/xfs_bmap_broot_space/g' \
 -e 's/XFS_BMDR_SPACE_CALC/xfs_bmdr_space_calc/g' \
 -e 's/XFS_BMAP_BMDR_SPACE/xfs_bmap_bmdr_space/g' \
 -i $(git ls-files fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch])

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
de55149b66 xfs: fix a sloppy memory handling bug in xfs_iroot_realloc
While refactoring code, I noticed that when xfs_iroot_realloc tries to
shrink a bmbt root block, it allocates a smaller new block and then
copies "records" and pointers to the new block.  However, bmbt root
blocks cannot ever be leaves, which means that it's not technically
correct to copy records.  We /should/ be copying keys.

Note that this has never resulted in actual memory corruption because
sizeof(bmbt_rec) == (sizeof(bmbt_key) + sizeof(bmbt_ptr)).  However,
this will no longer be true when we start adding realtime rmap stuff,
so fix this now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
c460f0f1a2 xfs: fix FITRIM reporting again
Don't report FITRIMming more bytes than possibly exist in the
filesystem.

Fixes: 410e8a18f8 ("xfs: don't bother reporting blocks trimmed via FITRIM")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
64dfa18d6e xfs: fix C++ compilation errors in xfs_fs.h
Several people reported C++ compilation errors due to things that C
compilers allow but C++ compilers do not.  Fix both of these problems,
and hope there aren't more of these brown paper bags in 2 months when we
finally get these fixes through the process into a released xfsprogs.

NOTE: I am submitting this bugfix over the objections of a former
maintainer, who insists that we should remove this function from the
published userspace ABI instead of fixing the C++ compilation errors.
No deprecation period, no discussion, just a hard drop of an already
provided and correct C function, which would be in contravention of
Linus' rules.  IOWs, removing ABI that have already shipped in a
released kernel requires a careful deprecation period, so I will let
that maintainer run that process.

Reported-by: kernel@mattwhitlock.name
Reported-by: sam@gentoo.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219203
Fixes: 233f4e12bb ("xfs: add parent pointer ioctls")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
2c4162be6c xfs: refactor loading quota inodes in the regular case
Create a helper function to load quota inodes in the case where the
dqtype and the sb quota inode fields correspond.  This is true for
nearly all the iget callsites in the quota code, except for when we're
switching the group and project quota inodes.  We'll need this in
subsequent patches to make the metadir handling less convoluted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:20 -07:00
Darrick J. Wong
2ca7b9d7b8 xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c
Move this function out of xfs_ioctl.c to reduce the clutter in there,
and make the entire getfsmap implementation self-contained in a single
file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
516f91035c xfs: rearrange xfs_fsmap.c a little bit
The order of the functions in this file has gotten a little confusing
over the years.  Specifically, the two data device implementations
(bnobt and rmapbt) could be adjacent in the source code instead of split
in two by the logdev and rtdev fsmap implementations.  We're about to
add more functionality to this file, so rearrange things now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
33912286cb xfs: replace m_rsumsize with m_rsumblocks
Track the RT summary file size in blocks, just like the RT bitmap
file.  While we have users of both units, blocks are used slightly
more often and this matches the bitmap file for consistency.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
1fc51cf11d xfs: remove xfs_{rtbitmap,rtsummary}_wordcount
xfs_rtbitmap_wordcount and xfs_rtsummary_wordcount are currently unused,
so remove them to simplify refactoring other rtbitmap helpers.  They
can be added back or simply open coded when actually needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
0902819fe6 xfs: add xchk_setup_nothing and xchk_nothing helpers
Add common helpers for no-op scrubbing methods.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
ec12f97f1b xfs: make the rtalloc start hint a xfs_rtblock_t
0 is a valid start RT extent, and with pending changes it will become
both more common and non-unique.  Switch to pass a xfs_rtblock_t instead
so that we can use NULLRTBLOCK to determine if a hint was set or not.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
b2dd85f414 xfs: factor out a xfs_rtallocate_align helper
Split the code to calculate the aligned allocation request from
xfs_bmap_rtalloc into a separate self-contained helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
fd048a1bb3 xfs: rework the rtalloc fallback handling
xfs_rtallocate currently has two fallbacks, when an allocation fails:

 1) drop the requested extent size alignment, if any, and retry
 2) ignore the locality hint

Oddly enough it does those in order, as trying a different location
is more in line with what the user asked for, and does it in a very
unstructured way.

Lift the fallback to try to allocate without the locality hint into
xfs_rtallocate to both perform them in a more sensible order and to
clean up the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
a9f646af43 xfs: factor out a xfs_rtallocate helper
Split out a helper from xfs_rtallocate that performs the actual
allocation.  This keeps the scope of the xfs_rtalloc_args structure
contained, and prepares for rtgroups support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
1e21d1897f xfs: clean up the ISVALID macro in xfs_bmap_adjacent
Turn the  ISVALID macro defined and used inside in xfs_bmap_adjacent
that relies on implict context into a proper inline function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
df8b181f15 xfs: simplify xfs_rtalloc_query_range
There isn't much of a good reason to pass the xfs_rtalloc_rec structures
that describe extents to xfs_rtalloc_query_range as we really just want
a lower and upper bound xfs_rtxnum_t.  Pass the rtxnum directly and
simply the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
fa0fc38b25 xfs: remove xfs_rtb_to_rtxrem
Simplify the number of block number conversion helpers by removing
xfs_rtb_to_rtxrem.  Any recent compiler is smart enough to eliminate
the double divisions if using separate xfs_rtb_to_rtx and
xfs_rtb_to_rtxoff calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
9e9be9840f xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block
This function tries to find a suitable free space extent starting from
a particular rtbitmap block.  Some time ago, I added a clamping function
to prevent the free space scans from running off the end of the bitmap,
but I didn't quite get the logic right.

Let's say there's an allocation request with a minlen of 5 and a maxlen
of 32 and we're scanning the last rtbitmap block.  If we come within 4
rtx of the end of the rt volume, maxlen will get clamped to 4.  If the
next 3 rtx are free, we could have satisfied the allocation, but the
code setting partial besti/bestlen for "minlen < maxlen" will think that
we're doing a non-variable allocation and ignore it.

The root of this problem is overwriting maxlen; I should have stuffed
the results in a different variable, which would not have introduced
this bug.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
74c234bbe5 xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near
The near rt allocator employs two allocation strategies -- first it
tries to allocate at exactly @start.  If that fails, it will pivot back
and forth around that starting point looking for an appropriately sized
free space.

However, I clamped maxlen ages ago to prevent the exact allocation scan
from running off the end of the rt volume.  This, I realize, was
excessive.  If the allocation request is (say) for 32 rtx but the start
position is 5 rtx from the end of the volume, we clamp maxlen to 5.  If
the exact allocation fails, we then pivot back and forth looking for 5
rtx, even though the original intent was to try to get 32 rtx.

If we then find 5 rtx when we could have gotten 32 rtx, we've not done
as well as we could have.  This may be moot if the caller immediately
comes back for more space, but it might not be.  Either way, we can do
better here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
62c3d24968 xfs: clean up xfs_rtallocate_extent_exact a bit
Before we start doing more surgery on the rt allocator, let's clean up
the exact allocator so that it doesn't change its arguments and uses the
helper introduced in the previous patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
e6a74dcf9b xfs: refactor aligning bestlen to prod
There are two places in xfs_rtalloc.c where we want to make sure that a
count of rt extents is aligned with a particular prod(uct) factor.  In
one spot, we actually use rounddown(), albeit unnecessarily if prod < 2.
In the other case, we open-code this rounding inefficiently by promoting
the 32-bit length value to a 64-bit value and then performing a 64-bit
division to figure out the subtraction.

Refactor this into a single helper that uses the correct types and
division method for the type, and skips the division entirely unless
prod is large enough to make a difference.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
e99aa0401e xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block
The loop conditional here is not quite correct because an rtbitmap block
can represent rtextents beyond the end of the rt volume.  There's no way
that it makes sense to scan for free space beyond EOFS, so don't do it.
This overrun has been present since v2.6.0.

Also fix the type of bestlen, which was incorrectly converted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
cb59233e82 xfs: don't return too-short extents from xfs_rtallocate_extent_block
If xfs_rtallocate_extent_block is asked for a variable-sized allocation,
it will try to return the best-sized free extent, which is apparently
the largest one that it finds starting in this rtbitmap block.  It will
then trim the size of the extent as needed to align it with prod.

However, it misses one thing -- rounding down the best-fit candidate to
the required alignment could make the extent shorter than minlen.  In
the case where minlen > 1, we'd rather the caller relaxed its alignment
requirements and tried again, as the allocator already supports that.

Returning a too-short extent that causes xfs_bmapi_write to return
ENOSR if there aren't enough nmaps to handle multiple new allocations,
which can then cause filesystem shutdowns.

I haven't seen this happen on any production systems, but then I don't
think it's very common to set a per-file extent size hint on realtime
files.  I tripped it while working on the rtgroups feature and pounding
on the realtime allocator enthusiastically.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
86a0264ef2 xfs: ensure rtx mask/shift are correct after growfs
When growfs sets an extent size, it doesn't updated the m_rtxblklog and
m_rtxblkmask values, which could lead to incorrect usage of them if they
were set before and can't be used for the new extent size.

Add a xfs_mount_sb_set_rextsize helper that updates the two fields, and
also use it when calculating the new RT geometry instead of disabling
the optimization there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
a18a69bbec xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock
After going great length to calculate the transaction reservation for
the new geometry, we should also use it to allocate the transaction it
was calculated for.

Fixes: 578bd4ce71 ("xfs: recompute growfsrtfree transaction reservation while growing rt volume")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
0a59e4f3e1 xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock
To prepare for being able to join an already locked rtbitmap inode to a
transaction split out separate helpers for joining the transaction from
the locking helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
2a95ffc44b xfs: factor out rtbitmap/summary initialization helpers
Add helpers to libxfs that can be shared by growfs and mkfs for
initializing the rtbitmap and summary, and by passing the optional data
pointer also by repair for rebuilding them.  This will become even more
useful when the rtgroups feature adds a metadata header to each block,
which means even more shared code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor documentation and data advance tweaks]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
266e78aec4 xfs: factor out a xfs_last_rt_bmblock helper
Add helper to calculate the last currently used rt bitmap block to
better structure the growfs code and prepare for future changes to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
7996f10ce6 xfs: factor out a xfs_growfs_rt_bmblock helper
Add a helper to contain the per-rtbitmap block logic in xfs_growfs_rt.

Note that this helper now allocates a new fake mount structure for
each rtbitmap block iteration instead of reusing the memory for an
entire growfs call.  Compared to all the other work done when freeing
the blocks the overhead for this is in the noise and it keeps the code
nicely modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
c8e5a0bfe0 xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc
Currently the various low-level RT allocator functions call into
xfs_rtallocate_range directly, which ties them into the locking protocol
for the RT bitmap.  As these helpers already return the allocated range,
lift the call to xfs_rtallocate_range into xfs_bmap_rtalloc so that it
happens as high as possible in the stack, which will simplify future
changes to the locking protocol.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
237130564e xfs: cleanup the calling convention for xfs_rtpick_extent
xfs_rtpick_extent never returns an error.  Do away with the error return
and directly return the picked extent instead of doing that through a
call by reference argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
b4781eea68 xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf
Add a corruption check for passing an invalid block number, which is a
lot easier to understand than the xfs_bmapi_read failure later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
6d2db12d56 xfs: assert a valid limit in xfs_rtfind_forw
Protect against developers passing stupid limits when refactoring the
RT code once again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
119c65e56b xfs: remove the limit argument to xfs_rtfind_back
All callers pass a 0 limit to xfs_rtfind_back, so remove the argument
and hard code it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
3cb30d5162 xfs: make the RT rsum_cache mandatory
Currently the RT mount code simply ignores an allocation failure for the
rsum_cache.  The code mostly works fine with it, but not having it leads
to nasty corner cases in the growfs code that we don't really handle
well.  Switch to failing the mount if we can't allocate the memory, the
file system would not exactly be useful in such a constrained environment
to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
6529eef810 xfs: factor out a xfs_validate_rt_geometry helper
Split the RT geometry validation in the early mount code into a
helper than can be reused by repair (from which this code was
apparently originally stolen anyway).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: u64 return value for calc_rbmblocks]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
021d9c107e xfs: remove xfs_validate_rtextents
Replace xfs_validate_rtextents with an open coded check for 0
rtextents.  The name for the function implies it does a lot more
than a zero check, which is more obvious when open coded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
390b4775d6 xfs: pass the icreate args object to xfs_dialloc
Pass the xfs_icreate_args object to xfs_dialloc since we can extract the
relevant mode (really just the file type) and parent inumber from there.
This simplifies the calling convention in preparation for the next
patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
feb09b727b xfs: match on the global RT inode numbers in xfs_is_metadata_inode
Match the inode number instead of the inode pointers, as the inode
pointers in the superblock will go away soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: port to my tree, make the parameter a const pointer]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
05aba1953f xfs: validate inumber in xfs_iget
Actually use the inumber validator to check the argument passed in here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2024-09-01 08:58:19 -07:00
Darrick J. Wong
398597c3ef xfs: introduce new file range commit ioctls
This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.

It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01 08:58:19 -07:00
Christoph Hellwig
4acaddf5d1 xfs: refactor xfs_file_fallocate
Refactor xfs_file_fallocate into separate helpers for each mode,
two factors for i_size handling and a single switch statement over the
supported modes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-7-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 16:53:58 +02:00
Christoph Hellwig
72f4d52570 xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space
Move the xfs_is_always_cow_inode check from the caller into
xfs_alloc_file_space to prepare for refactoring of xfs_file_fallocate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-6-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 16:53:58 +02:00
Christoph Hellwig
1df1d3b2dc xfs: call xfs_flush_unmap_range from xfs_free_file_space
Call xfs_flush_unmap_range from xfs_free_file_space so that
xfs_file_fallocate doesn't have to predict which mode will call it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-5-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 16:53:58 +02:00
Darrick J. Wong
a24cae8fc1 xfs: reset rootdir extent size hint after growfsrt
If growfsrt is run on a filesystem that doesn't have a rt volume, it's
possible to change the rt extent size.  If the root directory was
previously set up with an inherited extent size hint and rtinherit, it's
possible that the hint is no longer a multiple of the rt extent size.
Although the verifiers don't complain about this, xfs_repair will, so if
we detect this situation, log the root directory to clean it up.  This
is still racy, but it's better than nothing.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27 18:32:14 +05:30
Darrick J. Wong
16e1fbdce9 xfs: take m_growlock when running growfsrt
Take the grow lock when we're expanding the realtime volume, like we do
for the other growfs calls.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27 18:32:14 +05:30
Zizhi Wo
ca6448aed4 xfs: Fix missing interval for missing_owner in xfs fsmap
In the fsmap query of xfs, there is an interval missing problem:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
 EXT: DEV    BLOCK-RANGE           OWNER              FILE-OFFSET      AG AG-OFFSET             TOTAL
   0: 253:16 [0..7]:               static fs metadata                  0  (0..7)                    8
   1: 253:16 [8..23]:              per-AG metadata                     0  (8..23)                  16
   2: 253:16 [24..39]:             inode btree                         0  (24..39)                 16
   3: 253:16 [40..47]:             per-AG metadata                     0  (40..47)                  8
   4: 253:16 [48..55]:             refcount btree                      0  (48..55)                  8
   5: 253:16 [56..103]:            per-AG metadata                     0  (56..103)                48
   6: 253:16 [104..127]:           free space                          0  (104..127)               24
   ......

BUG:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
[root@fedora ~]#
Normally, we should be able to get [104, 107), but we got nothing.

The problem is caused by shifting. The query for the problem-triggered
scenario is for the missing_owner interval (e.g. freespace in rmapbt/
unknown space in bnobt), which is obtained by subtraction (gap). For this
scenario, the interval is obtained by info->last. However, rec_daddr is
calculated based on the start_block recorded in key[1], which is converted
by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed
info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log
<= info->next_daddr, no records will be displayed. In the above example,
104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two
are reduced to 0 and the gap is ignored:

 before calculate ----------------> after shifting
 104(st)  107(ed)		      12(st/ed)
  |---------|				  |
  sector size			      block size

Resolve this issue by introducing the "end_daddr" field in
xfs_getfsmap_info. This records |key[1].fmr_physical + key[1].length| at
the granularity of sector. If the current query is the last, the rec_daddr
is end_daddr to prevent missing interval problems caused by shifting. We
only need to focus on the last query, because xfs disks are internally
aligned with disk blocksize that are powers of two and minimum 512, so
there is no problem with shifting in previous queries.

After applying this patch, the above problem have been solved:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
 EXT: DEV    BLOCK-RANGE      OWNER            FILE-OFFSET      AG AG-OFFSET        TOTAL
   0: 253:16 [104..106]:      free space                        0  (104..106)           3

Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: limit the range of end_addr correctly]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27 18:32:14 +05:30
Darrick J. Wong
6b35cc8d92 xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this
field is null" like the rest of xfs.

Cc: wozizhi@huawei.com
Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27 18:32:08 +05:30
Zizhi Wo
68415b349f xfs: Fix the owner setting issue for rmap query in xfs fsmap
I notice a rmap query bug in xfs_io fsmap:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
 EXT: DEV    BLOCK-RANGE           OWNER              FILE-OFFSET      AG AG-OFFSET             TOTAL
   0: 253:16 [0..7]:               static fs metadata                  0  (0..7)                    8
   1: 253:16 [8..23]:              per-AG metadata                     0  (8..23)                  16
   2: 253:16 [24..39]:             inode btree                         0  (24..39)                 16
   3: 253:16 [40..47]:             per-AG metadata                     0  (40..47)                  8
   4: 253:16 [48..55]:             refcount btree                      0  (48..55)                  8
   5: 253:16 [56..103]:            per-AG metadata                     0  (56..103)                48
   6: 253:16 [104..127]:           free space                          0  (104..127)               24
   ......

Bug:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
[root@fedora ~]#
Normally, we should be able to get one record, but we got nothing.

The root cause of this problem lies in the incorrect setting of rm_owner in
the rmap query. In the case of the initial query where the owner is not
set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX.
This is done to prevent any omissions when comparing rmap items. However,
if the current ag is detected to be the last one, the function sets info's
high_irec based on the provided key. If high->rm_owner is not specified, it
should continue to be set to ULLONG_MAX; otherwise, there will be issues
with interval omissions. For example, consider "start" and "end" within the
same block. If high->rm_owner == 0, it will be smaller than the founded
record in rmapbt, resulting in a query with no records. The main call stack
is as follows:

xfs_ioc_getfsmap
  xfs_getfsmap
    xfs_getfsmap_datadev_rmapbt
      __xfs_getfsmap_datadev
        info->high.rm_owner = ULLONG_MAX
        if (pag->pag_agno == end_ag)
	  xfs_fsmap_owner_to_rmap
	    // set info->high.rm_owner = 0 because fmr_owner == -1ULL
	    dest->rm_owner = 0
	// get nothing
	xfs_getfsmap_datadev_rmapbt_query

The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap
function internal logic to achieve.

After applying this patch, the above problem have been solved:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
 EXT: DEV    BLOCK-RANGE      OWNER              FILE-OFFSET      AG AG-OFFSET        TOTAL
   0: 253:16 [0..7]:          static fs metadata                  0  (0..7)               8

Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-26 09:52:27 +05:30
Darrick J. Wong
410e8a18f8 xfs: don't bother reporting blocks trimmed via FITRIM
Don't bother reporting the number of bytes that we "trimmed" because the
underlying storage isn't required to do anything(!) and failed discard
IOs aren't reported to the caller anyway.  It's not like userspace can
use the reported value for anything useful like adjusting the offset
parameter of the next call, and it's not like anyone ever wrote a
manpage about FITRIM's out parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-26 09:52:13 +05:30
Dave Chinner
95179935be xfs: xfs_finobt_count_blocks() walks the wrong btree
As a result of the factoring in commit 14dd46cf31 ("xfs: split
xfs_inobt_init_cursor"), mount started taking a long time on a
user's filesystem.  For Anders, this made mount times regress from
under a second to over 15 minutes for a filesystem with only 30
million inodes in it.

Anders bisected it down to the above commit, but even then the bug
was not obvious. In this commit, over 20 calls to
xfs_inobt_init_cursor() were modified, and some we modified to call
a new function named xfs_finobt_init_cursor().

If that takes you a moment to reread those function names to see
what the rename was, then you have realised why this bug wasn't
spotted during review. And it wasn't spotted on inspection even
after the bisect pointed at this commit - a single missing "f" isn't
the easiest thing for a human eye to notice....

The result is that xfs_finobt_count_blocks() now incorrectly calls
xfs_inobt_init_cursor() so it is now walking the inobt instead of
the finobt. Hence when there are lots of allocated inodes in a
filesystem, mount takes a -long- time run because it now walks a
massive allocated inode btrees instead of the small, nearly empty
free inode btrees. It also means all the finobt space reservations
are wrong, so mount could potentially given ENOSPC on kernel
upgrade.

In hindsight, commit 14dd46cf31 should have been two commits - the
first to convert the finobt callers to the new API, the second to
modify the xfs_inobt_init_cursor() API for the inobt callers. That
would have made the bug very obvious during review.

Fixes: 14dd46cf31 ("xfs: split xfs_inobt_init_cursor")
Reported-by: Anders Blomdell <anders.blomdell@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-26 09:52:00 +05:30
Darrick J. Wong
5335affcff xfs: fix folio dirtying for XFILE_ALLOC callers
willy pointed out that folio_mark_dirty is the correct function to use
to mark an xfile folio dirty because it calls out to the mapping's aops
to mark it dirty.  For tmpfs this likely doesn't matter much since it
currently uses nop_dirty_folio, but let's use the abstractions properly.

Reported-by: willy@infradead.org
Fixes: 6907e3c00a ("xfs: add file_{get,put}_folio")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-26 09:51:27 +05:30
Darrick J. Wong
e21fea4ac3 xfs: fix di_onlink checking for V1/V2 inodes
"KjellR" complained on IRC that an old V4 filesystem suddenly stopped
mounting after upgrading from 6.9.11 to 6.10.3, with the following splat
when trying to read the rt bitmap inode:

00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00  IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30  ........C...!..0
00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00  C...!..0........
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00  ................
00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

As Dave Chinner points out, this is a V1 inode with both di_onlink and
di_nlink set to 1 and di_flushiter == 0.  In other words, this inode was
formatted this way by mkfs and hasn't been touched since then.

Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc
would set di_nlink, but if the filesystem didn't have NLINK, it would
then set di_version = 1.  libxfs_iflush_int later sees the V1 inode and
copies the value of di_nlink to di_onlink without zeroing di_onlink.

Eventually this filesystem must have been upgraded to support NLINK
because 6.10 doesn't support !NLINK filesystems, which is how we tripped
over this old behavior.  The filesystem doesn't have a realtime section,
so that's why the rtbitmap inode has never been touched.

Fix this by removing the di_onlink/di_nlink checking for all V1/V2
inodes because this is a muddy mess.  The V3 inode handling code has
always supported NLINK and written di_onlink==0 so keep that check.
The removal of the V1 inode handling code when we dropped support for
!NLINK obscured this old behavior.

Reported-by: kjell.m.randa@gmail.com
Fixes: 40cb8613d6 ("xfs: check unused nlink fields in the ondisk inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-26 09:50:41 +05:30
Darrick J. Wong
8d16762047 xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
If a file has the S_DAX flag (aka fsdax access mode) set, we cannot
allow users to change the realtime flag unless the datadev and rtdev
both support fsdax access modes.  Even if there are no extents allocated
to the file, the setattr thread could be racing with another thread
that has already started down the write code paths.

Fixes: ba23cba9b3 ("fs: allow per-device dax status checking for filesystems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:20:24 +05:30
Darrick J. Wong
04d6dbb553 xfs: revert AIL TASK_KILLABLE threshold
In commit 9adf40249e, we changed the behavior of the AIL thread to
set its own task state to KILLABLE whenever the timeout value is
nonzero.  Unfortunately, this missed the fact that xfsaild_push will
return 50ms (aka a longish sleep) when we reach the push target or the
AIL becomes empty, so xfsaild goes to sleep for a long period of time in
uninterruptible D state.

This results in artificially high load averages because KILLABLE
processes are UNINTERRUPTIBLE, which contributes to load average even
though the AIL is asleep waiting for someone to interrupt it.  It's not
blocked on IOs or anything, but people scrap ps for processes that look
like they're stuck in D state, so restore the previous threshold.

Fixes: 9adf40249e ("xfs: AIL doesn't need manual pushing")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:19:34 +05:30
Darrick J. Wong
73c34b0b85 xfs: attr forks require attr, not attr2
It turns out that I misunderstood the difference between the attr and
attr2 feature bits.  "attr" means that at some point an attr fork was
created somewhere in the filesystem.  "attr2" means that inodes have
variable-sized forks, but says nothing about whether or not there
actually /are/ attr forks in the system.

If we have an attr fork, we only need to check that attr is set.

Fixes: 99d9d8d05d ("xfs: scrub inode block mappings")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:19:14 +05:30
Al Viro
88a2f6468d struct fd: representation change
We want the compiler to see that fdput() on empty instance
is a no-op.  The emptiness check is that file reference is NULL,
while fdput() is "fput() if FDPUT_FPUT is present in flags".
The reason why fdput() on empty instance is a no-op is something
compiler can't see - it's that we never generate instances with
NULL file reference combined with non-zero flags.

	It's not that hard to deal with - the real primitives behind
fdget() et.al. are returning an unsigned long value, unpacked by (inlined)
__to_fd() into the current struct file * + int.  The lower bits are
used to store flags, while the rest encodes the pointer.  Linus suggested
that keeping this unsigned long around with the extractions done by inlined
accessors should generate a sane code and that turns out to be the case.
Namely, turning struct fd into a struct-wrapped unsinged long, with
        fd_empty(f) => unlikely(f.word == 0)
	fd_file(f) => (struct file *)(f.word & ~3)
	fdput(f) => if (f.word & 1) fput(fd_file(f))
ends up with compiler doing the right thing.  The cost is the patch
footprint, of course - we need to switch f.file to fd_file(f) all over
the tree, and it's not doable with simple search and replace; there are
false positives, etc.

	Note that the sole member of that structure is an opaque
unsigned long - all accesses should be done via wrappers and I don't
want to use a name that would invite manual casts to file pointers,
etc.  The value of that member is equal either to (unsigned long)p | flags,
p being an address of some struct file instance, or to 0 for an empty fd.

	For now the new predicate (fd_empty(f)) has no users; all the
existing checks have form (!fd_file(f)).  We will convert to fd_empty()
use later; here we only define it (and tell the compiler that it's
unlikely to return true).

	This commit only deals with representation change; there will
be followups.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12 22:01:05 -04:00
Al Viro
1da91ea87a introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
	Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
	This commit converts (almost) all of f.file to
fd_file(f).  It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

	NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12 22:00:43 -04:00
Chen Ni
7bf888fa26 xfs: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Fixes: 178b48d588 ("xfs: remove the for_each_xbitmap_ helpers")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-29 09:34:18 +05:30
Chen Ni
8c2263b923 xfs: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Fixes: 8f4b980ee6 ("xfs: pass the attr value to put_listent when possible")
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-29 09:32:53 +05:30
Julian Sun
af5d92f2fa xfs: remove unused parameter in macro XFS_DQUOT_LOGRES
In the macro definition of XFS_DQUOT_LOGRES, a parameter is accepted,
but it is not used. Hence, it should be removed.

This patch has only passed compilation test, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-29 09:29:31 +05:30
Darrick J. Wong
19ebc8f84e xfs: fix file_path handling in tracepoints
Since file_path() takes the output buffer as one of its arguments, we
might as well have it format directly into the tracepoint's char array
instead of wasting stack space.

Fixes: 3934e8ebb7 ("xfs: create a big array data structure")
Fixes: 5076a6040c ("xfs: support in-memory buffer cache targets")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403290419.HPcyvqZu-lkp@intel.com/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-29 09:27:23 +05:30
Eric Sandeen
39c1ddb064 xfs: allow SECURE namespace xattrs to use reserved block pool
We got a report from the podman folks that selinux relabels that happen
as part of their process were returning ENOSPC when the filesystem is
completely full. This is because xattr changes reserve about 15 blocks
for the worst case, but the common case is for selinux contexts to be
the sole, in-inode xattr and consume no blocks.

We already allow reserved space consumption for XFS_ATTR_ROOT for things
such as ACLs, and SECURE namespace attributes are not so very different,
so allow them to use the reserved space as well.

Code-comment-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

V2: Remove local variable, add comment.
V3: Add Dave's preferred comment
V4: Spelling and comment beautification
2024-07-29 09:26:20 +05:30
Darrick J. Wong
80d3d33cdf xfs: fix a memory leak
kmemleak reported that we don't free the parent pointer names here if we
found corruption.

Fixes: 0d29a20fbd ("xfs: scrub parent pointers")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-29 09:25:01 +05:30
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Linus Torvalds
bf3aa9de7b New code for 6.11:
* Enable FITRIM on the realtime device.
   * Introduce byte-based grant head log reservation tracking instead of
     physical log location tracking.
     This allows grant head to track a full 64 bit bytes space and hence
     overcome the limit of 4GB indexing that has been present until now.
   * Fixes
     - xfs_flush_unmap_range() and xfs_prepare_shift() should consider RT extents
       in the flush unmap range.
     - Implement bounds check when traversing log operations during log replay.
     - Prevent out of bounds access when traversing a directory data block.
     - Prevent incorrect ENOSPC when concurrently performing file creation and
       file writes.
     - Fix rtalloc rotoring when delalloc is in use
   * Cleanups
     - Clean up I/O path inode locking helpers and the page fault handler.
     - xfs: hoist inode operations to libxfs in anticipation of the metadata
       inode directory feature, which maintains a directory tree of metadata
       inodes. This will be necessary for further enhancements to the realtime
       feature, subvolume support.
     - Clean up some warts in the extent freeing log intent code.
     - Clean up the refcount and rmap intent code before adding support for
       realtime devices.
     - Provide the correct email address for sysfs ABI documentation.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZo9pkwAKCRAH7y4RirJu
 9EV+AQDBlX2AxTzKPsfb74qKaFgDpTdud8b1U779tijs4a6ZbwD8CvS40NXAjqmq
 R2j3wWQP3rkRxBusnStQ/9El20Q+WAI=
 =BcGP
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.11-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:
 "Major changes in this release are limited to enabling FITRIM on
  realtime devices and Byte-based grant head log reservation tracking.

  The remaining changes are limited to fixes and cleanups included in
  this pull request.

  Core:

   - Enable FITRIM on the realtime device

   - Introduce byte-based grant head log reservation tracking instead of
     physical log location tracking.

     This allows grant head to track a full 64 bit bytes space and hence
     overcome the limit of 4GB indexing that has been present until now

  Fixes:

   - xfs_flush_unmap_range() and xfs_prepare_shift() should consider RT
     extents in the flush unmap range

   - Implement bounds check when traversing log operations during log
     replay

   - Prevent out of bounds access when traversing a directory data block

   - Prevent incorrect ENOSPC when concurrently performing file creation
     and file writes

   - Fix rtalloc rotoring when delalloc is in use

  Cleanups:

   - Clean up I/O path inode locking helpers and the page fault handler

   - xfs: hoist inode operations to libxfs in anticipation of the
     metadata inode directory feature, which maintains a directory tree
     of metadata inodes. This will be necessary for further enhancements
     to the realtime feature, subvolume support

   - Clean up some warts in the extent freeing log intent code

   - Clean up the refcount and rmap intent code before adding support
     for realtime devices

   - Provide the correct email address for sysfs ABI documentation"

* tag 'xfs-6.11-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (80 commits)
  xfs: fix rtalloc rotoring when delalloc is in use
  xfs: get rid of xfs_ag_resv_rmapbt_alloc
  xfs: skip flushing log items during push
  xfs: grant heads track byte counts, not LSNs
  xfs: pass the full grant head to accounting functions
  xfs: track log space pinned by the AIL
  xfs: collapse xlog_state_set_callback in caller
  xfs: l_last_sync_lsn is really AIL state
  xfs: ensure log tail is always up to date
  xfs: background AIL push should target physical space
  xfs: AIL doesn't need manual pushing
  xfs: move and rename xfs_trans_committed_bulk
  xfs: fix the contact address for the sysfs ABI documentation
  xfs: Avoid races with cnt_btree lastrec updates
  xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c
  xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one
  xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one
  xfs: reuse xfs_refcount_update_cancel_item
  xfs: add a ci_entry helper
  xfs: remove xfs_trans_set_refcount_flags
  ...
2024-07-17 12:57:48 -07:00
Linus Torvalds
4f5e249ec0 vfs-6.11.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHLQAKCRCRxhvAZXjc
 ot3sAP9TBUM+vzUcQ5SVcUnSX+y3dhOGYnquORBbRc/Y6AzLMAEAu3TcsvdoaWfy
 6ImUaju6iLqy9cCY3uDlNmJR16E4IgE=
 =Bwpy
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:
 "This contains some minor work for the iomap subsystem:

   - Add documentation on the design of iomap and how to port to it

   - Optimize iomap_read_folio()

   - Bring back the change to iomap_write_end() to no increase i_size.

     This is accompanied by a change to xfs to reserve blocks for
     truncating large realtime inodes to avoid exposing stale data when
     iomap_write_end() stops increasing i_size"

* tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: don't increase i_size in iomap_write_end()
  xfs: reserve blocks for truncating large realtime inode
  Documentation: the design of iomap and how to port
  iomap: Optimize iomap_read_folio
2024-07-15 13:28:14 -07:00
Linus Torvalds
2aae1d67fd vfs-6.11.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
 ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
 Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
 =eDC8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode / dentry updates from Christian Brauner:
 "This contains smaller performance improvements to inodes and dentries:

  inode:

   - Add rcu based inode lookup variants.

     They avoid one inode hash lock acquire in the common case thereby
     significantly reducing contention. We already support RCU-based
     operations but didn't take advantage of them during inode
     insertion.

     Callers of iget_locked() get the improvement without any code
     changes. Callers that need a custom callback can switch to
     iget5_locked_rcu() as e.g., did btrfs.

     With 20 threads each walking a dedicated 1000 dirs * 1000 files
     directory tree to stat(2) on a 32 core + 24GB ram vm:

        before: 3.54s user 892.30s system 1966% cpu 45.549 total
        after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

     Long-term we should pick up the effort to introduce more
     fine-grained locking and possibly improve on the currently used
     hash implementation.

   - Start zeroing i_state in inode_init_always() instead of doing it in
     individual filesystems.

     This allows us to remove an unneeded lock acquire in new_inode()
     and not burden individual filesystems with this.

  dcache:

   - Move d_lockref out of the area used by RCU lookup to avoid
     cacheline ping poing because the embedded name is sharing a
     cacheline with d_lockref.

   - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
     up with 128 bytes in total"

* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix dentry size
  vfs: move d_lockref out of the area used by RCU lookup
  bcachefs: remove now spurious i_state initialization
  xfs: remove now spurious i_state initialization in xfs_inode_alloc
  vfs: partially sanitize i_state zeroing on inode creation
  xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  btrfs: use iget5_locked_rcu
  vfs: add rcu-based find_inode variants for iget ops
2024-07-15 11:39:44 -07:00
Christoph Hellwig
2bf6e35354 xfs: fix rtalloc rotoring when delalloc is in use
If we're trying to allocate real space for a delalloc reservation at
offset 0, we should use the rotor to spread files across the rt volume.

Switch the rtalloc to use the XFS_ALLOC_INITIAL_USER_DATA flag that
is set for any write at startoff to make it match the behavior for
the main data device.

Based on a patch from Darrick J. Wong.

Fixes: 6a94b1acda ("xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-09 09:08:28 +05:30
Long Li
49cdc4e834 xfs: get rid of xfs_ag_resv_rmapbt_alloc
The pag in xfs_ag_resv_rmapbt_alloc() is already held when the struct
xfs_btree_cur is initialized in xfs_rmapbt_init_cursor(), so there is no
need to get pag again.

On the other hand, in xfs_rmapbt_free_block(), the similar function
xfs_ag_resv_rmapbt_free() was removed in commit 92a005448f ("xfs: get
rid of unnecessary xfs_perag_{get,put} pairs"), xfs_ag_resv_rmapbt_alloc()
was left because scrub used it, but now scrub has removed it. Therefore,
we could get rid of xfs_ag_resv_rmapbt_alloc() just like the rmap free
block, make the code cleaner.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 14:36:13 +05:30
Dave Chinner
f3f7ae68a4 xfs: skip flushing log items during push
The AIL pushing code spends a huge amount of time skipping over
items that are already marked as flushing. It is not uncommon to
see hundreds of thousands of items skipped every second due to inode
clustering marking all the inodes in a cluster as flushing when the
first one is flushed.

However, to discover an item is already flushing and should be
skipped we have to call the iop_push() method for it to try to flush
the item. For inodes (where this matters most), we have to first
check that inode is flushable first.

We can optimise this overhead away by tracking whether the log item
is flushing internally. This allows xfsaild_push() to check the log
item directly for flushing state and immediately skip the log item.
Whilst this doesn't remove the CPU cache misses for loading the log
item, it does avoid the overhead of an indirect function call
and the cache misses involved in accessing inode and
backing cluster buffer structures to determine flushing state. When
trying to flush hundreds of thousands of inodes each second, this
CPU overhead saving adds up quickly.

It's so noticeable that the biggest issue with pushing on the AIL on
fast storage becomes the 10ms back-off wait when we hit enough
pinned buffers to break out of the push loop but not enough for the
AIL pushing to be considered stuck. This limits the xfsaild to about
70% total CPU usage, and on fast storage this isn't enough to keep
the storage 100% busy.

The xfsaild will block on IO submission on slow storage and so is
self throttling - it does not need a backoff in the case where we
are really just breaking out of the walk to submit the IO we have
gathered.

Further with no backoff we don't need to gather huge delwri lists to
mitigate the impact of backoffs, so we can submit IO more frequently
and reduce the time log items spend in flushing state by breaking
out of the item push loop once we've gathered enough IO to batch
submission effectively.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:47 +05:30
Dave Chinner
c1220522ef xfs: grant heads track byte counts, not LSNs
The grant heads in the log track the space reserved in the log for
running transactions. They do this by tracking how far ahead of the
tail that the reservation has reached, and the units for doing this
are {cycle,bytes} for the reserve head rather than {cycle,blocks}
which are normal used by LSNs.

This is annoyingly complex because we have to split, crack and
combined these tuples for any calculation we do to determine log
space and targets. This is computationally expensive as well as
difficult to do atomically and locklessly, as well as limiting the
size of the log to 2^32 bytes.

Really, though, all the grant heads are tracking is how much space
is currently available for use in the log. We can track this as a
simply byte count - we just don't care what the actual physical
location in the log the head and tail are at, just how much space we
have remaining before the head and tail overlap.

So, convert the grant heads to track the byte reservations that are
active rather than the current (cycle, offset) tuples. This means an
empty log has zero bytes consumed, and a full log is when the
reservations reach the size of the log minus the space consumed by
the AIL.

This greatly simplifies the accounting and checks for whether there
is space available. We no longer need to crack or combine LSNs to
determine how much space the log has left, nor do we need to look at
the head or tail of the log to determine how close to full we are.

There is, however, a complexity that needs to be handled. We know
how much space is being tracked in the AIL now via log->l_tail_space
and the log tickets track active reservations and return the unused
portions to the grant heads when ungranted.  Unfortunately, we don't
track the used portion of the grant, so when we transfer log items
from the CIL to the AIL, the space accounted to the grant heads is
transferred to the log tail space.  Hence when we move the AIL head
forwards on item insert, we have to remove that space from the grant
heads.

We also remove the xlog_verify_grant_tail() debug function as it is
no longer useful. The check it performs has been racy since delayed
logging was introduced, but now it is clearly only detecting false
positives so remove it.

The result of this substantially simpler accounting algorithm is an
increase in sustained transaction rate from ~1.3 million
transactions/s to ~1.9 million transactions/s with no increase in
CPU usage. We also remove the 32 bit space limitation on the grant
heads, which will allow us to increase the journal size beyond 2GB
in future.

Note that this renames the sysfs files exposing the log grant space
now that the values are exported in bytes.  This allows xfstests
to auto-detect the old or new ABI.

[hch: move xlog_grant_sub_space out of line,
      update the xlog_grant_{add,sub}_space prototypes,
      rename the sysfs files to allow auto-detection in xfstests]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:47 +05:30
Dave Chinner
de302cea1e xfs: pass the full grant head to accounting functions
Because we are going to need them soon. API change only, no logic
changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:47 +05:30
Dave Chinner
551bf13ba8 xfs: track log space pinned by the AIL
Currently we track space used in the log by grant heads.
These store the reserved space as a physical log location and
combine both space reserved for future use with space already used in
the log in a single variable. The amount of space consumed in the
log is then calculated as the  distance between the log tail and
the grant head.

The problem with tracking the grant head as a physical location
comes from the fact that it tracks both log cycle count and offset
into the log in bytes in a single 64 bit variable. because the cycle
count on disk is a 32 bit number, this also limits the offset into
the log to 32 bits. ANd because that is in bytes, we are limited to
being able to track only 2GB of log space in the grant head.

Hence to support larger physical logs, we need to track used space
differently in the grant head. We no longer use the grant head for
guiding AIL pushing, so the only thing it is now used for is
determining if we've run out of reservation space via the
calculation in xlog_space_left().

What we really need to do is move the grant heads away from tracking
physical space in the log. The issue here is that space consumed in
the log is not directly tracked by the current mechanism - the
space consumed in the log by grant head reservations gets returned
to the free pool by the tail of the log moving forward. i.e. the
space isn't directly tracked or calculated, but the used grant space
gets "freed" as the physical limits of the log are updated without
actually needing to update the grant heads.

Hence to move away from implicit, zero-update log space tracking we
need to explicitly track the amount of physical space the log
actually consumes separately to the in-memory reservations for
operations that will be committed to the journal. Luckily, we
already track the information we need to calculate this in the AIL
itself.

That is, the space currently consumed by the journal is the maximum
LSN that the AIL has seen minus the current log tail. As we update
both of these items dynamically as the head and tail of the log
moves, we always know exactly how much space the journal consumes.

This means that we also know exactly how much space the currently
active reservations require, and exactly how much free space we have
remaining for new reservations to be made. Most importantly, we know
what these spaces are indepedently of the physical locations of
the head and tail of the log.

Hence by separating out the physical space consumed by the journal,
we can now track reservations in the grant heads purely as a byte
count, and the log can be considered full when the tail space +
reservation space exceeds the size of the log. This means we can use
the full 64 bits of grant head space for reservation space,
completely removing the 32 bit byte count limitation on log size
that they impose.

Hence the first step in this conversion is to track and update the
"log tail space" every time the AIL tail or maximum seen LSN
changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
be5abd323b xfs: collapse xlog_state_set_callback in caller
The function is called from a single place, and it isn't just
setting the iclog state to XLOG_STATE_CALLBACK - it can mark iclogs
clean, which moves them to states after CALLBACK. Hence the function
is now badly named, and should just be folded into the caller where
the iclog completion logic makes a whole lot more sense.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
0dcd5a10d9 xfs: l_last_sync_lsn is really AIL state
The current implementation of xlog_assign_tail_lsn() assumes that
when the AIL is empty, the log tail matches the LSN of the last
written commit record. This is recorded in xlog_state_set_callback()
as log->l_last_sync_lsn when the iclog state changes to
XLOG_STATE_CALLBACK. This change is then immediately followed by
running the callbacks on the iclog which then insert the log items
into the AIL at the "commit lsn" of that checkpoint.

The AIL tracks log items via the start record LSN of the checkpoint,
not the commit record LSN. This is because we can pipeline multiple
checkpoints, and so the start record of checkpoint N+1 can be
written before the commit record of checkpoint N. i.e:

     start N			commit N
	+-------------+------------+----------------+
		  start N+1			commit N+1

The tail of the log cannot be moved to the LSN of commit N when all
the items of that checkpoint are written back, because then the
start record for N+1 is no longer in the active portion of the log
and recovery will fail/corrupt the filesystem.

Hence when all the log items in checkpoint N are written back, the
tail of the log most now only move as far forwards as the start LSN
of checkpoint N+1.

Hence we cannot use the maximum start record LSN the AIL sees as a
replacement the pointer to the current head of the on-disk log
records. However, we currently only use the l_last_sync_lsn when the
AIL is empty - when there is no start LSN remaining, the tail of the
log moves to the LSN of the last commit record as this is where
recovery needs to start searching for recoverable records. THe next
checkpoint will have a start record LSN that is higher than
l_last_sync_lsn, and so everything still works correctly when new
checkpoints are written to an otherwise empty log.

l_last_sync_lsn is an atomic variable because it is currently
updated when an iclog with callbacks attached moves to the CALLBACK
state. While we hold the icloglock at this point, we don't hold the
AIL lock. When we assign the log tail, we hold the AIL lock, not the
icloglock because we have to look up the AIL. Hence it is an atomic
variable so it's not bound to a specific lock context.

However, the iclog callbacks are only used for CIL checkpoints. We
don't use callbacks with unmount record writes, so the
l_last_sync_lsn variable only gets updated when we are processing
CIL checkpoint callbacks. And those callbacks run under AIL lock
contexts, not icloglock context. The CIL checkpoint already knows
what the LSN of the iclog the commit record was written to (obtained
when written into the iclog before submission) and so we can update
the l_last_sync_lsn under the AIL lock in this callback. No other
iclog callbacks will run until the currently executing one
completes, and hence we can update the l_last_sync_lsn under the AIL
lock safely.

This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn"
and it can be used to replace the atomic l_last_sync_lsn in the
iclog code. This makes tracking the log tail belong entirely to the
AIL, rather than being smeared across log, iclog and AIL state and
locking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
a07776ab81 xfs: ensure log tail is always up to date
Whenever we write an iclog, we call xlog_assign_tail_lsn() to update
the current tail before we write it into the iclog header. This
means we have to take the AIL lock on every iclog write just to
check if the tail of the log has moved.

This doesn't avoid races with log tail updates - the log tail could
move immediately after we assign the tail to the iclog header and
hence by the time the iclog reaches stable storage the tail LSN has
moved forward in memory. Hence the log tail LSN in the iclog header
is really just a point in time snapshot of the current state of the
AIL.

With this in mind, if we simply update the in memory log->l_tail_lsn
every time it changes in the AIL, there is no need to update the in
memory value when we are writing it into an iclog - it will already
be up-to-date in memory and checking the AIL again will not change
this. Hence xlog_state_release_iclog() does not need to check the
AIL to update the tail lsn and can just sample it directly without
needing to take the AIL lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
b50b4c49d8 xfs: background AIL push should target physical space
Currently the AIL attempts to keep 25% of the "log space" free,
where the current used space is tracked by the reserve grant head.
That is, it tracks both physical space used plus the amount reserved
by transactions in progress.

When we start tail pushing, we are trying to make space for new
reservations by writing back older metadata and the log is generally
physically full of dirty metadata, and reservations for modifications
in flight take up whatever space the AIL can physically free up.

Hence we don't really need to take into account the reservation
space that has been used - we just need to keep the log tail moving
as fast as we can to free up space for more reservations to be made.
We know exactly how much physical space the journal is consuming in
the AIL (i.e. max LSN - min LSN) so we can base push thresholds
directly on this state rather than have to look at grant head
reservations to determine how much to physically push out of the
log.

This also allows code that needs to know if log items in the current
transaction need to be pushed or re-logged to simply sample the
current target - they don't need to calculate the current target
themselves. This avoids the need for any locking when doing such
checks.

Further, moving to a physical target means we don't need "push all
until empty semantics" like were introduced in the previous patch.
We can now test and clear the "push all" as a one-shot command to
set the target to the current head of the AIL. This allows the
xfsaild to maximise the use of log space right up to the point where
conditions indicate that the xfsaild is not keeping up with load and
it needs to work harder, and as soon as those constraints go away
(i.e. external code no longer needs everything pushed) the xfsaild
will return to maintaining the normal 25% free space thresholds.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
9adf40249e xfs: AIL doesn't need manual pushing
We have a mechanism that checks the amount of log space remaining
available every time we make a transaction reservation. If the
amount of space is below a threshold (25% free) we push on the AIL
to tell it to do more work. To do this, we end up calculating the
LSN that the AIL needs to push to on every reservation and updating
the push target for the AIL with that new target LSN.

This is silly and expensive. The AIL is perfectly capable of
calculating the push target itself, and it will always be running
when the AIL contains objects.

What the target does is determine if the AIL needs to do
any work before it goes back to sleep. If we haven't run out of
reservation space or memory (or some other push all trigger), it
will simply go back to sleep for a while if there is more than 25%
of the journal space free without doing anything.

If there are items in the AIL at a lower LSN than the target, it
will try to push up to the target or to the point of getting stuck
before going back to sleep and trying again soon after.`

Hence we can modify the AIL to calculate it's own 25% push target
before it starts a push using the same reserve grant head based
calculation as is currently used, and remove all the places where we
ask the AIL to push to a new 25% free target. We can also drop the
minimum free space size of 256BBs from the calculation because the
25% of a minimum sized log is *always going to be larger than
256BBs.

This does still require a manual push in certain circumstances.
These circumstances arise when the AIL is not full, but the
reservation grants consume the entire of the free space in the log.
In this case, we still need to push on the AIL to free up space, so
when we hit this condition (i.e. reservation going to sleep to wait
on log space) we do a single push to tell the AIL it should empty
itself. This will keep the AIL moving as new reservations come in
and want more space, rather than keep queuing them and having to
push the AIL repeatedly.

The reason for using the "push all" when grant space runs out is
that we can run out of grant space when there is more than 25% of
the log free. Small logs are notorious for this, and we have a hack
in the log callback code (xlog_state_set_callback()) where we push
the AIL because the *head* moved) to ensure that we kick the AIL
when we consume space in it because that can push us over the "less
than 25% available" available that starts tail pushing back up
again.

Hence when we run out of grant space and are going to sleep, we have
to consider that the grant space may be consuming almost all the log
space and there is almost nothing in the AIL. In this situation, the
AIL pins the tail and moving the tail forwards is the only way the
grant space will come available, so we have to force the AIL to push
everything to guarantee grant space will eventually be returned.
Hence triggering a "push all" just before sleeping removes all the
nasty corner cases we have in other parts of the code that work
around the "we didn't ask the AIL to push enough to free grant
space" condition that leads to log space hangs...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Dave Chinner
613e2fdbbc xfs: move and rename xfs_trans_committed_bulk
Ever since the CIL and delayed logging was introduced,
xfs_trans_committed_bulk() has been a purely CIL checkpoint
completion function and not a transaction commit completion
function. Now that we are adding log specific updates to this
function, it really does not have anything to do with the
transaction subsystem - it is really log and log item level
functionality.

This should be part of the CIL code as it is the callback
that moves log items from the CIL checkpoint to the AIL. Move it
and rename it to xlog_cil_ail_insert().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:46:46 +05:30
Zizhi Wo
94a0333b92 xfs: Avoid races with cnt_btree lastrec updates
A concurrent file creation and little writing could unexpectedly return
-ENOSPC error since there is a race window that the allocator could get
the wrong agf->agf_longest.

Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
   address and length of the remaining part of the entry after allocation.
2) Delete this entry and update the -current- agf->agf_longest.
3) Insert the remaining unused parts of this entry based on the
   calculations in 1), and update the agf->agf_longest again if necessary.

Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
   inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
   held. Otherwire, an error is returned directly.

If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it may be mistaken for no space,
resulting in the file system still has space but the file creation fails.

We have sent two different commits to the community in order to fix this
problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
discussed with Dave and Darrick, realized that a better solution to this
problem requires the "last cnt record tracking" to be ripped out of the
generic btree code. And surprisingly, Dave directly provided his fix code.
This patch includes appropriate modifications based on his tmp-code to
address this issue.

The entire fix can be roughly divided into two parts:
1) Delete the code related to lastrec-update in the generic btree code.
2) Place the process of updating longest freespace with cntbt separately
   to the end of the cntbt modifications. Move the cursor to the rightmost
   firstly, and update the longest free extent based on the record.

Note that we can not update the longest with xfs_alloc_get_rec() after
find the longest record, as xfs_verify_agbno() may not pass because
pag->block_count is updated on the outside. Therefore, use
xfs_btree_get_rec() as a replacement.

[1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
[2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com

Reported by: Ye Bin <yebin10@huawei.com>

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-04 12:44:16 +05:30
Darrick J. Wong
783e8a7c9c xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c
Move the code that adds the incore xfs_refcount_update_item deferred
work data to a transaction live with the CUI log item code.  This means
that the refcount code no longer has to know about the inner workings of
the CUI log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:07 -07:00
Darrick J. Wong
e51987a12c xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one
Only update rcur when we know the final *pcur value.

Inspired-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:07 -07:00
Darrick J. Wong
bac3f78492 xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one
In xfs_refcount_finish_one we know the cursor is non-zero when calling
xfs_refcount_finish_one_cleanup and we pass a 0 error variable.  This
means xfs_refcount_finish_one_cleanup is just doing a
xfs_btree_del_cursor.

Open code that and move xfs_refcount_finish_one_cleanup to
fs/xfs/xfs_refcount_item.c.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:07 -07:00
Darrick J. Wong
8aef79928b xfs: reuse xfs_refcount_update_cancel_item
Reuse xfs_refcount_update_cancel_item to put the AG/RTG and free the
item in a few places that currently open code the logic.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:06 -07:00
Darrick J. Wong
0e9254861f xfs: add a ci_entry helper
Add a helper to translate from the item list head to the
refcount_intent_item structure and use it so shorten assignments and
avoid the need for extra local variables.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:06 -07:00
Darrick J. Wong
e69682e5a1 xfs: remove xfs_trans_set_refcount_flags
Remove this single-use helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:06 -07:00
Darrick J. Wong
886f11c797 xfs: clean up refcount log intent item tracepoint callsites
Pass the incore refcount intent structure to the tracepoints instead of
open-coding the argument passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:06 -07:00
Darrick J. Wong
8fbac2f1a0 xfs: pass btree cursors to refcount btree tracepoints
Prepare the rest of refcount btree tracepoints for use with realtime
reflink by making them take the btree cursor object as a parameter.
This will save us a lot of trouble later on.

Remove the xfs_refcount_recover_extent tracepoint since it's already
covered by other refcount tracepoints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:05 -07:00
Darrick J. Wong
bb0efb0d0a xfs: create specialized classes for refcount tracepoints
The only user of the "ag" tracepoint event classes is the refcount
btree, so rename them to make that obvious and make them take the btree
cursor to simplify the arguments.  This will save us a lot of trouble
later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:05 -07:00
Darrick J. Wong
7cf2663ff1 xfs: give refcount btree cursor error tracepoints their own class
Convert all the refcount tracepoints to use the btree error tracepoint
class.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:05 -07:00
Darrick J. Wong
ea7b0820d9 xfs: move xfs_rmap_update_defer_add to xfs_rmap_item.c
Move the code that adds the incore xfs_rmap_update_item deferred work
data to a transaction to live with the RUI log item code.  This means
that the rmap code no longer has to know about the inner workings of the
RUI log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:05 -07:00
Christoph Hellwig
905af72610 xfs: simplify usage of the rcur local variable in xfs_rmap_finish_one
Only update rcur when we know the final *pcur value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:04 -07:00
Christoph Hellwig
8363b43619 xfs: don't bother calling xfs_rmap_finish_one_cleanup in xfs_rmap_finish_one
In xfs_rmap_finish_one we known the cursor is non-zero when calling
xfs_rmap_finish_one_cleanup and we pass a 0 error variable.  This means
xfs_rmap_finish_one_cleanup is just doing a xfs_btree_del_cursor.

Open code that and move xfs_rmap_finish_one_cleanup to
fs/xfs/xfs_rmap_item.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor porting changes]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:04 -07:00
Christoph Hellwig
37f9d1db03 xfs: reuse xfs_rmap_update_cancel_item
Reuse xfs_rmap_update_cancel_item to put the AG/RTG and free the item in
a few places that currently open code the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:04 -07:00
Christoph Hellwig
f93963779b xfs: add a ri_entry helper
Add a helper to translate from the item list head to the
rmap_intent_item structure and use it so shorten assignments
and avoid the need for extra local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:04 -07:00
Darrick J. Wong
c9099a28c2 xfs: remove xfs_trans_set_rmap_flags
Remove this single-use helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:04 -07:00
Darrick J. Wong
fbe8c7e167 xfs: clean up rmap log intent item tracepoint callsites
Pass the incore rmap structure to the tracepoints instead of open-coding
the argument passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:03 -07:00
Darrick J. Wong
47492ed124 xfs: pass btree cursors to rmap btree tracepoints
Prepare the rmap btree tracepoints for use with realtime rmap btrees by
making them take the btree cursor object as a parameter.  This will save
us a lot of trouble later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:03 -07:00
Darrick J. Wong
71f5a17e52 xfs: give rmap btree cursor error tracepoints their own class
Create a new tracepoint class for btree-related errors, then convert all
the rmap tracepoints to use it.  Also fix the one tracepoint that was
abusing the old class by making it a separate tracepoint.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:03 -07:00
Darrick J. Wong
84a3c1576c xfs: move xfs_extent_free_defer_add to xfs_extfree_item.c
Move the code that adds the incore xfs_extent_free_item deferred work
data to a transaction to live with the EFI log item code.  This means
that the allocator code no longer has to know about the inner workings
of the EFI log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:03 -07:00
Christoph Hellwig
7272f77c67 xfs: remove xfs_defer_agfl_block
xfs_free_extent_later can handle the extra AGFL special casing with
very little extra logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:02 -07:00
Christoph Hellwig
851a678189 xfs: remove duplicate asserts in xfs_defer_extent_free
The bno/len verification is already done by the calls to
xfs_verify_rtbext / xfs_verify_fsbext, and reporting a corruption error
seem like the better handling than tripping an assert anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:02 -07:00
Christoph Hellwig
81927e6ec6 xfs: factor out a xfs_efd_add_extent helper
Factor out a helper to add an extent to and EFD instead of duplicating
the logic in two places.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:02 -07:00
Christoph Hellwig
61665fae4e xfs: reuse xfs_extent_free_cancel_item
Reuse xfs_extent_free_cancel_item to put the AG/RTG and free the item in
a few places that currently open code the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:02 -07:00
Christoph Hellwig
649c0c2b86 xfs: add a xefi_entry helper
Add a helper to translate from the item list head to the
xfs_extent_free_item structure and use it so shorten assignments
and avoid the need for extra local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:01 -07:00
Christoph Hellwig
62d597a197 xfs: pass the fsbno to xfs_perag_intent_get
All callers of xfs_perag_intent_get have a fsbno and need boilerplate
code to turn that into an agno.  Just pass the fsbno to
xfs_perag_intent_get and look up the agno there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-07-02 11:37:01 -07:00
Darrick J. Wong
980faece91 xfs: convert "skip_discard" to a proper flags bitset
Convert the boolean to skip discard on free into a proper flags field so
that we can add more flags in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:01 -07:00
Darrick J. Wong
4e0e2c0fe3 xfs: clean up extent free log intent item tracepoint callsites
Pass the incore EFI structure to the tracepoints instead of open-coding
the argument passing.  This cleans up the call sites a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:01 -07:00
Darrick J. Wong
ac3a027516 xfs: don't use the incore struct xfs_sb for offsets into struct xfs_dsb
Currently, the XFS_SB_CRC_OFF macro uses the incore superblock struct
(xfs_sb) to compute the address of sb_crc within the ondisk superblock
struct (xfs_dsb).  This is a landmine if we ever change the layout of
the incore superblock (as we're about to do), so redefine the macro
to use xfs_dsb to compute the layout of xfs_dsb.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:00 -07:00
Darrick J. Wong
47d4d5961f xfs: get rid of trivial rename helpers
Get rid of the largely pointless xfs_cross_rename and xfs_finish_rename
now that we've refactored its parent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:00 -07:00
Darrick J. Wong
62bbf50bea xfs: move dirent update hooks to xfs_dir2.c
Move the directory entry update hook code to xfs_dir2 so that it is
mostly consolidated with the higher level directory functions.  Retain
the exports so that online fsck can still send notifications through the
hooks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:37:00 -07:00
Darrick J. Wong
28d0d81344 xfs: create libxfs helper to rename two directory entries
Create a new libxfs function to rename two directory entries.  The
upcoming metadata directory feature will need this to replace a metadata
inode directory entry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:59 -07:00
Darrick J. Wong
a55712b35c xfs: create libxfs helper to exchange two directory entries
Create a new libxfs function to exchange two directory entries.
The upcoming metadata directory feature will need this to replace a
metadata inode directory entry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:59 -07:00
Darrick J. Wong
90636e4531 xfs: create libxfs helper to remove an existing inode/name from a directory
Create a new libxfs function to remove a (name, inode) entry from a
directory.  The upcoming metadata directory feature will need this to
create a metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:59 -07:00
Darrick J. Wong
1964435d19 xfs: hoist inode free function to libxfs
Create a libxfs helper function that marks an inode free on disk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:59 -07:00
Darrick J. Wong
c1f0bad423 xfs: create libxfs helper to link an existing inode into a directory
Create a new libxfs function to link an existing inode into a directory.
The upcoming metadata directory feature will need this to create a
metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:59 -07:00
Darrick J. Wong
1fa2e81957 xfs: create libxfs helper to link a new inode into a directory
Create a new libxfs function to link a newly created inode into a
directory.  The upcoming metadata directory feature will need this to
create a metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:58 -07:00
Darrick J. Wong
b11b11e3b7 xfs: separate the icreate logic around INIT_XATTRS
INIT_XATTRS is overloaded here -- it's set during the creat process when
we think that we're immediately going to set some ACL xattrs to save
time.  However, it's also used by the parent pointers code to enable the
attr fork in preparation to receive ppptr xattrs.  This results in
xfs_has_parent() branches scattered around the codebase to turn on
INIT_XATTRS.

Linkable files are created far more commonly than unlinkable temporary
files or directory tree roots, so we should centralize this logic in
xfs_inode_init.  For the three callers that don't want parent pointers
(online repiar tempfiles, unlinkable tempfiles, rootdir creation) we
provide an UNLINKABLE flag to skip attr fork initialization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:58 -07:00
Darrick J. Wong
a9e583d34f xfs: hoist xfs_{bump,drop}link to libxfs
Move xfs_bumplink and xfs_droplink to libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:58 -07:00
Darrick J. Wong
b8a6107921 xfs: hoist xfs_iunlink to libxfs
Move xfs_iunlink and xfs_iunlink_remove to libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:58 -07:00
Darrick J. Wong
c0223b8d66 xfs: wrap inode creation dqalloc calls
Create a helper that calls dqalloc to allocate and grab a reference to
dquots for the user, group, and project ids listed in an icreate
structure.  This simplifies the creat-related dqalloc callsites
scattered around the code base.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:57 -07:00
Darrick J. Wong
dfaf884233 xfs: push xfs_icreate_args creation out of xfs_create*
Move the initialization of the xfs_icreate_args structure out of
xfs_create and xfs_create_tempfile into their callers so that we can set
the new inode's attributes in one place and pass that through instead of
open coding the collection of attributes all over the code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:57 -07:00
Darrick J. Wong
e9d2b35bb9 xfs: hoist new inode initialization functions to libxfs
Move all the code that initializes a new inode's attributes from the
icreate_args structure and the parent directory into libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:57 -07:00
Darrick J. Wong
38fd3d6a95 xfs: split new inode creation into two pieces
There are two parts to initializing a newly allocated inode: setting up
the incore structures, and initializing the new inode core based on the
parent inode and the current user's environment.  The initialization
code is not specific to the kernel, so we would like to share that with
userspace by hoisting it to libxfs.  Therefore, split xfs_icreate into
separate functions to prepare for the next few patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:57 -07:00
Darrick J. Wong
a7b12718cb xfs: use xfs_trans_ichgtime to set times when allocating inode
Use xfs_trans_ichgtime to set the inode times when allocating an inode,
instead of open-coding them here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:56 -07:00
Darrick J. Wong
3d1dfb6df9 xfs: implement atime updates in xfs_trans_ichgtime
Enable xfs_trans_ichgtime to change the inode access time so that we can
use this function to set inode times when allocating inodes instead of
open-coding it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:56 -07:00
Darrick J. Wong
ba4b39fe4c xfs: pack icreate initialization parameters into a separate structure
Callers that want to create an inode currently pass all possible file
attribute values for the new inode into xfs_init_new_inode as ten
separate parameters.  This causes two code maintenance issues: first, we
have large multi-line call sites which programmers must read carefully
to make sure they did not accidentally invert a value.  Second, all
three file id parameters must be passed separately to the quota
functions; any discrepancy results in quota count errors.

Clean this up by creating a new icreate_args structure to hold all this
information, some helpers to initialize them properly, and make the
callers pass this structure through to the creation function, whose name
we shorten to xfs_icreate.  This eliminates the issues, enables us to
keep the inode init code in sync with userspace via libxfs, and is
needed for future metadata directory tree management.

(A subsequent cleanup will also fix the quota alloc calls.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:56 -07:00
Darrick J. Wong
fcea5b35f3 xfs: hoist project id get/set functions to libxfs
Move the project id get and set functions into libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:56 -07:00
Darrick J. Wong
b7c477be39 xfs: hoist inode flag conversion functions to libxfs
Hoist the inode flag conversion functions into libxfs so that we can
keep them in sync.  Do this by creating a new xfs_inode_util.c file in
libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:55 -07:00
Darrick J. Wong
acdddbe168 xfs: hoist extent size helpers to libxfs
Move the extent size helpers to xfs_bmap.c in libxfs since they're used
there already.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:55 -07:00
Darrick J. Wong
d76e137057 xfs: move inode copy-on-write predicates to xfs_inode.[ch]
Move these inode predicate functions to xfs_inode.[ch] since they're not
reflink functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:55 -07:00
Darrick J. Wong
24a4e1cb32 xfs: use consistent uid/gid when grabbing dquots for inodes
I noticed that callers of xfs_qm_vop_dqalloc use the following code to
compute the anticipated uid of the new file:

	mapped_fsuid(idmap, &init_user_ns);

whereas the VFS uses a slightly different computation for actually
assigning i_uid:

	mapped_fsuid(idmap, i_user_ns(inode));

Technically, these are not the same things.  According to Christian
Brauner, the only time that inode->i_sb->s_user_ns != &init_user_ns is
when the filesystem was mounted in a new mount namespace by an
unpriviledged user.  XFS does not allow this, which is why we've never
seen bug reports about quotas being incorrect or the uid checks in
xfs_qm_vop_create_dqattach tripping debug assertions.

However, this /is/ a logic bomb, so let's make the code consistent.

Link: https://lore.kernel.org/linux-fsdevel/20240617-weitblick-gefertigt-4a41f37119fa@brauner/
Fixes: c14329d39f ("fs: port fs{g,u}id helpers to mnt_idmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:55 -07:00
Darrick J. Wong
150bb10a28 xfs: verify buffer, inode, and dquot items every tx commit
generic/388 has an annoying tendency to fail like this during log
recovery:

XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Starting recovery (logdev: internal)
00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07  IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10  ................
00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00  5...>n..5...?...
00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f  5...?........<.O
00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00  ................
00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b  .............t..
00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d  .....Es........-
00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a  .......0........
00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1  5...;U.......'..
000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30  C_....F..V...XQ0
XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c.  Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs]
CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4f/0x60
 xfs_corruption_error+0x90/0xa0
 xlog_recover_inode_commit_pass2+0x5f1/0xb00
 xlog_recover_items_pass2+0x4e/0xc0
 xlog_recover_commit_trans+0x2db/0x350
 xlog_recovery_process_trans+0xab/0xe0
 xlog_recover_process_data+0xa7/0x130
 xlog_do_recovery_pass+0x398/0x840
 xlog_do_log_recovery+0x62/0xc0
 xlog_do_recover+0x34/0x1d0
 xlog_recover+0xe9/0x1a0
 xfs_log_mount+0xff/0x260
 xfs_mountfs+0x5d9/0xb60
 xfs_fs_fill_super+0x76b/0xa30
 get_tree_bdev+0x124/0x1d0
 vfs_get_tree+0x17/0xa0
 path_mount+0x72b/0xa90
 __x64_sys_mount+0x112/0x150
 do_syscall_64+0x49/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
XFS (sda4): Corruption detected. Unmount and run xfs_repair
XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1
XFS (sda4): Filesystem has been shut down due to log error (0x2).
XFS (sda4): Please unmount the filesystem and rectify the problem(s).
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed

This inode log item recovery failing the dinode verifier after
replaying the contents of the inode log item into the ondisk inode.
Looking back into what the kernel was doing at the time of the fs
shutdown, a thread was in the middle of running a series of
transactions, each of which committed changes to the inode.

At some point in the middle of that chain, an invalid (at least
according to the verifier) change was committed.  Had the filesystem not
shut down in the middle of the chain, a subsequent transaction would
have corrected the invalid state and nobody would have noticed.  But
that's not what happened here.  Instead, the invalid inode state was
committed to the ondisk log, so log recovery tripped over it.

The actual defect here was an overzealous inode verifier, which was
fixed in a separate patch.  This patch adds some transaction precommit
functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
of transient errors at transaction commit time, where it's much easier
to find the root cause.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-07-02 11:36:54 -07:00
Darrick J. Wong
3ba3ab1f67 xfs: enable FITRIM on the realtime device
Implement FITRIM for the realtime device by pretending that it's
"space" immediately after the data device.  We have to hold the
rtbitmap ILOCK while the discard operations are ongoing because there's
no busy extent tracking for the rt volume to prevent reallocations.

Cc: Konst Mayer <cdlscpmv@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Wenchao Hao
a330cae8a7 xfs: Remove header files which are included more than once
Following warning is reported, so remove these duplicated header
including:

./fs/xfs/libxfs/xfs_trans_resv.c: xfs_da_format.h is included more than once.
./fs/xfs/scrub/quota_repair.c: xfs_format.h is included more than once.
./fs/xfs/xfs_handle.c: xfs_da_btree.h is included more than once.
./fs/xfs/xfs_qm_bhv.c: xfs_mount.h is included more than once.
./fs/xfs/xfs_trace.c: xfs_bmap.h is included more than once.

This is just a clean code, no logic changed.

Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
4818fd60db xfs: fold xfs_ilock_for_write_fault into xfs_write_fault
Now that the page fault handler has been refactored, the only caller
of xfs_ilock_for_write_fault is simple enough and calls it
unconditionally.  Fold the logic and expand the comments explaining it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
4e82fa11fb xfs: always take XFS_MMAPLOCK shared in xfs_dax_read_fault
After the previous refactoring, xfs_dax_fault is now never used for write
faults, so don't bother with the xfs_ilock_for_write_fault logic to
protect against writes when remapping is in progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
6a39ec1d39 xfs: refactor __xfs_filemap_fault
Split the write fault and DAX fault handling into separate helpers
so that the main fault handler is easier to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
9092b1de35 xfs: simplify xfs_dax_fault
Replace the separate stub with an IS_ENABLED check, and take the call to
dax_finish_sync_fault into xfs_dax_fault instead of leaving it in the
caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
29bc0dd0a2 xfs: cleanup xfs_ilock_iocb_for_write
Move the relock path out of the straight line and add a comment
explaining why it exists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Christoph Hellwig
8626b67acf xfs: move the dio write relocking out of xfs_ilock_for_iomap
About half of xfs_ilock_for_iomap deals with a special case for direct
I/O writes to COW files that need to take the ilock exclusively.  Move
this code into the one callers that cares and simplify
xfs_ilock_for_iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
lei lu
0c7fcdb6d0 xfs: don't walk off the end of a directory data block
This adds sanity checks for xfs_dir2_data_unused and xfs_dir2_data_entry
to make sure don't stray beyond valid memory region. Before patching, the
loop simply checks that the start offset of the dup and dep is within the
range. So in a crafted image, if last entry is xfs_dir2_data_unused, we
can change dup->length to dup->length-1 and leave 1 byte of space. In the
next traversal, this space will be considered as dup or dep. We may
encounter an out of bound read when accessing the fixed members.

In the patch, we make sure that the remaining bytes large enough to hold
an unused entry before accessing xfs_dir2_data_unused and
xfs_dir2_data_unused is XFS_DIR2_DATA_ALIGN byte aligned. We also make
sure that the remaining bytes large enough to hold a dirent with a
single-byte name before accessing xfs_dir2_data_entry.

Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
lei lu
fb63435b7c xfs: add bounds checking to xlog_recover_process_data
There is a lack of verification of the space occupied by fixed members
of xlog_op_header in the xlog_recover_process_data.

We can create a crafted image to trigger an out of bounds read by
following these steps:
    1) Mount an image of xfs, and do some file operations to leave records
    2) Before umounting, copy the image for subsequent steps to simulate
       abnormal exit. Because umount will ensure that tail_blk and
       head_blk are the same, which will result in the inability to enter
       xlog_recover_process_data
    3) Write a tool to parse and modify the copied image in step 2
    4) Make the end of the xlog_op_header entries only 1 byte away from
       xlog_rec_header->h_size
    5) xlog_rec_header->h_num_logops++
    6) Modify xlog_rec_header->h_crc

Fix:
Add a check to make sure there is sufficient space to access fixed members
of xlog_op_header.

Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
John Garry
f23660f059 xfs: Fix xfs_prepare_shift() range for RT
The RT extent range must be considered in the xfs_flush_unmap_range() call
to stabilize the boundary.

This code change is originally from Dave Chinner.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
John Garry
d3b689d7c7 xfs: Fix xfs_flush_unmap_range() range for RT
Currently xfs_flush_unmap_range() does unmap for a full RT extent range,
which we also want to ensure is clean and idle.

This code change is originally from Dave Chinner.

Reviewed-by: Christoph Hellwig <hch@lst.de>4
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Gao Xiang
d40c2865bd xfs: avoid redundant AGFL buffer invalidation
Currently AGFL blocks can be filled from the following three sources:
 - allocbt free blocks, as in xfs_allocbt_free_block();
 - rmapbt free blocks, as in xfs_rmapbt_free_block();
 - refilled from freespace btrees, as in xfs_alloc_fix_freelist().

Originally, allocbt free blocks would be marked as stale only when they
put back in the general free space pool as Dave mentioned on IRC, "we
don't stale AGF metadata btree blocks when they are returned to the
AGFL .. but once they get put back in the general free space pool, we
have to make sure the buffers are marked stale as the next user of
those blocks might be user data...."

However, after commit ca250b1b3d ("xfs: invalidate allocbt blocks
moved to the free list") and commit edfd9dd549 ("xfs: move buffer
invalidation to xfs_btree_free_block"), even allocbt / bmapbt free
blocks will be invalidated immediately since they may fail to pass
V5 format validation on writeback even writeback to free space would be
safe.

IOWs, IMHO currently there is actually no difference of free blocks
between AGFL freespace pool and the general free space pool.  So let's
avoid extra redundant AGFL buffer invalidation, since otherwise we're
currently facing unnecessary xfs_log_force() due to xfs_trans_binval()
again on buffers already marked as stale before as below:

[  333.507469] Call Trace:
[  333.507862]  xfs_buf_find+0x371/0x6a0       <- xfs_buf_lock
[  333.508451]  xfs_buf_get_map+0x3f/0x230
[  333.509062]  xfs_trans_get_buf_map+0x11a/0x280
[  333.509751]  xfs_free_agfl_block+0xa1/0xd0
[  333.510403]  xfs_agfl_free_finish_item+0x16e/0x1d0
[  333.511157]  xfs_defer_finish_noroll+0x1ef/0x5c0
[  333.511871]  xfs_defer_finish+0xc/0xa0
[  333.512471]  xfs_itruncate_extents_flags+0x18a/0x5e0
[  333.513253]  xfs_inactive_truncate+0xb8/0x130
[  333.513930]  xfs_inactive+0x223/0x270

xfs_log_force() will take tens of milliseconds with AGF buffer locked.
It becomes an unnecessary long latency especially on our PMEM devices
with FSDAX enabled and fsops like xfs_reflink_find_shared() at the same
time are stuck due to the same AGF lock.  Removing the double
invalidation on the AGFL blocks does not make this issue go away, but
this patch fixes for our workloads in reality and it should also work
by the code analysis.

Note that I'm not sure I need to remove another redundant one in
xfs_alloc_ag_vextent_small() since it's unrelated to our workloads.
Also fstests are passed with this patch.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01 09:32:29 +05:30
Darrick J. Wong
673cd885bb xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
xfs_init_new_inode ignores the init_xattrs parameter for filesystems
that do not have ATTR enabled.  As a result, the first init_xattrs file
to be created by the kernel will not have an attr fork created to store
acls.  Storing that first acl will add ATTR to the superblock flags, so
subsequent files will be created with attr forks.  The overhead of this
is so small that chances are that nobody has noticed this behavior.

However, this is disastrous on a filesystem with parent pointers because
it requires that a new linkable file /must/ have a pre-existing attr
fork, and the parent pointers code uses init_xattrs to create that fork.
The preproduction version of mkfs.xfs used to set this, but the V5 sb
verifier only requires ATTR2, not ATTR.  There is no guard for
filesystems with (PARENT && !ATTR).

It turns out that I misunderstood the two flags -- ATTR means that we at
some point created an attr fork to store xattrs in a file; ATTR2
apparently means only that inodes have dynamic fork offsets or that the
filesystem was mounted with the "attr2" option.

Fixes: 2442ee15bb ("xfs: eager inode attr fork init needs attr feature awareness")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
dc5e1cbae2 xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl.  Change this before 6.10 final
releases.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
1ec9307fc0 xfs: allow unlinked symlinks and dirs with zero size
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd changed the inode verifier to
prohibit zero-length symlinks and directories.  If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.

Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.

Fixes: 3c6f46eacd ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
288e1f693f xfs: restrict when we try to align cow fork delalloc to cowextsz hints
xfs/205 produces the following failure when always_cow is enabled:

  --- a/tests/xfs/205.out	2024-02-28 16:20:24.437887970 -0800
  +++ b/tests/xfs/205.out.bad	2024-06-03 21:13:40.584000000 -0700
  @@ -1,4 +1,5 @@
   QA output created by 205
   *** one file
  +   !!! disk full (expected)
   *** one file, a few bytes at a time
   *** done

This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint.  Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.

We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation.  There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.

There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space.  Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.

Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry.  This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.

I probably should have caught this six years ago when 6ca30729c2 was
being reviewed, but oh well.  Update the comments to reflect what the
code does now.

Fixes: 6ca30729c2 ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:24 +05:30
Christoph Hellwig
610b29161b xfs: fix freeing speculative preallocations for preallocated files
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks.  This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.

Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:17 +05:30
Zhang Yi
d048945150
xfs: reserve blocks for truncating large realtime inode
When unaligned truncate down a big realtime file, xfs_truncate_page()
only zeros out the tail EOF block, __xfs_bunmapi() should split the tail
written extent and convert the later one that beyond EOF block to
unwritten, but it couldn't work as expected now since the reserved block
is zero in xfs_setattr_size(), this could expose stale data just after
commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write
operation")'.

If we truncate file that contains a large enough written extent:

     |<    rxext    >|<    rtext    >|
  ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
        ^ (new EOF)      ^ old EOF

Since we only zeros out the tail of the EOF block, and
xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned
extents, it becomes this state:

     |<    rxext    >|
  ...WWWzWWWWWWWWWWWWW
        ^ new EOF

Then if we do an extending write like this, the blocks in the previous
tail extent becomes stale:

     |<    rxext    >|
  ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
        ^ old EOF               ^ append start  ^ new EOF

Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime
inode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240618142112.1315279-2-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-19 15:58:28 +02:00
Dave Chinner
348a1983cf xfs: fix unlink vs cluster buffer instantiation race
Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:

 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
 RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
 RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
 R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
 FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
 xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
 xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
 xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
 xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
 __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
 xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
 xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
 xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
 process_one_work (kernel/workqueue.c:3231)
 worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:147)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.

The trail of logic that I can reconstruct is as follows.

	1. the inode is clean when inodegc runs, so it is not
	   attached to a cluster buffer when precommit runs.

	2. #1 implies the inode cluster buffer may be clean and not
	   pinned by dirty inodes when inodegc runs.

	3. #2 implies that the inode cluster buffer can be reclaimed
	   by memory pressure at any time.

	4. The assert failure implies that the cluster buffer was
	   attached to the transaction, but not marked done. It had
	   been accessed earlier in the transaction, but not marked
	   done.

	5. #4 implies the cluster buffer has been invalidated (i.e.
	   marked stale).

	6. #5 implies that the inode cluster buffer was instantiated
	   uninitialised in the transaction in xfs_ifree_cluster(),
	   which only instantiates the buffers to invalidate them
	   and never marks them as done.

Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.

I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:

	1. A buffer cache lookup hit on a XBF_STALE buffer will
	   clear the XBF_DONE flag. Hence all future users of the
	   buffer know they have to re-initialise the contents
	   before use and mark it done themselves.

	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
	   means the buffer remains locked until the journal commit
	   completes and the buffer is unpinned. Hence once marked
	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
	   context that can access the freed buffer is the currently
	   running transaction.

	3. #2 implies that future buffer lookups in the currently
	   running transaction will hit the transaction match code
	   and not the buffer cache. Hence XBF_STALE and
	   XFS_BLI_STALE will not be cleared unless the transaction
	   initialises and logs the buffer with valid contents
	   again. At which point, the buffer will be marked marked
	   XBF_DONE again, so having XBF_DONE already set on the
	   stale buffer is a moot point.

	4. #2 also implies that any concurrent access to that
	   cluster buffer will block waiting on the buffer lock
	   until the inode cluster has been fully freed and is no
	   longer an active inode cluster buffer.

	5. #4 + #1 means that any future user of the disk range of
	   that buffer will always see the range of disk blocks
	   covered by the cluster buffer as not done, and hence must
	   initialise the contents themselves.

	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
	   unlinked inode precommit code will see a XBF_DONE buffer
	   from the transaction match as it expects. It can then
	   attach the stale but newly dirtied inode to the stale
	   but newly dirtied cluster buffer without unexpected
	   failures. The stale buffer will then sail through the
	   journal and do the right thing with the attached stale
	   inode during unpin.

Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....

Fixes: 82842fee6e ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-17 11:17:09 +05:30
Mateusz Guzik
e9dae2fb99 xfs: remove now spurious i_state initialization in xfs_inode_alloc
inode_init_always started setting the field to 0.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611120626.513952-4-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-13 13:40:44 +02:00
Mateusz Guzik
ddd4cd4824 xfs: preserve i_state around inode_init_always in xfs_reinit_inode
This is in preparation for the routine starting to zero the field.

De facto coded by Dave Chinner, see:
https://lore.kernel.org/linux-fsdevel/ZmgtaGglOL33Wkzr@dread.disaster.area/

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611120626.513952-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-12 14:13:26 +02:00
Wengang Wang
58f880711f xfs: make sure sb_fdblocks is non-negative
A user with a completely full filesystem experienced an unexpected
shutdown when the filesystem tried to write the superblock during
runtime.
kernel shows the following dmesg:

[    8.176281] XFS (dm-4): Metadata corruption detected at xfs_sb_write_verify+0x60/0x120 [xfs], xfs_sb block 0x0
[    8.177417] XFS (dm-4): Unmount and run xfs_repair
[    8.178016] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
[    8.178703] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
[    8.179487] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    8.180312] 00000020: cf 12 dc 89 ca 26 45 29 92 e6 e3 8d 3b b8 a2 c3  .....&E)....;...
[    8.181150] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
[    8.182003] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
[    8.182004] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
[    8.182004] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
[    8.182005] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
[    8.182008] XFS (dm-4): Corruption of in-memory data detected.  Shutting down filesystem
[    8.182010] XFS (dm-4): Please unmount the filesystem and rectify the problem(s)

When xfs_log_sb writes super block to disk, b_fdblocks is fetched from
m_fdblocks without any lock. As m_fdblocks can experience a positive ->
negative -> positive changing when the FS reaches fullness (see
xfs_mod_fdblocks). So there is a chance that sb_fdblocks is negative, and
because sb_fdblocks is type of unsigned long long, it reads super big.
And sb_fdblocks being bigger than sb_dblocks is a problem during log
recovery, xfs_validate_sb_write() complains.

Fix:
As sb_fdblocks will be re-calculated during mount when lazysbcount is
enabled, We just need to make xfs_validate_sb_write() happy -- make sure
sb_fdblocks is not nenative. This patch also takes care of other percpu
counters in xfs_log_sb.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-10 11:38:12 +05:30
Ritesh Harjani (IBM)
b0c6bcd58d xfs: Add cond_resched to block unmap range and reflink remap path
An async dio write to a sparse file can generate a lot of extents
and when we unlink this file (using rm), the kernel can be busy in umapping
and freeing those extents as part of transaction processing.

Similarly xfs reflink remapping path can also iterate over a million
extent entries in xfs_reflink_remap_blocks().

Since we can busy loop in these two functions, so let's add cond_resched()
to avoid softlockup messages like these.

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
Call Trace:
  xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
  xfs_alloc_compute_aligned+0x5c/0x144
  xfs_alloc_ag_vextent_size+0x238/0x8d4
  xfs_alloc_fix_freelist+0x540/0x694
  xfs_free_extent_fix_freelist+0x84/0xe0
  __xfs_free_extent+0x74/0x1ec
  xfs_extent_free_finish_item+0xcc/0x214
  xfs_defer_finish_one+0x194/0x388
  xfs_defer_finish_noroll+0x1b4/0x5c8
  xfs_defer_finish+0x2c/0xc4
  xfs_bunmapi_range+0xa4/0x100
  xfs_itruncate_extents_flags+0x1b8/0x2f4
  xfs_inactive_truncate+0xe0/0x124
  xfs_inactive+0x30c/0x3e0
  xfs_inodegc_worker+0x140/0x234
  process_scheduled_works+0x240/0x57c
  worker_thread+0x198/0x468
  kthread+0x138/0x140
  start_kernel_thread+0x14/0x18

run fstests generic/175 at 2024-02-02 04:40:21
[   C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
 watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
 CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G X 6.4.0
 NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs]
 LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs]
 Call Trace:
  0xc000000014107c00 (unreliable)
  __xfs_btree_updkeys+0x8c/0x2c0 [xfs]
  xfs_btree_update_keys+0x150/0x170 [xfs]
  xfs_btree_lshift+0x534/0x660 [xfs]
  xfs_btree_make_block_unfull+0x19c/0x240 [xfs]
  xfs_btree_insrec+0x4e4/0x630 [xfs]
  xfs_btree_insert+0x104/0x2d0 [xfs]
  xfs_rmap_insert+0xc4/0x260 [xfs]
  xfs_rmap_map_shared+0x228/0x630 [xfs]
  xfs_rmap_finish_one+0x2d4/0x350 [xfs]
  xfs_rmap_update_finish_item+0x44/0xc0 [xfs]
  xfs_defer_finish_noroll+0x2e4/0x740 [xfs]
  __xfs_trans_commit+0x1f4/0x400 [xfs]
  xfs_reflink_remap_extent+0x2d8/0x650 [xfs]
  xfs_reflink_remap_blocks+0x154/0x320 [xfs]
  xfs_file_remap_range+0x138/0x3a0 [xfs]
  do_clone_file_range+0x11c/0x2f0
  vfs_clone_file_range+0x60/0x1c0
  ioctl_file_clone+0x78/0x140
  sys_ioctl+0x934/0x1270
  system_call_exception+0x158/0x320
  system_call_vectored_common+0x15c/0x2ec

Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Disha Goel<disgoel@linux.ibm.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 20:50:35 +05:30
Darrick J. Wong
95b19e2f4e xfs: don't open-code u64_to_user_ptr
Don't open-code what the kernel already provides.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:55:52 +05:30
Darrick J. Wong
38de567906 xfs: allow symlinks with short remote targets
An internal user complained about log recovery failing on a symlink
("Bad dinode after recovery") with the following (excerpted) format:

core.magic = 0x494e
core.mode = 0120777
core.version = 3
core.format = 2 (extents)
core.nlinkv2 = 1
core.nextents = 1
core.size = 297
core.nblocks = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,12,1,0]

This is a symbolic link with a 297-byte target stored in a disk block,
which is to say this is a symlink with a remote target.  The forkoff is
0, which is to say that there's 512 - 176 == 336 bytes in the inode core
to store the data fork.

Eventually, testing of generic/388 failed with the same inode corruption
message during inode recovery.  In writing a debugging patch to call
xfs_dinode_verify on dirty inode log items when we're committing
transactions, I observed that xfs/298 can reproduce the problem quite
quickly.

xfs/298 creates a symbolic link, adds some extended attributes, then
deletes them all.  The test failure occurs when the final removexattr
also deletes the attr fork because that does not convert the remote
symlink back into a shortform symlink.  That is how we trip this test.
The only reason why xfs/298 only triggers with the debug patch added is
that it deletes the symlink, so the final iflush shows the inode as
free.

I wrote a quick fstest to emulate the behavior of xfs/298, except that
it leaves the symlinks on the filesystem after inducing the "corrupt"
state.  Kernels going back at least as far as 4.18 have written out
symlink inodes in this manner and prior to 1eb70f54c4 they did not
object to reading them back in.

Because we've been writing out inodes this way for quite some time, the
only way to fix this is to relax the check for symbolic links.
Directories don't have this problem because di_size is bumped to
blocksize during the sf->data conversion.

Fixes: 1eb70f54c4 ("xfs: validate inode fork size against fork format")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:55:52 +05:30
Darrick J. Wong
97835e6866 xfs: fix xfs_init_attr_trans not handling explicit operation codes
When we were converting the attr code to use an explicit operation code
instead of keying off of attr->value being null, we forgot to change the
code that initializes the transaction reservation.  Split the function
into two helpers that handle the !remove and remove cases, then fix both
callsites to handle this correctly.

Fixes: c27411d4c6 ("xfs: make attr removal an explicit operation")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:55:52 +05:30
Darrick J. Wong
2b3f004d3d xfs: drop xfarray sortinfo folio on error
Chandan Babu reports the following livelock in xfs/708:

 run fstests xfs/708 at 2024-05-04 15:35:29
 XFS (loop16): EXPERIMENTAL online scrub feature in use. Use at your own risk!
 XFS (loop5): Mounting V5 Filesystem e96086f0-a2f9-4424-a1d5-c75d53d823be
 XFS (loop5): Ending clean mount
 XFS (loop5): Quotacheck needed: Please wait.
 XFS (loop5): Quotacheck: Done.
 XFS (loop5): EXPERIMENTAL online scrub feature in use. Use at your own risk!
 INFO: task xfs_io:143725 blocked for more than 122 seconds.
       Not tainted 6.9.0-rc4+ #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:xfs_io          state:D stack:0     pid:143725 tgid:143725 ppid:117661 flags:0x00004006
 Call Trace:
  <TASK>
  __schedule+0x69c/0x17a0
  schedule+0x74/0x1b0
  io_schedule+0xc4/0x140
  folio_wait_bit_common+0x254/0x650
  shmem_undo_range+0x9d5/0xb40
  shmem_evict_inode+0x322/0x8f0
  evict+0x24e/0x560
  __dentry_kill+0x17d/0x4d0
  dput+0x263/0x430
  __fput+0x2fc/0xaa0
  task_work_run+0x132/0x210
  get_signal+0x1a8/0x1910
  arch_do_signal_or_restart+0x7b/0x2f0
  syscall_exit_to_user_mode+0x1c2/0x200
  do_syscall_64+0x72/0x170
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The shmem code is trying to drop all the folios attached to a shmem
file and gets stuck on a locked folio after a bnobt repair.  It looks
like the process has a signal pending, so I started looking for places
where we lock an xfile folio and then deal with a fatal signal.

I found a bug in xfarray_sort_scan via code inspection.  This function
is called to set up the scanning phase of a quicksort operation, which
may involve grabbing a locked xfile folio.  If we exit the function with
an error code, the caller does not call xfarray_sort_scan_done to put
the xfile folio.  If _sort_scan returns an error code while si->folio is
set, we leak the reference and never unlock the folio.

Therefore, change xfarray_sort to call _scan_done on exit.  This is safe
to call multiple times because it sets si->folio to NULL and ignores a
NULL si->folio.  Also change _sort_scan to use an intermediate variable
so that we never pollute si->folio with an errptr.

Fixes: 232ea05277 ("xfs: enable sorting of xfile-backed arrays")
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:55:52 +05:30
John Garry
b33874fb7f xfs: Stop using __maybe_unused in xfs_alloc.c
In both xfs_alloc_cur_finish() and xfs_alloc_ag_vextent_exact(), local
variable @afg is tagged as __maybe_unused. Otherwise an unused variable
warning would be generated for when building with W=1 and CONFIG_XFS_DEBUG
unset. In both cases, the variable is unused as it is only referenced in
an ASSERT() call, which is compiled out (in this config).

It is generally a poor programming style to use __maybe_unused for
variables.

The ASSERT() call is to verify that agbno of the end of the extent is
within bounds for both functions. @afg is used as an intermediate variable
to find the AG length.

However xfs_verify_agbext() already exists to verify a valid extent range.
The arguments for calling xfs_verify_agbext() are already available, so use
that instead.

An advantage of using xfs_verify_agbext() is that it verifies that both the
start and the end of the extent are within the bounds of the AG and
catches overflows.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:54:24 +05:30
John Garry
d7ba701da6 xfs: Clear W=1 warning in xfs_iwalk_run_callbacks()
For CONFIG_XFS_DEBUG unset, xfs_iwalk_run_callbacks() generates the
following warning for when building with W=1:

fs/xfs/xfs_iwalk.c: In function ‘xfs_iwalk_run_callbacks’:
fs/xfs/xfs_iwalk.c:354:42: error: variable ‘irec’ set but not used [-Werror=unused-but-set-variable]
  354 |         struct xfs_inobt_rec_incore     *irec;
      |                                          ^~~~
cc1: all warnings being treated as errors

Drop @irec, as it is only an intermediate variable.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-27 15:54:24 +05:30
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Linus Torvalds
5ad8b6ad9a getting rid of bogus set_blocksize() uses, switching it
to struct file * and verifying that caller has device
 opened exclusively.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ
 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt
 KubxHVFsfW+eu6ASeaoMRB83w5OIzwk=
 =Liix
 -----END PGP SIGNATURE-----

Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs blocksize updates from Al Viro:
 "This gets rid of bogus set_blocksize() uses, switches it over
  to be based on a 'struct file *' and verifies that the caller
  has the device opened exclusively"

* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make set_blocksize() fail unless block device is opened exclusive
  set_blocksize(): switch to passing struct file *
  btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
  swsusp: don't bother with setting block size
  zram: don't bother with reopening - just use O_EXCL for open
  swapon(2): open swap with O_EXCL
  swapon(2)/swapoff(2): don't bother with block size
  pktcdvd: sort set_blocksize() calls out
  bcache_register(): don't bother with set_blocksize()
2024-05-21 08:34:51 -07:00
Linus Torvalds
119d1b8a5d New code for 6.10:
* Introduce Parent Pointer extended attribute for inodes.
 
   * Online Repair
     - Implement atomic file content exchanges i.e. exchange ranges of bytes
       between two files atomically.
     - Create temporary files to repair file-based metadata. This uses atomic
       file content exchange facility to swap file fork mappings between the
       temporary file and the metadata inode.
 
     - Allow callers of directory/xattr code to set an explicit owner number to
       be written into the header fields of any new blocks that are created.
       This is required to avoid walking every block of the new structure and
       modify their ownership during online repair.
     - Repair
       - Extended attributes
       - Inode unlinked state
       - Directories
       - Symbolic links
       - AGI's unlinked inode list.
       - Parent pointers.
     - Move Orphan files to lost and found directory.
     - Fixes for Inode repair functionality.
     - Introduce a new sub-AG FITRIM implementation to reduce the duration for
       which the AGF lock is held.
     - Updates for the design documentation.
     - Use Parent Pointers to assist in checking directories, parent pointers,
       extended attributes, and link counts.
 
   * Bring back delalloc support for realtime devices which have an extent size
     that is equal to filesystem's block size.
 
   * Improve performance of log incompat feature handling.
 
   * Fixes
     - Prevent userspace from reading invalid file data due to incorrect.
       updation of file size when performing a non-atomic clone operation.
     - Minor fixes to online repair.
     - Fix confusing return values from xfs_bmapi_write().
     - Fix an out of bounds access due to incorrect h_size during log recovery.
     - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until
       we know we are going to modify the extent mapping.
     - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent().
     - Fix sparse warnings.
 
   * Cleanups
     - Hold inode locks on all files involved in a rename until the completion
       of the operation. This is in preparation for the parent pointers patchset
       where parent pointers are applied in a separate chained update from the
       actual directory update.
     - Compile out v4 support when disabled.
     - Cleanup xfs_extent_busy_clear().
     - Remove unused flags and fields from struct xfs_da_args.
     - Remove definitions of unused functions.
     - Improve extended attribute validation.
     - Add higher level directory operations helpers to remove duplication of
       code.
     - Cleanup quota (un)reservation interfaces.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZjZC0wAKCRAH7y4RirJu
 9HsCAPoCQvmPefDv56aMb5JEQNpv9dPz2Djj14hqLytQs5P/twD+LF5NhJgQNDUo
 Lwnb0tmkAhmG9Y4CCiN1FwSj1rq59gE=
 =2hXB
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:
 "Online repair feature continues to be expanded. Also, we now support
  delayed allocation for realtime devices which have an extent size that
  is equal to filesystem's block size.

  New code:

   - Introduce Parent Pointer extended attribute for inodes

   - Bring back delalloc support for realtime devices which have an
     extent size that is equal to filesystem's block size

   - Improve performance of log incompat feature handling

  Online Repair:

   - Implement atomic file content exchanges i.e. exchange ranges of
     bytes between two files atomically

   - Create temporary files to repair file-based metadata. This uses
     atomic file content exchange facility to swap file fork mappings
     between the temporary file and the metadata inode

   - Allow callers of directory/xattr code to set an explicit owner
     number to be written into the header fields of any new blocks that
     are created. This is required to avoid walking every block of the
     new structure and modify their ownership during online repair

   - Repair more data structures:
       - Extended attributes
       - Inode unlinked state
       - Directories
       - Symbolic links
       - AGI's unlinked inode list
       - Parent pointers

   - Move Orphan files to lost and found directory

   - Fixes for Inode repair functionality

   - Introduce a new sub-AG FITRIM implementation to reduce the duration
     for which the AGF lock is held

   - Updates for the design documentation

   - Use Parent Pointers to assist in checking directories, parent
     pointers, extended attributes, and link counts

  Fixes:

   - Prevent userspace from reading invalid file data due to incorrect.
     updation of file size when performing a non-atomic clone operation

   - Minor fixes to online repair

   - Fix confusing return values from xfs_bmapi_write()

   - Fix an out of bounds access due to incorrect h_size during log
     recovery

   - Defer upgrading the extent counters in xfs_reflink_end_cow_extent()
     until we know we are going to modify the extent mapping

   - Remove racy access to if_bytes check in
     xfs_reflink_end_cow_extent()

   - Fix sparse warnings

  Cleanups:

   - Hold inode locks on all files involved in a rename until the
     completion of the operation. This is in preparation for the parent
     pointers patchset where parent pointers are applied in a separate
     chained update from the actual directory update

   - Compile out v4 support when disabled

   - Cleanup xfs_extent_busy_clear()

   - Remove unused flags and fields from struct xfs_da_args

   - Remove definitions of unused functions

   - Improve extended attribute validation

   - Add higher level directory operations helpers to remove duplication
     of code

   - Cleanup quota (un)reservation interfaces"

* tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (221 commits)
  xfs: simplify iext overflow checking and upgrade
  xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent
  xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later
  xfs: xfs_quota_unreserve_blkres can't fail
  xfs: consolidate the xfs_quota_reserve_blkres definitions
  xfs: clean up buffer allocation in xlog_do_recovery_pass
  xfs: fix log recovery buffer allocation for the legacy h_size fixup
  xfs: widen flags argument to the xfs_iflags_* helpers
  xfs: minor cleanups of xfs_attr3_rmt_blocks
  xfs: create a helper to compute the blockcount of a max sized remote value
  xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function
  xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c
  xfs: do not allocate the entire delalloc extent in xfs_bmapi_write
  xfs: fix xfs_bmap_add_extent_delay_real for partial conversions
  xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate
  xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate
  xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write
  xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate
  xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate
  xfs: fix error returns from xfs_bmapi_write
  ...
2024-05-20 12:55:12 -07:00
Linus Torvalds
ff9a79307f Kbuild updates for v6.10
- Avoid 'constexpr', which is a keyword in C23
 
  - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
    'dt_binding_check'
 
  - Fix weak references to avoid GOT entries in position-independent
    code generation
 
  - Convert the last use of 'optional' property in arch/sh/Kconfig
 
  - Remove support for the 'optional' property in Kconfig
 
  - Remove support for Clang's ThinLTO caching, which does not work with
    the .incbin directive
 
  - Change the semantics of $(src) so it always points to the source
    directory, which fixes Makefile inconsistencies between upstream and
    downstream
 
  - Fix 'make tar-pkg' for RISC-V to produce a consistent package
 
  - Provide reasonable default coverage for objtool, sanitizers, and
    profilers
 
  - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
 
  - Remove the last use of tristate choice in drivers/rapidio/Kconfig
 
  - Various cleanups and fixes in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
 msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
 GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
 inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
 lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
 JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
 Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
 y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
 orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
 ZCSnZ31h7woGfNho
 =D5B/
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Avoid 'constexpr', which is a keyword in C23

 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
   'dt_binding_check'

 - Fix weak references to avoid GOT entries in position-independent code
   generation

 - Convert the last use of 'optional' property in arch/sh/Kconfig

 - Remove support for the 'optional' property in Kconfig

 - Remove support for Clang's ThinLTO caching, which does not work with
   the .incbin directive

 - Change the semantics of $(src) so it always points to the source
   directory, which fixes Makefile inconsistencies between upstream and
   downstream

 - Fix 'make tar-pkg' for RISC-V to produce a consistent package

 - Provide reasonable default coverage for objtool, sanitizers, and
   profilers

 - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.

 - Remove the last use of tristate choice in drivers/rapidio/Kconfig

 - Various cleanups and fixes in Kconfig

* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
  kconfig: use sym_get_choice_menu() in sym_check_prop()
  rapidio: remove choice for enumeration
  kconfig: lxdialog: remove initialization with A_NORMAL
  kconfig: m/nconf: merge two item_add_str() calls
  kconfig: m/nconf: remove dead code to display value of bool choice
  kconfig: m/nconf: remove dead code to display children of choice members
  kconfig: gconf: show checkbox for choice correctly
  kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
  Makefile: remove redundant tool coverage variables
  kbuild: provide reasonable defaults for tool coverage
  modules: Drop the .export_symbol section from the final modules
  kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
  kconfig: use sym_get_choice_menu() in conf_write_defconfig()
  kconfig: add sym_get_choice_menu() helper
  kconfig: turn defaults and additional prompt for choice members into error
  kconfig: turn missing prompt for choice members into error
  kconfig: turn conf_choice() into void function
  kconfig: use linked list in sym_set_changed()
  kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
  kconfig: gconf: remove debug code
  ...
2024-05-18 12:39:20 -07:00
Linus Torvalds
1b0aabcc9a vfs-6.10.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
 orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
 0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
 =SVsU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
2024-05-13 11:40:06 -07:00
Masahiro Yamada
b1992c3772 kbuild: use $(src) instead of $(srctree)/$(src) for source directory
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:

    src := $(obj)

When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.

This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.

To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.

Going forward, the variables used in Makefiles will have the following
meanings:

  $(obj)     - directory in the object tree
  $(src)     - directory in the source tree  (changed by this commit)
  $(objtree) - the top of the kernel object tree
  $(srctree) - the top of the kernel source tree

Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-10 04:34:52 +09:00