linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-03 01:28:04 +00:00

Author	SHA1	Message	Date
Darrick J. Wong	7f8a44f372	xfs: fix sb_spino_align checks for large fsblock sizes For a sparse inodes filesystem, mkfs.xfs computes the values of sb_spino_align and sb_inoalignmt with the following code: int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; if (cfg->sb_feat.crcs_enabled) cluster_size = cfg->inodesize / XFS_DINODE_MIN_SIZE; sbp->sb_spino_align = cluster_size >> cfg->blocklog; sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK cfg->inodesize >> cfg->blocklog; On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results in cluster_size = 8192 * (512 / 256) = 16384. As a result, sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately, this trips the new sb_spino_align check that was just added to xfs_validate_sb_common, and the mkfs fails: # mkfs.xfs -f -b size=64k, /dev/sda meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=65536 blocks=324544, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=65536 blocks=5006, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents Discarding blocks...Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: writing AG headers failed, err=22 Prior to commit `59e43f5479` this all worked fine, even if "sparse" inodes are somewhat meaningless when everything fits in a single fsblock. Adjust the checks to handle existing filesystems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `59e43f5479` ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	ca378189fd	xfs: convert quotacheck to attach dquot buffers Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either. Cc: <stable@vger.kernel.org> # v6.12 Fixes: `903edea6c5` ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	acc8f8628c	xfs: attach dquot buffer to dquot log item buffer Ever since 6.12-rc1, I've observed a pile of warnings from the kernel when running fstests with quotas enabled: WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18 CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c <snip> Call trace: __alloc_pages_noprof+0xc9c/0xf18 alloc_pages_mpol_noprof+0x94/0x240 alloc_pages_noprof+0x68/0xf8 new_slab+0x3e0/0x568 ___slab_alloc+0x5a0/0xb88 __slab_alloc.constprop.0+0x7c/0xf8 __kmalloc_noprof+0x404/0x4d0 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88] xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88] kthread+0x110/0x128 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- This corresponds to the line: WARN_ON_ONCE(current->flags & PF_MEMALLOC); within the NOFAIL checks. What's happening here is that the XFS AIL is trying to write a disk quota update back into the filesystem, but for that it needs to read the ondisk buffer for the dquot. The buffer is not in memory anymore, probably because it was evicted. Regardless, the buffer cache tries to allocate a new buffer, but those allocations are NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim) since commit `43ff2122e6` ("xfs: on-stack delayed write buffer lists") presumably because reclaim can push on XFS to push on the AIL. An easy way to fix this probably would have been to drop the NOFAIL flag from the xfs_buf allocation and open code a retry loop, but then there's still the problem that for bs>ps filesystems, the buffer itself could require up to 64k worth of pages. Inode items had similar behavior (multi-page cluster buffers that we don't want to allocate in the AIL) which we solved by making transaction precommit attach the inode cluster buffers to the dirty log item. Let's solve the dquot problem in the same way. So: Make a real precommit handler to read the dquot buffer and attach it to the log item; pass it to dqflush in the push method; and have the iodone function detach the buffer once we've flushed everything. Add a state flag to the log item to track when a thread has entered the precommit -> push mechanism to skip the detaching if it turns out that the dquot is very busy, as we don't hold the dquot lock between log item commit and AIL push). Reading and attaching the dquot buffer in the precommit hook is inspired by the work done for inode cluster buffers some time ago. Cc: <stable@vger.kernel.org> # v6.12 Fixes: `903edea6c5` ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	ec88b41b93	xfs: clean up log item accesses in xfs_qm_dqflush{,_done} Clean up these functions a little bit before we move on to the real modifications, and make the variable naming consistent for dquot log items. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	a40fe30868	xfs: separate dquot buffer reads from xfs_dqflush The first step towards holding the dquot buffer in the li_buf instead of reading it in the AIL is to separate the part that reads the buffer from the actual flush code. There should be no functional changes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	07137e925f	xfs: don't lose solo dquot update transactions Quota counter updates are tracked via incore objects which hang off the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_dquot_deltas just prior to commiting the log items to the CIL. However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure quota counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb. Cc: <stable@vger.kernel.org> # v2.6.35 Fixes: `0924378a68` ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	3762113b59	xfs: don't lose solo superblock counter update transactions Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL. However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	a004afdc62	xfs: avoid nested calls to __xfs_trans_commit Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls __xfs_trans_commit again on the same transaction. In other words, there's a nested function call (albeit with slightly different arguments) that has caused minor amounts of confusion in the past. There's no reason to keep this around, since there's only one place where we actually want the xfs_defer_finish_noroll, and that is in the top level xfs_trans_commit call. This also reduces stack usage a little bit. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	44d9b07e52	xfs: only run precommits once per transaction object Committing a transaction tx0 with a defer ops chain of (A, B, C) creates a chain of transactions that looks like this: tx0 -> txA -> txB -> txC Prior to commit `cb04211748`, __xfs_trans_commit would run precommits on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C]. Unfortunately, after the finish_noroll loop we forgot to run precommits on txC. That was fixed by adding the second precommit call. Unfortunately, none of us remembered that xfs_defer_finish_noroll calls __xfs_trans_commit a second time to commit tx0 before finishing work A in txA and committing that. In other words, we run precommits twice on tx0: xfs_trans_commit(tx0) __xfs_trans_commit(tx0, false) xfs_trans_run_precommits(tx0) xfs_defer_finish_noroll(tx0) xfs_trans_roll(tx0) txA = xfs_trans_dup(tx0) __xfs_trans_commit(tx0, true) xfs_trans_run_precommits(tx0) This currently isn't an issue because the inode item precommit is idempotent; the iunlink item precommit deletes itself so it can't be called again; and the buffer/dquot item precommits only check the incore objects for corruption. However, it doesn't make sense to run precommits twice. Fix this situation by only running precommits after finish_noroll. Cc: <stable@vger.kernel.org> # v6.4 Fixes: `cb04211748` ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	53b001a21c	xfs: unlock inodes when erroring out of xfs_trans_alloc_dir Debugging a filesystem patch with generic/475 caused the system to hang after observing the following sequences in dmesg: XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5 XFS (dm-0): log I/O error -5 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs] XFS (dm-0): Corruption detected. Unmount and run xfs_repair XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7 The system is stuck in unmount trying to lock a couple of inodes so that they can be purged. The dquot corruption notice above is a clue to what happened -- a link() call tried to set up a transaction to link a child into a directory. Quota reservation for the transaction failed after IO errors shut down the filesystem, but then we forgot to unlock the inodes on our way out. Fix that. Cc: <stable@vger.kernel.org> # v6.10 Fixes: `bd5562111d` ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	ffc3ea4f3c	xfs: fix scrub tracepoints when inode-rooted btrees are involved Fix a minor mistakes in the scrub tracepoints that can manifest when inode-rooted btrees are enabled. The existing code worked fine for bmap btrees, but we should tighten the code up to be less sloppy. Cc: <stable@vger.kernel.org> # v5.7 Fixes: `92219c292a` ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	6d7b4bc1c3	xfs: update btree keys correctly when _insrec splits an inode root block In commit `2c813ad66a`, I partially fixed a bug wherein xfs_btree_insrec would erroneously try to update the parent's key for a block that had been split if we decided to insert the new record into the new block. The solution was to detect this situation and update the in-core key value that we pass up to the caller so that the caller will (eventually) add the new block to the parent level of the tree with the correct key. However, I missed a subtlety about the way inode-rooted btrees work. If the full block was a maximally sized inode root block, we'll solve that fullness by moving the root block's records to a new block, resizing the root block, and updating the root to point to the new block. We don't pass a pointer to the new block to the caller because that work has already been done. The new record will /always/ land in the new block, so in this case we need to use xfs_btree_update_keys to update the keys. This bug can theoretically manifest itself in the very rare case that we split a bmbt root block and the new record lands in the very first slot of the new block, though I've never managed to trigger it in practice. However, it is very easy to reproduce by running generic/522 with the realtime rmapbt patchset if rtinherit=1. Cc: <stable@vger.kernel.org> # v4.8 Fixes: `2c813ad66a` ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	23bee6f390	xfs: fix error bailout in xfs_rtginode_create smatch reported that we screwed up the error cleanup in this function. Fix it. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `ae897e0bed` ("xfs: support creating per-RTG files in growfs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	af9f02457f	xfs: fix null bno_hint handling in xfs_rtallocate_rtg xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka NULLFSBLOCK). If the allocation request is for a file range that's adjacent to an existing mapping, it will then change bno_hint to the blkno hint in the bmalloca structure. In other words, bno_hint is either a rt block number, or it's all 1s. Unfortunately, commit `ec12f97f1b` didn't take the NULLRTBLOCK state into account, which means that it tries to translate that into a realtime extent number. We then end up with an obnoxiously high rtx number and pointlessly feed that to the near allocator. This often fails and falls back to the by-size allocator. Seeing as we had no locality hint anyway, this is a waste of time. Fix the code to detect a lack of bno_hint correctly. This was detected by running xfs/009 with metadir enabled and a 28k rt extent size. Cc: <stable@vger.kernel.org> # v6.12 Fixes: `ec12f97f1b` ("xfs: make the rtalloc start hint a xfs_rtblock_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	dc5a052739	xfs: mark metadir repair tempfiles with IRECOVERY Once in a long while, xfs/566 and xfs/801 report directory corruption in one of the metadata subdirectories while it's forcibly rebuilding all filesystem metadata. I observed the following sequence of events: 1. Initiate a repair of the parent pointers for the /quota/user file. This is the secret file containing user quota data. 2. The pptr repair thread creates a temporary file and begins staging parent pointers in the ondisk metadata in preparation for an exchange-range to commit the new pptr data. 3. At the same time, initiate a repair of the /quota directory itself. 4. The dir repair thread finds the temporary file from (2), scans it for parent pointers, and stages a dirent in its own temporary dir in preparation to commit the fixed directory. 5. The parent pointer repair completes and frees the temporary file. 6. The dir repair commits the new directory and scans it again. It finds the dirent that points to the old temporary file in (2) and marks the directory corrupt. Oops! Repair code must never scan the temporary files that other repair functions create to stage new metadata. They're not supposed to do that, but the predicate function xrep_is_tempfile is incorrect because it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary file, but xrep_tempfile_adjust_directory_tree creates exactly that. Fix this by setting the IRECOVERY flag on temporary metadata directory inodes and using that to correct the predicate. Repair code is supposed to erase all the data in temporary files before releasing them, so it's ok if a thread scans the temporary file after we drop IRECOVERY. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `bb6cdd5529` ("xfs: hide metadata inodes from everyone because they are special") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	6f4669708a	xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlink If we need to reset a symlink target to the "durr it's busted" string, then we clear the zapped flag as well. However, this should be using the provided helper so that we don't set the zapped state on an otherwise ok symlink. Cc: <stable@vger.kernel.org> # v6.10 Fixes: `2651923d8d` ("xfs: online repair of symbolic links") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	aa7bfb537e	xfs: separate healthy clearing mask during repair In commit `d9041681dd` we introduced some XFS_SICK_*ZAPPED flags so that the inode record repair code could clean up a damaged inode record enough to iget the inode but still be able to remember that the higher level repair code needs to be called. As part of that, we introduced a xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED state to be removed if that higher level metadata actually checks out. This was done by setting additional bits in sick_mask hoping that xchk_update_health will clear all those bits after a healthy scrub. Unfortunately, that's not quite what sick_mask means -- bits in that mask are indeed cleared if the metadata is healthy, but they're set if the metadata is NOT healthy. fsck is only intended to set the ZAPPED bits explicitly. If something else sets the CORRUPT/XCORRUPT state after the xchk_mark_healthy_if_clean call, we end up marking the metadata zapped. This can happen if the following sequence happens: 1. Scrub runs, discovers that the metadata is fine but could be optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag. That causes the ZAPPED flag to be set in sick_mask because the metadata is not CORRUPT or XCORRUPT. 2. Repair runs to optimize the metadata. 3. Some other metadata used for cross-referencing in (1) becomes corrupt. 4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due to the events in (3). 5. Now the xchk_health_update sets the ZAPPED flag on the metadata we just repaired. This is not the correct state. Fix this by moving the "if healthy" mask to a separate field, and only ever using it to clear the sick state. Cc: <stable@vger.kernel.org> # v6.8 Fixes: `d9041681dd` ("xfs: set inode sick state flags when we zap either ondisk fork") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	7ce31f20a0	xfs: don't drop errno values when we fail to ficlone the entire range Way back when we first implemented FICLONE for XFS, life was simple -- either the the entire remapping completed, or something happened and we had to return an errno explaining what happened. Neither of those ioctls support returning partial results, so it's all or nothing. Then things got complicated when copy_file_range came along, because it actually can return the number of bytes copied, so commit `3f68c1f562` tried to make it so that we could return a partial result if the REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can indicate that the kernel performed a partial deduplication. Unfortunately, the logic is wrong if an error stops the remapping and CAN_SHORTEN is not set. Because those callers cannot return partial results, it is an error for ->remap_file_range to return a positive quantity that is less than the @len passed in. Implementations really should be returning a negative errno in this case, because that's what btrfs (which introduced FICLONE{,RANGE}) did. Therefore, ->remap_range implementations cannot silently drop an errno that they might have when the number of bytes remapped is less than the number of bytes requested and CAN_SHORTEN is not set. Found by running generic/562 on a 64k fsblock filesystem and wondering why it reported corrupt files. Cc: <stable@vger.kernel.org> # v4.20 Fixes: `3fc9f5e409` ("xfs: remove xfs_reflink_remap_range") Really-Fixes: `3f68c1f562` ("xfs: support returning partial reflink results") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	bd27c7bcdc	xfs: return a 64-bit block count from xfs_btree_count_blocks With the nrext64 feature enabled, it's possible for a data fork to have 2^48 extent mappings. Even with a 64k fsblock size, that maps out to a bmbt containing more than 2^32 blocks. Therefore, this predicate must return a u64 count to avoid an integer wraparound that will cause scrub to do the wrong thing. It's unlikely that any such filesystem currently exists, because the incore bmbt would consume more than 64GB of kernel memory on its own, and so far nobody except me has driven a filesystem that far, judging from the lack of complaints. Cc: <stable@vger.kernel.org> # v5.19 Fixes: `df9ad5cc7a` ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:09 -08:00
Darrick J. Wong	e1d8602b6c	xfs: keep quota directory inode loaded In the same vein as the previous patch, there's no point in the metapath scrub setup function doing a lookup on the quota metadir just so it can validate that lookups work correctly. Instead, retain the quota directory inode in memory for the lifetime of the mount so that we can check this meaningfully. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `128a055291` ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:08 -08:00
Darrick J. Wong	9b72800103	xfs: metapath scrubber should use the already loaded inodes Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of the quota inodes, just grab them from the quotainfo structure. The whole point of this scrubber is to make sure that the dirents exist, so it's completely silly to do lookups. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `128a055291` ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:08 -08:00
Darrick J. Wong	a440a28ddb	xfs: fix off-by-one error in fsmap's end_daddr usage In commit `ca6448aed4`, we created an "end_daddr" variable to fix fsmap reporting when the end of the range requested falls in the middle of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't notice that the the code sets end_daddr to the last sector of the device but then uses that quantity to compute the length of the synthesized mapping. Zizhi Wo later observed that when end_daddr isn't set, we still don't report the last fsblock on a device because in that case (aka when info->last is true), the info->high mapping that we pass to xfs_getfsmap_group_helper has a startblock that points to the last fsblock. This is also wrong because the code uses startblock to compute the length of the synthesized mapping. Fix the second problem by setting end_daddr unconditionally, and fix the first problem by setting start_daddr to one past the end of the range to query. Cc: <stable@vger.kernel.org> # v6.11 Fixes: `ca6448aed4` ("xfs: Fix missing interval for missing_owner in xfs fsmap") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reported-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:08 -08:00
Josef Bacik	5121711eb8	fs: enable pre-content events on supported file systems Now that all the code has been added for pre-content events, and the various file systems that need the page fault hooks for fsnotify have been updated, add SB_I_ALLOW_HSM to the supported file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com	2024-12-11 17:28:41 +01:00
Josef Bacik	7f4796a465	xfs: add pre-content fsnotify hook for DAX faults xfs has it's own handling for DAX faults, so we need to add the pre-content fsnotify hook for this case. Other faults go through filemap_fault so they're handled properly there. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9eccdf59a65b72f0a1a5e2f2b9bff8eda2d4f2d9.1731684329.git.josef@toxicpanda.com	2024-12-11 17:28:41 +01:00
Linus Torvalds	9141c5d389	Bug fixes for 6.13-rc2 * Use xchg() in xlog_cil_insert_pcp_aggregate() * Fix ABBA deadlock on a race between mount and log shutdown * Fix quota softlimit incoherency on delalloc * Fix sparse inode limits on runt AG * remove unknown compat feature checks in SB write valdation * Eliminate a lockdep false positive -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZ072iwAKCRBGdaER5Qtf prtCAX4kKOVnDzn2dX4YWpFaPAFvaWbiH0GIIVIiRuQLqzARya/lurNXjfanuotc 4oJ3JacBgJ1MWYiBX2j95AEHJaes/G3Nm+EsXeqefWWxxQvrCcQt5kdtw1fVY8kz 5NCMUtxjIQ== =FM4s -----END PGP SIGNATURE----- Merge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - Use xchg() in xlog_cil_insert_pcp_aggregate() - Fix ABBA deadlock on a race between mount and log shutdown - Fix quota softlimit incoherency on delalloc - Fix sparse inode limits on runt AG - remove unknown compat feature checks in SB write valdation - Eliminate a lockdep false positive * tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay xfs: Use xchg() in xlog_cil_insert_pcp_aggregate() xfs: prevent mount and log shutdown race xfs: delalloc and quota softlimit timers are incoherent xfs: fix sparse inode limits on runt AG xfs: remove unknown compat feature check in superblock write validation xfs: eliminate lockdep false positives in xfs_attr_shortform_list	2024-12-03 10:46:49 -08:00
Christoph Hellwig	cc2dba08cc	xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay xfs_bmap_add_extent_hole_delay works entirely on delalloc extents, for which xfs_bmap_same_rtgroup doesn't make sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-28 12:54:22 +01:00
Uros Bizjak	214093534f	xfs: Use xchg() in xlog_cil_insert_pcp_aggregate() try_cmpxchg() loop with constant "new" value can be substituted with just xchg() to atomically get and clear the location. The code on x86_64 improves from: 1e7f: 48 89 4c 24 10 mov %rcx,0x10(%rsp) 1e84: 48 03 14 c5 00 00 00 add 0x0(,%rax,8),%rdx 1e8b: 00 1e88: R_X86_64_32S __per_cpu_offset 1e8c: 8b 02 mov (%rdx),%eax 1e8e: 41 89 c5 mov %eax,%r13d 1e91: 31 c9 xor %ecx,%ecx 1e93: f0 0f b1 0a lock cmpxchg %ecx,(%rdx) 1e97: 75 f5 jne 1e8e <xlog_cil_commit+0x84e> 1e99: 48 8b 4c 24 10 mov 0x10(%rsp),%rcx 1e9e: 45 01 e9 add %r13d,%r9d to just: 1e7f: 48 03 14 cd 00 00 00 add 0x0(,%rcx,8),%rdx 1e86: 00 1e83: R_X86_64_32S __per_cpu_offset 1e87: 31 c9 xor %ecx,%ecx 1e89: 87 0a xchg %ecx,(%rdx) 1e8b: 41 01 cb add %ecx,%r11d No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <elder@riscstar.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-28 12:42:10 +01:00
Linus Torvalds	5c00ff742b	- The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro\|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq xyyenpibWgUoShU2wZ/Ha8FE5WDINwg= =JfWR -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro\|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...	2024-11-23 09:58:07 -08:00
Dave Chinner	a858109960	xfs: prevent mount and log shutdown race I recently had an fstests hang where there were two internal tasks stuck like so: [ 6559.010870] task:kworker/24:45 state:D stack:12152 pid:631308 tgid:631308 ppid:2 flags:0x00004000 [ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work [ 6559.020349] Call Trace: [ 6559.022002] <TASK> [ 6559.023426] __schedule+0x650/0xb10 [ 6559.025734] schedule+0x6d/0xf0 [ 6559.027835] schedule_timeout+0x31/0x180 [ 6559.030582] wait_for_common+0x10c/0x1e0 [ 6559.033495] wait_for_completion+0x1d/0x30 [ 6559.036463] __flush_workqueue+0xeb/0x490 [ 6559.039479] ? mempool_alloc_slab+0x15/0x20 [ 6559.042537] xlog_cil_force_seq+0xa1/0x2f0 [ 6559.045498] ? bio_alloc_bioset+0x1d8/0x510 [ 6559.048578] ? submit_bio_noacct+0x2f2/0x380 [ 6559.051665] ? xlog_force_shutdown+0x3b/0x170 [ 6559.054819] xfs_log_force+0x77/0x230 [ 6559.057455] xlog_force_shutdown+0x3b/0x170 [ 6559.060507] xfs_do_force_shutdown+0xd4/0x200 [ 6559.063798] ? xfs_buf_rele+0x1bd/0x580 [ 6559.066541] xfs_buf_ioend_handle_error+0x163/0x2e0 [ 6559.070099] xfs_buf_ioend+0x61/0x200 [ 6559.072728] xfs_buf_ioend_work+0x15/0x20 [ 6559.075706] process_scheduled_works+0x1d4/0x400 [ 6559.078814] worker_thread+0x234/0x2e0 [ 6559.081300] kthread+0x147/0x170 [ 6559.083462] ? __pfx_worker_thread+0x10/0x10 [ 6559.086295] ? __pfx_kthread+0x10/0x10 [ 6559.088771] ret_from_fork+0x3e/0x50 [ 6559.091153] ? __pfx_kthread+0x10/0x10 [ 6559.093624] ret_from_fork_asm+0x1a/0x30 [ 6559.096227] </TASK> [ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work [ 6559.112673] Call Trace: [ 6559.114333] <TASK> [ 6559.115760] __schedule+0x650/0xb10 [ 6559.118084] schedule+0x6d/0xf0 [ 6559.120175] schedule_timeout+0x31/0x180 [ 6559.122776] ? call_rcu+0xee/0x2f0 [ 6559.125034] __down_common+0xbe/0x1f0 [ 6559.127470] __down+0x1d/0x30 [ 6559.129458] down+0x48/0x50 [ 6559.131343] ? xfs_buf_item_unpin+0x8d/0x380 [ 6559.134213] xfs_buf_lock+0x3d/0xe0 [ 6559.136544] xfs_buf_item_unpin+0x8d/0x380 [ 6559.139253] xlog_cil_committed+0x287/0x520 [ 6559.142019] ? sched_clock+0x10/0x30 [ 6559.144384] ? sched_clock_cpu+0x10/0x190 [ 6559.147039] ? psi_group_change+0x48/0x310 [ 6559.149735] ? _raw_spin_unlock+0xe/0x30 [ 6559.152340] ? finish_task_switch+0xbc/0x310 [ 6559.155163] xlog_cil_process_committed+0x6d/0x90 [ 6559.158265] xlog_state_shutdown_callbacks+0x53/0x110 [ 6559.161564] ? xlog_cil_push_work+0xa70/0xaf0 [ 6559.164441] xlog_state_release_iclog+0xba/0x1b0 [ 6559.167483] xlog_cil_push_work+0xa70/0xaf0 [ 6559.170260] process_scheduled_works+0x1d4/0x400 [ 6559.173286] worker_thread+0x234/0x2e0 [ 6559.175779] kthread+0x147/0x170 [ 6559.177933] ? __pfx_worker_thread+0x10/0x10 [ 6559.180748] ? __pfx_kthread+0x10/0x10 [ 6559.183231] ret_from_fork+0x3e/0x50 [ 6559.185601] ? __pfx_kthread+0x10/0x10 [ 6559.188092] ret_from_fork_asm+0x1a/0x30 [ 6559.190692] </TASK> This is an ABBA deadlock where buffer IO completion is triggering a forced shutdown with the buffer lock held. It is waiting for the CIL to flush as part of the log force. The CIL flush is blocked doing shutdown processing of all it's objects, trying to unpin a buffer item. That requires taking the buffer lock.... For the CIL to be doing shutdown processing, the log must be marked with XLOG_IO_ERROR, but that doesn't happen until after the log force is issued. Hence for xfs_do_force_shutdown() to be forcing the log on a shut down log, we must have had a racing xlog_force_shutdown and xfs_force_shutdown like so: p0 p1 CIL push <holds buffer lock> xlog_force_shutdown xfs_log_force test_and_set_bit(XLOG_IO_ERROR) xlog_state_release_iclog() sees XLOG_IO_ERROR xlog_state_shutdown_callbacks .... xfs_buf_item_unpin xfs_buf_lock <blocks on buffer p1 holds> xfs_force_shutdown xfs_set_shutdown(mp) wins xlog_force_shutdown xfs_log_force <blocks on CIL push> xfs_set_shutdown(mp) fails <shuts down rest of log> The deadlock can be mitigated by avoiding the log force on the second pass through xlog_force_shutdown. Do this by adding another atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to xlog_force_shutdown() but doesn't mark the log as shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-22 11:24:51 +01:00
Dave Chinner	c9c293240e	xfs: delalloc and quota softlimit timers are incoherent I've been seeing this failure on during xfs/050 recently: XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435 .... Call Trace: <TASK> xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0 xfs_qm_scall_getquota_next+0x69/0xa0 xfs_fs_get_nextdqblk+0x62/0xf0 quota_getnextxquota+0xbf/0x320 do_quotactl+0x1a1/0x410 __se_sys_quotactl+0x126/0x310 __x64_sys_quotactl+0x21/0x30 x64_sys_call+0x2819/0x2ee0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x76/0x7e It turns out that the _qmount call has silently been failing to unmount and mount the filesystem, so when the softlimit is pushed past with a buffered write, it is not getting synced to disk before the next quota report is being run. Hence when the quota report runs, we have 300 blocks of delalloc data on an inode, with a soft limit of 200 blocks. XFS dquots account delalloc reservations as used space, hence the dquot is over the soft limit. However, we don't update the soft limit timers until we do a transactional update of the dquot. That is, the dquot sits over the soft limit without a softlimit timer being started until writeback occurs and the allocation modifies the dquot and we call xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in xfs_trans_commit() context. This isn't really a problem, except for this debug code in xfs_qm_scall_getquota_fill_qc(): if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) { if ((dst->d_space > dst->d_spc_softlimit) && (dst->d_spc_softlimit > 0)) { ASSERT(dst->d_spc_timer != 0); } .... It asserts taht if the used block count is over the soft limit, it must have a soft limit timer running. This is clearly not the case, because we haven't committed the delalloc space to disk yet. Hence the soft limit is only exceeded temporarily in memory (which isn't an issue) and we start the timer the moment we exceed the soft limit in journalled metadata. This debug was introduced in: commit 0d5ad8383061fbc0a9804fbb98218750000fe032 Author: Supriya Wickrematillake <sup@sgi.com> Date: Wed May 15 22:44:44 1996 +0000 initial checkin quotactl syscall functions. The very first quota support commit back in 1996. This is zero-day debug for Irix and, as it turns out, a zero-day bug in the debug code because the delalloc code on Irix didn't update the softlimit timers, either. IOWs, this issue has been in the code for 28 years. We obviously don't care if soft limit timers are a bit rubbery when we have delalloc reservations in memory. Production systems running quota reports have been exposed to this situation for 28 years and nobody has noticed it, so the debug code is essentially worthless at this point in time. We also have the on-disk dquot verifiers checking that the soft limit timer is running whenever the dquot is over the soft limit before we write it to disk and after we read it from disk. These aren't firing, so it is clear the issue is purely a temporary in-memory incoherency that I never would have noticed had the test not silently failed to unmount the filesystem. Hence I'm simply going to trash this runtime debug because it isn't useful in the slightest for catching quota bugs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-22 11:24:45 +01:00
Dave Chinner	1332533358	xfs: fix sparse inode limits on runt AG The runt AG at the end of a filesystem is almost always smaller than the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno limit for the inode chunk allocation, we do not take this into account. This means we can allocate a sparse inode chunk that overlaps beyond the end of an AG. When we go to allocate an inode from that sparse chunk, the irec fails validation because the agbno of the start of the irec is beyond valid limits for the runt AG. Prevent this from happening by taking into account the size of the runt AG when allocating inode chunks. Also convert the various checks for valid inode chunk agbnos to use xfs_ag_block_count() so that they will also catch such issues in the future. Fixes: `56d1115c9b` ("xfs: allocate sparse inode chunks on full chunk allocation failure") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-22 11:24:40 +01:00
Long Li	652f03db89	xfs: remove unknown compat feature check in superblock write validation Compat features are new features that older kernels can safely ignore, allowing read-write mounts without issues. The current sb write validation implementation returns -EFSCORRUPTED for unknown compat features, preventing filesystem write operations and contradicting the feature's definition. Additionally, if the mounted image is unclean, the log recovery may need to write to the superblock. Returning an error for unknown compat features during sb write validation can cause mount failures. Although XFS currently does not use compat feature flags, this issue affects current kernels' ability to mount images that may use compat feature flags in the future. Since superblock read validation already warns about unknown compat features, it's unnecessary to repeat this warning during write validation. Therefore, the relevant code in write validation is being removed. Fixes: `9e037cb797` ("xfs: check for unknown v5 feature bits in superblock write verifier") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-22 10:20:55 +01:00
Long Li	45f69d091b	xfs: eliminate lockdep false positives in xfs_attr_shortform_list xfs_attr_shortform_list() only called from a non-transactional context, it hold ilock before alloc memory and maybe trapped in memory reclaim. Since commit 204fae32d5f7("xfs: clean up remaining GFP_NOFS users") removed GFP_NOFS flag, lockdep warning will be report as [1]. Eliminate lockdep false positives by use __GFP_NOLOCKDEP to alloc memory in xfs_attr_shortform_list(). [1] https://lore.kernel.org/linux-xfs/000000000000e33add0616358204@google.com/ Reported-by: syzbot+4248e91deb3db78358a2@syzkaller.appspotmail.com Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-22 09:52:03 +01:00
Linus Torvalds	2edc8f933d	New xfs code for 6.13 * convert perag to use xarrays * create a new generic allocation group structure * Add metadata inode dir trees * Create in-core rt allocation groups * Shard the RT section into allocation groups * Persist quota options with the enw metadata dir tree * Enable quota for RT volumes * Enable metadata directory trees * Some bugfixes Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZzyNwAAKCRBGdaER5Qtf psV3AYCncK/pVhFfKQSFbnCvgPSoAe7N9n0Wt5gmjy0Ill2mbQXVl9ADXkH6a015 gcGM3t4BgIHLJQndL/Uz+3a0L5IriEb9QkAfzmx8t3vjiRBzBe3WfywEx9Yt7kZe xbxEJ2HQpA== =3ngC -----END PGP SIGNATURE----- Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Carlos Maiolino: "The bulk of this pull request is a major rework that Darrick and Christoph have been doing on XFS's real-time volume, coupled with a few features to support this rework. It does also includes some bug fixes. - convert perag to use xarrays - create a new generic allocation group structure - add metadata inode dir trees - create in-core rt allocation groups - shard the RT section into allocation groups - persist quota options with the enw metadata dir tree - enable quota for RT volumes - enable metadata directory trees - some bugfixes" * tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits) xfs: port ondisk structure checks from xfs/122 to the kernel xfs: separate space btree structures in xfs_ondisk.h xfs: convert struct typedefs in xfs_ondisk.h xfs: enable metadata directory feature xfs: enable realtime quota again xfs: update sb field checks when metadir is turned on xfs: reserve quota for realtime files correctly xfs: create quota preallocation watermarks for realtime quota xfs: report realtime block quota limits on realtime directories xfs: persist quota flags with metadir xfs: advertise realtime quota support in the xqm stat files xfs: scrub quota file metapaths xfs: fix chown with rt quota xfs: use metadir for quota inodes xfs: refactor xfs_qm_destroy_quotainos xfs: use rtgroup busy extent list for FITRIM xfs: implement busy extent tracking for rtgroups xfs: port the perag discard code to handle generic groups xfs: move the min and max group block numbers to xfs_group xfs: adjust min_block usage in xfs_verify_agbno ...	2024-11-21 09:20:07 -08:00
Linus Torvalds	0f25f0e4ef	the bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9 CN18glYmD3wRyU6Bwl4vGODouSJvDgA= =gVH3 -----END PGP SIGNATURE----- Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...	2024-11-18 12:24:06 -08:00
Linus Torvalds	241c7ed4d4	vfs-6.13.untorn.writes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcopwAKCRCRxhvAZXjc oitWAQD68PGFI6/ES9x+qGsDFEZBH08icuO+a9dyaZXyNRosDgD/ex2zHj6F7IzS Ghgb9jiqWQ8l2+PDYfisxa/0jiqCbAk= =DmXf -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()	2024-11-18 11:30:09 -08:00
Linus Torvalds	6ac81fd55e	vfs-6.13.mgtime -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc= =nVlm -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps	2024-11-18 09:15:39 -08:00
Carlos Maiolino	5877dc24be	xfs: improve ondisk structure checks [v5.5 10/10] Reorganize xfs_ondisk.h to group the build checks by type, then add a bunch of missing checks that were in xfs/122 but not the build system. With this, we can get rid of xfs/122. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR ph8MAP4nWT1zCLAEDpFqVQmJO7wjlS02x5xFTHEzcms30ptsIwEA73ycS74tAmP5 CGsPxFVzhG8oMkWgrWXnXuOGfpQBPQo= =Dx2e -----END PGP SIGNATURE----- Merge tag 'better-ondisk-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: improve ondisk structure checks [v5.5 10/10] Reorganize xfs_ondisk.h to group the build checks by type, then add a bunch of missing checks that were in xfs/122 but not the build system. With this, we can get rid of xfs/122. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:03:15 +01:00
Carlos Maiolino	052378aef8	xfs: enable metadir [v5.5 09/10] Actually enable this very large feature, which adds metadata directory trees, allocation groups on the realtime volume, persistent quota options, and quota for realtime files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pvjKAP9A1v0QwkBo65ARf1gQb2AZYlebV8o6faSrpc+10vOPxgD/SV3GKhxQrtaT gOswA6f9QJN3E82IJ6fR2PZgSke+iQs= =xjZL -----END PGP SIGNATURE----- Merge tag 'metadir-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: enable metadir [v5.5 09/10] Actually enable this very large feature, which adds metadata directory trees, allocation groups on the realtime volume, persistent quota options, and quota for realtime files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:02:55 +01:00
Carlos Maiolino	8ca118e17a	xfs: enable quota for realtime volumes [v5.5 08/10] At some point, I realized that I've refactored enough of the quota code in XFS that I should evaluate whether or not quota actually works on realtime volumes. It turns out that it nearly works: the only broken pieces are chown and delayed allocation, and reporting of project quotas in the statvfs output for projinherit+rtinherit directories. Fix these things and we can have realtime quotas again after 20 years. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pkh4AQCtjI73mwU9rhzs2MO5nLNlpg9bgOxute+G4eqGCP02CwEAvg/LpT9yA6qk 1jM5x8C6xy03yIWTUc+DcMPqoCqJzwc= =2QMv -----END PGP SIGNATURE----- Merge tag 'realtime-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: enable quota for realtime volumes [v5.5 08/10] At some point, I realized that I've refactored enough of the quota code in XFS that I should evaluate whether or not quota actually works on realtime volumes. It turns out that it nearly works: the only broken pieces are chown and delayed allocation, and reporting of project quotas in the statvfs output for projinherit+rtinherit directories. Fix these things and we can have realtime quotas again after 20 years. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:02:25 +01:00
Carlos Maiolino	93c0f79edf	xfs: persist quota options with metadir [v5.5 07/10] Store the quota files in the metadata directory tree instead of the superblock. Since we're introducing a new incompat feature flag, let's also make the mount process bring up quotas in whatever state they were when the filesystem was last unmounted, instead of requiring sysadmins to remember that themselves. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pod/AP9+FSHh3UF9ccpBQXMAG0NmhXYJBQ7grnAp4q89ko6fCAD+JyUsqi55zDw6 KnLSWZgNuO+aaCmMKXX4hEttA0//gAY= =0YFz -----END PGP SIGNATURE----- Merge tag 'metadir-quotas-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: persist quota options with metadir [v5.5 07/10] Store the quota files in the metadata directory tree instead of the superblock. Since we're introducing a new incompat feature flag, let's also make the mount process bring up quotas in whatever state they were when the filesystem was last unmounted, instead of requiring sysadmins to remember that themselves. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:01:12 +01:00
Carlos Maiolino	b939bcdca3	xfs: shard the realtime section [v5.5 06/10] Right now, the realtime section uses a single pair of metadata inodes to store the free space information. This presents a scalability problem since every thread trying to allocate or free rt extents have to lock these files. Solve this problem by sharding the realtime section into separate realtime allocation groups. While we're at it, define a superblock to be stamped into the start of the rt section. This enables utilities such as blkid to identify block devices containing realtime sections, and avoids the situation where anything written into block 0 of the realtime extent can be misinterpreted as file data. The best advantage for rtgroups will become evident later when we get to adding rmap and reflink to the realtime volume, since the geometry constraints are the same for rt groups and AGs. Hence we can reuse all that code directly. This is a very large patchset, but it catches us up with 20 years of technical debt that have accumulated. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pqk4AQD31pupAefiZ39TFLz0oA1+Q2WUOoLxH/3Ovqin1GJNPgD9EG04/14fDmRU WDUSVfU8JKKJYEXXZnLeJLsvEUL2EQ0= =1/oh -----END PGP SIGNATURE----- Merge tag 'realtime-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: shard the realtime section [v5.5 06/10] Right now, the realtime section uses a single pair of metadata inodes to store the free space information. This presents a scalability problem since every thread trying to allocate or free rt extents have to lock these files. Solve this problem by sharding the realtime section into separate realtime allocation groups. While we're at it, define a superblock to be stamped into the start of the rt section. This enables utilities such as blkid to identify block devices containing realtime sections, and avoids the situation where anything written into block 0 of the realtime extent can be misinterpreted as file data. The best advantage for rtgroups will become evident later when we get to adding rmap and reflink to the realtime volume, since the geometry constraints are the same for rt groups and AGs. Hence we can reuse all that code directly. This is a very large patchset, but it catches us up with 20 years of technical debt that have accumulated. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:00:42 +01:00
Carlos Maiolino	cb288c9fb2	xfs: preparation for realtime allocation groups [v5.5 05/10] Prepare for realtime groups by adding a few bug fixes and generic code that will be necessary. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pgmeAP980iZAY49aL85dhr/QNl0G5YmuLOx6UW8DiAALCJxyxAEAnD5ilA7vQz40 80/cn+Y77fT3LptpAHTM5/FY+42IOgM= =AzwL -----END PGP SIGNATURE----- Merge tag 'rtgroups-prep-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: preparation for realtime allocation groups [v5.5 05/10] Prepare for realtime groups by adding a few bug fixes and generic code that will be necessary. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 11:00:16 +01:00
Carlos Maiolino	6b3582aca3	xfs: create incore rt allocation groups [v5.5 04/10] Add in-memory data structures for sharding the realtime volume into independent allocation groups. For existing filesystems, the entire rt volume is modelled as having a single large group, with (potentially) a number of rt extents exceeding 2^32 blocks, though these are not likely to exist because the codebase has been a bit broken for decades. The next series fills in the ondisk format and other supporting structures. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR ptrrAP41PURivFpHWXqg0sajsIUUezhuAdfg41fJqOop81qWDAEA2CsLf1z0c9/P CQS/tlQ3xdwZ0MYZMaw2o0EgSHYjwg8= =qVdv -----END PGP SIGNATURE----- Merge tag 'incore-rtgroups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create incore rt allocation groups [v5.5 04/10] Add in-memory data structures for sharding the realtime volume into independent allocation groups. For existing filesystems, the entire rt volume is modelled as having a single large group, with (potentially) a number of rt extents exceeding 2^32 blocks, though these are not likely to exist because the codebase has been a bit broken for decades. The next series fills in the ondisk format and other supporting structures. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 10:59:34 +01:00
Carlos Maiolino	d7a5b69bf0	xfs: metadata inode directory trees [v5.5 03/10] This series delivers a new feature -- metadata inode directories. This is a separate directory tree (rooted in the superblock) that contains only inodes that contain filesystem metadata. Different metadata objects can be looked up with regular paths. Start by creating xfs_imeta{dir,file}* functions to mediate access to the metadata directory tree. By the end of this mega series, all existing metadata inodes (rt+quota) will use this directory tree instead of the superblock. Next, define the metadir on-disk format, which consists of marking inodes with a new iflag that says they're metadata. This prevents bulkstat and friends from ever getting their hands on fs metadata files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR ppNiAP9JgPFENv3P0UCJiCDqtWZlNWfz8a9ngAehm4AQMA0P9gD/XgVYNKZRY2Q3 P+3Sh1TVZ63dcEENlmEFE3myKjeJKAQ= =IJLc -----END PGP SIGNATURE----- Merge tag 'metadata-directory-tree-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: metadata inode directory trees [v5.5 03/10] This series delivers a new feature -- metadata inode directories. This is a separate directory tree (rooted in the superblock) that contains only inodes that contain filesystem metadata. Different metadata objects can be looked up with regular paths. Start by creating xfs_imeta{dir,file}* functions to mediate access to the metadata directory tree. By the end of this mega series, all existing metadata inodes (rt+quota) will use this directory tree instead of the superblock. Next, define the metadir on-disk format, which consists of marking inodes with a new iflag that says they're metadata. This prevents bulkstat and friends from ever getting their hands on fs metadata files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 10:59:05 +01:00
Carlos Maiolino	28cf0d1a34	xfs: create a generic allocation group structure [v5.5 02/10] Soon we'll be sharding the realtime volume into separate allocation groups. These rt groups will /mostly/ behave the same as the ones on the data device, but since rt groups don't have quite the same set of struct fields as perags, let's hoist the parts that will be shared by both into a common xfs_group object. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQdAAKCRBKO3ySh0YR pnJDAQCh14f3aSCHslr4XeG1YkT7NZ44AtMjBqEi5GRN26wC0wD+PeaVSRgt+5wy nONT/nFqU5pApe1w2pq78SoJ+vLJxAk= =la16 -----END PGP SIGNATURE----- Merge tag 'generic-groups-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create a generic allocation group structure [v5.5 02/10] Soon we'll be sharding the realtime volume into separate allocation groups. These rt groups will /mostly/ behave the same as the ones on the data device, but since rt groups don't have quite the same set of struct fields as perags, let's hoist the parts that will be shared by both into a common xfs_group object. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 10:58:27 +01:00
Carlos Maiolino	131ffe5e69	xfs: convert perag to use xarrays [v5.5 01/10] Convert the xfs_mount perag tree to use an xarray instead of a radix tree. There should be no functional changes here. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZyqQcwAKCRBKO3ySh0YR pmKsAQDko7YQ/dJZsjDvBJRUEnjrJqK9ukR9Ovc00ak0WXIDYQD/cdm8xzkixHn2 yfwsuFxIm6k0PPzb9koSCDT/CQS6QAA= =WGih -----END PGP SIGNATURE----- Merge tag 'perag-xarray-6.13_2024-11-05' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: convert perag to use xarrays [v5.5 01/10] Convert the xfs_mount perag tree to use an xarray instead of a radix tree. There should be no functional changes here. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-12 10:57:32 +01:00
Kairui Song	da0c02516c	mm/list_lru: simplify the list_lru walk callback function Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-11 17:22:26 -08:00
Kairui Song	fb56fdf8b9	mm/list_lru: split the lock to per-cgroup scope Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This lock can be split into per-cgroup scope to reduce contention. To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process. Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously. Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained. The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine. This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru. prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait } do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all } Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times: Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-11 17:22:26 -08:00
Darrick J. Wong	13877bc79d	xfs: port ondisk structure checks from xfs/122 to the kernel Check this with every kernel and userspace build, so we can drop the nonsense in xfs/122. Roughly drafted with: sed -e 's/^offsetof/\tXFS_CHECK_OFFSET/g' \ -e 's/^sizeof/\tXFS_CHECK_STRUCT_SIZE/g' \ -e 's/ = $[0-9]$/,\t\t\t\1);/g' \ -e 's/xfs_sb_t/struct xfs_dsb/g' \ -e 's/),/,/g' \ -e 's/xfs_$[a-z0-9_]$_t,/struct xfs_\1,/g' \ < tests/xfs/122.out \| sort and then manual fixups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:47 -08:00
Darrick J. Wong	131a883fff	xfs: separate space btree structures in xfs_ondisk.h Create a separate section for space management btrees so that they're not mixed in with file structures. Ignore the dsb stuff sprinkled around for now, because we'll deal with that in a subsequent patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:47 -08:00
Darrick J. Wong	89b38282d1	xfs: convert struct typedefs in xfs_ondisk.h Replace xfs_foo_t with struct xfs_foo where appropriate. The next patch will import more checks from xfs/122, and it's easier to automate deduplication if we don't have to reason about typedefs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:47 -08:00
Darrick J. Wong	ea079efd36	xfs: enable metadata directory feature Enable the metadata directory feature. With this feature, all metadata inodes are placed in the metadata directory, and the only inumbers in the superblock are the roots of the two directory trees. The RT device is now sharded into a number of rtgroups, where 0 rtgroups mean that no RT extents are supported, and the traditional XFS stub RT bitmap and summary inodes don't exist. A single rtgroup gives roughly identical behavior to the traditional RT setup, but now with checksummed and self identifying free space metadata. For quota, the quota options are read from the superblock unless explicitly overridden via mount options. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	edc038f7f3	xfs: enable realtime quota again Enable quotas for the realtime device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	28d756d4d5	xfs: update sb field checks when metadir is turned on When metadir is enabled, we want to check the two new rtgroups fields, and we don't want to check the old inumbers that are now in the metadir. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	b7020ba86a	xfs: reserve quota for realtime files correctly Fix xfs_quota_reserve_blkres to reserve rt block quota whenever we're dealing with a realtime file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	5dd70852b0	xfs: create quota preallocation watermarks for realtime quota Refactor the quota preallocation watermarking code so that it'll work for realtime quota too. Convert the do_div calls into div_u64 for compactness. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	9a17ebfea9	xfs: report realtime block quota limits on realtime directories On the data device, calling statvfs on a projinherit directory results in the block and avail counts being curtailed to the project quota block limits, if any are set. Do the same for realtime files or directories, only use the project quota rt block limits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:46 -08:00
Darrick J. Wong	d5d9dd5b30	xfs: persist quota flags with metadir It's annoying that one has to keep reminding XFS about what quota options it should mount with, since the quota flags recording the previous state are sitting right there in the primary superblock. Even more strangely, there exists a noquota option to disable quotas completely, so it's odder still that providing no options is the same as noquota. Starting with metadir, let's change the behavior so that if the user does not specify any quota-related mount options at all, the ondisk quota flags will be used to bring up quota. In other words, the filesystem will mount in the same state and with the same functionality as it had during the last mount. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	184c619f55	xfs: advertise realtime quota support in the xqm stat files Add a fifth column to this (really old) stat file to advertise that the kernel supports quota for realtime volumes. This will be used by fstests to detect kernel support. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	128a055291	xfs: scrub quota file metapaths Enable online fsck for quota file metadata directory paths. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	b28564cae1	xfs: fix chown with rt quota Make chown's quota adjustments work with realtime files. This is mostly a matter of calling xfs_inode_count_blocks on a given file to figure out the number of blocks allocated to the data device and to the realtime device, and using those quantities to update the quota accounting when the id changes. Delayed allocation reservations are moved from the old dquot's incore reservation to the new dquot's incore reservation. Note that there was a missing ILOCK bug in xfs_qm_dqusage_adjust that we must fix before calling xfs_iread_extents. Prior to 2.6.37 the locking was correct, but then someone removed the ILOCK as part of a cleanup. Nobody noticed because nowhere in the git history have we ever supported rt+quota so nobody can use this. I'm leaving git breadcrumbs in case anyone is desperate enough to try to backport the rtquota code to old kernels. Not-Cc: <stable@vger.kernel.org> # v2.6.37 Fixes: `52fda11424` ("xfs: simplify xfs_qm_dqusage_adjust") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	e80fbe1ad8	xfs: use metadir for quota inodes Store the quota inodes in the /quota metadata directory if metadir is enabled. This enables us to stop using the sb_[ugp]uotino fields in the superblock. From this point on, all metadata files will be children of the metadata directory tree root. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	fc23a426ce	xfs: refactor xfs_qm_destroy_quotainos Reuse this function instead of open-coding the logic. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:45 -08:00
Darrick J. Wong	a3315d1130	xfs: use rtgroup busy extent list for FITRIM For filesystems that have rtgroups and hence use the busy extent list for freed rt space, use that busy extent list so that FITRIM can issue discard commands asynchronously without worrying about other callers accidentally allocating and using space that is being discarded. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	7e85fc2394	xfs: implement busy extent tracking for rtgroups For rtgroups filesystems, track newly freed (rt) space through the log until the rt EFIs have been committed to disk. This way we ensure that space cannot be reused until all traces of the old owner are gone. As a fringe benefit, we now support -o discard on the realtime device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	0c271d906e	xfs: port the perag discard code to handle generic groups Port xfs_discard_extents and its tracepoints to handle generic groups instead of just perags. This is needed to enable busy extent tracking for rtgroups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	e0b5b97dde	xfs: move the min and max group block numbers to xfs_group Move the min and max agblock numbers to the generic xfs_group structure so that we can start building validators for extents within an rtgroup. While we're at it, use check_add_overflow for the extent length computation because that has much better overflow checking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	ceaa0bd773	xfs: adjust min_block usage in xfs_verify_agbno There's some weird logic in xfs_verify_agbno -- min_block ought to be the first agblock number in the AG that can be used by non-static metadata. However, we initialize it to the last agblock of the static metadata, which works due to the <= check, even though this isn't technically correct. Change the check to < and set min_block to the next agblock past the static metadata. This hasn't been an issue up to now, but we're going to move these things into the generic group struct, and this will cause problems with rtgroups, where min_block can be zero for an rtgroup that doesn't have a rt superblock. Note that there's no user-visible impact with the old logic, so this isn't a bug fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	7195f240c6	xfs: make xfs_rtblock_t a segmented address like xfs_fsblock_t Now that we've finished adding allocation groups to the realtime volume, let's make the file block mapping address (xfs_rtblock_t) a segmented value just like we do on the data device. This means that group number and block number conversions can be done with shifting and masking instead of integer division. While in theory we could continue caching the rgno shift value in m_rgblklog, the fact that we now always use the shift value means that we have an opportunity to increase the redundancy of the rt geometry by storing it in the ondisk superblock and adding more sb verifier code. Extend the sueprblock to store the rgblklog value. Now that we have segmented addresses, set the correct values in m_groups[XG_TYPE_RTG] so that the xfs_group helpers work correctly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:44 -08:00
Darrick J. Wong	3f0205ebe7	xfs: create helpers to deal with rounding xfs_filblks_t to rtx boundaries We're about to segment xfs_rtblock_t addresses, so we must create type-specific helpers to do rt extent rounding of file mapping block lengths because the rtb helpers soon will not do the right thing there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	fd7588fa64	xfs: create helpers to deal with rounding xfs_fileoff_t to rtx boundaries We're about to segment xfs_rtblock_t addresses, so we must create type-specific helpers to do rt extent rounding of file block offsets because the rtb helpers soon will not do the right thing there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	ea99122b18	xfs: mask off the rtbitmap and summary inodes when metadir in use Set the rtbitmap and summary file inumbers to NULLFSINO in the superblock and make sure they're zeroed whenever we write the superblock to disk, to mimic mkfs behavior. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	a74923333d	xfs: scrub metadir paths for rtgroup metadata Add the code we need to scan the metadata directory paths of rt group metadata files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	1433f8f9ce	xfs: repair realtime group superblock Repair the realtime superblock if it has become out of date with the primary superblock. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	3f1bdf50ab	xfs: scrub the realtime group superblock Enable scrubbing of realtime group superblocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:43 -08:00
Darrick J. Wong	7333c948c2	xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub The bmbt scrubber will combine file mappings if they are mergeable to reduce the number of cross-referencing checks. However, we shouldn't combine mappings that cross rt group boundaries because that will cause verifiers to trip incorrectly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:42 -08:00
Christoph Hellwig	d162491c54	xfs: make the RT allocator rtgroup aware Make the allocator rtgroup aware by either picking a specific group if there is a hint, or loop over all groups otherwise. A simple rotor is provided to pick the placement for initial allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:42 -08:00
Darrick J. Wong	b91afef724	xfs: don't merge ioends across RTGs Unlike AGs, RTGs don't always have metadata in their first blocks, and thus we don't get automatic protection from merging I/O completions across RTG boundaries. Add code to set the IOMAP_F_BOUNDARY flag for ioends that start at the first block of a RTG so that they never get merged into the previous ioend. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:42 -08:00
Darrick J. Wong	44e69c9af1	xfs: use realtime EFI to free extents when rtgroups are enabled When rmap is enabled, XFS expects a certain order of operations, which is: 1) remove the file mapping, 2) remove the reverse mapping, and then 3) free the blocks. When reflink is enabled, XFS replaces (3) with a deferred refcount decrement operation that can schedule freeing the blocks if that was the last refcount. For realtime files, xfs_bmap_del_extent_real tries to do 1 and 3 in the same transaction, which will break both rmap and reflink unless we switch it to use realtime EFIs. Both rmap and reflink depend on the rtgroups feature, so let's turn on EFIs for all rtgroups filesystems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:42 -08:00
Darrick J. Wong	fc91d9430e	xfs: support error injection when freeing rt extents A handful of fstests expect to be able to test what happens when extent free intents fail to actually free the extent. Now that we're supporting EFIs for realtime extents, add to xfs_rtfree_extent the same injection point that exists in the regular extent freeing code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:42 -08:00
Darrick J. Wong	4c8900bbf1	xfs: support logging EFIs for realtime extents Teach the EFI mechanism how to free realtime extents. We're going to need this to enforce proper ordering of operations when we enable realtime rmap. Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer ops for rt extents. This keeps the ondisk artifacts and processing code completely separate between the rt and non-rt cases. Hopefully this will make it easier to debug filesystem problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:42 -08:00
Darrick J. Wong	b57283e1a0	xfs: force swapext to a realtime file to use the file content exchange ioctl xfs_swap_extent_rmap does not use log items to track the overall progress of an attempt to swap the extent mappings between two files. If the system crashes in the middle of swapping a partially written realtime extent, the mapping will be left in an inconsistent state wherein a file can point to multiple extents on the rt volume. The new file range exchange functionality handles this correctly, so all callers must upgrade to that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:41 -08:00
Darrick J. Wong	e464d8e8bb	xfs: store rtgroup information with a bmap intent Make the bmap intent items take an active reference to the rtgroup containing the space that is being mapped or unmapped. We will need this functionality once we start enabling rmap and reflink on the rt volume. Technically speaking we need it even for !rtgroups filesystems to prevent the (dummy) rtgroup 0 from going away, even though this will never happen. As a bonus, we can rework the xfs_bmap_deferred_class tracepoint to use the xfs_group object to figure out the type and group number, widen the group block number field to fit 64-bit quantities, and get rid of the now redundant opdev and rtblock fields. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:41 -08:00
Darrick J. Wong	ee32135148	xfs: grow the realtime section when realtime groups are enabled Enable growing the rt section when realtime groups are enabled. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:41 -08:00
Darrick J. Wong	a2c2836739	xfs: encode the rtsummary in big endian format Currently, the ondisk realtime summary file counters are accessed in units of 32-bit words. There's no endian translation of the contents of this file, which means that the Bad Things Happen(tm) if you go from (say) x86 to powerpc. Since we have a new feature flag, let's take the opportunity to enforce an endianness on the file. Encode the summary information in big endian format, like most of the rest of the filesystem. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:41 -08:00
Darrick J. Wong	eba42c2c53	xfs: encode the rtbitmap in big endian format Currently, the ondisk realtime bitmap file is accessed in units of 32-bit words. There's no endian translation of the contents of this file, which means that the Bad Things Happen(tm) if you go from (say) x86 to powerpc. Since we have a new feature flag, let's take the opportunity to enforce an endianness on the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:41 -08:00
Darrick J. Wong	118895aa95	xfs: add block headers to realtime bitmap and summary blocks Upgrade rtbitmap and rtsummary blocks to have self describing metadata like most every other thing in XFS. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:40 -08:00
Darrick J. Wong	3fa7a6d0c7	xfs: export the geometry of realtime groups to userspace Create an ioctl so that the kernel can report the status of realtime groups to userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:40 -08:00
Darrick J. Wong	ab7bd650e1	xfs: record rt group metadata errors in the health system Record the state of per-rtgroup metadata sickness in the rtgroup structure for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:40 -08:00
Darrick J. Wong	21e62bddf0	xfs: convert sick_map loops to use ARRAY_SIZE Convert these arrays to use ARRAY_SIZE insteead of requiring an empty sentinel array element at the end. This saves memory and would have avoided a bug that worked its way into the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:40 -08:00
Darrick J. Wong	35537f25d2	xfs: add frextents to the lazysbcounters when rtgroups enabled Make the free rt extent count a part of the lazy sb counters when the realtime groups feature is enabled. This is possible because the patch to recompute frextents from the rtbitmap during log recovery predates the code adding rtgroup support, hence we know that the value will always be correct during runtime. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:40 -08:00
Christoph Hellwig	8458c4944e	xfs: add a helper to prevent bmap merges across rtgroup boundaries Except for the rt superblock, realtime groups do not store any metadata at the start (or end) of the group. There is nothing to prevent the bmap code from merging allocations from multiple groups into a single bmap record. Add a helper to check for this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: massage the commit message after pulling this into rtgroups] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:40 -08:00
Darrick J. Wong	9bb5127347	xfs: check that rtblock extents do not break rtsupers or rtgroups Check that rt block pointers do not point to the realtime superblock and that allocated rt space extents do not cross rtgroup boundaries. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	8edde94d64	xfs: export realtime group geometry via XFS_FSOP_GEOM Export the realtime geometry information so that userspace can query it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	76d3be00df	xfs: update realtime super every time we update the primary fs super Every time we update parts of the primary filesystem superblock that are echoed in the rt superblock, we must update the rt super. Avoid changing the log to support logging to the rt device by using ordered buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	18618e7100	xfs: check the realtime superblock at mount time Check the realtime superblock at mount time, to ensure that the label and uuids actually match the primary superblock on the data device. If the rt superblock is good, attach it to the xfs_mount so that the log can use ordered buffers to keep this primary in sync with the primary super on the data device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Darrick J. Wong	96768e9151	xfs: define the format of rt groups Define the ondisk format of realtime group metadata, and a superblock for realtime volumes. rt supers are conditionally enabled by a predicate function so that they can be disabled if we ever implement zoned storage support for the realtime volume. For rt group enabled file systems there is a separate bitmap and summary file for each group and thus the number of bitmap and summary blocks needs to be calculated differently. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:39 -08:00
Christoph Hellwig	f220f6da5f	xfs: make RT extent numbers relative to the rtgroup To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t type encode the RT extent number relative to the rtgroup. The biggest part of this to clearly distinguish between the relative extent number that gets masked when converting from a global block number and length values that just have a factor applied to them when converting from file system blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Darrick J. Wong	dca94251f6	xfs: fix rt device offset calculations for FITRIM FITRIM on xfs has this bizarro uapi where we flatten all the physically addressable storage across two block devices into a linear address space. In this address space, the realtime device comes immediately after the data device. Therefore, the xfs_trim_rtdev_extents has to convert its input parameters from the linear address space to actual rtdev block addresses on the realtime volume. Right now the address space conversion is done in units of rtblocks. However, a future patchset will convert xfs_rtblock_t to be a segmented address space (group:blkno) like the data device. Change the conversion code to be done in units of daddrs since those will never be segmented. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	f8c5a8415f	xfs: refactor xfs_rtsummary_blockcount Make xfs_rtsummary_blockcount take all the required information from the mount structure and return the number of summary levels from it as well. This cleans up many of the callers and prepares for making the rtsummary files per-rtgroup where they need to look at different value. This means we recalculate some values in some callers, but as all these calculations are outside the fast path and cheap, which seems like a price worth paying. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	5a7566c8d6	xfs: refactor xfs_rtbitmap_blockcount Rename the existing xfs_rtbitmap_blockcount to xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper around it that takes the number of extents from the mount structure. This will simplify the move to per-rtgroup bitmaps as those will need to pass in the number of extents per rtgroup instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	bde86b42d2	xfs: factor out a xfs_growfs_check_rtgeom helper Split the check that the rtsummary fits into the log into a separate helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT geometry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: avoid division for the 0-rtx growfs check] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:38 -08:00
Christoph Hellwig	fc233f1fb0	xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks Use xfs_growfs_rt_alloc_fake_mount instead of manually recalculating the RT bitmap geometry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Darrick J. Wong	1029f08dc5	xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper Split the code to set up a fake mount point to calculate new RT geometry out of xfs_growfs_rt_bmblock so that it can be reused. Note that this changes the rmblocks calculation method to be based on the passed in rblocks and extsize and not the explicitly passed one, but both methods will always lead to the same result. The new version just does a little bit more math while being more general. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	cb9cd6e56e	xfs: calculate RT bitmap and summary blocks based on sb_rextents Use the on-disk rextents to calculate the bitmap and summary blocks instead of the calculated one so that we can refactor the helpers for calculating them. As the RT bitmap and summary scrubbers already check that sb_rextents match the block count this does not change coverage of the scrubber. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Darrick J. Wong	c1442d22a0	xfs: remove XFS_ILOCK_RT* Now that we've centralized the realtime metadata locking routines, get rid of the ILOCK subclasses since we now use explicit lockdep classes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	ae897e0bed	xfs: support creating per-RTG files in growfs To support adding new RT groups in growfs, we need to be able to create the per-RT group files. Add a new xfs_rtginode_create helper to create a given per-RTG file. Most of the code for that is shared, but the details of the actual file are abstracted out using a new create method in struct xfs_rtginode_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	e3088ae2dc	xfs: move RT bitmap and summary information to the rtgroup Move the pointers to the RT bitmap and summary inodes as well as the summary cache to the rtgroups structure to prepare for having a separate bitmap and summary inodes for each rtgroup. Code using the inodes now needs to operate on a rtgroup. Where easily possible such code is converted to iterate over all rtgroups, else rtgroup 0 (the only one that can currently exist) is hardcoded. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:37 -08:00
Christoph Hellwig	c8edf1cbef	xfs: split xfs_trim_rtdev_extents Split xfs_trim_rtdev_extents into two parts to prepare for reusing the main validation also for RT group aware file systems. Use the fully features xfs_daddr_to_rtb helper to convert from a daddr to a xfs_rtblock_t to prepare for segmented addressing in RT groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	d6d5c90ada	xfs: cleanup xfs_getfsmap_rtdev_rtbitmap Use mp->m_sb.sb_rblocks to calculate the end instead of sb_rextents that needs a conversion, use consistent names to xfs_rtblock_t types, and only calculated them by the time they are needed. Remove the pointless "high" local variable that only has a single user. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	9154b5008c	xfs: factor out a xfs_growfs_rt_alloc_blocks helper Split out a helper to allocate or grow the rtbitmap and rtsummary files in preparation of per-RT group bitmap and summary files. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	cd8d049082	xfs: add a xfs_qm_unmount_rt helper RT group enabled file systems fix the bug where we pointlessly attach quotas to the RT bitmap and summary files. Split the code to detach the quotas into a helper, make it conditional and document the differing behavior for RT group and pre-RT group file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Christoph Hellwig	9c3cfb9c96	xfs: add a xfs_bmap_free_rtblocks helper Split the RT extent freeing logic from xfs_bmap_del_extent_real because it will become more complicated when adding RT group. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:36 -08:00
Darrick J. Wong	cd5b26f0c0	xfs: add rtgroup-based realtime scrubbing context management Create a state tracking structure and helpers to initialize the tracking structure so that we can check metadata records against the realtime space management metadata. Right now this is limited to grabbing the incore rtgroup object, but we'll eventually add to the tracking structure the ILOCK state and btree cursors. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:36 -08:00
Darrick J. Wong	0d2c636e48	xfs: repair metadata directory file path connectivity Fix disconnected or incorrect metadata directory paths. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	65b1231b8c	xfs: support caching rtgroup metadata inodes Create the necessary per-rtgroup infrastructure that we need to load metadata inodes into memory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	c29237a65c	xfs: add a lockdep class key for rtgroup inodes Add a dynamic lockdep class key for rtgroup inodes. This will enable lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking order. Each class can have 8 subclasses, and for now we will only have 2 inodes per group. This enables rtgroup order and inode order checks when nesting ILOCKs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	0e4875b3fb	xfs: define locking primitives for realtime groups Define helper functions to lock all metadata inodes related to a realtime group. There's not much to look at now, but this will become important when we add per-rtgroup metadata files and online fsck code for them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	87fe4c34a3	xfs: create incore realtime group structures Create an incore object that will contain information about a realtime allocation group. This will eventually enable us to shard the realtime section in a similar manner to how we shard the data section, but for now just a single object for the entire RT subvolume is created. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:35 -08:00
Christoph Hellwig	dcfc65befb	xfs: clean up xfs_getfsmap_helper arguments The calling conventions for xfs_getfsmap_helper are confusing -- callers pass in an rmap record, but they must also supply startblock and blockcount in daddr units. This was bolted onto the original fsmap implementation so that we could report something for realtime volumes, which do not support rmap and hence can draw only from the rt free space bitmap. Free space on the rt volume can be more than 2^32 fsblocks long, which means that we can't use the rmap startblock or blockcount fields. This is confusing for callers, because they must supplying redundant data, but not all of it is used. Streamline this by creating a separate fsmap irec structure that contains exactly the data we need, once. Note that we actually do need rm_startblock for rmap key comparisons when we're actually querying an rmap btree, so leave that field but document why it's there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:35 -08:00
Darrick J. Wong	87b7c205da	xfs: confirm dotdot target before replacing it during a repair xfs_dir_replace trips an assertion if you tell it to change a dirent to point to an inumber that it already points at. Look up the dotdot entry directly to confirm that we need to make a change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	b3c03efa59	xfs: check metadata directory file path connectivity Create a new scrubber type that checks that well known metadata directory paths are connected to the metadata inode that the incore structures think is in use. For example, check that "/quota/user" in the metadata directory tree actually points to mp->m_quotainfo->qi_uquotaip->i_ino. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	9dc31acb01	xfs: move repair temporary files to the metadata directory tree Due to resource acquisition rules, we have to create the ondisk temporary files used to stage a filesystem repair before we can acquire a reference to the inode that we actually want to repair. Therefore, we do not know at tempfile creation time whether the tempfile will belong to the regular directory tree or the metadata directory tree. This distinction becomes important when the swapext code tries to figure out the quota accounting of the two files whose mappings are being swapped. The swapext code assumes that accounting updates are required for a file if dqattach attaches dquots. Metadir files are never accounted in quota, which means that swapext must not update the quota accounting when swapping in a repaired directory/xattr/rtbitmap structure. Prior to the swapext call, therefore, both files must be marked as METADIR for dqattach so that dqattach will ignore them. Add support for a repair tempfile to be switched to the metadir tree and switched back before being released so that ifree will just free the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	dcde94bdee	xfs: check the metadata directory inumber in superblocks When metadata directories are enabled, make sure that the secondary superblocks point to the metadata directory. This isn't strictly required because the secondaries are only used to recover damaged filesystems, and the metadir root inumber is fixed. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	3d2c341111	xfs: scrub metadata directories Teach online scrub about the metadata directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	5dab2daa8a	xfs: fix di_metatype field of inodes that won't load Make sure that the di_metatype field is at least set plausibly so that later scrubbers could set the real type. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:34 -08:00
Darrick J. Wong	aec2eb7da8	xfs: adjust parent pointer scrubber for sb-rooted metadata files Starting with the metadata directory feature, we're allowed to call the directory and parent pointer scrubbers for every metadata file, including the ones that are children of the superblock. For these children, checking the link count against the number of parent pointers is a bit funny -- there's no such thing as a parent pointer for a child of the superblock since there's no corresponding dirent. For purposes of validating nlink, we pretend that there is a parent pointer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	91fb4232be	xfs: metadata files can have xattrs if metadir is enabled If parent pointers are enabled, then metadata files will store parent pointers in xattrs, just like files in the user visible directory tree. Therefore, scrub and repair need to handle attr forks for metadata files on metadir filesystems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	13af229ee0	xfs: do not count metadata directory files when doing online quotacheck Previously, we stated that files in the metadata directory tree are not counted in the dquot information. Fix the online quotacheck code to reflect this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	679b098b59	xfs: refactor directory tree root predicates Metadata directory trees make reasoning about the parent of a file more difficult. Traditionally, user files are children of sb_rootino, and metadata files are "children" of the superblock. Now, we add a third possibility -- some metadata files can be children of sb_metadirino, but the classic ones (rt free space data and quotas) are left alone. Let's add some helper functions (instead of open-coding the logic everywhere) to make scrub logic easier to understand. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	be42fc1393	xfs: record health problems with the metadata directory Make a report to the health monitoring subsystem any time we encounter something in the metadata directory tree that looks like corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:33 -08:00
Darrick J. Wong	61b6bdb30a	xfs: adjust xfs_bmap_add_attrfork for metadir Online repair might use the xfs_bmap_add_attrfork to repair a file in the metadata directory tree if (say) the metadata file lacks the correct parent pointers. In that case, it is not correct to check that the file is dqattached -- metadata files must be not have /any/ dquot attached at all. Adjust the assertions appropriately. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	cc0cf84aa7	xfs: mark quota inodes as metadata files When we're creating quota files at mount time, make sure to mark them as metadir inodes if appropriate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	382e275f0e	xfs: don't count metadata directory files to quota Files in the metadata directory tree are internal to the filesystem. Don't count the inodes or the blocks they use in the root dquot because users do not need to know about their resource usage. This will also quiet down complaints about dquot usage not matching du output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	df866c538f	xfs: allow bulkstat to return metadata directories Allow the V5 bulkstat ioctl to return information about metadata directory files so that xfs_scrub can find and scrub them, since they are otherwise ordinary directories. (Metadata files of course require per-file scrub code and hence do not need exposure.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	688828d8f8	xfs: advertise metadata directory feature Advertise the existence of the metadata directory feature; this will be used by scrub to decide if it needs to scan the metadir too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	bb6cdd5529	xfs: hide metadata inodes from everyone because they are special Metadata inodes are private files and therefore cannot be exposed to userspace. This means no bulkstat, no open-by-handle, no linking them into the directory tree, and no feeding them to LSMs. As such, we mark them S_PRIVATE, which stops all that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:32 -08:00
Darrick J. Wong	8651b410ae	xfs: disable the agi rotor for metadata inodes Ideally, we'd put all the metadata inodes in one place if we could, so that the metadata all stay reasonably close together instead of spreading out over the disk. Furthermore, if the log is internal we'd probably prefer to keep the metadata near the log. Therefore, disable AGI rotoring for metadata inode allocations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	5d9b54a4ef	xfs: read and write metadata inode directory tree Plumb in the bits we need to load metadata inodes from a named entry in a metadir directory, create (or hardlink) inodes into a metadir directory, create metadir directories, and flag inodes as being metadata files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	7297fd0beb	xfs: enforce metadata inode flag Add checks for the metadata inode flag so that we don't ever leak metadata inodes out to userspace, and we don't ever try to read a regular inode as metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	c555dd9b8c	xfs: load metadata directory root at mount time Load the metadata directory root inode into memory at mount time and release it at unmount time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	dcf6069143	xfs: iget for metadata inodes Create a xfs_trans_metafile_iget function for metadata inodes to ensure that when we try to iget a metadata file, the inode is allocated and its file mode matches the metadata file type the caller expects. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Darrick J. Wong	4f3d4dd1b0	xfs: define the on-disk format for the metadir feature Define the on-disk layout and feature flags for the metadata inode directory feature. Add a xfs_sb_version_hasmetadir for benefit of xfs_repair, which needs to know where the new end of the superblock lies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:31 -08:00
Christoph Hellwig	e5e5cae05b	xfs: store a generic group structure in the intents Replace the pag pointers in the extent free, bmap, rmap and refcount intent structures with a pointer to the generic group to prepare for adding intents for realtime groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	ecc8065dfa	xfs: standardize EXPERIMENTAL warning generation Refactor the open-coded warnings about EXPERIMENTAL feature use into a standard helper before we go adding more experimental features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	4d272929a5	xfs: rename metadata inode predicates The predicate xfs_internal_inum tells us if an inumber refers to one of the inodes rooted in the superblock. Soon we're going to have internal inodes in a metadata directory tree, so this helper should be renamed to capture its limited scope. Ondisk inodes will soon have a flag to indicate that they're metadata inodes. Head off some confusion by renaming the xfs_is_metadata_inode predicate to xfs_is_internal_inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	fdf5703b61	xfs: constify the xfs_inode predicates Change the xfs_inode predicates to take a const struct xfs_inode pointer because they do not change the inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Darrick J. Wong	8d939f4bd7	xfs: constify the xfs_sb predicates Change the xfs_sb predicates to take a const struct xfs_sb pointer because they do not change the superblock. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:30 -08:00
Christoph Hellwig	ba102a682d	xfs: remove xfs_group_intent_hold and xfs_group_intent_rele Each of them just has a single caller, so fold them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	759cc1989a	xfs: add group based bno conversion helpers Add/move the blocks, blklog and blkmask fields to the generic groups structure so that code can work with AGs and RTGs by just using the right index into the array. Then, add convenience helpers to convert block numbers based on the generic group. This will allow writing code that doesn't care if it is used on AGs or the upcoming realtime groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	198febb9fe	xfs: store a generic xfs_group pointer in xfs_getfsmap_info Replace the pag and rtg pointers with a generic group pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	77a530e6c4	xfs: add a generic group pointer to the btree cursor Replace the pag pointers in the type specific union with a generic xfs_group pointer. This prepares for adding realtime group support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	adbc76aa0f	xfs: convert busy extent tracking to the generic group structure Split busy extent tracking from struct xfs_perag into its own private structure, which can be pointed to by the generic group structure. Note that this structure is now dynamically allocated instead of embedded as the upcoming zone XFS code doesn't need it and will also have an unusually high number of groups due to hardware constraints. Dynamically allocating the structure this is a big memory saver for this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	0e10cb98f1	xfs: convert extent busy tracepoints to the generic group structure Prepare for tracking busy RT extents by passing the generic group structure to the xfs_extent_busy_class tracepoints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:29 -08:00
Christoph Hellwig	6af1300d47	xfs: return the busy generation from xfs_extent_busy_list_empty This avoid having to poke into the internals of the busy tracking in xrep_setup_ag_allocbt. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	eb4a84a3c2	xfs: move the online repair rmap hooks to the generic group structure Prepare for the upcoming realtime groups feature by moving the online repair rmap hooks to based to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	34cf3a6f39	xfs: move draining of deferred operations to the generic group structure Prepare supporting the upcoming realtime groups feature by moving the deferred operation draining to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	2ed27a5464	xfs: mark xfs_perag_intent_{hold,rele} static These two functions are only used inside of xfs_drain.c, so mark them static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	5c8483cec3	xfs: move metadata health tracking to the generic group structure Prepare for also tracking the health status of the upcoming realtime groups by moving the health tracking code to the generic xfs_group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	86437e6abb	xfs: switch perag iteration from the for_each macros to a while based iterator The current for_each_perag* macros are a bit annoying in that they require the caller to both provide an object and an index iterator, and also somewhat obsfucate the underlying control flow mechanism. Switch to open coded while loops using new xfs_perag_next{,_from,_range} helpers that return the next pag structure to iterate on based on the previous one or NULL for the loop start. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:28 -08:00
Christoph Hellwig	d66496578b	xfs: insert the pag structures into the xarray later Cleaning up is much easier if a structure can't be looked up yet, so only insert the pag once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	819928770b	xfs: add a xfs_group_next_range helper Add a helper to iterate over iterate over all groups, which can be used as a simple while loop: struct xfs_group *xg = NULL; while ((xg = xfs_group_next_range(mp, xg, 0, MAX_GROUP))) { ... } This will be wrapped by the realtime group code first, and eventually replace the for_each_rtgroup_from and for_each_rtgroup_range helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	201c5fa342	xfs: split xfs_initialize_perag Factor out a xfs_perag_alloc helper that allocates a single perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	e9c4d8bfb2	xfs: factor out a generic xfs_group structure Split the lookup and refcount handling of struct xfs_perag into an embedded xfs_group structure that can be reused for the upcoming realtime groups. It will be extended with more features later. Note that he xg_type field will only need a single bit even with realtime group support. For now it fills a hole, but it might be worth to fold it into another field if we can use this space better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	c4ae021bcb	xfs: convert remaining trace points to pass pag structures Convert all tracepoints that take [mp,agno] tuples to take a pag argument instead so that decoding only happens when tracepoints are enabled and to clean up the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	0a4d79741d	xfs: factor out a xfs_iwalk_args helper Add a helper to share more code between xfs_iwalk and xfs_inobt_walk, and at the same time do away with the extra flags indirect so that everyone use the same names for the same flags when using the common iwalk code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:27 -08:00
Christoph Hellwig	dc8df7e382	xfs: pass the pag to the xrep_newbt_extent_class tracepoints This requires moving a few of the callsites a little bit to ensure that we already have the reference, but allows for the decoding to only happen when tracing is actually enabled, and cleans up the callsites a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	934dde65b2	xfs: pass the pag to the trace_xrep_calc_ag_resblks{,_btsize} trace points This requires holding the pag refcount a little longer, but allows for the decoding to only happen when tracing is actually enabled, and cleans up the callsites a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	618a27a94d	xfs: pass objects to the xrep_ibt_walk_rmap tracepoint Pass the perag structure and the irec so that the decoding is only done when tracing is actually enabled and the call sites look a lot neater, and remove the pointless class indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	1209d360eb	xfs: pass the iunlink item to the xfs_iunlink_update_dinode trace point So that decoding is only done when tracing is actually enabled and the call site look a lot neater. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	487092ceaa	xfs: pass objects to the xfs_irec_merge_{pre,post} trace points Pass the perag structure and the irec to these tracepoints so that the decoding is only done when tracing is actually enabled and the call sites look a lot neater. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	835ddb592f	xfs: pass a perag structure to the xfs_ag_resv_init_error trace point And remove the single instance class indirection for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:26 -08:00
Christoph Hellwig	2337ac79e9	xfs: constify pag arguments to trace points Trace points never modify their arguments. Mark all the pag objects passed to trace points. The exception is the xfs_ag_resv_class, which uses the xfs_perag_resv helper that can't be marked const due to other users modifying the returned structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	3c39444939	xfs: remove the unused xrep_bmap_walk_rmap trace point Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	c896fb44f6	xfs: remove the unused trace_xfs_iwalk_ag trace point Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	8dcf5e617f	xfs: remove the mount field from struct xfs_busy_extents The mount field is only passed to xfs_extent_busy_clear, which never uses it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	4a137e0915	xfs: keep a reference to the pag for busy extents Processing of busy extents requires the perag structure, so keep the reference while they are in flight. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	b6dc8c6dd2	xfs: pass a pag to xfs_extent_busy_{search,reuse} Replace the [mp,agno] tuple with the perag structure, which will become more useful later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:25 -08:00
Christoph Hellwig	6abd82ab6e	xfs: add a xfs_agino_to_ino helper Add a helpers to convert an agino to an ino based on a pag structure. This provides a simpler conversion and better type safety compared to the existing code that passes the mount structure and the agno separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	856a920ac2	xfs: add xfs_agbno_to_fsb and xfs_agbno_to_daddr helpers Add helpers to convert an agbno to a daddr or fsbno based on a pag structure. This provides a simpler conversion and better type safety compared to the existing code that passes the mount structure and the agno separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	db129fa011	xfs: remove the agno argument to xfs_free_ag_extent xfs_free_ag_extent already has a pointer to the pag structure through the agf buffer. Use that instead of passing the redundant argument, and do the same for the tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	67ce5ba575	xfs: pass a pag to xfs_difree_inode_chunk We'll want to use more than just the agno field in a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	9943b45732	xfs: remove the unused pag_active_wq field in struct xfs_perag pag_active_wq is only woken, but never waited for. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:24 -08:00
Christoph Hellwig	4e071d79e4	xfs: remove the unused pagb_count field in struct xfs_perag Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:23 -08:00
Christoph Hellwig	cd8ae42a82	xfs: fix superfluous clearing of info->low in __xfs_getfsmap_datadev The for_each_perag helpers update the agno passed in for each iteration, and thus the "if (pag->pag_agno == start_ag)" check will always be true. Add another variable for the loop iterator so that the field is only cleared after the first iteration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-05 13:38:23 -08:00
Darrick J. Wong	62027820eb	xfs: fix simplify extent lookup in xfs_can_free_eofblocks In commit `11f4c3a53a`, we tried to simplify the extent lookup in xfs_can_free_eofblocks so that it doesn't incur the overhead of all the extra stuff that xfs_bmapi_read does around the iext lookup. Unfortunately, this causes regressions on generic/603, xfs/108, generic/219, xfs/173, generic/694, xfs/052, generic/230, and xfs/441 when always_cow is turned on. In all cases, the regressions take the form of alwayscow files consuming rather more space than the golden output is expecting. I observed that in all these cases, the cause of the excess space usage was due to CoW fork delalloc reservations that go beyond EOF. For alwayscow files we allow posteof delalloc CoW reservations because all writes go through the CoW fork. Recall that all extents in the CoW fork are accounted for via i_delayed_blks, which means that prior to this patch, we'd invoke xfs_free_eofblocks on first close if anything was in the CoW fork. Now we don't do that. Fix the problem by reverting the removal of the i_delayed_blks check. Cc: <stable@vger.kernel.org> # v6.12-rc1 Fixes: `11f4c3a53a` ("xfs: simplify extent lookup in xfs_can_free_eofblocks") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-05 13:38:23 -08:00
Christoph Hellwig	fe4e0faac9	xfs: remove xfs_page_mkwrite_iomap_ops Shared the regular buffered write iomap_ops with the page fault path and just check for the IOMAP_FAULT flag to skip delalloc punching. This keeps the delalloc punching checks in one place, and will make it easier to convert iomap to an iter model where the begin and end handlers are merged into a single callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	a7fd3327d3	xfs: remove __xfs_filemap_fault xfs_filemap_huge_fault only ever serves DAX faults, so hard code the call to xfs_dax_read_fault and open code __xfs_filemap_fault in the only remaining caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	1eb6fc0447	xfs: split write fault handling out of __xfs_filemap_fault Only two of the callers of __xfs_filemap_fault every handle read faults. Split the write_fault handling out of __xfs_filemap_fault so that all callers call that directly either conditionally or unconditionally and only leave the read fault handling in __xfs_filemap_fault. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Christoph Hellwig	1171de3296	xfs: split the page fault trace event Split the xfs_filemap_fault trace event into separate ones for read and write faults and move them into the applicable locations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:52:57 +01:00
Dave Chinner	59e43f5479	xfs: sb_spino_align is not verified It's just read in from the superblock and used without doing any validity checks at all on the value. Fixes: `fb4f2b4e5a` ("xfs: add sparse inode chunk alignment superblock field") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:59 +01:00
Christoph Hellwig	792ef2745d	xfs: simplify sector number calculation in xfs_zero_extent xfs_zero_extent does some really odd gymnstics to calculate the block layer sectors numbers passed to blkdev_issue_zeroout. This is because it used to call sb_issue_zeroout and the calculations in that helper got open coded here in the rather misleadingly named commit `3dc2916107` ("dax: use sb_issue_zerout instead of calling dax_clear_sectors"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:59 +01:00
Long Li	8b9b261594	xfs: remove the redundant xfs_alloc_log_agf There are two invocations of xfs_alloc_log_agf in xfs_alloc_put_freelist. The AGF does not change between the two calls. Although this does not pose any practical problems, it seems like a small mistake. Therefore, fix it by removing the first xfs_alloc_log_agf invocation. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-11-05 13:51:55 +01:00
John Garry	3af5298ce9	xfs: Support setting FMODE_CAN_ATOMIC_WRITE Set FMODE_CAN_ATOMIC_WRITE flag if we can atomic write for that inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> #On ppc64	2024-11-04 16:22:11 -08:00
John Garry	f096207d32	xfs: Validate atomic writes Validate that an atomic write adheres to length/offset rules. Currently we can only write a single FS block. For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_write_iter(), FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this, ATOMICWRITES flags would also need to be set for the inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:22:10 -08:00
John Garry	6432c6e723	xfs: Support atomic write for statx Support providing info on atomic write unit min and max for an inode. For simplicity, currently we limit the min at the FS block size. As for max, we limit also at FS block size, as there is no current method to guarantee extent alignment or granularity for regular files. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-11-04 16:22:10 -08:00
Al Viro	8152f82010	fdget(), more trivial conversions all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Al Viro	0d113fcbc2	simplify xfs_find_handle() a bit XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and let the file go; results in simpler control flow - cleanup is the same for both "by descriptor" and "by pathname" cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Linus Torvalds	f6a7b4ec74	XFS bug fies for 6.12-rc6 * fix a sysbot reported crash on filestreams * Reduce cpu time spent searching for extents in a very fragmented FS * Check for delayed allocations before setting extsize Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R 3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG M+TLhUNkPQ== =kKX4 -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - fix a sysbot reported crash on filestreams - Reduce cpu time spent searching for extents in a very fragmented FS - Check for delayed allocations before setting extsize * tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: streamline xfs_filestream_pick_ag xfs: fix finding a last resort AG in xfs_filestream_pick_ag xfs: Reduce unnecessary searches when searching for the best extents xfs: Check for delayed allocations before setting extsize	2024-11-02 09:22:16 -10:00
Linus Torvalds	17fa6a5f93	vfs-6.12-rc6.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq 3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk= =/0iA -----END PGP SIGNATURE----- Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iomap fixes from Christian Brauner: "Fixes for iomap to prevent data corruption bugs in the fallocate unshare range implementation of fsdax and a small cleanup to turn iomap_want_unshare_iter() into an inline function" * tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iomap: turn iomap_want_unshare_iter into an inline function fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole	2024-11-01 07:45:00 -10:00
Christoph Hellwig	81a1e1c32e	xfs: streamline xfs_filestream_pick_ag Directly return the error from xfs_bmap_longest_free_extent instead of breaking from the loop and handling it there, and use a done label to directly jump to the exist when we found a suitable perag structure to reduce the indentation level and pag/max_pag check complexity in the tail of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Christoph Hellwig	dc60992ce7	xfs: fix finding a last resort AG in xfs_filestream_pick_ag When the main loop in xfs_filestream_pick_ag fails to find a suitable AG it tries to just pick the online AG. But the loop for that uses args->pag as loop iterator while the later code expects pag to be set. Fix this by reusing the max_pag case for this last resort, and also add a check for impossible case of no AG just to make sure that the uninitialized pag doesn't even escape in theory. Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Fixes: `f8f1ed1ab3` ("xfs: return a referenced perag from filestreams allocator") Cc: <stable@vger.kernel.org> # v6.3 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Chi Zhiling	3ef2268403	xfs: Reduce unnecessary searches when searching for the best extents Recently, we found that the CPU spent a lot of time in xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented spaces. The reason is that we conducted much extra searching for extents that could not yield a better result, and these searches would cost a lot of time when there were millions of extents to search through. Even if we get the same result length, we don't switch our choice to the new one, so we can definitely terminate the search early. Since the result length cannot exceed the found length, when the found length equals the best result length we already have, we can conclude the search. We did a test in that filesystem: [root@localhost ~]# xfs_db -c freesp /dev/vdb from to extents blocks pct 1 1 215 215 0.01 2 3 994476 1988952 99.99 Before this patch: 0) \| xfs_alloc_ag_vextent_size [xfs]() { 0) * 15597.94 us \| } After this patch: 0) \| xfs_alloc_ag_vextent_size [xfs]() { 0) 19.176 us \| } Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Ojaswin Mujoo	2a492ff666	xfs: Check for delayed allocations before setting extsize Extsize should only be allowed to be set on files with no data in it. For this, we check if the files have extents but miss to check if delayed extents are present. This patch adds that check. While we are at it, also refactor this check into a helper since it's used in some other places as well like xfs_inactive() or xfs_ioctl_setattr_xflags() Without the patch (SUCCEEDS) $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) With the patch (FAILS as expected) $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument Fixes: `e94af02a9c` ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-30 11:27:18 +01:00
Christoph Hellwig	4a201dcfa1	xfs: update the pag for the last AG at recovery time Currently log recovery never updates the in-core perag values for the last allocation group when they were grown by growfs. This leads to btree record validation failures for the alloc, ialloc or finotbt trees if a transaction references this new space. Found by Brian's new growfs recovery stress test. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:19 +02:00
Christoph Hellwig	069cf5e32b	xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag __GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail, which isn't really helpful during log recovery. Remove the flag and stick to the default GFP_KERNEL policies. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	b882b0f813	xfs: error out when a superblock buffer update reduces the agcount XFS currently does not support reducing the agcount, so error out if a logged sb buffer tries to shrink the agcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	6a18765b54	xfs: update the file system geometry after recoverying superblock buffers Primary superblock buffers that change the file system geometry after a growfs operation can affect the operation of later CIL checkpoints that make use of the newly added space and allocation groups. Apply the changes to the in-memory structures as part of recovery pass 2, to ensure recovery works fine for such cases. In the future we should apply the logic to other updates such as features bits as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	aa67ec6a25	xfs: merge the perag freeing helpers There is no good reason to have two different routines for freeing perag structures for the unmount and error cases. Add two arguments to specify the range of AGs to free to xfs_free_perag, and use that to replace xfs_free_unused_perag_range. The addition RCU grace period for the error case is harmless, and the extra check for the AG to actually exist is not required now that the callers pass the exact known allocated range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	82742f8c3f	xfs: pass the exact range to initialize to xfs_initialize_perag Currently only the new agcount is passed to xfs_initialize_perag, which requires lookups of existing AGs to skip them and complicates error handling. Also pass the previous agcount so that the range that xfs_initialize_perag operates on is exactly defined. That way the extra lookups can be avoided, and error handling can clean up the exact range from the old count to the last added perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Darrick J. Wong	af8512c527	xfs: don't fail repairs on metadata files with no attr fork Fix a minor bug where we fail repairs on metadata files that do not have attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT. Cc: stable@vger.kernel.org # v6.8 Fixes: `5a8e07e799` ("xfs: repair the inode core and forks of a metadata inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-22 13:37:18 +02:00
Christoph Hellwig	f6f91d290c	xfs: punch delalloc extents from the COW fork for COW writes When ->iomap_end is called on a short write to the COW fork it needs to punch stale delalloc data from the COW fork and not the data fork. Ensure that IOMAP_F_NEW is set for new COW fork allocations in xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag in xfs_buffered_write_delalloc_punch to decide which fork to punch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	7d6fe5c586	xfs: set IOMAP_F_SHARED for all COW fork allocations Change to always set xfs_buffered_write_iomap_begin for COW fork allocations even if they don't overlap existing data fork extents, which will allow the iomap_end callback to detect if it has to punch stale delalloc blocks from the COW fork instead of the data fork. It also means we sample the sequence counter for both the data and the COW fork when writing to the COW fork, which ensures we properly revalidate when only COW fork changes happens. This is essentially a revert of commit `72a048c105` ("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because the problem that the commit fixed has now been dealt with in iomap by only looking at the actual srcmap and not the fallback to the write iomap. Note that the direct I/O path was never changed and has always set IOMAP_F_SHARED for all COW fork allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	c29440ff66	xfs: share more code in xfs_buffered_write_iomap_begin Introduce a local iomap_flags variable so that the code allocating new delalloc blocks in the data fork can fall through to the found_imap label and reuse the code to unlock and fill the iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	8fe3b21efa	xfs: support the COW fork in xfs_bmap_punch_delalloc_range xfs_buffered_write_iomap_begin can also create delallocate reservations that need cleaning up, prepare for that by adding support for the COW fork in xfs_bmap_punch_delalloc_range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	abd7d651ad	xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock All XFS callers of iomap_zero_range and iomap_file_unshare already hold invalidate_lock, so we can't take it again in iomap_file_buffered_write_punch_delalloc. Use the passed in flags argument to detect if we're called from a zero or unshare operation and don't take the lock again in this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	acfbac7764	xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof xfs_file_write_zero_eof is the only caller of xfs_zero_range that does not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that is actually the right thing, as an error in the iomap zeroing code will also take the invalidate_lock to clean up, but to fix that deadlock we need a consistent locking pattern first. The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read pagefaults, which isn't really needed here, but also not actively harmful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	3c399374af	xfs: factor out a xfs_file_write_zero_eof helper Split a helper from xfs_file_write_checks that just deal with the post-EOF zeroing to keep the code readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	b784951662	iomap: move locking out of iomap_write_delalloc_release XFS (which currently is the only user of iomap_write_delalloc_release) already holds invalidate_lock for most zeroing operations. To be able to avoid a deadlock it needs to stop taking the lock, but doing so in iomap would leak XFS locking details into iomap. To avoid this require the caller to hold invalidate_lock when calling iomap_write_delalloc_release instead of taking it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Christoph Hellwig	caf0ea451d	iomap: remove iomap_file_buffered_write_punch_delalloc Currently iomap_file_buffered_write_punch_delalloc can be called from XFS either with the invalidate lock held or not. To fix this while keeping the locking in the file system and not the iomap library code we'll need to life the locking up into the file system. To prepare for that, open code iomap_file_buffered_write_punch_delalloc in the only caller, and instead export iomap_write_delalloc_release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-15 11:37:42 +02:00
Darrick J. Wong	0fb823f1cf	xfs: fix integer overflow in xrep_bmap The variable declaration in this function predates the merge of the nrext64 (aka 64-bit extent counters) feature, which means that the variable declaration type is insufficient to avoid an integer overflow. Fix that by redeclaring the variable to be xfs_extnum_t. Coverity-id: 1630958 Fixes: `8f71bede8e` ("xfs: repair inode fork block mapping data structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-11 12:32:48 +02:00
Christian Brauner	b40508ca5d	Merge patch series "timekeeping/fs: multigrain timestamp redux" Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:57 +02:00
Jeff Layton	1cf7e834a6	xfs: switch to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Also, anytime the mtime changes, the ctime must also change, and those are now the only two options for xfs_trans_ichgtime. Have that function unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is always set. Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime should give us better semantics now. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-9-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:52 +02:00
Andrew Kreimer	77bfe1b11e	xfs: fix a typo Fix a typo in comments. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-09 10:05:26 +02:00
Brian Foster	4390f019ad	xfs: don't free cowblocks from under dirty pagecache on unshare fallocate unshare mode explicitly breaks extent sharing. When a command completes, it checks the data fork for any remaining shared extents to determine whether the reflink inode flag and COW fork preallocation can be removed. This logic doesn't consider in-core pagecache and I/O state, however, which means we can unsafely remove COW fork blocks that are still needed under certain conditions. For example, consider the following command sequence: xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \ -c "pwrite 0 32k" -c "funshare 0 1k" <file> This allocates a data block at offset 0, shares it, and then overwrites it with a larger buffered write. The overwrite triggers COW fork preallocation, 32 blocks by default, which maps the entire 32k write to delalloc in the COW fork. All but the shared block at offset 0 remains hole mapped in the data fork. The unshare command redirties and flushes the folio at offset 0, removing the only shared extent from the inode. Since the inode no longer maps shared extents, unshare purges the COW fork before the remaining 28k may have written back. This leaves dirty pagecache backed by holes, which writeback quietly skips, thus leaving clean, non-zeroed pagecache over holes in the file. To verify, fiemap shows holes in the first 32k of the file and reads return different data across a remount: $ xfs_io -c "fiemap -v" <file> <file>: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS ... 1: [8..511]: hole 504 ... $ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........ $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........ To avoid this problem, make unshare follow the same rules used for background cowblock scanning and never purge the COW fork for inodes with dirty pagecache or in-flight I/O. Fixes: `46afb0628b` ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-09 10:05:10 +02:00
Christian Brauner	dad1b6c805	Merge patch series "fsdax/xfs: unshare range fixes for 6.12" Darrick J. Wong <djwong@kernel.org> says: This patchset fixes multiple data corruption bugs in the fallocate unshare range implementation for fsdax. * patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs: fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-08 12:27:14 +02:00
Darrick J. Wong	b8c4076db5	xfs: don't allocate COW extents when unsharing a hole It doesn't make sense to allocate a COW extent when unsharing a hole because holes cannot be shared. Fixes: `1f1397b721` ("xfs: don't allocate into the data fork for an unshare request") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-07 13:51:46 +02:00
Brian Foster	90a71daaf7	xfs: skip background cowblock trims on inodes open for write The background blockgc scanner runs on a 5m interval by default and trims preallocation (post-eof and cow fork) from inodes that are otherwise idle. Idle effectively means that iolock can be acquired without blocking and that the inode has no dirty pagecache or I/O in flight. This simple mechanism and heuristic has worked fairly well for post-eof speculative preallocations. Support for reflink and COW fork preallocations came sometime later and plugged into the same mechanism, with similar heuristics. Some recent testing has shown that COW fork preallocation may be notably more sensitive to blockgc processing than post-eof preallocation, however. For example, consider an 8GB reflinked file with a COW extent size hint of 1MB. A worst case fully randomized overwrite of this file results in ~8k extents of an average size of ~1MB. If the same workload is interrupted a couple times for blockgc processing (assuming the file goes idle), the resulting extent count explodes to over 100k extents with an average size <100kB. This is significantly worse than ideal and essentially defeats the COW extent size hint mechanism. While this particular test is instrumented, it reflects a fairly reasonable pattern in practice where random I/Os might spread out over a large period of time with varying periods of (in)activity. For example, consider a cloned disk image file for a VM or container with long uptime and variable and bursty usage. A background blockgc scan that races and processes the image file when it happens to be clean and idle can have a significant effect on the future fragmentation level of the file, even when still in use. To help combat this, update the heuristic to skip cowblocks inodes that are currently opened for write access during non-sync blockgc scans. This allows COW fork preallocations to persist for as long as possible unless otherwise needed for functional purposes (i.e. a sync scan), the file is idle and closed, or the inode is being evicted from cache. While here, update the comments to help distinguish performance oriented heuristics from the logic that exists to maintain functional correctness. Suggested-by: Darrick Wong <djwong@kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:12 +02:00
Christoph Hellwig	6aac770598	xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation variant fails to drop into the lowmode last resort allocator, and thus can sometimes fail allocations for which the caller has a transaction block reservation. Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:12 +02:00
Christoph Hellwig	405ee87c69	xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after doing the basic setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	b611fddc04	xfs: don't ifdef around the exact minlen allocations Exact minlen allocations only exist as an error injection tool for debug builds. Currently this is implemented using ifdefs, which means the code isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test coverage by always building the code and use the compilers' dead code elimination to remove it from the generated binary instead. The only downside is that the alloc_minlen_only field is unconditionally added to struct xfs_alloc_args now, but by moving it around and packing it tightly this doesn't actually increase the size of the structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	865469cd41	xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate Userdata and metadata allocations end up in the same allocation helpers. Remove the separate xfs_bmap_alloc_userdata function to make this more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	b3f4e84e2f	xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return -ENOSPC both for an actual failure to allocate a disk block, but also to signal the caller to convert the format of the attr fork. Use magic 1 to ask for the conversion here as well. Note that unlike the similar issue in xfs_attr3_leaf_split, this one was only found by code review. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	a5f73342ab	xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split xfs_attr3_leaf_split propagates the need for an extra btree split as -ENOSPC to it's only caller, but the same return value can also be returned from xfs_da_grow_inode when it fails to find free space. Distinguish the two cases by returning 1 for the extra split case instead of overloading -ENOSPC. This can be triggered relatively easily with the pending realtime group support and a file system with a lot of small zones that use metadata space on the main device. In this case every about 5-10th run of xfs/538 runs into the following assert: ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC); in xfs_attr3_leaf_split caused by an allocation failure. Note that the allocation failure is caused by another bug that will be fixed subsequently, but this commit at least sorts out the error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	346c1d46d4	xfs: return bool from xfs_attr3_leaf_add xfs_attr3_leaf_add only has two potential return values, indicating if the entry could be added or not. Replace the errno return with a bool so that ENOSPC from it can't easily be confused with a real ENOSPC. Remove the return value from the xfs_attr3_leaf_add_work helper entirely, as it always return 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Christoph Hellwig	b1c649da15	xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and merging the two will simplify a following error handling fix. To facilitate this move the remote block state save/restore helpers up in the file so that they don't need forward declarations now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Uros Bizjak	20195d011c	xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate() Use !try_cmpxchg instead of cmpxchg (ptr, old, new) != old in xlog_cil_insert_pcp_aggregate(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. Also, try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Yan Zhen	6148b77960	xfs: scrub: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Zhang Zekun	f6225eebd7	xfs: Remove empty declartion in header file The definition of xfs_attr_use_log_assist() has been removed since commit `d9c61ccb3b` ("xfs: move xfs_attr_use_log_assist out of xfs_log.c"). So, Remove the empty declartion in header files. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-10-07 08:00:11 +02:00
Al Viro	5f60d5f6bb	move asm/unaligned.h to linux/unaligned.h asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h	2024-10-02 17:23:23 -04:00
Linus Torvalds	f8ffbc365f	struct fd layout change (and conversion to accessor helpers) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d 592+iDgLsema/H/0/CqfqlaNtDNY8Q0= =HUl5 -----END PGP SIGNATURE----- Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.	2024-09-23 09:35:36 -07:00
Linus Torvalds	617a814f14	ALong with the usual shower of singleton patches, notable patch series in this pull request are: "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. "mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this was overprovisioning THPs in sparsely accessed memory areas. "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo= =s0T+ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...	2024-09-21 07:29:05 -07:00
Linus Torvalds	171754c380	vfs-6.12.blocksize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvwAKCRCRxhvAZXjc ohg3APwJWQnqFlBddcRl4yrPJ/cgcYSYAOdHb+E+blomSwdxcwEAmwsnLPNQOtw2 rxKvQfZqhVT437bl7RpPPZrHGxwTng8= =6v1r -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs blocksize updates from Christian Brauner: "This contains the vfs infrastructure as well as the xfs bits to enable support for block sizes (bs) larger than page sizes (ps) plus a few fixes to related infrastructure. There has been efforts over the last 16 years to enable enable Large Block Sizes (LBS), that is block sizes in filesystems where bs > page size. Through these efforts we have learned that one of the main blockers to supporting bs > ps in filesystems has been a way to allocate pages that are at least the filesystem block size on the page cache where bs > ps. Thanks to various previous efforts it is possible to support bs > ps in XFS with only a few changes in XFS itself. Most changes are to the page cache to support minimum order folio support for the target block size on the filesystem. A motivation for Large Block Sizes today is to support high-capacity (large amount of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are typically greater than 4k to help reduce DRAM and so in turn cost and space. In practice this then allows different architectures to use a base page size of 4k while still enabling support for block sizes aligned to the larger IUs by relying on high order folios on the page cache when needed. It also allows to take advantage of the drive's support for atomics larger than 4k with buffered IO support in Linux. As described this year at LSFMM, supporting large atomics greater than 4k enables databases to remove the need to rely on their own journaling, so they can disable double buffered writes, which is a feature different cloud providers are already enabling through custom storage solutions" * tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits) Documentation: iomap: fix a typo iomap: remove the iomap_file_buffered_write_punch_delalloc return value iomap: pass the iomap to the punch callback iomap: pass flags to iomap_file_buffered_write_punch_delalloc iomap: improve shared block detection in iomap_unshare_iter iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release docs:filesystems: fix spelling and grammar mistakes in iomap design page filemap: fix htmldoc warning for mapping_align_index() iomap: make zero range flush conditional on unwritten mappings iomap: fix handling of dirty folios over unwritten extents iomap: add a private argument for iomap_file_buffered_write iomap: remove set_memor_ro() on zero page xfs: enable block size larger than page size support xfs: make the calculation generic in xfs_sb_validate_fsb_count() xfs: expose block size in stat xfs: use kvmalloc for xattr buffers iomap: fix iomap_dio_zero() for fs bs > system page size filemap: cap PTE range to be created to allowed zero fill in folio_map_range() mm: split a folio in minimum folio order chunks readahead: allocate folios with mapping_min_order in readahead ...	2024-09-20 17:53:17 -07:00
Linus Torvalds	8751b21ad9	New code for 6.12: * Introduce new ioctls to exchange contents of two files. The first ioctl does the preparation work to exchange the contents of two files while the second ioctl performs the actual exchange if the target file has not been changed since a given sampling point. * Fixes - Fix bugs associated with calculating the maximum range of realtime extents to scan for free space. - Copy keys instead of records when resizing the incore BMBT root block. - Do not report FITRIMming more bytes than possibly exist in the filesystem. - Modify xfs_fs.h to prevent C++ compilation errors. - Do not over eagerly free post-EOF speculative preallocation. - Ensure st_blocks never goes to zero during COW writes * Cleanups/refactors - Use Xarray to hold per-AG data instead of a Radix tree. - Cleanup the following functionality, - Realtime bitmap. - Inode allocator. - Quota. - Inode rooted btree code. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZtmy/gAKCRAH7y4RirJu 9H3GAP9CnoiZu+U/QmNL5T15fgNGs+BQDrUNbmbn3bNlmIZviQEAi3p+50OlT0nP lcQ/89NJ6uDFNBiphpkGajlp5vn2BQ0= =7wy/ -----END PGP SIGNATURE----- Merge tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Chandan Babu: "New code: - Introduce new ioctls to exchange contents of two files. The first ioctl does the preparation work to exchange the contents of two files while the second ioctl performs the actual exchange if the target file has not been changed since a given sampling point. Fixes: - Fix bugs associated with calculating the maximum range of realtime extents to scan for free space. - Copy keys instead of records when resizing the incore BMBT root block. - Do not report FITRIMming more bytes than possibly exist in the filesystem. - Modify xfs_fs.h to prevent C++ compilation errors. - Do not over eagerly free post-EOF speculative preallocation. - Ensure st_blocks never goes to zero during COW writes Cleanups/refactors: - Use Xarray to hold per-AG data instead of a Radix tree. - Cleanups to: - realtime bitmap - inode allocator - quota - inode rooted btree code" * tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits) xfs: ensure st_blocks never goes to zero during COW writes xfs: use xas_for_each_marked in xfs_reclaim_inodes_count xfs: convert perag lookup to xarray xfs: simplify tagged perag iteration xfs: move the tagged perag lookup helpers to xfs_icache.c xfs: use kfree_rcu_mightsleep to free the perag structures xfs: use LIST_HEAD() to simplify code xfs: Remove duplicate xfs_trans_priv.h header xfs: remove unnecessary check xfs: Use xfs set and clear mp state helpers xfs: reclaim speculative preallocations for append only files xfs: simplify extent lookup in xfs_can_free_eofblocks xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks xfs: only free posteof blocks on first close xfs: don't free post-EOF blocks on read close xfs: skip all of xfs_file_release when shut down xfs: don't bother returning errors from xfs_file_release xfs: refactor f_op->release handling xfs: remove the i_mode check in xfs_release xfs: standardize the btree maxrecs function parameters ...	2024-09-19 07:03:55 +02:00
Linus Torvalds	9ea925c806	Updates for timers and timekeeping: - Core: - Overhaul of posix-timers in preparation of removing the workaround for periodic timers which have signal delivery ignored. - Remove the historical extra jiffie in msleep() msleep() adds an extra jiffie to the timeout value to ensure minimal sleep time. The timer wheel ensures minimal sleep time since the large rewrite to a non-cascading wheel, but the extra jiffie in msleep() remained unnoticed. Remove it. - Make the timer slack handling correct for realtime tasks. The procfs interface is inconsistent and does neither reflect reality nor conforms to the man page. Show the correct 0 slack for real time tasks and enforce it at the core level instead of having inconsistent individual checks in various timer setup functions. - The usual set of updates and enhancements all over the place. - Drivers: - Allow the ACPI PM timer to be turned off during suspend - No new drivers - The usual updates and enhancements in various drivers -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G F6p9cgoDNP9KFg== =jE0N -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Core: - Overhaul of posix-timers in preparation of removing the workaround for periodic timers which have signal delivery ignored. - Remove the historical extra jiffie in msleep() msleep() adds an extra jiffie to the timeout value to ensure minimal sleep time. The timer wheel ensures minimal sleep time since the large rewrite to a non-cascading wheel, but the extra jiffie in msleep() remained unnoticed. Remove it. - Make the timer slack handling correct for realtime tasks. The procfs interface is inconsistent and does neither reflect reality nor conforms to the man page. Show the correct 0 slack for real time tasks and enforce it at the core level instead of having inconsistent individual checks in various timer setup functions. - The usual set of updates and enhancements all over the place. Drivers: - Allow the ACPI PM timer to be turned off during suspend - No new drivers - The usual updates and enhancements in various drivers" * tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) ntp: Make sure RTC is synchronized when time goes backwards treewide: Fix wrong singular form of jiffies in comments cpu: Use already existing usleep_range() timers: Rename next_expiry_recalc() to be unique platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function clocksource/drivers/jcore: Use request_percpu_irq() clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init() clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended clocksource: acpi_pm: Add external callback for suspend/resume clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped() dt-bindings: timer: rockchip: Add rk3576 compatible timers: Annotate possible non critical data race of next_expiry timers: Remove historical extra jiffie for timeout in msleep() hrtimer: Use and report correct timerslack values for realtime tasks hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse. timers: Add sparse annotation for timer_sync_wait_running(). signal: Replace BUG_ON()s ...	2024-09-17 07:25:37 +02:00
Linus Torvalds	ee25861f26	vfs-6.12.fallocate -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc omD7AQCZuWPXkEGYFD37MJZuRXNEoq7Tuj6yd0O2b5khUpzvyAD+MPuthGiCMPsu voPpUP83x7T0D3JsEsCAXtNeVRcIBQI= =xTs6 -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fallocate updates from Christian Brauner: "This contains work to try and cleanup some the fallocate mode handling. Currently, it confusingly mixes operation modes and an optional flag. The work here tries to better define operation modes and optional flags allowing the core and filesystem code to use switch statements to switch on the operation mode" * tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: refactor xfs_file_fallocate xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space xfs: call xfs_flush_unmap_range from xfs_free_file_space fs: sort out the fallocate mode vs flag mess ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE block: remove checks for FALLOC_FL_NO_HIDE_STALE	2024-09-16 09:34:08 +02:00
Thomas Gleixner	2f7eedca6c	Merge branch 'linus' into timers/core To update with the latest fixes.	2024-09-10 13:49:53 +02:00
Christoph Hellwig	4bceb9ba05	iomap: remove the iomap_file_buffered_write_punch_delalloc return value iomap_file_buffered_write_punch_delalloc can only return errors if either the ->punch callback returned an error, or if someone changed the API of mapping_seek_hole_data to return a negative error code that is not -ENXIO. As the only instance of ->punch never returns an error, an such an error would be fatal anyway remove the entire error propagation and don't return an error code from iomap_file_buffered_write_punch_delalloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-10 11:14:15 +02:00
Christoph Hellwig	492f53758f	iomap: pass the iomap to the punch callback XFS will need to look at the flags in the iomap structure, so pass it down all the way to the callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-10 11:14:15 +02:00
Christoph Hellwig	11596dc3df	iomap: pass flags to iomap_file_buffered_write_punch_delalloc To fix short write error handling, We'll need to figure out what operation iomap_file_buffered_write_punch_delalloc is called for. Pass the flags argument on to it, and reorder the argument list to match that of ->iomap_end so that the compiler only has to add the new punch argument to the end of it instead of reshuffling the registers. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-10 11:14:14 +02:00
Rik van Riel	e1e4cfd01a	mm,tmpfs: consider end of file write in shmem_is_huge Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size This allows large writes that append to the end of a file to automatically use large pages. Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always. huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-09 16:39:12 -07:00
Anna-Maria Behnsen	bd7c8ff9fe	treewide: Fix wrong singular form of jiffies in comments There are several comments all over the place, which uses a wrong singular form of jiffies. Replace 'jiffie' by 'jiffy'. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de	2024-09-08 20:47:40 +02:00
Brian Foster	c5c810b94c	iomap: fix handling of dirty folios over unwritten extents The iomap zero range implementation doesn't properly handle dirty pagecache over unwritten mappings. It skips such mappings as if they were pre-zeroed. If some part of an unwritten mapping is dirty in pagecache from a previous write, the data in cache should be zeroed as well. Instead, the data is left in cache and creates a stale data exposure problem if writeback occurs sometime after the zero range. Most callers are unaffected by this because the higher level filesystem contexts that call zero range typically perform a filemap flush of the target range for other reasons. A couple contexts that don't otherwise need to flush are write file size extension and truncate in XFS. The former path is currently susceptible to the stale data exposure problem and the latter performs a flush specifically to work around it. This is clearly inconsistent and incomplete. As a first step toward correcting behavior, lift the XFS workaround to iomap_zero_range() and unconditionally flush the range before the zero range operation proceeds. While this appears to be a bit of a big hammer, most all users already do this from calling context save for the couple of exceptions noted above. Future patches will optimize or elide this flush while maintaining functional correctness. Fixes: `ae259a9c85` ("fs: introduce iomap infrastructure") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20240830145634.138439-2-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-03 15:01:24 +02:00
Josef Bacik	31754ea6cb	iomap: add a private argument for iomap_file_buffered_write In order to switch fuse over to using iomap for buffered writes we need to be able to have the struct file for the original write, in case we have to read in the page to make it uptodate. Handle this by using the existing private field in the iomap_iter, and add the argument to iomap_file_buffered_write. This will allow us to pass the file in through the iomap buffered write path, and is flexible for any other file systems needs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-03 15:01:23 +02:00
Pankaj Raghav	7df7c204c6	xfs: enable block size larger than page size support Page cache now has the ability to have a minimum order when allocating a folio which is a prerequisite to add support for block size > page size. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240827-xfs-fix-wformat-bs-gt-ps-v1-1-aec6717609e0@kernel.org # fix folded Link: https://lore.kernel.org/r/20240822135018.1931258-11-kernel@pankajraghav.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-03 15:00:52 +02:00
Christoph Hellwig	90fa22da6d	xfs: ensure st_blocks never goes to zero during COW writes COW writes remove the amount overwritten either directly for delalloc reservations, or in earlier deferred transactions than adding the new amount back in the bmap map transaction. This means st_blocks on an inode where all data is overwritten using the COW path can temporarily show a 0 st_blocks. This can easily be reproduced with the pending zoned device support where all writes use this path and trips the check in generic/615, but could also happen on a reflink file without that. Fix this by temporarily add the pending blocks to be mapped to i_delayed_blks while the item is queued. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:47 +05:30
Christoph Hellwig	866cf1dd3d	xfs: use xas_for_each_marked in xfs_reclaim_inodes_count xfs_reclaim_inodes_count iterates over all AGs to sum up the reclaimable inodes counts. There is no point in grabbing a reference to the them or unlock the RCU critical section for each iteration, so switch to the more efficient xas_for_each_marked iterator. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:46 +05:30
Christoph Hellwig	32fa4059fe	xfs: convert perag lookup to xarray Convert the perag lookup from the legacy radix tree to the xarray, which allows for much nicer iteration and bulk lookup semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:46 +05:30
Christoph Hellwig	f9ffd095c8	xfs: simplify tagged perag iteration Pass the old perag structure to the tagged loop helpers so that they can grab the old agno before releasing the reference. This removes the need to separately track the agno and the iterator macro, and thus also obsoletes the for_each_perag_tag syntactic sugar. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:44 +05:30
Christoph Hellwig	f48f0a8e00	xfs: move the tagged perag lookup helpers to xfs_icache.c The tagged perag helpers are only used in xfs_icache.c in the kernel code and not at all in xfsprogs. Move them to xfs_icache.c in preparation for switching to an xarray, for which I have no plan to implement the tagged lookup functions for userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:43 +05:30
Christoph Hellwig	4ef7c6d39d	xfs: use kfree_rcu_mightsleep to free the perag structures Using the kfree_rcu_mightsleep is simpler and removes the need for a rcu_head in the perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:43 +05:30
Hongbo Li	70045dafdf	xfs: use LIST_HEAD() to simplify code list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:42 +05:30
Jiapeng Chong	9db384feea	xfs: Remove duplicate xfs_trans_priv.h header ./fs/xfs/libxfs/xfs_defer.c: xfs_trans_priv.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9491 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:41 +05:30
Dan Carpenter	fb8b941c75	xfs: remove unnecessary check We checked that "pip" is non-NULL at the start of the if else statement so there is no need to check again here. Delete the check. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:40 +05:30
John Garry	ca57120dfe	xfs: Use xfs set and clear mp state helpers Use the set and clear mp state helpers instead of open-coding. It is noted that in some instances calls to atomic operation set_bit() and clear_bit() are being replaced with test_and_set_bit() and test_and_clear_bit(), respectively, as there is no specific helpers for set_bit() and clear_bit() only. However should be ok, as we are just ignoring the returned value from those "test" variants. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:39 +05:30
Christoph Hellwig	9372dce08b	xfs: reclaim speculative preallocations for append only files The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids writes that don't append at the current EOF. But the commit originally adding XFS_DIFLAG_APPEND support (commit a23321e766d in xfs xfs-import repository) also checked it to skip releasing speculative preallocations, which doesn't make any sense. Another commit (`dd9f438e32` in the xfs-import repository) later extended that flag to also report these speculation preallocations which should not exist in getbmap. Remove these checks as nothing XFS_DIFLAG_APPEND implies that preallocations beyond EOF should exist, but explicitly check for XFS_DIFLAG_APPEND in xfs_file_release to bypass the algorithm that discard preallocations on the first close as append only files aren't expected to be written to only once. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:39 +05:30
Christoph Hellwig	11f4c3a53a	xfs: simplify extent lookup in xfs_can_free_eofblocks xfs_can_free_eofblocks just cares if there is an extent beyond EOF. Replace the call to xfs_bmapi_read with a xfs_iext_lookup_extent as we've already checked that extents are read in earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Christoph Hellwig	b717089efe	xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the eofblocks, so don't bother locking the inode or performing the checks in xfs_can_free_eofblocks. Also switch to a test_and_set operation once the iolock has been acquire so that only the caller that sets it actually frees the post-EOF blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Darrick J. Wong	f1204d9645	xfs: only free posteof blocks on first close Certain workloads fragment files on XFS very badly, such as a software package that creates a number of threads, each of which repeatedly run the sequence: open a file, perform a synchronous write, and close the file, which defeats the speculative preallocation mechanism. We work around this problem by only deleting posteof blocks the /first/ time a file is closed to preserve the behavior that unpacking a tarball lays out files one after the other with no gaps. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: rebased, updated comment, renamed the flag] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Dave Chinner	816e3599ca	xfs: don't free post-EOF blocks on read close When we have a workload that does open/read/close in parallel with other allocation, the file becomes rapidly fragmented. This is due to close() calling xfs_file_release() and removing the speculative preallocation beyond EOF. Add a check for a writable context to xfs_file_release to skip the post-EOF block freeing (an the similarly pointless flushing on truncate down). Before: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 919 /mnt/scratch/file.1: 916 /mnt/scratch/file.2: 919 /mnt/scratch/file.3: 920 /mnt/scratch/file.4: 920 /mnt/scratch/file.5: 921 /mnt/scratch/file.6: 916 /mnt/scratch/file.7: 918 After: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 24 /mnt/scratch/file.1: 24 /mnt/scratch/file.2: 11 /mnt/scratch/file.3: 24 /mnt/scratch/file.4: 3 /mnt/scratch/file.5: 24 /mnt/scratch/file.6: 24 /mnt/scratch/file.7: 23 Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick: wordsmithing, fix commit message] Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: ported to the new ->release code structure] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Christoph Hellwig	c741d79c1a	xfs: skip all of xfs_file_release when shut down There is no point in trying to free post-EOF blocks when the file system is shutdown, as it will just error out ASAP. Instead return instantly when xfs_file_release is called on a shut down file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Christoph Hellwig	98e44e2bc0	xfs: don't bother returning errors from xfs_file_release While ->release returns int, the only caller ignores the return value. As we're only doing cleanup work there isn't much of a point in return a value to start with, so just document the situation instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:38 +05:30
Christoph Hellwig	5d3ca62611	xfs: refactor f_op->release handling Currently f_op->release is split in not very obvious ways. Fix that by folding xfs_release into xfs_file_release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:37 +05:30
Christoph Hellwig	6e13dbebd5	xfs: remove the i_mode check in xfs_release xfs_release is only called from xfs_file_release, which is wired up as the f_op->release handler for regular files only. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-09-03 10:07:37 +05:30
Pankaj Raghav	cebf9dacd5	xfs: make the calculation generic in xfs_sb_validate_fsb_count() Instead of assuming that PAGE_SHIFT is always higher than the blocklog, make the calculation generic so that page cache count can be calculated correctly for LBS. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-10-kernel@pankajraghav.com Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-02 16:19:44 +02:00
Pankaj Raghav	79012cfa00	xfs: expose block size in stat For block size larger than page size, the unit of efficient IO is the block size, not the page size. Leaving stat() to report PAGE_SIZE as the block size causes test programs like fsx to issue illegal ranges for operations that require block size alignment (e.g. fallocate() insert range). Hence update the preferred IO size to reflect the block size in this case. This change is based on a patch originally from Dave Chinner.[1] [1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/ Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240822135018.1931258-9-kernel@pankajraghav.com Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-02 16:19:44 +02:00
Dave Chinner	de631e1a8b	xfs: use kvmalloc for xattr buffers Pankaj Raghav reported that when filesystem block size is larger than page size, the xattr code can use kmalloc() for high order allocations. This triggers a useless warning in the allocator as it is a __GFP_NOFAIL allocation here: static inline struct page rmqueue(struct zone preferred_zone, struct zone zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { struct page page; /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. */ >>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); ... Fix this by changing all these call sites to use kvmalloc(), which will strip the NOFAIL from the kmalloc attempt and if that fails will do a __GFP_NOFAIL vmalloc(). This is not an issue that productions systems will see as filesystems with block size > page size cannot be mounted by the kernel; Pankaj is developing this functionality right now. Reported-by: Pankaj Raghav <kernel@pankajraghav.com> Fixes: `f078d4ea82` ("xfs: convert kmem_alloc() to kmalloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20240822135018.1931258-8-kernel@pankajraghav.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-02 16:19:44 +02:00
Danilo Krummrich	590b9d576c	mm: kvmalloc: align kvrealloc() with krealloc() Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-09-01 20:25:44 -07:00
Darrick J. Wong	411a71256d	xfs: standardize the btree maxrecs function parameters Standardize the parameters in xfs_{alloc,bm,ino,rmap,refcount}bt_maxrecs so that we have consistent calling conventions. This doesn't affect the kernel that much, but enables us to clean up userspace a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	79124b3740	xfs: replace shouty XFS_BM{BT,DR} macros Replace all the shouty bmap btree and bmap disk root macros with actual functions. sed \ -e 's/XFS_BMBT_BLOCK_LEN/xfs_bmbt_block_len/g' \ -e 's/XFS_BMBT_REC_ADDR/xfs_bmbt_rec_addr/g' \ -e 's/XFS_BMBT_KEY_ADDR/xfs_bmbt_key_addr/g' \ -e 's/XFS_BMBT_PTR_ADDR/xfs_bmbt_ptr_addr/g' \ -e 's/XFS_BMDR_REC_ADDR/xfs_bmdr_rec_addr/g' \ -e 's/XFS_BMDR_KEY_ADDR/xfs_bmdr_key_addr/g' \ -e 's/XFS_BMDR_PTR_ADDR/xfs_bmdr_ptr_addr/g' \ -e 's/XFS_BMAP_BROOT_PTR_ADDR/xfs_bmap_broot_ptr_addr/g' \ -e 's/XFS_BMAP_BROOT_SPACE_CALC/xfs_bmap_broot_space_calc/g' \ -e 's/XFS_BMAP_BROOT_SPACE/xfs_bmap_broot_space/g' \ -e 's/XFS_BMDR_SPACE_CALC/xfs_bmdr_space_calc/g' \ -e 's/XFS_BMAP_BMDR_SPACE/xfs_bmap_bmdr_space/g' \ -i $(git ls-files fs/xfs/.[ch] fs/xfs/libxfs/.[ch] fs/xfs/scrub/*.[ch]) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	de55149b66	xfs: fix a sloppy memory handling bug in xfs_iroot_realloc While refactoring code, I noticed that when xfs_iroot_realloc tries to shrink a bmbt root block, it allocates a smaller new block and then copies "records" and pointers to the new block. However, bmbt root blocks cannot ever be leaves, which means that it's not technically correct to copy records. We /should/ be copying keys. Note that this has never resulted in actual memory corruption because sizeof(bmbt_rec) == (sizeof(bmbt_key) + sizeof(bmbt_ptr)). However, this will no longer be true when we start adding realtime rmap stuff, so fix this now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	c460f0f1a2	xfs: fix FITRIM reporting again Don't report FITRIMming more bytes than possibly exist in the filesystem. Fixes: `410e8a18f8` ("xfs: don't bother reporting blocks trimmed via FITRIM") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	64dfa18d6e	xfs: fix C++ compilation errors in xfs_fs.h Several people reported C++ compilation errors due to things that C compilers allow but C++ compilers do not. Fix both of these problems, and hope there aren't more of these brown paper bags in 2 months when we finally get these fixes through the process into a released xfsprogs. NOTE: I am submitting this bugfix over the objections of a former maintainer, who insists that we should remove this function from the published userspace ABI instead of fixing the C++ compilation errors. No deprecation period, no discussion, just a hard drop of an already provided and correct C function, which would be in contravention of Linus' rules. IOWs, removing ABI that have already shipped in a released kernel requires a careful deprecation period, so I will let that maintainer run that process. Reported-by: kernel@mattwhitlock.name Reported-by: sam@gentoo.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219203 Fixes: `233f4e12bb` ("xfs: add parent pointer ioctls") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	2c4162be6c	xfs: refactor loading quota inodes in the regular case Create a helper function to load quota inodes in the case where the dqtype and the sb quota inode fields correspond. This is true for nearly all the iget callsites in the quota code, except for when we're switching the group and project quota inodes. We'll need this in subsequent patches to make the metadir handling less convoluted. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:20 -07:00
Darrick J. Wong	2ca7b9d7b8	xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c Move this function out of xfs_ioctl.c to reduce the clutter in there, and make the entire getfsmap implementation self-contained in a single file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	516f91035c	xfs: rearrange xfs_fsmap.c a little bit The order of the functions in this file has gotten a little confusing over the years. Specifically, the two data device implementations (bnobt and rmapbt) could be adjacent in the source code instead of split in two by the logdev and rtdev fsmap implementations. We're about to add more functionality to this file, so rearrange things now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	33912286cb	xfs: replace m_rsumsize with m_rsumblocks Track the RT summary file size in blocks, just like the RT bitmap file. While we have users of both units, blocks are used slightly more often and this matches the bitmap file for consistency. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	1fc51cf11d	xfs: remove xfs_{rtbitmap,rtsummary}_wordcount xfs_rtbitmap_wordcount and xfs_rtsummary_wordcount are currently unused, so remove them to simplify refactoring other rtbitmap helpers. They can be added back or simply open coded when actually needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	0902819fe6	xfs: add xchk_setup_nothing and xchk_nothing helpers Add common helpers for no-op scrubbing methods. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	ec12f97f1b	xfs: make the rtalloc start hint a xfs_rtblock_t 0 is a valid start RT extent, and with pending changes it will become both more common and non-unique. Switch to pass a xfs_rtblock_t instead so that we can use NULLRTBLOCK to determine if a hint was set or not. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	b2dd85f414	xfs: factor out a xfs_rtallocate_align helper Split the code to calculate the aligned allocation request from xfs_bmap_rtalloc into a separate self-contained helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	fd048a1bb3	xfs: rework the rtalloc fallback handling xfs_rtallocate currently has two fallbacks, when an allocation fails: 1) drop the requested extent size alignment, if any, and retry 2) ignore the locality hint Oddly enough it does those in order, as trying a different location is more in line with what the user asked for, and does it in a very unstructured way. Lift the fallback to try to allocate without the locality hint into xfs_rtallocate to both perform them in a more sensible order and to clean up the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	a9f646af43	xfs: factor out a xfs_rtallocate helper Split out a helper from xfs_rtallocate that performs the actual allocation. This keeps the scope of the xfs_rtalloc_args structure contained, and prepares for rtgroups support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	1e21d1897f	xfs: clean up the ISVALID macro in xfs_bmap_adjacent Turn the ISVALID macro defined and used inside in xfs_bmap_adjacent that relies on implict context into a proper inline function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	df8b181f15	xfs: simplify xfs_rtalloc_query_range There isn't much of a good reason to pass the xfs_rtalloc_rec structures that describe extents to xfs_rtalloc_query_range as we really just want a lower and upper bound xfs_rtxnum_t. Pass the rtxnum directly and simply the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	fa0fc38b25	xfs: remove xfs_rtb_to_rtxrem Simplify the number of block number conversion helpers by removing xfs_rtb_to_rtxrem. Any recent compiler is smart enough to eliminate the double divisions if using separate xfs_rtb_to_rtx and xfs_rtb_to_rtxoff calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	9e9be9840f	xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block This function tries to find a suitable free space extent starting from a particular rtbitmap block. Some time ago, I added a clamping function to prevent the free space scans from running off the end of the bitmap, but I didn't quite get the logic right. Let's say there's an allocation request with a minlen of 5 and a maxlen of 32 and we're scanning the last rtbitmap block. If we come within 4 rtx of the end of the rt volume, maxlen will get clamped to 4. If the next 3 rtx are free, we could have satisfied the allocation, but the code setting partial besti/bestlen for "minlen < maxlen" will think that we're doing a non-variable allocation and ignore it. The root of this problem is overwriting maxlen; I should have stuffed the results in a different variable, which would not have introduced this bug. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	74c234bbe5	xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near The near rt allocator employs two allocation strategies -- first it tries to allocate at exactly @start. If that fails, it will pivot back and forth around that starting point looking for an appropriately sized free space. However, I clamped maxlen ages ago to prevent the exact allocation scan from running off the end of the rt volume. This, I realize, was excessive. If the allocation request is (say) for 32 rtx but the start position is 5 rtx from the end of the volume, we clamp maxlen to 5. If the exact allocation fails, we then pivot back and forth looking for 5 rtx, even though the original intent was to try to get 32 rtx. If we then find 5 rtx when we could have gotten 32 rtx, we've not done as well as we could have. This may be moot if the caller immediately comes back for more space, but it might not be. Either way, we can do better here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	62c3d24968	xfs: clean up xfs_rtallocate_extent_exact a bit Before we start doing more surgery on the rt allocator, let's clean up the exact allocator so that it doesn't change its arguments and uses the helper introduced in the previous patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	e6a74dcf9b	xfs: refactor aligning bestlen to prod There are two places in xfs_rtalloc.c where we want to make sure that a count of rt extents is aligned with a particular prod(uct) factor. In one spot, we actually use rounddown(), albeit unnecessarily if prod < 2. In the other case, we open-code this rounding inefficiently by promoting the 32-bit length value to a 64-bit value and then performing a 64-bit division to figure out the subtraction. Refactor this into a single helper that uses the correct types and division method for the type, and skips the division entirely unless prod is large enough to make a difference. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	e99aa0401e	xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block The loop conditional here is not quite correct because an rtbitmap block can represent rtextents beyond the end of the rt volume. There's no way that it makes sense to scan for free space beyond EOFS, so don't do it. This overrun has been present since v2.6.0. Also fix the type of bestlen, which was incorrectly converted. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	cb59233e82	xfs: don't return too-short extents from xfs_rtallocate_extent_block If xfs_rtallocate_extent_block is asked for a variable-sized allocation, it will try to return the best-sized free extent, which is apparently the largest one that it finds starting in this rtbitmap block. It will then trim the size of the extent as needed to align it with prod. However, it misses one thing -- rounding down the best-fit candidate to the required alignment could make the extent shorter than minlen. In the case where minlen > 1, we'd rather the caller relaxed its alignment requirements and tried again, as the allocator already supports that. Returning a too-short extent that causes xfs_bmapi_write to return ENOSR if there aren't enough nmaps to handle multiple new allocations, which can then cause filesystem shutdowns. I haven't seen this happen on any production systems, but then I don't think it's very common to set a per-file extent size hint on realtime files. I tripped it while working on the rtgroups feature and pounding on the realtime allocator enthusiastically. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	86a0264ef2	xfs: ensure rtx mask/shift are correct after growfs When growfs sets an extent size, it doesn't updated the m_rtxblklog and m_rtxblkmask values, which could lead to incorrect usage of them if they were set before and can't be used for the new extent size. Add a xfs_mount_sb_set_rextsize helper that updates the two fields, and also use it when calculating the new RT geometry instead of disabling the optimization there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	a18a69bbec	xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock After going great length to calculate the transaction reservation for the new geometry, we should also use it to allocate the transaction it was calculated for. Fixes: `578bd4ce71` ("xfs: recompute growfsrtfree transaction reservation while growing rt volume") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	0a59e4f3e1	xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock To prepare for being able to join an already locked rtbitmap inode to a transaction split out separate helpers for joining the transaction from the locking helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	2a95ffc44b	xfs: factor out rtbitmap/summary initialization helpers Add helpers to libxfs that can be shared by growfs and mkfs for initializing the rtbitmap and summary, and by passing the optional data pointer also by repair for rebuilding them. This will become even more useful when the rtgroups feature adds a metadata header to each block, which means even more shared code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor documentation and data advance tweaks] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	266e78aec4	xfs: factor out a xfs_last_rt_bmblock helper Add helper to calculate the last currently used rt bitmap block to better structure the growfs code and prepare for future changes to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	7996f10ce6	xfs: factor out a xfs_growfs_rt_bmblock helper Add a helper to contain the per-rtbitmap block logic in xfs_growfs_rt. Note that this helper now allocates a new fake mount structure for each rtbitmap block iteration instead of reusing the memory for an entire growfs call. Compared to all the other work done when freeing the blocks the overhead for this is in the noise and it keeps the code nicely modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	c8e5a0bfe0	xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc Currently the various low-level RT allocator functions call into xfs_rtallocate_range directly, which ties them into the locking protocol for the RT bitmap. As these helpers already return the allocated range, lift the call to xfs_rtallocate_range into xfs_bmap_rtalloc so that it happens as high as possible in the stack, which will simplify future changes to the locking protocol. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	237130564e	xfs: cleanup the calling convention for xfs_rtpick_extent xfs_rtpick_extent never returns an error. Do away with the error return and directly return the picked extent instead of doing that through a call by reference argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	b4781eea68	xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf Add a corruption check for passing an invalid block number, which is a lot easier to understand than the xfs_bmapi_read failure later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	6d2db12d56	xfs: assert a valid limit in xfs_rtfind_forw Protect against developers passing stupid limits when refactoring the RT code once again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	119c65e56b	xfs: remove the limit argument to xfs_rtfind_back All callers pass a 0 limit to xfs_rtfind_back, so remove the argument and hard code it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	3cb30d5162	xfs: make the RT rsum_cache mandatory Currently the RT mount code simply ignores an allocation failure for the rsum_cache. The code mostly works fine with it, but not having it leads to nasty corner cases in the growfs code that we don't really handle well. Switch to failing the mount if we can't allocate the memory, the file system would not exactly be useful in such a constrained environment to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	6529eef810	xfs: factor out a xfs_validate_rt_geometry helper Split the RT geometry validation in the early mount code into a helper than can be reused by repair (from which this code was apparently originally stolen anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: u64 return value for calc_rbmblocks] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	021d9c107e	xfs: remove xfs_validate_rtextents Replace xfs_validate_rtextents with an open coded check for 0 rtextents. The name for the function implies it does a lot more than a zero check, which is more obvious when open coded. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	390b4775d6	xfs: pass the icreate args object to xfs_dialloc Pass the xfs_icreate_args object to xfs_dialloc since we can extract the relevant mode (really just the file type) and parent inumber from there. This simplifies the calling convention in preparation for the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	feb09b727b	xfs: match on the global RT inode numbers in xfs_is_metadata_inode Match the inode number instead of the inode pointers, as the inode pointers in the superblock will go away soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: port to my tree, make the parameter a const pointer] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	05aba1953f	xfs: validate inumber in xfs_iget Actually use the inumber validator to check the argument passed in here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	398597c3ef	xfs: introduce new file range commit ioctls This patch introduces two more new ioctls to manage atomic updates to file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE does, but with the additional requirement that file2 cannot have changed since some sampling point. The start-commit ioctl performs the sampling of file attributes. Note: This patch currently samples i_ctime during START_COMMIT and checks that it hasn't changed during COMMIT_RANGE. This isn't entirely safe in kernels prior to 6.12 because ctime only had coarse grained granularity and very fast updates could collide with a COMMIT_RANGE. With the multi-granularity ctime introduced by Jeff Layton, it's now possible to update ctime such that this does not happen. It is critical, then, that this patch must not be backported to any kernel that does not support fine-grained file change timestamps. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Christoph Hellwig	4acaddf5d1	xfs: refactor xfs_file_fallocate Refactor xfs_file_fallocate into separate helpers for each mode, two factors for i_size handling and a single switch statement over the supported modes. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-7-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-28 16:53:58 +02:00
Christoph Hellwig	72f4d52570	xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space Move the xfs_is_always_cow_inode check from the caller into xfs_alloc_file_space to prepare for refactoring of xfs_file_fallocate. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-28 16:53:58 +02:00
Christoph Hellwig	1df1d3b2dc	xfs: call xfs_flush_unmap_range from xfs_free_file_space Call xfs_flush_unmap_range from xfs_free_file_space so that xfs_file_fallocate doesn't have to predict which mode will call it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-28 16:53:58 +02:00
Darrick J. Wong	a24cae8fc1	xfs: reset rootdir extent size hint after growfsrt If growfsrt is run on a filesystem that doesn't have a rt volume, it's possible to change the rt extent size. If the root directory was previously set up with an inherited extent size hint and rtinherit, it's possible that the hint is no longer a multiple of the rt extent size. Although the verifiers don't complain about this, xfs_repair will, so if we detect this situation, log the root directory to clean it up. This is still racy, but it's better than nothing. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Darrick J. Wong	16e1fbdce9	xfs: take m_growlock when running growfsrt Take the grow lock when we're expanding the realtime volume, like we do for the other growfs calls. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Zizhi Wo	ca6448aed4	xfs: Fix missing interval for missing_owner in xfs fsmap In the fsmap query of xfs, there is an interval missing problem: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... BUG: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt [root@fedora ~]# Normally, we should be able to get [104, 107), but we got nothing. The problem is caused by shifting. The query for the problem-triggered scenario is for the missing_owner interval (e.g. freespace in rmapbt/ unknown space in bnobt), which is obtained by subtraction (gap). For this scenario, the interval is obtained by info->last. However, rec_daddr is calculated based on the start_block recorded in key[1], which is converted by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log <= info->next_daddr, no records will be displayed. In the above example, 104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two are reduced to 0 and the gap is ignored: before calculate ----------------> after shifting 104(st) 107(ed) 12(st/ed) \|---------\| \| sector size block size Resolve this issue by introducing the "end_daddr" field in xfs_getfsmap_info. This records \|key[1].fmr_physical + key[1].length\| at the granularity of sector. If the current query is the last, the rec_daddr is end_daddr to prevent missing interval problems caused by shifting. We only need to focus on the last query, because xfs disks are internally aligned with disk blocksize that are powers of two and minimum 512, so there is no problem with shifting in previous queries. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [104..106]: free space 0 (104..106) 3 Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: limit the range of end_addr correctly] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Darrick J. Wong	6b35cc8d92	xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this field is null" like the rest of xfs. Cc: wozizhi@huawei.com Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:08 +05:30
Zizhi Wo	68415b349f	xfs: Fix the owner setting issue for rmap query in xfs fsmap I notice a rmap query bug in xfs_io fsmap: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... Bug: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt [root@fedora ~]# Normally, we should be able to get one record, but we got nothing. The root cause of this problem lies in the incorrect setting of rm_owner in the rmap query. In the case of the initial query where the owner is not set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX. This is done to prevent any omissions when comparing rmap items. However, if the current ag is detected to be the last one, the function sets info's high_irec based on the provided key. If high->rm_owner is not specified, it should continue to be set to ULLONG_MAX; otherwise, there will be issues with interval omissions. For example, consider "start" and "end" within the same block. If high->rm_owner == 0, it will be smaller than the founded record in rmapbt, resulting in a query with no records. The main call stack is as follows: xfs_ioc_getfsmap xfs_getfsmap xfs_getfsmap_datadev_rmapbt __xfs_getfsmap_datadev info->high.rm_owner = ULLONG_MAX if (pag->pag_agno == end_ag) xfs_fsmap_owner_to_rmap // set info->high.rm_owner = 0 because fmr_owner == -1ULL dest->rm_owner = 0 // get nothing xfs_getfsmap_datadev_rmapbt_query The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap function internal logic to achieve. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:27 +05:30
Darrick J. Wong	410e8a18f8	xfs: don't bother reporting blocks trimmed via FITRIM Don't bother reporting the number of bytes that we "trimmed" because the underlying storage isn't required to do anything(!) and failed discard IOs aren't reported to the caller anyway. It's not like userspace can use the reported value for anything useful like adjusting the offset parameter of the next call, and it's not like anyone ever wrote a manpage about FITRIM's out parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:13 +05:30
Dave Chinner	95179935be	xfs: xfs_finobt_count_blocks() walks the wrong btree As a result of the factoring in commit `14dd46cf31` ("xfs: split xfs_inobt_init_cursor"), mount started taking a long time on a user's filesystem. For Anders, this made mount times regress from under a second to over 15 minutes for a filesystem with only 30 million inodes in it. Anders bisected it down to the above commit, but even then the bug was not obvious. In this commit, over 20 calls to xfs_inobt_init_cursor() were modified, and some we modified to call a new function named xfs_finobt_init_cursor(). If that takes you a moment to reread those function names to see what the rename was, then you have realised why this bug wasn't spotted during review. And it wasn't spotted on inspection even after the bisect pointed at this commit - a single missing "f" isn't the easiest thing for a human eye to notice.... The result is that xfs_finobt_count_blocks() now incorrectly calls xfs_inobt_init_cursor() so it is now walking the inobt instead of the finobt. Hence when there are lots of allocated inodes in a filesystem, mount takes a -long- time run because it now walks a massive allocated inode btrees instead of the small, nearly empty free inode btrees. It also means all the finobt space reservations are wrong, so mount could potentially given ENOSPC on kernel upgrade. In hindsight, commit `14dd46cf31` should have been two commits - the first to convert the finobt callers to the new API, the second to modify the xfs_inobt_init_cursor() API for the inobt callers. That would have made the bug very obvious during review. Fixes: `14dd46cf31` ("xfs: split xfs_inobt_init_cursor") Reported-by: Anders Blomdell <anders.blomdell@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:00 +05:30
Darrick J. Wong	5335affcff	xfs: fix folio dirtying for XFILE_ALLOC callers willy pointed out that folio_mark_dirty is the correct function to use to mark an xfile folio dirty because it calls out to the mapping's aops to mark it dirty. For tmpfs this likely doesn't matter much since it currently uses nop_dirty_folio, but let's use the abstractions properly. Reported-by: willy@infradead.org Fixes: `6907e3c00a` ("xfs: add file_{get,put}_folio") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:51:27 +05:30
Darrick J. Wong	e21fea4ac3	xfs: fix di_onlink checking for V1/V2 inodes "KjellR" complained on IRC that an old V4 filesystem suddenly stopped mounting after upgrading from 6.9.11 to 6.10.3, with the following splat when trying to read the rt bitmap inode: 00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30 ........C...!..0 00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00 C...!..0........ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 ................ 00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ As Dave Chinner points out, this is a V1 inode with both di_onlink and di_nlink set to 1 and di_flushiter == 0. In other words, this inode was formatted this way by mkfs and hasn't been touched since then. Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc would set di_nlink, but if the filesystem didn't have NLINK, it would then set di_version = 1. libxfs_iflush_int later sees the V1 inode and copies the value of di_nlink to di_onlink without zeroing di_onlink. Eventually this filesystem must have been upgraded to support NLINK because 6.10 doesn't support !NLINK filesystems, which is how we tripped over this old behavior. The filesystem doesn't have a realtime section, so that's why the rtbitmap inode has never been touched. Fix this by removing the di_onlink/di_nlink checking for all V1/V2 inodes because this is a muddy mess. The V3 inode handling code has always supported NLINK and written di_onlink==0 so keep that check. The removal of the V1 inode handling code when we dropped support for !NLINK obscured this old behavior. Reported-by: kjell.m.randa@gmail.com Fixes: `40cb8613d6` ("xfs: check unused nlink fields in the ondisk inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:50:41 +05:30
Darrick J. Wong	8d16762047	xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set If a file has the S_DAX flag (aka fsdax access mode) set, we cannot allow users to change the realtime flag unless the datadev and rtdev both support fsdax access modes. Even if there are no extents allocated to the file, the setattr thread could be racing with another thread that has already started down the write code paths. Fixes: `ba23cba9b3` ("fs: allow per-device dax status checking for filesystems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:20:24 +05:30
Darrick J. Wong	04d6dbb553	xfs: revert AIL TASK_KILLABLE threshold In commit `9adf40249e`, we changed the behavior of the AIL thread to set its own task state to KILLABLE whenever the timeout value is nonzero. Unfortunately, this missed the fact that xfsaild_push will return 50ms (aka a longish sleep) when we reach the push target or the AIL becomes empty, so xfsaild goes to sleep for a long period of time in uninterruptible D state. This results in artificially high load averages because KILLABLE processes are UNINTERRUPTIBLE, which contributes to load average even though the AIL is asleep waiting for someone to interrupt it. It's not blocked on IOs or anything, but people scrap ps for processes that look like they're stuck in D state, so restore the previous threshold. Fixes: `9adf40249e` ("xfs: AIL doesn't need manual pushing") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:34 +05:30
Darrick J. Wong	73c34b0b85	xfs: attr forks require attr, not attr2 It turns out that I misunderstood the difference between the attr and attr2 feature bits. "attr" means that at some point an attr fork was created somewhere in the filesystem. "attr2" means that inodes have variable-sized forks, but says nothing about whether or not there actually /are/ attr forks in the system. If we have an attr fork, we only need to check that attr is set. Fixes: `99d9d8d05d` ("xfs: scrub inode block mappings") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:14 +05:30
Al Viro	88a2f6468d	struct fd: representation change We want the compiler to see that fdput() on empty instance is a no-op. The emptiness check is that file reference is NULL, while fdput() is "fput() if FDPUT_FPUT is present in flags". The reason why fdput() on empty instance is a no-op is something compiler can't see - it's that we never generate instances with NULL file reference combined with non-zero flags. It's not that hard to deal with - the real primitives behind fdget() et.al. are returning an unsigned long value, unpacked by (inlined) __to_fd() into the current struct file * + int. The lower bits are used to store flags, while the rest encodes the pointer. Linus suggested that keeping this unsigned long around with the extractions done by inlined accessors should generate a sane code and that turns out to be the case. Namely, turning struct fd into a struct-wrapped unsinged long, with fd_empty(f) => unlikely(f.word == 0) fd_file(f) => (struct file *)(f.word & ~3) fdput(f) => if (f.word & 1) fput(fd_file(f)) ends up with compiler doing the right thing. The cost is the patch footprint, of course - we need to switch f.file to fd_file(f) all over the tree, and it's not doable with simple search and replace; there are false positives, etc. Note that the sole member of that structure is an opaque unsigned long - all accesses should be done via wrappers and I don't want to use a name that would invite manual casts to file pointers, etc. The value of that member is equal either to (unsigned long)p \| flags, p being an address of some struct file instance, or to 0 for an empty fd. For now the new predicate (fd_empty(f)) has no users; all the existing checks have form (!fd_file(f)). We will convert to fd_empty() use later; here we only define it (and tell the compiler that it's unlikely to return true). This commit only deals with representation change; there will be followups. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-12 22:01:05 -04:00
Al Viro	1da91ea87a	introduce fd_file(), convert all accessors to it. For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-12 22:00:43 -04:00
Chen Ni	7bf888fa26	xfs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Fixes: `178b48d588` ("xfs: remove the for_each_xbitmap_ helpers") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:34:18 +05:30
Chen Ni	8c2263b923	xfs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Fixes: `8f4b980ee6` ("xfs: pass the attr value to put_listent when possible") Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:32:53 +05:30
Julian Sun	af5d92f2fa	xfs: remove unused parameter in macro XFS_DQUOT_LOGRES In the macro definition of XFS_DQUOT_LOGRES, a parameter is accepted, but it is not used. Hence, it should be removed. This patch has only passed compilation test, but it should be fine. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:29:31 +05:30
Darrick J. Wong	19ebc8f84e	xfs: fix file_path handling in tracepoints Since file_path() takes the output buffer as one of its arguments, we might as well have it format directly into the tracepoint's char array instead of wasting stack space. Fixes: `3934e8ebb7` ("xfs: create a big array data structure") Fixes: `5076a6040c` ("xfs: support in-memory buffer cache targets") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403290419.HPcyvqZu-lkp@intel.com/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:27:23 +05:30
Eric Sandeen	39c1ddb064	xfs: allow SECURE namespace xattrs to use reserved block pool We got a report from the podman folks that selinux relabels that happen as part of their process were returning ENOSPC when the filesystem is completely full. This is because xattr changes reserve about 15 blocks for the worst case, but the common case is for selinux contexts to be the sole, in-inode xattr and consume no blocks. We already allow reserved space consumption for XFS_ATTR_ROOT for things such as ACLs, and SECURE namespace attributes are not so very different, so allow them to use the reserved space as well. Code-comment-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> V2: Remove local variable, add comment. V3: Add Dave's preferred comment V4: Spelling and comment beautification	2024-07-29 09:26:20 +05:30
Darrick J. Wong	80d3d33cdf	xfs: fix a memory leak kmemleak reported that we don't free the parent pointer names here if we found corruption. Fixes: `0d29a20fbd` ("xfs: scrub parent pointers") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:25:01 +05:30
Joel Granados	78eb4ea25c	sysctl: treewide: constify the ctl_table argument of proc_handlers const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer\|interval)_handler\|sched_(rt\|rr)_handler\|rds_tcp_skbuf_handler\|proc_sctp_do_(hmac_alg\|rto_min\|rto_max\|udp_port\|alpha_beta\|auth\|probe_interval)"; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void , size_t , loff_t ); @r4@ identifier func, ctl; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int , void , size_t , loff_t ); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void buffer, size_t lenp, loff_t ppos); ``` Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-07-24 20:59:29 +02:00
Linus Torvalds	bf3aa9de7b	New code for 6.11: * Enable FITRIM on the realtime device. * Introduce byte-based grant head log reservation tracking instead of physical log location tracking. This allows grant head to track a full 64 bit bytes space and hence overcome the limit of 4GB indexing that has been present until now. * Fixes - xfs_flush_unmap_range() and xfs_prepare_shift() should consider RT extents in the flush unmap range. - Implement bounds check when traversing log operations during log replay. - Prevent out of bounds access when traversing a directory data block. - Prevent incorrect ENOSPC when concurrently performing file creation and file writes. - Fix rtalloc rotoring when delalloc is in use * Cleanups - Clean up I/O path inode locking helpers and the page fault handler. - xfs: hoist inode operations to libxfs in anticipation of the metadata inode directory feature, which maintains a directory tree of metadata inodes. This will be necessary for further enhancements to the realtime feature, subvolume support. - Clean up some warts in the extent freeing log intent code. - Clean up the refcount and rmap intent code before adding support for realtime devices. - Provide the correct email address for sysfs ABI documentation. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZo9pkwAKCRAH7y4RirJu 9EV+AQDBlX2AxTzKPsfb74qKaFgDpTdud8b1U779tijs4a6ZbwD8CvS40NXAjqmq R2j3wWQP3rkRxBusnStQ/9El20Q+WAI= =BcGP -----END PGP SIGNATURE----- Merge tag 'xfs-6.11-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Chandan Babu: "Major changes in this release are limited to enabling FITRIM on realtime devices and Byte-based grant head log reservation tracking. The remaining changes are limited to fixes and cleanups included in this pull request. Core: - Enable FITRIM on the realtime device - Introduce byte-based grant head log reservation tracking instead of physical log location tracking. This allows grant head to track a full 64 bit bytes space and hence overcome the limit of 4GB indexing that has been present until now Fixes: - xfs_flush_unmap_range() and xfs_prepare_shift() should consider RT extents in the flush unmap range - Implement bounds check when traversing log operations during log replay - Prevent out of bounds access when traversing a directory data block - Prevent incorrect ENOSPC when concurrently performing file creation and file writes - Fix rtalloc rotoring when delalloc is in use Cleanups: - Clean up I/O path inode locking helpers and the page fault handler - xfs: hoist inode operations to libxfs in anticipation of the metadata inode directory feature, which maintains a directory tree of metadata inodes. This will be necessary for further enhancements to the realtime feature, subvolume support - Clean up some warts in the extent freeing log intent code - Clean up the refcount and rmap intent code before adding support for realtime devices - Provide the correct email address for sysfs ABI documentation" * tag 'xfs-6.11-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (80 commits) xfs: fix rtalloc rotoring when delalloc is in use xfs: get rid of xfs_ag_resv_rmapbt_alloc xfs: skip flushing log items during push xfs: grant heads track byte counts, not LSNs xfs: pass the full grant head to accounting functions xfs: track log space pinned by the AIL xfs: collapse xlog_state_set_callback in caller xfs: l_last_sync_lsn is really AIL state xfs: ensure log tail is always up to date xfs: background AIL push should target physical space xfs: AIL doesn't need manual pushing xfs: move and rename xfs_trans_committed_bulk xfs: fix the contact address for the sysfs ABI documentation xfs: Avoid races with cnt_btree lastrec updates xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one xfs: reuse xfs_refcount_update_cancel_item xfs: add a ci_entry helper xfs: remove xfs_trans_set_refcount_flags ...	2024-07-17 12:57:48 -07:00
Linus Torvalds	4f5e249ec0	vfs-6.11.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHLQAKCRCRxhvAZXjc ot3sAP9TBUM+vzUcQ5SVcUnSX+y3dhOGYnquORBbRc/Y6AzLMAEAu3TcsvdoaWfy 6ImUaju6iLqy9cCY3uDlNmJR16E4IgE= =Bwpy -----END PGP SIGNATURE----- Merge tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: "This contains some minor work for the iomap subsystem: - Add documentation on the design of iomap and how to port to it - Optimize iomap_read_folio() - Bring back the change to iomap_write_end() to no increase i_size. This is accompanied by a change to xfs to reserve blocks for truncating large realtime inodes to avoid exposing stale data when iomap_write_end() stops increasing i_size" * tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: don't increase i_size in iomap_write_end() xfs: reserve blocks for truncating large realtime inode Documentation: the design of iomap and how to port iomap: Optimize iomap_read_folio	2024-07-15 13:28:14 -07:00
Linus Torvalds	2aae1d67fd	vfs-6.11.inode -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU= =eDC8 -----END PGP SIGNATURE----- Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode / dentry updates from Christian Brauner: "This contains smaller performance improvements to inodes and dentries: inode: - Add rcu based inode lookup variants. They avoid one inode hash lock acquire in the common case thereby significantly reducing contention. We already support RCU-based operations but didn't take advantage of them during inode insertion. Callers of iget_locked() get the improvement without any code changes. Callers that need a custom callback can switch to iget5_locked_rcu() as e.g., did btrfs. With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB ram vm: before: 3.54s user 892.30s system 1966% cpu 45.549 total after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) Long-term we should pick up the effort to introduce more fine-grained locking and possibly improve on the currently used hash implementation. - Start zeroing i_state in inode_init_always() instead of doing it in individual filesystems. This allows us to remove an unneeded lock acquire in new_inode() and not burden individual filesystems with this. dcache: - Move d_lockref out of the area used by RCU lookup to avoid cacheline ping poing because the embedded name is sharing a cacheline with d_lockref. - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end up with 128 bytes in total" * tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix dentry size vfs: move d_lockref out of the area used by RCU lookup bcachefs: remove now spurious i_state initialization xfs: remove now spurious i_state initialization in xfs_inode_alloc vfs: partially sanitize i_state zeroing on inode creation xfs: preserve i_state around inode_init_always in xfs_reinit_inode btrfs: use iget5_locked_rcu vfs: add rcu-based find_inode variants for iget ops	2024-07-15 11:39:44 -07:00
Christoph Hellwig	2bf6e35354	xfs: fix rtalloc rotoring when delalloc is in use If we're trying to allocate real space for a delalloc reservation at offset 0, we should use the rotor to spread files across the rt volume. Switch the rtalloc to use the XFS_ALLOC_INITIAL_USER_DATA flag that is set for any write at startoff to make it match the behavior for the main data device. Based on a patch from Darrick J. Wong. Fixes: `6a94b1acda` ("xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-09 09:08:28 +05:30
Long Li	49cdc4e834	xfs: get rid of xfs_ag_resv_rmapbt_alloc The pag in xfs_ag_resv_rmapbt_alloc() is already held when the struct xfs_btree_cur is initialized in xfs_rmapbt_init_cursor(), so there is no need to get pag again. On the other hand, in xfs_rmapbt_free_block(), the similar function xfs_ag_resv_rmapbt_free() was removed in commit `92a005448f` ("xfs: get rid of unnecessary xfs_perag_{get,put} pairs"), xfs_ag_resv_rmapbt_alloc() was left because scrub used it, but now scrub has removed it. Therefore, we could get rid of xfs_ag_resv_rmapbt_alloc() just like the rmap free block, make the code cleaner. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 14:36:13 +05:30
Dave Chinner	f3f7ae68a4	xfs: skip flushing log items during push The AIL pushing code spends a huge amount of time skipping over items that are already marked as flushing. It is not uncommon to see hundreds of thousands of items skipped every second due to inode clustering marking all the inodes in a cluster as flushing when the first one is flushed. However, to discover an item is already flushing and should be skipped we have to call the iop_push() method for it to try to flush the item. For inodes (where this matters most), we have to first check that inode is flushable first. We can optimise this overhead away by tracking whether the log item is flushing internally. This allows xfsaild_push() to check the log item directly for flushing state and immediately skip the log item. Whilst this doesn't remove the CPU cache misses for loading the log item, it does avoid the overhead of an indirect function call and the cache misses involved in accessing inode and backing cluster buffer structures to determine flushing state. When trying to flush hundreds of thousands of inodes each second, this CPU overhead saving adds up quickly. It's so noticeable that the biggest issue with pushing on the AIL on fast storage becomes the 10ms back-off wait when we hit enough pinned buffers to break out of the push loop but not enough for the AIL pushing to be considered stuck. This limits the xfsaild to about 70% total CPU usage, and on fast storage this isn't enough to keep the storage 100% busy. The xfsaild will block on IO submission on slow storage and so is self throttling - it does not need a backoff in the case where we are really just breaking out of the walk to submit the IO we have gathered. Further with no backoff we don't need to gather huge delwri lists to mitigate the impact of backoffs, so we can submit IO more frequently and reduce the time log items spend in flushing state by breaking out of the item push loop once we've gathered enough IO to batch submission effectively. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:47 +05:30
Dave Chinner	c1220522ef	xfs: grant heads track byte counts, not LSNs The grant heads in the log track the space reserved in the log for running transactions. They do this by tracking how far ahead of the tail that the reservation has reached, and the units for doing this are {cycle,bytes} for the reserve head rather than {cycle,blocks} which are normal used by LSNs. This is annoyingly complex because we have to split, crack and combined these tuples for any calculation we do to determine log space and targets. This is computationally expensive as well as difficult to do atomically and locklessly, as well as limiting the size of the log to 2^32 bytes. Really, though, all the grant heads are tracking is how much space is currently available for use in the log. We can track this as a simply byte count - we just don't care what the actual physical location in the log the head and tail are at, just how much space we have remaining before the head and tail overlap. So, convert the grant heads to track the byte reservations that are active rather than the current (cycle, offset) tuples. This means an empty log has zero bytes consumed, and a full log is when the reservations reach the size of the log minus the space consumed by the AIL. This greatly simplifies the accounting and checks for whether there is space available. We no longer need to crack or combine LSNs to determine how much space the log has left, nor do we need to look at the head or tail of the log to determine how close to full we are. There is, however, a complexity that needs to be handled. We know how much space is being tracked in the AIL now via log->l_tail_space and the log tickets track active reservations and return the unused portions to the grant heads when ungranted. Unfortunately, we don't track the used portion of the grant, so when we transfer log items from the CIL to the AIL, the space accounted to the grant heads is transferred to the log tail space. Hence when we move the AIL head forwards on item insert, we have to remove that space from the grant heads. We also remove the xlog_verify_grant_tail() debug function as it is no longer useful. The check it performs has been racy since delayed logging was introduced, but now it is clearly only detecting false positives so remove it. The result of this substantially simpler accounting algorithm is an increase in sustained transaction rate from ~1.3 million transactions/s to ~1.9 million transactions/s with no increase in CPU usage. We also remove the 32 bit space limitation on the grant heads, which will allow us to increase the journal size beyond 2GB in future. Note that this renames the sysfs files exposing the log grant space now that the values are exported in bytes. This allows xfstests to auto-detect the old or new ABI. [hch: move xlog_grant_sub_space out of line, update the xlog_grant_{add,sub}_space prototypes, rename the sysfs files to allow auto-detection in xfstests] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:47 +05:30
Dave Chinner	de302cea1e	xfs: pass the full grant head to accounting functions Because we are going to need them soon. API change only, no logic changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:47 +05:30
Dave Chinner	551bf13ba8	xfs: track log space pinned by the AIL Currently we track space used in the log by grant heads. These store the reserved space as a physical log location and combine both space reserved for future use with space already used in the log in a single variable. The amount of space consumed in the log is then calculated as the distance between the log tail and the grant head. The problem with tracking the grant head as a physical location comes from the fact that it tracks both log cycle count and offset into the log in bytes in a single 64 bit variable. because the cycle count on disk is a 32 bit number, this also limits the offset into the log to 32 bits. ANd because that is in bytes, we are limited to being able to track only 2GB of log space in the grant head. Hence to support larger physical logs, we need to track used space differently in the grant head. We no longer use the grant head for guiding AIL pushing, so the only thing it is now used for is determining if we've run out of reservation space via the calculation in xlog_space_left(). What we really need to do is move the grant heads away from tracking physical space in the log. The issue here is that space consumed in the log is not directly tracked by the current mechanism - the space consumed in the log by grant head reservations gets returned to the free pool by the tail of the log moving forward. i.e. the space isn't directly tracked or calculated, but the used grant space gets "freed" as the physical limits of the log are updated without actually needing to update the grant heads. Hence to move away from implicit, zero-update log space tracking we need to explicitly track the amount of physical space the log actually consumes separately to the in-memory reservations for operations that will be committed to the journal. Luckily, we already track the information we need to calculate this in the AIL itself. That is, the space currently consumed by the journal is the maximum LSN that the AIL has seen minus the current log tail. As we update both of these items dynamically as the head and tail of the log moves, we always know exactly how much space the journal consumes. This means that we also know exactly how much space the currently active reservations require, and exactly how much free space we have remaining for new reservations to be made. Most importantly, we know what these spaces are indepedently of the physical locations of the head and tail of the log. Hence by separating out the physical space consumed by the journal, we can now track reservations in the grant heads purely as a byte count, and the log can be considered full when the tail space + reservation space exceeds the size of the log. This means we can use the full 64 bits of grant head space for reservation space, completely removing the 32 bit byte count limitation on log size that they impose. Hence the first step in this conversion is to track and update the "log tail space" every time the AIL tail or maximum seen LSN changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	be5abd323b	xfs: collapse xlog_state_set_callback in caller The function is called from a single place, and it isn't just setting the iclog state to XLOG_STATE_CALLBACK - it can mark iclogs clean, which moves them to states after CALLBACK. Hence the function is now badly named, and should just be folded into the caller where the iclog completion logic makes a whole lot more sense. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	0dcd5a10d9	xfs: l_last_sync_lsn is really AIL state The current implementation of xlog_assign_tail_lsn() assumes that when the AIL is empty, the log tail matches the LSN of the last written commit record. This is recorded in xlog_state_set_callback() as log->l_last_sync_lsn when the iclog state changes to XLOG_STATE_CALLBACK. This change is then immediately followed by running the callbacks on the iclog which then insert the log items into the AIL at the "commit lsn" of that checkpoint. The AIL tracks log items via the start record LSN of the checkpoint, not the commit record LSN. This is because we can pipeline multiple checkpoints, and so the start record of checkpoint N+1 can be written before the commit record of checkpoint N. i.e: start N commit N +-------------+------------+----------------+ start N+1 commit N+1 The tail of the log cannot be moved to the LSN of commit N when all the items of that checkpoint are written back, because then the start record for N+1 is no longer in the active portion of the log and recovery will fail/corrupt the filesystem. Hence when all the log items in checkpoint N are written back, the tail of the log most now only move as far forwards as the start LSN of checkpoint N+1. Hence we cannot use the maximum start record LSN the AIL sees as a replacement the pointer to the current head of the on-disk log records. However, we currently only use the l_last_sync_lsn when the AIL is empty - when there is no start LSN remaining, the tail of the log moves to the LSN of the last commit record as this is where recovery needs to start searching for recoverable records. THe next checkpoint will have a start record LSN that is higher than l_last_sync_lsn, and so everything still works correctly when new checkpoints are written to an otherwise empty log. l_last_sync_lsn is an atomic variable because it is currently updated when an iclog with callbacks attached moves to the CALLBACK state. While we hold the icloglock at this point, we don't hold the AIL lock. When we assign the log tail, we hold the AIL lock, not the icloglock because we have to look up the AIL. Hence it is an atomic variable so it's not bound to a specific lock context. However, the iclog callbacks are only used for CIL checkpoints. We don't use callbacks with unmount record writes, so the l_last_sync_lsn variable only gets updated when we are processing CIL checkpoint callbacks. And those callbacks run under AIL lock contexts, not icloglock context. The CIL checkpoint already knows what the LSN of the iclog the commit record was written to (obtained when written into the iclog before submission) and so we can update the l_last_sync_lsn under the AIL lock in this callback. No other iclog callbacks will run until the currently executing one completes, and hence we can update the l_last_sync_lsn under the AIL lock safely. This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn" and it can be used to replace the atomic l_last_sync_lsn in the iclog code. This makes tracking the log tail belong entirely to the AIL, rather than being smeared across log, iclog and AIL state and locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	a07776ab81	xfs: ensure log tail is always up to date Whenever we write an iclog, we call xlog_assign_tail_lsn() to update the current tail before we write it into the iclog header. This means we have to take the AIL lock on every iclog write just to check if the tail of the log has moved. This doesn't avoid races with log tail updates - the log tail could move immediately after we assign the tail to the iclog header and hence by the time the iclog reaches stable storage the tail LSN has moved forward in memory. Hence the log tail LSN in the iclog header is really just a point in time snapshot of the current state of the AIL. With this in mind, if we simply update the in memory log->l_tail_lsn every time it changes in the AIL, there is no need to update the in memory value when we are writing it into an iclog - it will already be up-to-date in memory and checking the AIL again will not change this. Hence xlog_state_release_iclog() does not need to check the AIL to update the tail lsn and can just sample it directly without needing to take the AIL lock. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	b50b4c49d8	xfs: background AIL push should target physical space Currently the AIL attempts to keep 25% of the "log space" free, where the current used space is tracked by the reserve grant head. That is, it tracks both physical space used plus the amount reserved by transactions in progress. When we start tail pushing, we are trying to make space for new reservations by writing back older metadata and the log is generally physically full of dirty metadata, and reservations for modifications in flight take up whatever space the AIL can physically free up. Hence we don't really need to take into account the reservation space that has been used - we just need to keep the log tail moving as fast as we can to free up space for more reservations to be made. We know exactly how much physical space the journal is consuming in the AIL (i.e. max LSN - min LSN) so we can base push thresholds directly on this state rather than have to look at grant head reservations to determine how much to physically push out of the log. This also allows code that needs to know if log items in the current transaction need to be pushed or re-logged to simply sample the current target - they don't need to calculate the current target themselves. This avoids the need for any locking when doing such checks. Further, moving to a physical target means we don't need "push all until empty semantics" like were introduced in the previous patch. We can now test and clear the "push all" as a one-shot command to set the target to the current head of the AIL. This allows the xfsaild to maximise the use of log space right up to the point where conditions indicate that the xfsaild is not keeping up with load and it needs to work harder, and as soon as those constraints go away (i.e. external code no longer needs everything pushed) the xfsaild will return to maintaining the normal 25% free space thresholds. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	9adf40249e	xfs: AIL doesn't need manual pushing We have a mechanism that checks the amount of log space remaining available every time we make a transaction reservation. If the amount of space is below a threshold (25% free) we push on the AIL to tell it to do more work. To do this, we end up calculating the LSN that the AIL needs to push to on every reservation and updating the push target for the AIL with that new target LSN. This is silly and expensive. The AIL is perfectly capable of calculating the push target itself, and it will always be running when the AIL contains objects. What the target does is determine if the AIL needs to do any work before it goes back to sleep. If we haven't run out of reservation space or memory (or some other push all trigger), it will simply go back to sleep for a while if there is more than 25% of the journal space free without doing anything. If there are items in the AIL at a lower LSN than the target, it will try to push up to the target or to the point of getting stuck before going back to sleep and trying again soon after.` Hence we can modify the AIL to calculate it's own 25% push target before it starts a push using the same reserve grant head based calculation as is currently used, and remove all the places where we ask the AIL to push to a new 25% free target. We can also drop the minimum free space size of 256BBs from the calculation because the 25% of a minimum sized log is always going to be larger than 256BBs. This does still require a manual push in certain circumstances. These circumstances arise when the AIL is not full, but the reservation grants consume the entire of the free space in the log. In this case, we still need to push on the AIL to free up space, so when we hit this condition (i.e. reservation going to sleep to wait on log space) we do a single push to tell the AIL it should empty itself. This will keep the AIL moving as new reservations come in and want more space, rather than keep queuing them and having to push the AIL repeatedly. The reason for using the "push all" when grant space runs out is that we can run out of grant space when there is more than 25% of the log free. Small logs are notorious for this, and we have a hack in the log callback code (xlog_state_set_callback()) where we push the AIL because the head* moved) to ensure that we kick the AIL when we consume space in it because that can push us over the "less than 25% available" available that starts tail pushing back up again. Hence when we run out of grant space and are going to sleep, we have to consider that the grant space may be consuming almost all the log space and there is almost nothing in the AIL. In this situation, the AIL pins the tail and moving the tail forwards is the only way the grant space will come available, so we have to force the AIL to push everything to guarantee grant space will eventually be returned. Hence triggering a "push all" just before sleeping removes all the nasty corner cases we have in other parts of the code that work around the "we didn't ask the AIL to push enough to free grant space" condition that leads to log space hangs... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Dave Chinner	613e2fdbbc	xfs: move and rename xfs_trans_committed_bulk Ever since the CIL and delayed logging was introduced, xfs_trans_committed_bulk() has been a purely CIL checkpoint completion function and not a transaction commit completion function. Now that we are adding log specific updates to this function, it really does not have anything to do with the transaction subsystem - it is really log and log item level functionality. This should be part of the CIL code as it is the callback that moves log items from the CIL checkpoint to the AIL. Move it and rename it to xlog_cil_ail_insert(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Zizhi Wo	94a0333b92	xfs: Avoid races with cnt_btree lastrec updates A concurrent file creation and little writing could unexpectedly return -ENOSPC error since there is a race window that the allocator could get the wrong agf->agf_longest. Write file process steps: 1) Find the entry that best meets the conditions, then calculate the start address and length of the remaining part of the entry after allocation. 2) Delete this entry and update the -current- agf->agf_longest. 3) Insert the remaining unused parts of this entry based on the calculations in 1), and update the agf->agf_longest again if necessary. Create file process steps: 1) Check whether there are free inodes in the inode chunk. 2) If there is no free inode, check whether there has space for creating inode chunks, perform the no-lock judgment first. 3) If the judgment succeeds, the judgment is performed again with agf lock held. Otherwire, an error is returned directly. If the write process is in step 2) but not go to 3) yet, the create file process goes to 2) at this time, it may be mistaken for no space, resulting in the file system still has space but the file creation fails. We have sent two different commits to the community in order to fix this problem[1][2]. Unfortunately, both solutions have flaws. In [2], I discussed with Dave and Darrick, realized that a better solution to this problem requires the "last cnt record tracking" to be ripped out of the generic btree code. And surprisingly, Dave directly provided his fix code. This patch includes appropriate modifications based on his tmp-code to address this issue. The entire fix can be roughly divided into two parts: 1) Delete the code related to lastrec-update in the generic btree code. 2) Place the process of updating longest freespace with cntbt separately to the end of the cntbt modifications. Move the cursor to the rightmost firstly, and update the longest free extent based on the record. Note that we can not update the longest with xfs_alloc_get_rec() after find the longest record, as xfs_verify_agbno() may not pass because pag->block_count is updated on the outside. Therefore, use xfs_btree_get_rec() as a replacement. [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com Reported by: Ye Bin <yebin10@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:44:16 +05:30
Darrick J. Wong	783e8a7c9c	xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c Move the code that adds the incore xfs_refcount_update_item deferred work data to a transaction live with the CUI log item code. This means that the refcount code no longer has to know about the inner workings of the CUI log items. As a consequence, we can get rid of the _{get,put}_group helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:07 -07:00
Darrick J. Wong	e51987a12c	xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one Only update rcur when we know the final *pcur value. Inspired-by: Christoph Hellwig <hch@lst.de> [djwong: don't leave the caller with a dangling ref] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:07 -07:00
Darrick J. Wong	bac3f78492	xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one In xfs_refcount_finish_one we know the cursor is non-zero when calling xfs_refcount_finish_one_cleanup and we pass a 0 error variable. This means xfs_refcount_finish_one_cleanup is just doing a xfs_btree_del_cursor. Open code that and move xfs_refcount_finish_one_cleanup to fs/xfs/xfs_refcount_item.c. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:07 -07:00
Darrick J. Wong	8aef79928b	xfs: reuse xfs_refcount_update_cancel_item Reuse xfs_refcount_update_cancel_item to put the AG/RTG and free the item in a few places that currently open code the logic. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:06 -07:00
Darrick J. Wong	0e9254861f	xfs: add a ci_entry helper Add a helper to translate from the item list head to the refcount_intent_item structure and use it so shorten assignments and avoid the need for extra local variables. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:06 -07:00
Darrick J. Wong	e69682e5a1	xfs: remove xfs_trans_set_refcount_flags Remove this single-use helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:06 -07:00
Darrick J. Wong	886f11c797	xfs: clean up refcount log intent item tracepoint callsites Pass the incore refcount intent structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:06 -07:00
Darrick J. Wong	8fbac2f1a0	xfs: pass btree cursors to refcount btree tracepoints Prepare the rest of refcount btree tracepoints for use with realtime reflink by making them take the btree cursor object as a parameter. This will save us a lot of trouble later on. Remove the xfs_refcount_recover_extent tracepoint since it's already covered by other refcount tracepoints. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	bb0efb0d0a	xfs: create specialized classes for refcount tracepoints The only user of the "ag" tracepoint event classes is the refcount btree, so rename them to make that obvious and make them take the btree cursor to simplify the arguments. This will save us a lot of trouble later on. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	7cf2663ff1	xfs: give refcount btree cursor error tracepoints their own class Convert all the refcount tracepoints to use the btree error tracepoint class. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	ea7b0820d9	xfs: move xfs_rmap_update_defer_add to xfs_rmap_item.c Move the code that adds the incore xfs_rmap_update_item deferred work data to a transaction to live with the RUI log item code. This means that the rmap code no longer has to know about the inner workings of the RUI log items. As a consequence, we can get rid of the _{get,put}_group helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Christoph Hellwig	905af72610	xfs: simplify usage of the rcur local variable in xfs_rmap_finish_one Only update rcur when we know the final *pcur value. Signed-off-by: Christoph Hellwig <hch@lst.de> [djwong: don't leave the caller with a dangling ref] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:04 -07:00
Christoph Hellwig	8363b43619	xfs: don't bother calling xfs_rmap_finish_one_cleanup in xfs_rmap_finish_one In xfs_rmap_finish_one we known the cursor is non-zero when calling xfs_rmap_finish_one_cleanup and we pass a 0 error variable. This means xfs_rmap_finish_one_cleanup is just doing a xfs_btree_del_cursor. Open code that and move xfs_rmap_finish_one_cleanup to fs/xfs/xfs_rmap_item.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor porting changes] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:04 -07:00
Christoph Hellwig	37f9d1db03	xfs: reuse xfs_rmap_update_cancel_item Reuse xfs_rmap_update_cancel_item to put the AG/RTG and free the item in a few places that currently open code the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:04 -07:00
Christoph Hellwig	f93963779b	xfs: add a ri_entry helper Add a helper to translate from the item list head to the rmap_intent_item structure and use it so shorten assignments and avoid the need for extra local variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:04 -07:00
Darrick J. Wong	c9099a28c2	xfs: remove xfs_trans_set_rmap_flags Remove this single-use helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:04 -07:00
Darrick J. Wong	fbe8c7e167	xfs: clean up rmap log intent item tracepoint callsites Pass the incore rmap structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	47492ed124	xfs: pass btree cursors to rmap btree tracepoints Prepare the rmap btree tracepoints for use with realtime rmap btrees by making them take the btree cursor object as a parameter. This will save us a lot of trouble later on. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	71f5a17e52	xfs: give rmap btree cursor error tracepoints their own class Create a new tracepoint class for btree-related errors, then convert all the rmap tracepoints to use it. Also fix the one tracepoint that was abusing the old class by making it a separate tracepoint. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	84a3c1576c	xfs: move xfs_extent_free_defer_add to xfs_extfree_item.c Move the code that adds the incore xfs_extent_free_item deferred work data to a transaction to live with the EFI log item code. This means that the allocator code no longer has to know about the inner workings of the EFI log items. As a consequence, we can get rid of the _{get,put}_group helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Christoph Hellwig	7272f77c67	xfs: remove xfs_defer_agfl_block xfs_free_extent_later can handle the extra AGFL special casing with very little extra logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:02 -07:00
Christoph Hellwig	851a678189	xfs: remove duplicate asserts in xfs_defer_extent_free The bno/len verification is already done by the calls to xfs_verify_rtbext / xfs_verify_fsbext, and reporting a corruption error seem like the better handling than tripping an assert anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:02 -07:00
Christoph Hellwig	81927e6ec6	xfs: factor out a xfs_efd_add_extent helper Factor out a helper to add an extent to and EFD instead of duplicating the logic in two places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:02 -07:00
Christoph Hellwig	61665fae4e	xfs: reuse xfs_extent_free_cancel_item Reuse xfs_extent_free_cancel_item to put the AG/RTG and free the item in a few places that currently open code the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:02 -07:00
Christoph Hellwig	649c0c2b86	xfs: add a xefi_entry helper Add a helper to translate from the item list head to the xfs_extent_free_item structure and use it so shorten assignments and avoid the need for extra local variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:01 -07:00
Christoph Hellwig	62d597a197	xfs: pass the fsbno to xfs_perag_intent_get All callers of xfs_perag_intent_get have a fsbno and need boilerplate code to turn that into an agno. Just pass the fsbno to xfs_perag_intent_get and look up the agno there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-07-02 11:37:01 -07:00
Darrick J. Wong	980faece91	xfs: convert "skip_discard" to a proper flags bitset Convert the boolean to skip discard on free into a proper flags field so that we can add more flags in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:01 -07:00
Darrick J. Wong	4e0e2c0fe3	xfs: clean up extent free log intent item tracepoint callsites Pass the incore EFI structure to the tracepoints instead of open-coding the argument passing. This cleans up the call sites a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:01 -07:00
Darrick J. Wong	ac3a027516	xfs: don't use the incore struct xfs_sb for offsets into struct xfs_dsb Currently, the XFS_SB_CRC_OFF macro uses the incore superblock struct (xfs_sb) to compute the address of sb_crc within the ondisk superblock struct (xfs_dsb). This is a landmine if we ever change the layout of the incore superblock (as we're about to do), so redefine the macro to use xfs_dsb to compute the layout of xfs_dsb. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:00 -07:00
Darrick J. Wong	47d4d5961f	xfs: get rid of trivial rename helpers Get rid of the largely pointless xfs_cross_rename and xfs_finish_rename now that we've refactored its parent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:00 -07:00
Darrick J. Wong	62bbf50bea	xfs: move dirent update hooks to xfs_dir2.c Move the directory entry update hook code to xfs_dir2 so that it is mostly consolidated with the higher level directory functions. Retain the exports so that online fsck can still send notifications through the hooks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:00 -07:00
Darrick J. Wong	28d0d81344	xfs: create libxfs helper to rename two directory entries Create a new libxfs function to rename two directory entries. The upcoming metadata directory feature will need this to replace a metadata inode directory entry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:59 -07:00
Darrick J. Wong	a55712b35c	xfs: create libxfs helper to exchange two directory entries Create a new libxfs function to exchange two directory entries. The upcoming metadata directory feature will need this to replace a metadata inode directory entry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:59 -07:00
Darrick J. Wong	90636e4531	xfs: create libxfs helper to remove an existing inode/name from a directory Create a new libxfs function to remove a (name, inode) entry from a directory. The upcoming metadata directory feature will need this to create a metadata directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:59 -07:00
Darrick J. Wong	1964435d19	xfs: hoist inode free function to libxfs Create a libxfs helper function that marks an inode free on disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:59 -07:00
Darrick J. Wong	c1f0bad423	xfs: create libxfs helper to link an existing inode into a directory Create a new libxfs function to link an existing inode into a directory. The upcoming metadata directory feature will need this to create a metadata directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:59 -07:00
Darrick J. Wong	1fa2e81957	xfs: create libxfs helper to link a new inode into a directory Create a new libxfs function to link a newly created inode into a directory. The upcoming metadata directory feature will need this to create a metadata directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:58 -07:00
Darrick J. Wong	b11b11e3b7	xfs: separate the icreate logic around INIT_XATTRS INIT_XATTRS is overloaded here -- it's set during the creat process when we think that we're immediately going to set some ACL xattrs to save time. However, it's also used by the parent pointers code to enable the attr fork in preparation to receive ppptr xattrs. This results in xfs_has_parent() branches scattered around the codebase to turn on INIT_XATTRS. Linkable files are created far more commonly than unlinkable temporary files or directory tree roots, so we should centralize this logic in xfs_inode_init. For the three callers that don't want parent pointers (online repiar tempfiles, unlinkable tempfiles, rootdir creation) we provide an UNLINKABLE flag to skip attr fork initialization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:58 -07:00
Darrick J. Wong	a9e583d34f	xfs: hoist xfs_{bump,drop}link to libxfs Move xfs_bumplink and xfs_droplink to libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:58 -07:00
Darrick J. Wong	b8a6107921	xfs: hoist xfs_iunlink to libxfs Move xfs_iunlink and xfs_iunlink_remove to libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:58 -07:00
Darrick J. Wong	c0223b8d66	xfs: wrap inode creation dqalloc calls Create a helper that calls dqalloc to allocate and grab a reference to dquots for the user, group, and project ids listed in an icreate structure. This simplifies the creat-related dqalloc callsites scattered around the code base. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:57 -07:00
Darrick J. Wong	dfaf884233	xfs: push xfs_icreate_args creation out of xfs_create* Move the initialization of the xfs_icreate_args structure out of xfs_create and xfs_create_tempfile into their callers so that we can set the new inode's attributes in one place and pass that through instead of open coding the collection of attributes all over the code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:57 -07:00
Darrick J. Wong	e9d2b35bb9	xfs: hoist new inode initialization functions to libxfs Move all the code that initializes a new inode's attributes from the icreate_args structure and the parent directory into libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:57 -07:00
Darrick J. Wong	38fd3d6a95	xfs: split new inode creation into two pieces There are two parts to initializing a newly allocated inode: setting up the incore structures, and initializing the new inode core based on the parent inode and the current user's environment. The initialization code is not specific to the kernel, so we would like to share that with userspace by hoisting it to libxfs. Therefore, split xfs_icreate into separate functions to prepare for the next few patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:57 -07:00
Darrick J. Wong	a7b12718cb	xfs: use xfs_trans_ichgtime to set times when allocating inode Use xfs_trans_ichgtime to set the inode times when allocating an inode, instead of open-coding them here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:56 -07:00
Darrick J. Wong	3d1dfb6df9	xfs: implement atime updates in xfs_trans_ichgtime Enable xfs_trans_ichgtime to change the inode access time so that we can use this function to set inode times when allocating inodes instead of open-coding it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:56 -07:00
Darrick J. Wong	ba4b39fe4c	xfs: pack icreate initialization parameters into a separate structure Callers that want to create an inode currently pass all possible file attribute values for the new inode into xfs_init_new_inode as ten separate parameters. This causes two code maintenance issues: first, we have large multi-line call sites which programmers must read carefully to make sure they did not accidentally invert a value. Second, all three file id parameters must be passed separately to the quota functions; any discrepancy results in quota count errors. Clean this up by creating a new icreate_args structure to hold all this information, some helpers to initialize them properly, and make the callers pass this structure through to the creation function, whose name we shorten to xfs_icreate. This eliminates the issues, enables us to keep the inode init code in sync with userspace via libxfs, and is needed for future metadata directory tree management. (A subsequent cleanup will also fix the quota alloc calls.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:56 -07:00
Darrick J. Wong	fcea5b35f3	xfs: hoist project id get/set functions to libxfs Move the project id get and set functions into libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:56 -07:00
Darrick J. Wong	b7c477be39	xfs: hoist inode flag conversion functions to libxfs Hoist the inode flag conversion functions into libxfs so that we can keep them in sync. Do this by creating a new xfs_inode_util.c file in libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:55 -07:00
Darrick J. Wong	acdddbe168	xfs: hoist extent size helpers to libxfs Move the extent size helpers to xfs_bmap.c in libxfs since they're used there already. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:55 -07:00
Darrick J. Wong	d76e137057	xfs: move inode copy-on-write predicates to xfs_inode.[ch] Move these inode predicate functions to xfs_inode.[ch] since they're not reflink functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:55 -07:00
Darrick J. Wong	24a4e1cb32	xfs: use consistent uid/gid when grabbing dquots for inodes I noticed that callers of xfs_qm_vop_dqalloc use the following code to compute the anticipated uid of the new file: mapped_fsuid(idmap, &init_user_ns); whereas the VFS uses a slightly different computation for actually assigning i_uid: mapped_fsuid(idmap, i_user_ns(inode)); Technically, these are not the same things. According to Christian Brauner, the only time that inode->i_sb->s_user_ns != &init_user_ns is when the filesystem was mounted in a new mount namespace by an unpriviledged user. XFS does not allow this, which is why we've never seen bug reports about quotas being incorrect or the uid checks in xfs_qm_vop_create_dqattach tripping debug assertions. However, this /is/ a logic bomb, so let's make the code consistent. Link: https://lore.kernel.org/linux-fsdevel/20240617-weitblick-gefertigt-4a41f37119fa@brauner/ Fixes: `c14329d39f` ("fs: port fs{g,u}id helpers to mnt_idmap") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:55 -07:00
Darrick J. Wong	150bb10a28	xfs: verify buffer, inode, and dquot items every tx commit generic/388 has an annoying tendency to fail like this during log recovery: XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130 XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130 XFS (sda4): Starting recovery (logdev: internal) 00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10 ................ 00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00 5...>n..5...?... 00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f 5...?........<.O 00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b .............t.. 00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d .....Es........- 00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a .......0........ 00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1 5...;U.......'.. 000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30 C_....F..V...XQ0 XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c. Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs] CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4f/0x60 xfs_corruption_error+0x90/0xa0 xlog_recover_inode_commit_pass2+0x5f1/0xb00 xlog_recover_items_pass2+0x4e/0xc0 xlog_recover_commit_trans+0x2db/0x350 xlog_recovery_process_trans+0xab/0xe0 xlog_recover_process_data+0xa7/0x130 xlog_do_recovery_pass+0x398/0x840 xlog_do_log_recovery+0x62/0xc0 xlog_do_recover+0x34/0x1d0 xlog_recover+0xe9/0x1a0 xfs_log_mount+0xff/0x260 xfs_mountfs+0x5d9/0xb60 xfs_fs_fill_super+0x76b/0xa30 get_tree_bdev+0x124/0x1d0 vfs_get_tree+0x17/0xa0 path_mount+0x72b/0xa90 __x64_sys_mount+0x112/0x150 do_syscall_64+0x49/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> XFS (sda4): Corruption detected. Unmount and run xfs_repair XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1 XFS (sda4): Filesystem has been shut down due to log error (0x2). XFS (sda4): Please unmount the filesystem and rectify the problem(s). XFS (sda4): log mount/recovery failed: error -117 XFS (sda4): log mount failed This inode log item recovery failing the dinode verifier after replaying the contents of the inode log item into the ondisk inode. Looking back into what the kernel was doing at the time of the fs shutdown, a thread was in the middle of running a series of transactions, each of which committed changes to the inode. At some point in the middle of that chain, an invalid (at least according to the verifier) change was committed. Had the filesystem not shut down in the middle of the chain, a subsequent transaction would have corrected the invalid state and nobody would have noticed. But that's not what happened here. Instead, the invalid inode state was committed to the ondisk log, so log recovery tripped over it. The actual defect here was an overzealous inode verifier, which was fixed in a separate patch. This patch adds some transaction precommit functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds of transient errors at transaction commit time, where it's much easier to find the root cause. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:36:54 -07:00
Darrick J. Wong	3ba3ab1f67	xfs: enable FITRIM on the realtime device Implement FITRIM for the realtime device by pretending that it's "space" immediately after the data device. We have to hold the rtbitmap ILOCK while the discard operations are ongoing because there's no busy extent tracking for the rt volume to prevent reallocations. Cc: Konst Mayer <cdlscpmv@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Wenchao Hao	a330cae8a7	xfs: Remove header files which are included more than once Following warning is reported, so remove these duplicated header including: ./fs/xfs/libxfs/xfs_trans_resv.c: xfs_da_format.h is included more than once. ./fs/xfs/scrub/quota_repair.c: xfs_format.h is included more than once. ./fs/xfs/xfs_handle.c: xfs_da_btree.h is included more than once. ./fs/xfs/xfs_qm_bhv.c: xfs_mount.h is included more than once. ./fs/xfs/xfs_trace.c: xfs_bmap.h is included more than once. This is just a clean code, no logic changed. Signed-off-by: Wenchao Hao <haowenchao22@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	4818fd60db	xfs: fold xfs_ilock_for_write_fault into xfs_write_fault Now that the page fault handler has been refactored, the only caller of xfs_ilock_for_write_fault is simple enough and calls it unconditionally. Fold the logic and expand the comments explaining it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	4e82fa11fb	xfs: always take XFS_MMAPLOCK shared in xfs_dax_read_fault After the previous refactoring, xfs_dax_fault is now never used for write faults, so don't bother with the xfs_ilock_for_write_fault logic to protect against writes when remapping is in progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	6a39ec1d39	xfs: refactor __xfs_filemap_fault Split the write fault and DAX fault handling into separate helpers so that the main fault handler is easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	9092b1de35	xfs: simplify xfs_dax_fault Replace the separate stub with an IS_ENABLED check, and take the call to dax_finish_sync_fault into xfs_dax_fault instead of leaving it in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	29bc0dd0a2	xfs: cleanup xfs_ilock_iocb_for_write Move the relock path out of the straight line and add a comment explaining why it exists. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Christoph Hellwig	8626b67acf	xfs: move the dio write relocking out of xfs_ilock_for_iomap About half of xfs_ilock_for_iomap deals with a special case for direct I/O writes to COW files that need to take the ilock exclusively. Move this code into the one callers that cares and simplify xfs_ilock_for_iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
lei lu	0c7fcdb6d0	xfs: don't walk off the end of a directory data block This adds sanity checks for xfs_dir2_data_unused and xfs_dir2_data_entry to make sure don't stray beyond valid memory region. Before patching, the loop simply checks that the start offset of the dup and dep is within the range. So in a crafted image, if last entry is xfs_dir2_data_unused, we can change dup->length to dup->length-1 and leave 1 byte of space. In the next traversal, this space will be considered as dup or dep. We may encounter an out of bound read when accessing the fixed members. In the patch, we make sure that the remaining bytes large enough to hold an unused entry before accessing xfs_dir2_data_unused and xfs_dir2_data_unused is XFS_DIR2_DATA_ALIGN byte aligned. We also make sure that the remaining bytes large enough to hold a dirent with a single-byte name before accessing xfs_dir2_data_entry. Signed-off-by: lei lu <llfamsec@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
lei lu	fb63435b7c	xfs: add bounds checking to xlog_recover_process_data There is a lack of verification of the space occupied by fixed members of xlog_op_header in the xlog_recover_process_data. We can create a crafted image to trigger an out of bounds read by following these steps: 1) Mount an image of xfs, and do some file operations to leave records 2) Before umounting, copy the image for subsequent steps to simulate abnormal exit. Because umount will ensure that tail_blk and head_blk are the same, which will result in the inability to enter xlog_recover_process_data 3) Write a tool to parse and modify the copied image in step 2 4) Make the end of the xlog_op_header entries only 1 byte away from xlog_rec_header->h_size 5) xlog_rec_header->h_num_logops++ 6) Modify xlog_rec_header->h_crc Fix: Add a check to make sure there is sufficient space to access fixed members of xlog_op_header. Signed-off-by: lei lu <llfamsec@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
John Garry	f23660f059	xfs: Fix xfs_prepare_shift() range for RT The RT extent range must be considered in the xfs_flush_unmap_range() call to stabilize the boundary. This code change is originally from Dave Chinner. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
John Garry	d3b689d7c7	xfs: Fix xfs_flush_unmap_range() range for RT Currently xfs_flush_unmap_range() does unmap for a full RT extent range, which we also want to ensure is clean and idle. This code change is originally from Dave Chinner. Reviewed-by: Christoph Hellwig <hch@lst.de>4 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Gao Xiang	d40c2865bd	xfs: avoid redundant AGFL buffer invalidation Currently AGFL blocks can be filled from the following three sources: - allocbt free blocks, as in xfs_allocbt_free_block(); - rmapbt free blocks, as in xfs_rmapbt_free_block(); - refilled from freespace btrees, as in xfs_alloc_fix_freelist(). Originally, allocbt free blocks would be marked as stale only when they put back in the general free space pool as Dave mentioned on IRC, "we don't stale AGF metadata btree blocks when they are returned to the AGFL .. but once they get put back in the general free space pool, we have to make sure the buffers are marked stale as the next user of those blocks might be user data...." However, after commit `ca250b1b3d` ("xfs: invalidate allocbt blocks moved to the free list") and commit `edfd9dd549` ("xfs: move buffer invalidation to xfs_btree_free_block"), even allocbt / bmapbt free blocks will be invalidated immediately since they may fail to pass V5 format validation on writeback even writeback to free space would be safe. IOWs, IMHO currently there is actually no difference of free blocks between AGFL freespace pool and the general free space pool. So let's avoid extra redundant AGFL buffer invalidation, since otherwise we're currently facing unnecessary xfs_log_force() due to xfs_trans_binval() again on buffers already marked as stale before as below: [ 333.507469] Call Trace: [ 333.507862] xfs_buf_find+0x371/0x6a0 <- xfs_buf_lock [ 333.508451] xfs_buf_get_map+0x3f/0x230 [ 333.509062] xfs_trans_get_buf_map+0x11a/0x280 [ 333.509751] xfs_free_agfl_block+0xa1/0xd0 [ 333.510403] xfs_agfl_free_finish_item+0x16e/0x1d0 [ 333.511157] xfs_defer_finish_noroll+0x1ef/0x5c0 [ 333.511871] xfs_defer_finish+0xc/0xa0 [ 333.512471] xfs_itruncate_extents_flags+0x18a/0x5e0 [ 333.513253] xfs_inactive_truncate+0xb8/0x130 [ 333.513930] xfs_inactive+0x223/0x270 xfs_log_force() will take tens of milliseconds with AGF buffer locked. It becomes an unnecessary long latency especially on our PMEM devices with FSDAX enabled and fsops like xfs_reflink_find_shared() at the same time are stuck due to the same AGF lock. Removing the double invalidation on the AGFL blocks does not make this issue go away, but this patch fixes for our workloads in reality and it should also work by the code analysis. Note that I'm not sure I need to remove another redundant one in xfs_alloc_ag_vextent_small() since it's unrelated to our workloads. Also fstests are passed with this patch. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Darrick J. Wong	673cd885bb	xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs xfs_init_new_inode ignores the init_xattrs parameter for filesystems that do not have ATTR enabled. As a result, the first init_xattrs file to be created by the kernel will not have an attr fork created to store acls. Storing that first acl will add ATTR to the superblock flags, so subsequent files will be created with attr forks. The overhead of this is so small that chances are that nobody has noticed this behavior. However, this is disastrous on a filesystem with parent pointers because it requires that a new linkable file /must/ have a pre-existing attr fork, and the parent pointers code uses init_xattrs to create that fork. The preproduction version of mkfs.xfs used to set this, but the V5 sb verifier only requires ATTR2, not ATTR. There is no guard for filesystems with (PARENT && !ATTR). It turns out that I misunderstood the two flags -- ATTR means that we at some point created an attr fork to store xattrs in a file; ATTR2 apparently means only that inodes have dynamic fork offsets or that the filesystem was mounted with the "attr2" option. Fixes: `2442ee15bb` ("xfs: eager inode attr fork init needs attr feature awareness") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-26 14:29:25 +05:30
Darrick J. Wong	dc5e1cbae2	xfs: fix direction in XFS_IOC_EXCHANGE_RANGE The kernel reads userspace's buffer but does not write it back. Therefore this is really an _IOW ioctl. Change this before 6.10 final releases. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-26 14:29:25 +05:30
Darrick J. Wong	1ec9307fc0	xfs: allow unlinked symlinks and dirs with zero size For a very very long time, inode inactivation has set the inode size to zero before unmapping the extents associated with the data fork. Unfortunately, commit `3c6f46eacd` changed the inode verifier to prohibit zero-length symlinks and directories. If an inode happens to get logged in this state and the system crashes before freeing the inode, log recovery will also fail on the broken inode. Therefore, allow zero-size symlinks and directories as long as the link count is zero; nobody will be able to open these files by handle so there isn't any risk of data exposure. Fixes: `3c6f46eacd` ("xfs: sanity check directory inode di_size") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-26 14:29:25 +05:30
Darrick J. Wong	288e1f693f	xfs: restrict when we try to align cow fork delalloc to cowextsz hints xfs/205 produces the following failure when always_cow is enabled: --- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800 +++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700 @@ -1,4 +1,5 @@ QA output created by 205 * one file + !!! disk full (expected) * one file, a few bytes at a time *** done This is the result of overly aggressive attempts to align cow fork delalloc reservations to the CoW extent size hint. Looking at the trace data, we're trying to append a single fsblock to the "fred" file. Trying to create a speculative post-eof reservation fails because there's not enough space. We then set @prealloc_blocks to zero and try again, but the cowextsz alignment code triggers, which expands our request for a 1-fsblock reservation into a 39-block reservation. There's not enough space for that, so the whole write fails with ENOSPC even though there's sufficient space in the filesystem to allocate the single block that we need to land the write. There are two things wrong here -- first, we shouldn't be attempting speculative preallocations beyond what was requested when we're low on space. Second, if we've already computed a posteof preallocation, we shouldn't bother trying to align that to the cowextsize hint. Fix both of these problems by adding a flag that only enables the expansion of the delalloc reservation to the cowextsize if we're doing a non-extending write, and only if we're not doing an ENOSPC retry. This requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc. I probably should have caught this six years ago when `6ca30729c2` was being reviewed, but oh well. Update the comments to reflect what the code does now. Fixes: `6ca30729c2` ("xfs: bmap code cleanup") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-26 14:29:24 +05:30
Christoph Hellwig	610b29161b	xfs: fix freeing speculative preallocations for preallocated files xfs_can_free_eofblocks returns false for files that have persistent preallocations unless the force flag is passed and there are delayed blocks. This means it won't free delalloc reservations for files with persistent preallocations unless the force flag is set, and it will also free the persistent preallocations if the force flag is set and the file happens to have delayed allocations. Both of these are bad, so do away with the force flag and always free only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC or APPEND flags set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-26 14:29:17 +05:30
Zhang Yi	d048945150	xfs: reserve blocks for truncating large realtime inode When unaligned truncate down a big realtime file, xfs_truncate_page() only zeros out the tail EOF block, __xfs_bunmapi() should split the tail written extent and convert the later one that beyond EOF block to unwritten, but it couldn't work as expected now since the reserved block is zero in xfs_setattr_size(), this could expose stale data just after commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")'. If we truncate file that contains a large enough written extent: \|< rxext >\|< rtext >\| ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW ^ (new EOF) ^ old EOF Since we only zeros out the tail of the EOF block, and xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned extents, it becomes this state: \|< rxext >\| ...WWWzWWWWWWWWWWWWW ^ new EOF Then if we do an extending write like this, the blocks in the previous tail extent becomes stale: \|< rxext >\| ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW ^ old EOF ^ append start ^ new EOF Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime inode. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240618142112.1315279-2-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-19 15:58:28 +02:00
Dave Chinner	348a1983cf	xfs: fix unlink vs cluster buffer instantiation race Luis has been reporting an assert failure when freeing an inode cluster during inode inactivation for a while. The assert looks like: XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:102! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs] RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80 FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs process_one_work (kernel/workqueue.c:3231) worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2)) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> And occurs when the the inode precommit handlers is attempt to look up the inode cluster buffer to attach the inode for writeback. The trail of logic that I can reconstruct is as follows. 1. the inode is clean when inodegc runs, so it is not attached to a cluster buffer when precommit runs. 2. #1 implies the inode cluster buffer may be clean and not pinned by dirty inodes when inodegc runs. 3. #2 implies that the inode cluster buffer can be reclaimed by memory pressure at any time. 4. The assert failure implies that the cluster buffer was attached to the transaction, but not marked done. It had been accessed earlier in the transaction, but not marked done. 5. #4 implies the cluster buffer has been invalidated (i.e. marked stale). 6. #5 implies that the inode cluster buffer was instantiated uninitialised in the transaction in xfs_ifree_cluster(), which only instantiates the buffers to invalidate them and never marks them as done. Given factors 1-3, this issue is highly dependent on timing and environmental factors. Hence the issue can be very difficult to reproduce in some situations, but highly reliable in others. Luis has an environment where it can be reproduced easily by g/531 but, OTOH, I've reproduced it only once in ~2000 cycles of g/531. I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag on the cluster buffers, even though they may not be initialised. The reasons why I think this is safe are: 1. A buffer cache lookup hit on a XBF_STALE buffer will clear the XBF_DONE flag. Hence all future users of the buffer know they have to re-initialise the contents before use and mark it done themselves. 2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which means the buffer remains locked until the journal commit completes and the buffer is unpinned. Hence once marked XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only context that can access the freed buffer is the currently running transaction. 3. #2 implies that future buffer lookups in the currently running transaction will hit the transaction match code and not the buffer cache. Hence XBF_STALE and XFS_BLI_STALE will not be cleared unless the transaction initialises and logs the buffer with valid contents again. At which point, the buffer will be marked marked XBF_DONE again, so having XBF_DONE already set on the stale buffer is a moot point. 4. #2 also implies that any concurrent access to that cluster buffer will block waiting on the buffer lock until the inode cluster has been fully freed and is no longer an active inode cluster buffer. 5. #4 + #1 means that any future user of the disk range of that buffer will always see the range of disk blocks covered by the cluster buffer as not done, and hence must initialise the contents themselves. 6. Setting XBF_DONE in xfs_ifree_cluster() then means the unlinked inode precommit code will see a XBF_DONE buffer from the transaction match as it expects. It can then attach the stale but newly dirtied inode to the stale but newly dirtied cluster buffer without unexpected failures. The stale buffer will then sail through the journal and do the right thing with the attached stale inode during unpin. Hence the fix is just one line of extra code. The explanation of why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and complex.... Fixes: `82842fee6e` ("xfs: fix AGF vs inode cluster buffer deadlock") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-17 11:17:09 +05:30
Mateusz Guzik	e9dae2fb99	xfs: remove now spurious i_state initialization in xfs_inode_alloc inode_init_always started setting the field to 0. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240611120626.513952-4-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-13 13:40:44 +02:00
Mateusz Guzik	ddd4cd4824	xfs: preserve i_state around inode_init_always in xfs_reinit_inode This is in preparation for the routine starting to zero the field. De facto coded by Dave Chinner, see: https://lore.kernel.org/linux-fsdevel/ZmgtaGglOL33Wkzr@dread.disaster.area/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240611120626.513952-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-12 14:13:26 +02:00
Wengang Wang	58f880711f	xfs: make sure sb_fdblocks is non-negative A user with a completely full filesystem experienced an unexpected shutdown when the filesystem tried to write the superblock during runtime. kernel shows the following dmesg: [ 8.176281] XFS (dm-4): Metadata corruption detected at xfs_sb_write_verify+0x60/0x120 [xfs], xfs_sb block 0x0 [ 8.177417] XFS (dm-4): Unmount and run xfs_repair [ 8.178016] XFS (dm-4): First 128 bytes of corrupted metadata buffer: [ 8.178703] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00 XFSB............ [ 8.179487] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 8.180312] 00000020: cf 12 dc 89 ca 26 45 29 92 e6 e3 8d 3b b8 a2 c3 .....&E)....;... [ 8.181150] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80 ................ [ 8.182003] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ [ 8.182004] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00 .....d.......... [ 8.182004] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00 ..d............. [ 8.182005] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19 ................ [ 8.182008] XFS (dm-4): Corruption of in-memory data detected. Shutting down filesystem [ 8.182010] XFS (dm-4): Please unmount the filesystem and rectify the problem(s) When xfs_log_sb writes super block to disk, b_fdblocks is fetched from m_fdblocks without any lock. As m_fdblocks can experience a positive -> negative -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks). So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is type of unsigned long long, it reads super big. And sb_fdblocks being bigger than sb_dblocks is a problem during log recovery, xfs_validate_sb_write() complains. Fix: As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled, We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is not nenative. This patch also takes care of other percpu counters in xfs_log_sb. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-06-10 11:38:12 +05:30
Ritesh Harjani (IBM)	b0c6bcd58d	xfs: Add cond_resched to block unmap range and reflink remap path An async dio write to a sparse file can generate a lot of extents and when we unlink this file (using rm), the kernel can be busy in umapping and freeing those extents as part of transaction processing. Similarly xfs reflink remapping path can also iterate over a million extent entries in xfs_reflink_remap_blocks(). Since we can busy loop in these two functions, so let's add cond_resched() to avoid softlockup messages like these. watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435] CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S L 6.9.0-rc5-0-default #1 Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290 LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290 Call Trace: xfs_alloc_get_rec+0x54/0x1b0 (unreliable) xfs_alloc_compute_aligned+0x5c/0x144 xfs_alloc_ag_vextent_size+0x238/0x8d4 xfs_alloc_fix_freelist+0x540/0x694 xfs_free_extent_fix_freelist+0x84/0xe0 __xfs_free_extent+0x74/0x1ec xfs_extent_free_finish_item+0xcc/0x214 xfs_defer_finish_one+0x194/0x388 xfs_defer_finish_noroll+0x1b4/0x5c8 xfs_defer_finish+0x2c/0xc4 xfs_bunmapi_range+0xa4/0x100 xfs_itruncate_extents_flags+0x1b8/0x2f4 xfs_inactive_truncate+0xe0/0x124 xfs_inactive+0x30c/0x3e0 xfs_inodegc_worker+0x140/0x234 process_scheduled_works+0x240/0x57c worker_thread+0x198/0x468 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 run fstests generic/175 at 2024-02-02 04:40:21 [ C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679] CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G X 6.4.0 NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs] LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs] Call Trace: 0xc000000014107c00 (unreliable) __xfs_btree_updkeys+0x8c/0x2c0 [xfs] xfs_btree_update_keys+0x150/0x170 [xfs] xfs_btree_lshift+0x534/0x660 [xfs] xfs_btree_make_block_unfull+0x19c/0x240 [xfs] xfs_btree_insrec+0x4e4/0x630 [xfs] xfs_btree_insert+0x104/0x2d0 [xfs] xfs_rmap_insert+0xc4/0x260 [xfs] xfs_rmap_map_shared+0x228/0x630 [xfs] xfs_rmap_finish_one+0x2d4/0x350 [xfs] xfs_rmap_update_finish_item+0x44/0xc0 [xfs] xfs_defer_finish_noroll+0x2e4/0x740 [xfs] __xfs_trans_commit+0x1f4/0x400 [xfs] xfs_reflink_remap_extent+0x2d8/0x650 [xfs] xfs_reflink_remap_blocks+0x154/0x320 [xfs] xfs_file_remap_range+0x138/0x3a0 [xfs] do_clone_file_range+0x11c/0x2f0 vfs_clone_file_range+0x60/0x1c0 ioctl_file_clone+0x78/0x140 sys_ioctl+0x934/0x1270 system_call_exception+0x158/0x320 system_call_vectored_common+0x15c/0x2ec Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Disha Goel<disgoel@linux.ibm.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 20:50:35 +05:30
Darrick J. Wong	95b19e2f4e	xfs: don't open-code u64_to_user_ptr Don't open-code what the kernel already provides. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:55:52 +05:30
Darrick J. Wong	38de567906	xfs: allow symlinks with short remote targets An internal user complained about log recovery failing on a symlink ("Bad dinode after recovery") with the following (excerpted) format: core.magic = 0x494e core.mode = 0120777 core.version = 3 core.format = 2 (extents) core.nlinkv2 = 1 core.nextents = 1 core.size = 297 core.nblocks = 1 core.naextents = 0 core.forkoff = 0 core.aformat = 2 (extents) u3.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,12,1,0] This is a symbolic link with a 297-byte target stored in a disk block, which is to say this is a symlink with a remote target. The forkoff is 0, which is to say that there's 512 - 176 == 336 bytes in the inode core to store the data fork. Eventually, testing of generic/388 failed with the same inode corruption message during inode recovery. In writing a debugging patch to call xfs_dinode_verify on dirty inode log items when we're committing transactions, I observed that xfs/298 can reproduce the problem quite quickly. xfs/298 creates a symbolic link, adds some extended attributes, then deletes them all. The test failure occurs when the final removexattr also deletes the attr fork because that does not convert the remote symlink back into a shortform symlink. That is how we trip this test. The only reason why xfs/298 only triggers with the debug patch added is that it deletes the symlink, so the final iflush shows the inode as free. I wrote a quick fstest to emulate the behavior of xfs/298, except that it leaves the symlinks on the filesystem after inducing the "corrupt" state. Kernels going back at least as far as 4.18 have written out symlink inodes in this manner and prior to `1eb70f54c4` they did not object to reading them back in. Because we've been writing out inodes this way for quite some time, the only way to fix this is to relax the check for symbolic links. Directories don't have this problem because di_size is bumped to blocksize during the sf->data conversion. Fixes: `1eb70f54c4` ("xfs: validate inode fork size against fork format") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:55:52 +05:30
Darrick J. Wong	97835e6866	xfs: fix xfs_init_attr_trans not handling explicit operation codes When we were converting the attr code to use an explicit operation code instead of keying off of attr->value being null, we forgot to change the code that initializes the transaction reservation. Split the function into two helpers that handle the !remove and remove cases, then fix both callsites to handle this correctly. Fixes: `c27411d4c6` ("xfs: make attr removal an explicit operation") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:55:52 +05:30
Darrick J. Wong	2b3f004d3d	xfs: drop xfarray sortinfo folio on error Chandan Babu reports the following livelock in xfs/708: run fstests xfs/708 at 2024-05-04 15:35:29 XFS (loop16): EXPERIMENTAL online scrub feature in use. Use at your own risk! XFS (loop5): Mounting V5 Filesystem e96086f0-a2f9-4424-a1d5-c75d53d823be XFS (loop5): Ending clean mount XFS (loop5): Quotacheck needed: Please wait. XFS (loop5): Quotacheck: Done. XFS (loop5): EXPERIMENTAL online scrub feature in use. Use at your own risk! INFO: task xfs_io:143725 blocked for more than 122 seconds. Not tainted 6.9.0-rc4+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:xfs_io state:D stack:0 pid:143725 tgid:143725 ppid:117661 flags:0x00004006 Call Trace: <TASK> __schedule+0x69c/0x17a0 schedule+0x74/0x1b0 io_schedule+0xc4/0x140 folio_wait_bit_common+0x254/0x650 shmem_undo_range+0x9d5/0xb40 shmem_evict_inode+0x322/0x8f0 evict+0x24e/0x560 __dentry_kill+0x17d/0x4d0 dput+0x263/0x430 __fput+0x2fc/0xaa0 task_work_run+0x132/0x210 get_signal+0x1a8/0x1910 arch_do_signal_or_restart+0x7b/0x2f0 syscall_exit_to_user_mode+0x1c2/0x200 do_syscall_64+0x72/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e The shmem code is trying to drop all the folios attached to a shmem file and gets stuck on a locked folio after a bnobt repair. It looks like the process has a signal pending, so I started looking for places where we lock an xfile folio and then deal with a fatal signal. I found a bug in xfarray_sort_scan via code inspection. This function is called to set up the scanning phase of a quicksort operation, which may involve grabbing a locked xfile folio. If we exit the function with an error code, the caller does not call xfarray_sort_scan_done to put the xfile folio. If _sort_scan returns an error code while si->folio is set, we leak the reference and never unlock the folio. Therefore, change xfarray_sort to call _scan_done on exit. This is safe to call multiple times because it sets si->folio to NULL and ignores a NULL si->folio. Also change _sort_scan to use an intermediate variable so that we never pollute si->folio with an errptr. Fixes: `232ea05277` ("xfs: enable sorting of xfile-backed arrays") Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:55:52 +05:30
John Garry	b33874fb7f	xfs: Stop using __maybe_unused in xfs_alloc.c In both xfs_alloc_cur_finish() and xfs_alloc_ag_vextent_exact(), local variable @afg is tagged as __maybe_unused. Otherwise an unused variable warning would be generated for when building with W=1 and CONFIG_XFS_DEBUG unset. In both cases, the variable is unused as it is only referenced in an ASSERT() call, which is compiled out (in this config). It is generally a poor programming style to use __maybe_unused for variables. The ASSERT() call is to verify that agbno of the end of the extent is within bounds for both functions. @afg is used as an intermediate variable to find the AG length. However xfs_verify_agbext() already exists to verify a valid extent range. The arguments for calling xfs_verify_agbext() are already available, so use that instead. An advantage of using xfs_verify_agbext() is that it verifies that both the start and the end of the extent are within the bounds of the AG and catches overflows. Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:54:24 +05:30
John Garry	d7ba701da6	xfs: Clear W=1 warning in xfs_iwalk_run_callbacks() For CONFIG_XFS_DEBUG unset, xfs_iwalk_run_callbacks() generates the following warning for when building with W=1: fs/xfs/xfs_iwalk.c: In function ‘xfs_iwalk_run_callbacks’: fs/xfs/xfs_iwalk.c:354:42: error: variable ‘irec’ set but not used [-Werror=unused-but-set-variable] 354 \| struct xfs_inobt_rec_incore *irec; \| ^~~~ cc1: all warnings being treated as errors Drop @irec, as it is only an intermediate variable. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-27 15:54:24 +05:30
Steven Rostedt (Google)	2c92ca849f	tracing/treewide: Remove second parameter of __assign_str() With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str \| while read a ; do sed -e 's/$__assign_str([^,][^ ,]$ ,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>	2024-05-22 20:14:47 -04:00
Linus Torvalds	5ad8b6ad9a	getting rid of bogus set_blocksize() uses, switching it to struct file * and verifying that caller has device opened exclusively. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt KubxHVFsfW+eu6ASeaoMRB83w5OIzwk= =Liix -----END PGP SIGNATURE----- Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file ' and verifies that the caller has the device opened exclusively" tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()	2024-05-21 08:34:51 -07:00
Linus Torvalds	119d1b8a5d	New code for 6.10: * Introduce Parent Pointer extended attribute for inodes. * Online Repair - Implement atomic file content exchanges i.e. exchange ranges of bytes between two files atomically. - Create temporary files to repair file-based metadata. This uses atomic file content exchange facility to swap file fork mappings between the temporary file and the metadata inode. - Allow callers of directory/xattr code to set an explicit owner number to be written into the header fields of any new blocks that are created. This is required to avoid walking every block of the new structure and modify their ownership during online repair. - Repair - Extended attributes - Inode unlinked state - Directories - Symbolic links - AGI's unlinked inode list. - Parent pointers. - Move Orphan files to lost and found directory. - Fixes for Inode repair functionality. - Introduce a new sub-AG FITRIM implementation to reduce the duration for which the AGF lock is held. - Updates for the design documentation. - Use Parent Pointers to assist in checking directories, parent pointers, extended attributes, and link counts. * Bring back delalloc support for realtime devices which have an extent size that is equal to filesystem's block size. * Improve performance of log incompat feature handling. * Fixes - Prevent userspace from reading invalid file data due to incorrect. updation of file size when performing a non-atomic clone operation. - Minor fixes to online repair. - Fix confusing return values from xfs_bmapi_write(). - Fix an out of bounds access due to incorrect h_size during log recovery. - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until we know we are going to modify the extent mapping. - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent(). - Fix sparse warnings. * Cleanups - Hold inode locks on all files involved in a rename until the completion of the operation. This is in preparation for the parent pointers patchset where parent pointers are applied in a separate chained update from the actual directory update. - Compile out v4 support when disabled. - Cleanup xfs_extent_busy_clear(). - Remove unused flags and fields from struct xfs_da_args. - Remove definitions of unused functions. - Improve extended attribute validation. - Add higher level directory operations helpers to remove duplication of code. - Cleanup quota (un)reservation interfaces. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZjZC0wAKCRAH7y4RirJu 9HsCAPoCQvmPefDv56aMb5JEQNpv9dPz2Djj14hqLytQs5P/twD+LF5NhJgQNDUo Lwnb0tmkAhmG9Y4CCiN1FwSj1rq59gE= =2hXB -----END PGP SIGNATURE----- Merge tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Chandan Babu: "Online repair feature continues to be expanded. Also, we now support delayed allocation for realtime devices which have an extent size that is equal to filesystem's block size. New code: - Introduce Parent Pointer extended attribute for inodes - Bring back delalloc support for realtime devices which have an extent size that is equal to filesystem's block size - Improve performance of log incompat feature handling Online Repair: - Implement atomic file content exchanges i.e. exchange ranges of bytes between two files atomically - Create temporary files to repair file-based metadata. This uses atomic file content exchange facility to swap file fork mappings between the temporary file and the metadata inode - Allow callers of directory/xattr code to set an explicit owner number to be written into the header fields of any new blocks that are created. This is required to avoid walking every block of the new structure and modify their ownership during online repair - Repair more data structures: - Extended attributes - Inode unlinked state - Directories - Symbolic links - AGI's unlinked inode list - Parent pointers - Move Orphan files to lost and found directory - Fixes for Inode repair functionality - Introduce a new sub-AG FITRIM implementation to reduce the duration for which the AGF lock is held - Updates for the design documentation - Use Parent Pointers to assist in checking directories, parent pointers, extended attributes, and link counts Fixes: - Prevent userspace from reading invalid file data due to incorrect. updation of file size when performing a non-atomic clone operation - Minor fixes to online repair - Fix confusing return values from xfs_bmapi_write() - Fix an out of bounds access due to incorrect h_size during log recovery - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until we know we are going to modify the extent mapping - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent() - Fix sparse warnings Cleanups: - Hold inode locks on all files involved in a rename until the completion of the operation. This is in preparation for the parent pointers patchset where parent pointers are applied in a separate chained update from the actual directory update - Compile out v4 support when disabled - Cleanup xfs_extent_busy_clear() - Remove unused flags and fields from struct xfs_da_args - Remove definitions of unused functions - Improve extended attribute validation - Add higher level directory operations helpers to remove duplication of code - Cleanup quota (un)reservation interfaces" * tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (221 commits) xfs: simplify iext overflow checking and upgrade xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later xfs: xfs_quota_unreserve_blkres can't fail xfs: consolidate the xfs_quota_reserve_blkres definitions xfs: clean up buffer allocation in xlog_do_recovery_pass xfs: fix log recovery buffer allocation for the legacy h_size fixup xfs: widen flags argument to the xfs_iflags_* helpers xfs: minor cleanups of xfs_attr3_rmt_blocks xfs: create a helper to compute the blockcount of a max sized remote value xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c xfs: do not allocate the entire delalloc extent in xfs_bmapi_write xfs: fix xfs_bmap_add_extent_delay_real for partial conversions xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate xfs: fix error returns from xfs_bmapi_write ...	2024-05-20 12:55:12 -07:00
Linus Torvalds	ff9a79307f	Kbuild updates for v6.10 - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd ZCSnZ31h7woGfNho =D5B/ -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...	2024-05-18 12:39:20 -07:00
Linus Torvalds	1b0aabcc9a	vfs-6.10.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L 0+DNepg4R8PZOHH371eSSsLNRCUCkAs= =SVsU -----END PGP SIGNATURE----- Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_ bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...	2024-05-13 11:40:06 -07:00
Masahiro Yamada	b1992c3772	kbuild: use $(src) instead of $(srctree)/$(src) for source directory Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>	2024-05-10 04:34:52 +09:00

... 7 8 9 10 11 ...

9520 Commits