linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-28 00:19:36 +00:00

Author	SHA1	Message	Date
Wengang Wang	09dab6ce02	xfs: free up mp->m_free[0].count in error case In xfs_init_percpu_counters(), memory for mp->m_free[0].count wasn't freed in error case. Free it up in this patch. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Fixes: `712bae9663` ("xfs: generalize the freespace and reserved blocks handling") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 15:37:49 +02:00
Christoph Hellwig	1c7161ef01	xfs: remove the EXPERIMENTAL warning for pNFS The pNFS layout support has been around for 10 years without major issues, drop the EXPERIMENTAL warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:42:12 +02:00
Darrick J. Wong	ca43b74ac3	xfs: remove some EXPERIMENTAL warnings Online fsck was finished a year ago, in Linux 6.10. The exchange-range syscall and parent pointers were merged in the same cycle. None of these have encountered any serious errors in the year that they've been in the kernel (or the many many years they've been under development) so let's drop the shouty warnings. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:42:05 +02:00
Carlos Maiolino	6e7d71b3a0	Merge branch 'atomic_writes-6.16' into xfs-6.16-merge Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:38:53 +02:00
Zizhi Wo	c0a5c40847	xfs: Remove deprecated xfs_bufd sysctl parameters Commit `64af7a6ea5` ("xfs: remove deprecated sysctls") removed the deprecated xfsbufd-related sysctl interface, but forgot to delete the corresponding parameters: "xfs_buf_timer" and "xfs_buf_age". This patch removes those parameters and makes no other changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:35:24 +02:00
Darrick J. Wong	ea31bdece2	xfs: stop using set_blocksize XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:22:37 +02:00
Fedor Pchelkin	f3def8270c	sort.h: hoist cmp_int() into generic header file Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:54:12 -07:00
Christian Brauner	1af3331764	super: add filesystem freezing helpers for suspend and hibernate Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-05-09 12:41:02 +02:00
Carlos Maiolino	4abb9052a7	large atomic writes for xfs [v12.1] Currently atomic write support for xfs is limited to writing a single block as we have no way to guarantee alignment and that the write covers a single extent. This series introduces a method to issue atomic writes via a software-based method. The software-based method is used as a fallback for when attempting to issue an atomic write over misaligned or multiple extents. For xfs, this support is based on reflink CoW support. The basic idea of this CoW method is to alloc a range in the CoW fork, write the data, and atomically update the mapping. Initial mysql performance testing has shown this method to perform ok. However, there we are only using 16K atomic writes (and 4K block size), so typically - and thankfully - this software fallback method won't be used often. For other FSes which want large atomics writes and don't support CoW, I think that they can follow the example in [0]. Catherine is currently working on further xfstests for this feature, which we hope to share soon. About 17/17, maybe it can be omitted as there is no strong demand to have it included. Based on `bfecc4091e` (xfs/next-rc, xfs/for-next) xfs: allow ro mounts if rtdev or logdev are read-only [0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ Differences to v12: - add more review tags Differences to v11: - split "xfs: ignore ..." patch - inline sync_blockdev() in xfs_alloc_buftarg() (Christoph) - fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick) - Add RB tag from Christoph (thanks!) Differences to v10: - add "xfs: only call xfs_setsize_buftarg once ..." by Darrick - symbol renames in "xfs: ignore HW which cannot..." by Darrick Differences to v9: - rework "ignore HW which cannot .." patch by Darrick - Ensure power-of-2 max always for unit min/max when no HW support With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaBvPzwAKCRBKO3ySh0YR pkDWAP9Znh9MKUMGDDTn0LH2V57VbZXjtwv9MjBomWcc4yVL0AD9HpAv38+zjInL OGgrUvEk4HxdDtGtPkYIOlXoegpioAc= =D/6o -----END PGP SIGNATURE----- Merge tag 'atomic-writes-6.16_2025-05-07' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into atomic_writes large atomic writes for xfs [v12.1] Currently atomic write support for xfs is limited to writing a single block as we have no way to guarantee alignment and that the write covers a single extent. This series introduces a method to issue atomic writes via a software-based method. The software-based method is used as a fallback for when attempting to issue an atomic write over misaligned or multiple extents. For xfs, this support is based on reflink CoW support. The basic idea of this CoW method is to alloc a range in the CoW fork, write the data, and atomically update the mapping. Initial mysql performance testing has shown this method to perform ok. However, there we are only using 16K atomic writes (and 4K block size), so typically - and thankfully - this software fallback method won't be used often. For other FSes which want large atomics writes and don't support CoW, I think that they can follow the example in [0]. Catherine is currently working on further xfstests for this feature, which we hope to share soon. About 17/17, maybe it can be omitted as there is no strong demand to have it included. Based on `bfecc4091e` (xfs/next-rc, xfs/for-next) xfs: allow ro mounts if rtdev or logdev are read-only [0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ Differences to v12: - add more review tags Differences to v11: - split "xfs: ignore ..." patch - inline sync_blockdev() in xfs_alloc_buftarg() (Christoph) - fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick) - Add RB tag from Christoph (thanks!) Differences to v10: - add "xfs: only call xfs_setsize_buftarg once ..." by Darrick - symbol renames in "xfs: ignore HW which cannot..." by Darrick Differences to v9: - rework "ignore HW which cannot .." patch by Darrick - Ensure power-of-2 max always for unit min/max when no HW support With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-05-09 09:15:39 +02:00
Darrick J. Wong	4528b90527	xfs: allow sysadmins to specify a maximum atomic write limit at mount time Introduce a mount option to allow sysadmins to specify the maximum size of an atomic write. If the filesystem can work with the supplied value, that becomes the new guaranteed maximum. The value mustn't be too big for the existing filesystem geometry (max write size, max AG/rtgroup size). We dynamically recompute the tr_atomic_write transaction reservation based on the given block size, check that the current log size isn't less than the new minimum log size constraints, and set a new maximum. The actual software atomic write max is still computed based off of tr_atomic_ioend the same way it has for the past few commits. Note also that xfs_calc_atomic_write_log_geometry is non-static because mkfs will need that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:33 -07:00
John Garry	9dffc58f23	xfs: update atomic write limits Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)(). No reflink support always means no CoW-based atomic writes. For updating xfs_get_atomic_write_min(), we support blocksize only and that depends on HW or reflink support. For updating xfs_get_atomic_write_max(), for no reflink, we are limited to blocksize but only if HW support. Otherwise we are limited to combined limit in mp->m_atomic_write_unit_max. For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by the bdev atomic write limit. If xfs_get_atomic_write_max() does not report > 1x blocksize, then just continue to report 0 as before. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: update comments in the helper functions] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	0c438dcc31	xfs: add xfs_calc_atomic_write_unit_max() Now that CoW-based atomic writes are supported, update the max size of an atomic write for the data device. The limit of a CoW-based atomic write will be the limit of the number of logitems which can fit into a single transaction. In addition, the max atomic write size needs to be aligned to the agsize. Limit the size of atomic writes to the greatest power-of-two factor of the agsize so that allocations for an atomic write will always be aligned compatibly with the alignment requirements of the storage. Function xfs_atomic_write_logitems() is added to find the limit the number of log items which can fit in a single transaction. Amend the max atomic write computation to create a new transaction reservation type, and compute the maximum size of an atomic write completion (in fsblocks) based on this new transaction reservation. Initially, tr_atomic_write is a clone of tr_itruncate, which provides a reasonable level of parallelism. In the next patch, we'll add a mount option so that sysadmins can configure their own limits. [djwong: use a new reservation type for atomic write ioends, refactor group limit calculations] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	9baeac3ab1	xfs: add xfs_file_dio_write_atomic() Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes. Now HW offload will not be required to support atomic writes and is an optional feature. CoW-based atomic writes will be supported with out-of-places write and atomic extent remapping. Either mode of operation may be used for an atomic write, depending on the circumstances. The preferred method is HW offload as it will be faster. If HW offload is not available then we always use the CoW-based method. If HW offload is available but not possible to use, then again we use the CoW-based method. If available, HW offload would not be possible for the write length exceeding the HW offload limit, the write spanning multiple extents, unaligned disk blocks, etc. Apart from the write exceeding the HW offload limit, other conditions for HW offload usage can only be detected in the iomap handling for the write. As such, we use a fallback method to issue the write if we detect in the ->iomap_begin() handler that HW offload is not possible. Special code -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is not possible. In other words, atomic writes are supported on any filesystem that can perform out of place write remapping atomically (i.e. reflink) up to some fairly large size. If the conditions are right (a single correctly aligned overwrite mapping) then the filesystem will use any available hardware support to avoid the filesystem metadata updates. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	b1e09178b7	xfs: commit CoW-based atomic writes atomically When completing a CoW-based write, each extent range mapping update is covered by a separate transaction. For a CoW-based atomic write, all mappings must be changed at once, so change to use a single transaction. Note that there is a limit on the amount of log intent items which can be fit into a single transaction, but this is being ignored for now since the count of items for a typical atomic write would be much less than is typically supported. A typical atomic write would be expected to be 64KB or less, which means only 16 possible extents unmaps, which is quite small. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	11ab31909d	xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() For when large atomic writes (> 1x FS block) are supported, there will be various occasions when HW offload may not be possible. Such instances include: - unaligned extent mapping wrt write length - extent mappings which do not cover the full write, e.g. the write spans sparse or mixed-mapping extents - the write length is greater than HW offload can support - no hardware support at all In those cases, we need to fallback to the CoW-based atomic write mode. For this, report special code -ENOPROTOOPT to inform the caller that HW offload-based method is not possible. In addition to the occasions mentioned, if the write covers an unallocated range, we again judge that we need to rely on the CoW-based method when we would need to allocate anything more than 1x block. This is because if we allocate less blocks that is required for the write, then again HW offload-based method would not be possible. So we are taking a pessimistic approach to writes covering unallocated space. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: various cleanups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	bd1d2c21d5	xfs: add xfs_atomic_write_cow_iomap_begin() For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork support. Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create staging mappings in the CoW fork for atomic write updates. The general steps in the function are as follows: - find extent mapping in the CoW fork for the FS block range being written - if part or full extent is found, proceed to process found extent - if no extent found, map in new blocks to the CoW fork - convert unwritten blocks in extent if required - update iomap extent mapping and return The bulk of this function is quite similar to the processing in xfs_reflink_allocate_cow(), where we try to find an extent mapping; if none exists, then allocate a new extent in the CoW fork, convert unwritten blocks, and return a mapping. Performance testing has shown the XFS_ILOCK_EXCL locking to be quite a bottleneck, so this is an area which could be optimised in future. Christoph Hellwig contributed almost all of the code in xfs_atomic_write_cow_iomap_begin(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add a new xfs_can_sw_atomic_write to convey intent better] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:31 -07:00
John Garry	0ea88ed47b	xfs: refine atomic write size check in xfs_file_write_iter() Currently the size of atomic write allowed is fixed at the blocksize. To start to lift this restriction, partly refactor xfs_report_atomic_write() to into helpers - xfs_get_atomic_write_{min, max}() - and use those helpers to find the per-inode atomic write limits and check according to that. Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and just return 0 since large atomics aren't supported yet. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:31 -07:00
John Garry	514df14fae	xfs: refactor xfs_reflink_end_cow_extent() Refactor xfs_reflink_end_cow_extent() into separate parts which process the CoW range and commit the transaction. This refactoring will be used in future for when it is required to commit a range of extents as a single transaction, similar to how it was done pre-commit `d6f215f359`. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:31 -07:00
John Garry	6baf4cc47a	xfs: allow block allocator to take an alignment hint Add a BMAPI flag to provide a hint to the block allocator to align extents according to the extszhint. This will be useful for atomic writes to ensure that we are not being allocated extents which are not suitable (for atomic writes). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:31 -07:00
Darrick J. Wong	85bf2dfa3f	xfs: ignore HW which cannot atomic write a single block Currently only HW which can write at least 1x block is supported. For supporting atomic writes > 1x block, a CoW-based method will also be used and this will not be resticted to using HW which can write >= 1x block. However for deciding if HW-based atomic writes can be used, we need to start adding checks for write length < HW min, which complicates the code. Indeed, a statx field similar to unit_max_opt should also be added for this minimum, which is undesirable. HW which can only write > 1x blocks would be uncommon and quite weird, so let's just not support it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2025-05-07 14:25:31 -07:00
Darrick J. Wong	805f898812	xfs: add helpers to compute transaction reservation for finishing intent items In the transaction reservation code, hoist the logic that computes the reservation needed to finish one log intent item into separate helper functions. These will be used in subsequent patches to estimate the number of blocks that an online repair can commit to reaping in the same transaction as the change committing the new data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Darrick J. Wong	6d1bdc7391	xfs: add helpers to compute log item overhead Add selected helpers to estimate the transaction reservation required to write various log intent and buffer items to the log. These helpers will be used by the online repair code for more precise estimations of how much work can be done in a single transaction. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Darrick J. Wong	13c7c54bd0	xfs: separate out setting buftarg atomic writes limits Separate out setting buftarg atomic writes limits into a dedicated function, xfs_configure_buftarg_atomic_writes(), to keep the specific functionality self-contained. For naming consistency, rename xfs_setsize_buftarg() -> xfs_configure_buftarg(). Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: separate out from patch "xfs: ignore HW which ..."] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2025-05-07 14:25:30 -07:00
John Garry	5af9f55084	xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write() In future we will want to be able to check if specifically HW offload-based atomic writes are possible, so rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: add an underscore to be consistent with everything else] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Darrick J. Wong	84270a1a30	xfs: only call xfs_setsize_buftarg once per buffer target It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the block device LBA size because we don't need to ask the block layer to validate a geometry number that it provided us. Instead, set the preliminary bt_meta_sector* fields to the LBA size in preparation for reading the primary super. However, we still want to flush and invalidate the pagecache for all three block devices before we start reading metadata from those devices, so call sync_blockdev() per bdev in xfs_alloc_buftarg(). This will enable a subsequent patch to validate hw atomic write geometry against the filesystem geometry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> [jpg: call sync_blockdev() from xfs_alloc_buftarg()] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2025-05-07 14:25:30 -07:00
John Garry	5d894321c4	fs: add atomic write unit max opt to statx XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Christoph Hellwig	5ced480d48	xfs: simplify building the bio in xlog_write_iclog Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract away the details of the memory allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	d486bbecc9	xfs: simplify xfs_rw_bdev Delegate to bdev_rw_virt when operating on non-vmalloc memory and use bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	9dccf2aa6e	xfs: simplify xfs_buf_submit_bio Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Dave Chinner	23be716b1c	xfs: don't assume perags are initialised when trimming AGs When running fstrim immediately after mounting a V4 filesystem, the fstrim fails to trim all the free space in the filesystem. It only trims the first extent in the by-size free space tree in each AG and then returns. If a second fstrim is then run, it runs correctly and the entire free space in the filesystem is iterated and discarded correctly. The problem lies in the setup of the trim cursor - it assumes that pag->pagf_longest is valid without either reading the AGF first or checking if xfs_perag_initialised_agf(pag) is true or not. As a result, when a filesystem is mounted without reading the AGF (e.g. a clean mount on a v4 filesystem) and the first operation is a fstrim call, pag->pagf_longest is zero and so the free extent search starts at the wrong end of the by-size btree and exits after discarding the first record in the tree. Fix this by deferring the initialisation of tcur->count to after we have locked the AGF and guaranteed that the perag is properly initialised. We trigger this on tcur->count == 0 after locking the AGF, as this will only occur on the first call to xfs_trim_gather_extents() for each AG. If we need to iterate, tcur->count will be set to the length of the record we need to restart at, so we can use this to ensure we only sample a valid pag->pagf_longest value for the iteration. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Fixes: `89cfa89960` ("xfs: reduce AGF hold times during fstrim operations") Cc: <stable@vger.kernel.org> # v6.6 Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-05 09:39:19 +02:00
Hans Holmberg	bfecc4091e	xfs: allow ro mounts if rtdev or logdev are read-only Allow read-only mounts on rtdevs and logdevs that are marked as read-only and make sure those mounts can't be remounted read-write. Use the sb_open_mode helper to make sure that we don't try to open devices with write access enabled for read-only mounts. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-30 20:53:52 +02:00
Darrick J. Wong	5088aad3d3	xfs: stop using set_blocksize XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-28 11:34:14 +02:00
Carlos Maiolino	bd7c193319	XFS: fix zoned gc threshold math for 32-bit arches xfs_zoned_need_gc makes use of mult_frac() to calculate the threshold for triggering the zoned garbage collector, but, turns out mult_frac() doesn't properly work with 64-bit data types and this caused build failures on some 32-bit architectures. Fix this by essentially open coding mult_frac() in a 64-bit friendly way. Notice we don't need to bother with counters underflow here because xfs_estimate_freecounter() will always return a positive value, as it leverages percpu_counter_read_positive to read such counters. Fixes: `845abeb1f0` ("xfs: add tunable threshold parameter for triggering zone GC") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202504181233.F7D9Atra-lkp@intel.com/ Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-22 16:03:14 +02:00
Darrick J. Wong	c6f1401b1d	xfs: fix fsmap for internal zoned devices Filesystems with an internal zoned rt section use xfs_rtblock_t values that are relative to the start of the data device. When fsmap reports on internal rt sections, it reports the space used by the data section as "OWN_FS". Unfortunately, the logic for resuming a query isn't quite right, so xfs/273 fails because it stress-tests GETFSMAP with a single-record buffer. If we enter the "report fake space as OWN_FS" block with a nonzero key[0].fmr_length, we should add that to key[0].fmr_physical and recheck if we still need to emit the fake record. We should /not/ just return 0 from the whole function because that prevents all rmap record iteration. If we don't enter that block, the resumption is still wrong. keys[*].fmr_physical is a reflection of what we copied out to userspace on a previous query, which means that it already accounts for rgstart. It is not correct to add rtstart_daddr when computing start_rtb or end_rtb, so stop that. While we're at it, add a xfs_has_zoned to make it clear that this is a zoned filesystem thing. Fixes: `e50ec7fac8` ("xfs: enable fsmap reporting for internal RT devices") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-16 12:56:10 +02:00
Zhang Xianwei	1c406526bd	xfs: Fix spelling mistake "drity" -> "dirty" There is a spelling mistake in fs/xfs/xfs_log.c. Fix it. Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-16 10:43:23 +02:00
Darrick J. Wong	a37b3b9c3c	xfs: compute buffer address correctly in xmbuf_map_backing_mem Prior to commit `e614a00117`, xmbuf_map_backing_mem relied on folio_file_page to return the base page for the xmbuf's loff_t in the xfile, and set b_addr to the page_address of that base page. Now that folio_file_page has been removed from xmbuf_map_backing_mem, we always set b_addr to the folio_address of the folio. This is correct for the situation where the folio size matches the buffer size, but it's totally wrong if tmpfs uses large folios. We need to use offset_in_folio here. Found via xfs/801, which demonstrated evidence of corruption of an in-memory rmap btree block right after initializing an adjacent block. Fixes: `e614a00117` ("xfs: cleanup mapping tmpfs folios into the buffer cache") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-14 11:22:52 +02:00
Hans Holmberg	845abeb1f0	xfs: add tunable threshold parameter for triggering zone GC Presently we start garbage collection late - when we start running out of free zones to backfill max_open_zones. This is a reasonable default as it minimizes write amplification. The longer we wait, the more blocks are invalidated and reclaim cost less in terms of blocks to relocate. Starting this late however introduces a risk of GC being outcompeted by user writes. If GC can't keep up, user writes will be forced to wait for free zones with high tail latencies as a result. This is not a problem under normal circumstances, but if fragmentation is bad and user write pressure is high (multiple full-throttle writers) we will "bottom out" of free zones. To mitigate this, introduce a zonegc_low_space tunable that lets the user specify a percentage of how much of the unused space that GC should keep available for writing. A high value will reclaim more of the space occupied by unused blocks, creating a larger buffer against write bursts. This comes at a cost as write amplification is increased. To illustrate this using a sample workload, setting zonegc_low_space to 60% avoids high (500ms) max latencies while increasing write amplification by 15%. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-14 10:41:33 +02:00
Christoph Hellwig	a1a56f541a	xfs: mark xfs_buf_free as might_sleep() xfs_buf_free can call vunmap, which can sleep. The vunmap path is an unlikely one, so add might_sleep to ensure calling xfs_buf_free from atomic context gets caught more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-14 10:24:30 +02:00
Christoph Hellwig	b73e05281c	xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure Marking a log item as failed kept a buffer reference around for resubmission of inode and dquote items. For inode items commit `298f7bec50` ("xfs: pin inode backing buffer to the inode log item") started pinning the inode item buffers unconditionally and removed the need for this. Later commit `acc8f8628c` ("xfs: attach dquot buffer to dquot log item buffer") did the same for dquot items but didn't fully clean up the xfs_clear_li_failed side for them. Stop adding the extra pin for dquot items and remove the helpers. This happens to fix a call to xfs_buf_free with the AIL lock held, which would be incorrect for the unlikely case freeing the buffer ends up calling vfree. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-14 10:24:30 +02:00
NeilBrown	06c567403a	Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS try_lookup_noperm() and d_hash_and_lookup() are nearly identical. The former does some validation of the name where the latter doesn't. Outside of the VFS that validation is likely valuable, and having only one exported function for this task is certainly a good idea. So make d_hash_and_lookup() local to VFS files and change all other callers to try_lookup_noperm(). Note that the arguments are swapped. Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-08 11:24:41 +02:00
NeilBrown	fa6fe07d15	VFS: rename lookup_one_len family to lookup_noperm and remove permission check The lookup_one_len family of functions is (now) only used internally by a filesystem on itself either - in a context where permission checking is irrelevant such as by a virtual filesystem populating itself, or xfs accessing its ORPHANAGE or dquota accessing the quota file; or - in a context where a permission check (MAY_EXEC on the parent) has just been performed such as a network filesystem finding in "silly-rename" file in the same directory. This is also the context after the _parentat() functions where currently lookup_one_qstr_excl() is used. So the permission check is pointless. The name "one_len" is unhelpful in understanding the purpose of these functions and should be changed. Most of the callers pass the len as "strlen()" so using a qstr and QSTR() can simplify the code. This patch renames these functions (include lookup_positive_unlocked() which is part of the family despite the name) to have a name based on "lookup_noperm". They are changed to receive a 'struct qstr' instead of separate name and len. In a few cases the use of QSTR() results in a new call to strlen(). try_lookup_noperm() takes a pointer to a qstr instead of the whole qstr. This is consistent with d_hash_and_lookup() (which is nearly identical) and useful for lookup_noperm_unlocked(). The new lookup_noperm_common() doesn't take a qstr yet. That will be tidied up in a subsequent patch. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-08 11:24:36 +02:00
Eric Biggers	b261d22220	lib/crc: remove CONFIG_LIBCRC32C Now that LIBCRC32C does nothing besides select CRC32, make every option that selects LIBCRC32C instead select CRC32 directly. Then remove LIBCRC32C. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-04-04 11:31:42 -07:00
Linus Torvalds	d6b02199cd	- The 7 patch series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The 2 patch series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The 12 patch series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The 16 patch series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The 7 patch series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The 2 patch series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The 4 patch series "resource: Split and use DEFINE_RES() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq r6khQUIcQImPPcjFqEFpuiSOU0MBZA0= =Kii8 -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...	2025-04-01 10:06:52 -07:00
Linus Torvalds	eb0ece1602	- The 6 patch series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...	2025-04-01 09:29:18 -07:00
Linus Torvalds	c148bc7535	XFS - new code for 6.15 Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCZ+Eg9QAKCRBcsMJ8RxYu Y707AYCssAqTYkwPm937oACNbNpL3d2Q/3kP6ku+LmEZM+1HlD2K9cwsdqEWBcWw oPA4ClwBgKP2dnn66oaaFSxEWMj/1evpbzAqptSKBJys83Ge7PFGiFG4Tyk7AUvl kAc1FcIYrQ== =9SdI -----END PGP SIGNATURE----- Merge tag 'xfs-6.15-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Carlos Maiolino: - XFS zoned allocator: Enables XFS to support zoned devices using its real-time allocator - Use folios/vmalloc for buffer cache backing memory - Some code cleanups and bug fixes * tag 'xfs-6.15-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (70 commits) xfs: remove the flags argument to xfs_buf_get_uncached xfs: remove the flags argument to xfs_buf_read_uncached xfs: remove xfs_buf_free_maps xfs: remove xfs_buf_get_maps xfs: call xfs_buf_alloc_backing_mem from _xfs_buf_alloc xfs: remove unnecessary NULL check before kvfree() xfs: don't wake zone space waiters without m_zone_info xfs: don't increment m_generation for all errors in xfs_growfs_data xfs: fix a missing unlock in xfs_growfs_data xfs: Remove duplicate xfs_rtbitmap.h header xfs: trigger zone GC when out of available rt blocks xfs: trace what memory backs a buffer xfs: cleanup mapping tmpfs folios into the buffer cache xfs: use vmalloc instead of vm_map_area for buffer backing memory xfs: buffer items don't straddle pages anymore xfs: kill XBF_UNMAPPED xfs: convert buffer cache to use high order folios xfs: remove the kmalloc to page allocator fallback xfs: refactor backing memory allocations for buffers xfs: remove xfs_buf_is_vmapped ...	2025-03-27 13:07:00 -07:00
Linus Torvalds	e41170cc5e	vfs-6.15-rc1.pagesize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()	2025-03-24 12:01:29 -07:00
Linus Torvalds	26d8e43079	vfs-6.15-rc1.async.dir -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q N8YtzXArHWt7u0YbcVpy9WK3F72BdwU= =VJgY -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.	2025-03-24 10:47:14 -07:00
Linus Torvalds	0ec0d4ecdd	vfs-6.15-rc1.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rGwAKCRCRxhvAZXjc okVnAP9VgaYjWGzaeep/dLzWtu7C/Cg5Swl1P84Vj+SJ+hFPEAD/auzWTV0D0Ko5 5GLyUsLZehfeVDOSRqmiyt1po8iVsQo= =ANks -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Allow the filesystem to submit the writeback bios. - Allow the filsystem to track completions on a per-bio bases instead of the entire I/O. - Change writeback_ops so that ->submit_bio can be done by the filesystem. - A new ANON_WRITE flag for writes that don't have a block number assigned to them at the iomap level leaving the filesystem to do that work in the submission handler. - Incremental iterator advance The folio_batch support for zero range where the filesystem provides a batch of folios to process that might not be logically continguous requires more flexibility than the current offset based iteration currently offers. Update all iomap operations to advance the iterator within the operation and thus remove the need to advance from the core iomap iterator. - Make buffered writes work with RWF_DONTCACHE If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. On writeback completion the pages will be dropped. - Introduce infrastructure for large atomic writes This will eventually be used by xfs and ext4. * tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) iomap: rework IOMAP atomic flags iomap: comment on atomic write checks in iomap_dio_bio_iter() iomap: inline iomap_dio_bio_opflags() iomap: fix inline data on buffered read iomap: Lift blocksize restriction on atomic writes iomap: Support SW-based atomic writes iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW xfs: flag as supporting FOP_DONTCACHE iomap: make buffered writes work with RWF_DONTCACHE iomap: introduce a full map advance helper iomap: rename iomap_iter processed field to status iomap: remove unnecessary advance from iomap_iter() dax: advance the iomap_iter on pte and pmd faults dax: advance the iomap_iter on dedupe range dax: advance the iomap_iter on unshare range dax: advance the iomap_iter on zero range dax: push advance down into dax_iomap_iter() for read and write dax: advance the iomap_iter in the read/write path iomap: convert misc simple ops to incremental advance iomap: advance the iter on direct I/O ...	2025-03-24 10:19:31 -07:00
John Garry	370a6de765	iomap: rework IOMAP atomic flags Flag IOMAP_ATOMIC_SW is not really required. The idea of having this flag is that the FS ->iomap_begin callback could check if this flag is set to decide whether to do a SW (FS-based) atomic write. But the FS can set which ->iomap_begin callback it wants when deciding to do a FS-based atomic write. Furthermore, it was thought that IOMAP_ATOMIC_HW is not a proper name, as the block driver can use SW-methods to emulate an atomic write. So change back to IOMAP_ATOMIC. The ->iomap_begin callback needs though to indicate to iomap core that REQ_ATOMIC needs to be set, so add IOMAP_F_ATOMIC_BIO for that. These changes were suggested by Christoph Hellwig and Dave Chinner. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-4-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-20 15:16:03 +01:00
Christoph Hellwig	b3f8f2903b	xfs: remove the flags argument to xfs_buf_get_uncached No callers passes flags to xfs_buf_get_uncached, which makes sense given that the flags apply to behavior not used for uncached buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:47:45 +01:00
Christoph Hellwig	8d54b48fef	xfs: remove the flags argument to xfs_buf_read_uncached No callers passes flags to xfs_buf_read_uncached, which makes sense given that the flags apply to behavior not used for uncached buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:47:45 +01:00
Christoph Hellwig	44e1f90b16	xfs: remove xfs_buf_free_maps xfs_buf_free_maps only has a single caller, so open code it there. Stop zeroing the b_maps pointer as the buffer is freed in the next line. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:47:45 +01:00
Christoph Hellwig	5abea7094b	xfs: remove xfs_buf_get_maps xfs_buf_get_maps has a single caller, and can just be open coded there. When doing that, stop handling the allocation failure as we always pass __GFP_NOFAIL to the slab allocator, and use the proper kcalloc helper for array allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:47:45 +01:00
Christoph Hellwig	1ec1207722	xfs: call xfs_buf_alloc_backing_mem from _xfs_buf_alloc We never allocate a buffer without backing memory. Simplify the call chain by calling xfs_buf_alloc_backing_mem from _xfs_buf_alloc. To avoid a forward declaration, move _xfs_buf_alloc down a bit in the file. Also drop the pointless _-prefix from _xfs_buf_alloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:47:45 +01:00
Chen Ni	34ba1fcd74	xfs: remove unnecessary NULL check before kvfree() Remove unnecessary NULL check before kvfree() reported by Coccinelle/coccicheck and the semantic patch at scripts/coccinelle/free/ifnullfree.cocci. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:24:13 +01:00
Carlos Maiolino	c3a60b673a	Merge branch 'xfs-6.15-folios_vmalloc' into XFS-for-linus-6.15-merge Merge buffer cache conversion to folios and vmalloc Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:10:30 +01:00
Carlos Maiolino	8e6415460f	Merge branch 'xfs-6.15-zoned_devices' into XFS-for-linus-6.15-merge Merge Zoned allocator for XFS. Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 14:06:04 +01:00
Darrick J. Wong	f56f73ebf8	xfs: don't wake zone space waiters without m_zone_info xfs_zoned_wake_all checks SB_ACTIVE to make sure it does the right thing when a shutdown happens during unmount, but it fails to account for the log recovery special case that sets SB_ACTIVE temporarily. Add a NULL check to cover both cases. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: added a commit log and comment] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 13:06:00 +01:00
Christoph Hellwig	9ec3f7977a	xfs: don't increment m_generation for all errors in xfs_growfs_data xfs_growfs_data needs to increment m_generation as soon as the primary superblock has been updated. As the update of the secondary superblocks was part of xfs_growfs_data_private that mean the incremented had to be done unconditionally once that was called. Later, commit `83a7f86e39` ("xfs: separate secondary sb update in growfs") split the secondary superblock update into a separate helper, so now the increment on error can be limited to failed calls to xfs_update_secondary_sbs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 13:06:00 +01:00
Christoph Hellwig	beba948713	xfs: fix a missing unlock in xfs_growfs_data The newly added check for the internal RT device needs to unlock m_growlock just like all ther other error cases. Fixes: `bdc03eb5f9` ("xfs: allow internal RT devices for zoned mode") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-18 13:06:00 +01:00
Alistair Popple	0e2f80afcf	fs/dax: ensure all pages are idle prior to filesystem unmount File systems call dax_break_mapping() prior to reallocating file system blocks to ensure the page is not undergoing any DMA or other accesses. Generally this is needed when a file is truncated to ensure that if a block is reallocated nothing is writing to it. However filesystems currently don't call this when an FS DAX inode is evicted. This can cause problems when the file system is unmounted as a page can continue to be under going DMA or other remote access after unmount. This means if the file system is remounted any truncate or other operation which requires the underlying file system block to be freed will not wait for the remote access to complete. Therefore a busy block may be reallocated to a new file leading to corruption. Link: https://lkml.kernel.org/r/2d3cf575bbd095084993154be2f0aa7442e5cd28.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Wiliams <dan.j.williams@intel.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:38 -07:00
Alistair Popple	bde708f1a6	fs/dax: always remove DAX page-cache entries when breaking layouts Prior to any truncation operations file systems call dax_break_mapping() to ensure pages in the range are not under going DMA. Later DAX page-cache entries will be removed by truncate_folio_batch_exceptionals() in the generic page-cache code. However this makes it possible for folios to be removed from the page-cache even though they are still DMA busy if the file-system hasn't called dax_break_mapping(). It also means they can never be waited on in future because FS DAX will lose track of them once the page-cache entry has been deleted. Instead it is better to delete the FS DAX entry when the file-system calls dax_break_mapping() as part of it's truncate operation. This ensures only idle pages can be removed from the FS DAX page-cache and makes it easy to detect if a file-system hasn't called dax_break_mapping() prior to a truncate operation. Link: https://lkml.kernel.org/r/3be6115eaaa8d28fee37fcba3287be4f226a7d24.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:37 -07:00
Alistair Popple	d5b3afea22	fs/dax: create a common implementation to break DAX layouts Prior to freeing a block file systems supporting FS DAX must check that the associated pages are both unmapped from user-space and not undergoing DMA or other access from eg. get_user_pages(). This is achieved by unmapping the file range and scanning the FS DAX page-cache to see if any pages within the mapping have an elevated refcount. This is done using two functions - dax_layout_busy_page_range() which returns a page to wait for the refcount to become idle on. Rather than open-code this introduce a common implementation to both unmap and wait for the page to become idle. Link: https://lkml.kernel.org/r/c4d381e41fc618296cee2820403c166d80599d5c.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:37 -07:00
Alistair Popple	e6fa3963a3	fs/dax: refactor wait for dax idle page A FS DAX page is considered idle when its refcount drops to one. This is currently open-coded in all file systems supporting FS DAX. Move the idle detection to a common function to make future changes easier. Link: https://lkml.kernel.org/r/c2c9d269110b90224eeb1dc661ffbc1d82aa20c9.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:37 -07:00
Easwar Hariharan	a877cd2a64	xfs: convert timeouts to secs_to_jiffies() Commit `b35108a51c` ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() to avoid the multiplication This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies +secs_to_jiffies (E - * \( 1000 \\| MSEC_PER_SEC \) ) Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-9-a43967e36c88@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Cc: Carlos Maiolino <cem@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Dongsheng Yang <dongsheng.yang@easystack.cn> Cc: Fabio Estevam <festevam@gmail.com> Cc: Frank Li <frank.li@nxp.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Brown <broonie@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Sebastian Reichel <sre@kernel.org> Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 12:16:58 -07:00
Linus Torvalds	eb88e6bfbc	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmfVSgwACgkQnJ2qBz9k QNkhqgf9GbV/DkH+NWKSAkTUkTUj/ibwkpc7KIFVvBEoNQOAqDntpIqDCGKVvywA hrX0/SxDn34RPSrQSzVVqlH/SGHkL1xgWTOwHUmo2DFEk7k03mfOlGEiV3I85Ugz ZaPdIfGwXFCf7/bXXJV2zkFPPDQjqkTHSzzBf3Az/HXzTvz7ZZ1QexbXTAlplTuN 4PWkIOZBOAccSLlbqfSygEQ8nuTcR6mjExDXLYwIispDXa6Th28SFMF7nlk5D3G0 /O4YrPmVozb6U/0fmoMr+x3Knu2tz/GDVxP8w4mI9k79FYsJ1UmuywnhZuuniXZF A1r1HE+DBP6Dd24SVThOQLuQvpnw6Q== =L6AL -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify reverts from Jan Kara: "Syzbot has found out that fsnotify HSM events generated on page fault can be generated while we already hold freeze protection for the filesystem (when you do buffered write from a buffer which is mmapped file on the same filesystem) which violates expectations for HSM events and could lead to deadlocks of HSM clients with filesystem freezing. Since it's quite late in the cycle we've decided to revert changes implementing HSM events on page fault for now and instead just generate one event for the whole range on mmap(2) so that HSM client can fetch the data at that moment" * tag 'fsnotify_for_v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: Revert "fanotify: disable readahead if we have pre-content watches" Revert "mm: don't allow huge faults for files with pre content watches" Revert "fsnotify: generate pre-content permission event on page fault" Revert "xfs: add pre-content fsnotify hook for DAX faults" Revert "ext4: add pre-content fsnotify hook for DAX faults" fsnotify: add pre-content hooks on mmap()	2025-03-15 08:32:16 -10:00
Matthew Wilcox (Oracle)	ca3ac4bf4d	xfs: Use abs_diff instead of XFS_ABSDIFF We have a central definition for this function since 2023, used by a number of different parts of the kernel. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-14 13:40:17 +01:00
Amir Goldstein	27773ce177	Revert "xfs: add pre-content fsnotify hook for DAX faults" This reverts commit `7f4796a465`. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-4-amir73il@gmail.com	2025-03-13 16:30:40 +01:00
Jiapeng Chong	fcb255537b	xfs: Remove duplicate xfs_rtbitmap.h header ./fs/xfs/libxfs/xfs_sb.c: xfs_rtbitmap.h is included more than once. Fixes: `2167eaabe2` ("xfs: define the zoned on-disk format") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19446 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-12 10:00:43 +01:00
Hans Holmberg	b7bc85480b	xfs: trigger zone GC when out of available rt blocks We periodically check the available rt blocks when filling up zones and start GC if needed, but we may run completely out in between filling zones, so start GC(unless already running) if we can't reserve writable space. This should only happen as a corner case in setups with very few backing zones. Fixes: `080d01c41d` ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-11 12:52:27 +01:00
Christoph Hellwig	89ce287c83	xfs: trace what memory backs a buffer Add three trace points for the different backing memory allocators for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	e614a00117	xfs: cleanup mapping tmpfs folios into the buffer cache Directly assign b_addr based on the tmpfs folios without a detour through pages, reuse the folio_put path used for non-tmpfs buffers and replace all references to pages in comments with folios. Partially based on a patch from Dave Chinner <dchinner@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	e2874632a6	xfs: use vmalloc instead of vm_map_area for buffer backing memory The fallback buffer allocation path currently open codes a suboptimal version of vmalloc to allocate pages that are then mapped into vmalloc space. Switch to using vmalloc instead, which uses all the optimizations in the common vmalloc code, and removes the need to track the backing pages in the xfs_buf structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Dave Chinner	fd87851680	xfs: buffer items don't straddle pages anymore Unmapped buffers don't exist anymore, so the page straddling detection and slow path code can go away now. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	a2f790b285	xfs: kill XBF_UNMAPPED Unmapped buffer access is a pain, so kill it. The switch to large folios means we rarely pay a vmap penalty for large buffers, so this functionality is largely unnecessary now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	94c78cfa3b	xfs: convert buffer cache to use high order folios Now that we have the buffer cache using the folio API, we can extend the use of folios to allocate high order folios for multi-page buffers rather than an array of single pages that are then vmapped into a contiguous range. This creates a new type of single folio buffers that can have arbitrary order in addition to the existing multi-folio buffers made up of many single page folios that get vmapped. The single folio is for now stashed into the existing b_pages array, but that will go away entirely later in the series and remove the temporary page vs folio typing issues that only work because the two structures currently can be used largely interchangeable. The code that allocates buffers will optimistically attempt a high order folio allocation as a fast path if the buffer size is a power of two and thus fits into a folio. If this high order allocation fails, then we fall back to the existing multi-folio allocation code. This now forms the slow allocation path, and hopefully will be largely unused in normal conditions except for buffers with size that are not a power of two like larger remote xattrs. This should improve performance of large buffer operations (e.g. large directory block sizes) as we should now mostly avoid the expense of vmapping large buffers (and the vmap lock contention that can occur) as well as avoid the runtime pressure that frequently accessing kernel vmapped pages put on the TLBs. Based on a patch from Dave Chinner <dchinner@redhat.com>, but mutilated beyond recognition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	4ef3982831	xfs: remove the kmalloc to page allocator fallback Since commit `59bb47985c` ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)", kmalloc and friends guarantee that power of two sized allocations are naturally aligned. Limit our use of kmalloc for buffers to these power of two sizes and remove the fallback to the page allocator for this case, but keep a check in addition to trusting the slab allocator to get the alignment right. Also refactor the kmalloc path to reuse various calculations for the size and gfp flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	50a524e0ef	xfs: refactor backing memory allocations for buffers Lift handling of shmem and slab backed buffers into xfs_buf_alloc_pages and rename the result to xfs_buf_alloc_backing_mem. This shares more code and ensures uncached buffers can also use slab, which slightly reduces the memory usage of growfs on 512 byte sector size file systems, but more importantly means the allocation invariants are the same for cached and uncached buffers. Document these new invariants with a big fat comment mostly stolen from a patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	48a325a4ee	xfs: remove xfs_buf_is_vmapped No need to look at the page count if we can simply call is_vmalloc_addr on bp->b_addr. This prepares for eventualy removing the b_page_count field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	51e1099315	xfs: remove xfs_buf.b_offset b_offset is only set for slab backed buffers and always set to offset_in_page(bp->b_addr), which can be done just as easily in the only user of b_offset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	f2a3717a74	xfs: add a fast path to xfs_buf_zero when b_addr is set No need to walk the page list if bp->b_addr is valid. That also means b_offset doesn't need to be taken into account in the unmapped loop as b_offset is only set for kmem backed buffers which are always mapped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Dave Chinner	69659e46b7	xfs: unmapped buffer item size straddling mismatch We never log large contiguous regions of unmapped buffers, so this bug is never triggered by the current code. However, the slowpath for formatting buffer straddling regions is broken. That is, the size and shape of the log vector calculated across a straddle does not match how the formatting code formats a straddle. This results in a log vector with an uninitialised iovec and this causes a crash when xlog_write_full() goes to copy the iovec into the journal. Whilst touching this code, don't bother checking mapped or single folio buffers for discontiguous regions because they don't have them. This significantly reduces the overhead of this check when logging large buffers as calling xfs_buf_offset() is not free and it occurs a lot in those cases. Fixes: `929f8b0deb` ("xfs: optimise xfs_buf_item_size/format for contiguous regions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Carlos Maiolino	32f6987f93	Merge branch 'xfs-6.15-merge' into for-next XFS code for 6.15 to be merged into linux-next Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 10:35:39 +01:00
Matthew Wilcox (Oracle)	5d138b6fb4	xfs: Use abs_diff instead of XFS_ABSDIFF We have a central definition for this function since 2023, used by a number of different parts of the kernel. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 10:28:34 +01:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
Christian Brauner	1743d385e7	Merge branch 'vfs-6.15.shared.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Bring in iomap changes that xfs relies on. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-06 10:59:18 +01:00
NeilBrown	c54b386969	VFS: Change vfs_mkdir() to return the dentry. vfs_mkdir() does not guarantee to leave the child dentry hashed or make it positive on success, and in many such cases the filesystem had to use a different dentry which it can now return. This patch changes vfs_mkdir() to return the dentry provided by the filesystems which is hashed and positive when provided. This reduces the number of cases where the resulting dentry is not positive to a handful which don't deserve extra efforts. The only callers of vfs_mkdir() which are interested in the resulting inode are in-kernel filesystem clients: cachefiles, nfsd, smb/server. The only filesystems that don't reliably provide the inode are: - kernfs, tracefs which these clients are unlikely to be interested in - cifs in some configurations would need to do a lookup to find the created inode, but doesn't. cifs cannot be exported via NFS, is unlikely to be used by cachefiles, and smb/server only has a soft requirement for the inode, so this is unlikely to be a problem in practice. - hostfs, nfs, cifs may need to do a lookup (rarely for NFS) and it is possible for a race to make that lookup fail. Actual failure is unlikely and providing callers handle negative dentries graceful they will fail-safe. So this patch removes the lookup code in nfsd and smb/server and adjusts them to fail safe if a negative dentry is provided: - cache-files already fails safe by restarting the task from the top - it still does with this change, though it no longer calls cachefiles_put_directory() as that will crash if the dentry is negative. - nfsd reports "Server-fault" which it what it used to do if the lookup failed. This will never happen on any file-systems that it can actually export, so this is of no consequence. I removed the fh_update() call as that is not needed and out-of-place. A subsequent nfsd_create_setattr() call will call fh_update() when needed. - smb/server only wants the inode to call ksmbd_smb_inherit_owner() which updates ->i_uid (without calling notify_change() or similar) which can be safely skipping on cifs (I hope). If a different dentry is returned, the first one is put. If necessary the fact that it is new can be determined by comparing pointers. A new dentry will certainly have a new pointer (as the old is put after the new is obtained). Similarly if an error is returned (via ERR_PTR()) the original dentry is put. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-7-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-05 11:52:50 +01:00
Christoph Hellwig	9c477912b2	xfs: export max_open_zones in sysfs Add a zoned group with an attribute for the maximum number of open zones. This allows querying the open zones for data placement tests, or also for placement aware applications that are in control of the entire file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:10 -07:00
Christoph Hellwig	243f40d0c7	xfs: contain more sysfs code in xfs_sysfs.c Extend the error sysfs initialization helper to include the neighbouring attributes as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:10 -07:00
Hans Holmberg	5443041b9c	xfs: export zone stats in /proc//mountstats Add the per-zone life time hint and the used block distribution for fully written zones, grouping reclaimable zones in fixed-percentage buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a few statistics about the zone allocator and open and reclaimable zones in /proc//mountstats. This gives good insight into data fragmentation and data placement success rate. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:10 -07:00
Christoph Hellwig	099bf44f9c	xfs: wire up the show_stats super operation The show_stats option allows a file system to dump plain text statistic on a per-mount basis into /proc/*/mountstats. Wire up a no-op version which will grow useful information for zoned file systems later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:10 -07:00
Hans Holmberg	64d0361114	xfs: support write life time based data placement Add a file write life time data placement allocation scheme that aims to minimize fragmentation and thereby to do two things: a) separate file data to different zones when possible. b) colocate file data of similar life times when feasible. To get best results, average file sizes should align with the zone capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl. This improvement in data placement efficiency reduces the number of blocks requiring relocation by GC, and thus decreases overall write amplification. The impact on performance varies depending on how full the file system is. For RocksDB using leveled compaction, the lifetime hints can improve throughput for overwrite workloads at 80% file system utilization by ~10%, but for lower file system utilization there won't be as much benefit in application performance as there is less need for garbage collection to start with. Lifetime hints can be disabled using the nolifetime mount option. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:10 -07:00
Christoph Hellwig	7452a6daf9	xfs: add a max_open_zones mount option Allow limiting the number of open zones used below that exported by the device. This is required to tune the number of write streams when zoned RT devices are used on conventional devices, and can be useful on zoned devices that support a very large number of open zones. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:09 -07:00
Christoph Hellwig	97c69ba1c0	xfs: support zone gaps Zoned devices can have gaps beyond the usable capacity of a zone and the end in the LBA/daddr address space. In other words, the hardware equivalent to the RT groups already takes care of the power of 2 alignment for us. In this case the sparse FSB/RTB address space maps 1:1 to the device address space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:09 -07:00
Christoph Hellwig	be458049ff	xfs: enable the zoned RT device feature Enable the zoned RT device directory feature. With this feature, RT groups are written sequentially and always emptied before rewriting the blocks. This perfectly maps to zoned devices, but can also be used on conventional block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:09 -07:00
Christoph Hellwig	ad35e362bf	xfs: disable rt quotas for zoned file systems They'll need a little more work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:09 -07:00
Christoph Hellwig	af4f88330d	xfs: disable reflink for zoned file systems While the zoned on-disk format supports reflinks, the GC code currently always unshares reflinks when moving blocks to new zones, thus making the feature unusuable. Disable reflinks until the GC code is refcount aware. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:09 -07:00
Christoph Hellwig	e50ec7fac8	xfs: enable fsmap reporting for internal RT devices File system with internal RT devices are a bit odd in that we need to report AGs and RGs. To make this happen use separate synthetic fmr_device values for the different sections instead of the dev_t mapping used by other XFS configurations. The data device is reported as file system metadata before the start of the RGs for the synthetic RT fmr_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	14d355dcec	xfs: support xrep_require_rtext_inuse on zoned file systems Space usage is tracked by the rmap, which already is separately cross-referenced. But on top of that we have the write pointer and can do a basic sanity check here that the block is not beyond the write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	48b9ac6819	xfs: support xchk_xref_is_used_rt_space on zoned file systems Space usage is tracked by the rmap, which already is separately cross-referenced. But on top of that we have the write pointer and can do a basic sanity check here that the block is not beyond the write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	1cf4554e7b	xfs: allow COW forks on zoned file systems in xchk_bmap Zoned file systems can have COW forks even without reflinks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	01b71e64bb	xfs: support growfs on zoned file systems Replace the inner loop growing one RT bitmap block at a time with one just modifying the superblock counters for growing an entire zone (aka RTG). The big restriction is just like at mkfs time only a RT extent size of a single FSB is allowed, and the file system capacity needs to be aligned to the zone size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	55ef6e7a40	xfs: hide reserved RT blocks from statfs File systems with a zoned RT device have a large number of reserved blocks that are required for garbage collection, and which can't be filled with user data. Exclude them from the available blocks reported through stat(v)fs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:08 -07:00
Christoph Hellwig	859b692711	xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item Make xfs_rtextent_free_finish_item call into the zoned allocator to free blocks on zoned RT devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	2e23834058	xfs: implement direct writes to zoned RT devices Direct writes to zoned RT devices are extremely simple. After taking the block reservation before acquiring the iolock, the iomap direct I/O calls into ->iomap_begin which will return a "fake" iomap for the entire requested range. The actual block allocation is then done from the submit_io handler using code shared with the buffered I/O path. The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize the embedded ioend, which allows reusing the existing ioend based buffered I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	058dd70c65	xfs: implement buffered writes to zoned RT devices Implement buffered writes including page faults and block zeroing for zoned RT devices. Buffered writes to zoned RT devices are split into three phases: 1) a reservation for the worst case data block usage is taken before acquiring the iolock. When there are enough free blocks but not enough available one, garbage collection is kicked off to free the space before continuing with the write. If there isn't enough freeable space, the block reservation is reduced and a short write will happen as expected by normal Linux write semantics. 2) with the iolock held, the generic iomap buffered write code is called, which through the iomap_begin operation usually just inserts delalloc extents for the range in a single iteration. Only for overwrites of existing data that are not block aligned, or zeroing operations the existing extent mapping is read to fill out the srcmap and to figure out if zeroing is required. 3) the ->map_blocks callback to the generic iomap writeback code calls into the zoned space allocator to actually allocate on-disk space for the range before kicking of the writeback. Note that because all writes are out of place, truncate or hole punches that are not aligned to block size boundaries need to allocate space. For block zeroing from truncate, ->setattr is called with the iolock (aka i_rwsem) already held, so a hacky deviation from the above scheme is needed. In this case the space reservations is called with the iolock held, but is required not to block and can dip into the reserved block pool. This can lead to -ENOSPC when truncating a file, which is unfortunate. But fixing the calling conventions in the VFS is probably much easier with code requiring it already in mainline. Similarly because all writes are out place, the zoned allocator can't support unwritten extents and thus the FALLOC_FL_ALLOCATE_RANGE range mode of fallocate. Other fallocate modes that would reserved space but don't need to to provide proper semantics do work but do not reserve space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	080d01c41d	xfs: implement zoned garbage collection RT groups on a zoned file system need to be completely empty before their space can be reused. This means that partially empty groups need to be emptied entirely to free up space if no entirely free groups are available. Add a garbage collection thread that moves all data out of the least used zone when not enough free zones are available, and which resets all zones that have been emptied. To find empty zone a simple set of 10 buckets based on the amount of space used in the zone is used. To empty zones, the rmap is walked to find the owners and the data is read and then written to the new place. To automatically defragment files the rmap records are sorted by inode and logical offset. This means defragmentation of parallel writes into a single zone happens automatically when performing garbage collection. Because holding the iolock over the entire GC cycle would inject very noticeable latency for other accesses to the inodes, the iolock is not taken while performing I/O. Instead the I/O completion handler checks that the mapping hasn't changed over the one recorded at the start of the GC cycle and doesn't update the mapping if it change. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	0bb2193056	xfs: add support for zoned space reservations For zoned file systems garbage collection (GC) has to take the iolock and mmaplock after moving data to a new place to synchronize with readers. This means waiting for garbage collection with the iolock can deadlock. To avoid this, the worst case required blocks have to be reserved before taking the iolock, which is done using a new RTAVAILABLE counter that tracks blocks that are free to write into and don't require garbage collection. The new helpers try to take these available blocks, and if there aren't enough available it wakes and waits for GC. This is done using a list of on-stack reservations to ensure fairness. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	4e4d520755	xfs: add the zoned space allocator For zoned RT devices space is always allocated at the write pointer, that is right after the last written block and only recorded on I/O completion. Because the actual allocation algorithm is very simple and just involves picking a good zone - preferably the one used for the last write to the inode. As the number of zones that can written at the same time is usually limited by the hardware, selecting a zone is done as late as possible from the iomap dio and buffered writeback bio submissions helpers just before submitting the bio. Given that the writers already took a reservation before acquiring the iolock, space will always be readily available if an open zone slot is available. A new structure is used to track these open zones, and pointed to by the xfs_rtgroup. Because zoned file systems don't have a rsum cache the space for that pointer can be reused. Allocations are only recorded at I/O completion time. The scheme used for that is very similar to the reflink COW end I/O path. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:56 -07:00
Christoph Hellwig	720c2d5834	xfs: parse and validate hardware zone information Add support to validate and parse reported hardware zone state. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:46 -07:00
Christoph Hellwig	0cb53d773b	xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file The zoned allocator never performs speculative preallocations, so don't bother queueing up zoned inodes here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:46 -07:00
Christoph Hellwig	fc04408c47	xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes Zoned file systems require out of place writes and thus can't support post-EOF speculative preallocations. Avoid the pointless ilock critical section to find out that none can be freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:46 -07:00
Christoph Hellwig	f044dda351	xfs: disable FITRIM for zoned RT devices The zoned allocator unconditionally issues zone resets or discards after emptying an entire zone, so supporting FITRIM for a zoned RT device is not useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:46 -07:00
Christoph Hellwig	1d319ac6fe	xfs: disable sb_frextents for zoned file systems Zoned file systems not only don't use the global frextents counter, but for them the in-memory percpu counter also includes reservations taken before even allocating delalloc extent records, so it will never match the per-zone used information. Disable all updates and verification of the sb counter for zoned file systems as it isn't useful for them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:45 -07:00
Christoph Hellwig	1fd8159e7c	xfs: export zoned geometry via XFS_FSOP_GEOM Export the zoned geometry information so that userspace can query it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:45 -07:00
Christoph Hellwig	bdc03eb5f9	xfs: allow internal RT devices for zoned mode Allow creating an RT subvolume on the same device as the main data device. This is mostly used for SMR HDDs where the conventional zones are used for the data device and the sequential write required zones for the zoned RT section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:45 -07:00
Christoph Hellwig	2167eaabe2	xfs: define the zoned on-disk format Zone file systems reuse the basic RT group enabled XFS file system structure to support a mode where each RT group is always written from start to end and then reset for reuse (after moving out any remaining data). There are few minor but important changes, which are indicated by a new incompat flag: 1) there are no bitmap and summary inodes, thus the /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the sb_rbmblocks superblock field must be cleared to zero. 2) there is a new superblock field that specifies the start of an internal RT section. This allows supporting SMR HDDs that have random writable space at the beginning which is used for the XFS data device (which really is the metadata device for this configuration), directly followed by a RT device on the same block device. While something similar could be achieved using dm-linear just having a single device directly consumed by XFS makes handling the file systems a lot easier. 3) Another superblock field that tracks the amount of reserved space (or overprovisioning) that is never used for user capacity, but allows GC to run more smoothly. 4) an overlay of the cowextsize field for the rtrmap inode so that we can persistently track the total amount of rtblocks currently used in a RT group. There is no data structure other than the rmap that tracks used space in an RT group, and this counter is used to decide when a RT group has been entirely emptied, and to select one that is relatively empty if garbage collection needs to be performed. While this counter could be tracked entirely in memory and rebuilt from the rmap at mount time, that would lead to very long mount times with the large number of RT groups implied by the number of hardware zones especially on SMR hard drives with 256MB zone sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:45 -07:00
Christoph Hellwig	aacde95a37	xfs: add a xfs_rtrmap_highest_rgbno helper Add a helper to find the last offset mapped in the rtrmap. This will be used by the zoned code to find out where to start writing again on conventional devices without hardware zone support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:45 -07:00
Christoph Hellwig	f42c652434	xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay The zone allocator wants to be able to remove a delalloc mapping in the COW fork while keeping the block reservation. To support that pass the flags argument down to xfs_bmap_del_extent_delay and support the XFS_BMAPI_REMAP flag to keep the reservation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	6fff175279	xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write For always COW inodes we also must check the alignment of each individual iovec segment, as they could end up with different I/Os due to the way bio_iov_iter_get_pages works, and we'd then overwrite an already written block. The existing always_cow sysctl based code doesn't catch this because nothing enforces that blocks aren't rewritten, but for zoned XFS on sequential write required zones this is a hard error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	8ae4c8cec0	xfs: skip always_cow inodes in xfs_reflink_trim_around_shared xfs_reflink_trim_around_shared tries to find shared blocks in the refcount btree. Always_cow inodes don't have that tree, so don't bother. For the existing always_cow code this is a minor optimization. For the upcoming zoned code that can do COW without the rtreflink code it avoids triggering a NULL pointer dereference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	7c879c8275	xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c Delalloc reservations are not supported in userspace, and thus it doesn't make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c toward the two callers. Note that there rest of the delalloc handling should probably eventually also move out of xfs_bmap.c, but that will require a bit more surgery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	012482b330	xfs: add a rtg_blocks helper Shortcut dereferencing the xg_block_count field in the generic group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	a581de0d61	xfs: factor out a xfs_rt_check_size helper Add a helper to check that the last block of a RT device is readable to share the code between mount and growfs. This also adds the mount time overflow check to growfs and improves the error messages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:44 -07:00
Christoph Hellwig	272e20bb24	xfs: reduce metafile reservations There is no point in reserving more space than actually available on the data device for the worst case scenario that is unlikely to happen. Reserve at most 1/4th of the data device blocks, which is still a heuristic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00
Christoph Hellwig	1df8d75030	xfs: make metabtree reservations global Currently each metabtree inode has it's own space reservation to ensure it can be expanded to the maximum size, mirroring what is done for the AG-based btrees. But unlike the AG-based btrees the metabtree inodes aren't restricted to allocate from a single AG but can use free space form the entire file system. And unlike AG-based btrees where the required reservation shrinks with the available free space due to this, the metabtree reservations for the rtrmap and rtfreflink trees are not bound in any way by the data device free space as they track RT extent allocations. This is not very efficient as it requires a large number of blocks to be set aside that can't be used at all by other btrees. Switch to a model that uses a global pool instead in preparation for reducing the amount of reserved space, which now also removes the overloading of the i_nblocks field for metabtree inodes, which would create problems if metabtree inodes ever had a big enough xattr fork to require xattr blocks outside the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00
Christoph Hellwig	c0bd736d33	xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocks All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree reservation, otherwise they'd leave the reservations in an incoherent state. Move the call to xrep_reset_metafile_resv into xrep_reap_metadir_fsblocks so it always is taken care of, and remove now superfluous helper functions in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00
Christoph Hellwig	a0760cca8e	xfs: trace in-memory freecounter reservations Add two tracepoints when the freecounter dips into the reserved pool and when it is entirely out of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00
Christoph Hellwig	c8c4e8bc69	xfs: support reserved blocks for the rt extent counter The zoned space allocator will need reserved RT extents for garbage collection and zeroing of partial blocks. Move the resblks related fields into the freecounter array so that they can be used for all counters. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00
Christoph Hellwig	712bae9663	xfs: generalize the freespace and reserved blocks handling xfs_{add,dec}_freecounter already handles the block and RT extent percpu counters, but it currently hardcodes the passed in counter. Add a freecounter abstraction that uses an enum to designate the counter and add wrappers that hide the actual percpu_counters. This will allow expanding the reserved block handling to the RT extent counter in the next step, and also prepares for adding yet another such counter that can share the code. Both these additions will be needed for the zoned allocator. Also switch the flooring of the frextents counter to 0 in statfs for the rthinherit case to a manual min_t call to match the handling of the fdblocks counter for normal file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:37 -07:00
Christoph Hellwig	cc3d2f55c4	xfs: reflow xfs_dec_freecounter Let the successful allocation be the main path through the function with exception handling in branches to make the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:15:48 -07:00
Carlos Maiolino	0a1fd78080	Merge branch 'vfs-6.15.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into xfs-6.15-merge	2025-03-03 13:01:06 +01:00
NeilBrown	88d5baf690	Change inode_operations.mkdir to return struct dentry * Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 20:00:17 +01:00
Jens Axboe	974c5e6139	xfs: flag as supporting FOP_DONTCACHE Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 11:27:54 +01:00
Jens Axboe	d47c670061	xfs: flag as supporting FOP_DONTCACHE Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 11:21:00 +01:00
Christoph Hellwig	9b47d37496	xfs: remove the XBF_STALE check from xfs_buf_rele_cached xfs_buf_stale already set b_lru_ref to 0, and thus prevents the buffer from moving to the LRU. Remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	0d1120b9bb	xfs: remove most in-flight buffer accounting The buffer cache keeps a bt_io_count per-CPU counter to track all in-flight I/O, which is used to ensure no I/O is in flight when unmounting the file system. For most I/O we already keep track of inflight I/O at higher levels: - for synchronous I/O (xfs_buf_read/xfs_bwrite/xfs_buf_delwri_submit), the caller has a reference and waits for I/O completions using xfs_buf_iowait - for xfs_buf_delwri_submit_nowait the only caller (AIL writeback) tracks the log items that the buffer attached to This only leaves only xfs_buf_readahead_map as a submitter of asynchronous I/O that is not tracked by anything else. Replace the bt_io_count per-cpu counter with a more specific bt_readahead_count counter only tracking readahead I/O. This allows to simply increment it when submitting readahead I/O and decrementing it when it completed, and thus simplify xfs_buf_rele and remove the needed for the XBF_NO_IOACCT flags and the XFS_BSTATE_IN_FLIGHT buffer state. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	efc5f7a9f3	xfs: decouple buffer readahead from the normal buffer read path xfs_buf_readahead_map is the only caller of xfs_buf_read_map and thus _xfs_buf_read that is not synchronous. Split it from xfs_buf_read_map so that the asynchronous path is self-contained and the now purely synchronous xfs_buf_read_map / _xfs_buf_read implementation can be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	4b90de5bc0	xfs: reduce context switches for synchronous buffered I/O Currently all metadata I/O completions happen in the m_buf_workqueue workqueue. But for synchronous I/O (i.e. all buffer reads) there is no need for that, as there always is a called in process context that is waiting for the I/O. Factor out the guts of xfs_buf_ioend into a separate helper and call it from xfs_buf_iowait to avoid a double an extra context switch to the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	2d873efd17	xfs: flush inodegc before swapon Fix the brand new xfstest that tries to swapon on a recently unshared file and use the chance to document the other bit of magic in this function. The big comment is taken from a mailinglist post by Dave Chinner. Fixes: `5e672cd69f` ("xfs: introduce xfs_inodegc_push()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Christoph Hellwig	3cd6a8056f	xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate Match the method name and the naming convention or address_space operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Carlos Maiolino	9f0902091c	xfs: Do not allow norecovery mount with quotacheck Mounting a filesystem that requires quota state changing will generate a transaction. We already check for a read-only device; we should do that for norecovery too. A quotacheck on a norecovery mount, and with the right log size, will cause the mount process to hang on: [<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs] [<0>] xlog_grant_head_check+0x112/0x180 [xfs] [<0>] xfs_log_reserve+0xe3/0x260 [xfs] [<0>] xfs_trans_reserve+0x179/0x250 [xfs] [<0>] xfs_trans_alloc+0x101/0x260 [xfs] [<0>] xfs_sync_sb+0x3f/0x80 [xfs] [<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs] [<0>] xfs_mountfs+0x7ad/0xc20 [xfs] [<0>] xfs_fs_fill_super+0x762/0xa50 [xfs] [<0>] get_tree_bdev_flags+0x131/0x1d0 [<0>] vfs_get_tree+0x26/0xd0 [<0>] vfs_cmd_create+0x59/0xe0 [<0>] __do_sys_fsconfig+0x4e3/0x6b0 [<0>] do_syscall_64+0x82/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This is caused by a transaction running with bogus initialized head/tail I initially hit this while running generic/050, with random log sizes, but I managed to reproduce it reliably here with the steps below: mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i sparse=1 /dev/vdb2 > /dev/null mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt xfs_io -x -c 'shutdown -f' /mnt umount /mnt mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt Last mount hangs up As we add yet another validation if quota state is changing, this also add a new helper named xfs_qm_validate_state_change(), factoring the quota state changes out of xfs_qm_newmount() to reduce cluttering within it. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Lukas Herbolt	9e00163c31	xfs: do not check NEEDSREPAIR if ro,norecovery mount. If there is corrutpion on the filesystem andxfs_repair fails to repair it. The last resort of getting the data is to use norecovery,ro mount. But if the NEEDSREPAIR is set the filesystem cannot be mounted. The flag must be cleared out manually using xfs_db, to get access to what left over of the corrupted fs. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Darrick J. Wong	6e33017c32	xfs: fix data fork format filtering during inode repair Coverity noticed that xrep_dinode_bad_metabt_fork never runs because XFS_DINODE_FMT_META_BTREE is always filtered out in the mode selection switch of xrep_dinode_check_dfork. Metadata btrees are allowed only in the data forks of regular files, so add this case explicitly. I guess this got fubard during a refactoring prior to 6.13 and I didn't notice until now. :/ Coverity-id: 1617714 Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:24 +01:00
Darrick J. Wong	66314e9a57	xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n I received a report from the release engineering side of the house that xfs_scrub without the -n flag (aka fix it mode) would try to fix a broken filesystem even on a kernel that doesn't have online repair built into it: # xfs_scrub -dTvn /mnt/test EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /mnt/test: using 1 threads to scrub. Phase 1: Memory used: 132k/0k (108k/25k), time: 0.00/ 0.00/ 0.00s <snip> Phase 4: Repair filesystem. <snip> Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351) Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204) Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair It is strange that xfs_scrub doesn't refuse to run, because the kernel is supposed to return EOPNOTSUPP if we actually needed to run a repair, and xfs_io's repair subcommand will perror that. And yet: # xfs_io -x -c 'repair probe' /mnt/test # The first problem is commit `dcb660f922` (4.15) which should have had xchk_probe set the CORRUPT OFLAG so that any of the repair machinery will get called at all. It turns out that some refactoring that happened in the 6.6-6.8 era broke the operation of this corner case. What we really want to happen is that all the predicates that would steer xfs_scrub_metadata() towards calling xrep_attempt() should function the same way that they do when repair is compiled in; and then xrep_attempt gets to return the fatal EOPNOTSUPP error code that causes the probe to fail. Instead, commit `8336a64eb7` (6.6) started the failwhale swimming by hoisting OFLAG checking logic into a helper whose non-repair stub always returns false, causing scrub to return "repair not needed" when in fact the repair is not supported. Prior to that commit, the oflag checking that was open-coded in scrub.c worked correctly. Similarly, in commit `4bdfd7d157` (6.8) we hoisted the IFLAG_REPAIR and ALREADY_FIXED logic into a helper whose non-repair stub always returns false, so we never enter the if test body that would have called xrep_attempt, let alone fail to decode the OFLAGs correctly. The final insult (yes, we're doing The Naked Gun now) is commit `48a72f6086` (6.8) in which we hoisted the "are we going to try a repair?" predicate into yet another function with a non-repair stub always returns false. Fix xchk_probe to trigger xrep_probe if repair is enabled, or return EOPNOTSUPP directly if it is not. For all the other scrub types, we need to fix the header predicates so that the ->repair functions (which are all xrep_notsupported) get called to return EOPNOTSUPP. Commit 48a72 is tagged here because the scrub code prior to LTS 6.12 are incomplete and not worth patching. Reported-by: David Flynn <david.flynn@oracle.com> Cc: <stable@vger.kernel.org> # v6.8 Fixes: `8336a64eb7` ("xfs: don't complain about unfixed metadata when repairs were injected") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:37:25 +01:00
Christoph Hellwig	ddd402bbbf	iomap: pass private data to iomap_truncate_page Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-12-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	c6d1b8d154	iomap: pass private data to iomap_zero_range Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-11-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	02b39c4655	iomap: pass private data to iomap_page_mkwrite Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-10-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	7102733306	iomap: simplify io_flags and io_type in struct iomap_ioend The ioend fields for distinct types of I/O are a bit complicated. Consolidate them into a single io_flag field with it's own flags decoupled from the iomap flags. This also prepares for adding a new flag that is unrelated to both of the iomap namespaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-3-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:13 +01:00
Christoph Hellwig	c50105933f	iomap: allow the file system to submit the writeback bios Change ->prepare_ioend to ->submit_ioend and require file systems that implement it to submit the bio. This is needed for file systems that do their own work on the bios before submitting them to the block layer like btrfs or zoned xfs. To make this easier also pass the writeback context to the method. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-2-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:13 +01:00

1 2 3 4 5 ...

9520 Commits