mirror_ubuntu-kernels/fs/xfs
Dave Chinner ebb7fb1557 xfs, iomap: limit individual ioend chain lengths in writeback
Trond Myklebust reported soft lockups in XFS IO completion such as
this:

 watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
 CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
 Workqueue: xfs-conv/md127 xfs_end_io [xfs]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
 Call Trace:
  wake_up_page_bit+0x8a/0x110
  iomap_finish_ioend+0xd7/0x1c0
  iomap_finish_ioends+0x7f/0xb0
  xfs_end_ioend+0x6b/0x100 [xfs]
  xfs_end_io+0xb9/0xe0 [xfs]
  process_one_work+0x1a7/0x360
  worker_thread+0x1fa/0x390
  kthread+0x116/0x130
  ret_from_fork+0x35/0x40

Ioends are processed as an atomic completion unit when all the
chained bios in the ioend have completed their IO. Logically
contiguous ioends can also be merged and completed as a single,
larger unit.  Both of these things can be problematic as both the
bio chains per ioend and the size of the merged ioends processed as
a single completion are both unbound.

If we have a large sequential dirty region in the page cache,
write_cache_pages() will keep feeding us sequential pages and we
will keep mapping them into ioends and bios until we get a dirty
page at a non-sequential file offset. These large sequential runs
can will result in bio and ioend chaining to optimise the io
patterns. The pages iunder writeback are pinned within these chains
until the submission chaining is broken, allowing the entire chain
to be completed. This can result in huge chains being processed
in IO completion context.

We get deep bio chaining if we have large contiguous physical
extents. We will keep adding pages to the current bio until it is
full, then we'll chain a new bio to keep adding pages for writeback.
Hence we can build bio chains that map millions of pages and tens of
gigabytes of RAM if the page cache contains big enough contiguous
dirty file regions. This long bio chain pins those pages until the
final bio in the chain completes and the ioend can iterate all the
chained bios and complete them.

OTOH, if we have a physically fragmented file, we end up submitting
one ioend per physical fragment that each have a small bio or bio
chain attached to them. We do not chain these at IO submission time,
but instead we chain them at completion time based on file
offset via iomap_ioend_try_merge(). Hence we can end up with unbound
ioend chains being built via completion merging.

XFS can then do COW remapping or unwritten extent conversion on that
merged chain, which involves walking an extent fragment at a time
and running a transaction to modify the physical extent information.
IOWs, we merge all the discontiguous ioends together into a
contiguous file range, only to then process them individually as
discontiguous extents.

This extent manipulation is computationally expensive and can run in
a tight loop, so merging logically contiguous but physically
discontigous ioends gains us nothing except for hiding the fact the
fact we broke the ioends up into individual physical extents at
submission and then need to loop over those individual physical
extents at completion.

Hence we need to have mechanisms to limit ioend sizes and
to break up completion processing of large merged ioend chains:

1. bio chains per ioend need to be bound in length. Pure overwrites
go straight to iomap_finish_ioend() in softirq context with the
exact bio chain attached to the ioend by submission. Hence the only
way to prevent long holdoffs here is to bound ioend submission
sizes because we can't reschedule in softirq context.

2. iomap_finish_ioends() has to handle unbound merged ioend chains
correctly. This relies on any one call to iomap_finish_ioend() being
bound in runtime so that cond_resched() can be issued regularly as
the long ioend chain is processed. i.e. this relies on mechanism #1
to limit individual ioend sizes to work correctly.

3. filesystems have to loop over the merged ioends to process
physical extent manipulations. This means they can loop internally,
and so we break merging at physical extent boundaries so the
filesystem can easily insert reschedule points between individual
extent manipulations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-tested-by: Trond Myklebust <trondmy@hammerspace.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-26 09:19:20 -08:00
..
libxfs Withdraw the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl definitions. 2022-01-21 08:51:48 +02:00
scrub xfs: fix online fsck handling of v5 feature bits on secondary supers 2022-01-12 09:45:21 -08:00
Kconfig xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n 2020-10-16 15:34:28 -07:00
kmem.c mm: introduce memalloc_retry_wait() 2022-01-15 16:30:29 +02:00
kmem.h xfs: remove kmem_zone typedef 2021-10-22 16:00:31 -07:00
Makefile xfs: refactor log recovery item sorting into a generic dispatch structure 2020-05-08 08:49:58 -07:00
mrlock.h
xfs_acl.c overlayfs update for 5.15 2021-09-02 09:21:27 -07:00
xfs_acl.h vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
xfs_aops.c xfs, iomap: limit individual ioend chain lengths in writeback 2022-01-26 09:19:20 -08:00
xfs_aops.h
xfs_attr_inactive.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_attr_list.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_bio_io.c xfs: async blkdev cache flush 2021-06-21 10:05:51 -07:00
xfs_bmap_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_bmap_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_bmap_util.c Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families. 2022-01-21 08:47:25 +02:00
xfs_bmap_util.h xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls 2022-01-17 09:16:41 -08:00
xfs_buf_item_recover.c xfs: check sb_meta_uuid for dabuf buffer recovery 2021-12-21 09:49:41 -08:00
xfs_buf_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_buf_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_buf.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
xfs_buf.h dax: return the partition offset from fs_dax_get_by_bdev 2021-12-04 08:58:54 -08:00
xfs_dir2_readdir.c xfs: take the ILOCK when readdir inspects directory mapping data 2022-01-11 15:11:04 -08:00
xfs_discard.c xfs: convert mount flags to features 2021-08-19 10:07:12 -07:00
xfs_discard.h
xfs_dquot_item_recover.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_dquot_item.c xfs: remove support for disabling quota accounting on a mounted file system 2021-08-06 11:05:36 -07:00
xfs_dquot_item.h xfs: remove support for disabling quota accounting on a mounted file system 2021-08-06 11:05:36 -07:00
xfs_dquot.c xfs: hold quota inode ILOCK_EXCL until the end of dqalloc 2022-01-06 10:43:30 -08:00
xfs_dquot.h xfs: queue inactivation immediately when quota is nearing enforcement 2021-08-09 10:52:18 -07:00
xfs_error.c xfs: sysfs: use default_groups in kobj_type 2022-01-06 10:43:30 -08:00
xfs_error.h xfs: add trace point for fs shutdown 2021-08-18 18:46:00 -07:00
xfs_export.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_export.h
xfs_extent_busy.c xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extent_busy.h xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extfree_item.c xfs: reduce the size of struct xfs_extent_free_item 2021-10-22 16:04:36 -07:00
xfs_extfree_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_file.c Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families. 2022-01-21 08:47:25 +02:00
xfs_filestream.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_filestream.h xfs: convert mount flags to features 2021-08-19 10:07:12 -07:00
xfs_fsmap.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_fsmap.h xfs: fix deadlock and streamline xfs_getfsmap performance 2020-10-07 08:40:29 -07:00
xfs_fsops.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_fsops.h xfs: get rid of xfs_growfs_{data,log}_t 2021-02-03 09:18:50 -08:00
xfs_globals.c xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_health.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_icache.c New code for 5.17: 2022-01-22 11:04:27 +02:00
xfs_icache.h xfs: throttle inode inactivation queuing on memory reclaim 2021-08-09 11:13:17 -07:00
xfs_icreate_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_icreate_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode_item_recover.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_inode_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode.c fs.idmapped.v5.17 2022-01-11 14:26:55 -08:00
xfs_inode.h xfs: remove xfs_inew_wait 2021-11-24 10:06:02 -08:00
xfs_ioctl32.c xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls 2022-01-17 09:16:41 -08:00
xfs_ioctl32.h xfs: remove unused xfs_ioctl32.h declarations 2022-01-18 10:18:36 -08:00
xfs_ioctl.c xfs: remove the XFS_IOC_{ALLOC,FREE}SP* definitions 2022-01-17 09:17:11 -08:00
xfs_ioctl.h xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls 2022-01-17 09:16:41 -08:00
xfs_iomap.c fsdax: shift partition offset handling into the file systems 2021-12-04 08:58:54 -08:00
xfs_iomap.h iomap: add a IOMAP_DAX flag 2021-12-04 08:58:53 -08:00
xfs_iops.c dax + libnvdimm for v5.17 2022-01-12 15:46:11 -08:00
xfs_iops.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_itable.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_itable.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_iwalk.c xfs: avoid buffer deadlocks when walking fs inodes 2021-08-09 11:13:16 -07:00
xfs_iwalk.h
xfs_linux.h fs: move mapping helpers 2021-12-03 18:50:17 +01:00
xfs_log_cil.c xfs: reduce kvmalloc overhead for CIL shadow buffers 2022-01-06 10:43:30 -08:00
xfs_log_priv.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_log_recover.c xfs: Remove redundant assignment of mp 2022-01-06 10:43:30 -08:00
xfs_log.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_log.h xfs: AIL needs asynchronous CIL forcing 2021-08-16 12:09:30 -07:00
xfs_message.c xfs: refactor ratelimited buffer error messages into helper 2020-05-07 08:27:46 -07:00
xfs_message.h once: implement DO_ONCE_LITE for non-fast-path "do once" functionality 2021-06-28 15:54:57 -07:00
xfs_mount.c xfs: only run COW extent recovery when there are no live extents 2021-12-21 09:49:41 -08:00
xfs_mount.h xfs: compute maximum AG btree height for critical reservation calculation 2021-10-19 11:45:15 -07:00
xfs_mru_cache.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_mru_cache.h
xfs_ondisk.h xfs: rename struct xfs_legacy_ictimestamp 2021-04-22 18:29:25 -07:00
xfs_pnfs.c iomap: add a IOMAP_DAX flag 2021-12-04 08:58:53 -08:00
xfs_pnfs.h
xfs_pwork.c xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_pwork.h xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_qm_bhv.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_qm_syscalls.c xfs: fix quotaoff mutex usage now that we don't support disabling it 2021-12-21 09:49:41 -08:00
xfs_qm.c xfs: remove the xfs_dqblk_t typedef 2021-10-14 09:19:33 -07:00
xfs_qm.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_quota.h xfs: queue inactivation immediately when quota is nearing enforcement 2021-08-09 10:52:18 -07:00
xfs_quotaops.c xfs: remove the active vs running quota differentiation 2021-08-06 11:05:37 -07:00
xfs_refcount_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_refcount_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_reflink.c dax + libnvdimm for v5.17 2022-01-12 15:46:11 -08:00
xfs_reflink.h xfs: convert xfs_sb_version_has checks to use mount features 2021-08-19 10:07:14 -07:00
xfs_rmap_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_rmap_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_rtalloc.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_rtalloc.h xfs: make the record pointer passed to query_range functions const 2021-08-18 18:46:01 -07:00
xfs_stats.c xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_stats.h xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_super.c dax + libnvdimm for v5.17 2022-01-12 15:46:11 -08:00
xfs_super.h xfs: remove xfs_blkdev_issue_flush 2021-06-21 10:05:46 -07:00
xfs_symlink.c New code for 5.17: 2022-01-11 15:01:50 -08:00
xfs_symlink.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_sysctl.c xfs: restore speculative_cow_prealloc_lifetime sysctl 2021-02-24 10:16:08 -08:00
xfs_sysctl.h xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_sysfs.c xfs: sysfs: use default_groups in kobj_type 2022-01-06 10:43:30 -08:00
xfs_sysfs.h xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init 2020-08-07 11:50:17 -07:00
xfs_trace.c xfs: add trace point for fs shutdown 2021-08-18 18:46:00 -07:00
xfs_trace.h xfs: prepare xfs_btree_cur for dynamic cursor heights 2021-10-19 11:45:14 -07:00
xfs_trans_ail.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_trans_buf.c xfs: introduce xfs_buf_daddr() 2021-08-19 10:07:14 -07:00
xfs_trans_dquot.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_trans_priv.h xfs: refactor adding recovered intent items to the log 2020-05-08 08:50:00 -07:00
xfs_trans.c xfs: shut down filesystem if we xfs_trans_cancel with deferred work items 2021-12-21 09:49:41 -08:00
xfs_trans.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_xattr.c xfs: prevent metadata files from being inactivated 2021-03-25 16:47:50 -07:00
xfs.h