- Various minor fixes in online scrub.
- Prevent metadata files from being automatically inactivated.
- Validate btree heights by the computed per-btree limits.
- Don't warn about remounting with deprecated mount options.
- Initialize attr forks at create time if we suspect we're going to need
to store them.
- Reduce memory reallocation workouts in the logging code.
- Fix some theoretical math calculation errors in logged buffers that
span multiple discontig memory ranges but contiguous ondisk regions.
- Speedups in dirty buffer bitmap handling.
- Make type verifier functions more inline-happy to reduce overhead.
- Reduce debug overhead in directory checking code.
- Many many typo fixes.
- Begin to handle the permanent loss of the very end of a filesystem.
- Fold struct xfs_icdinode into xfs_inode.
- Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx ioctl.
- Remove a broken directory block format check from online scrub.
- Fix a bug where we could produce an unnecessarily tall data fork btree
when creating an attr fork.
- Fix scrub and readonly remounts racing.
- Fix a writeback ioend log deadlock problem by dropping the behavior
where we could preallocate a setfilesize transaction.
- Fix some bugs in the new extent count checking code.
- Fix some bugs in the attr fork preallocation code.
- Refactor if_flags out of the incore inode fork data structure.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmB6MFUACgkQ+H93GTRK
tOvigBAAlpzBUXnZVo+U18u0tSHnq5c1zbXYcf5GPhQv9w3n3TlPi3YhK2vgEXlI
TULwsdU+an30oqWkQiVrwQjKPVaTWeWE3K0sA2MlYX9L2CwPPde4x5hwhyppfQxq
mQyu0suWp480ao7vToXAgZ751OdZRtGu8sRQ7rVQ/FVf9K4R8EqpZMEynNry25f+
hpK235hpf4IUC9E1A4pE2hNBSr/LGPIyu1t5sZsfazcNmtpKcauy5R5b8Pdnzo2/
WFa6PoeE8SRIp4OxZY/c/4QUI5cRubJGyoB+kbl0hg69uYIJO+pc+R69yrQPD9Z+
JDW/FktH+Zz4pstFsC+qnSvhRaF2DvXpvXrIldonQ2Z2ByVqbs3r6HzKySlWQ+QE
jU717HApWl/ADI/kVD2IuQnrbU+Q8Ue8thzgQeEpTRWsea2HzPMofNi5FImU2ulw
g4V7PleQWJ6AsLhcpfA46Y+CUAtjTD1Tvj67JpXuWJ+MFTB4hRm3U7zgCtV/0c3T
wBBUybQjDoVA6DDr6CP/9ki1k0BO3wKJGlZMR0bkEsuxXdFNTvHEz5lmueYT/Wxc
D91+oRbna9NpEeIVFGo6lhMIu2t0iYssFdgQKyn1jXrpGXKvOklP8zDjRdPnnQmz
plT2ajlXPIjc6KjOTP2mbVqKs059LuJoYV7gIWwM7CgtFsMIrd8=
=oRKe
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"The notable user-visible addition this cycle is ability to remove
space from the last AG in a filesystem. This is the first of many
changes needed for full-fledged support for shrinking a filesystem.
Still needed are (a) the ability to reorganize files and metadata away
from the end of the fs; (b) the ability to remove entire allocation
groups; (c) shrink support for realtime volumes; and (d) thorough
testing of (a-c).
There are a number of performance improvements in this code drop: Dave
streamlined various parts of the buffer logging code and reduced the
cost of various debugging checks, and added the ability to pre-create
the xattr structures while creating files. Brian eliminated
transaction reservations that were being held across writeback (thus
reducing livelock potential.
Other random pieces: Pavel fixed the repetitve warnings about
deprecated mount options, I fixed online fsck to behave itself when a
readonly remount comes in during scrub, and refactored various other
parts of that code, Christoph contributed a lot of refactoring this
cycle. The xfs_icdinode structure has been absorbed into the (incore)
xfs_inode structure, and the format and flags handling around
xfs_inode_fork structures has been simplified. Chandan provided a
number of fixes for extent count overflow related problems that have
been shaken out by debugging knobs added during 5.12.
Summary:
- Various minor fixes in online scrub.
- Prevent metadata files from being automatically inactivated.
- Validate btree heights by the computed per-btree limits.
- Don't warn about remounting with deprecated mount options.
- Initialize attr forks at create time if we suspect we're going to
need to store them.
- Reduce memory reallocation workouts in the logging code.
- Fix some theoretical math calculation errors in logged buffers that
span multiple discontig memory ranges but contiguous ondisk
regions.
- Speedups in dirty buffer bitmap handling.
- Make type verifier functions more inline-happy to reduce overhead.
- Reduce debug overhead in directory checking code.
- Many many typo fixes.
- Begin to handle the permanent loss of the very end of a filesystem.
- Fold struct xfs_icdinode into xfs_inode.
- Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx
ioctl.
- Remove a broken directory block format check from online scrub.
- Fix a bug where we could produce an unnecessarily tall data fork
btree when creating an attr fork.
- Fix scrub and readonly remounts racing.
- Fix a writeback ioend log deadlock problem by dropping the behavior
where we could preallocate a setfilesize transaction.
- Fix some bugs in the new extent count checking code.
- Fix some bugs in the attr fork preallocation code.
- Refactor if_flags out of the incore inode fork data structure"
* tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits)
xfs: remove xfs_quiesce_attr declaration
xfs: remove XFS_IFEXTENTS
xfs: remove XFS_IFINLINE
xfs: remove XFS_IFBROOT
xfs: only look at the fork format in xfs_idestroy_fork
xfs: simplify xfs_attr_remove_args
xfs: rename and simplify xfs_bmap_one_block
xfs: move the XFS_IFEXTENTS check into xfs_iread_extents
xfs: drop unnecessary setfilesize helper
xfs: drop unused ioend private merge and setfilesize code
xfs: open code ioend needs workqueue helper
xfs: drop submit side trans alloc for append ioends
xfs: fix return of uninitialized value in variable error
xfs: get rid of the ip parameter to xchk_setup_*
xfs: fix scrub and remount-ro protection when running scrub
xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks
xfs: move the xfs_can_free_eofblocks call under the IOLOCK
xfs: precalculate default inode attribute offset
xfs: default attr fork size does not handle device inodes
xfs: inode fork allocation depends on XFS_IFEXTENT flag
...
The in-memory XFS_IFEXTENTS is now only used to check if an inode with
extents still needs the extents to be read into memory before doing
operations that need the extent map. Add a new xfs_need_iread_extents
helper that returns true for btree format forks that do not have any
entries in the in-memory extent btree, and use that instead of checking
the XFS_IFEXTENTS flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Just check for an inline format fork instead of the using the equivalent
in-memory XFS_IFINLINE flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Just check for a btree format fork instead of the using the equivalent
in-memory XFS_IFBROOT flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Stop using the XFS_IFEXTENTS flag, and instead switch on the fork format
in xfs_idestroy_fork to decide how to cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Directly return from the subfunctions and avoid the error variable. Also
remove the not really needed dp local variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs_bmap_one_block is only called for the attribute fork. Move it to
xfs_attr.c, drop the unused whichfork argument and code only executed for
the data fork and rename the result to xfs_attr_is_leaf.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Move the XFS_IFEXTENTS check from the callers into xfs_iread_extents to
simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
A previous commit removed a call to xfs_attr3_leaf_read that
assigned an error return code to variable error. We now have
a few early error return paths to label 'out' that return
error if error is set; however error now is uninitialized
so potentially garbage is being returned. Fix this by setting
error to zero to restore the original behaviour where error
was zero at the label 'restart'.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Default attr fork offset is based on inode size, so is a fixed
geometry parameter of the inode. Move it to the xfs_ino_geometry
structure and stop calculating it on every call to
xfs_default_attroffset().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Device inodes have a non-default data fork size of 8 bytes
as checked/enforced by xfs_repair. xfs_default_attroffset() doesn't
handle this, so lets do a minor refactor so it does.
Fixes: e6a688c332 ("xfs: initialise attr fork on inode create")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
The incore data fork of an inode stores the bmap btree root node as 'struct
xfs_btree_block'. However, the ondisk version of the inode stores the bmap
btree root node as a 'struct xfs_bmdr_block'.
xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the
data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block'
to compute the size of the bmap btree root node. Since size of 'struct
xfs_btree_block' is larger than that of 'struct xfs_bmdr_block',
xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current
root node as the child of newly allocated root node.
This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree()
to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits
inside the data fork of the inode.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use of the flag has had no effect since kernel commit 288699feca
("xfs: drop dmapi hooks"), which removed all dmapi related code, so
deprecate it.
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and
remove the now entirely unused struct xfs_icdinode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the flags2
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the flags
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
forkoff field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The i_cowextsize field is only used for v3 inodes, and the i_flushiter
field is only used for v1/v2 inodes. Use a union to pack the inode a
littler better after adding a few missing guards around their usage.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
flushiter field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
cowextsize field into the containing xfs_inode structure. Also
switch to use the xfs_extlen_t instead of a uint32_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the extsize
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the nblocks
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the on-disk
size field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the projid
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs. So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode. The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Split looking up the dinode from xfs_imap_to_bp, which can be
significantly simplified as a result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs/538 can cause the following call trace to be printed when executing on a
multi-block directory configuration,
WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
Call Trace:
? xfs_buf_rele+0x4f/0x450
xfs_bmap_add_extent_hole_real+0x747/0x960
xfs_bmapi_allocate+0x39a/0x440
xfs_bmapi_write+0x507/0x9e0
xfs_da_grow_inode_int+0x1cd/0x330
? up+0x12/0x60
xfs_dir2_grow_inode+0x62/0x110
? xfs_trans_log_inode+0x234/0x2d0
xfs_dir2_sf_to_block+0x103/0x940
? xfs_dir2_sf_check+0x8c/0x210
? xfs_da_compname+0x19/0x30
? xfs_dir2_sf_lookup+0xd0/0x3d0
xfs_dir2_sf_addname+0x10d/0x910
xfs_dir_createname+0x1ad/0x210
xfs_create+0x404/0x620
xfs_generic_create+0x24c/0x320
path_openat+0xda6/0x1030
do_filp_open+0x88/0x130
? kmem_cache_alloc+0x50/0x210
? __cond_resched+0x16/0x40
? kmem_cache_alloc+0x50/0x210
do_sys_openat2+0x97/0x150
__x64_sys_creat+0x49/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes
xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of
xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1
and hence the space allocator could choose an AG which has less than
xfs_bmalloca->total number of free blocks available. As the transaction
proceeds, one of the future space allocation requests could fail due to
non-availability of free blocks in the AG that was originally chosen.
This commit fixes the bug by assigning xfs_alloc_arg->total to the value of
xfs_bmalloca->total.
Fixes: 3015196746 ("xfs: Introduce error injection to allocate only minlen size extents for files")
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
per-AG resv failure after fixing up freespace is hard to test in an
effective way, so directly add an error injection path to observe
such error handling path works as expected.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This patch introduces a helper to shrink unused space in the last AG
by fixing up the freespace btree.
Also make sure that the per-AG reservation works under the new AG
size. If such per-AG reservation or extent allocation fails, roll
the transaction so the new transaction could cancel without any side
effects.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
On debug kernels, we call xfs_dir3_leaf_check_int() multiple times
on every directory modification. The robust hash ordering checks it
does on every entry in the leaf on every call results in a massive
CPU overhead which slows down debug kernels by a large amount.
We use xfs_dir3_leaf_check_int() for the verifiers as well, so we
can't just gut the function to reduce overhead. What we can do,
however, is reduce the work it does when it is called from the
debug interfaces, just leaving the high level checks in place and
leaving the robust validation to the verifiers. This means the debug
checks will catch gross errors, but subtle bugs might not be caught
until a verifier is run.
It is easy enough to restore the existing debug behaviour if the
developer needs it (just change a call parameter in the debug code),
but overwise the overhead makes testing large directory block sizes
on debug kernels very slow.
Profile at an unlink rate of ~80k file/s on a 64k block size
filesystem before the patch:
40.30% [kernel] [k] xfs_dir3_leaf_check_int
10.98% [kernel] [k] __xfs_dir3_data_check
8.10% [kernel] [k] xfs_verify_dir_ino
4.42% [kernel] [k] memcpy
2.22% [kernel] [k] xfs_dir2_data_get_ftype
1.52% [kernel] [k] do_raw_spin_lock
Profile after, at an unlink rate of ~125k files/s (+50% improvement)
has largely dropped the leaf verification debug overhead out of the
profile.
16.53% [kernel] [k] __xfs_dir3_data_check
12.53% [kernel] [k] xfs_verify_dir_ino
7.97% [kernel] [k] memcpy
3.36% [kernel] [k] xfs_dir2_data_get_ftype
2.86% [kernel] [k] __pv_queued_spin_lock_slowpath
Create shows a similar change in profile and a +25% improvement in
performance.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We call xfs_dir_ino_validate() for every dir entry in a directory
when doing validity checking of the directory. It calls
xfs_verify_dir_ino() then emits a corruption report if bad or does
error injection if good. It is extremely costly:
43.27% [kernel] [k] xfs_dir3_leaf_check_int
10.28% [kernel] [k] __xfs_dir3_data_check
6.61% [kernel] [k] xfs_verify_dir_ino
4.16% [kernel] [k] xfs_errortag_test
4.00% [kernel] [k] memcpy
3.48% [kernel] [k] xfs_dir_ino_validate
7% of the cpu usage in this directory traversal workload is
xfs_dir_ino_validate() doing absolutely nothing.
We don't need error injection to simulate a bad inode numbers in the
directory structure because we can do that by fuzzing the structure
on disk.
And we don't need a corruption report, because the
__xfs_dir3_data_check() will emit one if the inode number is bad.
So just call xfs_verify_dir_ino() directly here, and get rid of all
this unnecessary overhead:
40.30% [kernel] [k] xfs_dir3_leaf_check_int
10.98% [kernel] [k] __xfs_dir3_data_check
8.10% [kernel] [k] xfs_verify_dir_ino
4.42% [kernel] [k] memcpy
2.22% [kernel] [k] xfs_dir2_data_get_ftype
1.52% [kernel] [k] do_raw_spin_lock
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
From a concurrent rm -rf workload:
41.04% [kernel] [k] xfs_dir3_leaf_check_int
9.85% [kernel] [k] __xfs_dir3_data_check
5.60% [kernel] [k] xfs_verify_ino
5.32% [kernel] [k] xfs_agino_range
4.21% [kernel] [k] memcpy
3.06% [kernel] [k] xfs_errortag_test
2.57% [kernel] [k] xfs_dir_ino_validate
1.66% [kernel] [k] xfs_dir2_data_get_ftype
1.17% [kernel] [k] do_raw_spin_lock
1.11% [kernel] [k] xfs_verify_dir_ino
0.84% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
0.83% [kernel] [k] xfs_buf_find
0.64% [kernel] [k] xfs_log_commit_cil
THere's an awful lot of overhead in just range checking inode
numbers in that, but each inode number check is not a lot of code.
The total is a bit over 14.5% of the CPU time is spent validating
inode numbers.
The problem is that they deeply nested global scope functions so the
overhead here is all in function call marshalling.
text data bss dec hex filename
2077 0 0 2077 81d fs/xfs/libxfs/xfs_types.o.orig
2197 0 0 2197 895 fs/xfs/libxfs/xfs_types.o
There's a small increase in binary size by inlining all the local
nested calls in the verifier functions, but the same workload now
profiles as:
40.69% [kernel] [k] xfs_dir3_leaf_check_int
10.52% [kernel] [k] __xfs_dir3_data_check
6.68% [kernel] [k] xfs_verify_dir_ino
4.22% [kernel] [k] xfs_errortag_test
4.15% [kernel] [k] memcpy
3.53% [kernel] [k] xfs_dir_ino_validate
1.87% [kernel] [k] xfs_dir2_data_get_ftype
1.37% [kernel] [k] do_raw_spin_lock
0.98% [kernel] [k] xfs_buf_find
0.94% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
0.73% [kernel] [k] xfs_log_commit_cil
Now we only spend just over 10% of the time validing inode numbers
for the same workload. Hence a few "inline" keyworks is good enough
to reduce the validation overhead by 30%...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.
This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.
This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:
18.99% [kernel] [k] __pv_queued_spin_lock_slowpath
3.57% [kernel] [k] do_raw_spin_lock
2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
2.48% [kernel] [k] memcpy
2.34% [kernel] [k] xfs_log_commit_cil
The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:
- 15.30% xfs_init_security
- 15.23% security_inode_init_security
- 13.05% xfs_initxattrs
- 12.94% xfs_attr_set
- 6.75% xfs_bmap_add_attrfork
- 5.51% xfs_trans_commit
- 5.48% __xfs_trans_commit
- 5.35% xfs_log_commit_cil
- 3.86% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.70% xfs_trans_alloc
0.52% xfs_trans_reserve
- 5.41% xfs_attr_set_args
- 5.39% xfs_attr_set_shortform.constprop.0
- 4.46% xfs_trans_commit
- 4.46% __xfs_trans_commit
- 4.33% xfs_log_commit_cil
- 2.74% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
0.60% xfs_inode_item_format
0.90% xfs_attr_try_sf_addname
- 1.99% selinux_inode_init_security
- 1.02% security_sid_to_context_force
- 1.00% security_sid_to_context_core
- 0.92% sidtab_entry_to_string
- 0.90% sidtab_sid2str_get
0.59% sidtab_sid2str_put.part.0
- 0.82% selinux_determine_inode_label
- 0.77% security_transition_sid
0.70% security_compute_sid.part.0
And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.
So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:
- 8.99% xfs_init_security
- 9.00% security_inode_init_security
- 6.43% xfs_initxattrs
- 6.37% xfs_attr_set
- 5.45% xfs_attr_set_args
- 5.42% xfs_attr_set_shortform.constprop.0
- 4.51% xfs_trans_commit
- 4.54% __xfs_trans_commit
- 4.59% xfs_log_commit_cil
- 2.67% _raw_spin_lock
- 3.28% do_raw_spin_lock
3.08% __pv_queued_spin_lock_slowpath
0.66% xfs_inode_item_format
- 0.90% xfs_attr_try_sf_addname
- 0.60% xfs_trans_alloc
- 2.35% selinux_inode_init_security
- 1.25% security_sid_to_context_force
- 1.21% security_sid_to_context_core
- 1.19% sidtab_entry_to_string
- 1.20% sidtab_sid2str_get
- 0.86% sidtab_sid2str_put.part.0
- 0.62% _raw_spin_lock_irqsave
- 0.77% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.84% selinux_determine_inode_label
- 0.83% security_transition_sid
0.86% security_compute_sid.part.0
Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic. This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.
In other words, xfs_inactive() should leave these files alone. Add the
necessary predicate functions to make this happen. This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de. If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.
Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h
Followup-to: f4c32e87de ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the AG btree height limits that we precomputed into the xfs_mount to
validate the AG headers instead of using XFS_BTREE_MAXLEVELS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.
[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]
Fixes: 9070733b4e ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.
Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
As xfs supports the feature of inode btree block counters now, expose
this feature flag in xfs geometry, for userspace can check if the
inobtcnt is enabled or not.
Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode
wrapper function, then convert a few more callsites.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a new helper xfs_trans_alloc_inode that allocates a transaction,
locks and joins an inode to it, and then reserves the appropriate amount
of quota against that transction. Then replace all the open-coded
idioms with a single call to this helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Modify xfs_trans_reserve_quota_nblks so that we can reserve data and
realtime blocks from the dquot at the same time. This change has the
theoretical side effect that for allocations to realtime files we will
reserve from the dquot both the number of rtblocks being allocated and
the number of bmbt blocks that might be needed to add the mapping.
However, since the mount code disables quota if it finds a realtime
device, this should not result in any behavior changes.
Now that we've moved the inode creation callers away from using the
_nblks function, we can repurpose the (now unused) ninos argument for
realtime blocks, so make that change. This also replaces the flags
argument with a boolean parameter to force the reservation since we
don't need to distinguish between data and rt quota reservations any
more, and the only flag being passed in was FORCE_RES.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a couple of convenience wrappers for creating and deleting quota
block reservations against future changes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Convert a few xfs_trans_*reserve* callsites that are open-coding other
convenience functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
With both CONFIG_XFS_DEBUG and CONFIG_XFS_WARN disabled, the only reference to
local variable "error" in xfs_bmap_compute_alignments() gets eliminated during
pre-processing stage of the compilation process. This causes the compiler to
generate a "set but not used" warning.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This commit adds XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag which
helps userspace test programs to get xfs_bmap_btalloc() to always
allocate minlen sized extents.
This is required for test programs which need a guarantee that minlen
extents allocated for a file do not get merged with their existing
neighbours in the inode's BMBT. "Inode fork extent overflow check" for
Directories, Xattrs and extension of realtime inodes need this since the
file offset at which the extents are being allocated cannot be
explicitly controlled from userspace.
One way to use this error tag is to,
1. Consume all of the free space by sequentially writing to a file.
2. Punch alternate blocks of the file. This causes CNTBT to contain
sufficient number of one block sized extent records.
3. Inject XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag.
After step 3, xfs_bmap_btalloc() will issue space allocation
requests for minlen sized extents only.
ENOSPC error code is returned to userspace when there aren't any "one
block sized" extents left in any of the AGs.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This commit moves over the code in xfs_bmap_btalloc() which is
responsible for processing an allocated extent to a new function. Apart
from xfs_bmap_btalloc(), the new function will be invoked by another
function introduced in a future commit.
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This commit moves over the code which computes stripe alignment and
extent size hint alignment into a separate function. Apart from
xfs_bmap_btalloc(), the new function will be used by another function
introduced in a future commit.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The check for verifying if the allocated extent is from an AG whose
index is greater than or equal to that of tp->t_firstblock is already
done a couple of statements earlier in the same function. Hence this
commit removes the redundant assert statement.
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>