Split the lookup and refcount handling of struct xfs_perag into an
embedded xfs_group structure that can be reused for the upcoming
realtime groups.
It will be extended with more features later.
Note that he xg_type field will only need a single bit even with
realtime group support. For now it fills a hole, but it might be
worth to fold it into another field if we can use this space better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Back when I wrote commit a03297a0ca, I had thought that we'd be doing
users a favor by only marking inodes dontcache at the end of a scrub
operation, and only if there's only one reference to that inode. This
was more or less true back when I_DONTCACHE was an XFS iflag and the
only thing it did was change the outcome of xfs_fs_drop_inode to 1.
Note: If there are dentries pointing to the inode when scrub finishes,
the inode will have positive i_count and stay around in cache until
dentry reclaim.
But now we have d_mark_dontcache, which cause the inode *and* the
dentries attached to it all to be marked I_DONTCACHE, which means that
we drop the dentries ASAP, which drops the inode ASAP.
This is bad if scrub found problems with the inode, because now they can
be scheduled for inactivation, which can cause inodegc to trip on it and
shut down the filesystem.
Even if the inode isn't bad, this is still suboptimal because phases 3-7
each initiate inode scans. Dropping the inode immediately during phase
3 is silly because phase 5 will reload it and drop it immediately, etc.
It's fine to mark the inodes dontcache, but if there have been accesses
to the file that set up dentries, we should keep them.
I validated this by setting up ftrace to capture xfs_iget_recycle*
tracepoints and ran xfs/285 for 30 seconds. With current djwong-wtf I
saw ~30,000 recycle events. I then dropped the d_mark_dontcache calls
and set XFS_IGET_DONTCACHE, and the recycle events dropped to ~5,000 per
30 seconds.
Therefore, grab the inode with XFS_IGET_DONTCACHE, which only has the
effect of setting I_DONTCACHE for cache misses. Remove the
d_mark_dontcache call that can happen in xchk_irele.
Fixes: a03297a0ca ("xfs: manage inode DONTCACHE status at irele time")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach the online directory repair code to scan the filesystem so that we
can set the dotdot entry when we're rebuilding a directory. This
involves dropping ILOCK on the directory that we're repairing, which
means that the VFS can sneak in and tell us to update dotdot at any
time. Deal with these races by using a dirent hook to absorb dotdot
updates, and be careful not to check the scan results until after we've
retaken the ILOCK.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While reviewing the next patch which fixes an ABBA deadlock between the
AGI and a directory ILOCK, someone asked a question about why we're
holding the AGI in the first place. The reason for that is to quiesce
the inode structures for that AG while we do a repair.
I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter,
which walks the inobts (and hence the AGIs) to find all the inodes.
This itself is also an ABBA vector, since the damaged inode could be in
AG 5, which we hold while we scan AG 0 for directories. 5 -> 0 is not
allowed.
To address this, modify the iscan to allow trylock of the AGI buffer
using the flags argument to xfs_ialloc_read_agi that the previous patch
added.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Allow callers to pass buffer lookup flags to xfs_read_agi and
xfs_ialloc_read_agi. This will be used in the next patch to fix a
deadlock in the online fsck inode scanner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Split xfs_inobt_init_cursor into separate routines for the inobt and
finobt to prepare for the removal of the xfs_btnum global enumeration
of btree types.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Repair might encounter an inode with a totally garbage i_mode. To fix
this problem, we have to figure out if the file was a regular file, a
directory, or a special file. One way to figure this out is to check if
there are any directories with entries pointing down to the busted file.
This patch recovers the file mode by scanning every directory entry on
the filesystem to see if there are any that point to the busted file.
If the ftype of all such dirents are consistent, the mode is recovered
from the ftype. If no dirents are found, the file becomes a regular
file. In all cases, ACLs are canceled and the file is made accessible
only by root.
A previous patch attempted to guess the mode by reading the beginning of
the file data. This was rejected by Christoph on the grounds that we
cannot trust user-controlled data blocks. Users do not have direct
control over the ondisk contents of directory entries, so this method
should be much safer.
If all the dirents have the same ftype, then we can translate that back
into an S_IFMT flag and fix the file. If not, reset the mode to
S_IFREG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The inode scanner tries to reduce contention on the AGI header buffer
lock by grabbing references to consecutive allocated inodes. Batching
stops as soon as we encounter an unallocated inode. This is unfortunate
because in the worst case performance collapses to the old "one at a
time" behavior if every other inode is free.
This is correct behavior, but we could do better. Unallocated inodes by
definition have nothing to scan, which means the iscan can ignore them
as long as someone ensures that the scan data will reflect another
thread allocating the inode and adding interesting metadata to that
inode. That mechanism is, of course, the live update hooks.
Therefore, extend the batching mechanism to track unallocated inodes
adjacent to the scan cursor. The _want_live_update predicate can tell
the caller's live update hook to incorporate all live updates to what
the scanner thinks is an unallocated inode if (after dropping the AGI)
some other thread allocates one of those inodes and begins using it.
Note that we cannot just copy the ir_free bitmap into the scan cursor
because the batching stops if iget says the inode is in an intermediate
state (e.g. on the inactivation list) and cannot be igrabbed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
After observing xfs_scrub taking forever to rebuild parent pointers on a
pptrs enabled filesystem, I decided to profile what the system was
doing. It turns out that when there are a lot of threads trying to scan
the filesystem, most of our time is spent contending on AGI buffer
locks. Given that we're walking the inobt records anyway, we can often
tell ahead of time when there's a bunch of (up to 64) consecutive inodes
that we could grab all at once.
Do this to amortize the cost of taking the AGI lock across as many
inodes as we possibly can. On the author's system this seems to improve
parallel throughput from barely one and a half cores to slightly
sublinear scaling. The obvious antipattern here of course is where the
freemask has every other bit set (e.g. all 0xA's)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Online directory and parent repairs on parent-pointer equipped
filesystems have shown that starting a large number of parallel iscans
causes a lot of AGI buffer contention. Try to reduce this by making it
so that iscans scan wrap around the end of the filesystem, and using a
rotor to stagger where each scanner begins. Surprisingly, this boosts
CPU utilization (on the author's test machines) from effectively
single-threaded to 160%. Not great, but see the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.
The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited. This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data. It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.
This new code is a separate patch from the patchsets adding callers for
the sake of enabling the author to move patches around his tree with
ease. The intended usage model for this code is roughly:
xchk_iscan_start(iscan, 0, 0);
while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
xfs_ilock(ip, ...);
/* capture inode metadata */
xchk_iscan_mark_visited(iscan, ip);
xfs_iunlock(ip, ...);
xfs_irele(ip);
}
xchk_iscan_stop(iscan);
if (error)
return error;
Hook functions for live updates can then do:
if (xchk_iscan_want_live_update(...))
/* update the captured inode metadata */
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>