Currently d_move(from, to) does the following:
* name/parent of from <- old name/parent of to, from hashed there
* to is unhashed
* name of to is preserved
* if from used to be detached, to gets detached
* if from used to be attached, parent of to <- old parent of from.
That's both user-visibly bogus and complicates reasoning a lot.
Much saner semantics would be
* name/parent of from <- name/parent of to, from hashed there.
* to is unhashed
* name/parent of to is unchanged.
The price, of course, is that old parent of from might lose a reference.
However,
* all potentially cross-directory callers of d_move() have both
parents pinned directly; typically, dentries themselves are grabbed
only after we have grabbed and locked both parents. IOW, the decrement
of old parent's refcount in case of d_move() won't reach zero.
* __d_move() from d_splice_alias() is done to detached alias.
No refcount decrements in that case
* __d_move() from __d_unalias() *can* get the refcount to zero.
So let's grab a reference to alias' old parent before calling __d_unalias()
and dput() it after we'd dropped rename_lock.
That does make d_splice_alias() potentially blocking. However, it has
no callers in non-sleepable contexts (and the case where we'd grown
that dget/dput pair is _very_ rare, so performance is not an issue).
Another thing that needs adjustment is unlocking in the end of __d_move();
folded it in. And cleaned the remnants of bogus ordering from the
"lock them in the beginning" counterpart - it's never been right and
now (well, for 7 years now) we have that thing always serialized on
rename_lock anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.
Trylock loops are problematic in two scenarios:
1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
preemptible. As a consequence the i_lock holder can be preempted
by a higher priority task. If that task executes the trylock loop
it will do so forever and live lock.
2) In virtual machines trylock loops are problematic as well. The
VCPU on which the i_lock holder runs can be scheduled out and a
task on a different VCPU can loop for a whole time slice. In the
worst case this can lead to starvation. Commits 47be61845c
("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45
("shrink_dentry_list(): take parent's d_lock earlier") are
addressing exactly those symptoms.
Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case when trylock in there fails, deal with it directly in
dentry_kill(). Note that in cases when we drop and retake
->d_lock, we need to recheck whether to retain the dentry.
Another thing is that dropping/retaking ->d_lock might have
ended up with negative dentry turning into positive; that,
of course, can happen only once...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order. For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.
The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.
[fix from Eric Biggers folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext4 isn't validating the sizes of xattrs where the value of the xattr
is stored in an external inode. This is problematic because
->e_value_size is a u32, but ext4_xattr_get() returns an int. A very
large size is misinterpreted as an error code, which ext4_get_acl()
translates into a bogus ERR_PTR() for which IS_ERR() returns false,
causing a crash.
Fix this by validating that all xattrs are <= INT_MAX bytes.
This issue has been assigned CVE-2018-1095.
https://bugzilla.kernel.org/show_bug.cgi?id=199185https://bugzilla.redhat.com/show_bug.cgi?id=1560793
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
This patch spits out the time taken by the various steps in the
journal recover process. Previously, the journal recovery time
didn't account for finding the journal head in the log which takes
up a significant portion of time.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.
This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.
[1] or if systemd doesn't cleanly unmount root, as it is wont
to do ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them
out in fallocate_chunk, much higher up the call stack. This gets rid of
gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew
buffer flag. I can't think of a reason why zeroing out the blocks in
gfs2_iomap_alloc would have any benefits: there is no additional locking
at that level that would add protection to the newly allocated blocks.
While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Reserve an F2FS feature flag and inode flag for fs-verity. This is an
in-development feature that is planned be discussed at LSF/MM 2018 [1].
It will provide file-based integrity and authenticity for read-only
files. Most code will be in a filesystem-independent module, with
smaller changes needed to individual filesystems that opt-in to
supporting the feature. An early prototype supporting F2FS is available
[2]. Reserving the F2FS on-disk bits for fs-verity will prevent users
of the prototype from conflicting with other new F2FS features.
Note that we're reserving the inode flag in f2fs_inode.i_advise, which
isn't really appropriate since it's not a hint or advice. But
->i_advise is already being used to hold the 'encrypt' flag; and F2FS's
->i_flags uses the generic FS_* values, so it seems ->i_flags can't be
used for an F2FS-specific flag without additional work to remove the
assumption that ->i_flags uses the generic flags namespace.
[1] https://marc.info/?l=linux-fsdevel&m=151690752225644
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
And use it in a few more places rather than opencoding the values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_dentry_open is where we do the actual open of the file, so this is
where we should do our O_DIRECT sanity check to cover all potential
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch add a segment type check in IPU, in
case of something wrong with blkadd in dnode.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since f2fs_inode_info is allocated with flag GFP_F2FS_ZERO, so we do not
need to initialize zero value for its member any more.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Nat entry set is used only in checkpoint(), and during checkpoint() we
won't flush new nat entry with unallocated address, so we don't need to
add new nat entry into nat set, then nat_entry_set::entry_cnt can
indicate actual entry count we need to flush in checkpoint().
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In rxrpc and afs, use the debug_ids that are monotonically allocated to
various objects as they're allocated rather than pointers as kernel
pointers are now hashed making them less useful. Further, the debug ids
aren't reused anywhere nearly as quickly.
In addition, allow kernel services that use rxrpc, such as afs, to take
numbers from the rxrpc counter, assign them to their own call struct and
pass them in to rxrpc for both client and service calls so that the trace
lines for each will have the same ID tag.
Signed-off-by: David Howells <dhowells@redhat.com>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes spelling typos found in printk.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.
To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In the last step of scrub_handle_error_block, we try to combine good
copies on all possible mirrors, this works fine for raid1 and raid10,
but not for raid56 as it's doing parity rebuild.
If parity rebuild doesn't get back with correct data which matches its
checksum, in case of replace we'd rather write what is stored in the
source device than the data calculuated from parity.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
async_missing_raid56() is identical to async_read_rebuild().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, btrfs_inode_by_name() returned 0 which left caller to check
objectid of location even location if the type was invalid.
Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
dir entry is found. Removal of label out_err also simplifies the
function.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ drop unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function btrfs_close_extra_devices() is about freeing
extra devids which once it may have belonged to this filesystem.
So rename it and add the comment. The _devid suffix is
appropriate as this function won't handle devices which are
outside of the filesytem being mounted.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to tlv_put()'s prototype, data and attrlen needs to be
exchanged in the macro, but seems all callers are already aware of
this misorder and are therefore not affected.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe
and btrfs_log_new_name) both of which pass inode->root as the root
argument and the inode itself. Remove the redundant root argument and
get a reference to the root directly from the inode, also remove
redundant root != inode->root check from the same function. No
functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always sets keep_locks to 1 and saves the old value of
keep_locks which is restored at the end. So there is no way it can be
called without keep_locks being set. Remove comment imposing redundant
requirement on callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The xattr_handler::get prototype returns int, use it. The only ssize_t
exception is the per-inode listxattr handler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When send finishes processing an inode representing a regular file, it
always issues a truncate operation for that file, even if its size did
not change or the last write sets the file size correctly. In the most
common cases, the issued write operations set the file to correct size
(either full or incremental sends) or the file size did not change (for
incremental sends), so the only case where a truncate operation is needed
is when a file size becomes smaller in the send snapshot when compared
to the parent snapshot.
By not issuing unnecessary truncate operations we reduce the stream size
and save time in the receiver. Currently truncating a file to the same
size triggers writeback of its last page (if it's dirty) and waits for it
to complete (only if the file size is not aligned with the filesystem's
sector size). This is being fixed by another patch and is independent of
this change (that patch's title is "Btrfs: skip writeback of last page
when truncating file to same size").
The following script was used to measure time spent by a receiver without
this change applied, with this change applied, and without this change and
with the truncate fix applied (the fix to not make it start and wait for
writeback to complete).
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results, which are averages for 5 runs for each case, were the
following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* Without this change and with the truncate fix
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
* With this change and without the truncate fix
average receive time was 10.02 seconds
standard deviation of 1.11 seconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.
This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).
So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.
The following script was used to quickly measure the time a receive
operation takes:
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results for 5 runs were the following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* With this change
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doens't make sense to process prealloc extents as pages will be
filled with zero when reading prealloc extents.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:
- RMW operations that have a return value are fully ordered;
Fully ordered primitives are ordered against everything prior and
everything subsequent. Therefore a fully ordered primitive is like
having an smp_mb() before and an smp_mb() after the primitive.
Given this let's replace the extra memory barriers with comments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.
As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.
Signed-off-by: David Sterba <dsterba@suse.com>
We aren't verifying the parameter passed to the subvolid mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o subvolid=abc.
This patch verifies the subvolid option with match_u64.
Up to now the memparse function accepts the K/M/G/ suffixes, that are
usually meant for size values and do not make sense for a subvolume it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.
In case of free space inode, this ends up with a warning in
cow_file_range().
The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.
With this, run_delalloc_nocow() bails out when errors occur at the two
places.
cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].
Refactor this function a little:
1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.
Although the name is a little longer, but it should explain its usage
quite well.
2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.
This also makes later code re-use between kernel and btrfs-progs
easier.
3) Remove function get_block_group_index()
Really no need to do such a simple thing as an exported function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inspecting the error message with real corruption, the "root=%llu"
always shows "1" (root tree), instead of the correct owner.
The problem is that we are getting @root from page->mapping->host, which
points the same btree inode, so we will always get the same root.
This makes the root owner output meaningless, and harder to port
tree-checker to btrfs-progs.
So get rid of the false and meaningless @root parameter and replace it
with @fs_info.
To get the owner, we can only rely on btrfs_header_owner() now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.
This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
load_free_space_tree calls either function load_free_space_bitmaps or
load_free_space_extents. And either of those two will lead to call
btrfs_next_item. So in function load_free_space_tree, use READA_FORWARD
to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
populate_free_space_tree calls function btrfs_search_slot_for_read with
parameter int find_higher = 1, it means that, if no exact match is
found, then use the next higher item. So in function
populate_free_space_tree, use READA_FORWARD to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The behavior of btrfs_delalloc_reserve_metadata depends on whether
the inode we are allocating for is the freespace inode or not. As it
stands if we are the free node we set 'flush' and 'delalloc_lock'
variable to certain values. Subsequently we check the values of those
vars and act accordingly. Instead, simplify things by having 1 if
which checks whether we are the freespace inode or not and do any
specific operation in either branches of that if. This makes the code
a bit easier to understand, as an added bonus it also shrinks the
compiled size:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17)
Function old new delta
btrfs_delalloc_reserve_metadata 1876 1859 -17
Total: Before=85966, After=85949, chg -0.02%
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By maintaining the device list sorted lets us reproduce the problems
related to missing chunk in the degraded mode much more consistent. So
fix this by sorting the devices by devid within the kernel. So that we
know which device is assigned to the struct fs_info::latest_bdev when
all the devices are having and same SB generation.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of manually fiddling with the state of the task
(RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible
which adjusts the task state as needed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though btrfs_start_dirty_block_groups is fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function is
executed and let the sole runner of dirty block groups handle the return
value. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page() can fail for logical reasons as from the bio_add_page()
comments:
/*
* This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or
* it's a cloned bio.
*/
Here we have just allocated the bio, so both of those failures can't
occur. So drop the check. We can also drop the error stats for write
error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enospc_debug makes extent allocator print more debug messages,
however for chunk allocation, there is no debug message for enospc_debug
at all.
This patch will add message for the following parts of chunk allocator:
1) No rw device at all
Quite rare, but at least output one message for this case.
2) Not enough space for some device
This debug message is quite handy for unbalanced disks with stripe
based profiles (RAID0/10/5/6).
3) Not enough free devices
This debug message should tell us if current chunk allocator is
working correctly under minimal device requirements.
Although in most cases, we will hit other ENOSPC before we even hit a
chunk allocator ENOSPC, but such debug info won't help.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.
The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit [1] removed the need to use btrfs_async_submit_limit(), so
delete it.
[1]
commit 736cd52e0c
Btrfs: remove nr_async_submits and async_submit_draining
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when enospc_debug mount option is turned on we do not print
any debug info in case metadata reservation failures happen. Fix this
by adding the necessary hook in reserve_metadata_bytes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For debugging or administration purposes, we would want to know if and
when the user cancels the replace, to complement the existing messages
when dev-replace starts or finishes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, fold fix for RCU warning from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
The replace target device can be missing when mounted with -o degraded,
but we wont allocate a missing btrfs_device to it. So check the device
before accessing.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch has been moved in front of patch "btrfs: log, when replace,
is canceled by the user" that could reproduce the crash if the system
reboots inside btrfs_dev_replace_start before the
btrfs_dev_replace_finishing call.
$ mkfs /dev/sda
$ mount /dev/sda mnt
$ btrfs replace start /dev/sda /dev/sdb
<insert reboot>
$ mount po degraded /dev/sdb mnt
<crash>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ added reproducer description from mail ]
Signed-off-by: David Sterba <dsterba@suse.com>
The options alloc_start and subvolrootid are deprecated, comment them in
the tokens list. And leave them as it is. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As check_int_print_mask mount option is unsigned so manage it as %u for
token verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As metadata_ratio mount option is unsinged so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option thread_pool is always unsigned. Manage it that way all
around.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.
Before:
text data bss dec hex filename
1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko
After:
text data bss dec hex filename
1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass btrfs_trans_handle which contains a reference to the
fs_info so use that. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the btrfs_transaction which references fs_info so no
need to pass the later as an argument. Also use the opportunity to
shorten transaction->trans. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the trans handle which has a reference to fs_info to
create_pending_snapshot so we can refer to it directly. Doing this
obviates the need to pass the fs_info to create_pending_snapshots as
well. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have the fs_info from the passed transaction so use it
directly. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only thing the passed root is used for is:
1. get a reference to the fs_info and to
2. call trace_btrfs_transaction_commit.
We can achieve 1) by simply referring to the fs_info from passed trans
object. As far as 2) is concerned cleanup_transaction is called from
only one place and the 'root' argument passed is the one from the trans
handle. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass a transaction handle which refrences the fs_info so
we can grab it from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction handle which has a reference to the
fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction which has a reference to the fs_info,
so use that. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction handle, which contains a refrence to
the fs_info so grab it from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction so no point in passing
it as a function argument. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaciton so no point in
passing it as function argument. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All current callers of this function just get a reference to the
trans->fs_info member and pass it as the second argument. Collapse this
into the function itself. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_write_and_wait_transaction is essentially a wrapper of
btrfs_write_and_wait_marked_extents with the addition of calling
clear_btree_io_tree. Having the code split doesn't really bring any
benefit. Open code the later into the former and add proper
documentation header.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only ever used in __btrfs_end_transaction and
btrfs_commit_transaction so there is no need to export it via header.
Let's move it closer to where it's used, make it static and remove it
from the header. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_dev_replace_tgtdev_for_resume() initializes replace
target device in a few simple steps, so do it at the parent function.
Moreover, there isn't any other caller so just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current u64 return from btrfs_dev_replace_cancel() was probably done
to match the btrfs_ioctl_dev_replace_args::result. However as our
actual return value fits in int, and it further gets typecast to u64,
so just return int.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove __ which is for the special functions.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the
actual cancel so just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the function uses a hardcoded value for the checksum size of
a sector. This is fine, given that we currently support only a single
algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not
having other algorithms, btrfs' design supports using a different
algorithm whith different space requirements. To future-proof the code
query the size of the currently used algorithm from the in-memory copy
of the super block. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a missing void parameter to function btrfs_test_extent_map, fixes
sparse warning:
warning: non-ANSI function declaration of function 'btrfs_test_extent_map'
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The check for a non-zero ret is redundant as the goto will jump to
the very next statement anyway. Remove this extraneous code.
Detected by CoverityScan, CID#1463784 ("Identical code for different
branches")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0e8c36a9fd ("Btrfs: fix lots of orphan inodes when the space
is not enough") changed the way transaction reservation is made in
btrfs_evict_node and as a result this function became unused. This has
been the status quo for 5 years in which time no one noticed, so I'd
say it's safe to assume it's unlikely it will ever be used again.
Historical note: there were more attempts to remove the function, the
reasoning was missing and only based on some static analysis tool
reports. Other reason for rejection was that there seemed to be
connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be
removeed to. This was not correct so removing the function is all we can
do.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Presently, failing a primary super block write but succeeding in at
least one super block write in general will appear to users as if
nothing important went wrong. However, upon unmounting and re-mounting,
the file system will be in a rolled back state. This was discovered
with a BCC program that uses bpf_override_return() to fail super block
writes.
This patch outputs an error clarifying that the primary super block
write has failed, so users can expect potentially erroneous behaviour.
It also forces wait_dev_supers() to return an error to its caller if
the primary super block write fails.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.
Move those braces to column 1.
This allows various function analyzers like gnu complexity to work
properly for these modified functions.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
"mark_unwritten" in comment and "unwritten" in the function arguments
is mismatched.
Signed-off-by: zhenwei.pi <zhenwei.pi@youruncloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Code cleanup. Instead of writing an internal static function, use the
available generic_writepages().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We can only get into the branch if CRCs are enabled, so there's no
need to check inside the branch for CRCs being enabled....
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We recently came across a V4 filesystem causing memory corruption
due to a newly allocated inode being setup twice and being added to
the superblock inode list twice. From code inspection, the only way
this could happen is if a newly allocated inode was not marked as
free on disk (i.e. di_mode wasn't zero).
Running the metadump on an upstream debug kernel fails during inode
allocation like so:
XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
e.c, line: 838
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:114!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
1/2014
RIP: 0010:assfail+0x28/0x30
RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
Call Trace:
xfs_ialloc+0x383/0x570
xfs_dir_ialloc+0x6a/0x2a0
xfs_create+0x412/0x670
xfs_generic_create+0x1f7/0x2c0
? capable_wrt_inode_uidgid+0x3f/0x50
vfs_mkdir+0xfb/0x1b0
SyS_mkdir+0xcf/0xf0
do_syscall_64+0x73/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Extracting the inode number we crashed on from an event trace and
looking at it with xfs_db:
xfs_db> inode 184452204
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 1
core.onlink = 0
.....
Confirms that it is not a free inode on disk. xfs_repair
also trips over this inode:
.....
zero length extent (off = 0, fsbno = 0) in ino 184452204
correcting nextents for inode 184452204
bad attribute fork in inode 184452204, would clear attr fork
bad nblocks 1 for inode 184452204, would reset to 0
bad anextents 1 for inode 184452204, would reset to 0
imap claims in-use inode 184452204 is free, would correct imap
would have cleared inode 184452204
.....
disconnected inode 184452204, would move to lost+found
And so we have a situation where the directory structure and the
inobt thinks the inode is free, but the inode on disk thinks it is
still in use. Where this corruption came from is not possible to
diagnose, but we can detect it and prevent the kernel from oopsing
on lookup. The reproducer now results in:
$ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
re needs cleaning
mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
utput error
mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
utput error
....
And this corruption shutdown:
[ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
marked free on disk
[ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670
[ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
443
[ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S 1.10.2-1 04/01/2014
[ 54.852859] Call Trace:
[ 54.853531] dump_stack+0x85/0xc5
[ 54.854385] xfs_trans_cancel+0x197/0x1c0
[ 54.855421] xfs_create+0x425/0x670
[ 54.856314] xfs_generic_create+0x1f7/0x2c0
[ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50
[ 54.858586] vfs_mkdir+0xfb/0x1b0
[ 54.859458] SyS_mkdir+0xcf/0xf0
[ 54.860254] do_syscall_64+0x73/0x1a0
[ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 54.862492] RIP: 0033:0x7fb73bddf547
[ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
000000000053
[ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
bddf547
[ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
a55449a
[ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
8670dd0
[ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
00001ff
[ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
a553500
[ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050
[ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt=
ing down filesystem
[ 54.884597] XFS (loop0): Please umount the filesystem and rectify the =
problem(s)
Note that this crash is only possible on v4 filesystemsi or v5
filesystems mounted with the ikeep mount option. For all other V5
filesystems, this problem cannot occur because we don't read inodes
we are allocating from disk - we simply overwrite them with the new
inode information.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
encounter discrepancies here so that we know that it's a cross
referencing error, not necessarily a corruption in the inobt itself.
The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.
It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt". The same reasoning applies to "xfs: record inode
buf errors as a xref error in inobt scrubber".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If a directory's parent inode pointer doesn't point to an inode, the
directory should be flagged as corrupt. Enable IGET_UNTRUSTED here so
that _iget will return -EINVAL if the inobt does not confirm that the
inode is present and allocated and we can flag the directory corruption.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we're verifying inode buffers, sanity-check the unlinked pointer.
We don't want to run the risk of trying to purge something that's
obviously broken.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Extent size hint validation is used by scrub to decide if there's an
error, and it will be used by repair to decide to remove the hint.
Since these use the same validation functions, move them to libxfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
During the inode btree scrubs we try to confirm the freemask bits
against the inode records. If the inode buffer read fails, this is a
cross-referencing error, not a corruption of the inode btree itself.
Use the xref_process_error call here. Found via core.version middlebit
fuzz in xfs/415.
The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.
It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we no longer do raw inode buffer scrubbing, the bp parameter is
no longer used anywhere we're dealing with an inode, so remove it and
all the useless NULL parameters that go with it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The inode scrubber tries to _iget the inode prior to running checks.
If that _iget call fails with corruption errors that's an automatic
fail, regardless of whether it was the inode buffer read verifier,
the ifork verifier, or the ifork formatter that errored out.
Therefore, get rid of the raw mode scrub code because it's not needed.
Found by trying to fix some test failures in xfs/379 and xfs/415.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we're scanning an extent mapping inode fork, ensure that every rmap
record for this ifork has a corresponding bmbt record too. This
(mostly) provides the ability to cross-reference rmap records with bmap
data. The rmap scrubber cannot do the xref on its own because that
requires taking an ilock with the agf lock held, which violates our
locking order rules (inode, then agf).
Note that we only do this for forks that are in btree format due to the
increased complexity; or forks that should have data but suspiciously
have zero extents because the inode could have just had its iforks
zapped by the inode repair code and now we need to reclaim the old
extents.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When the inode buffer verifier encounters an error, it's much more
helpful to print a buffer from the offending inode instead of just the
start of the inode chunk buffer.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor some of the inode verifier failure logging call sites to use
the new xfs_inode_verifier_error method which dumps the offending buffer
as well as the code location of the failed check. This trims the
output, makes it clearer to the admin that repair must be run, and gives
the developers more details to work from.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor the bmap validator into a more complete helper that looks for
extents that run off the end of the device, overflow into the next AG,
or have invalid flag states.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
it doesn't make sense. Since a carefully crafted fuzzed image can cause
the kernel to crash after blowing a bunch of assertions, let's move
those checks into a validator function and rig everything up to return
EFSCORRUPTED to userspace. Found by lastbit fuzzing ltail.bestcount via
xfs/391.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.
This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.
Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.
Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.
This allows kernels that include the packing fix commit 96f859d52b
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead split out a __xfs_log_fore_lsn helper that gets called again
with the already_slept flag set to true in case we had to sleep.
This prepares for aio_fsync support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the the smallest possible loop as preable to find the correct iclog
buffer, and then use gotos for unwinding to straighten the code.
Also fix the top of function comment while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When punching a hole or truncating an inode down to a given size, also
check if the truncate point / start of the hole is within the range we
have metadata for. Otherwise, we can end up freeing blocks that
shouldn't be freed, corrupting the inode, or crashing the machine when
trying to punch a hole into the void.
When growing an inode via truncate, we set the new size but we don't
allocate additional levels of indirect blocks and grow the inode height.
When shrinking that inode again, the new size may still point beyond the
end of the inode's metadata.
Fixes xfstest generic/476.
Debugged-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Files on FUSE can change at any point in time without IMA being able
to detect it. The file data read for the file signature verification
could be totally different from what is subsequently read, making the
signature verification useless.
FUSE can be mounted by unprivileged users either today with fusermount
installed with setuid, or soon with the upcoming patches to allow FUSE
mounts in a non-init user namespace.
This patch sets the SB_I_IMA_UNVERIFIABLE_SIGNATURE flag and when
appropriate sets the SB_I_UNTRUSTED_MOUNTER flag.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Cc: Alban Crequy <alban@kinvolk.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, thp: do not cause memcg oom for thp
mm/vmscan: wake up flushers for legacy cgroups too
Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
mm/thp: do not wait for lock_page() in deferred_split_scan()
mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
x86/mm: implement free pmd/pte page interfaces
mm/vmalloc: add interfaces to free unmapped page table
h8300: remove extraneous __BIG_ENDIAN definition
hugetlbfs: check for pgoff value overflow
lockdep: fix fs_reclaim warning
MAINTAINERS: update Mark Fasheh's e-mail
mm/mempolicy.c: avoid use uninitialized preferred_node
A vma with vm_pgoff large enough to overflow a loff_t type when
converted to a byte offset can be passed via the remap_file_pages system
call. The hugetlbfs mmap routine uses the byte offset to calculate
reservations and file size.
A sequence such as:
mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
will result in the following when task exits/file closed,
kernel BUG at mm/hugetlb.c:749!
Call Trace:
hugetlbfs_evict_inode+0x2f/0x40
evict+0xcb/0x190
__dentry_kill+0xcb/0x150
__fput+0x164/0x1e0
task_work_run+0x84/0xa0
exit_to_usermode_loop+0x7d/0x80
do_syscall_64+0x18b/0x190
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which
causes the BUG.
The previous overflow fix to this code was incomplete and did not take
the remap_file_pages system call into account.
[mike.kravetz@oracle.com: v3]
Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm@linux-foundation.org: include mmdebug.h]
[akpm@linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes: 045c7a3f53 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nic Losby <blurbdust@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqvCPYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOaAH/171cgZGFEXSONxK
3O1AAv61wN5K/ISMt6mnelWR6fZg195FarOx0Rnq7Ot8OWuVa8CGcyT4vX4Z7nb9
SVMQKNMPCVQE4WCDOv6S0njChmRC0BxBoVJtTN9fhywdYgX1KcaTS/drMRHACF5n
rB9eouMQScfMzKGAW08gp5NvEGJ6W1SLX7La3/u0751dYisdJSP7+vFZNxUrGXEA
yIPOQjFu0Tfo8GXz/BwC678RZVzVLN0sE6+/vM7zNnoDlsRVkdDIVMo3UiVqm/NK
B37/TlZz8CYoapoKnRRB5giXnSPDSXtsikbGy3mcy0u5imGe+ZgdjrdYSaLk31cR
NVZY08k=
=pu3X
-----END PGP SIGNATURE-----
Merge tag 'v4.16-rc6' into next-general
Merge to Linux 4.16-rc6 at the request of Jarkko, for his TPM updates.
Pull networking fixes from David Miller:
1) Always validate XFRM esn replay attribute, from Florian Westphal.
2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long.
3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul
Triebitz.
4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel
Axtens.
5) Fix some interrupt handling issues in e1000e driver, from Benjamin
Poitier.
6) Use strlcpy() in several ethtool get_strings methods, from Florian
Fainelli.
7) Fix rhlist dup insertion, from Paul Blakey.
8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev.
9) Fix driver unload crash when link is up in smsc911x, from Jeremy
Linton.
10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric
Dumazet.
11) Need to purge the write queue when TCP connections are aborted,
otherwise userspace using MSG_ZEROCOPY can't close the fd. From
Soheil Hassas Yeganeh.
12) Fix double free in error path of team driver, from Arkadi
Sharshevsky.
13) Filter fixes for hv_netvsc driver, from Stephen Hemminger.
14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo
Bianconi.
15) Properly filter out unsupported feature flags in macvlan driver,
from Shannon Nelson.
16) Don't request loading the diag module for a protocol if the protocol
itself is not even registered. From Xin Long.
17) If datagram connect fails in ipv6, make sure the socket state is
consistent afterwards. From Paolo Abeni.
18) Use after free in qed driver, from Dan Carpenter.
19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the
entry. From Sabrina Dubroca.
20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins.
21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita.
22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel.
23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti.
24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli.
25) Memory leak in gemini driver, from Igor Pylypiv.
26) IDR leaks in error paths of tcf_*_init() functions, from Davide
Caratti.
27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun.
28) Missing dev_put() in error path of macsec_newlink(), from Dan
Carpenter.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits)
macsec: missing dev_put() on error in macsec_newlink()
net: dsa: Fix functional dsa-loop dependency on FIXED_PHY
hv_netvsc: common detach logic
hv_netvsc: change GPAD teardown order on older versions
hv_netvsc: use RCU to fix concurrent rx and queue changes
hv_netvsc: disable NAPI before channel close
net/ipv6: Handle onlink flag with multipath routes
ppp: avoid loop in xmit recursion detection code
ipv6: sr: fix NULL pointer dereference when setting encap source address
ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
net: aquantia: driver version bump
net: aquantia: Implement pci shutdown callback
net: aquantia: Allow live mac address changes
net: aquantia: Add tx clean budget and valid budget handling logic
net: aquantia: Change inefficient wait loop on fw data reads
net: aquantia: Fix a regression with reset on old firmware
net: aquantia: Fix hardware reset when SPI may rarely hangup
s390/qeth: on channel error, reject further cmd requests
s390/qeth: lock read device while queueing next buffer
s390/qeth: when thread completes, wake up all waiters
...
If mount is auto-probing for filesystem type, it will try various
filesystems in order, with the MS_SILENT flag set. We get
that flag as the silent arg to ext4_fill_super.
If we're probing (silent==1) then don't complain about feature
incompatibilities that are found if it looks like it's actually
a different valid extN type - failed probes should be silent
in this case.
If the on-disk features are unknown even to ext4, then complain.
Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Commit 16c5468859 ("ext4: Allow parallel DIO reads") reworked the way
locking happens around parallel dio reads. This resulted in obviating
the need for EXT4_STATE_DIOREAD_LOCK flag and accompanying logic.
Currently this amounts to dead code so let's remove it. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
ext4_iomap_begin() has a bug where offset returned in the iomap
structure will be truncated to unsigned long size. On 64-bit
architectures this is fine but on 32-bit architectures obviously not.
Not many places actually use the offset stored in the iomap structure
but one of visible failures is in SEEK_HOLE / SEEK_DATA implementation.
If we create a file like:
dd if=/dev/urandom of=file bs=1k seek=8m count=1
then
lseek64("file", 0x100000000ULL, SEEK_DATA)
wrongly returns 0x100000000 on unfixed kernel while it should return
0x200000000. Avoid the overflow by proper type cast.
Fixes: 545052e9e3 ("ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # v4.15
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.
Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.
Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.
This fixes occasional i_size corruption in fstests generic/475.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
i_disksize update should be protected by i_data_sem, by either taking
the lock explicitly or by using ext4_update_i_disksize() helper. But the
i_disksize updates in ext4_direct_IO_write() are not protected at all,
which may be racing with i_disksize updates in writeback path in
delalloc buffer write path.
This is found by code inspection, and I didn't hit any i_disksize
corruption due to this bug. Thanks to Jan Kara for catching this bug and
suggesting the fix!
Reported-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Audit link denied events for symlinks had duplicate PATH records rather
than just updating the existing PATH record. Update the symlink's PATH
record with the current dentry and inode information.
See: https://github.com/linux-audit/audit-kernel/issues/21
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In commit 45b578fe4c
("audit: link denied should not directly generate PATH record")
the need for the struct path *link parameter was removed.
Remove the now useless struct path argument.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJasXu4AAoJECebzXlCjuG+fNcP/3QYZwLdIYe0pZSIn/9V6sW1
PwmxhTkBOEkeP4RTArI/4spZk3mrTqv7SC+ACYKbkD910uDC8cY3ElcwKxi82NL7
9M4gNGnscKJtUye2LhoamZZLOLDIzr302pEv0IIh3v11sAu0rK1K0bqHSfI1ga/q
r3S1sRtad+UktVH+jJ8MNE4NBfXC/xTqPz6Z+wd30CPmJKaA7Oqle7Wd3479zj0c
A7l+EtleFo/bctHwY7yp+lVs1Yqr3ShS+qu+YAknBTnGco9JJFJsZM6x33DFMZC4
EvFPO0gh4PuN/SkgT4zBvBlSXQjuqUWTU98ohqP7ugTtd0dXEIpR4DLLwWScA2KV
z+YIJIKUh8Ryh8KI+J50Ed9WseVlmACHOBTJjLW7KnSYtey3+mgh5kp6Vcd9IxHc
Gju7YgDK5FOV2JykaHQ/qCaSyL6ao1firapSrtZK4N27LpJ05FreK6qnLJtHmiwU
pqVHyfrH4kfho1T0Nr/bw4bnugBKZVXBbBQLKp2twgJhYkUkP+vxV8QMX3h531hB
jsQV9OIVPkdTjr5L0BmQMJ4GugM1Zr4mIunY6Mw0z5ax1JeBUelWeQccqL1YCSGL
TZr+0N9yOuzQHX+e+mZ8vFYqfUTRF/Xfkn1+ZV3ZQU/yY8oBkDbQddaLoo1hjLUx
2dpR3xIhzyDkP59jrK4M
=oGAf
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix from Bruce Fields:
"Just one fix for an occasional panic from Jeff Layton"
* tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux:
nfsd: remove blocked locks on client teardown
Currently we only take one vfs-level delegation (lease) for each file,
no matter how many clients hold delegations on that file.
Let's instead keep a one-to-one mapping between NFSv4 delegations and
VFS delegations. This turns out to be simpler.
There is still a many-to-one mapping of NFS opens to NFS files, and the
delegations on one file are all associated with one struct file. The
VFS can still distinguish between these delegations since we're setting
fl_owner to the struct nfs4_delegation now, not to the shared file.
I'm replacing at least one complicated function wholesale, which I don't
like to do, but I haven't figured out how to do this more incrementally.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Pull some duplicated code into a common helper.
This changes the order in destroy_delegation a little, but it looks to
me like that shouldn't matter.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
For now this makes no difference, as for files having delegations,
there's a one-to-one relationship between an nfs4_file and its
nfs4_delegation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Every single caller gets the file out of the delegation, so let's do
that once in nfs4_put_deleg_lease.
Plus we'll need it there for other reasons.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
fi_delegees is basically just a reference count on users of
fi_deleg_file, which is cleared when fi_delegees goes to zero. The
fi_deleg_file check here is redundant. Also add an assertion to make
sure we don't have unbalanced puts.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The sysfs_create_link_nowarn() is going to be used in phylib framework in
subsequent patch which can be built as module. Hence, export
sysfs_create_link_nowarn() to avoid build errors.
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Fixes: a399546049 ("net: phy: Relax error checking on sysfs_create_link()")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On x86_64, it's 1152 bytes, so we can avoid wasting 896 bytes each.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In a traditional NFS deployment using auth_unix, the clients are trusted
to correctly report the credentials of their logged-in users. The
server assumes that only root on client machines is allowed to send
requests from low-numbered ports, so it can use the originating port
number to distinguish "real" NFS clients from NFS clients run by
ordinary users, to prevent ordinary users from spoofing credentials.
The originating port number on a gss-authenticated request is less
important. The authentication ties the request to a user, and we take
it as proof that that user authorized the request. The low port number
check no longer adds much.
So, don't enforce low port numbers in the auth_gss case.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The variables nlm_ntf_refcnt and nlm_ntf_wq are local to the source and
do not need to be in global scope, so make them static.
Cleans up sparse warnings:
fs/lockd/svc.c:60:10: warning: symbol 'nlm_ntf_refcnt' was not declared.
Should it be static?
fs/lockd/svc.c:61:1: warning: symbol 'nlm_ntf_wq' was not declared.
Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We already send it for v4.1, but RFC7530 also notes that the stateid in
the close reply is bogus.
Always send the special close stateid, even in v4.0 responses. No client
should put any meaning on it whatsoever. For now, we continue to
increment the stateid value, though that might not be necessary either.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We had some reports of panics in nfsd4_lm_notify, and that showed a
nfs4_lockowner that had outlived its so_client.
Ensure that we walk any leftover lockowners after tearing down all of
the stateids, and remove any blocked locks that they hold.
With this change, we also don't need to walk the nbl_lru on nfsd_net
shutdown, as that will happen naturally when we tear down the clients.
Fixes: 76d348fadf (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks)
Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # 4.9
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch check blkaddr more accuratly before issue a
write or read bio.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.
In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.
The patch is a hack which adds blind unfreezing.
In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I noticed that offsetof(struct filename, iname) is actually 28 on 64
bit platforms, so we always pass an unaligned pointer to
strncpy_from_user. This is mostly a problem for those 64 bit platforms
without HAVE_EFFICIENT_UNALIGNED_ACCESS, but even on x86_64, unaligned
accesses carry a penalty.
A user-space microbenchmark doing nothing but strncpy_from_user from the
same (aligned) source string runs about 5% faster when the destination
is aligned. That number increases to 20% when the string is long
enough (~32 bytes) that we cross a cache line boundary - that's for
example the case for about half the files a "git status" in a kernel
tree ends up stat'ing.
This won't make any real-life workloads 5%, or even 1%, faster, but path
lookup is common enough that cutting even a few cycles should be
worthwhile. So ensure we always pass an aligned destination pointer to
strncpy_from_user. Instead of explicit padding, simply swap the refcnt
and aname members, as suggested by Al Viro.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The LSM check should happen after the file has been confirmed to be
unchanging. Without this, we could have a race between the Time of Check
(the call to security_kernel_read_file() which could read the file and
make access policy decisions) and the Time of Use (starting with
kernel_read_file()'s reading of the file contents). In theory, file
contents could change between the two.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
This patch introduces a new mount option `test_dummy_encryption'
to allow fscrypt to create a fake fscrypt context. This is used
by xfstests.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new feature, F2FS_FEATURE_LOST_FOUND, which
is set by mkfs. mkfs creates a directory named lost+found, which saves
unreachable files. If fsck finds a file which has no parent, or its
parent is removed by fsck, the file will be placed under lost+found
directory by fsck.
lost+found directory could not be encrypted. As a result, the root
directory cannot be encrypted too. So if LOST_FOUND feature is enabled,
let's avoid to encrypt root directory.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, we will leave the kernel with locks still held when the gc_range
is invalid. This patch fixes the bug.
Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For example, in arm64, free_nid_bitmap should be aligned to word size in order
to use bit operations.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When setting COLD_BIT_SHIFT flag in node block, we only need to call
set_cold_node() in new_node_page() and recover_inode_page() during
node page initialization. So remove unneeded set_cold_node() in other
places.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds nowait aio support[1].
Return EAGAIN if any of the following checks fail for direct I/O:
- i_rwsem is not lockable
- Blocks are not allocated at the write location
And xfstests generic/471 is passed.
[1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT"
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges miscellaneous mount options into struct f2fs_mount_info,
After this patch, once we add new mount option, we don't need to worry
about recovery of it in remount_fs(), since we will recover the
f2fs_sb_info.mount_opt including all options.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, we enable node SSR by default, and mixed
different types of node segment to do SSR more intensively.
Although reuse warm node is not allowed, warm node chain
will be destroyed by errors introduced by other types
node chain. So we'd better forbid reusing all types
of node to keep warm node chain.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit "0a007b97aad6"(f2fs: recover directory operations by fsync)
fixed xfstest generic/342 case, but it also increased the written
data and caused the performance degradation. In most cases, there's
no need to do so heavy fsync actually.
So we introduce new mount option "fsync_mode={posix,strict}" to
control the policy of fsync. "fsync_mode=posix" is set by default,
and means that f2fs uses a light fsync, which follows POSIX semantics.
And "fsync_mode=strict" means that it's a heavy fsync, which behaves
in line with xfs, ext4 and btrfs, where generic/342 will pass, but
the performance will regress.
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to restore old mount option once we encounter failure
in ->remount_fs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_readonly to wrap sb_rdonly for cleanup, and spread it in
all places.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes CAP_SYS_RESOURCE denial of selinux when using resgid, since it
seems selinux reports it at the first place, but mostly we don't need to
check this condition first.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqrzY4ACgkQxWXV+ddt
WDtV4BAAiiv5XECNB3lIvcoFjst8E7xIJNeKKzFZh8HNm+L94zP00uqrpHkwQSUv
tk9TGSYgbJJZMhw6I1rwYSKsoTkvP6NVMQZjpkJktWto9NXObKPvJzjKMdB+GYQR
TWGLBaHsMhruMTOewhdVsxA5od5+LravygHaLmTDQmho8jZSSEe0qqpqlakRI9sQ
+VZv1PDEI68IDMKMjOT7de7Ssq9c99BpGNXxvi5fy5kijyfgrs/BUfHBNV5r66MS
y0Yw1Ab+nlLC7cE5Ql24/snDojcLoSl7f5eYt7ib+DAmsJucYQzTfUetFcrYf8IC
EAmFs5Br6pbopi5M1IOxGZABNbr0qMS+zLvoF52cyka32nrmK/IMWl1s8bW0xfva
gxkjouzJDOTMq28asSdyu73i9yFyLB+tEHpojECljo+K5ASzvDad34xqVRlxY6vN
UJm4d76OqTfAyTRo7sQRWfLz/Cb1W+/v0mPotPWHWVyF0bUNsBgX5YKJhma3pU/z
rHwhI/fqg0NTXBcNUMCosEwS2u5vnjIfiOHkDTl1Dv747S/jM0AaNEvpjDo1cOuk
lsl4xKoDkQ6RpBCeFxRCW0wk1Nl/Su6RQ217xTtND+aIa4c6G7A9X8QAjj2XNAuK
CktJJD7wHYSBTu1sQ+u2R/c2QEgkAxABr4NpmWTPCBNlIKEjd4A=
=bXFe
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"There's an important revert in this pull request that needs to go to
stable as it causes a corruption on big endian machines.
The other fix is for FIEMAP incorrectly reporting shared extents
before a sync and one fix for a crash in raid56.
So far we got only one report about the BE corruption, the stable
kernels were out for like a week, so hopefully the scope of the damage
is low"
* tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: use proper endianness accessors for super_copy"
btrfs: add missing initialization in btrfs_check_shared
btrfs: Fix NULL pointer exception in find_bio_stripe
This reverts commit 3c181c12c4.
The offending patch was merged in 4.16-rc4 and was promptly applied to
stable kernels 4.14.25 and 4.15.8.
The patch causes a corruption in several superblock items on big-endian
machines because of messed up endianity conversions. The damage is
manually repairable. A filesystem cannot be mounted again after it has
been unmounted once.
We do a full revert and not a fixup so stable can pick that patch ASAP.
Fixes: 3c181c12c4 ("btrfs: use proper endianness accessors for super_copy")
Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de
CC: stable@vger.kernel.org # 4.14+
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Hardwall is a tile specific feature, and with the removal of the
tile architecture, this has become dead code, so let's remove it.
Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull vfs fixes from Al Viro:
- backport-friendly part of lock_parent() race fix
- a fix for an assumption in the heurisic used by path_connected() that
is not true on NFS
- livelock fixes for d_alloc_parallel()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Teach path_connected to handle nfs filesystems with multiple roots.
fs: dcache: Use READ_ONCE when accessing i_dir_seq
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem. The subsets can be from
disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.
The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem. The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.
This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.
When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.
The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount. This move can happen either locally or
remotely. With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.
If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check. As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.
The is_subdir function might be cheap enough that we can call it
unconditionally. Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to. So I am avoiding that for now.
Filesystems with snapshots such as nilfs and btrfs do something
similar. But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc2 ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use xfs_iext_prev_extent to skip to the previous extent instead of
opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the control flow a bit in preparation for O_ATOMIC-related
changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helper doesn't add any real value over just calling iomap_zero_range
directly, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we convert COW preallocations from unwritten to real on every
call this function needs to be called with the ilock held exclusively.
Fortunately we already do that, but update the assert to match.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no reason to get a mapping bigger than what we were asked for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
i_cnextents does not include delayed allocated extents, so switch
to the inode fork size check that we already use in other places
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
->fail_addr and ->addr can be updated no matter the result of
parent->_erase(), we just need to remove the code doing the same thing
in mtd_erase_callback() to avoid adjusting those fields twice.
Note that this can be done because all MTD users have been converted to
not pass an erase_info->callback() and are thus only taking the
->addr_fail and ->addr fields into account after part_erase() has
returned.
While we're at it, get rid of the erase_info->mtd field which was only
needed to let mtd_erase_callback() get the partition device back.
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
None of the mtd->_erase() implementations work in an asynchronous manner,
so let's simplify MTD users that call mtd_erase(). All they need to do
is check the value returned by mtd_erase() and assume that != 0 means
failure.
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
The new crypto API use causes some problems with Kconfig dependencies,
including this link error:
fs/pstore/platform.o: In function `pstore_register':
platform.c:(.text+0x248): undefined reference to `crypto_has_alg'
platform.c:(.text+0x2a0): undefined reference to `crypto_alloc_base'
fs/pstore/platform.o: In function `pstore_unregister':
platform.c:(.text+0x498): undefined reference to `crypto_destroy_tfm'
crypto/lz4hc.o: In function `lz4hc_sdecompress':
lz4hc.c:(.text+0x1a): undefined reference to `LZ4_decompress_safe'
crypto/lz4hc.o: In function `lz4hc_decompress_crypto':
lz4hc.c:(.text+0x5a): undefined reference to `LZ4_decompress_safe'
crypto/lz4hc.o: In function `lz4hc_scompress':
lz4hc.c:(.text+0xaa): undefined reference to `LZ4_compress_HC'
crypto/lz4hc.o: In function `lz4hc_mod_init':
lz4hc.c:(.init.text+0xf): undefined reference to `crypto_register_alg'
lz4hc.c:(.init.text+0x1f): undefined reference to `crypto_register_scomp'
lz4hc.c:(.init.text+0x2f): undefined reference to `crypto_unregister_alg'
The problem is that with CONFIG_CRYPTO=m, we must not 'select CRYPTO_LZ4'
from a bool symbol, or call crypto API functions from a built-in
module.
This turns the sub-options into 'tristate' ones so the dependencies
are honored, and makes the pstore itself select the crypto core
if necessary.
Fixes: cb3bee0369 ("pstore: Use crypto compress API")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
register_blkdev() and __register_chrdev_region() treat the major
number as an unsigned int. So print it the same way to avoid
absurd error statements such as:
"... major requested (-1) is greater than the maximum (511) ..."
(and also fix off-by-one bugs in the error prints).
While at it, also update the comment describing register_blkdev().
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CHRDEV_MAJOR_DYN_END and CHRDEV_MAJOR_DYN_EXT_END are valid major
numbers. So fix the loop iteration to include them in the search for
free major numbers.
While at it, also remove a redundant if condition ("cd->major != i"),
as it will never be true.
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the gfs2_iomap_end tracepoint, log the physical block address, just
as in the gfs2_bmap tracepoint.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch addresses an issue that causes fiemap to falsely
report a shared extent. The test case is as follows:
xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io -c "fiemap -v" /media/scratch/file5
which gives the resulting output:
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x2001
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.
With proper re-initialization the test result is as follows:
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
which corrects the regression.
Fixes: 3ec4d3238a ("btrfs: allow backref search checks for shared extents")
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
[ add text from cover letter to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
While converting ioctx index from a list to a table, db446a08c2
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors. This introduces a small window in the
lookup path where init and access may race.
Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c2 ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.12+
While fixing refcounting, e34ecee2ae ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU. This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.
Fix it by updating free_ioctx() to go through call_rcu() explicitly.
v2: Comment added to explain double bouncing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.13+
Streamline the conditionals so that it is more obvious which specific case
form the top of the function comments is being handled. Use gotos only
for early returns.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Switch to a single interface for flushing the log to a specific LSN, which
gives consistent trace point coverage and a less confusing interface.
The was only a single user of the previous xfs_log_force_lsn function,
which now also passes a NULL log_flushed argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Switch to a single interface for flushing the whole log, which gives
consistent trace point coverage, and removes the unused log_flushed
argument for the previous _xfs_log_force callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function now does something, and that something is central to our
inode logging scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most libcs will still look at /dev/ptmx when opening the master fd of a pty
device. When /dev/ptmx is a bind-mount of /dev/pts/ptmx and the TIOCGPTPEER
ioctl() is used to safely retrieve a file descriptor for the slave side of
the pty based on the master fd, the /proc/self/fd/{0,1,2} symlinks will
point to /. A very simply reproducer for this issue presupposing a libc
that uses TIOCGPTPEER in its openpty() implementation is:
unshare --mount
mount --bind /dev/pts/ptmx /dev/ptmx
chmod 666 /dev/ptmx
script
ls -al /proc/self/fd/0
Having bind-mounts of /dev/pts/ptmx to /dev/ptmx not working correctly is a
regression. In addition, it is also a fairly common scenario in containers
employing user namespaces.
The reason for the current failure is that the kernel tries to verify the
useability of the devpts filesystem without resolving the /dev/ptmx
bind-mount first. This will lead it to detect that the dentry is escaping
its bind-mount. The reason is that while the devpts filesystem mounted at
/dev/pts has the devtmpfs mounted at /dev as its parent mount:
21 -- -- / /dev
-- 21 -- / /dev/pts
devtmpfs and devpts are on different devices
-- -- 0:6 / /dev
-- -- 0:20 / /dev/pts
This has the consequence that the pathname of the parent directory of the
devpts filesystem mount at /dev/pts is /. So if /dev/ptmx is a bind-mount
of /dev/pts/ptmx then the /dev/ptmx bind-mount and the devpts mount at
/dev/pts will end up being located on the same device which is recorded in
the superblock of their vfsmount. This means the parent directory of the
/dev/ptmx bind-mount will be /ptmx:
-- -- ---- /ptmx /dev/ptmx
Without the bind-mount resolution patch the kernel will now perform the
bind-mount escape check directly on /dev/ptmx. The function responsible for
this is devpts_ptmx_path() which calls pts_path() which in turn calls
path_parent_directory(). Based on the above explanation,
path_parent_directory() will yield / as the parent directory for the
/dev/ptmx bind-mount and not the expected /dev. Thus, the kernel detects
that /dev/ptmx is escaping its bind-mount and will set /proc/<pid>/fd/<nr>
to /.
This patch changes the logic to first resolve any bind-mounts. After the
bind-mounts have been resolved (i.e. we have traced it back to the
associated devpts mount) devpts_ptmx_path() can be called. In order to
guarantee correct path generation for the slave file descriptor the kernel
now requires that a pts directory is found in the parent directory of the
ptmx bind-mount. This implies that when doing bind-mounts the ptmx
bind-mount and the devpts mount should have a common parent directory. A
valid example is:
mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /dev/ptmx
an invalid example is:
mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /ptmx
This allows us to support:
- calling open on ptmx devices located inside non-standard devpts mounts:
mount -t devpts devpts /mnt
master = open("/mnt/ptmx", ...);
slave = ioctl(master, TIOCGPTPEER, ...);
- calling open on ptmx devices located outside the devpts mount with a
common ancestor directory:
mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /dev/ptmx
master = open("/dev/ptmx", ...);
slave = ioctl(master, TIOCGPTPEER, ...);
while failing on ptmx devices located outside the devpts mount without a
common ancestor directory:
mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /ptmx
master = open("/ptmx", ...);
slave = ioctl(master, TIOCGPTPEER, ...);
in which case save path generation cannot be guaranteed.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hoist the check whether we have already found a suitable devpts filesystem
out of devpts_ptmx_path() in preparation for the devpts bind-mount
resolution patch. This is a non-functional change.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch supports to recognize hot file extension in f2fs, so that we
can allocate proper hot segment location for its data, which can lead to
better hot/cold seperation in filesystem.
In addition, we changes a bit on query/add/del operation method for
extension_list sysfs entry as below:
- Query: cat /sys/fs/f2fs/<disk>/extension_list
- Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list
- Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
- [h] means add/del hot file extension
- [c] means add/del cold file extension
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sqlite user Background GC
- move_data_block
: move page #1
- f2fs_is_atomic_file
- f2fs_ioc_start_atomic_write
- f2fs_ioc_commit_atomic_write
- commit_inmem_pages
: commit page #1 & set node #2 dirty
- f2fs_submit_page_write
- f2fs_update_data_blkaddr
- set_page_dirty
: set node #2 dirty
- f2fs_do_sync_file
- fsync_node_pages
: commit node #1 & node #2, then sudden power-cut
In a race case, we may check FI_ATOMIC_FILE flag before starting atomic
write flow, then we will commit meta data before data with reversed
order, after a sudden pow-cut, database transaction will be inconsistent.
So we'd better to exclude gc/atomic_write to each other by using lock
instead of flag checking.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, f2fs conducts GC on 8GB range only based on slow cost-benefit.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids to skip discard commands when user sets gc_urgent mode.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs is running on top of very small devices, it's worth to avoid abusing
free LBAs. In order to achieve that, this patch introduces some parameter
tuning.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds an mount option, "alloc_mode=%s" having two options, "default"
and "reuse".
In "alloc_mode=reuse" case, f2fs starts to allocate segments from 0'th segment
all the time to reassign segments. It'd be useful for small-sized eMMC parts.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Jayashree Mohan reported:
A simple workload to reproduce this would be :
1. create foo
2. Write (8K - 16K) // foo size = 16K now
3. fsync()
4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
5. fdatasync()
Crash now
On recovery, we see that the file size is 4210688 and not 16K, which
violates the semantics of keep_size flag. We have a test case to
reproduce this using CrashMonkey on 4.15 kernel. Try this out by
simply running :
./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400 -P -v
tests/generic_468_zero.so
The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
missing case.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_super_block.encrypt_pw_salt can be udpated and persisted
concurrently, result in getting different pwsalt in separated
threads, so let's introduce sb_lock to exclude concurrent
accessers.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pointer p is initialized with a value that is never read and is later
re-assigned a new value, hence the initialization is redundant and can
be removed.
Cleans up clang warning:
fs/f2fs/extent_cache.c:463:19: warning: Value stored to 'p' during
its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly powering off at this time, we could get into
an extreme scenario that cp pack 1 page and cp pack 2 page are updated
and latest, but payload or current summaries are still partially
outdated. (see reliable write in the UFS specification)
This patch submits the whole cp pack except cp pack 2 page at first,
and then writes the cp pack 2 page with an extra independent
bio with pre-io barrier.
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces F2FS_FEATURE_FUNCS to clean up the definitions of
different f2fs_sb_has_xxx functions.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes redundant check of page type when submit bio to
make the logic more clear.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.
For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to detect dquot_initialize errors early from evict_inode
for orphan inodes.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add the 'whint_mode' mount option that controls which write
hints are passed down to block layer. There are "off" and
"user-based" mode. The default mode is "off".
1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
2) whint_mode=user-based. F2FS tries to pass down hints given
by users.
User F2FS Block
---- ---- -----
META WRITE_LIFE_NOT_SET
HOT_NODE "
WARM_NODE "
COLD_NODE "
ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
extension list " "
-- buffered io
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " "
WRITE_LIFE_MEDIUM " "
WRITE_LIFE_LONG " "
-- direct io
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG " WRITE_LIFE_LONG
Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid build warning]
[Chao Yu: fix to restore whint_mode in ->remount_fs]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
before LBA becomes invalid again, fix it by clearing the flag in
checkpoint without CP_TRIMMED reason.
Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we will store all nat version bitmap in checkpoint pack block,
so our total node entry number has a limitation which caused total node
number can not exceed (3900 * 8) block * 455 node/block = 14196000. So
that once user wants to create more nodes in large size image, it becomes
a bottleneck, that's unreasonable.
This patch detects the new layout of nat/sit version bitmap in image in
order to enable supporting large nat bitmap.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: _raw_write_lock+0xc/0x30
Call Trace:
? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
f2fs_fallocate+0x5a0/0xdd0 [f2fs]
? common_file_perm+0x47/0xc0
? apparmor_file_permission+0x1a/0x20
vfs_fallocate+0x15b/0x290
SyS_fallocate+0x44/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch limits to enable inline_xattr_size mount option only if
both extra_attr and flexible_inline_xattr feature is on in current
image.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 7a20b8a61e ("f2fs: allocate node
and hot data in the beginning of partition") introduces another mount
option, heap, to reset it back. But it does not do anything for heap
mode, so fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
sb_getblk does not guarantee the buffer head is uptodate. If bh is not
uptodate, the data (may be used as boot code) in area before
F2FS_SUPER_OFFSET may get corrupted when super block is committed.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous dentry page uses highmem, which will cause panic in platforms
using highmem (such as arm), since the address space of dentry pages
from highmem directly goes into the decryption path via the function
fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not
from highmem, and then cause panic since it doesn't call kmap_high but
kunmap_high is triggered at the end. To fix this problem in a simple
way, this patch avoids to put dentry page in pagecache into highmem.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hightlights include the following stable fixes:
- NFS: Fix an incorrect type in struct nfs_direct_req
- pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
- NFS: Fix unstable write completion
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaprb9AAoJEGcL54qWCgDysK0P/Rwyc7fhcs4CvxETQT0LhbbB
P6XDAb5pBeU+ca8ZWzqlpe1uKc0z5ykuTLuxLt2QvSb9+6h+8T1gxOa5SibehnwY
6UizLDgHMuY2V4wqm4XLHiZX7J0Evdo4rgNCp7qq2/TinBPYKQsqIz34+s9BZDti
6LG6WiVhOtkhs6s/sRbAn+FfMbSiQn54iQIgFPeO+3zEzaaLzUZqlVHO5yjxVnxh
6oYGkEPVzP09URUO/HJ1VTUZJnso1/axEcH9qhuJzZ+pANCpuBjSMoFzzE6oI2mD
dfD8UJZjwqLb7AsFhgSQUrDXzaI0Xg7ccrImtCp4QLlTjja3TlOYMYbK9kDRHC1/
kT93cSFSIeSGpqLVSdNyiz7pCh2Cps2U0JTbhwE0R1NEskWQPzSwY73jEwiA/bex
JOZBxtjVZt8TgOKHxPO65rvQwaq3T31KrQSDLqv3JReI0epJeqng0h+pwiZluNAs
TFjE1aP0l567A3qQnPrHrGZ4IIs8XYobZ+MBxf9x5PaRWnGoPUkhHXMP2j8JCnwa
ofPBR4Mx5WhnsB7Z5vsYetHtmD99QbJ0+WH9ObilUFt6gUgXoXCrvexIUq+LD3DP
HIIqfbSorAIlQxtCwEz7m85B7VEKByaCWj8OiFt5VVAreEhiazmsncsyh8R5fa88
wJDxRe3QmNu7BEJ9NJKi
=IQGX
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Hightlights include the following stable fixes:
- NFS: Fix an incorrect type in struct nfs_direct_req
- pNFS: Prevent the layout header refcount going to zero in
pnfs_roc()
- NFS: Fix unstable write completion"
* tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix unstable write completion
pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
NFS: Fix an incorrect type in struct nfs_direct_req
This flag was added by fe0f07d08e ("direct-io: only inc/deci
inode->i_dio_count for file systems") as means to optimise the atomic
modificaiton of the variable for blockdevices. However with the advent
of 542ff7bf18 ("block: new direct I/O implementation") it became
unused. So let's remove it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This flag was added by 6039257378 ("direct-io: add flag to allow aio
writes beyond i_size") to support XFS. However, with the rework of
XFS' DIO's path to use iomap in acdda3aae1 ("xfs: use iomap_dio_rw")
it became redundant. So let's remove it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A subsequent patch will modify dentry_kill() to call lock_parent().
Move the dentry_kill() implementation "as is" below lock_parent()
first. This will help simplify the review of the subsequent patch
with dentry_kill() changes.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 0d98439ea3 ("vfs: use lockred "dead" flag to mark unrecoverably
dead dentries") removed the `ref' parameter in dentry_kill() but its
documentation remained. Remove it.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.
While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.
The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:
mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
...
... : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs. Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.
To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:
1.) Reintroduce an AGFL reservation type to serve as an accounting
no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
allocation path rather than the AGFL allocation path.
The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.
The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.
This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.
Note that this patch does not change perag reservation behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.
The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.
If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.
Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.
The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.
Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This is a simple rename, except that xa_ail becomes ail_head.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Noticed when looking at why cycling 600k inodes/s through the inode
cache was taking a total of 8% cpu in memset() during inode
initialisation. There is no need to zero the inode.i_data structure
twice.
This increases single threaded bulkstat throughput from ~200,000
inodes/s to ~220,000 inodes/s, so we save a substantial amount of
CPU time per inode init by doing this.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's no point in allocating a transaction and locking the inode in
preparation to clear cow blocks if there actually are any cow fork
extents. Therefore, move the xfs_reflink_cancel_cow_range hunk to
xfs_inactive and check the cow ifp first. This makes inode reclamation
run faster.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Yet another round of playing whack-a-mole with directory code that
asserts on corrupt on-disk metadata when it really should be returning
-EFSCORRUPTED instead of ASSERTing. Found by a xfs/391 crash while
lastbit fuzzing of ltail.bestcount.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_qm_dqalloc, we join the locked quota inode to the transaction we
use to allocate blocks. If the allocation or mapping fails, we're not
allowed to unlock the inode because the transaction code is in charge of
unlocking it for us. Therefore, remove the iunlock call to avoid
blowing asserts about unbalanced locking + mount hang.
Found by corrupting the AGF and allocating space in the filesystem
(quotacheck) immediately after mount. The upcoming agfl wrapping fixup
test will trigger this scenario.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Due to an inverted logic mistake in xfs_buftarg_isolate()
the xfs_buffers with zero b_lru_ref will take another trip
around LRU, while isolating buffers with non-zero b_lru_ref.
Additionally those isolated buffers end up right back on the LRU
once they are released, because b_lru_ref remains elevated.
Fix that circuitous route by leaving them on the LRU
as originally intended.
Signed-off-by: Vratislav Bendel <vbendel@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.
This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3e ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187d ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the VFS dirty inode tracking for lazytime inodes only, and just
log them in ->dirty_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The memcpy is guarded by a check which is performed a right before we
call xfs_log_dinode_to_disk. At this point we are sure this check will
always be false otherwise we would have errored out. So let's remove
this dead weight.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove unused legacy btree traces from IRIX era.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used. Remove it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.
TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In the pstore compression part, we use zlib/lzo/lz4/lz4hc/842
compression algorithm API to implement pstore compression backends. But
there are many repeat codes in these implementations. This patch uses
crypto compress API to simplify these codes.
1) rewrite allocate_buf_for_compression, free_buf_for_compression,
pstore_compress, pstore_decompress functions using crypto compress API.
2) drop compress, decompress, allocate, free functions in pstore_zbackend,
and add zbufsize function to get each different compress buffer size.
3) use late_initcall to call ramoops_init later, to make sure the crypto
subsystem has already initialized.
4) use 'unsigned int' type instead of 'size_t' in pstore_compress,
pstore_decompress functions' length arguments.
5) rename 'zlib' to 'deflate' to follow the crypto API's name convention.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked error messages on allocation failures and Kconfig help]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull overlayfs fixes from Miklos Szeredi:
"This fixes a corner case for NFS exporting (introduced in this cycle)
as well as fixing miscellaneous bugs"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update Kconfig texts
ovl: redirect_dir=nofollow should not follow redirect for opaque lower
ovl: fix ptr_ret.cocci warnings
ovl: check ERR_PTR() return value from ovl_lookup_real()
ovl: check lower ancestry on encode of lower dir file handle
ovl: hash non-dir by lower inode for fsnotify
- Fix some iomap locking problems
- Don't allocate cow blocks when we're zeroing file data
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJamePeAAoJEPh/dxk0SrTryPgP/0sh8xcAFUHR0ad7D74nSp3o
1gjBi6Yl+i+hn5SZ/gkKPa5YFjTMaefcTV8rpZm3uPYbTf75TTscjKDDczFNEqoh
CDKOeyfOchVSjhRI5ExElBh5ToxDgh+HMzzlmxk+olfWA+BXjJj2J2iLh871v9Ym
hQl7xOYUDPu1xZz+jLhDLKG2Gd/R97ez2KZM43gKMAUsQ7eUbz6CtV73cFT+tNfP
jjIe3xp6ohbJYJalMThbNcI2mOOssnCM8BHUDBJib7N4CXucJYTMpDAcGvNoFKul
+K2Icyip1/6CS4LkWDpP29ayiH+sTbZk8/ipioe27hXXHGqKG84HQOBefRZmo4o0
CC64SoomuxQX6V7qL6hf0Sz82QQbl+ToSliZ97xXP5o84/OxhggZGqVdykTR2+DB
4POaJzRPmYJKPO4Q0yT2worB4oepAUrridJnaQE0y9F80xnLTvJxpD53atLCHulj
Ht9NlvnvqndDYaYjQwvucOmdR9vCos3OjZCaXnYgL+EpbgZUvMeQoxuyBUBTdQeQ
CNGq3EDssI75gD67sWeyFc+gOST259XofANw3mvlu3KS2/huviORJkDYdtzi2/LL
gkCN9OW9CmXOaAiylE8dSAu98SAR+c83wmahMeBA2mq5vYrQU4rIDGYZi/Mulne4
tD7eW5nlvps9Fsvf58fd
=T9nM
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix some iomap locking problems
- Don't allocate cow blocks when we're zeroing file data
* tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't block on the ilock for RWF_NOWAIT
xfs: don't start out with the exclusive ilock for direct I/O
xfs: don't allocate COW blocks for zeroing holes or unwritten extents
Remove Variable Length Array from fs/reiserfs/reiserfs.h. EMPTY_DIR_SIZE
is used as an array size and as it is using strlen() it need not be
evaluated at compile time. Change it's definition to use sizeof() to
force evaluation of array length at compile time.
Signed-off-by: Kyle Spiers <kyle@spiers.me>
Signed-off-by: Jan Kara <jack@suse.cz>
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.
In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ensure that we hold a reference to the layout header when processing
the pNFS return-on-close so that the refcount value does not inadvertently
go to zero.
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.10+
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
The start offset needs to be of type loff_t.
Fixed: 5fadeb47dc ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Before this patch, GFS2 was setting the PageChecked flag for ordered
write pages. This is unnecessary. The ext3 file system only does it
for jdata, and it's only used in jdata circumstances. It only muddies
the already murky waters of writing pages in the aops.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_remove_from_ail is only ever used from log.c, so there
is no reason to declare it extern. This patch removes the extern and
declares it static.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Mark the source inode dirty during a rename instead of just updating the
underlying buffer head. Otherwise, fsync may find the inode clean and
will then skip flushing the journal. A subsequent power failure will
cause the rename to be lost. This happens in command sequences like:
xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo
mv foo bar
xfs_io -c 'fsync' bar
# power failure
Fixes xfstests generic/322, generic/376.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly. The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient. This needs
fixing in two places:
In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit. This fixes callers that rely on
ap->allowed to be set even when quotas are off.
In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Instead of using a stack VLA for the parity workspace, preallocate a
memory region. The preallocation is done to keep from needing to perform
allocations during crash dump writing, etc. This also fixes a missed
release of librs on free.
Signed-off-by: Kees Cook <keescook@chromium.org>
To allow for easier build test coverage and run-time testing, this allows
multiple compression algorithms to be built into pstore. Still only one
is supported to operate at a time (which can be selected at build time
or at boot time, similar to how LSMs are selected).
Signed-off-by: Kees Cook <keescook@chromium.org>
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0. This can
lead to arbitrary incorrect behavior including crashes or disk
corruption. Revert the incorrect part of that commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add some hints about overlayfs kernel config options.
Enabling NFS export by default is especially recommended against, as it
incurs a performance penalty even if the filesystem is not actually
exported.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of casting, make sure we don't end up with giant values and just
perform regular assignments with unsigned int instead of re-cast size_t.
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently, pstore has supported three compression algorithms: zlib,
lzo and lz4. This patch added two more compression algorithms: lz4hc
and 842.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked Kconfig help text slightly]
Signed-off-by: Kees Cook <keescook@chromium.org>
All of the conflicts were cases of overlapping changes.
In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqbuK8ACgkQxWXV+ddt
WDsiVQ//eE8Axfw4qWOyHHjozoAIu9kifvOFQIhvKviLiXHJNrD/vBI6YwD1hyD1
rbbLilMsEl1OD1Sq3AeOUMNSSl5qUFEB+CA8vg9GznFTNRobTkL+p96Zt5xlRDu3
lsFFV93tED+dK4D/eSGP+xYbknA8hIk/2gWPkd6hpYKyh2QdsPBYDqCnaEXvd79P
DIP/cAjIfzqQn0FTiZ9wbaES+LPO+NoZgQRC2w9McYQ5CEMc+oAChEmPJRwpPoKy
YdhuwoULniRNHVnVOIVfi4w9jkSPSz7YIgLeRFli/WGBYGcKeHTMFkMa12KdpuUC
JUclOogJ5ZMbFV3C0W8XEJ7Vb9ltIevrH8MgfKP/3BScuZzbJZ+n5KkH2Gf7vcpe
w5cZGOsKDz+35fDCdTmsFpDK9kpGzHq48JlRifOjARbdyqNwVq4emxOeQlO1ygzq
Y9H5UeMpp/FDAm6g/bV8ezCXzwuwUDV9CwAJBD+WCiZhD2nX85FfIp1kfF2zLcUg
Y8irqV6A/J/0BFkF7Iu9AuzTxOdo6zMLkcHEKb+sL7yGxZ2248v5gZxDoyV5iNWR
WY4Y2GaMemZGu6+NyyLFAJzlFyACD5fSy8vT7KD6upCzwmlAtaJ3VULfTrLyi/uM
SNZAKf3WzKBVe5h52GNGRCi5OLTteH6Yqz5m5/h8r74mBO00IrA=
=bF96
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- when NR_CPUS is large, a SRCU structure can significantly inflate
size of the main filesystem structure that would not be possible to
allocate by kmalloc, so the kvalloc fallback is used
- improved error handling
- fix endiannes when printing some filesystem attributes via sysfs,
this is could happen when a filesystem is moved between different
endianity hosts
- send fixes: the NO_HOLE mode should not send a write operation for a
file hole
- fix log replay for for special files followed by file hardlinks
- fix log replay failure after unlink and link combination
- fix max chunk size calculation for DUP allocation
* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log replay failure after unlink and link combination
Btrfs: fix log replay failure after linking special file and fsync
Btrfs: send, fix issuing write op when processing hole in no data mode
btrfs: use proper endianness accessors for super_copy
btrfs: alloc_chunk: fix DUP stripe size handling
btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
btrfs: handle failure of add_pending_csums
btrfs: use kvzalloc to allocate btrfs_fs_info
delayed and three error path memory leak fixups from Chengguang.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJamV3qAAoJEEp/3jgCEfOLv6kIAI0RXShR20dz9M3GnL2bDws7
KFkC/aNwbviXMAih3q7Q7ZiwXv6BI/uUK0YIYG/tWZoNHucYGS8ZOFDikdiIMcJR
2VP8Yas1/NvnLlh0ut45s2imvbXinkHcjWOLqgJJmJ2cEP9Ar/mbwz//TgfVMdNm
TSHEm2LPRge9jK3Ab/ZiatExGH6dx5hRJvssSauF+f2iN+q4nKj2aBqle+FKaNWT
Iv0ONi2f7aXnaXRvum4Y1gnKWXzKSFd8Y2Cr5R3tfTrKBv8nmQz54eRsMnnuxK/7
uV6Ujch4LUXBI1etXLqTwRdZQIcw3Li5PYxGrO8fl8mQfJau9yv5REw/rbgB5FU=
=f8gA
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A cap handling fix from Zhi that ensures that metadata writeback isn't
delayed and three error path memory leak fixups from Chengguang"
* tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client:
ceph: fix potential memory leak in init_caches()
ceph: fix dentry leak when failing to init debugfs
libceph, ceph: avoid memory leak when specifying same option several times
ceph: flush dirty caps of unlinked inode ASAP
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJamXImAAoJEPfTWPspceCmEP4P/3kkm0JIXtbNZFMb1JZtsjwE
t4OUVEDj4jjRfmZfUVkajPnczM4MSPiXm43PbcOi4NF53mv8k76jyIPhlZREzYzq
MBknibvpqyiWpbii9tBRrR6FGDR/N51//ya9vdPaYBcBssTg6Aqtt4BE5oPfo011
PleGROe1jtrBUNBy2dMy4sHb/MvZ0vRuNPxMsD8Agy5UiVeItAelY/lDn1Hw41BY
O+muE5bw6+yKqB9vGXhV3O4WRh8BofJi1YdzbwbbIzH40ZZK5VTDQc5o19/CFEZ/
uZ8BStOFEWA0LNuarME5fknWcogiedEtszweyiWBbVZo4VqCsfxPoaRCibY/Wg5F
a0UNJ4iSzglhfSMoHJlhvlCAMCyubFSeMSdJjrrpIcyBrziJXpcEXcUnWI43yi4P
FoM8zUni22XnfLWxIdTjVkMRytjtqTLcXOHXdP5N/ESa80jBq3Q76TLmzIKW+kK5
sAre+hgr52NdgovP/NSxsdvsckAolWNe40JI8wLbwNo+lMHr0ckzOG+sAdz1iPRK
iVL0CAlby4A94Wcu+OHCwfY7B9lBrMuMfHsesEM6x1cxgAhd3YNfEJ8g2QolCUEV
KmZizXbV9nnmJfegVC06SgM+D7AR26dwsBG2aoibShuvdxX6jMdUHygyu5DCJdg/
JS+q71jmxb/r1TWe/62r
=AMhV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180302' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This is a little larger than
usual at this time, but that's mainly because I was out on vacation
last week. Nothing in here is major in any way, it's just two weeks of
fixes. This contains:
- NVMe pull from Keith, with a set of fixes from the usual suspects.
- mq-deadline zone unlock fix from Damien, fixing an issue with the
SMR zone locking added for 4.16.
- two bcache fixes sent in by Michael, with changes from Coly and
Tang.
- comment typo fix from Eric for blktrace.
- return-value error handling fix for nbd, from Gustavo.
- fix a direct-io case where we don't defer to a completion handler,
making us sleep from IRQ device completion. From Jan.
- a small series from Jan fixing up holes around handling of bdev
references.
- small set of regression fixes from Jiufei, mostly fixing problems
around the gendisk pointer -> partition index change.
- regression fix from Ming, fixing a boundary issue with the discard
page cache invalidation.
- two-patch series from Ming, fixing both a core blk-mq-sched and
kyber issue around token freeing on a requeue condition"
* tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits)
block: fix a typo
block: display the correct diskname for bio
block: fix the count of PGPGOUT for WRITE_SAME
mq-deadline: Make sure to always unlock zones
nvmet: fix PSDT field check in command format
nvme-multipath: fix sysfs dangerously created links
nbd: fix return value in error handling path
bcache: fix kcrashes with fio in RAID5 backend dev
bcache: correct flash only vols (check all uuids)
blktrace_api.h: fix comment for struct blk_user_trace_setup
blockdev: Avoid two active bdev inodes for one device
genhd: Fix BUG in blkdev_open()
genhd: Fix use after free in __blkdev_get()
genhd: Add helper put_disk_and_module()
genhd: Rename get_disk() to get_disk_and_module()
genhd: Fix leaked module reference for NVME devices
direct-io: Fix sleep in atomic due to sync AIO
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
block: kyber: fix domain token leak during requeue
blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
...
When specifiying iocharset multiple times in a mount or once/multiple in
a remount, current option parsing may cause inaccurate refcount of nls
module. Also, in the failure cleanup of option parsing, the condition
of calling unload_nls is not sufficient.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Change return code to -ENOMEM from -EINVAL when failing memory
allocation in fill_super(), meanwhile delete redundant initial
assignment of variable err.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If logical volume integrity descriptor contains non-closed integrity
type when mounting the volume, there are high chances that the volume is
not consistent (device was detached before the filesystem was
unmounted). Don't touch integrity type of such volume so that fsck can
recognize it and check such filesystem.
Reported-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
so that we don't block io_submit callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no reason to take the ilock exclusively at the start of
xfs_file_iomap_begin for direct I/O, given that it will be demoted
just before calling xfs_iomap_write_direct anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The iomap zeroing interface is smart enough to skip zeroing holes or
unwritten extents. Don't subvert this logic for reflink files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is lack of cache destroy operation for ceph_file_cachep
when failing from fscache register.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/foo
$ ln /mnt/testdir/foo /mnt/testdir/bar
$ sync
$ unlink /mnt/testdir/bar
$ touch /mnt/testdir/bar
$ xfs_io -c "fsync" /mnt/testdir/bar
<power failure>
$ mount /dev/sdb /mnt
mount: mount(2) failed: /mnt: No such file or directory
When replaying the log, for that example, we also see the following in
dmesg/syslog:
[71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
[71813.674204] ------------[ cut here ]------------
[71813.675694] BTRFS: Transaction aborted (error -2)
[71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
[71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
[71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
[71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
[71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
[71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
[71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
[71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
[71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
[71813.679669] Call Trace:
[71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
[71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
[71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
[71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
[71813.679669] ? rcu_read_unlock+0x3a/0x57
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
[71813.679669] walk_log_tree+0xad/0x188 [btrfs]
[71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
[71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
[71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
[71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] do_mount+0x6e5/0x973
[71813.679669] ? memdup_user+0x3e/0x5c
[71813.679669] SyS_mount+0x72/0x98
[71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
[71813.679669] RIP: 0033:0x7f7cedf150ba
[71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
[71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
[71813.679669] ---[ end trace 83bd473fc5b4663b ]---
[71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
[71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
[71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
[71814.128078] BTRFS error (device dm-0): open_ctree failed
This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.
Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ mkfifo /mnt/testdir/foo
# Make sure everything done so far is durably persisted.
$ sync
# Create some unrelated file and fsync it, this is just to create a log
# tree. The file must be in the same directory as our special file.
$ touch /mnt/testdir/f1
$ xfs_io -c "fsync" /mnt/testdir/f1
# Rename our special file and then create a hard link with its old name.
$ mv /mnt/testdir/foo /mnt/testdir/bar
$ ln /mnt/testdir/bar /mnt/testdir/foo
# Create some other unrelated file and fsync it, this is just to persist
# the log tree which was modified by the previous rename and link
# operations. Alternatively we could have modified file f1 and fsync it.
$ touch /mnt/f2
$ xfs_io -c "fsync" /mnt/f2
<power failure>
$ mount /dev/sdc /mnt
mount: mount /dev/sdc on /mnt failed: File exists
This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.
Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.
A new generic test case for fstests was also submitted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info::super_copy is a byte copy of the on-disk structure and all
members must use the accessor macros/functions to obtain the right
value. This was missing in update_super_roots and in sysfs readers.
Moving between opposite endianness hosts will report bogus numbers in
sysfs, and mount may fail as the root will not be restored correctly. If
the filesystem is always used on a same endian host, this will not be a
problem.
Fix this by using the btrfs_set_super...() functions to set
fs_info::super_copy values, and for the sysfs, use the cached
fs_info::nodesize/sectorsize values.
CC: stable@vger.kernel.org
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.
There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.
As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Fix some compiler warnings
- Fix block rservations for transactions created during log recovery
- Fix resource leaks when respecifying mount options
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJalEvWAAoJEPh/dxk0SrTrdDcQAKgYcpg8Ip6uKNqc38hm79l+
RXjHQFbpgKiojBzjgI8+lPNsVBEhSSh65rXB6nEVSS/TFS0ONScRbNKrcH9Selkn
cH1RsuhKk1NvKWaLMMFJWMTZK5Z6cHLtJU2szqnPdCsv/2EqdS6NylyYSFhljtSl
xbD8vffjnJ1HU9ijZsSoZkiu0DO1yoXYu7EUQkWKPRSb/el+qYIMKSwFC8qQdxrp
KGPyYH1CENm2jarKXAgTgqmUmrdJ9ikLHLT0sXQPZ9AbOjOoGlZ9Bn0KmvggQFmq
TyE0je7L2EWrDRJ6R+lcFKC0fHDgo7ec8Iz/CJlOiExSebbNZgtsn0bbPMfq2Rnz
8IMYPAV+NBY8RQWumgBN2aOjyjV9EUd+TkeJh5aIubyFOE2GtEmHjlR4p0bG9os9
yOZJv+5JDF09oN1dLDf/xpEwXsHho6KHDYtVqbKhBWfQiw84sAlW9/NwfQOEugJS
6RXN3LaExSvpFSc9qcGqsrdGvEuMcNLo+XtTwz9g8DNR0Ztp1bFE64xh4KYlNsx5
QQj256Hx56R7vZ2/DC73MiT/hOSgfPpqnwOZP+Yc7I3DeO65DLwdCt9m2c0f1Odn
6xi3ZPXuq+QutJEp3iAfj5XXVmwSVgcub8EmtVmyK2fttiK3+SXbZdvo6JqyuEMX
ZxwBaZkNuYWikHhj5iAp
=A2Bo
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix some compiler warnings
- fix block reservations for transactions created during log recovery
- fix resource leaks when respecifying mount options
* tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix potential memory leak in mount option parsing
xfs: reserve blocks for refcount / rmap log item recovery
xfs: use memset to initialize xfs_scrub_agfl_info
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:
vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations just create and destroy /proc entries
and net_generic()->cb_ident_idr IDR. So, we are able to mark
them async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if notification event is lost due to event allocation failing
we ENOMEM, we just silently continue (except for fanotify permission
events where we deny the access). This is undesirable as userspace has
no way of knowing whether the notifications it got are complete or not.
Treat lost events due to ENOMEM the same way as lost events due to queue
overflow so that userspace knows something bad happened and it likely
needs to rescan the filesystem.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fanotify queues of unlimited length do not expect events can be lost.
Since these queues are used for system auditing and other security
related tasks, loosing events can even have security implications.
Currently, since the allocation is small (32-bytes), it cannot fail
however when we start accounting events in memcgs, allocation can start
failing. So avoid loosing events due to failure to allocate memory by
making event allocation use __GFP_NOFAIL.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Current code relies on the fact that invalid uid/gid as defined by UDF
2.60 3.3.3.1 and 3.3.3.2 coincides with invalid uid/gid as used by the
user namespaces implementation. Since this is only lucky coincidence,
clean this up to avoid future surprises in case user namespaces
implementation changes. Also this is more robust in presence of valid
(from UDF point of view) uids / gids which do not map into current user
namespace.
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently newly created files belong to current user despite
uid=<number> / gid=<number> mount options. This is confusing to users
(as owner of the file will change after remount / eviction from cache)
and also inconsistent with e.g. FAT with the same mount option. So apply
uid=<number> and gid=<number> also to newly created inodes and similarly
as FAT disallow to change owner of the file in this case.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently uid=ignore and gid=ignore make no sense without uid=<number>
and gid=<number> respectively as they result in all files having invalid
uid / gid which then doesn't allow even root to modify files and thus
causes confusion. And since commit ca76d2d803 "UDF: fix UID and GID
mount option ignorance" (from over 10 years ago) uid=<number> overrides
all uids on disk as uid=ignore does. So just silently ignore uid=ignore
mount option.
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Current handling of Partition Descriptors in Volume Descriptor Sequence
is buggy in several ways. Firstly, it does not take descriptor sequence
numbers into account at all, thus any volume making serious use of them
would be unmountable. Secondly, it does not handle Volume Descriptor
Pointers or Volume Descriptor Sequence without Terminating Descriptor.
Fix these problems by properly remembering all Partition Descriptors in
the Volume Descriptor Sequence and their sequence numbers. This is made
more complicated by the fact that we don't know number of partitions in
advance and sequence numbers have to be tracked on per-partition basis.
Reported-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When scanning Volume Descriptor Sequence, several descriptors have
exactly the same handling. Unify it.
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When specifying string type mount option (e.g., logdev)
several times in a mount, current option parsing may
cause memory leak. Hence, call kfree for previous one
in this case.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:
CPU0 CPU1
blkdev_open()
bdev = bd_acquire()
del_gendisk()
bdev_unhash_inode(bdev);
remove device
create new device with the same number
__blkdev_get()
disk = get_gendisk()
- gets reference to gendisk of the new device
Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.
Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:
CPU0 CPU1 CPU2
del_gendisk(disk);
bdev_unhash_inode(inode);
blkdev_open() blkdev_open()
bdev = bd_acquire(inode);
- creates and returns new inode
bdev = bd_acquire(inode);
- returns the same inode
__blkdev_get(devt) __blkdev_get(devt)
disk = get_gendisk(devt);
- got structure of device going away
<finish device removal>
<new device gets
created under the same
device number>
disk = get_gendisk(devt);
- got new device structure
if (!bdev->bd_openers) {
does the first open
}
if (!bdev->bd_openers)
- false
} else {
put_disk_and_module(disk)
- remember this was old device - this was last ref and disk is
now freed
}
disk_unblock_events(disk); -> oops
Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.
Reported-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit e864f39569 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be15 "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.
Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.
CC: Christoph Hellwig <hch@lst.de>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
CC: stable@vger.kernel.org
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9830f4be15
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
redirect_dir=nofollow should not follow a redirect. But in a specific
configuration it can still follow it. For example try this.
$ mkdir -p lower0 lower1/foo upper work merged
$ touch lower1/foo/lower-file.txt
$ setfattr -n "trusted.overlay.opaque" -v "y" lower1/foo
$ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=on none merged
$ cd merged
$ mv foo foo-renamed
$ umount merged
# mount again. This time with redirect_dir=nofollow
$ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=nofollow none merged
$ ls merged/foo-renamed/
# This lists lower-file.txt, while it should not have.
Basically, we are doing redirect check after we check for d.stop. And
if this is not last lower, and we find an opaque lower, d.stop will be
set.
ovl_lookup_single()
if (!d->last && ovl_is_opaquedir(this)) {
d->stop = d->opaque = true;
goto out;
}
To fix this, first check redirect is allowed. And after that check if
d.stop has been set or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: 438c84c2f0 ("ovl: don't follow redirects if redirect_dir=off")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
there is lack of dput of root_dentry and it causes slab errors,
so change the calling order of ceph_fs_debugfs_init() and
open_root_dentry() and do some cleanups to avoid this issue.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When parsing string option, in order to avoid memory leak we need to
carefully free it first in case of specifying same option several times.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
fs/overlayfs/export.c:459:10-16: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Hightlights include:
- Fix a broken cast in nfs4_callback_recallany()
- Fix an Oops during NFSv4 migration events
- make struct nlmclnt_fl_close_lock_ops static
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJakuNIAAoJEGcL54qWCgDykYYQAITHRrWP7tQ6aSpZxW5+Un5z
6K3RRbfxFjHWVyaePCBzRMOtPTA/puqO0ggx9H+2D4+u2GeXhFl7FMdIuLueGKrc
rh0wzB6+KiHvqK8NT3g4c2VzZbGJ8IWB6jlNaA3ZyHRJcO+Oi3rQhYBNZpVqP6ny
M1C3yXQTUtA13aOLjeThoAKIJyknwdZcsiMTptJslvSsQ9PL0w6m6jZKrVHu6Rc+
Hg12FFptaKien/gj2IUYJb6Z2Mz3arJu1Y7cm1P/zH/NBs37ynMUsrb9AvPbzvRm
PvPRT4ugNOlTgDTaIT2JHwP2bhlp2JF+Tdzq7WYE3ek2CEPUD49jv07MlgTHxy/w
+tcp/322ZCxvKLjHTeEWqGn4T0TZ1TdPrd4dIJsjox9Ffy72Z3rOjvLKt5UrRV0i
8IhiE3/ruHFpB75Yfi7ABEIH8aEwwmchQTf5bth0ZKoZdaEPHmy3xnJkxa+wlD1V
Hp6KoqMNlDXEFx2Ih/SD6j50MFKszq6+cjUk2D8iclLnelXhu9iddFQ+PFNfsxVZ
WSo4AWZoPbtbLnn9Ez9dkdsJILKv86LbEvYxLX6/LnxLzzX70E34tbRQa+drVeR3
Na6czRpld85juKWgiFkzNx+zD4TaBAxUMbQH2Gbwngiz5RoT6SnkkdkAK3EW7nhR
pIJgZGiq/NG3NLiCxIgs
=l2jv
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- fix a broken cast in nfs4_callback_recallany()
- fix an Oops during NFSv4 migration events
- make struct nlmclnt_fl_close_lock_ops static
* tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: make struct nlmclnt_fl_close_lock_ops static
nfs: system crashes after NFS4ERR_MOVED recovery
NFSv4: Fix broken cast in nfs4_callback_recallany()
i_dir_seq is subject to concurrent modification by a cmpxchg or
store-release operation, so ensure that the relaxed access in
d_alloc_parallel uses READ_ONCE.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If d_alloc_parallel runs concurrently with __d_add, it is possible for
d_alloc_parallel to continuously retry whilst i_dir_seq has been
incremented to an odd value by __d_add:
CPU0:
__d_add
n = start_dir_add(dir);
cmpxchg(&dir->i_dir_seq, n, n + 1) == n
CPU1:
d_alloc_parallel
retry:
seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
hlist_bl_lock(b);
bit_spin_lock(0, (unsigned long *)b); // Always succeeds
CPU0:
__d_lookup_done(dentry)
hlist_bl_lock
bit_spin_lock(0, (unsigned long *)b); // Never succeeds
CPU1:
if (unlikely(parent->d_inode->i_dir_seq != seq)) {
hlist_bl_unlock(b);
goto retry;
}
Since the simple bit_spin_lock used to implement hlist_bl_lock does not
provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
exit its retry loop because the sequence number always has the bottom
bit set.
This patch resolves the livelock by not taking hlist_bl_lock in
d_alloc_parallel if the sequence counter is odd, since any subsequent
masked comparison with i_dir_seq will fail anyway.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent. We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point. Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.
Cc: stable@vger.kernel.org # 3.12+, counting backports
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull siginfo fix from Eric Biederman:
"This fixes a build error that only shows up on blackfin"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fs/signalfd: fix build error for BUS_MCEERR_AR
During log recovery, the per-AG reservations aren't yet set up, so log
recovery has to reserve enough blocks to handle all possible btree
splits.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Apparently different gcc versions have competing and
incompatible notions of how to initialize at declaration,
so just give up and fall back to the time-tested memset().
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix build error in fs/signalfd.c by using same method that is used in
kernel/signal.c: separate blocks for different signal si_code values.
./fs/signalfd.c: error: 'BUS_MCEERR_AR' undeclared (first use in this function)
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).
On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.
Linus suggested per-user rate limit to solve this.
So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.
In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The structure nlmclnt_fl_close_lock_ops s local to the source and does
not need to be in global scope, so make it static.
Cleans up sparse warning:
fs/nfs/nfs3proc.c:876:33: warning: symbol 'nlmclnt_fl_close_lock_ops' was not
declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs4_update_server unconditionally releases the nfs_client for the
source server. If migration fails, this can cause the source server's
nfs_client struct to be left with a low reference count, resulting in
use-after-free. Also, adjust reference count handling for ELOOP.
NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
nfs_put_client+0xfa/0x110 [nfs]
nfs4_run_state_manager+0x30/0x40 [nfsv4]
kthread+0xd8/0xf0
BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
nfs4_xdr_enc_write+0x6b/0x160 [nfsv4]
rpcauth_wrap_req+0xac/0xf0 [sunrpc]
call_transmit+0x18c/0x2c0 [sunrpc]
__rpc_execute+0xa6/0x490 [sunrpc]
rpc_async_schedule+0x15/0x20 [sunrpc]
process_one_work+0x160/0x470
worker_thread+0x112/0x540
? rescuer_thread+0x3f0/0x3f0
kthread+0xd8/0xf0
This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"),
but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops")
Reported-by: Helen Chao <helen.chao@oracle.com>
Fixes: 52442f9b11 ("NFS4: Avoid migration loops")
Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When reading the inode or block allocation bitmap, if the bitmap needs
to be initialized, do not update the checksum in the block group
descriptor. That's because we're not set up to journal those changes.
Instead, just set the verified bit on the bitmap block, so that it's
not necessary to validate the checksum.
When a block or inode allocation actually happens, at that point the
checksum will be calculated, and update of the bg descriptor block
will be properly journalled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
This updates the jbd2 superblock unnecessarily, and on an abort we
shouldn't truncate the log.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Previously the jbd2 layer assumed that a file system check would be
required after a journal abort. In the case of the deliberate file
system shutdown, this should not be necessary. Allow the jbd2 layer
to distinguish between these two cases by using the ESHUTDOWN errno.
Also add proper locking to __journal_abort_soft().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The msleep() when processing EXT4_GOING_FLAGS_NOLOGFLUSH was a hack to
avoid some races (that are now fixed), but in fact it introduced its
own race.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The ext4 forced shutdown flag needs to prevent new handles from being
started, but it needs to allow existing handles to complete. So the
forced shutdown flag should not force ext4_journal_get_write_access to
fail.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
There were two error messages emitted by jbd2, one for a bad checksum
for a jbd2 descriptor block, and one for a bad checksum for a jbd2
data block. Change the data block checksum error so that the two can
be disambiguated.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqG8poACgkQxWXV+ddt
WDuHSA//eC+69XpHwohI6pcPQ7Jbr9UCj1L/Gt0U96YSzijGW4Hv3OQEWLIRBu4c
nZbzQYtUunpguLYwfXgUUgXRHBTo2Y5bXZNmF2MtL7JcPOLhLh4h/IcGY7eRd2Vq
qvv2bqr3yAcQo7s6z5U/D8ulohzHQTxG7Jaq/BkVxQqhvu+vdu/9T8ikAWnmSTjw
lONu8soR5QO7tewxz23Cguw/t1bWe1aMXG9Ykd4avyhQHtgzNE+l82i4DYUhK2CM
x8M5/CxnDLPe73IJuA2INCUtpPvR4Qufi5Nz6EN3BrJNCGBkmg18sPIvWlH6LsVh
bsm4Lwz/piq+hkDq2GG+Z79uiGAfCVUWAsnm7yYHwpVyMvwHKlfrcVSAuRZixw5E
/NZ0JEkEOtvzpv4inZFYbAgD+oKfvYvwj9BW5BXfu2aH6hJBImfAeMSd1aHB3uZI
kGgy52k2v2P3WKQOFUbmW417P05DvvGmRvRmU+tSFpB+lXAZqRzoiVIuFm0xwhf1
1SmnYgnSYzPmzIRXAMsSYQeK/8NXDdMZMutaw/AYwX+QBEdIAErf6MWcjI6XZRyG
g8Gr8JcpwSa+H5/LKN5uswfXxfSAsqVHnZhbOVrjyGX0wyR4KJg3ag3KsHd9SCxb
LDEjPSYEDn9yfmw6pK2Q6J26FGYiKpuUXaNiYVNymGe6162IiBM=
=VeA/
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few assorted fixes, some of them show up during fstests so I
gave them more testing"
* tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
Btrfs: fix null pointer dereference when replacing missing device
btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
Btrfs: fix unexpected -EEXIST when creating new inode
Btrfs: fix use-after-free on root->orphan_block_rsv
Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
Btrfs: fix extent state leak from tree log
Btrfs: fix crash due to not cleaning up tree log block's dirty bits
Btrfs: fix deadlock in run_delalloc_nocow
This change relaxes copy up on encode of merge dir with lower layer > 1
and handles the case of encoding a merge dir with lower layer 1, where an
ancestor is a non-indexed merge dir. In that case, decode of the lower
file handle will not have been possible if the non-indexed ancestor is
redirected before or after encode.
Before encoding a non-upper directory file handle from real layer N, we
need to check if it will be possible to reconnect an overlay dentry from
the real lower decoded dentry. This is done by following the overlay
ancestry up to a "layer N connected" ancestor and verifying that all
parents along the way are "layer N connectable". If an ancestor that is
NOT "layer N connectable" is found, we need to copy up an ancestor, which
is "layer N connectable", thus making that ancestor "layer N connected".
For example:
layer 1: /a
layer 2: /a/b/c
The overlay dentry /a is NOT "layer 2 connectable", because if dir /a is
copied up and renamed, upper dir /a will be indexed by lower dir /a from
layer 1. The dir /a from layer 2 will never be indexed, so the algorithm
in ovl_lookup_real_ancestor() (*) will not be able to lookup a connected
overlay dentry from the connected lower dentry /a/b/c.
To avoid this problem on decode time, we need to copy up an ancestor of
/a/b/c, which is "layer 2 connectable", on encode time. That ancestor is
/a/b. After copy up (and index) of /a/b, it will become "layer 2 connected"
and when the time comes to decode the file handle from lower dentry /a/b/c,
ovl_lookup_real_ancestor() will find the indexed ancestor /a/b and decoding
a connected overlay dentry will be accomplished.
(*) the algorithm in ovl_lookup_real_ancestor() can be improved to lookup
an entry /a in the lower layers above layer N and find the indexed dir /a
from layer 1. If that improvement is made, then the check for "layer N
connected" will need to verify there are no redirects in lower layers above
layer N. In the example above, /a will be "layer 2 connectable". However,
if layer 2 dir /a is a target of a layer 1 redirect, then /a will NOT be
"layer 2 connectable":
layer 1: /A (redirect = /a)
layer 2: /a/b/c
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 31747eda41 ("ovl: hash directory inodes for fsnotify")
fixed an issue of inotify watch on directory that stops getting
events after dropping dentry caches.
A similar issue exists for non-dir non-upper files, for example:
$ mkdir -p lower upper work merged
$ touch lower/foo
$ mount -t overlay -o
lowerdir=lower,workdir=work,upperdir=upper none merged
$ inotifywait merged/foo &
$ echo 2 > /proc/sys/vm/drop_caches
$ cat merged/foo
inotifywait doesn't get the OPEN event, because ovl_lookup() called
from 'cat' allocates a new overlay inode and does not reuse the
watched inode.
Fix this by hashing non-dir overlay inodes by lower real inode in
the following cases that were not hashed before this change:
- A non-upper overlay mount
- A lower non-hardlink when index=off
A helper ovl_hash_bylower() was added to put all the logic and
documentation about which real inode an overlay inode is hashed by
into one place.
The issue dates back to initial version of overlayfs, but this
patch depends on ovl_inode code that was introduced in kernel v4.13.
Cc: <stable@vger.kernel.org> #v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Convert index definitions from defines to enum. It is a shorter
description and easier to modify. Also remove VDS_POS_VOL_DESC_PTR since
it is unused.
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
According to ECMA-167 3/8.4.2 a volume descriptor sequence can be
terminated also by an unrecorded block within the extent of volume
descriptor sequence. Currently we errored out in such case making such
volumes unmountable. Handle that case by treating any invalid block as a
block terminating the sequence.
Reported-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
According to ECMA-167 3/8.4.2 Volume Descriptor Pointer is terminating
current extent of Volume Descriptor Sequence. Also according to ECMA-167
3/8.4.3 Volume Descriptor Sequence Number is not significant for Volume
Descriptor Pointers. Simplify the handling of Volume Descriptor Pointers
to take this into account.
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We pass one block beyond end of volume descriptor sequence into
process_sequence() as 'lastblock' instead of the last block of the
sequence. When the sequence is not terminated with TD descriptor, this
could lead to false errors due to invalid blocks in volume descriptor
sequence and thus unmountable volumes.
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This function will be used to obtain net of tun device.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Ingo Molnar:
"Misc fixes all across the map:
- /proc/kcore vsyscall related fixes
- LTO fix
- build warning fix
- CPU hotplug fix
- Kconfig NR_CPUS cleanups
- cpu_has() cleanups/robustification
- .gitignore fix
- memory-failure unmapping fix
- UV platform fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages
x86/error_inject: Make just_return_func() globally visible
x86/platform/UV: Fix GAM Range Table entries less than 1GB
x86/build: Add arch/x86/tools/insn_decoder_test to .gitignore
x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
x86/mm/kcore: Add vsyscall page to /proc/kcore conditionally
vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
x86/Kconfig: Further simplify the NR_CPUS config
x86/Kconfig: Simplify NR_CPUS config
x86/MCE: Fix build warning introduced by "x86: do not use print_symbol()"
x86/cpufeature: Update _static_cpu_has() to use all named variables
x86/cpufeature: Reindent _static_cpu_has()
This tag is meant for pulling a patch called gfs2: Fixes to
"Implement iomap for block_map". The patch fixes some
regressions we recently discovered in commit 3974320ca6.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJahFiJAAoJENeLYdPf93o7JKYH/irlIZM7NPHhiOcot1lXG6HL
x1fV9u6Rjw7QimctgM6ks1lu/R7hamNvOCAPz7TFXIo0grWes2qOcZa7tdWqkpZK
TGmSIv+NfrI9NzB3PwleImClfHR8SOgIh/ZlvHQWu9JvKkPlZ3Ik0mZCXbzUFn0I
Q5ebe+yvaaGeU3QUzsdBgTWuYRE0uQfIylyTz7f8wc9PDp2zB2l01CCCbat/VEWe
Jy1HlXSiQsmR0N5ypm5d3AszXJ0zbHfjQzKpNACP59WrRjnKvxsBan7En5pQBFnP
lhLWClqxgtXlvmSb4Takw+Cu9aS2zCYizQ8eqecX5FKQp1Vufoxs48EqRnq55IY=
=vJqP
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.16.rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
"Fix regressions in the gfs2 iomap for block_map implementation we
recently discovered in commit 3974320ca6"
* tag 'gfs2-4.16.rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fixes to "Implement iomap for block_map"
Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().
CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:
while (1) {
int wd;
wd = inotify_add_watch(inotify_fd, path, mask);
if (wd < 0) {
break;
} else if (wd == desired_wd_id) {
ret = 0;
break;
}
inotify_rm_watch(inotify_fd, wd);
}
(You may find the actual code at the below link:
https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.
But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.
This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.
The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.
v2: Use INT_MAX instead of custom definition of max id,
as IDR subsystem guarantees id is between 0 and INT_MAX.
CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
It turns out that commit 3974320ca6 "Implement iomap for block_map"
introduced a few bugs that trigger occasional failures with xfstest
generic/476:
In gfs2_iomap_begin, we jump to do_alloc when we determine that we are
beyond the end of the allocated metadata (height > ip->i_height).
There, we can end up calling hole_size with a metapath that doesn't
match the current metadata tree, which doesn't make sense. After
untangling the code at do_alloc, fix this by checking if the block we
are looking for is within the range of allocated metadata.
In addition, add a BUG() in case gfs2_iomap_begin is accidentally called
for reading stuffed files: this is handled separately. Make sure we
don't truncate iomap->length for reads beyond the end of the file; in
that case, the entire range counts as a hole.
Finally, revert to taking a bitmap write lock when doing allocations.
It's unclear why that change didn't lead to any failures during testing.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch starts to convert pernet_subsys, registered
before initcalls.
proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
{un,}register pernet net->proc_net and ->proc_net_stat.
Constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.
So, proc_net_ns_ops methods are able to be executed
in parallel with methods of any other pernet operations.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit:
df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext data")
... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
However, accessing the vsyscall user page will cause an SMAP fault.
Replace memcpy() with copy_from_user() to fix this bug works, but adding
a common way to handle this sort of user page may be useful for future.
Currently, only vsyscall page requires KCORE_USER.
Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Changes since v1:
Added changes in these files:
drivers/infiniband/hw/usnic/usnic_transport.c
drivers/staging/lustre/lnet/lnet/lib-socket.c
drivers/target/iscsi/iscsi_target_login.c
drivers/vhost/net.c
fs/dlm/lowcomms.c
fs/ocfs2/cluster/tcp.c
security/tomoyo/network.c
Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.
"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.
None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.
This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.
Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.
Userspace API is not changed.
text data bss dec hex filename
30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
30108109 2633612 873672 33615393 200ee21 vmlinux.o
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more poll annotation updates from Al Viro:
"This is preparation to solving the problems you've mentioned in the
original poll series.
After this series, the kernel is ready for running
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
as a for bulk search-and-replace.
After that, the kernel is ready to apply the patch to unify
{de,}mangle_poll(), and then get rid of kernel-side POLL... uses
entirely, and we should be all done with that stuff.
Basically, that's what you suggested wrt KPOLL..., except that we can
use EPOLL... instead - they already are arch-independent (and equal to
what is currently kernel-side POLL...).
After the preparations (in this series) switch to returning EPOLL...
from ->poll() instances is completely mechanical and kernel-side
POLL... can go away. The last step (killing kernel-side POLL... and
unifying {de,}mangle_poll() has to be done after the
search-and-replace job, since we need userland-side POLL... for
unified {de,}mangle_poll(), thus the cherry-pick at the last step.
After that we will have:
- POLL{IN,OUT,...} *not* in __poll_t, so any stray instances of
->poll() still using those will be caught by sparse.
- eventpoll.c and select.c warning-free wrt __poll_t
- no more kernel-side definitions of POLL... - userland ones are
visible through the entire kernel (and used pretty much only for
mangle/demangle)
- same behavior as after the first series (i.e. sparc et.al. epoll(2)
working correctly)"
* 'work.poll2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
annotate ep_scan_ready_list()
ep_send_events_proc(): return result via esed->res
preparation to switching ->poll() to returning EPOLL...
add EPOLLNVAL, annotate EPOLL... and event_poll->event
use linux/poll.h instead of asm/poll.h
xen: fix poll misannotation
smc: missing poll annotations
Pull misc vfs fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
seq_file: fix incomplete reset on read from zero offset
kernfs: fix regression in kernfs_fop_write caused by wrong type
-----BEGIN PGP SIGNATURE-----
iQGcBAABAgAGBQJafSqsAAoJEIosvXAHck9RHmEMAJyzkwc503WOl9/ZyagcaDli
4mJEplVgxL6ZcgmaPZrZ1qaZvHd0JWq5bDbPeuuNv+wyqIu14DYVHivaORswfI7y
Q0p0gslWf+hyS637CcmBajgEZbgAZIAkUktC+KPa7lZcUFDEvgYwHnQNuK3yvhBR
zRrWeiumWn4l25ahc8GBA5nZ7tDM5xkLpv8DfI0ycCbm5E+Bqnf23m13hTMT7Mt3
4hBc6iEdi+/IcRkwf5BHEO94hNeWSb4oERLIWxXXkZ3XTSlYtJteV/pdIoJfhHnr
Th453VUwPfkRVVw3h4feZaIKM6kGPStGg1435+6lBpgTWQgNImd/Kcg3d181U3rs
/+iORX2KLwwl6orVQnX5IBiUpnB2+ePpRGjMAGedIPSztMVInGInxxT1UZQtMCIg
fJ6PQ1eH/OlY7WiY16+3YBYvtWPPqJc98P7gyfDocne7ZoT0XkoQ+2YejaNzI2Sz
8Qkw6Y8gLSQ8tC2duV14evlLmynbB1qRL9n99iD06w==
=Ps35
-----END PGP SIGNATURE-----
Merge tag '4.16-minor-rc-SMB3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"There are a couple additional security fixes that are still being
tested that are not in this set."
* tag '4.16-minor-rc-SMB3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
Add missing structs and defines from recent SMB3.1.1 documentation
address lock imbalance warnings in smbdirect.c
cifs: silence compiler warnings showing up with gcc-8.0.0
Add some missing debug fields in server and tcon structs
But it's also a fairly small update this time around. Some cleanup,
RDMA fixes, overlayfs fixes, and a fix for an NFSv4 state bug.
The bigger deal for nfsd this time around is Jeff Layton's
already-merged i_version patches. This series has a minor conflict with
that one, and the resolution should be obvious. (Stephen Rothwell has
been carrying it in linux-next for what it's worth.)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJafNVvAAoJECebzXlCjuG+yZUP/2SctFtkW638z9frLcIVt5M6
x5hluw5jtFrVqq/KoMwi7rVaMzhdvcgwwfaLciqrPCOmcMKlOqiWslyCV0wZVCZS
jabkOeinKVAyPTlESesNyArWKBWaB8QaYDwbkQ5Y76U9Ma5gwSghS1wc8vrNduZY
2StieESOiOs9LljXf5SqCC5nN9s7gs4qtCK7aZ3JIt4661Lh39LqyO5zxLnc78eL
USnJKHjTSreY2Vd1/TdNWyZhiim43wdrB+jpy6IoocTqyhYalkCz1iYdJn1arqtP
iIddPpczKxkHekFVj7/Kfa+ATFtdXIpivOBhhOT0oY8HukTd58bh/oUMrFt4BSuP
MQst0R9h1sanBE18XBPlXuIK51sm3AjjOGaQycl/Mzes+dMRgIP/KspAcnwwXHqG
gyZsF3VzliFTc9s0SyiAz2AxNTUnjd+LV3E0DUeivURa6V3pc+sFlQzi8PRxRaep
0gmhYcZsfwdDKZ/kbQyQdSWN48NxOLFke4fYjmoUtoyILa0NAHEqafeJkR5EiRTm
tZsL9H/3THEGWygYlXGGBo/J4w5jE3uL/8KkfeuZefzSo0Ujqu0pBALMTnGFLKRx
Mpw7JEqfUwqIVZ0Qh6q9yIcjr89qWv96UpBqRRIkFX5zOPN7B1BH8C89g8qy3Hyt
gm/5BTw4FPE0uAM9Nhsd
=icEX
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd update from Bruce Fields:
"A fairly small update this time around. Some cleanup, RDMA fixes,
overlayfs fixes, and a fix for an NFSv4 state bug.
The bigger deal for nfsd this time around was Jeff Layton's
already-merged i_version patches"
* tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux:
svcrdma: Fix Read chunk round-up
NFSD: hide unused svcxdr_dupstr()
nfsd: store stat times in fill_pre_wcc() instead of inode times
nfsd: encode stat->mtime for getattr instead of inode->i_mtime
nfsd: return RESOURCE not GARBAGE_ARGS on too many ops
nfsd4: don't set lock stateid's sc_type to CLOSED
nfsd: Detect unhashed stids in nfsd4_verify_open_stid()
sunrpc: remove dead code in svc_sock_setbufsize
svcrdma: Post Receives in the Receive completion handler
nfsd4: permit layoutget of executable-only files
lockd: convert nlm_rqst.a_count from atomic_t to refcount_t
lockd: convert nlm_lockowner.count from atomic_t to refcount_t
lockd: convert nsm_handle.sm_count from atomic_t to refcount_t
1. don't pass garbage return codes back up the call chain (Mike Marshall)
2. fix stale inode test (Martin Brandenburg)
3. fix off-by-one errors (Xiongfeng Wang)
Also: add Martin as a reviewer in the Maintainers file.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaejneAAoJEM9EDqnrzg2+XhoQAIDF112mOwLwqDPmr4ty0g6/
gBcoHOrRFlYWPlS5aubjoZ3jFX2fAeNuHzYS4LIuqVKUdsC+oTKQ2URJ7KKpvLiK
6zOaz2Y4GLns2sa1ZUKli6nEBbPi6uwoF54FNbwt3b+97wpmJwlnXm9ztyt5REKA
zOHvLgJAcfGNZEJ7gyB1zjwllu4JeD0A4MoN4vJCtkKLAaNClywu4+V0jwZB+SSN
8QjDXNqkcD31ahWhQ/CaU4zXlxOOV+4ZR7/p5IKT693hEhV+ikTvmXy8g0+bksxj
L+FHmQMTO+GqCS5FxuBQd3v1IP5FkoHEmAwvr3C5aMlRAaVJ9eVVIZaC9CpOJBRB
S/CiaG2Mw8vx8VGOm8O93Z+xDi9tCYP8x4i7b5r62h0T9wSyHJSkSIUd6VIkCV9Q
c92bX/N3wHBvCPT+RC898plni5HsFpzs3vSs8hiaAICgp64sC8pIqVlZOAdMtJd8
RL4la/Fited/T+3BpaCTkmnvNk8Ktax7wHYsCt4gSyHN8WRvkzowgC5kV6S30Qlh
zfoXG0K50FcU8T5r3i8slvUHmsiyYxYwJIk/z1iDgXI7y4IIR6FGDxQmw5TxgNS7
+veTo6FCxon6QshtpAOeELCau7qNXhtlDdGqqm4+gDfMWoCn0Jem/LzdA2gPXCOr
iCDwHLiu6WXt7ZHTrgln
=xrih
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Mostly cleanups, but three bug fixes:
- don't pass garbage return codes back up the call chain (Mike
Marshall)
- fix stale inode test (Martin Brandenburg)
- fix off-by-one errors (Xiongfeng Wang)
Also add Martin as a reviewer in the Maintainers file"
* tag 'for-linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: reverse sense of is-inode-stale test in d_revalidate
orangefs: simplify orangefs_inode_is_stale
Orangefs: don't propogate whacky error codes
orangefs: use correct string length
orangefs: make orangefs_make_bad_inode static
orangefs: remove ORANGEFS_KERNEL_DEBUG
orangefs: remove gossip_ldebug and gossip_lerr
orangefs: make orangefs_client_debug_init static
MAINTAINERS: update orangefs list and add myself as reviewer
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWnx0Mvu3V2unywtrAQI3ng//Xdv2rxVjv4znzekb/EkE9QIakH3ET3wt
hBewQjaGkOWhZKgyE7DnhCMh7y6OrX/oVNtjPU8H7EEHDHVs+nyoGoDu282jlppr
qO7yMbxZwDtpja7O9hVtIViFZSqlEey/RCq1KKRUl/HDmyyOmAvOZHCpyowUqcYD
KqJs9Z2/onkP43rwmoKIQPEeKHxRfAs6pTiAG7fUPYC4d6aSskiN5K65N0g4dx4F
G6pDC/mIJWx2qeeI//CzSxnqhzWAhkozOs9UtvquSrIoNcYMSOQRHGne50n7OqkK
rZCttm4gSlrEU11cPDNExjKU4z8UM3tmVdudntC8wbng5PFCHTR7JB5nZu1bEjqw
TpIjb302QnUefzu1AGge03ZnysqDKKBAxKKwD1gYBHaj7Y2CrqP4lo+6QA4ePYTv
qD7nRZCiQ8rF3PJOYJ7xe944Jziktf6PhnOXyxOSNCv3IT90YD7meOR3MldMjny/
hM2ahYqfWXjLAjH20Q+B8z7ab9GDdVsBTl06w/ZX+RMrg5CNdDaYe0nfG/tS7H3A
oD7xIjUwWjqxMBqtXNUe/3GAOnU+ilEiKjq8gmNkBSjRlpO6SMxi02jOp66HwnRs
tD5qG3Bn2F3hdvEtwcKcS0cVWX511lLF5vkhlBhSbs/XkS+BXULr3vDsl5XclwAw
/07q8HsHlnM=
=fSB4
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20180208' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"Four fixes:
- add a missing put
- two fixes to reset the address iteration cursor correctly
- fix setting up the fileserver iteration cursor.
Two cleanups:
- remove some dead code
- rearrange a function to be more logically laid out
And one new feature:
- Support AFS dynamic root.
With this one should be able to do, say:
mkdir /afs
mount -t afs none /afs -o dyn
to create a dynamic root and then, provided you have keyutils
installed, do:
ls /afs/grand.central.org
and:
ls /afs/umich.edu
to list the root volumes of both those organisations' AFS cells
without requiring any other setup (the kernel upcall to a program
in the keyutils package to do DNS access as does NFS)"
* tag 'afs-next-20180208' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Support the AFS dynamic root
afs: Rearrange afs_select_fileserver() a little
afs: Remove unused code
afs: Fix server list handling
afs: Need to clear responded flag in addr cursor
afs: Fix missing cursor clearance
afs: Add missing afs_put_cell()
big ticket items slated for the next merge window.
On the CephFS side we have a large number of cap handling improvements,
a fix for our long-standing abuse of ->journal_info in ceph_readpages()
and yet another dentry pointer management patch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJafGqnAAoJEEp/3jgCEfOLjNcH/R6G/xyytDMfxaN+D8DBqCPF
IaQM7RtgYJeRzDIXYYCkDEBPYqLcD2fjHLzFotFNLcgLdeUcSOyfg7NuCOWWq7o2
t4z6Ekyish3GWZLUmlSdPcToQ+xIlMRshU8ZmzCHTCzx8XjO+CAnCADp5dh8OKZx
mCpRX16sXdc6ozE1hsGKIkUoNrkdj8d3+HseZ2Uxb/4FZBNgH3cmmg7c5y6M+sp6
wT4NEES3baqq2v5cVfw7T+d4MNgRm4/JC1aBy1JBkQlmVFNGteQTT7yzo0X1AfJ+
+kcR10ddg0gD4WGYhL+iZlQCfwyMp7vouHQbgTOgt+rDCitjDy5r1BAamtxnZjM=
=ctaD
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.16-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Things have been very quiet on the rbd side, as work continues on the
big ticket items slated for the next merge window.
On the CephFS side we have a large number of cap handling
improvements, a fix for our long-standing abuse of ->journal_info in
ceph_readpages() and yet another dentry pointer management patch"
* tag 'ceph-for-4.16-rc1' of git://github.com/ceph/ceph-client:
ceph: improving efficiency of syncfs
libceph: check kstrndup() return value
ceph: try to allocate enough memory for reserved caps
ceph: fix race of queuing delayed caps
ceph: delete unreachable code in ceph_check_caps()
ceph: limit rate of cap import/export error messages
ceph: fix incorrect snaprealm when adding caps
ceph: fix un-balanced fsc->writeback_count update
ceph: track read contexts in ceph_file_info
ceph: avoid dereferencing invalid pointer during cached readdir
ceph: use atomic_t for ceph_inode_info::i_shared_gen
ceph: cleanup traceless reply handling for rename
ceph: voluntarily drop Fx cap for readdir request
ceph: properly drop caps for setattr request
ceph: voluntarily drop Lx cap for link/rename requests
ceph: voluntarily drop Ax cap for requests that create new inode
rbd: whitelist RBD_FEATURE_OPERATIONS feature bit
rbd: don't NULL out ->obj_request in rbd_img_obj_parent_read_full()
rbd: use kmem_cache_zalloc() in rbd_img_request_create()
rbd: obj_request->completion is unused
Commit b9f5fb1800 ("cramfs: fix MTD dependency") did what it says.
Since commit 9059a3493e ("kconfig: fix relational operators for bool
and tristate symbols") it is possible to do it slightly better though.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is now only one caller left for svcxdr_dupstr() and this is inside
of an #ifdef, so we can get a warning when the option is disabled:
fs/nfsd/nfs4xdr.c:241:1: error: 'svcxdr_dupstr' defined but not used [-Werror=unused-function]
This changes the remaining caller to use a nicer IS_ENABLED() check,
which lets the compiler drop the unused code silently.
Fixes: e40d99e6183e ("NFSD: Clean up symlink argument XDR decoders")
Suggested-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The time values in stat and inode may differ for overlayfs and stat time
values are the correct ones to use. This is also consistent with the fact
that fill_post_wcc() also stores stat time values.
This means introducing a stat call that could fail, where previously we
were just copying values out of the inode. To be conservative about
changing behavior, we fall back to copying values out of the inode in
the error case. It might be better just to clear fh_pre_saved (though
note the BUG_ON in set_change_info).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The values of stat->mtime and inode->i_mtime may differ for overlayfs
and stat->mtime is the correct value to use when encoding getattr.
This is also consistent with the fact that other attr times are also
encoded from stat values.
Both callers of lease_get_mtime() already have the value of stat->mtime,
so the only needed change is that lease_get_mtime() will not overwrite
this value with inode->i_mtime in case the inode does not have an
exclusive lease.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client that sends more than a hundred ops in a single compound
currently gets an rpc-level GARBAGE_ARGS error.
It would be more helpful to return NFS4ERR_RESOURCE, since that gives
the client a better idea how to recover (for example by splitting up the
compound into smaller compounds).
This is all a bit academic since we've never actually seen a reason for
clients to send such long compounds, but we may as well fix it.
While we're there, just use NFSD4_MAX_OPS_PER_COMPOUND == 16, the
constant we already use in the 4.1 case, instead of hard-coding 100.
Chances anyone actually uses even 16 ops per compound are small enough
that I think there's a neglible risk or any regression.
This fixes pynfs test COMP6.
Reported-by: "Lu, Xinyu" <luxy.fnst@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJae0mSAAoJEAAOaEEZVoIVs98P+wSbwfgLeyTufmrRYrD9kxfh
EQXfuvnJqPzRHLJIUXfwzTN3IV9RZ1434ci31lZvQE3PKrgb90QuBLiR6OIKULef
UqpYRmjsg7BfFBdAnyUR8xSmmeN94PjXQk7tG+YQn096HJVZ6cG5qCA8RjJ9dFoq
2haDcOfDU+3e8mbtrrF4doP6jGrVwV+okqRsshFBclQv62Kk3m7L5AjQINyZpTM5
ZKX5JIMOAmlJcHsz/2J1qLAIRQKsvEUbRLV43bzp3E03PuVFPhig3dVtpGPUe+Yi
OW0JX49hIoTCrQ4KZk6uweLG7ZpaSoppXggEi2ERNCUkCf3nhejLlScfye+yLx7f
sItgPkOYU0VVF70Y72XH1DbOekZr/XCLZdEEUNCS/P68hnyK0gBNC9zPGetlxMMi
wjjQ9Qe45vD2JFlrvhHrdUdCnxnE05zC9ckBrmM94uRwIfDR0WVgo6pfebfRkAJd
Wp4/PfbaySY7vk4oyaXlNxcDIH2NvWwYkioI/K9rRGbB2KjTdXonQojBy+rT0LeS
f3mufyZYyCxdwu3Wf8WO36H23L+4fseMthKIIPA0aL4wasB9LgD8gDnkyKx28DT4
S32tdK4UALC8SAVsPr+vSaMVzKOZmuNHac+XB2i+5lHl8G/n4M2a+JFTeR4CnKJ/
9LsBEBL5Oj7ZXL7lfFIO
=iEKM
-----END PGP SIGNATURE-----
Merge tag 'iversion-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull inode->i_version cleanup from Jeff Layton:
"Goffredo went ahead and sent a patch to rename this function, and
reverse its sense, as we discussed last week.
The patch is very straightforward and I figure it's probably best to
go ahead and merge this to get the API as settled as possible"
* tag 'iversion-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}
Pull UDF and ext2 fixlets from Jan Kara:
"A UDF fix and an ext2 cleanup"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: drop unneeded newline
udf: Sanitize nanoseconds for time stamps
The last two updates to MS-SMB2 protocol documentation added various
flags and structs (especially relating to SMB3.1.1 tree connect).
Add missing defines and structs to smb2pdu.h
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Although at least one of these was an overly strict sparse warning
in the new smbdirect code, it is cleaner to fix - so no warnings.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
This bug was fixed before, but came up again with the latest
compiler in another function:
fs/cifs/cifssmb.c: In function 'CIFSSMBSetEA':
fs/cifs/cifssmb.c:6362:3: error: 'strncpy' offset 8 is out of the bounds [0, 4] [-Werror=array-bounds]
strncpy(parm_data->list[0].name, ea_name, name_len);
Let's apply the same fix that was used for the other instances.
Fixes: b2a3ad9ca5 ("cifs: silence compiler warnings showing up with gcc-4.7.0")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Allow dumping out debug information on dialect, signing, unix extensions
and encryption
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
The pipe buffer limits are accessed without any locking, and may be
changed at any time by the sysctl handlers. In theory this could cause
problems for expressions like the following:
pipe_user_pages_hard && user_bufs > pipe_user_pages_hard
... since the assembly code might reference the 'pipe_user_pages_hard'
memory location multiple times, and if the admin removes the limit by
setting it to 0, there is a very brief window where processes could
incorrectly observe the limit to be exceeded.
Fix this by loading the limits with READ_ONCE() prior to use.
Link: http://lkml.kernel.org/r/20180111052902.14409-8-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
round_pipe_size() calculates the number of pages the requested size
corresponds to, then rounds the page count up to the next power of 2.
However, it also rounds everything < PAGE_SIZE up to PAGE_SIZE.
Therefore, there's no need to actually translate the size into a page
count; we just need to round the size up to the next power of 2.
We do need to verify the size isn't greater than (1 << 31), since on
32-bit systems roundup_pow_of_two() would be undefined in that case. But
that can just be combined with the UINT_MAX check which we need anyway
now.
Finally, update pipe_set_size() to not redundantly check the return value
of round_pipe_size() for the "invalid size" case twice.
Link: http://lkml.kernel.org/r/20180111052902.14409-7-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A pipe's size is represented as an 'unsigned int'. As expected, writing a
value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to
32 bits, rather than failing with EINVAL as expected. (It *does* fail
with EINVAL for values above (1 << 31) but <= UINT_MAX.)
Fix this by moving the check against UINT_MAX into round_pipe_size() which
is called in both cases.
Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With pipe-user-pages-hard set to 'N', users were actually only allowed up
to 'N - 1' buffers; and likewise for pipe-user-pages-soft.
Fix this to allow up to 'N' buffers, as would be expected.
Link: http://lkml.kernel.org/r/20180111052902.14409-5-ebiggers3@gmail.com
Fixes: b0b91d18e2 ("pipe: fix limit checking in pipe_set_size()")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Willy Tarreau <w@1wt.eu>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pipe-user-pages-hard and pipe-user-pages-soft are only supposed to apply
to unprivileged users, as documented in both Documentation/sysctl/fs.txt
and the pipe(7) man page.
However, the capabilities are actually only checked when increasing a
pipe's size using F_SETPIPE_SZ, not when creating a new pipe. Therefore,
if pipe-user-pages-hard has been set, the root user can run into it and be
unable to create pipes. Similarly, if pipe-user-pages-soft has been set,
the root user can run into it and have their pipes limited to 1 page each.
Fix this by allowing the privileged override in both cases.
Link: http://lkml.kernel.org/r/20180111052902.14409-4-ebiggers3@gmail.com
Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pipe_proc_fn() is no longer needed, as it only calls through to
proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table
entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
stub for it.
(The reason the ENOSYS stub isn't needed is that the pipe-max-size
ctl_table entry is located directly in 'kern_table' rather than being
registered separately. Therefore, the entry is already only defined when
the kernel is built with sysctl support.)
Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "pipe: buffer limits fixes and cleanups", v2.
This series simplifies the sysctl handler for pipe-max-size and fixes
another set of bugs related to the pipe buffer limits:
- The root user wasn't allowed to exceed the limits when creating new
pipes.
- There was an off-by-one error when checking the limits, so a limit of
N was actually treated as N - 1.
- F_SETPIPE_SZ accepted values over UINT_MAX.
- Reading the pipe buffer limits could be racy.
This patch (of 7):
Before validating the given value against pipe_min_size,
do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
value up to pipe_min_size. Therefore, the second check against
pipe_min_size is redundant. Remove it.
Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7994e6f725 ("vfs: Move waiting for inode writeback from
end_writeback() to evict_inode()") removed inode_sync_wait() from
end_writeback() and commit dbd5768f87 ("vfs: Rename end_writeback() to
clear_inode()") renamed end_writeback() to clear_inode().
After these patches there is no sleeping operation in clear_inode().
So, remove might_sleep() from it.
Link: http://lkml.kernel.org/r/20171108004354.40308-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When creating a file inside a directory that has the setgid flag set, give
the new file the group ID of the parent, and also the setgid flag if it is
a directory itself.
Link: http://lkml.kernel.org/r/20171204192705.GA6101@debian.home
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The superblock and segment timestamps are used only internally in nilfs2
and can be read out using sysfs.
Since we are using the old 'get_seconds()' interface and store the data
as timestamps, the behavior differs slightly between 64-bit and 32-bit
kernels, the latter will show incorrect timestamps after 2038 in sysfs,
and presumably fail completely in 2106 as comparisons go wrong.
This changes nilfs2 to use time64_t with ktime_get_real_seconds() to
handle timestamps, making the behavior consistent and correct on both
32-bit and 64-bit machines.
The on-disk format already uses 64-bit timestamps, so nothing changes
there.
Link: http://lkml.kernel.org/r/20180122211050.1286441-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If vm.max_map_count bumped above 2^26 (67+ mil) and system has enough RAM
to allocate all the VMAs (~12.8 GB on Fedora 27 with 200-byte VMAs), then
it should be possible to overflow 32-bit "size", pass paranoia check,
allocate very little vmalloc space and oops while writing into vmalloc
guard page...
But I didn't test this, only coredump of regular process.
Link: http://lkml.kernel.org/r/20180112203427.GA9109@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A single character (line break) should be put into a sequence. Thus use
the corresponding function "seq_putc".
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/04fb69fe-d820-9141-820f-07e9a48f4635@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rearrange args for smaller code.
lookup revolves around memcmp() which gets len 3rd arg, so propagate
length as 3rd arg.
readdir and lookup add additional arg to VFS ->readdir and ->lookup, so
better add it to the end.
Space savings on x86_64:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-18 (-18)
Function old new delta
proc_readdir 22 13 -9
proc_lookup 18 9 -9
proc_match() is smaller if not inlined, I promise!
Link: http://lkml.kernel.org/r/20180104175958.GB5204@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
use_pde() is used at every open/read/write/... of every random /proc
file. Negative refcount happens only if PDE is being deleted by module
(read: never). So it gets "likely".
unuse_pde() gets "unlikely" for the same reason.
close_pdeo() gets unlikely as the completion is filled only if there is a
race between PDE removal and close() (read: never ever).
It even saves code on x86_64 defconfig:
add/remove: 0/0 grow/shrink: 1/2 up/down: 2/-20 (-18)
Function old new delta
close_pdeo 183 185 +2
proc_reg_get_unmapped_area 119 111 -8
proc_reg_poll 85 73 -12
Link: http://lkml.kernel.org/r/20180104175657.GA5204@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/self inode numbers, value of proc_inode_cache and st_nlink of
/proc/$TGID are fixed constants.
Link: http://lkml.kernel.org/r/20180103184707.GA31849@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct proc_dir_entry became bit messy over years:
* move 16-bit ->mode_t before namelen to get rid of padding
* make ->in_use first field: it seems to be most used resulting in
smaller code on x86_64 (defconfig):
add/remove: 0/0 grow/shrink: 7/13 up/down: 24/-67 (-43)
Function old new delta
proc_readdir_de 451 455 +4
proc_get_inode 282 286 +4
pde_put 65 69 +4
remove_proc_subtree 294 297 +3
remove_proc_entry 297 300 +3
proc_register 295 298 +3
proc_notify_change 94 97 +3
unuse_pde 27 26 -1
proc_reg_write 89 85 -4
proc_reg_unlocked_ioctl 85 81 -4
proc_reg_read 89 85 -4
proc_reg_llseek 87 83 -4
proc_reg_get_unmapped_area 123 119 -4
proc_entry_rundown 139 135 -4
proc_reg_poll 91 85 -6
proc_reg_mmap 79 73 -6
proc_get_link 55 49 -6
proc_reg_release 108 101 -7
proc_reg_open 298 291 -7
close_pdeo 228 218 -10
* move writeable fields together to a first cacheline (on x86_64),
those include
* ->in_use: reference count, taken every open/read/write/close etc
* ->count: reference count, taken at readdir on every entry
* ->pde_openers: tracks (nearly) every open, dirtied
* ->pde_unload_lock: spinlock protecting ->pde_openers
* ->proc_iops, ->proc_fops, ->data: writeonce fields,
used right together with previous group.
* other rarely written fields go into 1st/2nd and 2nd/3rd cacheline on
32-bit and 64-bit respectively.
Additionally on 32-bit, ->subdir, ->subdir_node, ->namelen, ->name go
fully into 2nd cacheline, separated from writeable fields. They are all
used during lookup.
Link: http://lkml.kernel.org/r/20171220215914.GA7877@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext
data") added a bounce buffer to avoid hardened usercopy checks. Copying
to the bounce buffer was implemented with a simple memcpy() assuming
that it is always valid to read from kernel memory iff the
kern_addr_valid() check passed.
A simple, but pointless, test case like "dd if=/proc/kcore of=/dev/null"
now can easily crash the kernel, since the former execption handling on
invalid kernel addresses now doesn't work anymore.
Also adding a kern_addr_valid() implementation wouldn't help here. Most
architectures simply return 1 here, while a couple implemented a page
table walk to figure out if something is mapped at the address in
question.
With DEBUG_PAGEALLOC active mappings are established and removed all the
time, so that relying on the result of kern_addr_valid() before
executing the memcpy() also doesn't work.
Therefore simply use probe_kernel_read() to copy to the bounce buffer.
This also allows to simplify read_kcore().
At least on s390 this fixes the observed crashes and doesn't introduce
warnings that were removed with df04abfd18 ("fs/proc/kcore.c: Add
bounce buffer for ktext data"), even though the generic
probe_kernel_read() implementation uses uaccess functions.
While looking into this I'm also wondering if kern_addr_valid() could be
completely removed...(?)
Link: http://lkml.kernel.org/r/20171202132739.99971-1-heiko.carstens@de.ibm.com
Fixes: df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext data")
Fixes: f5509cc18d ("mm: Hardened usercopy")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dentry name can be evaluated later, right before calling into VFS.
Also, spend less time under ->mmap_sem.
Link: http://lkml.kernel.org/r/20171110163034.GA2534@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Iterators aren't necessary as you can just grab the first entry and delete
it until no entries left.
Link: http://lkml.kernel.org/r/20171121191121.GA20757@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current code does:
if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
However sscanf() is broken garbage.
It silently accepts whitespace between format specifiers
(did you know that?).
It silently accepts valid strings which result in integer overflow.
Do not use sscanf() for any even remotely reliable parsing code.
OK
# readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
/lib/systemd/systemd
broken
# readlink '/proc/1/map_files/ 55a23af39000-55a23b05b000'
/lib/systemd/systemd
broken
# readlink '/proc/1/map_files/55a23af39000-55a23b05b000 '
/lib/systemd/systemd
very broken
# readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
/lib/systemd/systemd
Andrei said:
: This patch breaks criu. It was a bug in criu. And this bug is on a minor
: path, which works when memfd_create() isn't available. It is a reason why
: I ask to not backport this patch to stable kernels.
:
: In CRIU this bug can be triggered, only if this patch will be backported
: to a kernel which version is lower than v3.16.
Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
READ_ONCE and WRITE_ONCE are useless when there is only one read/write
is being made.
Link: http://lkml.kernel.org/r/20171120204033.GA9446@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PROC_NUMBUF is 13 which is enough for "negative int + \n + \0".
However PIDs and TGIDs are never negative and newline is not a concern,
so use just 10 per integer.
Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a dentry is deleted, then a dentry is recreated with the same handle
but a different type (i.e. it was a file and now it's a symlink), then
its a different inode. The check was backwards, so d_revalidate would
not have noticed.
Due to the design of the OrangeFS server, this is rather unlikely.
It's also possible for the dentry to be deleted and recreated with the
same type. This would be undetectable. It's a bit of a ship of
Theseus.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Check whether this is a new inode at location of call.
Raises the question of what to do with an unknown inode type. Old code
would've marked the inode bad and returned ESTALE.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When we get an error return code from userspace (the client-core)
we check to make sure it is a valid code.
This patch maps the whacky return code to -EINVAL instead of
propagating garbage back up the call chain potentially resulting
in a hard-to-find train-wreck.
The client-core doesn't have any business returning whacky return
codes, but if it does, we don't want the kernel to crash as a result.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
gcc-8 reports
fs/orangefs/dcache.c: In function 'orangefs_d_revalidate':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
fs/orangefs/namei.c: In function 'orangefs_rename':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
fs/orangefs/super.c: In function 'orangefs_mount':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
We need one less byte or call strlcpy() to make it a nul-terminated
string.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It wasn't possible to enable it, and it would've had very little effect.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
gossip_ldebug is unused.
gossip_lerr is used in two places. The messages are unique so line
numbers are unnecessary.
Also remove support for compiling gossip messages out. It wasn't
possible to enable it anyway.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJacX62AAoJEAhfPr2O5OEVjKYP/R3v+c8ztiHzaeibcZZ8IFNl
58E0Y0yGa8OpoGJx9uqtEOamQmZoHhACfId7joIp/Jv38bgWAdbxOmk3Y4FDCFqG
1bRrpnnmvlfabiMMfLpURLqKhf7rJMtErZkrnmmqg9P/lEMohaZUJAsgBZNfJM8l
fZeacSnCSpzlxVcUb9Bf4vWhLk39R+xFzvFrwzbVUIHf3bDVpf4S4kNorMkhSZSF
HaISYXqVMhpKca7CngVKytbfacUStUY01cXcjdMuB/sD7ySwdtKogbPMvrOSaexz
G/8MB+sGT1JKUgIlh6Qv8hX805KuxBgfP19XSOH46nNU8KbYegdGhN5QXlokwI1m
dAOiozkU93r5yBZl6QzkN3uwXe492PoLgczifg97pzAJP0BfWeFStkYqlugLTwwC
Slmr7g3FZVJajbPl6WyioAGW7xfqBF7ftScZOHYxmhy41CWCGKJctmsJOjncyz5O
GInEIP3KR4CgjR+iM1LoKvE+OvVo4kRc7hrcUsjQNsbfBn6xiixjwH+5M+UVvezA
6UQpmtWGg4pX1djb8j8f6mKF8KZM12Pp3jb4Rl1cLsytN5BOBKaMEKdV3rgL+19P
Yo0x/1wK/unkI20Om71vYyQ0nXVF9j7Tpeij5u0M57TeTVYCwloQgHmrcvQJdo8+
Pqw5XEUiDpAIjvKp0XGh
=H9AS
-----END PGP SIGNATURE-----
Merge tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- videobuf2 was moved to a media/common dir, as it is now used by the
DVB subsystem too
- Digital TV core memory mapped support interface
- new sensor driver: ov7740
- several improvements at ddbridge driver
- new V4L2 driver: IPU3 CIO2 CSI-2 receiver unit, found on some Intel
SoCs
- new tuner driver: tda18250
- finally got rid of all LIRC staging drivers
- as we don't have old lirc drivers anymore, restruct the lirc device
code
- add support for UVC metadata
- add a new staging driver for NVIDIA Tegra Video Decoder Engine
- DVB kAPI headers moved to include/media
- synchronize the kAPI and uAPI for the DVB subsystem, removing the gap
for non-legacy APIs
- reduce the kAPI gap for V4L2
- lots of other driver enhancements, cleanups, etc.
* tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (407 commits)
media: v4l2-compat-ioctl32.c: make ctrl_is_pointer work for subdevs
media: v4l2-compat-ioctl32.c: refactor compat ioctl32 logic
media: v4l2-compat-ioctl32.c: don't copy back the result for certain errors
media: v4l2-compat-ioctl32.c: drop pr_info for unknown buffer type
media: v4l2-compat-ioctl32.c: copy clip list in put_v4l2_window32
media: v4l2-compat-ioctl32.c: fix ctrl_is_pointer
media: v4l2-compat-ioctl32.c: copy m.userptr in put_v4l2_plane32
media: v4l2-compat-ioctl32.c: avoid sizeof(type)
media: v4l2-compat-ioctl32.c: move 'helper' functions to __get/put_v4l2_format32
media: v4l2-compat-ioctl32.c: fix the indentation
media: v4l2-compat-ioctl32.c: add missing VIDIOC_PREPARE_BUF
media: v4l2-ioctl.c: don't copy back the result for -ENOTTY
media: v4l2-ioctl.c: use check_fmt for enum/g/s/try_fmt
media: vivid: fix module load error when enabling fb and no_error_inj=1
media: dvb_demux: improve debug messages
media: dvb_demux: Better handle discontinuity errors
media: cxusb, dib0700: ignore XC2028_I2C_FLUSH
media: ts2020: avoid integer overflows on 32 bit machines
media: i2c: ov7740: use gpio/consumer.h instead of gpio.h
media: entity: Add a nop variant of media_entity_cleanup
...
* Require struct page by default for filesystem DAX to remove a number of
surprising failure cases. This includes failures with direct I/O, gdb and
fork(2).
* Add support for the new Platform Capabilities Structure added to the NFIT in
ACPI 6.2a. This new table tells us whether the platform supports flushing
of CPU and memory controller caches on unexpected power loss events.
* Revamp vmem_altmap and dev_pagemap handling to clean up code and better
support future future PCI P2P uses.
* Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has become
out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL spec, and
instead rely on the generic ND_CMD_CALL approach used by the two other IOCTL
families, NVDIMM_FAMILY_{HPE,MSFT}.
* Enhance nfit_test so we can test some of the new things added in version 1.6
of the DSM specification. This includes testing firmware download and
simulating the Last Shutdown State (LSS) status.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaeOg0AAoJEJ/BjXdf9fLBAFoQAI/IgcgJ2h9lfEpgjBRTC44t
2p8dxwT1Ofw3Y1aR/tI8nYRXjRtAGuP4UIeRVnb1CL/N7PagJyoMGU+6hmzg+ptY
c7cEDvw6nZOhrFwXx/xn7R53sYG8zH+UE6+jTR/PP/G4mQJfFCg4iF9R72Y7z0n7
aurf82Kz137NPUy6dNr4V9bmPMJWAaOci9WOj5SKddR5ZSNbjoxylTwQRvre5y4r
7HQTScEkirABOdSf1JoXTSUXCH/RC9UFFXR03ScHstGb1HjCj3KdcicVc50Q++Ub
qsEudhE6i44PEW1Hh4Qkg6hjHMEa8qHP+ShBuRuVaUmlghYTQn66niJAYLZilwdz
EVjE7vR+toHA5g3YCalEmYVutUEhIDkh/xfpd7vM6ZorUGJy95a2elEJs2fHBffC
gEhnCip7FROPcK5RDNUM8hBgnG/q5wwWPQMKY+6rKDZQx3mXssCrKp2Vlx7kBwMG
rpblkEpYjPonbLEHxsSU8yTg9Uq55ciIWgnOToffcjZvjbihi8WUVlHcwHUMPf/o
DWElg+4qmG0Sdd4S2NeAGwTl1Ewrf2RrtUGMjHtH4OUFs1wo6ZmfrxFzzMfoZ1Od
ko/s65v4uwtTzECh2o+XQaNsReR5YETXxmA40N/Jpo7/7twABIoZ/ASvj/3ZBYj+
sie+u2rTod8/gQWSfHpJ
=MIMX
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Ross Zwisler:
- Require struct page by default for filesystem DAX to remove a number
of surprising failure cases. This includes failures with direct I/O,
gdb and fork(2).
- Add support for the new Platform Capabilities Structure added to the
NFIT in ACPI 6.2a. This new table tells us whether the platform
supports flushing of CPU and memory controller caches on unexpected
power loss events.
- Revamp vmem_altmap and dev_pagemap handling to clean up code and
better support future future PCI P2P uses.
- Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
spec, and instead rely on the generic ND_CMD_CALL approach used by
the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
- Enhance nfit_test so we can test some of the new things added in
version 1.6 of the DSM specification. This includes testing firmware
download and simulating the Last Shutdown State (LSS) status.
* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
acpi, nfit: fix register dimm error handling
libnvdimm, namespace: make min namespace size 4K
tools/testing/nvdimm: force nfit_test to depend on instrumented modules
libnvdimm/nfit_test: adding support for unit testing enable LSS status
libnvdimm/nfit_test: add firmware download emulation
nfit-test: Add platform cap support from ACPI 6.2a to test
libnvdimm: expose platform persistence attribute for nd_region
acpi: nfit: add persistent memory control flag for nd_region
acpi: nfit: Add support for detect platform CPU cache flush on power loss
device-dax: Fix trailing semicolon
libnvdimm, btt: fix uninitialized err_lock
dax: require 'struct page' by default for filesystem dax
ext2: auto disable dax instead of failing mount
ext4: auto disable dax instead of failing mount
mm, dax: introduce pfn_t_special()
mm: Fix devm_memremap_pages() collision handling
mm: Fix memory size alignment in devm_memremap_pages_release()
memremap: merge find_dev_pagemap into get_dev_pagemap
memremap: change devm_memremap_pages interface to use struct dev_pagemap
...
Support the AFS dynamic root which is a pseudo-volume that doesn't connect
to any server resource, but rather is just a root directory that
dynamically creates mountpoint directories where the name of such a
directory is the name of the cell.
Such a mount can be created thus:
mount -t afs none /afs -o dyn
Dynamic root superblocks aren't shared except by bind mounts and
propagation. Cell root volumes can then be mounted by referring to them by
name, e.g.:
ls /afs/grand.central.org/
ls /afs/.grand.central.org/
The kernel will upcall to consult the DNS if the address wasn't supplied
directly.
Signed-off-by: David Howells <dhowells@redhat.com>
Rearrange afs_select_fileserver() a little to put the use_server chunk
before the next_server chunk so that with the removal of a couple of gotos
the main path through the function is all one sequence.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix server list handling in the following ways:
(1) In afs_alloc_volume(), remove duplicate server list build code. This
was already done by afs_alloc_server_list() which afs_alloc_volume()
previously called. This just results in twice as many VL RPCs.
(2) In afs_deliver_vl_get_entry_by_name_u(), use the number of server
records indicated by ->nServers in the UVLDB record returned by the
VL.GetEntryByNameU RPC call rather than scanning all NMAXNSERVERS
slots. Unused slots may contain garbage.
(3) In afs_alloc_server_list(), don't stop converting a UVLDB record into
a server list just because we can't look up one of the servers. Just
skip that server and go on to the next. If we can't look up any of
the servers then we'll fail at the end.
Without this patch, an attempt to view the umich.edu root cell using
something like "ls /afs/umich.edu" on a dynamic root (future patch) mount
or an autocell mount will result in ENOMEDIUM. The failure is due to kafs
not stopping after nServers'worth of records have been read, but then
trying to access a server with a garbage UUID and getting an error, which
aborts the server list build.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
In afs_select_fileserver(), we need to clear the ->responded flag in the
address list when reusing it. We should also clear it in
afs_select_current_fileserver().
To this end, just memset() the object before initialising it.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
afs_select_fileserver() ends the address cursor it is using in the case in
which we get some sort of network error and run out of addresses to iterate
through, before it jumps to try the next server. This also needs to be
done when the server aborts with some sort of error that means we should
try the next server.
Fix this by:
(1) Move the iterate_address afs_end_cursor() call to the next_server
case.
(2) End the cursor in the failed case.
(3) Make afs_end_cursor() clear the ->begun flag and ->addr pointer in the
address cursor.
(4) Make afs_end_cursor() able to be called on an already cleared cursor.
Without this, something like the following oops may occur:
AFS: Assertion failed
18446612134397189888 == 0 is false
0xffff88007c279f00 == 0x0 is false
------------[ cut here ]------------
kernel BUG at fs/afs/rotate.c:360!
RIP: 0010:afs_select_fileserver+0x79b/0xa30 [kafs]
Call Trace:
afs_statfs+0xcc/0x180 [kafs]
? p9_client_statfs+0x9e/0x110 [9pnet]
? _cond_resched+0x19/0x40
statfs_by_dentry+0x6d/0x90
vfs_statfs+0x1b/0xc0
user_statfs+0x4b/0x80
SYSC_statfs+0x15/0x30
SyS_statfs+0xe/0x10
entry_SYSCALL_64_fastpath+0x20/0x83
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
afs_alloc_volume() needs to release the cell ref it obtained in the case of
an error. Fix this by adding an afs_put_cell() call into the error path.
This can triggered when a lookup for a cell in a dynamic root or an
autocell mount returns an error whilst trying to look up the server (such
as ENOMEDIUM). This results in an assertion failure oops when the module
is unloaded due to outstanding refs on a cell record.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
There's no point I can see to
stp->st_stid.sc_type = NFS4_CLOSED_STID;
given release_lock_stateid immediately sets sc_type to 0.
That set of sc_type to 0 should be enough to prevent it being used where
we don't want it to be; NFS4_CLOSED_STID should only be needed for
actual open stateid's that are actually closed.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The state of the stid is guaranteed by 2 locks:
- The nfs4_client 'cl_lock' spinlock
- The nfs4_ol_stateid 'st_mutex' mutex
so it is quite possible for the stid to be unhashed after lookup,
but before calling nfsd4_lock_ol_stateid(). So we do need to check
for a zero value for 'sc_type' in nfsd4_verify_open_stid().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Checuk Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Fixes: 659aefb68e "nfsd: Ensure we don't recognise lock stateids..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
- Print scrub build status in the xfs build info.
- Explicitly call out the remaining two scenarios where we don't
support
reflink and never have.
- Remove EXPERIMENTAL tag from reverse mapping btree!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaeJjKAAoJEPh/dxk0SrTrZP8P/RT0bcKc1PkmonX6rZBYa9OB
Mz5X7TpVRsXtZPtGSNM3IBIubjIVEZ/f3s5CZefN08uV8s+AFBjEAdHmeAiGtT/X
qakQyvsBJ3mEyVsMyzuI7eu4TU3/5Xad7kSp9TFPnXfW8z09Z4GygyGVJPRqpKRQ
liFzh8BIVgS/IFcpTL+6wKEHdAHEuyz6u/78ylgCtLMuiNiMY1mYv/+U2f7dEV3u
yiRY4oHGQfOiw1aXy3EO2WUdSKcAQwIJIEsLOllYQRe3f5W2milflFCJF9RoEEuE
OLmur4PBwFWpTfLVl1BqGa6rr/nhaY1y7Lyy3mVrmv0QiHlnNM/BQ5UKICZJdx5O
8Ai4ZyaJ5Q/nQxA6USOBHSlkeexMOH82i7gJCCfPtYqW1l0QjStLcoTYjWXa/0u9
ULEkdnocNm/HSCIGocFrd6dzOKR8TxJDVh3DxIFo8VjTj/XI57+ePfbZT7J+0vuB
elhKcho87xKHeF1RQfsVdgh+518GGAXp5zZjAJ3P/6GpxuB9sa+ShEEtR7OzSf0K
sfkXw3P/tH9ladBxWvMC6Gx0tSUSUTAUeYSbfOC1wRio7iI7sf8Gl8SkU65y4RdE
ZhQp8M4i2+vt9JS/E/mbAVxKIn1iF7L9ZiWlycJXyuqFf7bv1uBXG+tTE7lM7nJA
YjSmXBWN5j6kxQeUR0NE
=U54J
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"As promised, here's a (much smaller) second pull request for the
second week of the merge cycle. This time around we have a couple
patches shutting off unsupported fs configurations, and a couple of
cleanups.
Last, we turn off EXPERIMENTAL for the reverse mapping btree, since
the primary downstream user of that information (online fsck) is now
upstream and I haven't seen any major failures in a few kernel
releases.
Summary:
- Print scrub build status in the xfs build info.
- Explicitly call out the remaining two scenarios where we don't
support reflink and never have.
- Remove EXPERIMENTAL tag from reverse mapping btree!"
* tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove experimental tag for reverse mapping
xfs: don't allow reflink + realtime filesystems
xfs: don't allow DAX on reflink filesystems
xfs: add scrub to XFS_BUILD_OPTIONS
xfs: fix u32 type usage in sb validation function
Pull overlayfs updates from Miklos Szeredi:
"This work from Amir adds NFS export capability to overlayfs. NFS
exporting an overlay filesystem is a challange because we want to keep
track of any copy-up of a file or directory between encoding the file
handle and decoding it.
This is achieved by indexing copied up objects by lower layer file
handle. The index is already used for hard links, this patchset
extends the use to NFS file handle decoding"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (51 commits)
ovl: check ERR_PTR() return value from ovl_encode_fh()
ovl: fix regression in fsnotify of overlay merge dir
ovl: wire up NFS export operations
ovl: lookup indexed ancestor of lower dir
ovl: lookup connected ancestor of dir in inode cache
ovl: hash non-indexed dir by upper inode for NFS export
ovl: decode pure lower dir file handles
ovl: decode indexed dir file handles
ovl: decode lower file handles of unlinked but open files
ovl: decode indexed non-dir file handles
ovl: decode lower non-dir file handles
ovl: encode lower file handles
ovl: copy up before encoding non-connectable dir file handle
ovl: encode non-indexed upper file handles
ovl: decode connected upper dir file handles
ovl: decode pure upper file handles
ovl: encode pure upper file handles
ovl: document NFS export
vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()
ovl: store 'has_upper' and 'opaque' as bit flags
...
Commit 4fde46f0cc ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also free
the btrfs_fs_devices struct itself. Currently this leads to a use after
free since list_for_each_entry will try to perform a check on the
already freed memory to see if it has to terminate the loop.
The fix is to use 'break' when we know we are freeing the current
fs_devs.
Fixes: 4fde46f0cc ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another fix for an issue reported by 0-day robot.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 8ed5eec9d6 ("ovl: encode pure upper file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A re-factoring patch in NFS export series has passed the wrong argument
to ovl_get_inode() causing a regression in the very recent fix to
fsnotify of overlay merge dir.
The regression has caused merge directory inodes to be hashed by upper
instead of lower real inode, when NFS export and directory indexing is
disabled. That caused an inotify watch to become obsolete after directory
copy up and drop caches.
LTP test inotify07 was improved to catch this regression.
The regression also caused multiple redirect dirs to same origin not to
be detected on lookup with NFS export disabled. An xfstest was added to
cover this case.
Fixes: 0aceb53e73 ("ovl: do not pass overlay dentry to ovl_get_inode()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp2R3AACgkQ8vlZVpUN
gaOIdAgApEdlFR2Gf93z2hMj5HxVL5rjkuPJVtVkKu0eH2HMQJyxNmjymrRfuFmM
8W1CrEvVKi5Aj6r8q4KHIdVV247Ya0SVEhLwKM0LX4CvlZUXmwgCmZ/MPDTXA1eq
C4vPVuJAuSNGNVYDlDs3+NiMHINGNVnBVQQFSPBP9P+iNWPD7o486712qaF8maVn
RbfbQ2rWtOIRdlAOD1U5WqgQku59lOsmHk2pc0+X4LHCZFpMoaO80JVjENPAw+BF
daRt6TX+WljMyx6DRIaszqau876CJhe/tqlZcCLOkpXZP0jJS13yodp26dVQmjCh
w8YdiY7uHK2D+S/8eyj7h7DIwzu3vg==
=ZjQP
-----END PGP SIGNATURE-----
Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Refactor support for encrypted symlinks to move common code to fscrypt"
Ted also points out about the merge:
"This makes the f2fs symlink code use the fscrypt_encrypt_symlink()
from the fscrypt tree. This will end up dropping the kzalloc() ->
f2fs_kzalloc() change, which means the fscrypt-specific allocation
won't get tested by f2fs's kmalloc error injection system; which is
fine"
* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: (26 commits)
fscrypt: fix build with pre-4.6 gcc versions
fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
fscrypt: document symlink length restriction
fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
fscrypt: calculate NUL-padding length in one place only
fscrypt: move fscrypt_symlink_data to fscrypt_private.h
fscrypt: remove fscrypt_fname_usr_to_disk()
ubifs: switch to fscrypt_get_symlink()
ubifs: switch to fscrypt ->symlink() helper functions
ubifs: free the encrypted symlink target
f2fs: switch to fscrypt_get_symlink()
f2fs: switch to fscrypt ->symlink() helper functions
ext4: switch to fscrypt_get_symlink()
ext4: switch to fscrypt ->symlink() helper functions
fscrypt: new helper function - fscrypt_get_symlink()
fscrypt: new helper functions for ->symlink()
fscrypt: trim down fscrypt.h includes
fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
...
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp16xMACgkQ8vlZVpUN
gaP1IAf8C48AKVnqy6ftFphzV1CdeGHDwJLL63lChs97fNr1mxo5TZE/6vdYB55j
k7C7huQ582cEiGWQJ0U4/+En0hF85zkAk5mTfnSao5BqxLr9ANsAocwBUNBXdFSp
B7IyMo4Dct7NCkwfmKLPRcEqZ49vwyv99TqM/9wUkgUStkTjPT7bhHgarB6VPbhp
BxoXVnFYgU0sZN0y71IBt8ngWqCK6j7fjw3gsl37oEenG3/h3SO0H9ih1FrysX8S
VOwwLJq6vfAgEwQvZACnBwWKDYsZpH7akNp9WGeDMByo28t514RNRjIi0mvLHEZa
h72I8Sb3bwHO9MJNvHFe/0b1Say4vw==
=dxAX
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Only miscellaneous cleanups and bug fixes for ext4 this cycle"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: create ext4_kset dynamically
ext4: create ext4_feat kobject dynamically
ext4: release kobject/kset even when init/register fail
ext4: fix incorrect indentation of if statement
ext4: correct documentation for grpid mount option
ext4: use 'sbi' instead of 'EXT4_SB(sb)'
ext4: save error to disk in __ext4_grp_locked_error()
jbd2: fix sphinx kernel-doc build warnings
ext4: fix a race in the ext4 shutdown path
mbcache: make sure c_entry_count is not decremented past zero
ext4: no need flush workqueue before destroying it
ext4: fixed alignment and minor code cleanup in ext4.h
ext4: fix ENOSPC handling in DAX page fault handler
dax: pass detailed error code from dax_iomap_fault()
mbcache: revert "fs/mbcache.c: make count_objects() more robust"
mbcache: initialize entry->e_referenced in mb_cache_entry_create()
ext4: fix up remaining files with SPDX cleanups
merged in this time. Both are regressions:
1. The first fixes another kernel build dependency problem.
2. The second fixes a performance regression in glock dumps.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJadIS1AAoJENeLYdPf93o7i24H/3orp2uf/0EQFRB3WF7vxuhB
aFyymb35V5+pkoSOqBRpV8plQR3oNxeQX1uo+a08n5UzW7VHQBApS5m5to5w03dI
MRZvDUs84weKwjUm+ndhqOgjoUZuTIQ6+/A6bRDu+24AftqwNE5vHrTBvDdZ94zN
WxCy847aHd21TQ7nKIsLVp7wlllmRuxp1D+VEc7Vmn18eNrGp4TDavP5lq/4YR92
Zsj1AfhJK1GuAY9AJGMT3ZiFL6Mdg9oj7qSyJ2HjT7q/QJE+odwI8uUPs4HKpiko
VPBPhTrfgDE2nD4gAYIR41Aog8s8JnLgGK+0P7CqVxB37rq89BSYvApaHQE8yTg=
=4Ha2
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.16.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 fixes from Bob Peterson:
"Andreas Gruenbacher wrote two additional patches that we would like
merged in this time. Both are regressions:
- fix another kernel build dependency problem
- fix a performance regression in glock dumps"
* tag 'gfs2-4.16.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Glock dump performance regression fix
gfs2: Fix the crc32c dependency
Until v4.14, this warning was very infrequent:
WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
Modules linked in: [...]
CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
Call Trace:
dump_stack+0x85/0xc2
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
find_parent_nodes+0xc41/0x14e0
__btrfs_find_all_roots+0xad/0x120
? extent_same_check_offsets+0x70/0x70
iterate_extent_inodes+0x168/0x300
iterate_inodes_from_logical+0x87/0xb0
? iterate_inodes_from_logical+0x87/0xb0
? extent_same_check_offsets+0x70/0x70
btrfs_ioctl+0x8ac/0x2820
? lock_acquire+0xc2/0x200
do_vfs_ioctl+0x91/0x700
? __fget+0x112/0x200
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc6
? trace_hardirqs_off_caller+0x1f/0x140
Starting with v4.14 (specifically 86d5f99442 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.
Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadolski@suse.de> on the linux-btrfs mailing list.
Fixes: 8da6d5815c ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ error message adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.
cc: <stable@vger.kernel.org> v4.4-rc6+
Fixes: f32e48e925 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This regression is introduced in
commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction").
There are two problems,
a) it is ->destroy_inode() that does the final free on inode, not
->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.
This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().
Fixes: commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,
umount
-> close_ctree
-> stop workers
-> iput(btree_inode)
-> iput_final
-> write_inode_now
-> ...
-> queue job on stop'd workers
cc: <stable@vger.kernel.org> v3.12+
Fixes: 681ae50917 ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reverse mapping has had a while to soak, so remove the experimental tag.
Now that we've landed space metadata cross-referencing in scrub, the
feature actually has a purpose.
Reject rmap filesystems with an rt device until the code to support it
is actually implemented.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
We don't support realtime filesystems with reflink either, so fail
those mounts.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Now that reflink is no longer experimental, reject attempts to mount
with DAX until that whole mess gets sorted out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Advertise this config option along with the others.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull printk updates from Petr Mladek:
- Add a console_msg_format command line option:
The value "default" keeps the old "[time stamp] text\n" format. The
value "syslog" allows to see the syslog-like "<log
level>[timestamp] text" format.
This feature was requested by people doing regression tests, for
example, 0day robot. They want to have both filtered and full logs
at hands.
- Reduce the risk of softlockup:
Pass the console owner in a busy loop.
This is a new approach to the old problem. It was first proposed by
Steven Rostedt on Kernel Summit 2017. It marks a context in which
the console_lock owner calls console drivers and could not sleep.
On the other side, printk() callers could detect this state and use
a busy wait instead of a simple console_trylock(). Finally, the
console_lock owner checks if there is a busy waiter at the end of
the special context and eventually passes the console_lock to the
waiter.
The hand-off works surprisingly well and helps in many situations.
Well, there is still a possibility of the softlockup, for example,
when the flood of messages stops and the last owner still has too
much to flush.
There is increasing number of people having problems with
printk-related softlockups. We might eventually need to get better
solution. Anyway, this looks like a good start and promising
direction.
- Do not allow to schedule in console_unlock() called from printk():
This reverts an older controversial commit. The reschedule helped
to avoid softlockups. But it also slowed down the console output.
This patch is obsoleted by the new console waiter logic described
above. In fact, the reschedule made the hand-off less effective.
- Deprecate "%pf" and "%pF" format specifier:
It was needed on ia64, ppc64 and parisc64 to dereference function
descriptors and show the real function address. It is done
transparently by "%ps" and "pS" format specifier now.
Sergey Senozhatsky found that all the function descriptors were in
a special elf section and could be easily detected.
- Remove printk_symbol() API:
It has been obsoleted by "%pS" format specifier, and this change
helped to remove few continuous lines and a less intuitive old API.
- Remove redundant memsets:
Sergey removed unnecessary memset when processing printk.devkmsg
command line option.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits)
printk: drop redundant devkmsg_log_str memsets
printk: Never set console_may_schedule in console_trylock()
printk: Hide console waiter logic into helpers
printk: Add console owner and waiter logic to load balance console writes
kallsyms: remove print_symbol() function
checkpatch: add pF/pf deprecation warning
symbol lookup: introduce dereference_symbol_descriptor()
parisc64: Add .opd based function descriptor dereference
powerpc64: Add .opd based function descriptor dereference
ia64: Add .opd based function descriptor dereference
sections: split dereference_function_descriptor()
openrisc: Fix conflicting types for _exext and _stext
lib: do not use print_symbol()
irq debug: do not use print_symbol()
sysfs: do not use print_symbol()
drivers: do not use print_symbol()
x86: do not use print_symbol()
unicore32: do not use print_symbol()
sh: do not use print_symbol()
mn10300: do not use print_symbol()
...
The only place that has any business including asm/poll.h
is linux/poll.h. Fortunately, asm/poll.h had only been
included in 3 places beyond that one, and all of them
are trivial to switch to using linux/poll.h.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge KASAN word-at-a-time fixups from Andrey Ryabinin.
The word-at-a-time optimizations have caused headaches for KASAN, since
the whole point is that we access byte streams in bigger chunks, and
KASAN can be unhappy about the potential extra access at the end of the
string.
We used to have a horrible hack in dcache, and then people got
complaints from the strscpy() case. This fixes it all up properly, by
adding an explicit helper for the "access byte stream one word at a
time" case.
* emailed patches from Andrey Ryabinin <aryabinin@virtuozzo.com>:
fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports"
fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()
lib/strscpy: Shut up KASAN false-positives in strscpy()
compiler.h: Add read_word_at_a_time() function.
compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
This reverts commit df4c0e36f1.
It's no longer needed since dentry_string_cmp() now uses
read_word_at_a_time() to avoid kasan's reports.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may
read slightly more than it was requested in kmallac(). Normally this
would make KASAN to report out-of-bounds access, but this was
workarounded by commit df4c0e36f1 ("fs: dcache: manually unpoison
dname after allocation to shut up kasan's reports").
This workaround is not perfect, since it allows out-of-bounds access to
dentry's name for all the code, not just in dentry_string_cmp().
So it would be better to use read_word_at_a_time() instead and revert
commit df4c0e36f1.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Restore an optimization removed in commit 7f19449553 "Fix debugfs glocks
dump": keep the glock hash table iterator active while the glock dump
file is held open. This avoids having to rescan the hash table from the
start for each read, with quadratically rising runtime.
In addition, use rhastable_walk_peek for resuming a glock dump at the
current position: when a glock doesn't fit in the provided buffer
anymore, the next read must revisit the same glock.
Finally, also restart the dump from the first entry when we notice that
the hash table has been resized in gfs2_glock_seq_start.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Depend on LIBCRC32C which uses the crypto API to select the appropriate
crc32c implementation. With the CRYPTO and CRYPTO_CRC32C dependencies,
gfs2 would still need to use the crypto API directly like ext4 and btrfs
do, which isn't necessary.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Here is the set of "big" driver core patches for 4.16-rc1.
The majority of the work here is in the firmware subsystem, with reworks
to try to attempt to make the code easier to handle in the long run, but
no functional change. There's also some tree-wide sysfs attribute
fixups with lots of acks from the various subsystem maintainers, as well
as a handful of other normal fixes and changes.
And finally, some license cleanups for the driver core and sysfs code.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLvPw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynNzACgkzjPoBytJWbpWFt6SR6L33/u4kEAnRFvVCGL
s6ygQPQhZIjKk2Lxa2hC
=Zihy
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of "big" driver core patches for 4.16-rc1.
The majority of the work here is in the firmware subsystem, with
reworks to try to attempt to make the code easier to handle in the
long run, but no functional change. There's also some tree-wide sysfs
attribute fixups with lots of acks from the various subsystem
maintainers, as well as a handful of other normal fixes and changes.
And finally, some license cleanups for the driver core and sysfs code.
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (48 commits)
device property: Define type of PROPERTY_ENRTY_*() macros
device property: Reuse property_entry_free_data()
device property: Move property_entry_free_data() upper
firmware: Fix up docs referring to FIRMWARE_IN_KERNEL
firmware: Drop FIRMWARE_IN_KERNEL Kconfig option
USB: serial: keyspan: Drop firmware Kconfig options
sysfs: remove DEBUG defines
sysfs: use SPDX identifiers
drivers: base: add coredump driver ops
sysfs: add attribute specification for /sysfs/devices/.../coredump
test_firmware: fix missing unlock on error in config_num_requests_store()
test_firmware: make local symbol test_fw_config static
sysfs: turn WARN() into pr_warn()
firmware: Fix a typo in fallback-mechanisms.rst
treewide: Use DEVICE_ATTR_WO
treewide: Use DEVICE_ATTR_RO
treewide: Use DEVICE_ATTR_RW
sysfs.h: Use octal permissions
component: add debugfs support
bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate
...
Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging tree
to the "real" part of the kernel, and the most and fsl-mc codebases are
almost ready to move out, that will probably happen for 4.17-rc1 if all
goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLxoA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk4vgCgjeMlwhtar65DIticIRj626EFxiQAnjGmH8Kd
d9Xz2Piq8X47uSsC/6AE
=xxMT
-----END PGP SIGNATURE-----
Merge tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO updates from Greg KH:
"Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all
releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging
tree to the "real" part of the kernel, and the most and fsl-mc
codebases are almost ready to move out, that will probably happen for
4.17-rc1 if all goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues"
* tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
staging: rtl8723bs: remove a couple of redundant initializations
staging: comedi: reformat lines to 80 chars or less
staging: lustre: separate a connection destroy from free struct kib_conn
Staging: rtl8723bs: Use !x instead of NULL comparison
Staging: rtl8723bs: Remove dead code
Staging: rtl8723bs: Change names to conform to the kernel code
staging: ccree: Fix missing blank line after declaration
staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
staging: comedi: dt2811: remove redundant initialization of 'ns'
staging: wilc1000: fix alignments to match open parenthesis
staging: wilc1000: removed unnecessary defined enums typedef
staging: wilc1000: remove unnecessary use of parentheses
staging: rtl8192u: remove redundant initialization of 'timeout'
staging: sm750fb: fix CamelCase for dispSet var
staging: lustre: lnet/selftest: fix compile error on UP build
staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
...
gcc versions prior to 4.6 require an extra level of braces when using a
designated initializer for a member in an anonymous struct or union.
This caused a compile error with the 'struct qstr' initialization in
__fscrypt_encrypt_symlink().
Fix it by using QSTR_INIT().
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Fixes: 76e81d6d50 ("fscrypt: new helper functions for ->symlink()")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.
Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Don't use u32, use uint32_t, because this won't work in xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
documentation, errseq documentation, kernel-doc support for nested
structure definitions, the removal of lots of crufty kernel-doc support for
unused formats, SPDX tag documentation, the beginnings of a manual for
subsystem maintainers, and lots of fixes and updates.
As usual, some of the changesets reach outside of Documentation/ to effect
kerneldoc comment fixes. It also adds the new LICENSES directory, of which
Thomas promises I do not need to be the maintainer.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
XeQ7sbd+S+jzVVjFi0W5
=Z15Z
-----END PGP SIGNATURE-----
Merge tag 'docs-4.16' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Documentation updates for 4.16.
New stuff includes refcount_t documentation, errseq documentation,
kernel-doc support for nested structure definitions, the removal of
lots of crufty kernel-doc support for unused formats, SPDX tag
documentation, the beginnings of a manual for subsystem maintainers,
and lots of fixes and updates.
As usual, some of the changesets reach outside of Documentation/ to
effect kerneldoc comment fixes. It also adds the new LICENSES
directory, of which Thomas promises I do not need to be the
maintainer"
* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
linux-next: docs-rst: Fix typos in kfigure.py
linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
Documentation: Fix misconversion of #if
docs: add index entry for networking/msg_zerocopy
Documentation: security/credentials.rst: explain need to sort group_list
LICENSES: Add MPL-1.1 license
LICENSES: Add the GPL 1.0 license
LICENSES: Add Linux syscall note exception
LICENSES: Add the MIT license
LICENSES: Add the BSD-3-clause "Clear" license
LICENSES: Add the BSD 3-clause "New" or "Revised" License
LICENSES: Add the BSD 2-clause "Simplified" license
LICENSES: Add the LGPL-2.1 license
LICENSES: Add the LGPL 2.0 license
LICENSES: Add the GPL 2.0 license
Documentation: Add license-rules.rst to describe how to properly identify file licenses
scripts: kernel_doc: better handle show warnings logic
fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
doc: md: Fix a file name to md-fault.c in fault-injection.txt
errseq: Add to documentation tree
...
Pull dcache updates from Al Viro:
"Neil Brown's d_move()/d_path() race fix"
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: close race between getcwd() and d_move()
Merge updates from Andrew Morton:
- misc fixes
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
mm: remove PG_highmem description
tools, vm: new option to specify kpageflags file
mm/swap.c: make functions and their kernel-doc agree
mm, memory_hotplug: fix memmap initialization
mm: correct comments regarding do_fault_around()
mm: numa: do not trap faults on shared data section pages.
hugetlb, mbind: fall back to default policy if vma is NULL
hugetlb, mempolicy: fix the mbind hugetlb migration
mm, hugetlb: further simplify hugetlb allocation API
mm, hugetlb: get rid of surplus page accounting tricks
mm, hugetlb: do not rely on overcommit limit during migration
mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
mm, hugetlb: unify core page allocation accounting and initialization
mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
mm/memcontrol.c: make local symbol static
mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
mm/compaction.c: fix comment for try_to_compact_pages()
mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
zsmalloc: use U suffix for negative literals being shifted
...
Nothing actually calls userfaultfd_file_create() besides the
userfaultfd() system call itself. So simplify things by folding it into
the system call and using anon_inode_getfd() instead of
anon_inode_getfile(). Do the same in resolve_userfault_fork() as well.
This removes over 50 lines with no change in functionality.
Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implements memfd sealing, similar to shmem:
- WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
memfd_add_seals(). write() doesn't exist for hugetlbfs.
- SHRINK: added similar check as shmem_setattr()
- GROW: added similar check as shmem_setattr() & shmem_fallocate()
Except write() operation that doesn't exist with hugetlbfs, that should
make sealing as close as it can be to shmem support.
Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs inode information will need to be accessed by code in
mm/shmem.c for file sealing operations. Move inode information
definition from .c file to header for needed access.
Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Those functions are called for memfd files, backed by shmem or hugetlb
(the next patches will handle hugetlb).
Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the modifed pmdp_invalidate() that returns the previous value of pmd
to transfer dirty and accessed bits.
Link: http://lkml.kernel.org/r/20171213105756.69879-12-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Daney <david.daney@cavium.com>
Cc: David Miller <davem@davemloft.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.
Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.
Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,
do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */
That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.
This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().
This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.
Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If start_code / end_code pointers are screwed then "VmExe" could be
bigger than total executable virtual memory and "VmLib" becomes
negative:
VmExe: 294320 kB
VmLib: 18446744073709327564 kB
VmExe and VmLib documented as text segment and shared library code size.
Now their sum will be always equal to mm->exec_vm which sums size of
executable and not writable and not stack areas.
I've seen this for huge (>2Gb) statically linked binary which has whole
world inside. For it start_code .. end_code range also covers one of
rodata sections. Probably this is bug in customized linker, elf loader
or both.
Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
we cannot trust them without validation.
Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not reuse the dirty bh in jbd2 directly due to the following
situation:
1. When removing extent rec, we will dirty the bhs of extent rec and
truncate log at the same time, and hand them over to jbd2.
2. The bhs are submitted to jbd2 area successfully.
3. The write-back thread of device help flush the bhs to disk but
encounter write error due to abnormal storage link.
4. After a while the storage link become normal. Truncate log flush
worker triggered by the next space reclaiming found the dirty bh of
truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
__ocfs2_journal_access():
ocfs2_truncate_log_worker
ocfs2_flush_truncate_log
__ocfs2_flush_truncate_log
ocfs2_replay_truncate_records
ocfs2_journal_access_di
__ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
extent rec is still in error state, and unfortunately nobody will
take care of it.
6. At last the space of extent rec was not reduced, but truncate log
flush worker have given it back to globalalloc. That will cause
duplicate cluster problem which could be identified by fsck.ocfs2.
Sadly we can hardly revert this but set fs read-only in case of ruining
atomicity and consistency of space reclaim.
Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
Fixes: acf8fdbe6a ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should unlock bh_stat if bg->bg_free_bits_count > bg->bg_bits
Link: http://lkml.kernel.org/r/1516843095-23680-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return EAGAIN if any of the following checks fail for direct I/O:
- Cannot get the related locks immediately
- Blocks are not allocated at the write location, it will trigger block
allocation and block IO operations.
[ghe@suse.com: v4]
Link: http://lkml.kernel.org/r/1516007283-29932-4-git-send-email-ghe@suse.com
[ghe@suse.com: v2]
Link: http://lkml.kernel.org/r/1511944612-9629-4-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-4-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add ocfs2_overwrite_io function, which is used to judge if overwrite
allocated blocks, otherwise, the write will bring extra block allocation
overhead.
[ghe@suse.com: v3]
Link: http://lkml.kernel.org/r/1514455665-16325-3-git-send-email-ghe@suse.com
[ghe@suse.com: v2]
Link: http://lkml.kernel.org/r/1511944612-9629-3-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-3-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ocfs2: add nowait aio support", v4.
VFS layer has introduced the non-blocking aio flag IOCB_NOWAIT, which
tells the kernel to bail out if an AIO request will block for reasons
such as file allocations, or writeback triggering, or would block while
allocating requests while performing direct I/O.
Subsequently, pwritev2/preadv2 also can leverage this part of kernel
code. So far, ext4/xfs/btrfs have supported this feature. Add the
related code for the ocfs2 file system.
This patch (of 3):
Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be
used in non-blocking IO scenarios.
[ghe@suse.com: v2]
Link: http://lkml.kernel.org/r/1511944612-9629-2-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-2-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: alex chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2 supports trimming the underlying disk via the fstrim command. But
there is a problem, ocfs2 is a shared disk cluster file system, if the
user configures a scheduled fstrim job on each file system node, this
will trigger multiple nodes trimming a shared disk simultaneously, which
is very wasteful for CPU and IO consumption. This also might negatively
affect the lifetime of poor-quality SSD devices.
So we introduce a trimfs dlm lock to communicate with each other in this
case, which will make only one fstrim command to do the trimming on a
shared disk among the cluster. The fstrim commands from the other nodes
should wait for the first fstrim to finish and return success directly,
to avoid running the same trim on the shared disk again.
Link: http://lkml.kernel.org/r/1513228484-2084-2-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new dlm lock resource, which will be used to communicate
during fstrimming of an ocfs2 device from cluster nodes.
Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A crash issue was reported by John Lightsey with a call trace as follows:
ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
dio_complete+0x19a/0x1a0
do_blockdev_direct_IO+0x19dd/0x1eb0
__blockdev_direct_IO+0x43/0x50
ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
generic_file_direct_write+0xb2/0x170
__generic_file_write_iter+0xc3/0x1b0
ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
__vfs_write+0xae/0xf0
vfs_write+0xb8/0x1b0
SyS_write+0x4f/0xb0
system_call_fastpath+0x16/0x75
The BUG code told that extent tree wants to grow but no metadata was
reserved ahead of time. From my investigation into this issue, the root
cause it that although enough metadata is not reserved, there should be
enough for following use. Rightmost extent is merged into its left one
due to a certain times of marking extent written. Because during
marking extent written, we got many physically continuous extents. At
last, an empty extent showed up and the rightmost path is removed from
extent tree.
Add a new mechanism to reuse extent block cached in dealloc which were
just unlinked from extent tree to solve this crash issue.
Criteria is that during marking extents *written*, if extent rotation
and merging results in unlinking extent with growing extent tree later
without any metadata reserved ahead of time, try to reuse those extents
in dealloc in which deleted extents are cached.
Also, this patch addresses the issue John reported that ::dw_zero_count
is not calculated properly.
After applying this patch, the issue John reported was gone. Thanks for
the reproducer provided by John. And this patch has passed
ocfs2-test(29 cases) suite running by New H3C Group.
[ge.changwei@h3c.com: fix static checker warnning]
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F29196AE@H3CMLB12-EX.srv.huawei-3com.com
[akpm@linux-foundation.org: brelse(NULL) is legal]
Link: http://lkml.kernel.org/r/1515479070-32653-2-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: John Lightsey <john@nixnuts.net>
Tested-by: John Lightsey <john@nixnuts.net>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current code assume that ::w_unwritten_list always has only one item on.
This is not right and hard to get understood. So improve how to count
unwritten item.
Link: http://lkml.kernel.org/r/1515479070-32653-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: John Lightsey <john@nixnuts.net>
Tested-by: John Lightsey <john@nixnuts.net>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The race between *set_acl and *get_acl will cause getting incomplete
xattr data as below:
processA processB
ocfs2_set_acl
ocfs2_xattr_set
__ocfs2_xattr_set_handle
ocfs2_get_acl_nolock
ocfs2_xattr_get_nolock:
processB may get incomplete xattr data if processA hasn't set_acl done.
So we should use 'ip_xattr_sem' to protect getting extended attribute in
ocfs2_get_acl_nolock(), as other processes could be changing it
concurrently.
Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some stack variables are no longer used but still assigned. Trim them.
Link: http://lkml.kernel.org/r/1516105069-12643-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need catch the errno returned by ocfs2_xattr_get_nolock() and assign
it to 'ret' for printing and noticing upper callers.
Link: http://lkml.kernel.org/r/5A571CAF.8050709@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Gang He <ghe@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:
ocfs2_mount
ocfs2_initialize_super
ocfs2_init_global_system_inodes
ocfs2_iget
ocfs2_read_locked_inode
ocfs2_validate_inode_block
ocfs2_error
ocfs2_handle_error
ocfs2_set_ro_flag(osb, 0); // set readonly
In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.
Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stack variable fe is no longer used, so trim it to save some CPU cycles
and stack space.
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F5A8DD@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the OCFS2_XATTR_ROOT_SIZE macro improves the readability of the
code.
Link: http://lkml.kernel.org/r/5A2E2488.70301@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When some nodes of cluster face with TCP connection fault, ocfs2 will
pick up a quorum to continue to work and other nodes will be fenced by
resetting host.
In order to decide which node should be fenced, ocfs2 leverages
o2quo_state::qs_holds. If that variable is reduced to zero, then a try
to decide if fence local node is performed. However, under a specific
scenario that local node is not disconnected from others at the same
time, above method has a problem to reduce ::qs_holds to zero.
Because, o2net 90s idle timer corresponding to different nodes is
triggered one after another.
node 2 node 3
90s idle timer elapses
clear ::qs_conn_bm
set hold
40s is passed
90 idle timer elapses
clear ::qs_conn_bm
set hold
still up timer elapses
clear hold (NOT to zero )
90s idle timer elapses AGAIN
still up timer elapses.
clear hold
still up timer elapses
To solve this issue, a node which has already be evicted from
::qs_conn_bm can't set hold again and again invoked from idle timer.
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F3F93B@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Yang Zhang <zhang.yangB@h3c.com>
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an obvious error message, due to mismatched cluster names between
on-disk and in the current cluster. We can meet this case during OCFS2
cluster migration.
If we can give the user an obvious tip for why they can not mount the
file system after migration, they can quickly fix this mismatch problem.
Second, also move printing ocfs2_fill_super() errno to the front of
ocfs2_dismount_volume(), since ocfs2_dismount_volume() will also print
its own message.
I looked through all the code of OCFS2 (include o2cb); there is not any
place which returns this error. In fact, the function calling path
ocfs2_fill_super -> ocfs2_mount_volume -> ocfs2_dlm_init ->
dlm_new_lockspace is a very specific one. We can use this errno to give
the user a more clear tip, since this case is a little common during
cluster migration, but the customer can quickly get the failure cause if
there is a error printed. Also, I think it is not possible to add this
errno in the o2cb path during ocfs2_dlm_init(), since the o2cb code has
been stable for a long time.
We only print this error tip when the user uses pcmk stack, since using
the o2cb stack the user will not meet this error.
[ghe@suse.com: v2]
Link: http://lkml.kernel.org/r/1495419305-3780-1-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1495089336-19312-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>