We want to enable cross-filesystem copy_file_range functionality
where possible, so push the "same superblock only" checks down to
the individual filesystem callouts so they can make their own
decisions about cross-superblock copy offload and fallack to
generic_copy_file_range() for cross-superblock copy.
[Amir] We do not call ->remap_file_range() in case the files are not
on the same sb and do not call ->copy_file_range() in case the files
do not belong to the same filesystem driver.
This changes behavior of the copy_file_range(2) syscall, which will
now allow cross filesystem in-kernel copy. CIFS already supports
cross-superblock copy, between two shares to the same server. This
functionality will now be available via the copy_file_range(2) syscall.
Cc: Steve French <stfrench@microsoft.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we have generic_copy_file_range(), remove it as a fallback
case when offloads fail. This puts the responsibility for executing
fallbacks on the filesystems that implement ->copy_file_range and
allows us to add operational validity checks to
generic_copy_file_range().
Rework vfs_copy_file_range() to call a new do_copy_file_range()
helper to execute the copying callout, and move calls to
generic_file_copy_range() into filesystem methods where they
currently return failures.
[Amir] overlayfs is not responsible of executing the fallback.
It is the responsibility of the underlying filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Another round of SPDX header file fixes for 5.2-rc4
These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
added, based on the text in the files. We are slowly chipping away at
the 700+ different ways people tried to write the license text. All of
these were reviewed on the spdx mailing list by a number of different
people.
We now have over 60% of the kernel files covered with SPDX tags:
$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
Files checked: 64533
Files with SPDX: 40392
Files with errors: 0
I think the majority of the "easy" fixups are now done, it's now the
start of the longer-tail of crazy variants to wade through.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPuGTg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykBvQCg2SG+HmDH+tlwKLT/q7jZcLMPQigAoMpt9Uuy
sxVEiFZo8ZU9v1IoRb1I
=qU++
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull yet more SPDX updates from Greg KH:
"Another round of SPDX header file fixes for 5.2-rc4
These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
added, based on the text in the files. We are slowly chipping away at
the 700+ different ways people tried to write the license text. All of
these were reviewed on the spdx mailing list by a number of different
people.
We now have over 60% of the kernel files covered with SPDX tags:
$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
Files checked: 64533
Files with SPDX: 40392
Files with errors: 0
I think the majority of the "easy" fixups are now done, it's now the
start of the longer-tail of crazy variants to wade through"
* tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
...
Based on 1 normalized pattern(s):
this file is released under the gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 68 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The fuse_writeback_range() helper flushes dirty data to the userspace
filesystem.
When the function returns, the WRITE requests for the data in the given
range have all been completed. This is not equivalent to fsync() on the
given range, since the userspace filesystem may not yet have the data on
stable storage.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Prior to sending COPY_FILE_RANGE to userspace filesystem, we must flush all
dirty pages in both the source and destination files.
This patch adds the missing flush of the source file.
Tested on libfuse-3.5.0 with:
libfuse/example/passthrough_ll /mnt/fuse/ -o writeback
libfuse/test/test_syscalls /mnt/fuse/tmp/test
Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In the FOPEN_DIRECT_IO case the write path doesn't call file_remove_privs()
and that means setuid bit is not cleared if unpriviliged user writes to a
file with setuid bit set.
pjdfstest chmod test 12.t tests this and fails.
Fix this by adding a flag to the FUSE_WRITE message that requests clearing
privileges on the given file. This needs
This better than just calling fuse_remove_privs(), because the attributes
may not be up to date, so in that case a write may miss clearing the
privileges.
Test case:
$ passthrough_ll /mnt/pasthrough-mnt -o default_permissions,allow_other,cache=never
$ mkdir /mnt/pasthrough-mnt/testdir
$ cd /mnt/pasthrough-mnt/testdir
$ prove -rv pjdfstests/tests/chmod/12.t
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Do the proper cleanup in case the size check fails.
Tested with xfstests:generic/228
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 0cbade024b ("fuse: honor RLIMIT_FSIZE in fuse_file_fallocate")
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Cc: <stable@vger.kernel.org> # v3.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXNpuRwAKCRDh3BK/laaZ
PMq/AP9kLvB97JU2GbzIJq6wOjDV8whPE/a2Knx0fajvW3AEOAD+NQwdZLmVNql7
DkkY8lZ7fVut3TMj8jHhpIbv4P1R1AE=
=qX6f
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Add more caching controls for userspace filesystems to use, as well as
bug fixes and cleanups"
* tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up fuse_alloc_inode
fuse: Add ioctl flag for x32 compat ioctl
fuse: Convert fusectl to use the new mount API
fuse: fix changelog entry for protocol 7.9
fuse: fix changelog entry for protocol 7.12
fuse: document fuse_fsync_in.fsync_flags
fuse: Add FOPEN_STREAM to use stream_open()
fuse: require /dev/fuse reads to have enough buffer capacity
fuse: retrieve: cap requested size to negotiated max_write
fuse: allow filesystems to have precise control over data cache
fuse: convert printk -> pr_*
fuse: honor RLIMIT_FSIZE in fuse_file_fallocate
fuse: fix writepages on 32bit
This patch cleans up fuse_alloc_inode function, just simply the code, no
logic change.
Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull vfs inode freeing updates from Al Viro:
"Introduction of separate method for RCU-delayed part of
->destroy_inode() (if any).
Pretty much as posted, except that destroy_inode() stashes
->free_inode into the victim (anon-unioned with ->i_fops) before
scheduling i_callback() and the last two patches (sockfs conversion
and folding struct socket_wq into struct socket) are excluded - that
pair should go through netdev once davem reopens his tree"
* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
orangefs: make use of ->free_inode()
shmem: make use of ->free_inode()
hugetlb: make use of ->free_inode()
overlayfs: make use of ->free_inode()
jfs: switch to ->free_inode()
fuse: switch to ->free_inode()
ext4: make use of ->free_inode()
ecryptfs: make use of ->free_inode()
ceph: use ->free_inode()
btrfs: use ->free_inode()
afs: switch to use of ->free_inode()
dax: make use of ->free_inode()
ntfs: switch to ->free_inode()
securityfs: switch to ->free_inode()
apparmor: switch to ->free_inode()
rpcpipe: switch to ->free_inode()
bpf: switch to ->free_inode()
mqueue: switch to ->free_inode()
ufs: switch to ->free_inode()
coda: switch to ->free_inode()
...
fuse_destroy_inode() is gone - sanity checks that need the stack
trace of the caller get moved into ->evict_inode(), the rest joins
the RCU-delayed part which becomes ->free_inode().
While we are at it, don't just pass the address of what happens
to be the first member of structure to kmem_cache_free() -
get_fuse_inode() is there for purpose and it gives the proper
container_of() use. No behaviour change, but verifying correctness
is easier that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, a CUSE server running on a 64-bit kernel can tell when an ioctl
request comes from a process running a 32-bit ABI, but cannot tell whether
the requesting process is using legacy IA32 emulation or x32 ABI. In
particular, the server does not know the size of the client process's
`time_t` type.
For 64-bit kernels, the `FUSE_IOCTL_COMPAT` and `FUSE_IOCTL_32BIT` flags
are currently set in the ioctl input request (`struct fuse_ioctl_in` member
`flags`) for a 32-bit requesting process. This patch defines a new flag
`FUSE_IOCTL_COMPAT_X32` and sets it if the 32-bit requesting process is
using the x32 ABI. This allows the server process to distinguish between
requests coming from client processes using IA32 emulation or the x32 ABI
and so infer the size of the client process's `time_t` type and any other
IA32/x32 differences.
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Convert the fusectl filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The FUSE_FSYNC_DATASYNC flag was introduced by commit b6aeadeda2
("[PATCH] FUSE - file operations") as a magic number. No new values have
been added to fsync_flags since.
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Starting from commit 9c225f2655 ("vfs: atomic f_pos accesses as per
POSIX") files opened even via nonseekable_open gate read and write via lock
and do not allow them to be run simultaneously. This can create read vs
write deadlock if a filesystem is trying to implement a socket-like file
which is intended to be simultaneously used for both read and write from
filesystem client. See commit 10dce8af34 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") for details and e.g. commit 581d21a2d0 ("xenbus: fix deadlock
on writes to /proc/xen/xenbus") for a similar deadlock example on
/proc/xen/xenbus.
To avoid such deadlock it was tempting to adjust fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3Dhttps://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
opened handler is having stream-like semantics; does not use file position
and thus the kernel is free to issue simultaneous read and write request on
opened file handle.
This patch together with stream_open() should be added to stable kernels
starting from v3.14+. This will allow to patch OSSPD and other FUSE
filesystems that provide stream-like files to return FOPEN_STREAM |
FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
kernel versions. This should work because fuse_finish_open ignores unknown
open flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel that
is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
is sufficient to implement streams without read vs write deadlock.
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.
Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. This way the
contract is that filesystem server should queue sys_reads with
4K+max_write buffer.
If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.
We can make the problem easy to diagnose if we indicate via error return to
filesystem server when it is violating the contract. This should not
practically cause problems because if a filesystem server is using shorter
buffer, writes to it were already very likely to cause EIO, and if the
filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
minimum buffer size.
Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header + that
max_write bytes. A filesystem server is free to set its max_write in
anywhere in the range between [1*page, fc->max_pages*page]. In particular
go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
corresponds to 128K. Libfuse also allows users to configure max_write, but
by default presets it to possible maximum.
If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract, will
be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server request
buffer. In turn the filesystem server could get stuck waiting indefinitely
for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
understood by clients as that NOTIFY_REPLY was queued and will be sent
back.
Cap requested size to negotiate max_write to avoid the problem. This
aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages. This
way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
was originally requested.
Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On networked filesystems file data can be changed externally. FUSE
provides notification messages for filesystem to inform kernel that
metadata or data region of a file needs to be invalidated in local page
cache. That provides the basis for filesystem implementations to invalidate
kernel cache explicitly based on observed filesystem-specific events.
FUSE has also "automatic" invalidation mode(*) when the kernel
automatically invalidates data cache of a file if it sees mtime change. It
also automatically invalidates whole data cache of a file if it sees file
size being changed.
The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA.
However, due to probably historical reason, that capability controls only
whether mtime change should be resulting in automatic invalidation or
not. A change in file size always results in invalidating whole data cache
of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+).
The filesystem I write[1] represents data arrays stored in networked
database as local files suitable for mmap. It is read-only filesystem -
changes to data are committed externally via database interfaces and the
filesystem only glues data into contiguous file streams suitable for mmap
and traditional array processing. The files are big - starting from
hundreds gigabytes and more. The files change regularly, and frequently by
data being appended to their end. The size of files thus changes
frequently.
If a file was accessed locally and some part of its data got into page
cache, we want that data to stay cached unless there is memory pressure, or
unless corresponding part of the file was actually changed. However current
FUSE behaviour - when it sees file size change - is to invalidate the whole
file. The data cache of the file is thus completely lost even on small size
change, and despite that the filesystem server is careful to accurately
translate database changes into FUSE invalidation messages to kernel.
Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA
capability, indicates to kernel that it is fully responsible for data cache
invalidation, then the kernel won't invalidate files data cache on size
change and only truncate that cache to new size in case the size decreased.
(*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag",
eed2179efe "fuse: invalidate inode mapping if mtime changes"
(+) in writeback mode the kernel does not invalidate data cache on file
size change, but neither it allows the filesystem to set the size due to
external event (see 8373200b12 "fuse: Trust kernel i_size only")
[1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Functions, like pr_err, are a more modern variant of printing compared to
printk. They could be used to denoise sources by using needed level in
the print function name, and by automatically inserting per-driver /
function / ... print prefix as defined by pr_fmt macro. pr_* are also
said to be used in Documentation/process/coding-style.rst and more
recent code - for example overlayfs - uses them instead of printk.
Convert CUSE and FUSE to use the new pr_* functions.
CUSE output stays completely unchanged, while FUSE output is amended a
bit for "trying to steal weird page" warning - the second line now comes
also with "fuse:" prefix. I hope it is ok.
Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fstests generic/228 reported this failure that fuse fallocate does not
honor what 'ulimit -f' has set.
This adds the necessary inode_newsize_ok() check.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 05ba1f0823 ("fuse: add FALLOCATE operation")
Cc: <stable@vger.kernel.org> # v3.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Writepage requests were cropped to i_size & 0xffffffff, which meant that
mmaped writes to any file larger than 4G might be silently discarded.
Fix by storing the file size in a properly sized variable (loff_t instead
of size_t).
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Fixes: 6eaf4782eb ("fuse: writepages: crop secondary requests")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXIdqOwAKCRDh3BK/laaZ
PFRlAP0RZr7vDfGcZTXGApcIr63YDjzi8Gg1/Jhd0jrzLbKcdAD+P0d6bupWWwOl
yGjVxY9LkXNJiTI2Q+Equ7AgMYvDcQk=
=Lvcr
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
"Scalability and performance improvements, as well as minor bug fixes
and cleanups"
* tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: cache readdir calls if filesystem opts out of opendir
fuse: support clients that don't implement 'opendir'
fuse: lift bad inode checks into callers
fuse: multiplex cached/direct_io file operations
fuse add copy_file_range to direct io fops
fuse: use iov_iter based generic splice helpers
fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
fuse: use atomic64_t for khctr
fuse: clean up aborted
fuse: Protect ff->reserved_req via corresponding fi->lock
fuse: Protect fi->nlookup with fi->lock
fuse: Introduce fi->lock to protect write related fields
fuse: Convert fc->attr_version into atomic64_t
fuse: Add fuse_inode argument to fuse_prepare_release()
fuse: Verify userspace asks to requeue interrupt that we really sent
fuse: Do some refactoring in fuse_dev_do_write()
fuse: Wake up req->waitq of only if not background
fuse: Optimize request_end() by not taking fiq->waitq.lock
fuse: Kill fasync only if interrupt is queued in queue_interrupt()
fuse: Remove stale comment in end_requests()
...
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
simplifies the expression in every filesystem. Also rename the macro to
VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
VM_MIN_READAHEAD
[akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen]
Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a filesystem returns ENOSYS from opendir and thus opts out of
opendir and releasedir requests, it almost certainly would also like
readdir results cached. Default open_flags to FOPEN_KEEP_CACHE and
FOPEN_CACHE_DIR in that case.
With this patch, I've measured recursive directory enumeration across
large FUSE mounts to be faster than native mounts.
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow filesystems to return ENOSYS from opendir, preventing the kernel from
sending opendir and releasedir messages in the future. This avoids
userspace transitions when filesystems don't need to keep track of state
per directory handle.
A new capability flag, FUSE_NO_OPENDIR_SUPPORT, parallels
FUSE_NO_OPEN_SUPPORT, indicating the new semantics for returning ENOSYS
from opendir.
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Bad inode checks were done done in various places, and move them into
fuse_file_{read|write}_iter().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is cleanup, as well as allowing switching between I/O modes while the
file is open in the future.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Nothing preventing copy_file_range to work on files opened with
FOPEN_DIRECT_IO.
Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The default splice implementation is grossly inefficient and the iter based
ones work just fine, so use those instead. I've measured an 8x speedup for
splice write (with len = 128k).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Switch to using the async directo IO code path in fuse_direct_read_iter()
and fuse_direct_write_iter(). This is especially important in connection
with loop devices with direct IO enabled as loop assumes async direct io is
actually async.
Signed-off-by: Martin Raiber <martin@urbackup.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The only caller that needs fc->aborted set is fuse_conn_abort_write().
Setting fc->aborted is now racy (fuse_abort_conn() may already be in
progress or finished) but there's no reason to care.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is rather natural action after previous patches, and it just decreases
load of fc->lock.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This continues previous patch and introduces the same protection for
nlookup field.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To minimize contention of fc->lock, this patch introduces a new spinlock
for protection fuse_inode metadata:
fuse_inode:
writectr
writepages
write_files
queued_writes
attr_version
inode:
i_size
i_nlink
i_mtime
i_ctime
Also, it protects the fields changed in fuse_change_attributes_common()
(too many to list).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch makes fc->attr_version of atomic64_t type, so fc->lock won't be
needed to read or modify it anymore.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Here is preparation for next patches, which introduce new fi->lock for
protection of ff->write_entry linked into fi->write_files.
This patch just passes new argument to the function.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When queue_interrupt() is called from fuse_dev_do_write(), it came from
userspace directly. Userspace may pass any request id, even the request's
we have not interrupted (or even background's request). This patch adds
sanity check to make kernel safe against that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently, we wait on req->waitq in request_wait_answer() function only,
and it's never used for background requests. Since wake_up() is not a
light-weight macros, instead of this, it unfolds in really called function,
which makes locking operations taking some cpu cycles, let's avoid its call
for the case we definitely know it's completely useless.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We take global fiq->waitq.lock every time, when we are in this function,
but interrupted requests are just small subset of all requests. This patch
optimizes request_end() and makes it to take the lock when it's really
needed.
queue_interrupt() needs small change for that. After req is linked to
interrupt list, we do smp_mb() and check for FR_FINISHED again. In case of
FR_FINISHED bit has appeared, we remove req and leave the function:
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We should sent signal only in case of interrupt is really queued. Not a
real problem, but this makes the code clearer and intuitive.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
It looks like we can optimize page replacement and avoid copying by simple
updating the request's page.
[SzM: swap with new request's tmp page to avoid use after free.]
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Auxiliary requests chained on req->misc.write.next may be leaked on
truncate. Free these as well if the parent request was truncated off.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Don't reuse the queued request, even if it only contains a single page.
This is needed because previous locking changes (spliting out
fiq->waitq.lock from fc->lock) broke the assumption that request will
remain in FR_PENDING at least until the new page contents are copied.
This fix removes a slight optimization for a rare corner case, so we really
shoudln't care.
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Restructure the function to better separate the locked and the unlocked
parts. Use the "old_req" local variable to mean only the queued request,
and not any auxiliary requests added onto its misc.write.next list. These
changes are in preparation for the following patch.
Also turn BUG_ON instances into WARN_ON and add a header comment explaining
what the function does.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call this from fuse_range_is_writeback() and fuse_writepage_in_flight().
Turn a BUG_ON() into a WARN_ON() in the process.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
NR_WRITEBACK_TEMP is accounted on the temporary page in the request, not
the page cache page.
Fixes: 8b284dc472 ("fuse: writepages: handle same page rewrites")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Some of the pipe_buf_release() handlers seem to assume that the pipe is
locked - in particular, anon_pipe_buf_release() accesses pipe->tmp_page
without taking any extra locks. From a glance through the callers of
pipe_buf_release(), it looks like FUSE is the only one that calls
pipe_buf_release() without having the pipe locked.
This bug should only lead to a memory leak, nothing terrible.
Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection.
Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this
incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent
to userspace.
Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR
inside of fuse_file_put.
Fixes: 7678ac5061 ("fuse: support clients that don't implement 'open'")
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_invalidate_attr() now sets fi->inval_mask instead of fi->i_time, hence
we need to check the inval mask in fuse_permission() as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 2f1e81965f ("fuse: allow fine grained attr cache invaldation")
Commit ab2257e994 ("fuse: reduce size of struct fuse_inode") moved parts
of fields related to writeback on regular file and to directory caching
into a union. However fuse_fsync_common() called from fuse_dir_fsync()
touches some writeback related fields, resulting in a crash.
Move writeback related parts from fuse_fsync_common() to fuse_fysnc().
Reported-by: Brett Girton <btgirton@gmail.com>
Tested-by: Brett Girton <btgirton@gmail.com>
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
make_bad_inode() sets inode->i_mode to S_IFREG if I/O error is detected
in fuse_do_getattr()/fuse_do_setattr(). If the inode is not a regular
file, write_files and queued_writes in fuse_inode are not initialized
and have NULL or invalid pointers written by other members in a union.
So, list_empty() returns false in fuse_destroy_inode(). Add
is_bad_inode() to check if make_bad_inode() was called.
Reported-by: syzbot+b9c89b84423073226299@syzkaller.appspotmail.com
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In async IO blocking case the additional reference to the io is taken for
it to survive fuse_aio_complete(). In non blocking case this additional
reference is not needed, however we still reference io to figure out
whether to wait for completion or not. This is wrong and will lead to
use-after-free. Fix it by storing blocking information in separate
variable.
This was spotted by KASAN when running generic/208 fstest.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 744742d692 ("fuse: Add reference counting for fuse_io_priv")
Cc: <stable@vger.kernel.org> # v4.6
In current fuse_drop_waiting() implementation it's possible that
fuse_wait_aborted() will not be woken up in the unlikely case that
fuse_abort_conn() + fuse_wait_aborted() runs in between checking
fc->connected and calling atomic_dec(&fc->num_waiting).
Do the atomic_dec_and_test() unconditionally, which also provides the
necessary barrier against reordering with the fc->connected check.
The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
barrier after resetting fc->connected. However, this is not a performance
sensitive path, and adding the explicit barrier makes it easier to
document.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Cc: <stable@vger.kernel.org> #v4.19
fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted). Don't leak request reference in this
case. Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.
Fixes: 2d45ba381a ("fuse: add retrieve request")
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"
* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
...
Use accessor functions to access an iterator's type and direction. This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.
Signed-off-by: David Howells <dhowells@redhat.com>
FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.
In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After sending a synchronous READ request from __fuse_direct_read() we only
need to invalidate atime; none of the other attributes should be changed by
a read().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If 'auto_inval_data' mode is active, then fuse_file_read_iter() will call
fuse_update_attributes(), which will check the attribute validity and send
a GETATTR request if some of the attributes are no longer valid. The page
cache is then invalidated if the size or mtime have changed.
Then, if a READ request was sent and reply received (which is the case if
the data wasn't cached yet, or if the file is opened for O_DIRECT), the
atime attribute is invalidated.
This will result in the next read() also triggering a GETATTR, ...
This can be fixed by only sending GETATTR if the mode or size are invalid,
we don't need to do a refresh if only atime is invalid.
More generally, none of the callers of fuse_update_attributes() need an
up-to-date atime value, so for now just remove STATX_ATIME from the request
mask when attributes are updated for internal use.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch adds the infrastructure for more fine grained attribute
invalidation. Currently only 'atime' is invalidated separately.
The use of this infrastructure is extended to the statx(2) interface, which
for now means that if only 'atime' is invalid and STATX_ATIME is not
specified in the mask argument, then no GETATTR request will be generated.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Writeback caching currently allocates requests with the maximum number of
possible pages, while the actual number of pages per request depends on a
couple of factors that cannot be determined when the request is allocated
(whether page is already under writeback, whether page is contiguous with
previous pages already added to a request).
This patch allows such requests to start with no page allocation (all pages
inline) and grow the page array on demand.
If the max_pages tunable remains the default value, then this will mean
just one allocation that is the same size as before. If the tunable is
larger, then this adds at most 3 additional memory allocations (which is
generously compensated by the improved performance from the larger
request).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When allocating page array for a request the array for the page pointers
and the array for page descriptors are allocated by two separate kmalloc()
calls. Merge these into one allocation.
Also instead of initializing the request and the page arrays with memset(),
use the zeroing allocation variants.
Reserved requests never carry pages (page array size is zero). Make that
explicit by initializing the page array pointers to NULL and make sure the
assumption remains true by adding a WARN_ON().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Do this by grouping fields used for cached writes and putting them into a
union with fileds used for cached readdir (with obviously no overlap, since
we don't have hybrid objects).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the internal iversion counter to make sure modifications of the
directory through this filesystem are not missed by the mtime check (due to
mtime granularity).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Store the modification time of the directory in the cache, obtained before
starting to fill the cache.
When reading the cache, verify that the directory hasn't changed, by
checking if current modification time is the same as the one stored in the
cache.
This only needs to be done when the current file position is at the
beginning of the directory, as mandated by POSIX.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow the cache to be invalidated when page(s) have gone missing. In this
case increment the version of the cache and reset to an empty state.
Add a version number to the directory stream in struct fuse_file as well,
indicating the version of the cache it's supposed to be reading. If the
cache version doesn't match the stream's version, then reset the stream to
the beginning of the cache.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The cache is only used if it's completed, not while it's still being
filled; this constraint could be lifted later, if it turns out to be
useful.
Introduce state in struct fuse_file that indicates the position within the
cache. After a seek, reset the position to the beginning of the cache and
search the cache for the current position. If the current position is not
found in the cache, then fall back to uncached readdir.
It can also happen that page(s) disappear from the cache, in which case we
must also fall back to uncached readdir.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch just adds the cache filling functions, which are invoked if
FOPEN_CACHE_DIR flag is set in the OPENDIR reply.
Cache reading and cache invalidation are added by subsequent patches.
The directory cache uses the page cache. Directory entries are packed into
a page in the same format as in the READDIR reply. A page only contains
whole entries, the space at the end of the page is cleared. The page is
locked while being modified.
Multiple parallel readdirs on the same directory can fill the cache; the
only constraint is that continuity must be maintained (d_off of last entry
points to position of current entry).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We noticed the performance bottleneck in FUSE running our Virtuozzo storage
over rdma. On some types of workload we observe 20% of times spent in
request_find() in profiler. This function is iterating over long requests
list, and it scales bad.
The patch introduces hash table to reduce the number of iterations, we do
in this function. Hash generating algorithm is taken from hash_add()
function, while 256 lines table is used to store pending requests. This
fixes problem and improves the performance.
Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This field is not needed after the previous patch, since we can easily
convert request ID to interrupt request ID and vice versa.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not
allow to link requests to a hash table. We need can't use none of them as a
key to calculate hash.
This patch changes the algorithm of allocation of IDs for a request. Plain
requests obtain even ID, while interrupt requests are encoded in the low
bit. So, in next patches we will be able to use the rest of ID bits to
calculate hash, and the hash will be the same for plain and interrupt
requests.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently, we take fc->lock there only to check for fc->connected.
But this flag is changed only on connection abort, which is very
rare operation.
So allow checking fc->connected under just fc->bg_lock and use this lock
(as well as fc->lock) when resetting fc->connected.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To reduce contention of fc->lock, this patch introduces bg_lock for
protection of fields related to background queue. These are:
max_background, congestion_threshold, num_background, active_background,
bg_queue and blocked.
This allows next patch to make async reads not requiring fc->lock, so async
reads and writes will have better performance executed in parallel.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Functions sequences like request_end()->flush_bg_queue() require that
max_background and congestion_threshold are constant during their
execution. Otherwise, checks like
if (fc->num_background == fc->max_background)
made in different time may behave not like expected.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since they are of unsigned int type, it's allowed to read them
unlocked during reporting to userspace. Let's underline this fact
with READ_ONCE() macroses.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This cleanup patch makes the function to use the primitive
instead of direct dereferencing.
Also, move fiq dereferencing out of cycle, since it's
always constant.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There are several FUSE filesystems that can implement server-side copy
or other efficient copy/duplication/clone methods. The copy_file_range()
syscall is the standard interface that users have access to while not
depending on external libraries that bypass FUSE.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using waitqueue_active() is racy. Make sure we issue a wake_up()
unconditionally after storing into fc->blocked. After that it's okay to
optimize with waitqueue_active() since the first wake up provides the
necessary barrier for all waiters, not the just the woken one.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3c18ef8117 ("fuse: optimize wake_up")
Cc: <stable@vger.kernel.org> # v3.10
Otherwise fuse_dev_do_write() could come in and finish off the request, and
the set_bit(FR_SENT, ...) could trigger the WARN_ON(test_bit(FR_SENT, ...))
in request_end().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: syzbot+ef054c4d3f64cd7f7cec@syzkaller.appspotmai
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
This contains various bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
=GJI7
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Various bug fixes and cleanups"
* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: reduce allocation size for splice_write
fuse: use kvmalloc to allocate array of pipe_buffer structs.
fuse: convert last timespec use to timespec64
fs: fuse: Adding new return type vm_fault_t
fuse: simplify fuse_abort_conn()
fuse: Add missed unlock_page() to fuse_readpages_fill()
fuse: Don't access pipe->buffers without pipe_lock()
fuse: fix initial parallel dirops
fuse: Fix oops at process_init_reply()
fuse: umount should wait for all requests
fuse: fix unlocked access to processing queue
fuse: fix double request_end()
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success. ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The 'bufs' array contains 'pipe->buffers' elements, but the
fuse_dev_splice_write() uses only 'pipe->nrbufs' elements.
So reduce the allocation size to 'pipe->nrbufs' elements.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The amount of pipe->buffers is basically controlled by userspace by
fcntl(... F_SETPIPE_SZ ...) so it could be large. High order allocations
could be slow (if memory is heavily fragmented) or may fail if the order
is larger than PAGE_ALLOC_COSTLY_ORDER.
Since the 'bufs' doesn't need to be physically contiguous, use
the kvmalloc_array() to allocate memory. If high order
page isn't available, the kvamalloc*() will fallback to 0-order.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
All of fuse uses 64-bit timestamps with the exception of the
fuse_change_attributes(), so let's convert this one as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.
commit 1c8f422059 ("mm: change return type to vm_fault_t")
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The above error path returns with page unlocked, so this place seems also
to behave the same.
Fixes: f8dbdf8182 ("fuse: rework fuse_readpages()")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_dev_splice_write() reads pipe->buffers to determine the size of
'bufs' array before taking the pipe_lock(). This is not safe as
another thread might change the 'pipe->buffers' between the allocation
and taking the pipe_lock(). So we end up with too small 'bufs' array.
Move the bufs allocations inside pipe_lock()/pipe_unlock() to fix this.
Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If parallel dirops are enabled in FUSE_INIT reply, then first operation may
leave fi->mutex held.
Reported-by: syzbot <syzbot+3f7b29af1baa9d0a55be@syzkaller.appspotmail.com>
Fixes: 5c672ab3f0 ("fuse: serialize dirops by default")
Cc: <stable@vger.kernel.org> # v4.7
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
syzbot is hitting NULL pointer dereference at process_init_reply().
This is because deactivate_locked_super() is called before response for
initial request is processed.
Fix this by aborting and waiting for all requests (including FUSE_INIT)
before resetting fc->sb.
Original patch by Tetsuo Handa <penguin-kernel@I-love.SKAURA.ne.jp>.
Reported-by: syzbot <syzbot+b62f08f4d5857755e3bc@syzkaller.appspotmail.com>
Fixes: e27c9d3877 ("fuse: fuse: add time_gran to INIT_OUT")
Cc: <stable@vger.kernel.org> # v3.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_abort_conn() does not guarantee that all async requests have actually
finished aborting (i.e. their ->end() function is called). This could
actually result in still used inodes after umount.
Add a helper to wait until all requests are fully done. This is done by
looking at the "num_waiting" counter. When this counter drops to zero, we
can be sure that no more requests are outstanding.
Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_dev_release() assumes that it's the only one referencing the
fpq->processing list, but that's not true, since fuse_abort_conn() can be
doing the same without any serialization between the two.
Fixes: c3696046be ("fuse: separate pqueue for clones")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Refcounting of request is broken when fuse_abort_conn() is called and
request is on the fpq->io list:
- ref is taken too late
- then it is not dropped
Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
__gfs2_lookup(), gfs2_create_inode(), nfs_finish_open() and fuse_create_open()
don't need 'opened' anymore. Get rid of that argument in those.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...
Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
The most interesting part of this update is user namespace support, mostly
done by Eric Biederman. This enables safe unprivileged fuse mounts within
a user namespace.
There are also a couple of fixes for bugs found by syzbot and miscellaneous
fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCWxanJAAKCRDh3BK/laaZ
PJjDAP4r6f4kL/5DZxK7JSnSue8BHESGD1LCMVgL57e9WmZukgD/cOtaO85ie3lh
DWuhX5xGZVMMX4frIGLfBn8ogSS+egw=
=3luD
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
"The most interesting part of this update is user namespace support,
mostly done by Eric Biederman. This enables safe unprivileged fuse
mounts within a user namespace.
There are also a couple of fixes for bugs found by syzbot and
miscellaneous fixes and cleanups"
* tag 'fuse-update-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: don't keep dead fuse_conn at fuse_fill_super().
fuse: fix control dir setup and teardown
fuse: fix congested state leak on aborted connections
fuse: Allow fully unprivileged mounts
fuse: Ensure posix acls are translated outside of init_user_ns
fuse: add writeback documentation
fuse: honor AT_STATX_FORCE_SYNC
fuse: honor AT_STATX_DONT_SYNC
fuse: Restrict allow_other to the superblock's namespace or a descendant
fuse: Support fuse filesystems outside of init_user_ns
fuse: Fail all requests with invalid uids or gids
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
fuse: return -ECONNABORTED on /dev/fuse read after abort
fuse: atomic_o_trunc should truncate pagecache
syzbot is reporting use-after-free at fuse_kill_sb_blk() [1].
Since sb->s_fs_info field is not cleared after fc was released by
fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds
already released fc and tries to hold the lock. Fix this by clearing
sb->s_fs_info field after calling fuse_conn_put().
[1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com>
Fixes: 3b463ae0c6 ("fuse: invalidation reverse calls")
Cc: John Muir <john@jmuir.com>
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@redhat.com>
Cc: <stable@vger.kernel.org> # v2.6.31
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
syzbot is reporting NULL pointer dereference at fuse_ctl_remove_conn() [1].
Since fc->ctl_ndents is incremented by fuse_ctl_add_conn() when new_inode()
failed, fuse_ctl_remove_conn() reaches an inode-less dentry and tries to
clear d_inode(dentry)->i_private field.
Fix by only adding the dentry to the array after being fully set up.
When tearing down the control directory, do d_invalidate() on it to get rid
of any mounts that might have been added.
[1] https://syzkaller.appspot.com/bug?id=f396d863067238959c91c0b7cfc10b163638cac6
Reported-by: syzbot <syzbot+32c236387d66c4516827@syzkaller.appspotmail.com>
Fixes: bafa96541b ("[PATCH] fuse: add control filesystem")
Cc: <stable@vger.kernel.org> # v2.6.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a connection gets aborted while congested, FUSE can leave
nr_wb_congested[] stuck until reboot causing wait_iff_congested() to
wait spuriously which can lead to severe performance degradation.
The leak is caused by gating congestion state clearing with
fc->connected test in request_end(). This was added way back in 2009
by 26c3679101 ("fuse: destroy bdi on umount"). While the commit
description doesn't explain why the test was added, it most likely was
to avoid dereferencing bdi after it got destroyed.
Since then, bdi lifetime rules have changed many times and now we're
always guaranteed to have access to the bdi while the superblock is
alive (fc->sb).
Drop fc->connected conditional to avoid leaking congestion states.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joshua Miller <joshmiller@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v2.6.29+
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now that the fuse and the vfs work is complete. Allow the fuse filesystem
to be mounted by the root user in a user namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.
This ensures that modern cached posix acl support is available
and used when dealing with posix acls. This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Files on FUSE can change at any point in time without IMA being able
to detect it. The file data read for the file signature verification
could be totally different from what is subsequently read, making the
signature verification useless.
FUSE can be mounted by unprivileged users either today with fusermount
installed with setuid, or soon with the upcoming patches to allow FUSE
mounts in a non-init user namespace.
This patch sets the SB_I_IMA_UNVERIFIABLE_SIGNATURE flag and when
appropriate sets the SB_I_UNTRUSTED_MOUNTER flag.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Cc: Alban Crequy <alban@kinvolk.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
The description of this flag says "Don't sync attributes with the server".
In other words: always use the attributes cached in the kernel and don't
send network or local messages to refresh the attributes.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed for a mount
done with user namespace root permissions. In such cases allow_other should
not allow users outside the userns to access the mount as doing so would
give the unprivileged user the ability to manipulate processes it would
otherwise be unable to manipulate. Restrict allow_other to apply to users
in the same userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a module.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In order to support mounts from namespaces other than init_user_ns, fuse
must translate uids and gids to/from the userns of the process servicing
requests on /dev/fuse. This patch does that, with a couple of restrictions
on the namespace:
- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.
- The namespace must be the same as s_user_ns.
These restrictions simplify the implementation by avoiding the need to pass
around userns references and by allowing fuse to rely on the checks in
setattr_prepare for ownership changes. Either restriction could be relaxed
in the future if needed.
For cuse the userns used is the opener of /dev/cuse. Semantically the cuse
support does not appear safe for unprivileged users. Practically the
permissions on /dev/cuse only make it accessible to the global root user.
If something slips through the cracks in a user namespace the only users
who will be able to use the cuse device are those users mapped into the
user namespace.
Translation in the posix acl is updated to use the uuser namespace of the
filesystem. Avoiding cases which might bypass this translation is handled
in a following change.
This change is stronlgy based on a similar change from Seth Forshee and
Dongsu Park.
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.
In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.
Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.
[SzM] Don't zero the context for the nofail case, just keep using the
munging version (makes sense for debugging and doesn't hurt).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.
If the initial process has exited, the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.
The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that
is not desirable.
The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages so
removing this code should not matter.
Getting the translation to a server running outside of the pid namespace of
a container can still be achieved by playing setns games at mount time. It
is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.
Fixes: 5d6d3a301c ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently the userspace has no way of knowing whether the fuse
connection ended because of umount or abort via sysfs. It makes it hard
for filesystems to free the mountpoint after abort without worrying
about removing some new mount.
The patch fixes it by returning different errors when userspace reads
from /dev/fuse (-ENODEV for umount and -ECONNABORTED for abort).
Add a new capability flag FUSE_ABORT_ERROR. If set and the connection is
gone because of sysfs abort, reading from the device will return
-ECONNABORTED.
Signed-off-by: Szymon Lukasz <noh4hss@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse has an "atomic_o_trunc" mode, where userspace filesystem uses the
O_TRUNC flag in the OPEN request to truncate the file atomically with the
open.
In this mode there's no need to send a SETATTR request to userspace after
the open, so fuse_do_setattr() checks this mode and returns. But this
misses the important step of truncating the pagecache.
Add the missing parts of truncation to the ATTR_OPEN branch.
Reported-by: Chad Austin <chadaustin@fb.com>
Fixes: 6ff958edbf ("fuse: add atomic open+truncate support")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2 updates
- almost all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
memory hotplug: fix comments when adding section
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
mm: simplify nodemask printing
mm,oom_reaper: remove pointless kthread_run() error check
mm/page_ext.c: check if page_ext is not prepared
writeback: remove unused function parameter
mm: do not rely on preempt_count in print_vma_addr
mm, sparse: do not swamp log with huge vmemmap allocation failures
mm/hmm: remove redundant variable align_end
mm/list_lru.c: mark expected switch fall-through
mm/shmem.c: mark expected switch fall-through
mm/page_alloc.c: broken deferred calculation
mm: don't warn about allocations which stall for too long
fs: fuse: account fuse_inode slab memory as reclaimable
mm, page_alloc: fix potential false positive in __zone_watermark_ok
mm: mlock: remove lru_add_drain_all()
mm, sysctl: make NUMA stats configurable
shmem: convert shmem_init_inodecache() to void
Unify migrate_pages and move_pages access checks
mm, pagevec: rename pagevec drained field
...
Fuse inodes are currently included in the unreclaimable slab counts -
SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
per-cgroup memory.stat. But they are reclaimable just like other
filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.
Mark the slab cache reclaimable.
Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers of release_pages claim the pages being released are cache
hot. As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Summary of modules changes for the 4.15 merge window:
- Treewide module_param_call() cleanup, fix up set/get function
prototype mismatches, from Kees Cook
- Minor code cleanups
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJaDCyzAAoJEMBFfjjOO8FyaYQP/AwHBy6XmwwVlWDP4BqIF6hL
Vhy3ccVLYEORvePv68tWSRPUz5n6+1Ebqanmwtkw6i8l+KwxY2SfkZql09cARc33
2iBE4bHF98iWQmnJbF6me80fedY9n5bZJNMQKEF9VozJWwTMOTQFTCfmyJRDBmk9
iidQj6M3idbSUOYIJjvc40VGx5NyQWSr+FFfqsz1rU5iLGRGEvA3I2/CDT0oTuV6
D4MmFxzE2Tv/vIMa2GzKJ1LGScuUfSjf93Lq9Kk0cG36qWao8l930CaXyVdE9WJv
bkUzpf3QYv/rDX6QbAGA0cada13zd+dfBr8YhchclEAfJ+GDLjMEDu04NEmI6KUT
5lP0Xw0xYNZQI7bkdxDMhsj5jaz/HJpXCjPCtZBnSEKiL4OPXVMe+pBHoCJ2/yFN
6M716XpWYgUviUOdiE+chczB5p3z4FA6u2ykaM4Tlk0btZuHGxjcSWwvcIdlPmjm
kY4AfDV6K0bfEBVguWPJicvrkx44atqT5nWbbPhDwTSavtsuRJLb3GCsHedx7K8h
ZO47lCQFAWCtrycK1HYw+oupNC3hYWQ0SR42XRdGhL1bq26C+1sei1QhfqSgA9PQ
7CwWH4UTOL9fhtrzSqZngYOh9sjQNFNefqQHcecNzcEjK2vjrgQZvRNWZKHSwaFs
fbGX8juZWP4ypbK+irTB
=c8vb
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Summary of modules changes for the 4.15 merge window:
- treewide module_param_call() cleanup, fix up set/get function
prototype mismatches, from Kees Cook
- minor code cleanups"
* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Do not paper over type mismatches in module_param_call()
treewide: Fix function prototypes for module_param_call()
module: Prepare to convert all module_param_call() prototypes
kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Marios Titas running a Haskell program noticed a problem with fuse's
readdirplus: when it is interrupted by a signal, it skips one directory
entry.
The reason is that fuse erronously updates ctx->pos after a failed
dir_emit().
The issue originates from the patch adding readdirplus support.
Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
Tested-by: Marios Titas <redneb@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b05b18381 ("fuse: implement NFS-like readdirplus support")
Cc: <stable@vger.kernel.org> # v3.9
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[AV: in addition to the fix in previous commit]
Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fuse updates from Miklos Szeredi:
"This fixes a regression (spotted by the Sandstorm.io folks) in the pid
namespace handling introduced in 4.12.
There's also a fix for honoring sync/dsync flags for pwritev2()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: getattr cleanup
fuse: honor iocb sync flags on write
fuse: allow server to run in different pid_ns
The refreshed argument isn't used by any caller, get rid of it.
Use a helper for just updating the inode (no need to fill in a kstat).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If the IOCB_DSYNC flag is set a sync is not being performed by
fuse_file_write_iter.
Honor IOCB_DSYNC/IOCB_SYNC by setting O_DYSNC/O_SYNC respectively in the
flags filed of the write request.
We don't need to sync data or metadata, since fuse_perform_write() does
write-through and the filesystem is responsible for updating file times.
Original patch by Vitaly Zolotusky.
Reported-by: Nate Clark <nate@neworld.us>
Cc: Vitaly Zolotusky <vitaly@unitc.com>.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 0b6e9ea041 ("fuse: Add support for pid namespaces") broke
Sandstorm.io development tools, which have been sending FUSE file
descriptors across PID namespace boundaries since early 2014.
The above patch added a check that prevented I/O on the fuse device file
descriptor if the pid namespace of the reader/writer was different from the
pid namespace of the mounter. With this change passing the device file
descriptor to a different pid namespace simply doesn't work. The check was
added because pids are transferred to/from the fuse userspace server in the
namespace registered at mount time.
To fix this regression, remove the checks and do the following:
1) the pid in the request header (the pid of the task that initiated the
filesystem operation) is translated to the reader's pid namespace. If a
mapping doesn't exist for this pid, then a zero pid is used. Note: even if
a mapping would exist between the initiator task's pid namespace and the
reader's pid namespace the pid will be zero if either mapping from
initator's to mounter's namespace or mapping from mounter's to reader's
namespace doesn't exist.
2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
Userspace should not interpret this value anyway. Also allow the
setlk/setlkw operations if the pid of the task cannot be represented in the
mounter's namespace (pid being zero in that case).
Reported-by: Kenton Varda <kenton@sandstorm.io>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b6e9ea041 ("fuse: Add support for pid namespaces")
Cc: <stable@vger.kernel.org> # v4.12+
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
UuF6eSWEC5O1UCRUk1A+
=LLY3
-----END PGP SIGNATURE-----
Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull writeback error handling updates from Jeff Layton:
"This pile continues the work from last cycle on better tracking
writeback errors. In v4.13 we added some basic errseq_t infrastructure
and converted a few filesystems to use it.
This set continues refining that infrastructure, adds documentation,
and converts most of the other filesystems to use it. The main
exception at this point is the NFS client"
* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
ecryptfs: convert to file_write_and_wait in ->fsync
mm: remove optimizations based on i_size in mapping writeback waits
fs: convert a pile of fsync routines to errseq_t based reporting
gfs2: convert to errseq_t based writeback error reporting for fsync
fs: convert sync_file_range to use errseq_t based error-tracking
mm: add file_fdatawait_range and file_write_and_wait
fuse: convert to errseq_t based error tracking for fsync
mm: consolidate dax / non-dax checks for writeback
Documentation: add some docs for errseq_t
errseq: rename __errseq_set to errseq_set
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrTzyAAoJEAAOaEEZVoIVj8wP/0sOxG+7vEEpe4uj2W52aq9T
Y39/ZLfRTLm9SqgH61lkN+IyUsvDx+IP1ws2LBhp0IDRD9m40wdILhHZRWXJcRW2
ApEfmXF+rxnZZ6725ixX9w4Ylab2ZeGmKbzaG4wIjxfddftewZkJvFQcb1LZDfWq
1N0SF4KWoWN6t26Du5CHmYSj/Sz6YGrWGhF22u3mNfkGL+MmuKbz+kB3W+0q2NUF
ZjkOIH9WcRiXgSlcHPBLre2EKHqHaNgb0s4Iofd3ZEe50v1NwY/vBMefxuwRdgKS
kpLhIKIYMawrHn2rpV0jm12qdgCYj+t2kbVIUBDn3unBP2zYA0e/oo5HNIrroVlk
Q6aGwmW0LN60rpd5qcRuNS1p1h2id2HpxEe98dsski6T8CVnj/nvu7EIxmWM02cf
g2HeOd7bnl3+uu7SwSTkOVb6G7Kjn+Xufiz/n11mK6fl2jvOyWZZmDqhhjWAYJ8r
t5mQVWJdEV12+6+A1WSv9DeS3TUgdYPCF8dzDtF+JVn3WEmxYHywH36Y3hKKz+BA
gFEhnHvlyaVvpXCr8Y5BqNSfEfvZe/YUnmVReHpgBU/U4GJ17iQYk/g2vfmPLmsN
IZ2OGCrDUc/LfdWc4llRyQBvlGT1KujaT0tbN7xnuWcS2qWdsfX4jDtDUH9E6pvK
TB6Sw4Ike0ixamG8N8q/
=VPMU
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"This pile just has a few file locking fixes from Ben Coddington. There
are a couple of cleanup patches + an attempt to bring sanity to the
l_pid value that is reported back to userland on an F_GETLK request.
After a few gyrations, he came up with a way for filesystems to
communicate to the VFS layer code whether the pid should be translated
according to the namespace or presented as-is to userland"
* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: restore a warn for leaked locks on close
fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
fs/locks: Use allocation rather than the stack in fcntl_getlk()
Pull fuse fixes from Miklos Szeredi:
"Fix a few bugs in fuse"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: set mapping error in writepage_locked when it fails
fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio
fuse: initialize the flock flag in fuse_file on allocation
This ensures that we see errors on fsync when writeback fails.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 8fba54aebb ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes
the ITER_BVEC page deadlock for direct io in fuse by checking in
fuse_direct_io(), whether the page is a bvec page or not, before locking
it. However, this check is missed when the "async_dio" mount option is
enabled. In this case, set_page_dirty_lock() is called from the req->end
callback in request_end(), when the fuse thread is returning from userspace
to respond to the read request. This will cause the same deadlock because
the bvec condition is not checked in this path.
Here is the stack of the deadlocked thread, while returning from userspace:
[13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds.
[13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13706.658788] glusterfs D ffffffff816c80f0 0 3006 1
0x00000080
[13706.658797] ffff8800d6713a58 0000000000000086 ffff8800d9ad7000
ffff8800d9ad5400
[13706.658799] ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0
7fffffffffffffff
[13706.658801] 0000000000000002 ffffffff816c80f0 ffff8800d6713a78
ffffffff816c790e
[13706.658803] Call Trace:
[13706.658809] [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80
[13706.658811] [<ffffffff816c790e>] schedule+0x3e/0x90
[13706.658813] [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210
[13706.658816] [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0
[13706.658817] [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20
[13706.658819] [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10
[13706.658822] [<ffffffff810f5792>] ? ktime_get+0x52/0xc0
[13706.658824] [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110
[13706.658826] [<ffffffff816c8126>] bit_wait_io+0x36/0x50
[13706.658828] [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0
[13706.658831] [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse]
[13706.658834] [<ffffffff8118800a>] __lock_page+0xaa/0xb0
[13706.658836] [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40
[13706.658838] [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60
[13706.658841] [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse]
[13706.658844] [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse]
[13706.658847] [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse]
[13706.658849] [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse]
[13706.658852] [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse]
[13706.658854] [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse]
[13706.658857] [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90
[13706.658859] [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90
[13706.658862] [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350
[fuse]
[13706.658863] [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0
[13706.658866] [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210
[13706.658868] [<ffffffff81206361>] vfs_writev+0x41/0x50
[13706.658870] [<ffffffff81206496>] SyS_writev+0x56/0xf0
[13706.658872] [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160
[13706.658874] [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71
Fix this by making should_dirty a fuse_io_priv parameter that can be
checked in fuse_aio_complete_req().
Reported-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since commit c69899a17c "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context. The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.
The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK. There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid. So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.
The translaton and presentation of fl_pid should handle the following four
cases:
1 - F_GETLK on a remote file with a remote lock:
In this case, the filesystem should determine the l_pid to return here.
Filesystems should indicate that the fl_pid represents a non-local pid
value that should not be translated by returning an fl_pid <= 0.
2 - F_GETLK on a local file with a remote lock:
This should be the l_pid of the lock manager process, and translated.
3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
These should be the translated l_pid of the local locking process.
Fuse was already doing the correct thing by translating the pid into the
caller's namespace. With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.
With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0. This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.
Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.
Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.
If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid. This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Before the patch, the flock flag could remain uninitialized for the
lifespan of the fuse_file allocation. Unless set to true in
fuse_file_flock(), it would remain in an indeterminate state until read in
an if statement in fuse_release_common(). This could consequently lead to
taking an unexpected branch in the code.
The bug was discovered by a runtime instrumentation designed to detect use
of uninitialized memory in the kernel.
Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Fixes: 37fb3a30b4 ("fuse: fix flock")
Cc: <stable@vger.kernel.org> # v3.1+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull block fixes from Jens Axboe:
"A small collection of fixes that should go into this cycle.
- a pull request from Christoph for NVMe, which ended up being
manually applied to avoid pulling in newer bits in master. Mostly
fibre channel fixes from James, but also a few fixes from Jon and
Vijay
- a pull request from Konrad, with just a single fix for xen-blkback
from Gustavo.
- a fuseblk bdi fix from Jan, fixing a regression in this series with
the dynamic backing devices.
- a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().
- a request leak fix for drbd from Lars, fixing a regression in the
last series with the kref changes. This will go to stable as well"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: release the sq ref on rdma read errors
nvmet-fc: remove target cpu scheduling flag
nvme-fc: stop queues on error detection
nvme-fc: require target or discovery role for fc-nvme targets
nvme-fc: correct port role bits
nvme: unmap CMB and remove sysfs file in reset path
blktrace: fix integer parse
fuseblk: Fix warning in super_setup_bdi_name()
block: xen-blkback: add null check to avoid null pointer dereference
drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
Commit 5f7f7543f5 "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.
Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().
Fixes: 5f7f7543f5 ("fuse: Convert to separately allocated bdi")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Highlights include:
Stable bugfixes:
- Fix use after free in write error path
- Use GFP_NOIO for two allocations in writeback
- Fix a hang in OPEN related to server reboot
- Check the result of nfs4_pnfs_ds_connect
- Fix an rcu lock leak
Features:
- Removal of the unmaintained and unused OSD pNFS layout
- Cleanup and removal of lots of unnecessary dprintk()s
- Cleanup and removal of some memory failure paths now that
GFP_NOFS is guaranteed to never fail.
- Remove the v3-only data server limitation on pNFS/flexfiles
Bugfixes:
- RPC/RDMA connection handling bugfixes
- Copy offload: fixes to ensure the copied data is COMMITed to disk.
- Readdir: switch back to using the ->iterate VFS interface
- File locking fixes from Ben Coddington
- Various use-after-free and deadlock issues in pNFS
- Write path bugfixes
-----BEGIN PGP SIGNATURE-----
iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
Qh7f3QR65K+QX9QbO1g=
=7Nm2
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix use after free in write error path
- Use GFP_NOIO for two allocations in writeback
- Fix a hang in OPEN related to server reboot
- Check the result of nfs4_pnfs_ds_connect
- Fix an rcu lock leak
Features:
- Removal of the unmaintained and unused OSD pNFS layout
- Cleanup and removal of lots of unnecessary dprintk()s
- Cleanup and removal of some memory failure paths now that GFP_NOFS
is guaranteed to never fail.
- Remove the v3-only data server limitation on pNFS/flexfiles
Bugfixes:
- RPC/RDMA connection handling bugfixes
- Copy offload: fixes to ensure the copied data is COMMITed to disk.
- Readdir: switch back to using the ->iterate VFS interface
- File locking fixes from Ben Coddington
- Various use-after-free and deadlock issues in pNFS
- Write path bugfixes"
* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
NFSv4.1: Work around a Linux server bug...
NFS append COMMIT after synchronous COPY
NFSv4: Fix exclusive create attributes encoding
NFSv4: Fix an rcu lock leak
nfs: use kmap/kunmap directly
NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
pNFS: Fix a deadlock when coalescing writes and returning the layout
pNFS: Don't clear the layout return info if there are segments to return
pNFS: Ensure we commit the layout if it has been invalidated
pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
pNFS: Ensure we check layout validity before marking it for return
NFS4.1 handle interrupted slot reuse from ERR_DELAY
NFSv4: check return value of xdr_inline_decode
nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
...
Pull fuse updates from Miklos Szeredi:
"Support for pid namespaces from Seth and refcount_t work from Elena"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Add support for pid namespaces
fuse: convert fuse_conn.count from atomic_t to refcount_t
fuse: convert fuse_req.count from atomic_t to refcount_t
fuse: convert fuse_file.count from atomic_t to refcount_t
Pull misc vfs updates from Al Viro:
"Assorted bits and pieces from various people. No common topic in this
pile, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/affs: add rename exchange
fs/affs: add rename2 to prepare multiple methods
Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
fs: don't set *REFERENCED on single use objects
fs: compat: Remove warning from COMPATIBLE_IOCTL
remove pointless extern of atime_need_update_rcu()
fs: completely ignore unknown open flags
fs: add a VALID_OPEN_FLAGS
fs: remove _submit_bh()
fs: constify tree_descr arrays passed to simple_fill_super()
fs: drop duplicate header percpu-rwsem.h
fs/affs: bugfix: Write files greater than page size on OFS
fs/affs: bugfix: enable writes on OFS disks
fs/affs: remove node generation check
fs/affs: import amigaffs.h
fs/affs: bugfix: make symbolic links work again
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.
Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It is not needed anymore since bdi is initialized whenever superblock
exists.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
When the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd are not being
translated into that process' namespace. Translation is necessary
for the pid to be useful to that process.
Since no use case currently exists for changing namespaces all
translations can be done relative to the pid namespace in use
when fuse_conn_init() is called. For fuse this translates to
mount time, and for cuse this is when /dev/cuse is opened. IO for
this connection from another namespace will return errors.
Requests from processes whose pid cannot be translated into the
target namespace will have a value of 0 for in.h.pid.
File locking changes based on previous work done by Eric
Biederman.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull vfs 'statx()' update from Al Viro.
This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.
It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?
From David Howells.
Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:
https://www.spinics.net/lists/linux-fsdevel/msg33831.html
* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.
Remove the vma parameter to simplify things.
[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct fuse_file is stored in file->private_data. Make this always be a
counting reference for consistency.
This also allows fuse_sync_release() to call fuse_file_put() instead of
partially duplicating its functionality.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_file_put() was missing the "force" flag for the RELEASE request when
sending synchronously (fuseblk).
If this flag is not set, then a sync request may be interrupted before it
is dequeued by the userspace filesystem. In this case the OPEN won't be
balanced with a RELEASE.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 5a18ec176c ("fuse: fix hang of single threaded fuseblk filesystem")
Cc: <stable@vger.kernel.org> # v2.6.38+
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:
TASK 1:
__fuse_request_send():
|--spin_lock(&fiq->waitq.lock);
|--queue_request();
|--spin_unlock(&fiq->waitq.lock);
|--request_wait_answer():
|--if (test_bit(FR_SENT, &req->flags))
<gets pre-empted after it is validated true>
TASK 2:
fuse_dev_do_write():
|--clears bit FR_SENT,
|--request_end():
|--sets bit FR_FINISHED
|--spin_lock(&fiq->waitq.lock);
|--list_del_init(&req->intr_entry);
|--spin_unlock(&fiq->waitq.lock);
|--fuse_put_request();
|--queue_interrupt();
<request gets queued to interrupts list>
|--wake_up_locked(&fiq->waitq);
|--wait_event_freezable();
<as FR_FINISHED is set, it returns and then
the caller frees this request>
Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:
[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94
As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # 4.2+
Since we need to change the implementation, stop exposing internals.
Provide KREF_INIT() to allow static initialization of struct kref.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit bcb6f6d2b9 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.
Fixes: bcb6f6d2b9 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.9
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.2+
Pull partial readlink cleanups from Miklos Szeredi.
This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.
Miklos and Al are still discussing the rest of the series.
* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: make generic_readlink() static
vfs: remove ".readlink = generic_readlink" assignments
vfs: default to generic_readlink()
vfs: replace calling i_op->readlink with vfs_readlink()
proc/self: use generic_readlink
ecryptfs: use vfs_get_link()
bad_inode: add missing i_op initializers
If .readlink == NULL implies generic_readlink().
Generated by:
to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Basically, the pjdfstests set the ownership of a file to 06555, and then
chowns it (as root) to a new uid/gid. Prior to commit a09f99edde ("fuse:
fix killing s[ug]id in setattr"), fuse would send down a setattr with both
the uid/gid change and a new mode. Now, it just sends down the uid/gid
change.
Technically this is NOTABUG, since POSIX doesn't _require_ that we clear
these bits for a privileged process, but Linux (wisely) has done that and I
think we don't want to change that behavior here.
This is caused by the use of should_remove_suid(), which will always return
0 when the process has CAP_FSETID.
In fact we really don't need to be calling should_remove_suid() at all,
since we've already been indicated that we should remove the suid, we just
don't want to use a (very) stale mode for that.
This patch should fix the above as well as simplify the logic.
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: a09f99edde ("fuse: fix killing s[ug]id in setattr")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
If pos is at the beginning of a page and copied is zero then page is not
zeroed but is marked uptodate.
Fix by skipping everything except unlock/put of page if zero bytes were
copied.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 6b12c1b37e ("fuse: Implement write_begin/write_end callbacks")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
Pull vfs xattr updates from Al Viro:
"xattr stuff from Andreas
This completes the switch to xattr_handler ->get()/->set() from
->getxattr/->setxattr/->removexattr"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Remove {get,set,remove}xattr inode operations
xattr: Stop calling {get,set,remove}xattr inode operations
vfs: Check for the IOP_XATTR flag in listxattr
xattr: Add __vfs_{get,set,remove}xattr helpers
libfs: Use IOP_XATTR flag for empty directory handling
vfs: Use IOP_XATTR flag for bad-inode handling
vfs: Add IOP_XATTR inode operations flag
vfs: Move xattr_resolve_name to the front of fs/xattr.c
ecryptfs: Switch to generic xattr handlers
sockfs: Get rid of getxattr iop
sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
kernfs: Switch to generic xattr handlers
hfs: Switch to generic xattr handlers
jffs2: Remove jffs2_{get,set,remove}xattr macros
xattr: Remove unnecessary NULL attribute name check
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.
There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
These inode operations are no longer used; remove them.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull VFS splice updates from Al Viro:
"There's a bunch of branches this cycle, both mine and from other folks
and I'd rather send pull requests separately.
This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
(and introduction of such). Gets rid of a lot of code in fs/splice.c
and elsewhere; there will be followups, but these are for the next
cycle... Some pipe/splice-related cleanups from Miklos in the same
branch as well"
* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
pipe: fix comment in pipe_buf_operations
pipe: add pipe_buf_steal() helper
pipe: add pipe_buf_confirm() helper
pipe: add pipe_buf_release() helper
pipe: add pipe_buf_get() helper
relay: simplify relay_file_read()
switch default_file_splice_read() to use of pipe-backed iov_iter
switch generic_file_splice_read() to use of ->read_iter()
new iov_iter flavour: pipe-backed
fuse_dev_splice_read(): switch to add_to_pipe()
skb_splice_bits(): get rid of callback
new helper: add_to_pipe()
splice: lift pipe_lock out of splice_to_pipe()
splice: switch get_iovec_page_array() to iov_iter
splice_to_pipe(): don't open-code wakeup_pipe_readers()
consistent treatment of EFAULT on O_DIRECT read/write
* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
arrange for waiting, looping, etc. themselves.
That should make pipe_lock the outermost one.
Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those. It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".
Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In preparation for posix acl support, rework fuse to use xattr handlers and
the generic setxattr/getxattr/listxattr callbacks. Split the xattr code
out into it's own file, and promote symbols to module-global scope as
needed.
Functionally these changes have no impact, as fuse still uses a single
handler for all xattrs which uses the old callbacks.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Only two flags: "default_permissions" and "allow_other". All other flags
are handled via bitfields. So convert these two as well. They don't
change during the lifetime of the filesystem, so this is quite safe.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Make sure userspace filesystem is returning a well formed list of xattr
names (zero or more nonzero length, null terminated strings).
[Michael Theall: only verify in the nonzero size case]
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Store in memory pointed to by ->d_fsdata. Use ->d_init() to allocate the
storage. Need to use RCU freeing because the data is used in RCU lookup
mode.
We could cast ->d_fsdata directly on 64bit archs, but I don't think this is
worth the extra complexity.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a new INIT flag, FUSE_POSIX_ACL, for negotiating ACL support with
userspace. When it is set in the INIT response, ACL support will be
enabled. ACL support also implies "default_permissions".
When ACL support is enabled, the kernel will cache and have responsibility
for enforcing ACLs. ACL xattrs will be passed to userspace, which is
responsible for updating the ACLs in the filesystem, keeping the file mode
in sync, and inheritance of default ACLs when new filesystem nodes are
created.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Only userspace filesystem can do the killing of suid/sgid without races.
So introduce an INIT flag and negotiate support for this.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse allowed VFS to set mode in setattr in order to clear suid/sgid on
chown and truncate, and (since writeback_cache) write. The problem with
this is that it'll potentially restore a stale mode.
The poper fix would be to let the filesystems do the suid/sgid clearing on
the relevant operations. Possibly some are already doing it but there's no
way we can detect this.
So fix this by refreshing and recalculating the mode. Do this only if
ATTR_KILL_S[UG]ID is set to not destroy performance for writes. This is
still racy but the size of the window is reduced.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Without "default_permissions" the userspace filesystem's lookup operation
needs to perform the check for search permission on the directory.
If directory does not allow search for everyone (this is quite rare) then
userspace filesystem has to set entry timeout to zero to make sure
permissions are always performed.
Changing the mode bits of the directory should also invalidate the
(previously cached) dentry to make sure the next lookup will have a chance
of updating the timeout, if needed.
Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.
Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. Propagate it down to fuse_do_setattr().
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When reading from a loop device backed by a fuse file it deadlocks on
lock_page().
This is because the page is already locked by the read() operation done on
the loop device. In this case we don't want to either lock the page or
dirty it.
So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
Reported-by: Sheng Yang <sheng@yasker.org>
Fixes: aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.1+
Reviewed-by: Sheng Yang <sheng@yasker.org>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Tested-by: Sheng Yang <sheng@yasker.org>
Tested-by: Ashish Samant <ashish.samant@oracle.com>
Pull qstr constification updates from Al Viro:
"Fairly self-contained bunch - surprising lot of places passes struct
qstr * as an argument when const struct qstr * would suffice; it
complicates analysis for no good reason.
I'd prefer to feed that separately from the assorted fixes (those are
in #for-linus and with somewhat trickier topology)"
* 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
qstr: constify instances in adfs
qstr: constify instances in lustre
qstr: constify instances in f2fs
qstr: constify instances in ext2
qstr: constify instances in vfat
qstr: constify instances in procfs
qstr: constify instances in fuse
qstr constify instances in fs/dcache.c
qstr: constify instances in nfs
qstr: constify instances in ocfs2
qstr: constify instances in autofs4
qstr: constify instances in hfs
qstr: constify instances in hfsplus
qstr: constify instances in logfs
qstr: constify dentry_init_security
Pull fuse updates from Miklos Szeredi:
"This fixes error propagation from writeback to fsync/close for
writeback cache mode as well as adding a missing capability flag to
the INIT message. The rest are cleanups.
(The commits are recent but all the code actually sat in -next for a
while now. The recommits are due to conflict avoidance and the
addition of Cc: stable@...)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use filemap_check_errors()
mm: export filemap_check_errors() to modules
fuse: fix wrong assignment of ->flags in fuse_send_init()
fuse: fuse_flush must check mapping->flags for errors
fuse: fsync() did not return IO errors
fuse: don't mess with blocking signals
new helper: wait_event_killable_exclusive()
fuse: improve aio directIO write performance for size extending writes
FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo.
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 69fe05c90e ("fuse: add missing INIT flags")
Cc: <stable@vger.kernel.org>
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
Due to implementation of fuse writeback filemap_write_and_wait_range() does
not catch errors. We have to do this directly after fuse_sync_writes()
Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
Merge more updates from Andrew Morton:
"The rest of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.
[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.
That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.
It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.
Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.
* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash
->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache. Use d_in_lookup() to tell one from another, rather
than d_unhashed().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
While sending the blocking directIO in fuse, the write request is broken
into sub-requests, each of default size 128k and all the requests are sent
in non-blocking background mode if async_dio mode is supported by libfuse.
The process which issue the write wait for the completion of all the
sub-requests. Sending multiple requests parallely gives a chance to perform
parallel writes in the user space fuse implementation if it is
multi-threaded and hence improves the performance.
When there is a size extending aio dio write, we switch to blocking mode so
that we can properly update the size of the file after completion of the
writes. However, in this situation all the sub-requests are sent in
serialized manner where the next request is sent only after receiving the
reply of the current request. Hence the multi-threaded user space
implementation is not utilized properly.
This patch changes the size extending aio dio behavior to exactly follow
blocking dio. For multi threaded fuse implementation having 10 threads and
using buffer size of 64MB to perform async directIO, we are getting double
the speed.
Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Negotiate with userspace filesystems whether they support parallel readdir
and lookup. Disable parallelism by default for fear of breaking fuse
filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 9902af79c0 ("parallel lookups: actual switch to rwsem")
Fixes: d9b3dbdcfd ("fuse: switch to ->iterate_shared()")
We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.
A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.
Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.
Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.
Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs cleanups from Al Viro:
"More cleanups from Christoph"
* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: use RWF_SYNC
fs: add RWF_DSYNC aand RWF_SYNC
ceph: use generic_write_sync
fs: simplify the generic_write_sync prototype
fs: add IOCB_SYNC and IOCB_DSYNC
direct-io: remove the offset argument to dio_complete
direct-io: eliminate the offset argument to ->direct_IO
xfs: eliminate the pos variable in xfs_file_dio_aio_write
filemap: remove the pos argument to generic_file_direct_write
filemap: remove pos variables in generic_file_read_iter
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io
read will not return 0 to indicate that read has completed.
Fixes: 742f992708 ("fuse: return patrial success from fuse_direct_io()")
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a user calls writev/readv in direct io mode with partially valid data
in the iovec array such that any vector other than the first one in the
array contains invalid data, we currently return the error for the invalid
iovec.
Instead, we should return the number of bytes already written/read and not
the error as we do in the non direct_io case.
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The 'reqs' member of fuse_io_priv serves two purposes. First is to track
the number of oustanding async requests to the server and to signal that
the io request is completed. The second is to be a reference count on the
structure to know when it can be freed.
For sync io requests these purposes can be at odds. fuse_direct_IO() wants
to block until the request is done, and since the signal is sent when
'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
use the object after the userspace server has completed processing
requests. This leads to some handshaking and special casing that it
needlessly complicated and responsible for at least one race condition.
It's much cleaner and safer to maintain a separate reference count for the
object lifecycle and to let 'reqs' just be a count of outstanding requests
to the userspace server. Then we can know for sure when it is safe to free
the object without any handshaking or special cases.
The catch here is that most of the time these objects are stack allocated
and should not be freed. Initializing these objects with a single reference
that is never released prevents accidental attempts to free the objects.
Fixes: 9d5722b777 ("fuse: handle synchronous iocbs internally")
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an
iocb that could have been freed if async io has already completed. The fix
in this case is simple and obvious: cache the result before starting io.
It was discovered by KASan:
kernel: ==================================================================
kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390
Signed-off-by: Robert Doebbelin <robert@quobyte.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: bcba24ccdc ("fuse: enable asynchronous processing direct IO")
Cc: <stable@vger.kernel.org> # 3.10+
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fuse updates from Miklos Szeredi:
"This adds SEEK_HOLE and SEEK_DATA support in lseek"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add support for SEEK_HOLE and SEEK_DATA in lseek
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.
No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A useful performance improvement for accessing virtual machine images
via FUSE mount.
See https://bugzilla.redhat.com/show_bug.cgi?id=1220173 for a use-case
for glusterFS.
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.
Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.
A similar problem is described in 124d3b7041 ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.
Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.
Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org>
The problem is that fuse_dev_alloc() acquires an extra reference to cc.fc,
and the original ref count is never dropped.
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: cc080e9e9b ("fuse: introduce per-instance fuse_dev structure")
Cc: <stable@vger.kernel.org> # v4.2+
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait(). This
allows for some later cleanup.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
leading to a type confusion issue. Fix it by checking file->f_op.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Pull user namespace updates from Eric Biederman:
"Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.
Some hacks were put in place to reduce the worst of the damage could
be done, and the common sense rule was adopted that fresh mounts of
proc and sysfs should allow no more than bind mounts of proc and
sysfs. Unfortunately that rule has not been fully enforced.
There are two kinds of gaps in that enforcement. Only filesystems
mounted on empty directories of proc and sysfs should be ignored but
the test for empty directories was insufficient. So in my tree
directories on proc, sysctl and sysfs that will always be empty are
created specially. Every other technique is imperfect as an ordinary
directory can have entries added even after a readdir returns and
shows that the directory is empty. Special creation of directories
for mount points makes the code in the kernel a smidge clearer about
it's purpose. I asked container developers from the various container
projects to help test this and no holes were found in the set of mount
points on proc and sysfs that are created specially.
This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of
proc and sysfs. I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags
on the previous mount of proc and sysfs. So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced. Dealing with the noexec and nosuid
attributes remains for another time.
This set of changes also addresses an issue with how open file
descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of
/proc/<pid>/fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.
There is also a short list of issues that have not been fixed yet that
I will mention briefly.
It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can
be walked up to the root directory of the filesystem. With user
namespaces enabled a bind mount of the bind mount can be created
allowing the user to pick a directory whose children they can rename
to outside of the bind mount. This is challenging to fix and doubly
so because all obvious solutions must touch code that is in the
performance part of pathname resolution.
As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable
files on sysfs and proc and in doing so introduce security regressions
in the current userspace that will not be immediately obvious and as
such are likely to require breaking userspace in painful ways once
they are recognized"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Remove incorrect debugging WARN in prepend_path
mnt: Update fs_fully_visible to test for permanently empty directories
sysfs: Create mountpoints with sysfs_create_mount_point
sysfs: Add support for permanently empty directories to serve as mount points.
kernfs: Add support for always empty directories.
proc: Allow creating permanently empty directories that serve as mount points
sysctl: Allow creating permanently empty directories that serve as mountpoints.
fs: Add helper functions for permanently empty directories.
vfs: Ignore unlocked mounts in fs_fully_visible
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
mnt: Refactor the logic for mounting sysfs and proc in a user namespace
Pull fuse updates from Miklos Szeredi:
"This is the start of improving fuse scalability.
An input queue and a processing queue is split out from the monolithic
fuse connection, each of those having their own spinlock. The end of
the patchset adds the ability to clone a fuse connection. This means,
that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors
open. Each of those can be used to receive requests and send answers,
currently the only constraint is that a request must be answered on
the same fd as it was read from.
This can be extended further to allow binding a device clone to a
specific CPU or NUMA node.
Based on a patchset by Srinivas Eeda and Ashish Samant. Thanks to
Ashish for the review of this series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits)
fuse: update MAINTAINERS entry
fuse: separate pqueue for clones
fuse: introduce per-instance fuse_dev structure
fuse: device fd clone
fuse: abort: no fc->lock needed for request ending
fuse: no fc->lock for pqueue parts
fuse: no fc->lock in request_end()
fuse: cleanup request_end()
fuse: request_end(): do once
fuse: add req flag for private list
fuse: pqueue locking
fuse: abort: group pqueue accesses
fuse: cleanup fuse_dev_do_read()
fuse: move list_del_init() from request_end() into callers
fuse: duplicate ->connected in pqueue
fuse: separate out processing queue
fuse: simplify request_wait()
fuse: no fc->lock for iqueue parts
fuse: allow interrupt queuing without fc->lock
fuse: iqueue locking
...
This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.
The mount points converted and their filesystems are:
/sys/hypervisor/s390/ s390_hypfs
/sys/kernel/config/ configfs
/sys/kernel/debug/ debugfs
/sys/firmware/efi/efivars/ efivarfs
/sys/fs/fuse/connections/ fusectl
/sys/fs/pstore/ pstore
/sys/kernel/tracing/ tracefs
/sys/fs/cgroup/ cgroup
/sys/kernel/security/ securityfs
/sys/fs/selinux/ selinuxfs
/sys/fs/smackfs/ smackfs
Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make each fuse device clone refer to a separate processing queue. The only
constraint on userspace code is that the request answer must be written to
the same device clone as it was read off.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow fuse device clones to refer to be distinguished. This patch just
adds the infrastructure by associating a separate "struct fuse_dev" with
each clone.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Allow an open fuse device to be "cloned". Userspace can create a clone by:
newfd = open("/dev/fuse", O_RDWR)
ioctl(newfd, FUSE_DEV_IOC_CLONE, &oldfd);
At this point newfd will refer to the same fuse connection as oldfd.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
In fuse_abort_conn() when all requests are on private lists we no longer
need fc->lock protection.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
No longer need to call request_end() with the connection lock held. We
still protect the background counters and queue with fc->lock, so acquire
it if necessary.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Now that we atomically test having already done everything we no longer
need other protection.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
When the connection is aborted it is possible that request_end() will be
called twice. Use atomic test and set to do the actual ending only once.
test_and_set_bit() also provides the necessary barrier semantics so no
explicit smp_wmb() is necessary.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
When an unlocked request is aborted, it is moved from fpq->io to a private
list. Then, after unlocking fpq->lock, the private list is processed and
the requests are finished off.
To protect the private list, we need to mark the request with a flag, so if
in the meantime the request is unlocked the list is not corrupted.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
request flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
- locked list_add() + list_del_init() cancel out
- common handling of case when request is ended here in the read phase
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
This will allow checking ->connected just with the processing queue lock.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
This is just two fields: fc->io and fc->processing.
This patch just rearranges the fields, no functional change.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
wait_event_interruptible_exclusive_locked() will do everything
request_wait() does, so replace it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Interrupt is only queued after the request has been sent to userspace.
This is either done in request_wait_answer() or fuse_dev_do_read()
depending on which state the request is in at the time of the interrupt.
If it's not yet sent, then queuing the interrupt is postponed until the
request is read. Otherwise (the request has already been read and is
waiting for an answer) the interrupt is queued immedidately.
We want to call queue_interrupt() without fc->lock protection, in which
case there can be a race between the two functions:
- neither of them queue the interrupt (thinking the other one has already
done it).
- both of them queue the interrupt
The first one is prevented by adding memory barriers, the second is
prevented by checking (under fiq->waitq.lock) if the interrupt has already
been queued.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use fiq->waitq.lock for protecting members of struct fuse_iqueue and
FR_PENDING request flag, previously protected by fc->lock.
Following patches will remove fc->lock protection from these members.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
This will allow checking ->connected just with the input queue lock.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
The input queue contains normal requests (fc->pending), forgets
(fc->forget_*) and interrupts (fc->interrupts). There's also fc->waitq and
fc->fasync for waking up the readers of the fuse device when a request is
available.
The fc->reqctr is also moved to the input queue (assigned to the request
when the request is added to the input queue.
This patch just rearranges the fields, no functional change.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Use flags for representing the state in fuse_req. This is needed since
req->list will be protected by different locks in different states, hence
we'll want the state itself to be split into distinct bits, each protected
with the relevant lock in that state.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
FUSE_REQ_INIT is actually the same state as FUSE_REQ_PENDING and
FUSE_REQ_READING and FUSE_REQ_WRITING can be merged into a common
FUSE_REQ_IO state.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Only hold fc->lock over sections of request_wait_answer() that actually
need it. If wait_event_interruptible() returns zero, it means that the
request finished. Need to add memory barriers, though, to make sure that
all relevant data in the request is synchronized.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Since it's a 64bit counter, it's never gonna wrap around. Remove code
dealing with that possibility.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Splice fc->pending and fc->processing lists into a common kill list while
holding fc->lock.
By the time we release fc->lock, pending and processing lists are empty and
the io list contains only locked requests.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Finer grained locking will mean there's no single lock to protect
modification of bitfileds in fuse_req.
So move to using bitops. Can use the non-atomic variants for those which
happen while the request definitely has only one reference.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
- don't end the request while req->locked is true
- make unlock_request() return an error if the connection was aborted
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
fuse_abort_conn() does all the work done by fuse_dev_release() and more.
"More" consists of:
end_io_requests(fc);
wake_up_all(&fc->waitq);
kill_fasync(&fc->fasync, SIGIO, POLL_IN);
All of which should be no-op (WARN_ON's added).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
And the same with fuse_request_send_nowait_locked().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
fc->conn_error is set once in FUSE_INIT reply and never cleared. Check it
in request allocation, there's no sense in doing all the preparation if
sending will surely fail.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Move accounting of fc->num_waiting to the point where the request actually
starts waiting. This is earlier than the current queue_request() for
background requests, since they might be waiting on the fc->bg_queue before
being queued on fc->pending.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Reset req->waiting in fuse_put_request(). This is needed for correct
accounting in fc->num_waiting for reserved requests.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
request_end() expects fc->num_background and fc->active_background to have
been incremented, which is not the case in fuse_request_send_nowait()
failure path. So instead just call the ->end() callback (which is actually
set by all callers).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
fc->release is called from fuse_conn_put() which was used in the error
cleanup before fc->release was initialized.
[Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling
fuse_conn_init(fc) instead of before.]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: a325f9b922 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()")
Cc: <stable@vger.kernel.org> #v2.6.31+
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->bdi_stat[] into wb.
* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
enums is changed from BDI_ to WB_.
* BDI_STAT_BATCH() -> WB_STAT_BATCH()
* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
* bdi_stat[_error]() -> wb_stat[_error]()
* bdi_writeout_inc() -> wb_writeout_inc()
* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
frees stat.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.
Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().
b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c
already done by caller. We used to call __fuse_direct_write(), which
called generic_write_checks(); now the former got expanded, bringing
the latter to the surface. It used to be called all along and calling
it from there had been wrong all along...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The misc subsystem (which is used for /dev/fuse) initializes private_data to
point to the misc device when a driver has registered a custom open file
operation, and initializes it to NULL when a custom open file operation has
*not* been provided.
This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc device
structure. And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.
So to simplify things in the misc subsystem, a patch [1] has been proposed to
*always* set the private_data to point to the misc device, instead of only
doing this when a custom open file operation has been registered.
But before this patch can be applied we need to modify drivers that make the
assumption that a misc device file's private_data is initialized to NULL
because they didn't register a custom open file operation, so they don't rely
on this assumption anymore. FUSE uses private_data to store the fuse_conn and
errors out if this is not initialized to NULL at mount time.
Hence, we now set a file's private_data to NULL explicitly, to be independent
of whatever value the misc subsystem initializes it to by default.
[1] https://lkml.org/lkml/2014/12/4/939
Reported-by: Giedrius Statkevicius <giedriuswork@gmail.com>
Reported-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.
Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.
We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Based on a patch from Maxim Patlasov <MPatlasov@parallels.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Regular pipe buffers' ->steal method (generic_pipe_buf_steal()) doesn't set
PG_uptodate.
Don't warn on this condition, just set the uptodate flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
fuse_try_move_page() is not prepared for replacing pages that have already
been read.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Convert the following where appropriate:
(1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
(2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
(3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
complicated than it appears as some calls should be converted to
d_can_lookup() instead. The difference is whether the directory in
question is a real dir with a ->lookup op or whether it's a fake dir with
a ->d_automount op.
In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).
Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer. In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.
However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.
There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
intended for special directory entry types that don't have attached inodes.
The following perl+coccinelle script was used:
use strict;
my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
print "No matches\n";
exit(0);
}
my @cocci = (
'@@',
'expression E;',
'@@',
'',
'- S_ISLNK(E->d_inode->i_mode)',
'+ d_is_symlink(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISDIR(E->d_inode->i_mode)',
'+ d_is_dir(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISREG(E->d_inode->i_mode)',
'+ d_is_reg(E)' );
my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);
foreach my $file (@callers) {
chomp $file;
print "Processing ", $file, "\n";
system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
die "spatch failed";
}
[AV: overlayfs parts skipped]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case. Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Theoretically we need to order setting of various fields in fc with
fc->initialized.
No known bug reports related to this yet.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Analysis from Marc:
"Commit 7078187a79 ("fuse: introduce fuse_simple_request() helper")
from the above pull request triggers some EIO errors for me in some tests
that rely on fuse
Looking at the code changes and a bit of debugging info I think there's a
general problem here that fuse_get_req checks and possibly waits for
fc->initialized, and this was always called first. But this commit
changes the ordering and in many places fc->minor is now possibly used
before fuse_get_req, and we can't be sure that fc has been initialized.
In my case fuse_lookup_init sets req->out.args[0].size to the wrong size
because fc->minor at that point is still 0, leading to the EIO error."
Fix by moving the compat adjustments into fuse_simple_request() to after
fuse_get_req().
This is also more readable than the original, since now compatibility is
handled in a single function instead of cluttering each operation.
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 7078187a79 ("fuse: introduce fuse_simple_request() helper")
Pull fuse update from Miklos Szeredi:
"The first part makes sure we don't hold up umount with pending async
requests. In addition to being a cleanup, this is a small behavioral
change (for the better) and unlikely to break anything.
The second part prepares for a cleanup of the fuse device I/O code by
adding a helper for simple request submission, with some savings in
line numbers already realized"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use file_inode() in fuse_file_fallocate()
fuse: introduce fuse_simple_request() helper
fuse: reduce max out args
fuse: hold inode instead of path after release
fuse: flush requests on umount
fuse: don't wake up reserved req in fuse_conn_kill()
The following pattern is repeated many times:
req = fuse_get_req_nopages(fc);
/* Initialize req->(in|out).args */
fuse_request_send(fc, req);
err = req->out.h.error;
fuse_put_request(req);
Create a new replacement helper:
/* Initialize args */
err = fuse_simple_request(fc, &args);
In addition to reducing the code size, this will ease moving from the
complex arg-based to a simpler page-based I/O on the fuse device.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
path_put() in release could trigger a DESTROY request in fuseblk. The
possible deadlock was worked around by doing the path_put() with
schedule_work().
This complexity isn't needed if we just hold the inode instead of the path.
Since we now flush all requests before destroying the super block we can be
sure that all held inodes will be dropped.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use fuse_abort_conn() instead of fuse_conn_kill() in fuse_put_super().
This flushes and aborts requests still on any queues. But since we've
already reset fc->connected, those requests would not be useful anyway and
would be flushed when the fuse device is closed.
Next patches will rely on requests being flushed before the superblock is
destroyed.
Use fuse_abort_conn() in cuse_process_init_reply() too, since it makes no
difference there, and we can get rid of fuse_conn_kill().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Waking up reserved_req_waitq from fuse_conn_kill() doesn't make sense since
we aren't chaging ff->reserved_req here, which is what this waitqueue
signals.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow.
Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.
The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.
Fixes: c9c37e2e63 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann <werner.baumann@onlinehome.de>
Tested-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig suggests:
1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
use ->rename2
3) see how many ->rename instances we'll have left after a few
iterations of 2.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here some additional changes to set a capability flag so that clients can
detect when it's appropriate to return -ENOSYS from open.
This amends the following commit introduced in 3.14:
7678ac5061 fuse: support clients that don't implement 'open'
However we can only add the flag to 3.15 and later since there was no
protocol version update in 3.14.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
Pull fuse fixes from Miklos Szeredi:
"This contains miscellaneous fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: replace count*size kzalloc by kcalloc
fuse: release temporary page if fuse_writepage_locked() failed
fuse: restructure ->rename2()
fuse: avoid scheduling while atomic
fuse: handle large user and group ID
fuse: inode: drop cast
fuse: ignore entry-timeout on LOOKUP_REVAL
fuse: timeout comparison fix
Make ->rename2() universal, i.e. able to handle zero flags. This is to
make future change of the API easier.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
triggers complains about scheduling while atomic:
BUG: scheduling while atomic: fuse.hf/13976/0x10000001
This happens because fuse_notify_inval_entry() attempts to allocate memory
with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().
Introduced by commit 58bda1da4b "fuse/dev: use atomic maps"
Fix by moving the map/unmap to just cover the actual memcpy operation.
Original patch from Maxim Patlasov <mpatlasov@parallels.com>
Reported-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
If the number in "user_id=N" or "group_id=N" mount options was larger than
INT_MAX then fuse returned EINVAL.
Fix this to handle all valid uid/gid values.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
This patch removes the cast on data of type void * as it is not needed.
The following Coccinelle semantic patch was used for making the change:
@r@
expression x;
void* e;
type T;
identifier f;
@@
(
*((T *)e)
|
((T *)x)[...]
|
((T *)x)->f
|
- (T *)
e
)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The following test case demonstrates the bug:
sh# mount -t glusterfs localhost:meta-test /mnt/one
sh# mount -t glusterfs localhost:meta-test /mnt/two
sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
bash: /mnt/one/file: Stale file handle
sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file
On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.
Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
As suggested by checkpatch.pl, use time_before64() instead of direct
comparison of jiffies64 values.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the fl_owner isn't set for flock locks. Some filesystems use
byte-range locks to simulate flock locks and there is a common idiom in
those that does:
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
Since flock locks are generally "owned" by the open file description,
move this into the common flock lock setup code. The fl_start and fl_end
fields are already set appropriately, so remove the unneeded setting of
that in flock ops in those filesystems as well.
Finally, the lease code also sets the fl_owner as if they were owned by
the process and not the open file description. This is incorrect as
leases have the same ownership semantics as flock locks. Set them the
same way. The lease code doesn't actually use the fl_owner value for
anything, so this is more for consistency's sake than a bugfix.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (Staging portion)
Acked-by: J. Bruce Fields <bfields@fieldses.org>
New variant of iov_iter - ITER_BVEC in iter->type, backed with
bio_vec array instead of iovec one. Primitives taught to deal
with such beasts, __swap_write() switched to using that kind
of iov_iter.
Note that bio_vec is just a <page, offset, length> triple - there's
nothing block-specific about it. I've left the definition where it
was, but took it from under ifdef CONFIG_BLOCK.
Next target: ->splice_write()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.
Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The patch addresses two use-cases when the flag may be safely cleared:
1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid.
In this case attr->ia_ctime bears actual value. In-kernel fuse must send it
to the userspace server and then assign the value to inode->i_ctime.
2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid,
whereas ATTR_CTIME is not set (truncate(2)).
In this case in-kernel fuse must sent "now" to the userspace server and then
assign the value to inode->i_ctime.
In both cases we could clear I_DIRTY_SYNC, but that needs more thought.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let the kernel maintain i_ctime locally: update i_ctime explicitly on
truncate, fallocate, open(O_TRUNC), setxattr, removexattr, link, rename,
unlink.
The inode flag I_DIRTY_SYNC serves as indication that local i_ctime should
be flushed to the server eventually. The patch sets the flag and updates
i_ctime in course of operations listed above.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch extends fuse_setattr_in, and extends the flush procedure
(fuse_flush_times()) called on ->write_inode() to send the ctime as well as
mtime.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow userspace fs to specify time granularity.
This is needed because with writeback_cache mode the kernel is responsible
for generating mtime and ctime, but if the underlying filesystem doesn't
support nanosecond granularity then the cache will contain a different
value from the one stored on the filesystem resulting in a change of times
after a cache flush.
Make the default granularity 1s.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
...and flush mtime from this. This allows us to use the kernel
infrastructure for writing out dirty metadata (mtime at this point, but
ctime in the next patches and also maybe atime).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Don't need to start I/O twice (once without i_mutex and one within).
Also make sure that even if the userspace filesystem doesn't support FSYNC
we do all the steps other than sending the message.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In case of fc->atomic_o_trunc is set, fuse does nothing in
fuse_do_setattr() while handling open(O_TRUNC). Hence, i_mtime must be
updated explicitly in fuse_finish_open(). The patch also adds extra locking
encompassing open(O_TRUNC) operation to avoid races between the truncation
and updating i_mtime.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Handling truncate(2), VFS doesn't set ATTR_MTIME bit in iattr structure;
only ATTR_SIZE bit is set. In-kernel fuse must handle the case by setting
mtime fields of struct fuse_setattr_in to "now" and set FATTR_MTIME bit
even though ATTR_MTIME was not set.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not
use it before fuse_init_inode() is called.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a staging driver; fix included. Greg KH said he'd take the patch
but hadn't as the merge window opened, so it's included here
to avoid breaking build.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTQMH9AAoJENkgDmzRrbjxo4UP/jwlenP44v+RFpo/dn8Z8E2n
SREQscU5ZZKvuyFD6kUdvOz8YC/nTrJvXoVkMUF05GVbuvb8/8UPtT9ECVemd0rW
xNy4aFfv9rbrqRLBLpLK9LAgTuhwlbTgGxgL78zRn3hWmf1hBZWCY+cEvKM8l/+9
oEQdORL0sUpZh7iryAeGqbOrXT4gqJEvSLOFwiYTSo6ryzWIilmdXSUAh6s8MIEX
PR1+oH9J8B6J29lcXKMf8/sDI1EBUeSLdBmMCuN5Y7xpYxsQLroVx94kPbdBY+XK
ZRoYuUGSUJfGRZY46cFKApIGeF07z1DGoyXghbSWEQrI+23TMUmrKUg47LSukE4Y
yCUf8HAtqIA3gVc9GKDdSp/2UpkAhTTv5ogKgnIzs1InWtOIBdDRSVUQXDosFEXw
6ZZe1pQs2zfXyXxO4j0Wq36K4RgI0aqOVw+dcC+w5BidjVylgnYRV0PSDd72tid7
bIfnjDbUBo+o4LanPNGYK474KyO7AslgTE50w6zwbJzgdwCQ36hCpKqScBZzm60a
42LrgTVoIHHWAL1tDzWL/LzWflZGdJAezzNje0/f2Q3bGMiNHWoljAvUphkTZ7qt
E8+jWqmM+riH3e8Y5wKpO1BKt7NGHISEy//bUlnqTwisjIzVILZ6VjfugQ1AI+0x
llTXPBotFvfvXqxunBg7
=yzUO
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Nothing major: the stricter permissions checking for sysfs broke a
staging driver; fix included. Greg KH said he'd take the patch but
hadn't as the merge window opened, so it's included here to avoid
breaking build"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
staging: fix up speakup kobject mode
Use 'E' instead of 'X' for unsigned module taint flag.
VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
kallsyms: fix percpu vars on x86-64 with relocation.
kallsyms: generalize address range checking
module: LLVMLinux: Remove unused function warning from __param_check macro
Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
module: remove MODULE_GENERIC_TABLE
module: allow multiple calls to MODULE_DEVICE_TABLE() per module
module: use pr_cont
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
+165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
DrPKlXwIgva7zq5/S0Vr
=4s1Z
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
Pull fuse update from Miklos Szeredi:
"This series adds cached writeback support to fuse, improving write
throughput"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix "uninitialized variable" warning
fuse: Turn writeback cache on
fuse: Fix O_DIRECT operations vs cached writeback misorder
fuse: fuse_flush() should wait on writeback
fuse: Implement write_begin/write_end callbacks
fuse: restructure fuse_readpage()
fuse: Flush files on wb close
fuse: Trust kernel i_mtime only
fuse: Trust kernel i_size only
fuse: Connection bit for enabling writeback
fuse: Prepare to handle short reads
fuse: Linking file to inode helper
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.
Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following warning:
In file included from include/linux/fs.h:16:0,
from fs/fuse/fuse_i.h:13,
from fs/fuse/file.c:9:
fs/fuse/file.c: In function 'fuse_file_poll':
include/linux/rbtree.h:82:28: warning: 'parent' may be used
uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/file.c:2592:27: note: 'parent' was declared here
Signed-off-by: Rajat Jain <rajatxjain@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).
Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The problem is:
1. write cached data to a file
2. read directly from the same file (via another fd)
The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.
When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.
Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.
filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.
In case of "write cache" turned off, the flush is ensured by fuse_vma_close().
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let the kernel maintain i_mtime locally:
- clear S_NOCMTIME
- implement i_op->update_time()
- flush mtime on fsync and last close
- update i_mtime explicitly on truncate and fallocate
Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.
This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.
fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
A helper which gets called when read reports less bytes than was requested.
See patch "trust kernel i_size only" for details.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Summary of http://lkml.org/lkml/2014/3/14/363 :
Ted: module_param(queue_depth, int, 444)
Joe: 0444!
Rusty: User perms >= group perms >= other perms?
Joe: CLASS_ATTR, DEVICE_ATTR, SENSOR_ATTR and SENSOR_ATTR_2?
Side effect of stricter permissions means removing the unnecessary
S_IFREG from several callers.
Note that the BUILD_BUG_ON_ZERO((perm) & 2) test was removed: a fair
number of drivers fail this test, so that will be the debate for a
future patch.
Suggested-by: Joe Perches <joe@perches.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> for drivers/pci/slot.c
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
open/release operations require userspace transitions to keep track
of the open count and to perform any FS-specific setup. However,
for some purely read-only FSs which don't need to perform any setup
at open/release time, we can avoid the performance overhead of
calling into userspace for open/release calls.
This patch adds the necessary support to the fuse kernel modules to prevent
open/release operations from hitting in userspace. When the client returns
ENOSYS, we avoid sending the subsequent release to userspace, and also
remember this so that future opens also don't trigger a userspace
operation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Various read operations (e.g. readlink, readdir) invalidate the cached
attrs for atime changes. This patch adds a new function
'fuse_invalidate_atime', which checks for a read-only super block and
avoids the attr invalidation in that case.
Signed-off-by: Andrew Gallagher <andrewjcg@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
As noticed by Coverity the "num != 0" condition never triggers. Instead it
should check for a complete page.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Having this struct in module memory could Oops when if the module is
unloaded while the buffer still persists in a pipe.
Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
merge them into nosteal_pipe_buf_ops (this is the same as
default_pipe_buf_ops except stealing the page from the buffer is not
allowed).
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:
- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes
plus misc stuff all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...
All async fuse requests must be supplied with extra reference to a fuse
file. This is necessary to ensure that the fuse file is not released until
all in-flight requests are completed. Fuse secondary writeback requests
must obey this rule as well.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
BDI_WRITTEN counter is used to estimate bdi bandwidth. It must be
incremented every time as bdi ends page writeback. No matter whether it
was fulfilled by actual write or by discarding the request (e.g. due to
shrunk i_size).
Note that even before writepages patches, the case "Got truncated off
completely" was handled in fuse_send_writepage() by calling
fuse_writepage_finish() which updated BDI_WRITTEN unconditionally.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If writeback happens while fuse is in FUSE_NOWRITE condition, the request
will be queued but not processed immediately (see fuse_flush_writepages()).
Until FUSE_NOWRITE becomes relaxed, more writebacks can happen. They will
be queued as "secondary" requests to that first ("primary") request.
Existing implementation crops only primary request. This is not correct
because a subsequent extending write(2) may increase i_size and then
secondary requests won't be cropped properly. The result would be stale
data written to the server to a file offset where zeros must be.
Similar problem may happen if secondary requests are attached to an
in-flight request that was already cropped.
The patch solves the issue by cropping all secondary requests in
fuse_writepage_end(). Thanks to Miklos for idea.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_writepage_in_flight() returns false if it fails to find request with
given index in fi->writepages. Then the caller proceeds with populating
data->orig_pages[] and incrementing req->num_pages. Hence,
fuse_writepage_in_flight() must revert changes it made in request before
returning false.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
...which just returns -EBUSY if a directory alias would be created.
This is to be used by fuse mkdir to make sure that a buggy or malicious
userspace filesystem doesn't do anything nasty. Previously fuse used a
private mutex for this purpose, which can now go away.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This allows udev (or more recently systemd-tmpfiles) to create /dev/cuse on
boot, in the same way as /dev/fuse is currently created, and the corresponding
module to be loaded on first access.
The corresponding functionalty was introduced for fuse in commit 578454f.
Signed-off-by: Tom Gundersen <teg@jklm.no>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If ->writepage() tries to write back a page whose copy is still in flight,
then just skip by calling redirty_page_for_writepage().
This is OK, since now ->writepage() should never be called for data
integrity sync.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
As Maxim Patlasov pointed out, it's possible to get a dirty page while it's
copy is still under writeback, despite fuse_page_mkwrite() doing its thing
(direct IO).
This could result in two concurrent write request for the same offset, with
data corruption if they get mixed up.
To prevent this, fuse needs to check and delay such writes. This
implementation does this by:
1. check if page is still under writeout, if so create a new, single page
secondary request for it
2. chain this secondary request onto the in-flight request
2/a. if a seconday request for the same offset was already chained to the
in-flight request, then just copy the contents of the page and discard
the new secondary request. This makes sure that for each page will
have at most two requests associated with it
3. when the in-flight request finished, send off all secondary requests
chained onto it
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Checking against tmp-page indexes is not very useful, and results in one
(or rarely two) page requests. Which is not much of an improvement...
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request,
then releases the original page by end_page_writeback and unlock it.
4) fuse_do_setattr completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepages_fill attaches more pages to the request (if any), then
fuse_writepages_send is eventually called. It is supposed to crop
inarg->size of the request, but it doesn't because i_size has already been
extended back.
Moving end_page_writeback behind fuse_writepages_send guarantees that
__fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size
of the request before write(2) gets the chance to extend i_size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.
[SzM: simplify, add comments]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Don't bug if there's no writable files found for page writeback. If ever
this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Lock the page in fuse_page_mkwrite() to protect against a race with
fuse_writepage() where the page is redirtied before the actual writeback
begins.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Doing dput(parent) is not valid in RCU walk mode. In RCU mode it would
probably be okay to update the parent flags, but it's actually not
necessary most of the time...
So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
recently initialized by READDIRPLUS.
This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
READDIRPLUS and only dropping out of RCU mode if this flag is set.
FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
the parent.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
with it instead of calling check_submounts_and_drop() which is not prepared
for being called from RCU walk.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
description of races between ftruncate and anyone who can extend i_size:
> 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
> changed (due to our own fuse_do_setattr()) and is going to call
> truncate_pagecache() for some 'new_size' it believes valid right now. But by
> the time that particular truncate_pagecache() is called ...
> 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
> not -- it doesn't matter).
> 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
> 4. mmap-ed write makes a page in the extended region dirty.
This patch adds necessary bits to fuse_file_fallocate() to protect from that
race.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
the page goes to write-back. The page is fully processed by fuse_writepage
(including end_page_writeback on the page), but fuse_flush_writepages did
nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
by calling fuse_release_nowrite. The latter triggers processing queued
write-back request which will write stale data to the hole soon.
Changed in v2 (thanks to Brian for suggestion):
- Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
we can end up in returning -ENOTSUPP while user data is already punched
from page cache. Use filemap_write_and_wait_range() instead.
Changed in v3 (thanks to Miklos for suggestion):
- fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
should be enough.
- rebased to for-linus branch of fuse.git
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression"). Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.
The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.
The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.
To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".
The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.
Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.
And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.
After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).
[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull fuse bugfixes from Miklos Szeredi:
"Just a bunch of bugfixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use list_for_each_entry() for list traversing
fuse: readdir: check for slash in names
fuse: hotfix truncate_pagecache() issue
fuse: invalidate inode attributes on xattr modification
fuse: postpone end_page_writeback() in fuse_writepage_locked()
Drop a subtree when we find that it has moved or been delated. This can be
done as long as there are no submounts under this location.
If the directory was moved and we come across the same directory in a
future lookup it will be reconnected by d_materialise_unique().
Signed-off-by: Anand Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the
error itself from ->d_revalidate() insted of returning zero (invalid).
Also make a common label for invalidating the dentry. This will be used by
the next patch.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use d_materialise_unique() instead of d_splice_alias(). This allows dentry
subtrees to be moved to a new place if there moved, even if something is
referencing a dentry in the subtree (open fd, cwd, etc..).
This will also allow us to drop a subtree if it is found to be replaced by
something else. In this case the disconnected subtree can later be
reconnected to its new location.
d_materialise_unique() ensures that a directory entry only ever has one
alias. We keep fc->inst_mutex around the calls for d_materialise_unique()
on directories to prevent a race with mkdir "stealing" the inode.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Userspace can add names containing a slash character to the directory
listing. Don't allow this as it could cause all sorts of trouble.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Calls like setxattr and removexattr result in updation of ctime.
Therefore invalidate inode attributes to force a refresh.
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
the original page by end_page_writeback.
4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
fuse_send_writepage is supposed to crop inarg->size of the request,
but it doesn't because i_size has already been extended back.
Moving end_page_writeback to the end of fuse_writepage_locked fixes the
race because now the fact that truncate_pagecache is successfully returned
infers that fuse_writepage_locked has already called end_page_writeback.
And this, in turn, infers that fuse_flush_writepages has already called
fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
could not extend it because of i_mutex held by ftruncate(2).
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The dev_attrs field of struct class is going away soon, dev_groups
should be used instead. This converts the cuse class code to use the
correct field.
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fuse does instantiation slightly differently from NFS/CIFS which use
d_materialise_unique().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Add sanity checks before adding or updating an entry with data received
from readdirplus.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
In case d_lookup() returns a dentry with d_inode == NULL, the dentry is not
returned with dput(). This results in triggering a BUG() in
shrink_dcache_for_umount_subtree():
BUG: Dentry ...{i=0,n=...} still in use (1) [unmount of fuse fuse]
[SzM: need to d_drop() as well]
Reported-by: Justin Clift <jclift@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Brian Foster <bfoster@redhat.com>
Tested-by: Niels de Vos <ndevos@redhat.com>
CC: stable@vger.kernel.org
The global variable num_physpages is scheduled to be removed, so use
totalram_pages instead of num_physpages at runtime.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
Pull VFS patches (part 1) from Al Viro:
"The major change in this pile is ->readdir() replacement with
->iterate(), dealing with ->f_pos races in ->readdir() instances for
good.
There's a lot more, but I'd prefer to split the pull request into
several stages and this is the first obvious cutoff point."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits)
[readdir] constify ->actor
[readdir] ->readdir() is gone
[readdir] convert ecryptfs
[readdir] convert coda
[readdir] convert ocfs2
[readdir] convert fatfs
[readdir] convert xfs
[readdir] convert btrfs
[readdir] convert hostfs
[readdir] convert afs
[readdir] convert ncpfs
[readdir] convert hfsplus
[readdir] convert hfs
[readdir] convert befs
[readdir] convert cifs
[readdir] convert freevxfs
[readdir] convert fuse
[readdir] convert hpfs
reiserfs: switch reiserfs_readdir_dentry to inode
reiserfs: is_privroot_deh() needs only directory inode, actually
...
Changing size of a file on server and local update (fuse_write_update_size)
should be always protected by inode->i_mutex. Otherwise a race like this is
possible:
1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE).
fuse_file_fallocate() sends FUSE_FALLOCATE request to the server.
2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr()
sends shrinking FUSE_SETATTR request to the server and updates local i_size
by i_size_write(inode, outarg.attr.size).
3. Process 'A' resumes execution of fuse_file_fallocate() and calls
fuse_write_update_size(inode, offset + length). But 'offset + length' was
obsoleted by ftruncate from previous step.
Changed in v2 (thanks Brian and Anand for suggestions):
- made relation between mutex_lock() and fuse_set_nowrite(inode) more
explicit and clear.
- updated patch description to use ftruncate(2) in example
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The bug was introduced with async_dio feature: trying to optimize short reads,
we cut number-of-bytes-to-read to i_size boundary. Hence the following example:
truncate --size=300 /mnt/file
dd if=/mnt/file of=/dev/null iflag=direct
led to FUSE_READ request of 300 bytes size. This turned out to be problem
for userspace fuse implementations who rely on assumption that kernel fuse
does not change alignment of request from client FS.
The patch turns off the optimization if async_dio is disabled. And, if it's
enabled, the patch fixes adjustment of number-of-bytes-to-read to preserve
alignment.
Note, that we cannot throw out short read optimization entirely because
otherwise a direct read of a huge size issued on a tiny file would generate
a huge amount of fuse requests and most of them would be ACKed by userspace
with zero bytes read.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If request submission fails for an async request (i.e.,
get_user_pages() returns -ERESTARTSYS), we currently skip the
-EIOCBQUEUED return and drop into wait_for_sync_kiocb() forever.
Avoid this by always returning -EIOCBQUEUED for async requests. If
an error occurs, the error is passed into fuse_aio_complete(),
returned via aio_complete() and thus propagated to userspace via
io_getevents().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix bug introduced by commit 4582a4ab2a "FUSE: Adapt readdirplus to application
usage patterns".
We need to check for a positive dentry; negative dentries are not added by
readdirplus. Secondly we need to advise the use of readdirplus on the *parent*,
otherwise the whole thing is useless. Thirdly all this is only relevant if
"readdirplus_auto" mode is selected by the filesystem.
We advise the use of readdirplus only if the dentry was still valid. If we had
to redo the lookup then there was no use in doing the -plus version.
Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Feng Shuo <steve.shuo.feng@gmail.com>
CC: stable@vger.kernel.org
An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the
size of a file. Update the inode size after a successful fallocate.
Also invalidate the inode attributes after a successful fallocate
to ensure we pick up the latest attribute values (i.e., i_blocks).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE
interface. When a hole punch is passed through, the page cache
is not cleared and thus allows reading stale data from the cache.
This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a
smallish random data file into cache, punching a hole and creating
a copy of the file. Drop caches or remount and observe that the
original file no longer matches the file copied after the hole
punch. The original file contains a zeroed range and the latter
file contains stale data.
Protect against writepage requests in progress and punch out the
associated page cache range after a successful client fs hole
punch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit 8b41e671 introduced explicit background checking for fuse_req
structures with BUG_ON() checks for the appropriate type of request in
in the associated send functions. Commit bcba24cc introduced the ability
to send dio requests as background requests but does not update the
request allocation based on the type of I/O request. As a result, a
BUG_ON() triggers in the fuse_request_send_background() background path if
an async I/O is sent.
Allocate a request based on the async state of the fuse_io_priv to avoid
the BUG.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
Pull fuse updates from Miklos Szeredi:
"This contains two patchsets from Maxim Patlasov.
The first reworks the request throttling so that only async requests
are throttled. Wakeup of waiting async requests is also optimized.
The second series adds support for async processing of direct IO which
optimizes direct IO and enables the use of the AIO userspace
interface."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add flag to turn on async direct IO
fuse: truncate file if async dio failed
fuse: optimize short direct reads
fuse: enable asynchronous processing direct IO
fuse: make fuse_direct_io() aware about AIO
fuse: add support of async IO
fuse: move fuse_release_user_pages() up
fuse: optimize wake_up
fuse: implement exclusive wakeup for blocked_waitq
fuse: skip blocking on allocations of synchronous requests
fuse: add flag fc->initialized
fuse: make request allocations for background processing explicit
Without async DIO write requests to a single file were always serialized.
With async DIO that's no longer the case.
So don't turn on async DIO by default for fear of breaking backward
compatibility.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch improves error handling in fuse_direct_IO(): if we successfully
submitted several fuse requests on behalf of synchronous direct write
extending file and some of them failed, let's try to do our best to clean-up.
Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If user requested direct read beyond EOF, we can skip sending fuse requests
for positions beyond EOF because userspace would ACK them with zero bytes read
anyway. We can trust to i_size in fuse_direct_IO for such cases because it's
called from fuse_file_aio_read() and the latter updates fuse attributes
including i_size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In case of synchronous DIO request (i.e. read(2) or write(2) for a file
opened with O_DIRECT), the patch submits fuse requests asynchronously, but
waits for their completions before return from fuse_direct_IO().
In case of asynchronous DIO request (i.e. libaio io_submit() or a file opened
with O_DIRECT), the patch submits fuse requests asynchronously and return
-EIOCBQUEUED immediately.
The only special case is async DIO extending file. Here the patch falls back
to old behaviour because we can't return -EIOCBQUEUED and update i_size later,
without i_mutex hold. And we have no method to wait on real async I/O
requests.
The patch also clean __fuse_direct_write() up: it's better to update i_size
in its callers. Thanks Brian for suggestion.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch implements passing "struct fuse_io_priv *io" down the stack up to
fuse_send_read/write where it is used to submit request asynchronously.
io->async==0 designates synchronous processing.
Non-trivial part of the patch is changes in fuse_direct_io(): resources
like fuse requests and user pages cannot be released immediately in async
case.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch implements a framework to process an IO request asynchronously. The
idea is to associate several fuse requests with a single kiocb by means of
fuse_io_priv structure. The structure plays the same role for FUSE as 'struct
dio' for direct-io.c.
The framework is supposed to be used like this:
- someone (who wants to process an IO asynchronously) allocates fuse_io_priv
and initializes it setting 'async' field to non-zero value.
- as soon as fuse request is filled, it can be submitted (in non-blocking way)
by fuse_async_req_send()
- when all submitted requests are ACKed by userspace, io->reqs drops to zero
triggering aio_complete()
In case of IO initiated by libaio, aio_complete() will finish processing the
same way as in case of dio_complete() calling aio_complete(). But the
framework may be also used for internal FUSE use when initial IO request
was synchronous (from user perspective), but it's beneficial to process it
asynchronously. Then the caller should wait on kiocb explicitly and
aio_complete() will wake the caller up.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_release_user_pages() will be indirectly used by fuse_send_read/write
in future patches.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch solves thundering herd problem. So far as previous patches ensured
that only allocations for background may block, it's safe to wake up one
waiter. Whoever it is, it will wake up another one in request_end() afterwards.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
A task may have at most one synchronous request allocated. So these
requests need not be otherwise limited.
The patch re-works fuse_get_req() to follow this idea.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Existing flag fc->blocked is used to suspend request allocation both in case
of many background request submitted and period of time before init_reply
arrives from userspace. Next patch will skip blocking allocations of
synchronous request (disregarding fc->blocked). This is mostly OK, but
we still need to suspend allocations if init_reply is not arrived yet. The
patch introduces flag fc->initialized which will serve this purpose.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
There are two types of processing requests in FUSE: synchronous (via
fuse_request_send()) and asynchronous (via adding to fc->bg_queue).
Fortunately, the type of processing is always known in advance, at the time
of request allocation. This preparatory patch utilizes this fact making
fuse_get_req() aware about the type. Next patches will use it.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
it's used only as a flag to distinguish normal pipes/FIFOs from the
internal per-task one used by file-to-file splice. And pipe->files
would work just as well for that purpose...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For some filesystems (e.g. GlusterFS), the cost of performing a
normal readdir and readdirplus are identical. Since adaptively
using readdirplus has no benefit for those systems, give
users/filesystems the option to control adaptive readdirplus use.
v2 of this patch incorporates Miklos's suggestion to simplify the code,
as well as improving consistency of macro names and documentation.
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
commit 626cf23660 "poll: add poll_requested_events()..." enabled us to send the
requested events to the filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
drop_nlink() warns if nlink is already zero. This is triggerable by a buggy
userspace filesystem. The cure, I think, is worse than the disease so disable
the warning.
Reported-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The all pointers within fuse_req must point to valid memory once
fuse_force_forget() returns.
This bug appeared in "fuse: implement NFS-like readdirplus support"
and was never in any official Linux release.
I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and
verified the FORGET operation was called properly in userspace.
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use the same adaptive readdirplus mechanism as NFS:
http://permalink.gmane.org/gmane.linux.nfs/49299
If the user space implementation wants to disable readdirplus
temporarily, it could just return ENOTSUPP. Then kernel will
recall it with readdir.
Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming
that 'task' is some other process but in fact this parameter always
equals to 'current'. Inline this parameter to make it more readable
and remove RCU lock as it is not needed when access current process
credentials.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
__fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The
patch calculates 'n' based on iov[] array. This is useful because allocating
FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request
would be waste of memory in case of iov-s of smaller size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as
possible. This is very beneficial in case of iov[] consisting of many
iov-s of relatively small sizes (e.g. PAGE_SIZE).
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch makes preliminary work for the next patch optimizing scatter-gather
direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s
to each fuse request as possible. So, here we only rework all related
call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages().
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.
Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.
Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_do_ioctl() already calculates the number of pages it's going to use. It is
stored in 'num_pages' variable. So the patch simply uses it for allocating
fuse_req.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch allocates as many page pointers in fuse_req as needed to cover
interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced
to hide this cumbersome arithmetic.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch uses 'nr_pages' argument of fuse_readpages() as heuristics for the
number of page pointers to allocate.
This can be improved further by taking in consideration fc->max_read and gaps
between page indices, but it's not clear whether it's worthy or not.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch reworks fuse_retrieve() to allocate only so many page pointers
as needed. The core part of the patch is the following calculation:
num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
(thanks Miklos for formula). All other changes are mostly shuffling lines.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch categorizes all fuse_get_req() invocations into two categories:
- fuse_get_req_nopages(fc) - when caller doesn't care about req->pages
- fuse_get_req(fc, n) - when caller need n page pointers (n > 0)
Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0)
scattered over code. Now it's clear from the first glance when a caller need
fuse_req with page pointers.
The patch doesn't make any logic changes. In multi-page case, it silly
allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended
by future patches.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from
fuse_req. Instead of that, req->pages may now point either to small inline
array or to an array allocated dynamically.
This essentially means that all callers of fuse_request_alloc[_nofs] should
pass the number of pages needed explicitly.
The patch doesn't make any logic changes.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This patch implements readdirplus support in FUSE, similar to NFS.
The payload returned in the readdirplus call contains
'fuse_entry_out' structure thereby providing all the necessary inputs
for 'faking' a lookup() operation on the spot.
If the dentry and inode already existed (for e.g. in a re-run of ls -l)
then just the inode attributes timeout and dentry timeout are refreshed.
With a simple client->network->server implementation of a FUSE based
filesystem, the following performance observations were made:
Test: Performing a filesystem crawl over 20,000 files with
sh# time ls -lR /mnt
Without readdirplus:
Run 1: 18.1s
Run 2: 16.0s
Run 3: 16.2s
With readdirplus:
Run 1: 4.1s
Run 2: 3.8s
Run 3: 3.8s
The performance improvement is significant as it avoided 20,000 upcalls
calls (lookup). Cache consistency is no worse than what already is.
Signed-off-by: Anand V. Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The variables mapping,index are initialized but never used
otherwise, so remove the unused variables.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix the following sparse warning:
fs/fuse/file.c:2249:6: warning: symbol 'fuse_file_fallocate' was not declared. Should it be static?
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Given that CUSE depends on FUSE, it only makes sense to move its
Kconfig entry into the FUSE Kconfig file. Also, add a few grammatical
and semantic touchups.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix the following compiler warnings:
fs/fuse/cuse.c: In function 'cuse_process_init_reply':
fs/fuse/cuse.c:288:24: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:14: note: 'val' was declared here
fs/fuse/cuse.c:284:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:8: note: 'key' was declared here
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Sysfs doesn't allow two devices with the same name, but we register a
sysfs entry for each cuse device without checking for name collisions.
This extends the registration to first check whether the name was already
registered.
To avoid race-conditions between the name-check and linking the device, we
need to protect the whole registration with a mutex.
Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We need to check for name-collisions during cuse-device registration. To
avoid race-conditions, this needs to be protected during the whole device
registration. Therefore, replace the spinlocks by mutexes first so we can
safely extend the locked regions to include more expensive or sleeping
code paths.
Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.
The connection between between a fuse filesystem and a fuse daemon is
established when a fuse filesystem is mounted and provided with a file
descriptor the fuse daemon created by opening /dev/fuse.
For now restrict the communication of uids and gids between the fuse
filesystem and the fuse daemon to the initial user namespace. Enforce
this by verifying the file descriptor passed to the mount of fuse was
opened in the initial user namespace. Ensuring the mount happens in
the initial user namespace is not necessary as mounts from non-initial
user namespaces are not yet allowed.
In fuse_req_init_context convert the currrent fsuid and fsgid into the
initial user namespace for the request that will be sent to the fuse
daemon.
In fuse_fill_attr convert the uid and gid passed from the fuse daemon
from the initial user namespace into kuids and kgids.
In iattr_to_fattr called from fuse_setattr convert kuids and kgids
into the uids and gids in the initial user namespace before passing
them to the fuse filesystem.
In fuse_change_attributes_common called from fuse_dentry_revalidate,
fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
the uid and gid from the fuse daemon into a kuid and a kgid to store
on the fuse inode.
By default fuse mounts are restricted to task whose uid, suid, and
euid matches the fuse user_id and whose gid, sgid, and egid matches
the fuse group id. Convert the user_id and group_id mount options
into kuids and kgids at mount time, and use uid_eq and gid_eq to
compare the in fuse_allow_task.
Cc: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In some cases fuse_retrieve() would return a short byte count if offset was
non-zero. The data returned was correct, though.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
gcc 4.6.3 complains about uninitialized variables in fs/fuse/control.c:
CC fs/fuse/control.o
fs/fuse/control.c: In function 'fuse_conn_congestion_threshold_write':
fs/fuse/control.c:165:29: warning: 'val' may be used uninitialized in this function [-Wuninitialized]
fs/fuse/control.c: In function 'fuse_conn_max_background_write':
fs/fuse/control.c:128:23: warning: 'val' may be used uninitialized in this function [-Wuninitialized]
fuse_conn_limit_write() will always return non-zero unless the &val
is modified, so the warning is misleading. Let the compiler know
about it by marking 'val' with 'uninitialized_var'.
Signed-off-by: Daniel Mack <zonque@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Luca Risolia reported that a CUSE daemon will continue to run even if
initialization of the emulated device failes for some reason (e.g. the device
number is already registered by another driver).
This patch disconnects the fuse device on error, which will make the userspace
CUSE daemon exit, albeit without indication about what the problem was.
Reported-by: Luca Risolia <luca.risolia@studio.unibo.it>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_conn_kill() removed fc->entry, called fuse_ctl_remove_conn() and
fuse_bdi_destroy(). None of which is appropriate for cuse cleanup.
The fuse_ctl_remove_conn() decrements the nlink on the control filesystem, which
is totally bogus. The others are harmless but unnecessary.
So move these out from fuse_conn_kill() to fuse_put_super() where they belong.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull vfs fixes from Miklos Szeredi.
This mainly fixes some confusion about whether the open 'mode' variable
passed around should contain the full file type (S_IFREG etc)
information or just the permission mode. In particular, the lack of
proper file type information had confused fuse.
* 'vfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: fix propagation of atomic_open create error on negative dentry
fuse: check create mode in atomic open
vfs: pass right create mode to may_o_create()
vfs: atomic_open(): fix create mode usage
vfs: canonicalize create mode in build_open_flags()
Verify that the VFS is passing us a complete create mode with the S_IFREG to
atomic open.
Reported-by: Steve <steveamigauk@yahoo.co.uk>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Commit 7572777eef attempted to verify that
the total iovec from the client doesn't overflow iov_length() but it
only checked the first element. The iovec could still overflow by
starting with a small element. The obvious fix is to check all the
elements.
The overflow case doesn't look dangerous to the kernel as the copy is
limited by the length after the overflow. This fix restores the
intention of returning an error instead of successfully copying less
than the iovec represented.
I found this by code inspection. I built it but don't have a test case.
I'm cc:ing stable because the initial commit did as well.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: <stable@vger.kernel.org> [2.6.37+]
Convert check in fuse_file_aio_write() to using new freeze protection.
CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add missing flags that userspace derived from the protocol version number. This
makes the protocol more flexible.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
A fuse-based network filesystem might allow for the inode
and/or file data to change unexpectedly. A local client
that opens and repeatedly reads a file might never pick
up on such changes and indefinitely return stale data.
Always invoke fuse_update_attributes() in the read path
to cause an attr revalidation when the attributes expire.
This leads to a page cache invalidation if necessary and
ensures fuse issues new read requests to the fuse client.
The original logic (reval only on reads beyond EOF) is
preserved unless the client specifies FUSE_AUTO_INVAL_DATA
on init.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We currently invalidate the inode address space mapping
if the file size changes unexpectedly. In the case of a
fuse network filesystem, a portion of a file could be
overwritten remotely without changing the file size.
Compare the old mtime as well to detect this condition
and invalidate the mapping if the file has been updated.
The original logic (to ignore changes in mtime) is
preserved unless the client specifies FUSE_AUTO_INVAL_DATA
on init.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just pass struct file *. Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int. Next: saner prototypes for parts in
namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change of calling conventions:
old new
NULL 1
file 0
ERR_PTR(-ve) -ve
Caller *knows* that struct file *; no need to return it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and let finish_open() report having opened the file via that sucker.
Next step: don't modify od->filp at all.
[AV: FILE_CREATE was already used by cifs; Miklos' fix folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an ->atomic_open implementation which replaces the atomic open+create
operation implemented via ->create. No functionality is changed.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.
I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
pass inode + parent's inode or NULL instead of dentry + bool saying
whether we want the parent or not.
NOTE: that needs ceph fix folded in.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
+AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
/9c4zxxekjHZnn2oooEa
=oLXf
-----END PGP SIGNATURE-----
Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."
* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally
Don't use inode->i_blkbits which might be stale, instead calculate the blksize
information from the freshly obtained attributes.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Now we store attr->ino at inode->i_ino, return attr->ino at the
first time and then return inode->i_ino if the attribute timeout
isn't expired. That's wrong on 32 bit platforms because attr->ino
is 64 bit and inode->i_ino is 32 bit in this case.
Fix this by saving 64 bit ino in fuse_inode structure and returning
it every time we call getattr. Also squash attr->ino into inode->i_ino
explicitly.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
fallocate filesystem operation preallocates media space for the given file.
If fallocate returns success then any subsequent write to the given range
never fails with 'not enough space' error.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This patch replaces the code for getting an number from a
userspace buffer by a simple call to kstroul_from_user.
This makes it easier to read and less error prone.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull fuse updates from Miklos Szeredi.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use flexible array in fuse.h
fuse: allow nanosecond granularity
fuse: O_DIRECT support for files
fuse: fix nlink after unlink
Derrik Pates reports that an utimensat with a NULL argument results in the
current time being sent from the kernel with 1 second granularity.
Reported-by: Derrik Pates <demon@now.ai>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
Implement ->direct_IO() method in aops. The ->direct_IO() method combines
the existing fuse_direct_read/fuse_direct_write methods to implement
O_DIRECT functionality.
Reaching ->direct_IO() in the read path via generic_file_aio_read ensures
proper synchronization with page cache with its existing framework.
Reaching ->direct_IO() in the write path via fuse_file_aio_write is made
to come via generic_file_direct_write() which makes it play nice with
the page cache w.r.t other mmap pages etc.
On files marked 'direct_io' by the filesystem server, IO always follows
the fuse_direct_read/write path. There is no effect of fcntl(O_DIRECT)
and it always succeeds.
On files not marked with 'direct_io' by the filesystem server, the IO
path depends on O_DIRECT flag by the application. This can be passed
at the time of open() as well as via fcntl().
Note that asynchronous O_DIRECT iocb jobs are completed synchronously
always (this has been the case with FUSE even before this patch)
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Anand Avati reports that the following sequence of system calls fail on a fuse
filesystem:
create("filename") => 0
link("filename", "linkname") => 0
unlink("filename") => 0
link("linkname", "filename") => -ENOENT ### BUG ###
vfs_link() fails with ENOENT if i_nlink is zero, this is done to prevent
resurrecting already deleted files.
Fuse clears i_nlink on unlink even if there are other links pointing to the
file.
Reported-by: Anand Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
FUSE: Notifying the kernel of deletion.
fuse: support ioctl on directories
fuse: Use kcalloc instead of kzalloc to allocate array
fuse: llseek optimize SEEK_CUR and SEEK_SET
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allows a FUSE file-system to tell the kernel when a file or directory is
deleted. If the specified dentry has the specified inode number, the kernel will
unhash it.
The current 'fuse_notify_inval_entry' does not cause the kernel to clean up
directories that are in use properly, and as a result the users of those
directories see incorrect semantics from the file-system. The error condition
seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is
avoided when 'fuse_notify_delete' is used instead.
The following scenario demonstrates the difference:
1. User A chdirs into 'testdir' and starts reading 'testfile'.
2. User B rm -rf 'testdir'.
3. User B creates 'testdir'.
4. User C chdirs into 'testdir'.
If you run the above within the same machine on any file-system (including fuse
file-systems), there is no problem: user C is able to chdir into the new
testdir. The old testdir is removed from the dentry tree, but still open by user
A.
If operations 2 and 3 are performed via the network such that the fuse
file-system uses one of the notify functions to tell the kernel that the nodes
are gone, then the following error occurs for user C while user A holds the
original directory open:
muirj@empacher:~> ls /test/testdir
ls: cannot access /test/testdir: No such file or directory
The issue here is that the kernel still has a dentry for testdir, and so it is
requesting the attributes for the old directory, while the file-system is
responding that the directory no longer exists.
If on the other hand, if the file-system can notify the kernel that the
directory is deleted using the new 'fuse_notify_delete' function, then the above
ls will find the new directory as expected.
Signed-off-by: John Muir <john@jmuir.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Multiplexing filesystems may want to support ioctls on the underlying
files and directores (e.g. FS_IOC_{GET,SET}FLAGS).
Ioctl support on directories was missing so add it now.
Reported-by: Antonio SJ Musumeci <bile@landofbile.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use generic_file_llseek() instead of open coding the seek function.
i_mutex protection is only necessary for SEEK_END (and SEEK_HOLE, SEEK_DATA), so
move SEEK_CUR and SEEK_SET out from under i_mutex.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix race between lseek(fd, 0, SEEK_CUR) and read/write. This was fixed in
generic code by commit 5b6f1eb97d (vfs: lseek(fd, 0, SEEK_CUR) race condition).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The test in fuse_file_llseek() "not SEEK_CUR or not SEEK_SET" always evaluates
to true.
This was introduced in 3.1 by commit 06222e49 (fs: handle SEEK_HOLE/SEEK_DATA
properly in all fs's that define their own llseek) and changed the behavior of
SEEK_CUR and SEEK_SET to always retrieve the file attributes. This is a
performance regression.
Fix the test so that it makes sense.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
CC: Josef Bacik <josef@redhat.com>
CC: Al Viro <viro@zeniv.linux.org.uk>
Fix two bugs in fuse_retrieve():
- retrieving more than one page would yield repeated instances of the
first page
- if more than FUSE_MAX_PAGES_PER_REQ pages were requested than the
request page array would overflow
fuse_retrieve() was added in 2.6.36 and these bugs had been there since the
beginning.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Some files were using the complete module.h infrastructure without
actually including the header at all. Fix them up in advance so
once the implicit presence is removed, we won't get failures like this:
CC [M] fs/nfsd/nfssvc.o
fs/nfsd/nfssvc.c: In function 'nfsd_create_serv':
fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function)
fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once
fs/nfsd/nfssvc.c:335: error: for each function it appears in.)
fs/nfsd/nfssvc.c: In function 'nfsd':
fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit'
make[3]: *** [fs/nfsd/nfssvc.o] Error 1
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Commit 37fb3a30b4 ("fuse: fix flock") added in 3.1-rc4 caused flock() to
fail with ENOSYS with the kernel ABI version 7.16 or earlier.
Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16
and earlier.
Reported-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message
fuse: mark pages accessed when written to
fuse: delete dead .write_begin and .write_end aops
fuse: fix flock
fuse: fix non-ANSI void function notation
FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the
message processing could overrun and result in a "kernel BUG at
fs/fuse/dev.c:629!"
Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
As fuse does not use the page cache library functions when userspace
writes to a file, it did not benefit from 'c8236db mm: mark page
accessed before we write_end()' that made sure pages are properly
marked accessed when written to.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Ever since 'ea9b990 fuse: implement perform_write', the .write_begin
and .write_end aops have been dead code.
Their task - acquiring a page from the page cache, sending out a write
request and releasing the page again - is now done batch-wise to
maximize the number of pages send per userspace request.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a
number of issues with supporing flock locks over existing POSIX
locking infrastructure:
- it's not backward compatible, passing flock(2) calls to userspace
unconditionally (if userspace sets FUSE_POSIX_LOCKS)
- it doesn't cater for the fact that flock locks are automatically
unlocked on file release
- it doesn't take into account the fact that flock exclusive locks
(write locks) don't need an fd opened for write.
The last one invalidates the original premise of the patch that flock
locks can be emulated with POSIX locks.
This patch fixes the first two issues. The last one needs to be fixed
in userspace if the filesystem assumed that a write lock will happen
only on a file operned for write (as in the case of the current fuse
library).
Reported-by: Sebastian Pipping <webmaster@hartwork.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-3.1' of git://linux-nfs.org/~bfields/linux:
nfsd: don't break lease on CLAIM_DELEGATE_CUR
locks: rename lock-manager ops
nfsd4: update nfsv4.1 implementation notes
nfsd: turn on reply cache for NFSv4
nfsd4: call nfsd4_release_compoundargs from pc_release
nfsd41: Deny new lock before RECLAIM_COMPLETE done
fs: locks: remove init_once
nfsd41: check the size of request
nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
nfsd4: fix file leak on open_downgrade
nfsd4: remember to put RW access on stateid destruction
NFSD: Added TEST_STATEID operation
NFSD: added FREE_STATEID operation
svcrpc: fix list-corrupting race on nfsd shutdown
rpc: allow autoloading of gss mechanisms
svcauth_unix.c: quiet sparse noise
svcsock.c: include sunrpc.h to quiet sparse noise
nfsd: Remove deprecated nfsctl system call and related code.
NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix void function parameter list sparse warning:
fs/fuse/inode.c:74:44: warning: non-ANSI function declaration of function 'fuse_alloc_forget'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fuse has no problems with references to unlinked directories.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: fuse-devel@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits)
cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
ceph: remove unnecessary dentry_unhash calls
vfs: clean up vfs_rename_other
vfs: clean up vfs_rename_dir
vfs: clean up vfs_rmdir
vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
libfs: drop unneeded dentry_unhash
vfs: update dentry_unhash() comment
vfs: push dentry_unhash on rename_dir into file systems
vfs: push dentry_unhash on rmdir into file systems
vfs: remove dget() from dentry_unhash()
vfs: dentry_unhash immediately prior to rmdir
vfs: Block mmapped writes while the fs is frozen
...
Only a few file systems need this. Start by pushing it down into each
rename method (except gfs2 and xfs) so that it can be dealt with on a
per-fs basis.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only a few file systems need this. Start by pushing it down into each
fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
basis.
This does not change behavior for any in-tree file systems.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL
nameidata.
https://bugzilla.kernel.org/show_bug.cgi?id=34732
Tyler Hicks pointed out that this bug was introduced by commit
e7c0a16786 "fuse: make fuse_dentry_revalidate() RCU aware"
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
This function basically does:
remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);
Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.
If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.
This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.
[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only bail out of fuse_dentry_revalidate() on LOOKUP_RCU when blocking
is actually necessary.
CC: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Only bail out of fuse_permission() on IPERM_FLAG_RCU when blocking is
actually necessary.
CC: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If a fuse dev connection is broken, wake up any
processes that are blocking, in a poll system call,
on one of the files in the now defunct filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reduce the size of struct fuse_request by removing cuse_init_out from
the request structure and allocating it dinamically instead.
CC: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC
opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse
assumed that when called from open, a truncate() will be done, not an
ftruncate().
Fix by restoring the old behavior, based on the ATTR_OPEN flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
triggered a DESTROY request via path_put().
Fix this by
a) making RELEASE requests synchronous, whenever possible, on fuseblk
filesystems
b) if not possible (triggered by an asynchronous read/write) then do
the path_put() in a separate thread with schedule_work().
Reported-by: Oliver Neukum <oneukum@suse.de>
Cc: stable@kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix ioctl ABI
fuse: allow batching of FORGET requests
fuse: separate queue for FORGET requests
fuse: ioctl cleanup
Fix up trivial conflict in fs/fuse/inode.c due to RCU lookup having done
the RCU-freeing of the inode in fuse_destroy_inode().
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
In kernel ABI version 7.16 and later FUSE_IOCTL_RETRY reply from a
unrestricted IOCTL request shall return with an array of 'struct
fuse_ioctl_iovec' instead of 'struct iovec'. This fixes the ABI
ambiguity of 32bit vs. 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can take up to 30 minutes to process
FORGET requests when all those inodes are evicted from the icache.
To solve this, create a BATCH_FORGET request that allows up to about
8000 FORGET requests to be sent in a single message.
This request is only sent if userspace supports interface version 7.16
or later, otherwise fall back to sending individual FORGET messages.
Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can go unresponsive for up to 30
minutes when all those inodes are evicted from the icache.
The reason is that FORGET messages, sent when the inode is evicted,
are queued up together with regular filesystem requests, and while the
huge queue of FORGET messages are processed no other filesystem
operation can proceed.
Since a full fuse request structure is allocated for each inode, these
take up quite a bit of memory as well.
To solve these issues, create a slim 'fuse_forget_link' structure
containing just the minimum of information required to send the FORGET
request and chain these on a separate queue.
When userspace is asking for a request make sure that FORGET and
non-FORGET requests are selected fairly: for each 8 non-FORGET allow
16 FORGET requests. This will make sure FORGETs do not pile up, yet
other requests are also allowed to proceed while the queued FORGETs
are processed.
Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.
The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.
Work around this by looking at the size of the reply to determine
which struct was used. This is only needed if CONFIG_COMPAT is
defined.
A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.
If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.
Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.
Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.
Needs release_pages() to be exported to modules.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
Commit 7909b1c640 ("fuse: don't use atomic kmap") removed KM_USER0 usage
from fuse/dev.c. Switch KM_USER1 uses to KM_USER0 for clarity. Also
replace open coded clear_highpage().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After all that's what they are intended for.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this
function
Initialize total_len to zero, else its value will be undefined.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Sparse doesn't understand lock annotations of the form
__releases(&foo->lock). Change them to __releases(foo->lock). Same
for __acquires().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
David Bartly reported that fuse can hang in fuse_get_req_nofail() when
the connection to the filesystem server is no longer active.
If bg_queue is not empty then flush_bg_queue() called from
request_end() can put more requests on to the pending queue. If this
happens while ending requests on the processing queue then those
background requests will be queued to the pending list and never
ended.
Another problem is that fuse_dev_release() didn't wake up processes
sleeping on blocked_waitq.
Solve this by:
a) flushing the background queue before calling end_requests() on the
pending and processing queues
b) setting blocked = 0 and waking up processes waiting on
blocked_waitq()
Thanks to David for an excellent bug report.
Reported-by: David Bartley <andareed@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok. Also clean up and document inode_change_ok
to make this obvious.
As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error. This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere. Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.
Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.
Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure we call inode_change_ok before doing any changes in ->setattr,
and make sure to call it even if our fs wants to ignore normal UNIX
permissions, but use the ATTR_FORCE to skip those.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently MAY_ACCESS means that filesystems must check the permissions
right then and not rely on cached results or the results of future
operations on the object. This can be because of a call to sys_access() or
because of a call to chdir() which needs to check search without relying on
any future operations inside that dir. I plan to use MAY_ACCESS for other
purposes in the security system, so I split the MAY_ACCESS and the
MAY_CHDIR cases.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Userspace filesystem can request data to be retrieved from the inode's
mapping. This request is synchronous and the retrieved data is queued
as a new request. If the write to the fuse device returns an error
then the retrieve request was not completed and a reply will not be
sent.
Only present pages are returned in the retrieve reply. Retrieving
stops when it finds a non-present page and only data prior to that is
returned.
This request doesn't change the dirty state of pages.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Userspace filesystem can request data to be stored in the inode's
mapping. This request is synchronous and has no reply. If the write
to the fuse device returns an error then the store request was not
fully completed (but may have updated some pages).
If the stored data overflows the current file size, then the size is
extended, similarly to a write(2) on the filesystem.
Pages which have been completely stored are marked uptodate.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Don't use atomic kmap for mapping userspace buffers in device
read/write/splice.
This is necessary because the next patch (adding store notify)
requires that caller of fuse_copy_page() may sleep between
invocations. The simplest way to ensure this is to change the atomic
kmaps to non-atomic ones.
Thankfully architectures where kmap() is not a no-op are going out of
fashion, so we can ignore the (probably negligible) performance impact
of this change.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Allow userspace filesystem implementation to use splice() to read from
the fuse device.
The userspace filesystem can now transfer data coming from a WRITE
request to an arbitrary file descriptor (regular file, block device or
socket) without having to go through a userspace buffer.
The semantics of using splice() to read messages are:
1) with a single splice() call move the whole message from the fuse
device to a temporary pipe
2) read the header from the pipe and determine the message type
3a) if message is a WRITE then splice data from pipe to destination
3b) else read rest of message to userspace buffer
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
move pages from the pipe buffer into the page cache. This allows
populating the fuse filesystem's cache without ever touching the page
contents, i.e. zero copy read capability.
The following steps are performed when trying to move a page into the
page cache:
- buf->ops->confirm() to make sure the new page is uptodate
- buf->ops->steal() to try to remove the new page from it's previous place
- remove_from_page_cache() on the old page
- add_to_page_cache_locked() on the new page
If any of the above steps fail (non fatally) then the code falls back
to copying the page. In particular ->steal() will fail if there are
external references (other than the page cache and the pipe buffer) to
the page.
Also since the remove_from_page_cache() + add_to_page_cache_locked()
are non-atomic it is possible that the page cache is repopulated in
between the two and add_to_page_cache_locked() will fail. This could
be fixed by creating a new atomic replace_page_cache_page() function.
fuse_readpages_end() needed to be reworked so it works even if
page->mapping is NULL for some or all pages which can happen if the
add_to_page_cache_locked() failed.
A number of sanity checks were added to make sure the stolen pages
don't have weird flags set, etc... These could be moved into generic
splice/steal code.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acquire a page ref on pages in ->readpages() and release them when the
read has finished. Not acquiring a reference didn't seem to cause any
trouble since the page is locked and will not be kicked out of the
page cache during the read.
However the following patches will want to remove the page from the
cache so a separate ref is needed. Making the reference in req->pages
explicit also makes the code easier to understand.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Replace uses of get_user_pages() with get_user_pages_fast(). It looks
nicer and should be faster in most cases.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
"map" isn't needed any more after: 0bd87182d3 "fuse: fix kunmap in
fuse_ioctl_copy_user"
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
gcc 4.4 warns about:
fs/fuse/dev.c: In function ‘fuse_notify_inval_entry’:
fs/fuse/dev.c:925: warning: the frame size of 1060 bytes is larger than 1024 bytes
The problem is we declare two structures and a large array on the stack,
I move the array alway from the stack and allocate memory for it dynamically.
Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in fuse_open about O_DIRECT:
"VFS checks this, but only _after_ ->open()"
also holds for fuse_create, however, the same kind of check was missing there.
As an impact of this bug, open(newfile, O_RDWR|O_CREAT|O_DIRECT) fails, but a
stub newfile will remain if the fuse server handled the implied FUSE_CREATE
request appropriately.
Other impact: in the above situation ima_file_free() will complain to open/free
imbalance if CONFIG_IMA is set.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Harshavardhana <harsha@gluster.com>
Cc: stable@kernel.org
Invalidate the target's attributes, which may have changed (such as
nlink, change time) so that they are refreshed on the next getattr().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Looks like another victim of the confusing kmap() vs kmap_atomic() API
differences.
Reported-by: Todor Gyumyushev <yodor1@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
fuse_direct_io() has a loop where requests are allocated in each
iteration. if allocation fails, the loop is broken out and follows
into an unconditional fuse_put_request() on that invalid pointer.
Signed-off-by: Anand V. Avati <avati@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update some fs code to make use of new helper functions introduced
in the previous patch. Should be no significant change in behaviour
(except CIFS now calls send_sig under i_lock, via inode_newsize_ok).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-nfs@vger.kernel.org
Cc: Trond.Myklebust@netapp.com
Cc: linux-cifs-client@lists.samba.org
Cc: sfrench@samba.org
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add fusectl interface to max_background
fuse: limit user-specified values of max background requests
fuse: use drop_nlink() instead of direct nlink manipulation
fuse: document protocol version negotiation
fuse: make the number of max background requests and congestion threshold tunable
We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().
Note that ->s_bdi assignment is required for proper writeback!
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make the max_background and congestion_threshold parameters of a FUSE
mount tunable at runtime by adding the respective knobs to its directory
within the fusectl filesystem.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
An untrusted user could DoS the system if s/he were allowed to accumulate an
arbitrary number of pending background requests by setting the above limits
to extremely high values in INIT. This patch excludes this possibility by
imposing global upper limits on the possible values of per-mount "max
background requests" and "congestion threshold" parameters for unprivileged
FUSE filesystems.
These global limits are implemented as module parameters.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
drop_nlink() is the API function to decrease the link count of an inode.
However, at a place the control filesystem used the decrement operator
on i_nlink directly. Fix this.
Cc: Anand Avati <avati@gluster.com>
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit 097041e576.
Trond had a better fix, which is the parent of this one ("Fix compile
error due to congestion_wait() changes")
Requested-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building v2.6.31-rc2-344-g69ca06c, the following build errors are
found due to missing includes:
CC [M] fs/fuse/dev.o
fs/fuse/dev.c: In function ‘request_end’:
fs/fuse/dev.c:289: error: ‘BLK_RW_SYNC’ undeclared (first use in this function)
...
fs/nfs/write.c: In function ‘nfs_set_page_writeback’:
fs/nfs/write.c:207: error: ‘BLK_RW_ASYNC’ undeclared (first use in this function)
Signed-off-by: Larry Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The practical values for these limits depend on the design of the
filesystem server so let userspace set them at initialization time.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add notification messages that allow the filesystem to invalidate VFS
caches.
Two notifications are added:
1) inode invalidation
- invalidate cached attributes
- invalidate a range of pages in the page cache (this is optional)
2) dentry invalidation
- try to invalidate a subtree in the dentry cache
Care must be taken while accessing the 'struct super_block' for the
mount, as it can go away while an invalidation is in progress. To
prevent this, introduce a rw-semaphore, that is taken for read during
the invalidation and taken for write in the ->kill_sb callback.
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@zresearch.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This patch lets filesystems handle masking the file mode on creation.
This is needed if filesystem is using ACLs.
- The CREATE, MKDIR and MKNOD requests are extended with a "umask"
parameter.
- A new FUSE_DONT_MASK flag is added to the INIT request/reply. With
this the filesystem may request that the create mode is not masked.
CC: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
exposes a bug due to a non-careful return of an int or unsigned value:
implement a FUSE filesystem which sends an unsolicited notification to
the kernel with invalid opcode. The respective write to /dev/fuse
will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
errno == EINVAL.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* 'cuse' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
CUSE: implement CUSE - Character device in Userspace
fuse: export symbols to be used by CUSE
fuse: update fuse_conn_init() and separate out fuse_conn_kill()
fuse: don't use inode in fuse_file_poll
fuse: don't use inode in fuse_do_ioctl() helper
fuse: don't use inode in fuse_sync_release()
fuse: create fuse_do_open() helper for CUSE
fuse: clean up args in fuse_finish_open() and fuse_release_fill()
fuse: don't use inode in helpers called by fuse_direct_io()
fuse: add members to struct fuse_file
fuse: prepare fuse_direct_io() for CUSE
fuse: clean up fuse_write_fill()
fuse: use struct path in release structure
fuse: misc cleanups
CUSE enables implementing character devices in userspace. With recent
additions of ioctl and poll support, FUSE already has most of what's
necessary to implement character devices. All CUSE has to do is
bonding all those components - FUSE, chardev and the driver model -
nicely.
When client opens /dev/cuse, kernel starts conversation with
CUSE_INIT. The client tells CUSE which device it wants to create. As
the previous patch made fuse_file usable without associated
fuse_inode, CUSE doesn't create super block or inodes. It attaches
fuse_file to cdev file->private_data during open and set ff->fi to
NULL. The rest of the operation is almost identical to FUSE direct IO
case.
Each CUSE device has a corresponding directory /sys/class/cuse/DEVNAME
(which is symlink to /sys/devices/virtual/class/DEVNAME if
SYSFS_DEPRECATED is turned off) which hosts "waiting" and "abort"
among other things. Those two files have the same meaning as the FUSE
control files.
The only notable lacking feature compared to in-kernel implementation
is mmap support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Update fuse_conn_init() such that it doesn't take @sb and move bdi
registration into a separate function. Also separate out
fuse_conn_kill() from fuse_put_super().
These will be used to implement cuse.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use ff->fc and ff->nodeid instead of file->f_dentry->d_inode in the
fuse_file_poll() implementation.
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Create a helper for sending an IOCTL request that doesn't use a struct
inode.
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Make fuse_sync_release() a generic helper function that doesn't need a
struct inode pointer. This makes it suitable for use by CUSE.
Change return value of fuse_release_common() from int to void.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move setting ff->fh, ff->nodeid and file->private_data outside
fuse_finish_open(). Add ->open_flags member to struct fuse_file.
This simplifies the argument passing to fuse_finish_open() and
fuse_release_fill(), and paves the way for creating an open helper
that doesn't need an inode pointer.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use ff->fc and ff->nodeid instead of passing down the inode.
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add new members ->fc and ->nodeid to struct fuse_file. This will aid
in converting functions for use by CUSE, where the inode is not owned
by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move code operating on the inode out from fuse_direct_io().
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move out code from fuse_write_fill() which is not common to all
callers. Remove two function arguments which become unnecessary.
Also remove unnecessary memset(), the request is already initialized
to zero.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Destroy bdi on error in fuse_fill_super().
This was an omission from commit 26c3679101
"fuse: destroy bdi on umount", which moved the bdi_destroy() call from
fuse_conn_put() to fuse_put_super().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* fuse_file_alloc() was structured in weird way. The success path was
split between else block and code following the block. Restructure
the code such that it's easier to read and modify.
* Unindent success path of fuse_release_common() to ease future
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
MAP_PRIVATE mmap could return stale data from the cache for
"direct_io" files. Fix this by flushing the cache on mmap.
Found with a slightly modified fsx-linux.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix the following warning:
fs/fuse/file.c: In function 'fuse_direct_io':
fs/fuse/file.c:1002: warning: passing argument 3 of 'fuse_get_user_pages' from incompatible pointer type
This was introduced by commit f4975c67 "fuse: allow kernel to access
"direct_io" files".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow MAP_PRIVATE mmaps of "direct_io" files. This is necessary for
execute support.
MAP_SHARED mappings require some sort of coherency between the
underlying file and the mapping. With "direct_io" it is difficult to
provide this, so for the moment just disallow shared (read-write and
read-only) mappings altogether.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow the kernel read and write on "direct_io" files. This is
necessary for nfs export and execute support.
The implementation is simple: if an access from the kernel is
detected, don't perform get_user_pages(), just use the kernel address
provided by the requester to copy from/to the userspace filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This bug was found with smatch (http://repo.or.cz/w/smatch.git/). If
we return directly the inode->i_mutex lock doesn't get released.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits)
fs/Kconfig: move 9p out
fs/Kconfig: move afs out
fs/Kconfig: move coda out
fs/Kconfig: move the rest of ncpfs out
fs/Kconfig: move smbfs out
fs/Kconfig: move sunrpc out
fs/Kconfig: move nfsd out
fs/Kconfig: move nfs out
fs/Kconfig: move ufs out
fs/Kconfig: move sysv out
fs/Kconfig: move romfs out
fs/Kconfig: move qnx4 out
fs/Kconfig: move hpfs out
fs/Kconfig: move omfs out
fs/Kconfig: move minix out
fs/Kconfig: move vxfs out
fs/Kconfig: move squashfs out
fs/Kconfig: move cramfs out
fs/Kconfig: move efs out
fs/Kconfig: move bfs out
...
Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
This is not a big issue because fuse_notify_poll_wakeup() should be
atomic, but it's cleaner this way, and later uses of notification will
need to be able to finish the copying before performing some actions.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If a fuse filesystem is unmounted but the device file descriptor
remains open and a new mount reuses the old device number, then the
mount fails with EEXIST and the following warning is printed in the
kernel log:
WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
sysfs: duplicate filename '0:15' can not be created
The cause is that the bdi belonging to the fuse filesystem was
destoryed only after the device file was released. Fix this by
calling bdi_destroy() from fuse_put_super() instead.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Fix the leaking file reference if allocation or initialization of
fuse_conn failed.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
ff is set to NULL and then dereferenced on line 65. Compile tested only.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up annotations of fc->lock
fuse: fix sparse warning in ioctl
fuse: update interface version
fuse: add fuse_conn->release()
fuse: separate out fuse_conn_init() from new_conn()
fuse: add fuse_ prefix to several functions
fuse: implement poll support
fuse: implement unsolicited notification
fuse: add file kernel handle
fuse: implement ioctl support
fuse: don't let fuse_req->end() put the base reference
fuse: move FUSE_MINOR to miscdevice.h
fuse: style fixes
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Makes the existing annotations match the more common one per line style
and adds a few missing annotations.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add fuse_conn->release() so that fuse_conn can be embedded in other
structures.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Separate out fuse_conn_init() from new_conn() and while at it
initialize fuse_conn->entry during conn initialization.
This will be used by CUSE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add fuse_ prefix to request_send*() and get_root_inode() as some of
those functions will be exported for CUSE. With or without CUSE
export, having the function names scoped is a good idea for
debuggability.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Implement poll support. Polled files are indexed using kh in a RB
tree rooted at fuse_conn->polled_files.
Client should send FUSE_NOTIFY_POLL notification once after processing
FUSE_POLL which has FUSE_POLL_SCHEDULE_NOTIFY set. Sending
notification unconditionally after the latest poll or everytime file
content might have changed is inefficient but won't cause malfunction.
fuse_file_poll() can sleep and requires patches from the following
thread which allows f_op->poll() to sleep.
http://thread.gmane.org/gmane.linux.kernel/726176
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Clients always used to write only in response to read requests. To
implement poll efficiently, clients should be able to issue
unsolicited notifications. This patch implements basic notification
support.
Zero fuse_out_header.unique is now accepted and considered unsolicited
notification and the error field contains notification code. This
patch doesn't implement any actual notification.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The file handle, fuse_file->fh, is opaque value supplied by userland
FUSE server and uniqueness is not guaranteed. Add file kernel handle,
fuse_file->kh, which is allocated by the kernel on file allocation and
guaranteed to be unique.
This will be used by poll to match notification to the respective file
but can be used for other purposes where unique file handle is
necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Generic ioctl support is tricky to implement because only the ioctl
implementation itself knows which memory regions need to be read
and/or written. To support this, fuse client can request retry of
ioctl specifying memory regions to read and write. Deep copying
(nested pointers) can be implemented by retrying multiple times
resolving one depth of dereference at a time.
For security and cleanliness considerations, ioctl implementation has
restricted mode where the kernel determines data transfer directions
and sizes using the _IOC_*() macros on the ioctl command. In this
mode, retry is not allowed.
For all FUSE servers, restricted mode is enforced. Unrestricted ioctl
will be used by CUSE.
Plese read the comment on top of fs/fuse/file.c::fuse_file_do_ioctl()
for more information.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_req->end() was supposed to be put the base reference but there's
no reason why it should. It only makes things more complex. Move it
out of ->end() and make it the responsibility of request_end().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>