Commit Graph

474 Commits

Author SHA1 Message Date
Tony Solomonik
5f0d594c60 Add do_ftruncate that truncates a struct file
do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.

do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.

Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09 08:57:56 -07:00
Jeff Layton
a69ce85ec9
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.

For now, add some macros to ensure that we can continue to build while
the conversion is in progress.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:38 +01:00
Linus Torvalds
bb93c5ed45 vfs-6.8.rw
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
 ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
 ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
 =bF+R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rw updates from Christian Brauner:
 "This contains updates from Amir for read-write backing file helpers
  for stacking filesystems such as overlayfs:

   - Fanotify is currently in the process of introducing pre content
     events. Roughly, a new permission event will be added indicating
     that it is safe to write to the file being accessed. These events
     are used by hierarchical storage managers to e.g., fill the content
     of files on first access.

     During that work we noticed that our current permission checking is
     inconsistent in rw_verify_area() and remap_verify_area().
     Especially in the splice code permission checking is done multiple
     times. For example, one time for the whole range and then again for
     partial ranges inside the iterator.

     In addition, we mostly do permission checking before we call
     file_start_write() except for a few places where we call it after.
     For pre-content events we need such permission checking to be done
     before file_start_write(). So this is a nice reason to clean this
     all up.

     After this series, all permission checking is done before
     file_start_write().

     As part of this cleanup we also massaged the splice code a bit. We
     got rid of a few helpers because we are alredy drowning in special
     read-write helpers. We also cleaned up the return types for splice
     helpers.

   - Introduce generic read-write helpers for backing files. This lifts
     some overlayfs code to common code so it can be used by the FUSE
     passthrough work coming in over the next cycles. Make Amir and
     Miklos the maintainers for this new subsystem of the vfs"

* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: fix __sb_write_started() kerneldoc formatting
  fs: factor out backing_file_mmap() helper
  fs: factor out backing_file_splice_{read,write}() helpers
  fs: factor out backing_file_{read,write}_iter() helpers
  fs: prepare for stackable filesystems backing file helpers
  fsnotify: optionally pass access range in file permission hooks
  fsnotify: assert that file_start_write() is not held in permission hooks
  fsnotify: split fsnotify_perm() into two hooks
  fs: use splice_copy_file_range() inline helper
  splice: return type ssize_t from all helpers
  fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
  fs: move file_start_write() into direct_splice_actor()
  fs: fork splice_file_range() from do_splice_direct()
  fs: create {sb,file}_write_not_started() helpers
  fs: create file_write_started() helper
  fs: create __sb_write_started() helper
  fs: move kiocb_start_write() into vfs_iocb_iter_write()
  fs: move permission hook out of do_iter_read()
  fs: move permission hook out of do_iter_write()
  fs: move file_start_write() into vfs_iter_write()
  ...
2024-01-08 11:11:51 -08:00
Linus Torvalds
c604110e66 vfs-6.8.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
 ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
 KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
 =0bRl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Add Jan Kara as VFS reviewer

   - Show correct device and inode numbers in proc/<pid>/maps for vma
     files on stacked filesystems. This is now easily doable thanks to
     the backing file work from the last cycles. This comes with
     selftests

  Cleanups:

   - Remove a redundant might_sleep() from wait_on_inode()

   - Initialize pointer with NULL, not 0

   - Clarify comment on access_override_creds()

   - Rework and simplify eventfd_signal() and eventfd_signal_mask()
     helpers

   - Process aio completions in batches to avoid needless wakeups

   - Completely decouple struct mnt_idmap from namespaces. We now only
     keep the actual idmapping around and don't stash references to
     namespaces

   - Reformat maintainer entries to indicate that a given subsystem
     belongs to fs/

   - Simplify fput() for files that were never opened

   - Get rid of various pointless file helpers

   - Rename various file helpers

   - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
     last cycle

   - Make relatime_need_update() return bool

   - Use GFP_KERNEL instead of GFP_USER when allocating superblocks

   - Replace deprecated ida_simple_*() calls with their current ida_*()
     counterparts

  Fixes:

   - Fix comments on user namespace id mapping helpers. They aren't
     kernel doc comments so they shouldn't be using /**

   - s/Retuns/Returns/g in various places

   - Add missing parameter documentation on can_move_mount_beneath()

   - Rename i_mapping->private_data to i_mapping->i_private_data

   - Fix a false-positive lockdep warning in pipe_write() for watch
     queues

   - Improve __fget_files_rcu() code generation to improve performance

   - Only notify writer that pipe resizing has finished after setting
     pipe->max_usage otherwise writers are never notified that the pipe
     has been resized and hang

   - Fix some kernel docs in hfsplus

   - s/passs/pass/g in various places

   - Fix kernel docs in ntfs

   - Fix kcalloc() arguments order reported by gcc 14

   - Fix uninitialized value in reiserfs"

* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  reiserfs: fix uninit-value in comp_keys
  watch_queue: fix kcalloc() arguments order
  ntfs: dir.c: fix kernel-doc function parameter warnings
  fs: fix doc comment typo fs tree wide
  selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
  fs/proc: show correct device and inode numbers in /proc/pid/maps
  eventfd: Remove usage of the deprecated ida_simple_xx() API
  fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
  fs/hfsplus: wrapper.c: fix kernel-doc warnings
  fs: add Jan Kara as reviewer
  fs/inode: Make relatime_need_update return bool
  pipe: wakeup wr_wait after setting max_usage
  file: remove __receive_fd()
  file: stop exposing receive_fd_user()
  fs: replace f_rcuhead with f_task_work
  file: remove pointless wrapper
  file: s/close_fd_get_file()/file_close_fd()/g
  Improve __fget_files_rcu() code generation (and thus __fget_light())
  file: massage cleanup of files that failed to open
  fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
  ...
2024-01-08 10:26:08 -08:00
Amir Goldstein
f91a704f71 fs: prepare for stackable filesystems backing file helpers
In preparation for factoring out some backing file io helpers from
overlayfs, move backing_file_open() into a new file fs/backing-file.c
and header.

Add a MAINTAINERS entry for stackable filesystems and add a Kconfig
FS_STACK which stackable filesystems need to select.

For now, the backing_file struct, the backing_file alloc/free functions
and the backing_file_real_path() accessor remain internal to file_table.c.
We may change that in the future.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-23 16:35:08 +02:00
Christian Brauner
2137e15642
Merge branch 'vfs.file'
Bring in the changes to the file infrastructure for this cycle. Mostly
cleanups and some performance tweaks.

* file: remove __receive_fd()
* file: stop exposing receive_fd_user()
* fs: replace f_rcuhead with f_task_work
* file: remove pointless wrapper
* file: s/close_fd_get_file()/file_close_fd()/g
* Improve __fget_files_rcu() code generation (and thus __fget_light())
* file: massage cleanup of files that failed to open

Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-21 13:21:52 +01:00
Jens Axboe
ae1914174a cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:19:48 -08:00
Amir Goldstein
d9e5d31084
fsnotify: optionally pass access range in file permission hooks
In preparation for pre-content permission events with file access range,
move fsnotify_file_perm() hook out of security_file_permission() and into
the callers.

Callers that have the access range information call the new hook
fsnotify_file_area_perm() with the access range.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-6-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12 16:20:02 +01:00
Christian Brauner
a88c955fcf
file: s/close_fd_get_file()/file_close_fd()/g
That really shouldn't have "get" in there as that implies we're bumping
the reference count which we don't do at all. We used to but not anmore.
Now we're just closing the fd and pick that file from the fdtable
without bumping the reference count. Update the wrong documentation
while at it.

Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-1-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12 14:24:13 +01:00
Bagas Sanjaya
d218569004 fs: Clarify "non-RCY" in access_override_creds() comment
The term is originally intended as a joke that stands for "non-racy".
This trips new contributors who mistake it for RCU typo [1].

Replace the term with more-explicit wording.

Link: https://lore.kernel.org/r/20231030-debatten-nachrangig-f58abcdac530@brauner/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20231031114728.41485-1-bagasdotme@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 15:00:01 +01:00
Amir Goldstein
def3ae83da
fs: store real path instead of fake path in backing file f_path
A backing file struct stores two path's, one "real" path that is referring
to f_inode and one "fake" path, which should be displayed to users in
/proc/<pid>/maps.

There is a lot more potential code that needs to know the "real" path, then
code that needs to know the "fake" path.

Instead of code having to request the "real" path with file_real_path(),
store the "real" path in f_path and require code that needs to know the
"fake" path request it with file_user_path().
Replace the file_real_path() helper with a simple const accessor f_path().

After this change, file_dentry() is not expected to observe any files
with overlayfs f_path and real f_inode, so the call to ->d_real() should
not be needed.  Leave the ->d_real() call for now and add an assertion
in ovl_d_real() to catch if we made wrong assumptions.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19 11:03:15 +02:00
Amir Goldstein
83bc1d2941
fs: get mnt_writers count for an open backing file's real path
A writeable mapped backing file can perform writes to the real inode.
Therefore, the real path mount must be kept writable so long as the
writable map exists.

This may not be strictly needed for ovelrayfs private upper mount,
but it is correct to take the mnt_writers count in the vfs helper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19 11:03:15 +02:00
Amir Goldstein
3e15dcf77b
fs: rename __mnt_{want,drop}_write*() helpers
Before exporting these helpers to modules, make their names more
meaningful.

The names mnt_{get,put)_write_access*() were chosen, because they rhyme
with the inode {get,put)_write_access() helpers, which have a very close
meaning for the inode object.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230908132900.2983519-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-11 15:05:50 +02:00
Linus Torvalds
475d4df827 v6.6-vfs.fchmodat2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc
 ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z
 dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA=
 =24Ha
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fchmodat2 system call from Christian Brauner:
 "This adds the fchmodat2() system call. It is a revised version of the
  fchmodat() system call, adding a missing flag argument. Support for
  both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.

  Adding this system call revision has been a longstanding request but
  so far has always fallen through the cracks. While the kernel
  implementation of fchmodat() does not have a flag argument the libc
  provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
  have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
  (see [1] and [2]).

  The workaround is brittle because it relies not just on O_PATH and
  O_NOFOLLOW semantics and procfs magic links but also on our rather
  inconsistent symlink semantics.

  This gives userspace a proper fchmodat2() system call that libcs can
  use to properly implement fchmodat(3) and allows them to get rid of
  their hacks. In this case it will immediately benefit them as the
  current workaround is already defunct because of aformentioned
  inconsistencies.

  In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
  AT_EMPTY_PATH with fchmodat2(). This is already possible with
  fchownat() so there's no reason to not also support it for
  fchmodat2().

  The implementation is simple and comes with selftests. Implementation
  of the system call and wiring up the system call are done as separate
  patches even though they could arguably be one patch. But in case
  there are merge conflicts from other system call additions it can be
  beneficial to have separate patches"

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2]

* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: fchmodat2: remove duplicate unneeded defines
  fchmodat2: add support for AT_EMPTY_PATH
  selftests: Add fchmodat2 selftest
  arch: Register fchmodat2, usually as syscall 452
  fs: Add fchmodat2()
  Non-functional cleanup of a "__user * filename"
2023-08-28 11:25:27 -07:00
Linus Torvalds
de16588a77 v6.6-vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
 okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
 AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
 =pSEW
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual filesystems.

  Features:

   - Block mode changes on symlinks and rectify our broken semantics

   - Report file modifications via fsnotify() for splice

   - Allow specifying an explicit timeout for the "rootwait" kernel
     command line option. This allows to timeout and reboot instead of
     always waiting indefinitely for the root device to show up

   - Use synchronous fput for the close system call

  Cleanups:

   - Get rid of open-coded lockdep workarounds for async io submitters
     and replace it all with a single consolidated helper

   - Simplify epoll allocation helper

   - Convert simple_write_begin and simple_write_end to use a folio

   - Convert page_cache_pipe_buf_confirm() to use a folio

   - Simplify __range_close to avoid pointless locking

   - Disable per-cpu buffer head cache for isolated cpus

   - Port ecryptfs to kmap_local_page() api

   - Remove redundant initialization of pointer buf in pipe code

   - Unexport the d_genocide() function which is only used within core
     vfs

   - Replace printk(KERN_ERR) and WARN_ON() with WARN()

  Fixes:

   - Fix various kernel-doc issues

   - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE

   - Fix a mainly theoretical issue in devpts

   - Check the return value of __getblk() in reiserfs

   - Fix a racy assert in i_readcount_dec

   - Fix integer conversion issues in various functions

   - Fix LSM security context handling during automounts that prevented
     NFS superblock sharing"

* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  cachefiles: use kiocb_{start,end}_write() helpers
  ovl: use kiocb_{start,end}_write() helpers
  aio: use kiocb_{start,end}_write() helpers
  io_uring: use kiocb_{start,end}_write() helpers
  fs: create kiocb_{start,end}_write() helpers
  fs: add kerneldoc to file_{start,end}_write() helpers
  io_uring: rename kiocb_end_write() local helper
  splice: Convert page_cache_pipe_buf_confirm() to use a folio
  libfs: Convert simple_write_begin and simple_write_end to use a folio
  fs/dcache: Replace printk and WARN_ON by WARN
  fs/pipe: remove redundant initialization of pointer buf
  fs: Fix kernel-doc warnings
  devpts: Fix kernel-doc warnings
  doc: idmappings: fix an error and rephrase a paragraph
  init: Add support for rootwait timeout parameter
  vfs: fix up the assert in i_readcount_dec
  fs: Fix one kernel-doc comment
  docs: filesystems: idmappings: clarify from where idmappings are taken
  fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
  vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
  ...
2023-08-28 10:17:14 -07:00
Matthew Wilcox (Oracle)
35931eb394 fs: Fix kernel-doc warnings
These have a variety of causes and a corresponding variety of solutions.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Message-Id: <20230818200824.2720007-1-willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-19 12:12:12 +02:00
Linus Torvalds
021a160abf fs: use __fput_sync in close(2)
close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.

Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).

Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ:     1329857
stock-RSEQ:     1421667 (+7%)
patched:        1523521 (+14.5% / +7%) (with / without rseq)

Patched result is the same regardless of rseq as the codepath is avoided.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-08 19:36:51 +02:00
Aleksa Sarai
a0fc452a5d
open: make RESOLVE_CACHED correctly test for O_TMPFILE
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old
fast-path check for RESOLVE_CACHED would reject all users passing
O_DIRECTORY with -EAGAIN, when in fact the intended test was to check
for __O_TMPFILE.

Cc: stable@vger.kernel.org # v5.12+
Fixes: 99668f6180 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06 15:08:35 +02:00
Aleksa Sarai
5daeb41a6f
fchmodat2: add support for AT_EMPTY_PATH
This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-28 16:43:05 +02:00
Alexey Gladkov
09da082b07
fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper
function which implements the POSIX-specified interface. This
interface differs from the underlying kernel system call, which does not
have a flags argument. Most implementations require procfs [1][2].

There doesn't appear to be a good userspace workaround for this issue
but the implementation in the kernel is pretty straight-forward.

The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
unlike existing fchmodat.

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
[2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28

Co-developed-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org>
[brauner: pre reviews, do flag conversion in do_fchmodat() directly]
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-27 12:24:09 +02:00
Linus Torvalds
18c9901d74 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScTY8ACgkQnJ2qBz9k
 QNnfwwgAhZtow1klDLH6qCnWMufB6AwT7VfAHyA3fvyTYMjUKb0sGHkpuh8hqVOb
 Lzb4YB+jSWV8XnMFn/4gFJQU/nAv8bMPavghMGpr5VNjQi7WkxYF/GB6O1I5NOHK
 EnJjDExgdxXDJZORaaXLVJWrtzJuDFgdiSeIwJECFa0MdTHNgPy3XOl+PPxnYQ/V
 xyHyP5ImGgd5O4iy3PFDQBGgOXIMrBX8IMce+qLQNYIvjSIUgmdnIkoUCvsQiisp
 LyKI2LxqAqnpA4h4Ow6hOZDw2VlPT0vDwFVUfFIZMIqs5YgaSbWa1Z6cs37MigAn
 fgUyRVx2y8A2Lwla7rwLaUEToRVADw==
 =ZdcG
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - Support for fanotify events returning file handles for filesystems
   not exportable via NFS

 - Improved error handling exportfs functions

 - Add missing FS_OPEN events when unusual open helpers are used

* tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: move fsnotify_open() hook into do_dentry_open()
  exportfs: check for error return value from exportfs_encode_*()
  fanotify: support reporting non-decodeable file handles
  exportfs: allow exporting non-decodeable file handles to userspace
  exportfs: add explicit flag to request non-decodeable file handles
  exportfs: change connectable argument to bit flags
2023-06-29 13:31:44 -07:00
Linus Torvalds
1f2300a738 v6.5/vfs.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4WgAKCRCRxhvAZXjc
 oofvAQDs9RJwQUyWHJmQA+tWz5cUE5DviVWCwwul5dQRRCqgaQEA2OIO0gPFaVoq
 1OYOeLyUjl/cpS8e3u4uJtw34jttdQA=
 =AwcR
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file handling updates from Christian Brauner:
 "This contains Amir's work to fix a long-standing problem where an
  unprivileged overlayfs mount can be used to avoid fanotify permission
  events that were requested for an inode or superblock on the
  underlying filesystem.

  Some background about files opened in overlayfs. If a file is opened
  in overlayfs @file->f_path will refer to a "fake" path. What this
  means is that while @file->f_inode will refer to inode of the
  underlying layer, @file->f_path refers to an overlayfs
  {dentry,vfsmount} pair. The reasons for doing this are out of scope
  here but it is the reason why the vfs has been providing the
  open_with_fake_path() helper for overlayfs for very long time now. So
  nothing new here.

  This is for sure not very elegant and everyone including the overlayfs
  maintainers agree. Improving this significantly would involve more
  fragile and potentially rather invasive changes.

  In various codepaths access to the path of the underlying filesystem
  is needed for such hybrid file. The best example is fsnotify where
  this becomes security relevant. Passing the overlayfs
  @file->f_path->dentry will cause fsnotify to skip generating fsnotify
  events registered on the underlying inode or superblock.

  To fix this we extend the vfs provided open_with_fake_path() concept
  for overlayfs to create a backing file container that holds the real
  path and to expose a helper that can be used by relevant callers to
  get access to the path of the underlying filesystem through the new
  file_real_path() helper. This pattern is similar to what we do in
  d_real() and d_real_inode().

  The first beneficiary is fsnotify and fixes the security sensitive
  problem mentioned above.

  There's a couple of nice cleanups included as well.

  Over time, the old open_with_fake_path() helper added specifically for
  overlayfs a long time ago started to get used in other places such as
  cachefiles. Even though cachefiles have nothing to do with hybrid
  files.

  The only reason cachefiles used that concept was that files opened
  with open_with_fake_path() aren't charged against the caller's open
  file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
  both overlayfs and cachefiles need to ensure to not overcharge the
  caller for their internal open calls.

  So this work disentangles FMODE_NOACCOUNT use cases and backing file
  use-cases by adding the FMODE_BACKING flag which indicates that the
  file can be used to retrieve the backing file of another filesystem.
  (Fyi, Jens will be sending you a really nice cleanup from Christoph
  that gets rid of 3 FMODE_* flags otherwise this would be the last
  fmode_t bit we'd be using.)

  So now overlayfs becomes the sole user of the renamed
  open_with_fake_path() helper which is now named backing_file_open().
  For internal kernel users such as cachefiles that are only interested
  in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
  kernel_file_open() helper which opens a file without being charged
  against the caller's open file limit. All new helpers are properly
  documented and clearly annotated to mention their special uses.

  We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
  distinguish it from vfs_tmpfile() and align it the other kernel_*()
  internal helpers"

* tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ovl: enable fsnotify events on underlying real files
  fs: use backing_file container for internal files with "fake" f_path
  fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
  fs: use a helper for opening kernel internal files
  fs: rename {vfs,kernel}_tmpfile_open()
2023-06-26 10:14:36 -07:00
Amir Goldstein
62d53c4a1d
fs: use backing_file container for internal files with "fake" f_path
Overlayfs uses open_with_fake_path() to allocate internal kernel files,
with a "fake" path - whose f_path is not on the same fs as f_inode.

Allocate a container struct backing_file for those internal files, that
is used to hold the "fake" ovl path along with the real path.

backing_file_real_path() can be used to access the stored real path.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230615112229.2143178-5-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:16:38 +02:00
Amir Goldstein
cbb0b9d4bb
fs: use a helper for opening kernel internal files
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile
without accounting for nr_files.

cachefiles uses open_with_fake_path() for the same reason without the
need for a fake path.

Fork open_with_fake_path() to kernel_file_open() which only does the
noaccount part and use it in cachefiles.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-3-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19 18:11:58 +02:00
Amir Goldstein
7b8c9d7bb4 fsnotify: move fsnotify_open() hook into do_dentry_open()
fsnotify_open() hook is called only from high level system calls
context and not called for the very many helpers to open files.

This may makes sense for many of the special file open cases, but it is
inconsistent with fsnotify_close() hook that is called for every last
fput() of on a file object with FMODE_OPENED.

As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
without ever observing an OPEN event.

Fix this inconsistency by replacing all the fsnotify_open() hooks with
a single hook inside do_dentry_open().

If there are special cases that would like to opt-out of the possible
overhead of fsnotify() call in fsnotify_open(), they would probably also
want to avoid the overhead of fsnotify() call in the rest of the fsnotify
hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.

However, in the majority of those cases, the s_fsnotify_connectors
optimization in fsnotify_parent() would be sufficient to avoid the
overhead of fsnotify() call anyway.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
2023-06-12 10:43:45 +02:00
Min-Hua Chen
cedd0bdc16 fs: fix incorrect fmode_t casts
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes
the following sparce warnings:
fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer
fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer
fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 13:14:01 +02:00
Anuradha Weeraman
55650b2fdd fs/open.c: Fix W=1 kernel doc warnings
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid':
 warning: Function parameter or member 'attr' not described
 - Fix warning by removing kernel-doc for these as they are static
   inline functions and not required to be exposed via kernel-doc.

fs/open.c:
 warning: Excess function parameter 'opened' description in 'finish_open'
 warning: Excess function parameter 'cred' description in 'vfs_open'
 - Fix by removing the parameters from the kernel-doc as they are no
   longer required by the function.

Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Message-Id: <20230506182928.384105-1-anuradha@debian.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 12:39:07 +02:00
Christian Brauner
43b4506326
open: return EINVAL for O_DIRECTORY | O_CREAT
After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.

On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                create regular file
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    EISDIR

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                create regular file
    * d exists and is a regular file: EEXIST
    * d exists and is a directory:    EEXIST

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                ENOTDIR (create regular file)
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    EISDIR

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                ENOTDIR (create regular file)
    * d exists and is a regular file: EEXIST
    * d exists and is a directory:    EEXIST

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.

The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.

Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.

The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].

There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.

So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.

Starting with this commit the following semantics apply:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                EINVAL
    * d exists and is a regular file: EINVAL
    * d exists and is a directory:    EINVAL

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                EINVAL
    * d exists and is a regular file: EINVAL
    * d exists and is a directory:    EINVAL

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

One additional note, O_TMPFILE is implemented as:

    #define __O_TMPFILE    020000000
    #define O_TMPFILE      (__O_TMPFILE | O_DIRECTORY)
    #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)

For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is
raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't
specified. Since O_DIRECTORY | O_CREAT could be used to create a regular
allowing that combination together with __O_TMPFILE would've meant that
false positives were possible, i.e., that a regular file was created
instead of a O_TMPFILE. This could've been used to trick userspace into
thinking it operated on a O_TMPFILE when it wasn't.

Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT
in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
can be dropped. Instead we can simply check verify that O_DIRECTORY is
raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.

As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.

Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-03-22 11:06:55 +01:00
Mateusz Guzik
981ee95cc1 vfs: avoid duplicating creds in faccessat if possible
access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)

or when running gcc: strace -c gcc empty.c

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
    0.00    0.000000           0        42        26 access

It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps.  This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.

Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.

An access benchmark plugged into will-it-scale running on Cascade Lake
shows:

    test     proc     before       after
    access1     1    1310582     2908735    (+121%) # distinct files
    access1    24    4716491    63822173   (+1353%) # distinct files
    access2    24    2378041     5370335    (+125%) # same file

The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:

  https://github.com/antonblanchard/will-it-scale/pull/36/files

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-27 16:39:19 -08:00
Linus Torvalds
ea5aac6fae fs.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5SogAKCRCRxhvAZXjc
 orVwAP4jJ1dPZYx1xHip9TfB5fv5xHz3euhvWns6qGJdVYoHzwEAhVxgYUpqWdXX
 L/+VKRFFujYxsSXP4BbS3xDPUJeQFAI=
 =ccK2
 -----END PGP SIGNATURE-----

Merge tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs hardening update from Christian Brauner:
 "Jan pointed out that during shutdown both filp_close() and super block
  destruction will use basic printk logging when bugs are detected. This
  causes issues in a few scenarios:

   - Tools like syzkaller cannot figure out that the logged message
     indicates a bug.

   - Users that explicitly opt in to have the kernel bug on data
     corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see
     the kernel crash when they did actually select that option.

   - When there are busy inodes after the superblock is shut down later
     access to such a busy inodes walks through freed memory. It would
     be better to cleanly crash instead.

  All of this can be addressed by using the already existing
  CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are
  detected. Its logging improvement is useful for all users.

  Otherwise this only has a meaningful behavioral effect when users do
  select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward
  compatible for regular users"

* tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20 12:03:55 -08:00
Linus Torvalds
05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Jann Horn
47d586913f
fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
Currently, filp_close() and generic_shutdown_super() use printk() to log
messages when bugs are detected. This is problematic because infrastructure
like syzkaller has no idea that this message indicates a bug.
In addition, some people explicitly want their kernels to BUG() when kernel
data corruption has been detected (CONFIG_BUG_ON_DATA_CORRUPTION).
And finally, when generic_shutdown_super() detects remaining inodes on a
system without CONFIG_BUG_ON_DATA_CORRUPTION, it would be nice if later
accesses to a busy inode would at least crash somewhat cleanly rather than
walking through freed memory.

To address all three, use CHECK_DATA_CORRUPTION() when kernel bugs are
detected.

Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-27 14:17:22 +01:00
Christian Brauner
4d7ca40901
fs: port vfs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:30 +01:00
Christian Brauner
9452e93e6d
fs: port privilege checking helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
abf08576af
fs: port vfs_*() helpers to struct mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-18 17:51:45 +01:00
Jeff Layton
c65454a947 fs: remove locks_inode
locks_inode was turned into a wrapper around file_inode in de2a4a501e
(Partially revert "locks: fix file locking on overlayfs"). Finish
replacing locks_inode invocations everywhere with file_inode.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11 06:52:43 -05:00
Jeff Layton
5970e15dbc filelock: move file locking definitions to separate header file
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.

Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11 06:52:32 -05:00
Linus Torvalds
299e2b1967 Landlock updates for v6.2-rc1
-----BEGIN PGP SIGNATURE-----
 
 iIYEABYIAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCY5b27RAcbWljQGRpZ2lr
 b2QubmV0AAoJEOXj0OiMgvbSg9YA/0K10H+VsGt1+qqR4+w9SM7SFzbgszrV3Yw9
 rwiPgaPVAP9rxXPr2bD2hAk7/Lv9LeJ2kfM9RzMErP1A6UsC5YVbDA==
 =mAG7
 -----END PGP SIGNATURE-----

Merge tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock updates from Mickaël Salaün:
 "This adds file truncation support to Landlock, contributed by Günther
  Noack. As described by Günther [1], the goal of these patches is to
  work towards a more complete coverage of file system operations that
  are restrictable with Landlock.

  The known set of currently unsupported file system operations in
  Landlock is described at [2]. Out of the operations listed there,
  truncate is the only one that modifies file contents, so these patches
  should make it possible to prevent the direct modification of file
  contents with Landlock.

  The new LANDLOCK_ACCESS_FS_TRUNCATE access right covers both the
  truncate(2) and ftruncate(2) families of syscalls, as well as open(2)
  with the O_TRUNC flag. This includes usages of creat() in the case
  where existing regular files are overwritten.

  Additionally, this introduces a new Landlock security blob associated
  with opened files, to track the available Landlock access rights at
  the time of opening the file. This is in line with Unix's general
  approach of checking the read and write permissions during open(), and
  associating this previously checked authorization with the opened
  file. An ongoing patch documents this use case [3].

  In order to treat truncate(2) and ftruncate(2) calls differently in an
  LSM hook, we split apart the existing security_path_truncate hook into
  security_path_truncate (for truncation by path) and
  security_file_truncate (for truncation of previously opened files)"

Link: https://lore.kernel.org/r/20221018182216.301684-1-gnoack3000@gmail.com [1]
Link: https://www.kernel.org/doc/html/v6.1/userspace-api/landlock.html#filesystem-flags [2]
Link: https://lore.kernel.org/r/20221209193813.972012-1-mic@digikod.net [3]

* tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  samples/landlock: Document best-effort approach for LANDLOCK_ACCESS_FS_REFER
  landlock: Document Landlock's file truncation support
  samples/landlock: Extend sample tool to support LANDLOCK_ACCESS_FS_TRUNCATE
  selftests/landlock: Test ftruncate on FDs created by memfd_create(2)
  selftests/landlock: Test FD passing from restricted to unrestricted processes
  selftests/landlock: Locally define __maybe_unused
  selftests/landlock: Test open() and ftruncate() in multiple scenarios
  selftests/landlock: Test file truncation support
  landlock: Support file truncation
  landlock: Document init_layer_masks() helper
  landlock: Refactor check_access_path_dual() into is_access_to_paths_allowed()
  security: Create file_truncate hook from path_truncate hook
2022-12-13 09:14:50 -08:00
Günther Noack
3350607dc5
security: Create file_truncate hook from path_truncate hook
Like path_truncate, the file_truncate hook also restricts file
truncation, but is called in the cases where truncation is attempted
on an already-opened file.

This is required in a subsequent commit to handle ftruncate()
operations differently to truncate() operations.

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20221018182216.301684-2-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
2022-10-19 09:01:40 +02:00
Christian Brauner
ed5a7047d2
attr: use consistent sgid stripping checks
Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
   -> xfs_file_fallocate()
      -> file_modified()
         -> __file_remove_privs()
            -> dentry_needs_remove_privs()
               -> should_remove_suid()
            -> __remove_privs()
               newattrs.ia_valid = ATTR_FORCE | kill;
               -> notify_change()
                  -> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

if (ia_valid & ATTR_MODE) {
        umode_t mode = attr->ia_mode;
        vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
        if (!vfsgid_in_group_p(vfsgid) &&
            !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                mode &= ~S_ISGID;
        inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
   -> ovl_fallocate()
      -> file_remove_privs()
         -> dentry_needs_remove_privs()
            -> should_remove_suid()
         -> __remove_privs()
            newattrs.ia_valid = ATTR_FORCE | kill;
            -> notify_change()
               -> ovl_setattr()
                  // TAKE ON MOUNTER'S CREDS
                  -> ovl_do_notify_change()
                     -> notify_change()
                  // GIVE UP MOUNTER'S CREDS
     // TAKE ON MOUNTER'S CREDS
     -> vfs_fallocate()
        -> xfs_file_fallocate()
           -> file_modified()
              -> __file_remove_privs()
                 -> dentry_needs_remove_privs()
                    -> should_remove_suid()
                 -> __remove_privs()
                    newattrs.ia_valid = attr_force | kill;
                    -> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.

While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.

Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-18 10:09:47 +02:00
Linus Torvalds
7a3353c5c4 struct file-related stuff
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxjIQAKCRBZ7Krx/gZQ
 6/FPAQCNCZygQzd+54//vo4kTwv5T2Bv3hS8J51rASPJT87/BQD/TfCLS5urt/Gt
 81A1dFOfnTXseofuBKyGSXwQm0dWpgA=
 =PLre
 -----END PGP SIGNATURE-----

Merge tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs file updates from Al Viro:
 "struct file-related stuff"

* tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dma_buf_getfile(): don't bother with ->f_flags reassignments
  Change calling conventions for filldir_t
  locks: fix TOCTOU race when granting write lease
2022-10-06 17:13:18 -07:00
Tetsuo Handa
f52d74b190
open: always initialize ownership fields
Beginning of the merge window we introduced the vfs{g,u}id_t types in
b27c82e129 ("attr: port attribute changes to new types") and changed
various codepaths over including chown_common().

During that change we forgot to account for the case were the passed
ownership value is -1. In this case the ownership fields in struct iattr
aren't initialized but we rely on them being initialized by the time we
generate the ownership to pass down to the LSMs. All the major LSMs
don't care about the ownership values at all. Only Tomoyo uses them and
so it took a while for syzbot to unearth this issue.

Fix this by initializing the ownership fields and do it within the
retry_deleg block. While notify_change() doesn't alter the ownership
fields currently we shouldn't rely on it.

Since no kernel has been released with these changes this does not
needed to be backported to any stable kernels.

[Christian Brauner (Microsoft) <brauner@kernel.org>]
* rewrote commit message
* use INVALID_VFS{G,U}ID macros

Fixes: b27c82e129 ("attr: port attribute changes to new types") # mainline only
Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-09-20 11:57:57 +02:00
Amir Goldstein
d6da19c9ca locks: fix TOCTOU race when granting write lease
Thread A trying to acquire a write lease checks the value of i_readcount
and i_writecount in check_conflicting_open() to verify that its own fd
is the only fd referencing the file.

Thread B trying to open the file for read will call break_lease() in
do_dentry_open() before incrementing i_readcount, which leaves a small
window where thread A can acquire the write lease and then thread B
completes the open of the file for read without breaking the write lease
that was acquired by thread A.

Fix this race by incrementing i_readcount before checking for existing
leases, same as the case with i_writecount.

Use a helper put_file_access() to decrement i_readcount or i_writecount
in do_dentry_open() and __fput().

Fixes: 387e3746d0 ("locks: eliminate false positive conflicts for write lease")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-16 10:59:54 -04:00
Linus Torvalds
5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00
Linus Torvalds
a782e86649 Saner handling of "lseek should fail with ESPIPE" - gets rid of
magical no_llseek thing and makes checks consistent.  In particular,
 ad-hoc "can we do splice via internal pipe" checks got saner (and
 somewhat more permissive, which is what Jason had been after, AFAICT)
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug2xgAKCRBZ7Krx/gZQ
 6wxWAQDqeg+xMq2FGPXmgjCa+Cp3PXH96Lp6f3hHzakIDx+t8gEAxvuiXAD22Mct
 6S1SKuGj0iDIuM4L7hUiWTiY/bDXSAc=
 =3EC/
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs lseek updates from Al Viro:
 "Jason's lseek series.

  Saner handling of 'lseek should fail with ESPIPE' - this gets rid of
  the magical no_llseek thing and makes checks consistent.

  In particular, the ad-hoc "can we do splice via internal pipe" checks
  got saner (and somewhat more permissive, which is what Jason had been
  after, AFAICT)"

* tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove no_llseek
  fs: check FMODE_LSEEK to control internal pipe splicing
  vfio: do not set FMODE_LSEEK flag
  dma-buf: remove useless FMODE_LSEEK flag
  fs: do not compare against ->llseek
  fs: clear or set FMODE_LSEEK based on llseek function
2022-08-03 11:35:20 -07:00
Jason A. Donenfeld
868941b144 fs: remove no_llseek
Now that all callers of ->llseek are going through vfs_llseek(), we
don't gain anything by keeping no_llseek around. Nothing actually calls
it and setting ->llseek to no_lseek is completely equivalent to
leaving it NULL.

Longer term (== by the end of merge window) we want to remove all such
intializations.  To simplify the merge window this commit does *not*
touch initializers - it only defines no_llseek as NULL (and simplifies
the tests on file opening).

At -rc1 we'll need do a mechanical removal of no_llseek -

git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
done
would do it.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-16 09:19:47 -04:00
Jason A. Donenfeld
e7478158e1 fs: clear or set FMODE_LSEEK based on llseek function
Pipe-like behaviour on llseek(2) (i.e. unconditionally failing with
-ESPIPE) can be expresses in 3 ways:
	1) ->llseek set to NULL in file_operations
	2) ->llseek set to no_llseek in file_operations
	3) FMODE_LSEEK *not* set in ->f_mode.

Enforce (3) in cases (1) and (2); that will allow to simplify the
checks and eventually get rid of no_llseek boilerplate.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-16 09:18:56 -04:00
Christian Brauner
b27c82e129
attr: port attribute changes to new types
Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.

This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.

Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.

The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.

We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.

Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.

The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.

Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-26 18:18:56 +02:00
Al Viro
164f4064ca keep iocb_flags() result cached in struct file
* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal
opens, alloc_file() for pipe()/socket()/etc.)
* update when handling F_SETFL
* keep in a new field - file->f_iocb_flags; since that thing is needed only
before the refcount reaches zero, we can put it into the same anon union
where ->f_rcuhead and ->f_llist live - those are used only after refcount
reaches zero.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-10 16:10:23 -04:00
Linus Torvalds
35b51afd23 RISC-V Patches for the 5.19 Merge Window, Part 1
* Support for the Svpbmt extension, which allows memory attributes to be
   encoded in pages.
 * Support for the Allwinner D1's implementation of page-based memory
   attributes.
 * Support for running rv32 binaries on rv64 systems, via the compat
   subsystem.
 * Support for kexec_file().
 * Support for the new generic ticket-based spinlocks, which allows us to
   also move to qrwlock.  These should have already gone in through the
   asm-geneic tree as well.
 * A handful of cleanups and fixes, include some larger ones around
   atomics and XIP.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmKWOx8THHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYieAiEADAUdP7ctoaSQwk5skd/fdA3b4KJuKn
 1Zjl+Br32WP0DlbirYBYWRUQZnCCsvABbTiwSJMcG7NBpU5pyQ5XDtB3OA5kJswO
 Fdp8Nd53//+GK1M5zdEM9OdgvT9fbfTZ3qTu8bKsROOQhGwnYL+Csc9KjFRqEmzN
 oQii0jlb3n5PM4FL3GsbV4uMn9zzkP9mnVAPQktcock2EKFEK/Fy3uNYMQiO2KPi
 n8O6bIDaeRdQ6SurzWOuOkt0cro0tEF85ilzT04mynQsOU0el5oGqCxnOhNH3VWg
 ndqPT6Yafw12hZOtbKJeP+nF8IIR6aJLP3jOtRwEVgcfbXYAw4QwbAV8kQZISefN
 ipn8JGY7GX9Y9TYU692OUGkcmAb3/dxb6c0WihBdvJ0M6YyLD5X+YKHNuG2onLgK
 ss43C5Mxsu629rsjdu/PV91B1+pve3rG9siVmF+g4eo0x9rjMq6/JB0Kal/8SLI1
 Je5T55d5ujV1a2XxhZLQOSD5owrK7J1M9owb0bloTnr9nVwFTWDrfEQEU82o3kP+
 Xm+FfXktnz9ai55NjkMbbEur5D++dKJhBavwCTnBcTrJmMtEH0R45GTK9ZehP+WC
 rNVrRXjIsS18wsTfJxnkZeFQA38as6VBKTzvwHvOgzTrrZU1/xk3lpkouYtAO6BG
 gKacHshVilmUuA==
 =Loi6
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-5.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for the Svpbmt extension, which allows memory attributes to
   be encoded in pages

 - Support for the Allwinner D1's implementation of page-based memory
   attributes

 - Support for running rv32 binaries on rv64 systems, via the compat
   subsystem

 - Support for kexec_file()

 - Support for the new generic ticket-based spinlocks, which allows us
   to also move to qrwlock. These should have already gone in through
   the asm-geneic tree as well

 - A handful of cleanups and fixes, include some larger ones around
   atomics and XIP

* tag 'riscv-for-linus-5.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits)
  RISC-V: Prepare dropping week attribute from arch_kexec_apply_relocations[_add]
  riscv: compat: Using seperated vdso_maps for compat_vdso_info
  RISC-V: Fix the XIP build
  RISC-V: Split out the XIP fixups into their own file
  RISC-V: ignore xipImage
  RISC-V: Avoid empty create_*_mapping definitions
  riscv: Don't output a bogus mmu-type on a no MMU kernel
  riscv: atomic: Add custom conditional atomic operation implementation
  riscv: atomic: Optimize dec_if_positive functions
  riscv: atomic: Cleanup unnecessary definition
  RISC-V: Load purgatory in kexec_file
  RISC-V: Add purgatory
  RISC-V: Support for kexec_file on panic
  RISC-V: Add kexec_file support
  RISC-V: use memcpy for kexec_file mode
  kexec_file: Fix kexec_file.c build error for riscv platform
  riscv: compat: Add COMPAT Kbuild skeletal support
  riscv: compat: ptrace: Add compat_arch_ptrace implement
  riscv: compat: signal: Add rt_frame implementation
  riscv: add memory-type errata for T-Head
  ...
2022-05-31 14:10:54 -07:00
Linus Torvalds
6d29d7fe4f NFSD 5.19 Release Notes
We introduce "courteous server" in this release. Previously NFSD
 would purge open and lock state for an unresponsive client after
 one lease period (typically 90 seconds). Now, after one lease
 period, another client can open and lock those files and the
 unresponsive client's lease is purged; otherwise if the unrespon-
 sive client's open and lock state is uncontended, the server retains
 that open and lock state for up to 24 hours, allowing the client's
 workload to resume after a lengthy network partition.
 
 A longstanding issue with NFSv4 file creation is also addressed.
 Previously a file creation can fail internally, returning an error
 to the client, but leave the newly created file in place as an
 artifact. The file creation code path has been reorganized so that
 internal failures and race conditions are less likely to result in
 an unwanted file creation.
 
 A fault injector has been added to help exercise paths that are run
 during kernel metadata cache invalidation. These caches contain
 information maintained by user space about exported filesystems.
 Many of our test workloads do not trigger cache invalidation.
 
 There is one patch that is needed to support PREEMPT_RT and a fix
 for an ancient "sleep while spin-locked" splat that seems to have
 become easier to hit since v5.18-rc3.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKPliAACgkQM2qzM29m
 f5dB3BAAorPa2L8xu5P1Ge1oTNogNSOVRkLPDzEkfEwK07ZM2qvz78eMZGkMziJ/
 strorvBWl3SWBlVtTePgNpJUjgYQ75MRRwaX7Qh2WuHeRKm1JlZm0/NId3+zKgbh
 N40QI20jdswWcNDuhidxVFFWurd09GlcM4z1cu8gZLbfthkiUOjZoPiLkXeNcvhk
 7wC9GiueWxHefYQQDAKh1nQS/L0GG1EkzJdJo7WUVAldZ9qVY9LpmJVMRqrBBbta
 XrFYfpeY1zFFDY4Qolyz5PUJSeQuDj9PctlhoZ6B1hp56PD/6yaqVhYXiPxtlALj
 tITtktfiekULZkgfvfvyzssCv+wkbYiaEBZcSSCauR7dkGOmBmajO+cf7vpsERgE
 fbCU8DWGk78SMeehdCrO+26cV37VP+8c2t2Txq/rG5Eq4ZoCi++Hj5poRboFLqb+
 oom+0Ee0LfcAKXkxH5gWTPTblHo49GzGitPZtRzTgZ9uFnVwvEaJ4+t0ij0J8JpL
 HuVtWrg5/REhqpEvOSwF0sRmkYWLTu7KdueGn/iZ8xUi7GHEue01NsVkClohKJcR
 WOjWrbNCNF/LJaG88MX0z5u7IO7s9bOHphd7PJ92vR+4YsehW3uRhk+rNi2ZBqQz
 hzULfu8BiaicV9fdB/hDcMmKQD6U6due2AVVPtxTf5XY+CHQNRY=
 =phE1
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "We introduce 'courteous server' in this release. Previously NFSD would
  purge open and lock state for an unresponsive client after one lease
  period (typically 90 seconds). Now, after one lease period, another
  client can open and lock those files and the unresponsive client's
  lease is purged; otherwise if the unresponsive client's open and lock
  state is uncontended, the server retains that open and lock state for
  up to 24 hours, allowing the client's workload to resume after a
  lengthy network partition.

  A longstanding issue with NFSv4 file creation is also addressed.
  Previously a file creation can fail internally, returning an error to
  the client, but leave the newly created file in place as an artifact.
  The file creation code path has been reorganized so that internal
  failures and race conditions are less likely to result in an unwanted
  file creation.

  A fault injector has been added to help exercise paths that are run
  during kernel metadata cache invalidation. These caches contain
  information maintained by user space about exported filesystems. Many
  of our test workloads do not trigger cache invalidation.

  There is one patch that is needed to support PREEMPT_RT and a fix for
  an ancient 'sleep while spin-locked' splat that seems to have become
  easier to hit since v5.18-rc3"

* tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits)
  NFSD: nfsd_file_put() can sleep
  NFSD: Add documenting comment for nfsd4_release_lockowner()
  NFSD: Modernize nfsd4_release_lockowner()
  NFSD: Fix possible sleep during nfsd4_release_lockowner()
  nfsd: destroy percpu stats counters after reply cache shutdown
  nfsd: Fix null-ptr-deref in nfsd_fill_super()
  nfsd: Unregister the cld notifier when laundry_wq create failed
  SUNRPC: Use RMW bitops in single-threaded hot paths
  NFSD: Clean up the show_nf_flags() macro
  NFSD: Trace filecache opens
  NFSD: Move documenting comment for nfsd4_process_open2()
  NFSD: Fix whitespace
  NFSD: Remove dprintk call sites from tail of nfsd4_open()
  NFSD: Instantiate a struct file when creating a regular NFSv4 file
  NFSD: Clean up nfsd_open_verified()
  NFSD: Remove do_nfsd_create()
  NFSD: Refactor NFSv4 OPEN(CREATE)
  NFSD: Refactor NFSv3 CREATE
  NFSD: Refactor nfsd_create_setattr()
  NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create()
  ...
2022-05-26 20:52:24 -07:00
Chuck Lever
fb70bf124b NFSD: Instantiate a struct file when creating a regular NFSv4 file
There have been reports of races that cause NFSv4 OPEN(CREATE) to
return an error even though the requested file was created. NFSv4
does not provide a status code for this case.

To mitigate some of these problems, reorganize the NFSv4
OPEN(CREATE) logic to allocate resources before the file is actually
created, and open the new file while the parent directory is still
locked.

Two new APIs are added:

+ Add an API that works like nfsd_file_acquire() but does not open
the underlying file. The OPEN(CREATE) path can use this API when it
already has an open file.

+ Add an API that is kin to dentry_open(). NFSD needs to create a
file and grab an open "struct file *" atomically. The
alloc_empty_file() has to be done before the inode create. If it
fails (for example, because the NFS server has exceeded its
max_files limit), we avoid creating the file and can still return
an error to the NFS client.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=382
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: JianHong Yin <jiyin@redhat.com>
2022-05-23 11:06:29 -04:00
NeilBrown
a2ad63daa8 VFS: add FMODE_CAN_ODIRECT file flag
Currently various places test if direct IO is possible on a file by
checking for the existence of the direct_IO address space operation.
This is a poor choice, as the direct_IO operation may not be used - it is
only used if the generic_file_*_iter functions are called for direct IO
and some filesystems - particularly NFS - don't do this.

Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
various places to check this (avoiding pointer dereferences).
do_dentry_open() will set this flag if ->direct_IO is present, so
filesystems do not need to be changed.

NFS *is* changed, to set the flag explicitly and discard the direct_IO
entry in the address_space_operations for files.

Other filesystems which currently use noop_direct_IO could usefully be
changed to set this flag instead.

Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brown
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:49 -07:00
Guo Ren
59c10c52f5
riscv: compat: syscall: Add compat_sys_call_table implementation
Implement compat sys_call_table and some system call functions:
truncate64, ftruncate64, fallocate, pread64, pwrite64,
sync_file_range, readahead, fadvise64_64 which need argument
translation.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/r/20220405071314.3225832-12-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-26 13:36:25 -07:00
Christoph Hellwig
7b12e49669 fs: remove fs.f_write_hint
The value is now completely unused except for reporting it back through
the F_GET_FILE_RW_HINT ioctl, so remove the value and the two ioctls
for it.

Trying to use the F_SET_FILE_RW_HINT and F_GET_FILE_RW_HINT fcntls will
now return EINVAL, just like it would on a kernel that never supported
this functionality in the first place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08 17:55:03 -07:00
Christian Brauner
bd303368b7
fs: support mapped mounts of mapped filesystems
In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.

With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.

In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.

As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.

This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.

Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-12-05 10:28:57 +01:00
Christian Brauner
4472071331
fs: use low-level mapping helpers
In a few places the vfs needs to interact with bare k{g,u}ids directly
instead of struct inode. These are just a few. In previous patches we
introduced low-level mapping helpers that are able to support
filesystems mounted an idmapping. This patch simply converts the places
to use these new helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-12-03 18:58:11 +01:00
Christian Brauner
a793d79ea3
fs: move mapping helpers
The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-12-03 18:50:17 +01:00
Linus Torvalds
512b7931ad Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "257 patches.

  Subsystems affected by this patch series: scripts, ocfs2, vfs, and
  mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
  gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
  pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
  memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
  vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
  cleanups, kfence, and damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
  mm/damon: remove return value from before_terminate callback
  mm/damon: fix a few spelling mistakes in comments and a pr_debug message
  mm/damon: simplify stop mechanism
  Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
  Docs/admin-guide/mm/damon/start: simplify the content
  Docs/admin-guide/mm/damon/start: fix a wrong link
  Docs/admin-guide/mm/damon/start: fix wrong example commands
  mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
  mm/damon: remove unnecessary variable initialization
  Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
  mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
  selftests/damon: support watermarks
  mm/damon/dbgfs: support watermarks
  mm/damon/schemes: activate schemes based on a watermarks mechanism
  tools/selftests/damon: update for regions prioritization of schemes
  mm/damon/dbgfs: support prioritization weights
  mm/damon/vaddr,paddr: support pageout prioritization
  mm/damon/schemes: prioritize regions within the quotas
  mm/damon/selftests: support schemes quotas
  mm/damon/dbgfs: support quotas of schemes
  ...
2021-11-06 14:08:17 -07:00
Rongwei Wang
8468e937df mm, thp: fix incorrect unmap behavior for private pages
When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping.  This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.

A simple test for a DSO (Prerequisite is the DSO mapped in file THP):

    int main(int argc, char *argv[])
    {
	int fd;

	fd = open(argv[1], O_WRONLY);
	if (fd < 0) {
		perror("open");
	}

	close(fd);
	return 0;
    }

The test only to open a target DSO, and do nothing.  But this operation
will lead one or more process to happen core dump.  This patch mainly to
fix this bug.

Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0a ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:41 -07:00
Rongwei Wang
55fc0d9174 mm, thp: lock filemap when truncating page cache
Patch series "fix two bugs for file THP".

This patch (of 2):

Transparent huge page has supported read-only non-shmem files.  The
file- backed THP is collapsed by khugepaged and truncated when written
(for shared libraries).

However, there is a race when multiple writers truncate the same page
cache concurrently.

In that case, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows:

    page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
    head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
    flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
    raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
    head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
    page dumped because: VM_BUG_ON_PAGE(PageTail(page))
    ------------[ cut here ]------------
    kernel BUG at mm/truncate.c:213!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
    CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
    Hardware name: ECS, BIOS 0.0.0 02/06/2015
    pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
    Call trace:
     truncate_inode_page+0x64/0x70
     truncate_inode_pages_range+0x550/0x7e4
     truncate_pagecache+0x58/0x80
     do_dentry_open+0x1e4/0x3c0
     vfs_open+0x38/0x44
     do_open+0x1f0/0x310
     path_openat+0x114/0x1dc
     do_filp_open+0x84/0x134
     do_sys_openat2+0xbc/0x164
     __arm64_sys_openat+0x74/0xc0
     el0_svc_common.constprop.0+0x88/0x220
     do_el0_svc+0x30/0xa0
     el0_svc+0x20/0x30
     el0_sync_handler+0x1a4/0x1b0
     el0_sync+0x180/0x1c0
    Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)

This patch mainly to lock filemap when one enter truncate_pagecache(),
avoiding truncating the same page cache concurrently.

Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0a ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Song Liu <song@kernel.org>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:41 -07:00
Richard Guy Briggs
571e5c0efc audit: add OPENAT2 record to list "how" info
Since the openat2(2) syscall uses a struct open_how pointer to communicate
its parameters they are not usefully recorded by the audit SYSCALL record's
four existing arguments.

Add a new audit record type OPENAT2 that reports the parameters in its
third argument, struct open_how with fields oflag, mode and resolve.

The new record in the context of an event would look like:
time->Wed Mar 17 16:28:53 2021
type=PROCTITLE msg=audit(1616012933.531:184): proctitle=
  73797363616C6C735F66696C652F6F70656E617432002F746D702F61756469742D
  7465737473756974652D737641440066696C652D6F70656E617432
type=PATH msg=audit(1616012933.531:184): item=1 name="file-openat2"
  inode=29 dev=00:1f mode=0100600 ouid=0 ogid=0 rdev=00:00
  obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
  cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1616012933.531:184):
  item=0 name="/root/rgb/git/audit-testsuite/tests"
  inode=25 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00
  obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT
  cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1616012933.531:184):
  cwd="/root/rgb/git/audit-testsuite/tests"
type=OPENAT2 msg=audit(1616012933.531:184):
  oflag=0100302 mode=0600 resolve=0xa
type=SYSCALL msg=audit(1616012933.531:184): arch=c000003e syscall=437
  success=yes exit=4 a0=3 a1=7ffe315f1c53 a2=7ffe315f1550 a3=18
  items=2 ppid=528 pid=540 auid=0 uid=0 gid=0 euid=0 suid=0
  fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="openat2"
  exe="/root/rgb/git/audit-testsuite/tests/syscalls_file/openat2"
  subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
  key="testsuite-1616012933-bjAUcEPO"

Link: https://lore.kernel.org/r/d23fbb89186754487850367224b060e26f9b7181.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: tweak subject, wrap example, move AUDIT_OPENAT2 to 1337]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-04 12:09:27 -04:00
Jeff Layton
f7e33bdbd6 fs: remove mandatory file locking support
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-23 06:15:36 -04:00
Linus Torvalds
58ec9059b3 Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs name lookup updates from Al Viro:
 "Small namei.c patch series, mostly to simplify the rules for nameidata
  state. It's actually from the previous cycle - but I didn't post it
  for review in time...

  Changes visible outside of fs/namei.c: file_open_root() calling
  conventions change, some freed bits in LOOKUP_... space"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: make sure nd->depth is always valid
  teach set_nameidata() to handle setting the root as well
  take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
  switch file_open_root() to struct path
2021-07-03 11:41:14 -07:00
Linus Torvalds
71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Collin Fijalkovich
eb6ecbed0a mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE.  This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open().  Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g.  Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()).  It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout.  The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Christian Brauner
cfe80306a0
open: don't silently ignore unknown O-flags in openat2()
The new openat2() syscall verifies that no unknown O-flag values are
set and returns an error to userspace if they are while the older open
syscalls like open() and openat() simply ignore unknown flag values:

  #define O_FLAG_CURRENTLY_INVALID (1 << 31)
  struct open_how how = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
          .resolve = 0,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

  /* succeeds */
  fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

However, openat2() silently truncates the upper 32 bits meaning:

  #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
  #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

  struct open_how how_lowe32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
  };

  struct open_how how_upper32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

  /* succeeds */
  fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

Fix this by preventing the immediate truncation in build_open_flags().

There's a snafu here though stripping FMODE_* directly from flags would
cause the upper 32 bits to be truncated as well due to integer promotion
rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

In addition, struct open_flags currently defines flags to be 32 bit
which is reasonable. If we simply were to bump it to 64 bit we would
need to change a lot of code preemptively which doesn't seem worth it.
So simply add a compile-time check verifying that all currently known
O_* flags are within the 32 bit range and fail to build if they aren't
anymore.

This change shouldn't regress old open syscalls since they silently
truncate any unknown values anyway. It is a tiny semantic change for
openat2() but it is very unlikely people pass ing > 32 bit unknown flags
and the syscall is relatively new too.

Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-05-28 17:44:37 +02:00
Al Viro
ffb37ca3bd switch file_open_root() to struct path
... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:56:43 -04:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Christian Brauner
b8b546a061
open: handle idmapped mounts
For core file operations such as changing directories or chrooting,
determining file access, changing mode or ownership the vfs will verify
that the caller is privileged over the inode. Extend the various helpers
to handle idmapped mounts. If the inode is accessed through an idmapped
mount map it into the mount's user namespace. Afterwards the permissions
checks are identical to non-idmapped mounts. When changing file
ownership we need to map the uid and gid from the mount's user
namespace. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
643fe55a06
open: handle idmapped mounts in do_truncate()
When truncating files the vfs will verify that the caller is privileged
over the inode. Extend it to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
02f92b3868
fs: add file and path permissions helpers
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Jens Axboe
99668f6180 fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:42:26 -05:00
Linus Torvalds
faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Eric W. Biederman
8760c909f5 file: Rename __close_fd to close_fd and remove the files parameter
The function __close_fd was added to support binder[1].  Now that
binder has been fixed to no longer need __close_fd[2] all calls
to __close_fd pass current->files.

Therefore transform the files parameter into a local variable
initialized to current->files, and rename __close_fd to close_fd to
reflect this change, and keep it in sync with the similar changes to
__alloc_fd, and __fd_install.

This removes the need for callers to care about the extra care that
needs to be take if anything except current->files is passed, by
limiting the callers to only operation on current->files.

[1] 483ce1d4b8 ("take descriptor-related part of close() to file.c")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Aleksa Sarai
398840f8bb
openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-03 10:15:50 +01:00
Kees Cook
633fb6ac39 exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late.  This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.

Move the test into the other inode type checks (which already look for
other pathological conditions[1]).  Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.

Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
		    /* new location of MAY_EXEC vs S_ISREG() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        /* old location of FMODE_EXEC vs S_ISREG() test */
                        security_file_open(f)
                        open()

[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Linus Torvalds
e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Christoph Hellwig
eb9d7d390e init: add an init_eaccess helper
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it.  Note that this
theoretically changes behavior as it always is based on the effective
permissions.  But during early init that doesn't make a difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
1097742efc init: add an init_chmod helper
Add a simple helper to chmod with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
b873498f99 init: add an init_chown helper
Add a simple helper to chown with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
4b7ca5014c init: add an init_chroot helper
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chroot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
db63f1e315 init: add an init_chdir helper
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chdir.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
b25ba7c3c9 fs: remove ksys_fchmod
Fold it into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:01 +02:00
Christoph Hellwig
166e07c37c fs: remove ksys_open
Just open code it in the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:00 +02:00
Christoph Hellwig
9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00
Christoph Hellwig
c04011fe8c fs: add a vfs_fchown helper
Add a helper for struct file based chown operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:00 +02:00
Christian Brauner
60997c3d45
close_range: add CLOSE_RANGE_UNSHARE
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-17 00:07:38 +02:00
Christian Brauner
278a5fbaed
open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: 9e4f2f3a6b/Modules/_posixsubprocess.c (L220)
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: 5238e95759/src/basic/fd-util.c (L217)
[5]: ddf4b77e11/src/lxc/start.c (L236)
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
     Note, in a recent patch series Florian made grantpt() a nop thereby
     removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: 5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2020-06-17 00:05:19 +02:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Jeff Layton
735e4ae5ba vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back.  It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect.  If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error.  syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures.  Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor.  This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though.  This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Miklos Szeredi
c8ffd8bcdd vfs: add faccessat2 syscall
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

Add a new faccessat(2) syscall with the added flags argument and implement
both flags.

The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR.  Use this value for the kernel interface as well, together
with the explanatory comment.

Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
9470451505 vfs: split out access_override_creds()
Split out a helper that overrides the credentials in preparation for
actually doing the access check.

This prepares for the next patch that optionally disables the creds
override.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Linus Torvalds
9c577491b9 Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathwalk sanitizing from Al Viro:
 "Massive pathwalk rewrite and cleanups.

  Several iterations have been posted; hopefully this thing is getting
  readable and understandable now. Pretty much all parts of pathname
  resolutions are affected...

  The branch is identical to what has sat in -next, except for commit
  message in "lift all calls of step_into() out of follow_dotdot/
  follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
  commit message changed there."

* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
  lookup_open(): don't bother with fallbacks to lookup+create
  atomic_open(): no need to pass struct open_flags anymore
  open_last_lookups(): move complete_walk() into do_open()
  open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
  open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
  open_last_lookups(): consolidate fsnotify_create() calls
  take post-lookup part of do_last() out of loop
  link_path_walk(): sample parent's i_uid and i_mode for the last component
  __nd_alloc_stack(): make it return bool
  reserve_stack(): switch to __nd_alloc_stack()
  pick_link(): take reserving space on stack into a new helper
  pick_link(): more straightforward handling of allocation failures
  fold path_to_nameidata() into its only remaining caller
  pick_link(): pass it struct path already with normal refcounting rules
  fs/namei.c: kill follow_mount()
  non-RCU analogue of the previous commit
  helper for mount rootwards traversal
  follow_dotdot(): be lazy about changing nd->path
  follow_dotdot_rcu(): be lazy about changing nd->path
  follow_dotdot{,_rcu}(): massage loops
  ...
2020-04-02 12:30:08 -07:00
Al Viro
d9a9f4849f cifs_atomic_open(): fix double-put on late allocation failure
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12 18:25:20 -04:00
Al Viro
31d1726d72 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.

However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Jens Axboe
35cb6d54c1 fs: make build_open_flags() available internally
This is a prep patch for supporting non-blocking open from io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00