mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-16 19:20:24 +00:00
Summary
=======
This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to
implement something like mount -t ext4 --exclusive /dev/sda /B which
fails if a superblock for the requested filesystem does already exist:
Before this patch
-----------------
$ sudo ./move-mount -f xfs -o source=/dev/sda4 /A
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)
$ sudo ./move-mount -f xfs -o source=/dev/sda4 /B
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)
After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL
--------------------------------------------------------------------------
$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)
$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)
Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed
Details
=======
As mentioned on the list (cf. [1]-[3]) mount requests like
mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new
superblock has been created and mounted or an existing superblock has
been reused and a bind-mount has been created.
This becomes clear in the following example where two processes create
the same mount for the same block device:
P1 P2
fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000");
// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
// finds compatible superblock of P1
// spins until P1 sets SB_BORN and grabs a reference
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B")
Not just does P2 get a bind-mount but the mount options that P2
requestes are silently ignored. The VFS itself doesn't, can't and
shouldn't enforce filesystem specific mount option compatibility. It
only enforces incompatibility for read-only <-> read-write transitions:
mount -t ext4 /dev/sda /A
mount -t ext4 -o ro /dev/sda /B
The read-only request will fail with EBUSY as the VFS can't just
silently transition a superblock from read-write to read-only or vica
versa without risking security issues.
To userspace this silent superblock reuse can become a security issue in
because there is currently no straightforward way for userspace to know
that they did indeed manage to create a new superblock and didn't just
reuse an existing one.
This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that
returns EBUSY if an existing superblock would be reused. Userspace that
needs to be sure that it did create a new superblock with the requested
mount options can request superblock creation using this command. If the
command succeeds they can be sure that they did create a new superblock
with the requested mount options.
This requires the new mount api. With the old mount api it would be
necessary to plumb this through every legacy filesystem's
file_system_type->mount() method. If they want this feature they are
most welcome to switch to the new mount api.
Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
each high-level superblock creation helper:
(1) get_tree_nodev()
Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and
FSCONFIG_CMD_CREATE_EXCL are equivalent.
The binderfs or overlayfs filesystems are examples.
(4) get_tree_keyed()
Finds an existing superblock based on sb->s_fs_info. Hence,
FSCONFIG_CMD_CREATE would reuse an existing superblock whereas
FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
The mqueue or nfsd filesystems are examples.
(2) get_tree_bdev()
This effectively works like get_tree_keyed().
The ext4 or xfs filesystems are examples.
(3) get_tree_single()
Only one superblock of this filesystem type can ever exist.
Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock
whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
The securityfs or configfs filesystems are examples.
Note that some single-instance filesystems never destroy the
superblock once it has been created during the first mount. For
example, if securityfs has been mounted at least onces then the
created superblock will never be destroyed again as long as there is
still an LSM making use it. Consequently, even if securityfs is
unmounted and the superblock seemingly destroyed it really isn't
which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting
reusing an existing superblock.
This is acceptable thugh since special purpose filesystems such as
this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway
and if they do it's probably to make sure that mount options aren't
ignored.
Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
filesystems that make use of the low-level sget_fc() helper directly.
They're all effectively variants on get_tree_keyed(), get_tree_bdev(),
or get_tree_nodev():
(5) mtd_get_sb()
Similar logic to get_tree_keyed().
(6) afs_get_tree()
Similar logic to get_tree_keyed().
(7) ceph_get_tree()
Similar logic to get_tree_keyed().
Already explicitly allows forcing the allocation of a new superblock
via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev().
(8) fuse_get_tree_submount()
Similar logic to get_tree_nodev().
(9) fuse_get_tree()
Forces reuse of existing FUSE superblock.
Forces reuse of existing superblock if passed in file refers to an
existing FUSE connection.
If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd
referring to an existing FUSE connections this would cause the
superblock reusal to fail. If reusing is the intent then
FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
(10) fuse_get_tree()
-> get_tree_nodev()
Same logic as in get_tree_nodev().
(11) fuse_get_tree()
-> get_tree_bdev()
Same logic as in get_tree_bdev().
(12) virtio_fs_get_tree()
Same logic as get_tree_keyed().
(13) gfs2_meta_get_tree()
Forces reuse of existing gfs2 superblock.
Mounting gfs2meta enforces that a gf2s superblock must already
exist. If not, it will error out. Consequently, mounting gfs2meta
with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the
intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
(14) kernfs_get_tree()
Similar logic to get_tree_keyed().
(15) nfs_get_tree_common()
Similar logic to get_tree_keyed().
Already explicitly allows forcing the allocation of a new superblock
via NFS_MOUNT_UNSHARED. This effectively turns it into
get_tree_nodev().
Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
142 lines
5.2 KiB
C
142 lines
5.2 KiB
C
#ifndef _UAPI_LINUX_MOUNT_H
|
|
#define _UAPI_LINUX_MOUNT_H
|
|
|
|
#include <linux/types.h>
|
|
|
|
/*
|
|
* These are the fs-independent mount-flags: up to 32 flags are supported
|
|
*
|
|
* Usage of these is restricted within the kernel to core mount(2) code and
|
|
* callers of sys_mount() only. Filesystems should be using the SB_*
|
|
* equivalent instead.
|
|
*/
|
|
#define MS_RDONLY 1 /* Mount read-only */
|
|
#define MS_NOSUID 2 /* Ignore suid and sgid bits */
|
|
#define MS_NODEV 4 /* Disallow access to device special files */
|
|
#define MS_NOEXEC 8 /* Disallow program execution */
|
|
#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
|
|
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
|
|
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
|
|
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
|
|
#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */
|
|
#define MS_NOATIME 1024 /* Do not update access times. */
|
|
#define MS_NODIRATIME 2048 /* Do not update directory access times */
|
|
#define MS_BIND 4096
|
|
#define MS_MOVE 8192
|
|
#define MS_REC 16384
|
|
#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
|
|
MS_VERBOSE is deprecated. */
|
|
#define MS_SILENT 32768
|
|
#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
|
|
#define MS_UNBINDABLE (1<<17) /* change to unbindable */
|
|
#define MS_PRIVATE (1<<18) /* change to private */
|
|
#define MS_SLAVE (1<<19) /* change to slave */
|
|
#define MS_SHARED (1<<20) /* change to shared */
|
|
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
|
|
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
|
|
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
|
|
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
|
|
#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
|
|
|
|
/* These sb flags are internal to the kernel */
|
|
#define MS_SUBMOUNT (1<<26)
|
|
#define MS_NOREMOTELOCK (1<<27)
|
|
#define MS_NOSEC (1<<28)
|
|
#define MS_BORN (1<<29)
|
|
#define MS_ACTIVE (1<<30)
|
|
#define MS_NOUSER (1<<31)
|
|
|
|
/*
|
|
* Superblock flags that can be altered by MS_REMOUNT
|
|
*/
|
|
#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
|
|
MS_LAZYTIME)
|
|
|
|
/*
|
|
* Old magic mount flag and mask
|
|
*/
|
|
#define MS_MGC_VAL 0xC0ED0000
|
|
#define MS_MGC_MSK 0xffff0000
|
|
|
|
/*
|
|
* open_tree() flags.
|
|
*/
|
|
#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
|
|
#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
|
|
|
|
/*
|
|
* move_mount() flags.
|
|
*/
|
|
#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
|
|
#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
|
|
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
|
|
#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
|
|
#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
|
|
#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
|
|
#define MOVE_MOUNT_SET_GROUP 0x00000100 /* Set sharing group instead */
|
|
#define MOVE_MOUNT_BENEATH 0x00000200 /* Mount beneath top mount */
|
|
#define MOVE_MOUNT__MASK 0x00000377
|
|
|
|
/*
|
|
* fsopen() flags.
|
|
*/
|
|
#define FSOPEN_CLOEXEC 0x00000001
|
|
|
|
/*
|
|
* fspick() flags.
|
|
*/
|
|
#define FSPICK_CLOEXEC 0x00000001
|
|
#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
|
|
#define FSPICK_NO_AUTOMOUNT 0x00000004
|
|
#define FSPICK_EMPTY_PATH 0x00000008
|
|
|
|
/*
|
|
* The type of fsconfig() call made.
|
|
*/
|
|
enum fsconfig_command {
|
|
FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
|
|
FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
|
|
FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
|
|
FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
|
|
FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
|
|
FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
|
|
FSCONFIG_CMD_CREATE = 6, /* Create new or reuse existing superblock */
|
|
FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
|
|
FSCONFIG_CMD_CREATE_EXCL = 8, /* Create new superblock, fail if reusing existing superblock */
|
|
};
|
|
|
|
/*
|
|
* fsmount() flags.
|
|
*/
|
|
#define FSMOUNT_CLOEXEC 0x00000001
|
|
|
|
/*
|
|
* Mount attributes.
|
|
*/
|
|
#define MOUNT_ATTR_RDONLY 0x00000001 /* Mount read-only */
|
|
#define MOUNT_ATTR_NOSUID 0x00000002 /* Ignore suid and sgid bits */
|
|
#define MOUNT_ATTR_NODEV 0x00000004 /* Disallow access to device special files */
|
|
#define MOUNT_ATTR_NOEXEC 0x00000008 /* Disallow program execution */
|
|
#define MOUNT_ATTR__ATIME 0x00000070 /* Setting on how atime should be updated */
|
|
#define MOUNT_ATTR_RELATIME 0x00000000 /* - Update atime relative to mtime/ctime. */
|
|
#define MOUNT_ATTR_NOATIME 0x00000010 /* - Do not update access times. */
|
|
#define MOUNT_ATTR_STRICTATIME 0x00000020 /* - Always perform atime updates */
|
|
#define MOUNT_ATTR_NODIRATIME 0x00000080 /* Do not update directory access times */
|
|
#define MOUNT_ATTR_IDMAP 0x00100000 /* Idmap mount to @userns_fd in struct mount_attr. */
|
|
#define MOUNT_ATTR_NOSYMFOLLOW 0x00200000 /* Do not follow symlinks */
|
|
|
|
/*
|
|
* mount_setattr()
|
|
*/
|
|
struct mount_attr {
|
|
__u64 attr_set;
|
|
__u64 attr_clr;
|
|
__u64 propagation;
|
|
__u64 userns_fd;
|
|
};
|
|
|
|
/* List of all mount_attr versions. */
|
|
#define MOUNT_ATTR_SIZE_VER0 32 /* sizeof first published struct */
|
|
|
|
#endif /* _UAPI_LINUX_MOUNT_H */
|