Commit Graph

830 Commits

Author SHA1 Message Date
Linus Torvalds
f70d24c230 vfs-6.17-rc1.nsfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCaAAKCRCRxhvAZXjc
 ouTCAQCrNCM3h9MpgcMDDUbi9b+lIR11JlvtWNGRUACv3RSNLgEA1vm30+u+JM87
 KpVYg1RrkXDyMFXXuzy7UWpzLcBRywI=
 =L0eW
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains namespace updates. This time specifically for nsfs:

   - Userspace heavily relies on the root inode numbers for namespaces
     to identify the initial namespaces. That's already a hard
     dependency. So we cannot change that anymore. Move the initial
     inode numbers to a public header and align the only two namespaces
     that currently don't do that with all the other namespaces.

   - The root inode of /proc having a fixed inode number has been part
     of the core kernel ABI since its inception, and recently some
     userspace programs (mainly container runtimes) have started to
     explicitly depend on this behaviour.

     The main reason this is useful to userspace is that by checking
     that a suspect /proc handle has fstype PROC_SUPER_MAGIC and is
     PROCFS_ROOT_INO, they can then use openat2() together with
     RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH} to ensure that there isn't a
     bind-mount that replaces some procfs file with a different one.

     This kind of attack has lead to security issues in container
     runtimes in the past (such as CVE-2019-19921) and libraries like
     libpathrs[1] use this feature of procfs to provide safe procfs
     handling functions"

* tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  uapi: export PROCFS_ROOT_INO
  mntns: use stable inode number for initial mount ns
  netns: use stable inode number for initial mount ns
  nsfs: move root inode number to uapi
2025-07-28 12:50:56 -07:00
Linus Torvalds
7879d7aff0 vfs-6.17-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
 opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
 =CHha
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc VFS updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add ext4 IOCB_DONTCACHE support

     This refactors the address_space_operations write_begin() and
     write_end() callbacks to take const struct kiocb * as their first
     argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
     to the filesystem's buffered I/O path.

     Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
     and advertises support via the FOP_DONTCACHE file operation flag.

     Additionally, the i915 driver's shmem write paths are updated to
     bypass the legacy write_begin/write_end interface in favor of
     directly calling write_iter() with a constructed synchronous kiocb.
     Another i915 change replaces a manual write loop with
     kernel_write() during GEM shmem object creation.

  Cleanups:

   - don't duplicate vfs_open() in kernel_file_open()

   - proc_fd_getattr(): don't bother with S_ISDIR() check

   - fs/ecryptfs: replace snprintf with sysfs_emit in show function

   - vfs: Remove unnecessary list_for_each_entry_safe() from
     evict_inodes()

   - filelock: add new locks_wake_up_waiter() helper

   - fs: Remove three arguments from block_write_end()

   - VFS: change old_dir and new_dir in struct renamedata to dentrys

   - netfs: Remove unused declaration netfs_queue_write_request()

  Fixes:

   - eventpoll: Fix semi-unbounded recursion

   - eventpoll: fix sphinx documentation build warning

   - fs/read_write: Fix spelling typo

   - fs: annotate data race between poll_schedule_timeout() and
     pollwake()

   - fs/pipe: set FMODE_NOWAIT in create_pipe_files()

   - docs/vfs: update references to i_mutex to i_rwsem

   - fs/buffer: remove comment about hard sectorsize

   - fs/buffer: remove the min and max limit checks in __getblk_slow()

   - fs/libfs: don't assume blocksize <= PAGE_SIZE in
     generic_check_addressable

   - fs_context: fix parameter name in infofc() macro

   - fs: Prevent file descriptor table allocations exceeding INT_MAX"

* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
  netfs: Remove unused declaration netfs_queue_write_request()
  eventpoll: fix sphinx documentation build warning
  ext4: support uncached buffered I/O
  mm/pagemap: add write_begin_get_folio() helper function
  fs: change write_begin/write_end interface to take struct kiocb *
  drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
  drm/i915: Use kernel_write() in shmem object create
  eventpoll: Fix semi-unbounded recursion
  vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
  fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
  fs/buffer: remove the min and max limit checks in __getblk_slow()
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  fs: Remove three arguments from block_write_end()
  fs/ecryptfs: replace snprintf with sysfs_emit in show function
  fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
  docs/vfs: update references to i_mutex to i_rwsem
  fs/buffer: remove comment about hard sectorsize
  fs_context: fix parameter name in infofc() macro
  VFS: change old_dir and new_dir in struct renamedata to dentrys
  proc_fd_getattr(): don't bother with S_ISDIR() check
  ...
2025-07-28 11:22:56 -07:00
Al Viro
a7cce09945 statmount_mnt_basic(): simplify the logics for group id
We are holding namespace_sem shared and we have not done any group
id allocations since we grabbed it.  Therefore IS_MNT_SHARED(m)
is equivalent to non-zero m->mnt_group_id.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:46 -04:00
Al Viro
f6cc2f4e3d invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()
All places where we call set_mnt_shared() are guaranteed to have
non-zero ->mnt_group_id - either by explicit test, or by having
done successful invent_group_ids() covering the same mount since
we'd grabbed namespace_sem.

The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED())
*is* possible - it means that we have allocated group id, but didn't
get around to set_mnt_shared() yet; such state is transient -
by the time we do namespace_unlock(), we must either do set_mnt_shared()
or unroll the group id allocations by cleanup_group_ids().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:46 -04:00
Al Viro
725ab435ff get rid of CL_SHARE_TO_SLAVE
the only difference between it and CL_SLAVE is in this predicate
in clone_mnt():
	if ((flag & CL_SLAVE) ||
	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
However, in case of CL_SHARED_TO_SLAVE we have not allocated any
mount group ids since the time we'd grabbed namespace_sem, so
IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id.  And
in case of CL_SLAVE old has come either from the original tree,
which had ->mnt_group_id allocated for all nodes or from result
of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED|CL_SLAVE copies,
ultimately going back to the original tree.  In both cases we are
guaranteed that old->mnt_group_id will be non-zero.

In other words, the predicate is always equal to
	(flags & (CL_SLAVE | CL_SHARED_TO_SLAVE)) && old->mnt_group_id
and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact
same behaviour.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:46 -04:00
Al Viro
aab771f34e take freeing of emptied mnt_namespace to namespace_unlock()
Freeing of a namespace must be delayed until after we'd dealt with mount
notifications (in namespace_unlock()).  The reasons are not immediately
obvious (they are buried in ->prev_ns handling in mnt_notify()), and
having that free_mnt_ns() explicitly called after namespace_unlock()
is asking for trouble - it does feel like they should be OK to free
as soon as they've been emptied.

Make the things more explicit by setting 'emptied_ns' under namespace_sem
and having namespace_unlock() free the sucker as soon as it's safe to free.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:46 -04:00
Al Viro
663206854f copy_tree(): don't link the mounts via mnt_list
The only place that really needs to be adjusted is commit_tree() -
there we need to iterate through the copy and we might as well
use next_mnt() for that.  However, in case when our tree has been
slid under something already mounted (propagation to a mountpoint
that already has something mounted on it or a 'beneath' move_mount)
we need to take care not to walk into the overmounting tree.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:37 -04:00
Al Viro
8c5a853f58 mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:30 -04:00
Al Viro
406fea7999 mount: separate the flags accessed only under namespace_sem
Several flags are updated and checked only under namespace_sem; we are
already making use of that when we are checking them without mount_lock,
but we have to hold mount_lock for all updates, which makes things
clumsier than they have to be.

Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE
into a separate field (->mnt_t_flags), renaming them to T_SHARED,
etc. to avoid confusion.  All accesses must be under namespace_sem.

That changes locking requirements for mnt_change_propagation() and
set_mnt_shared() - only namespace_sem is needed now.  The same goes
for SET_MNT_MARKED et.al.

There might be more flags moved from ->mnt_flags to that field;
this is just the initial set.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro
493a4bebf5 don't have mounts pin their parents
Simplify the rules for mount refcounts.  Current rules include:
	* being a namespace root => +1
	* being someone's child => +1
	* being someone's child => +1 to parent's refcount, unless you've
				   already been through umount_tree().

The last part is not needed at all.  It makes for more places where need
to decrement refcounts and it creates an asymmetry between the situations
for something that has never been a part of a namespace and something that
left one, both for no good reason.

If mount's refcount has additions from its children, we know that
	* it's either someone's child itself (and will remain so
until umount_tree(), at which point contributions from children
will disappear), or
	* or is the root of namespace (and will remain such until
it either becomes someone's child in another namespace or goes through
umount_tree()), or
	* it is the root of some tree copy, and is currently pinned
by the caller of copy_tree() (and remains such until it either gets
into namespace, or goes to umount_tree()).
In all cases we already have contribution(s) to refcount that will last
as long as the contribution from children remains.  In other words, the
lifetime is not affected by refcount contributions from children.

It might be useful for "is it busy" checks, but those are actually
no harder to express without it.

NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it
is; the current logics is actually wrong and may give false negatives,
but fixing that is for a separate patch (probably earlier in the queue).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
d72c773237 get rid of mountpoint->m_count
struct mountpoint has an odd kinda-sorta refcount in it.  It's always
either equal to or one above the number of mounts attached to that
mountpoint.

"One above" happens when a function takes a temporary reference to
mountpoint.  Things get simpler if we express that as inserting
a local object into ->m_list and removing it to drop the reference.

New calling conventions:

1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint()
take an extra struct pinned_mountpoint * argument and returns 0/-E...
(or true/false in case of lookup_mountpoint()) instead of returning
struct mountpoint pointers.  In case of success, the struct mountpoint *
we used to get can be found as pinned_mountpoint.mp

2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes
an address of struct pinned_mountpoint - the same that had been passed to
lock_mount()/do_lock_mount().

3) put_mountpoint() for a temporary reference (paired with get_mountpoint()
or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes
the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint().

4) all instances of pinned_mountpoint are local variables; they always live on
stack.  {} is used for initializer, after successful {get,lookup}_mountpoint()
we must make sure to call unpin_mountpoint() before leaving the scope and
after successful {do_,}lock_mount() we must make sure to call unlock_mount()
before leaving the scope.

5) all manipulations of ->m_count are gone, along with ->m_count itself.
struct mountpoint lives while its ->m_list is non-empty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
86f6398096 combine __put_mountpoint() with unhash_mnt()
A call of unhash_mnt() is immediately followed by passing its return
value to __put_mountpoint(); the shrink list given to __put_mountpoint()
will be ex_mountpoints when called from umount_mnt() and list when called
from mntput_no_expire().

Replace with __umount_mnt(mount, shrink_list), moving the call of
__put_mountpoint() into it (and returning nothing), adjust the
callers.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
e30da2a20e pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint()
attach new_mnt *before* detaching root_mnt; that way we don't need to keep hold
on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets
folded together into umount_mnt().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
ec3265a245 take ->mnt_expire handling under mount_lock [read_seqlock_excl]
Doesn't take much massage, and we no longer need to make sure that
by the time of final mntput() the victim has been removed from the
list.  Makes life safer for ->d_automount() instances...

Rules:
	* all ->mnt_expire accesses are under mount_lock.
	* insertion into the list is done by mnt_set_expiry(), and
caller (->d_automount() instance) must hold a reference to mount
in question.  It shouldn't be done more than once for a mount.
	* if a mount on an expiry list is not yet mounted, it will
be ignored by anything that walks that list.
	* if the final mntput() finds its victim still on an expiry
list (in which case it must've never been mounted - umount_tree()
would've taken it out), it will remove the victim from the list.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
a8c764e1a5 attach_recursive_mnt(): remove from expiry list on move
... rather than doing that in do_move_mount().  That's the main
obstacle to moving the protection of ->mnt_expire from namespace_sem
to mount_lock (spinlock-only), which would simplify several failure
exits.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
ee1ee33ccc do_move_mount(): get rid of 'attached' flag
'attached' serves as a proxy for "source is a subtree of our namespace
and not the entirety of anon namespace"; finish massaging it away.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
761de25854 do_move_mount(): take dropping the old mountpoint into attach_recursive_mnt()
... and fold it with unhash_mnt() there - there's no need to retain a reference
to old_mp beyond that point, since by then all mountpoints we were going to add
are either explicitly pinned by get_mountpoint() or have stuff already added
to them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
86b1da96c5 attach_recursive_mnt(): get rid of flags entirely
move vs. attach is trivially detected as mnt_has_parent(source_mnt)...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
18959bf585 attach_recursive_mnt(): pass destination mount in all cases
... and 'beneath' is no longer used there

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
96f5d2e051 attach_recursive_mnt(): unify the mnt_change_mountpoint() logics
The logics used for tucking under existing mount differs for original
and copies; copies do a mount hash lookup to see if mountpoint to be is
already overmounted, while the original is told explicitly.

But the same logics that is used for copies works for the original,
at which point the only place where we get very close to eliminating
the need of passing 'beneath' flag to attach_recursive_mnt().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
7c6fb47b2b make commit_tree() usable in same-namespace move case
Once attach_recursive_mnt() has created all copies of original subtree,
it needs to put them in place(s).

Steps needed for those are slightly different:
	1) in 'move' case, original copy doesn't need any rbtree
manipulations (everything's already in the same namespace where it will
be), but it needs to be detached from the current location
	2) in 'attach' case, original may be in anon namespace; if it is,
all those mounts need to removed from their current namespace before
insertion into the target one
	3) additional copies have a couple of extra twists - in case
of cross-userns propagation we need to lock everything other the root of
subtree and in case when we end up inserting under an existing mount,
that mount needs to be found (for original copy we have it explicitly
passed by the caller).

Quite a bit of that can be unified; as the first step, make commit_tree()
helper (inserting mounts into namespace, hashing the root of subtree
and marking the namespace as updated) usable in all cases; (2) and (3)
are already using it and for (1) we only need to make the insertion of
mounts into namespace conditional.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
f0d0ba1998 Rewrite of propagate_umount()
The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).

I tried to prove that it's the only bug there; I'm still not sure
whether it is.  If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation.  Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.

I hoped to do transformation as a massage series, but that turns out
to be too convoluted.  So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).

As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...

Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.

Compared to the first version:
	* explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
	* trim_ancestors() only clears that flag, leaving the suckers on list
	* trim_one() and handle_locked() take the stuff with flag cleared off
the list.  That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
	* no globals - I didn't bother with any kind of context, not worth it.

	* Notes updated accordingly; I have not touch the terms yet.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
24368a744b sanitize handling of long-term internal mounts
Original rationale for those had been the reduced cost of mntput()
for the stuff that is mounted somewhere.  Mount refcount increments and
decrements are frequent; what's worse, they tend to concentrate on the
same instances and cacheline pingpong is quite noticable.

As the result, mount refcounts are per-cpu; that allows a very cheap
increment.  Plain decrement would be just as easy, but decrement-and-test
is anything but (we need to add the components up, with exclusion against
possible increment-from-zero, etc.).

Fortunately, there is a very common case where we can tell that decrement
won't be the final one - if the thing we are dropping is currently
mounted somewhere.  We have an RCU delay between the removal from mount
tree and dropping the reference that used to pin it there, so we can
just take rcu_read_lock() and check if the victim is mounted somewhere.
If it is, we can go ahead and decrement without and further checks -
the reference we are dropping is not the last one.  If it isn't, we
get all the fun with locking, carefully adding up components, etc.,
but the majority of refcount decrements end up taking the fast path.

There is a major exception, though - pipes and sockets.  Those live
on the internal filesystems that are not going to be mounted anywhere.
They are not going to be _un_mounted, of course, so having to take the
slow path every time a pipe or socket gets closed is really obnoxious.
Solution had been to mark them as long-lived ones - essentially faking
"they are mounted somewhere" indicator.

With minor modification that works even for ones that do eventually get
dropped - all it takes is making sure we have an RCU delay between
clearing the "mounted somewhere" indicator and dropping the reference.

There are some additional twists (if you want to drop a dozen of such
internal mounts, you'd be better off with clearing the indicator on
all of them, doing an RCU delay once, then dropping the references),
but in the basic form it had been
	* use kern_mount() if you want your internal mount to be
a long-term one.
	* use kern_unmount() to undo that.

Unfortunately, the things did rot a bit during the mount API reshuffling.
In several cases we have lost the "fake the indicator" part; kern_unmount()
on the unmount side remained (it doesn't warn if you use it on a mount
without the indicator), but all benefits regaring mntput() cost had been
lost.

To get rid of that bitrot, let's add a new helper that would work
with fs_context-based API: fc_mount_longterm().  It's a counterpart
of fc_mount() that does, on success, mark its result as long-term.
It must be paired with kern_unmount() or equivalents.

Converted:
	1) mqueue (it used to use kern_mount_data() and the umount side
is still as it used to be)
	2) hugetlbfs (used to use kern_mount_data(), internal mount is
never unmounted in this one)
	3) i915 gemfs (used to be kern_mount() + manual remount to set
options, still uses kern_unmount() on umount side)
	4) v3d gemfs (copied from i915)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
c93ff74ff1 do_umount(): simplify the "is it still mounted" checks
Calls of do_umount() are always preceded by can_umount(), where we'd
done a racy check for mount belonging to our namespace; if it wasn't,
can_unmount() would've failed with -EINVAL and we wouldn't have
reached do_umount() at all.

That check needs to be redone once we have acquired namespace_sem
and in do_umount() we do that.  However, that's done in a very odd
way; we check that mount is still in rbtree of _some_ namespace or
its mnt_list is not empty.  It is equivalent to check_mnt(mnt) -
we know that earlier mnt was mounted in our namespace; if it has
stayed there, it's going to remain in rbtree of our namespace.
OTOH, if it ever had been removed from out namespace, it would be
removed from rbtree and it never would've re-added to a namespace
afterwards.  As for ->mnt_list, for something that had been mounted
in a namespace we'll never observe non-empty ->mnt_list while holding
namespace_sem - it does temporarily become non-empty during
umount_tree(), but that doesn't outlast the call of umount_tree(),
let alone dropping namespace_sem.

Things get much easier to follow if we replace that with (equivalent)
check_mnt(mnt) there.  What's more, currently we treat a failure of
that test as "quietly do nothing"; we might as well pretend that we'd
lost the race and fail on that the same way can_umount() would have.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
49acacdc7c clone_mnt(): simplify the propagation-related logics
The underlying rules are simple:
	* MNT_SHARED should be set iff ->mnt_group_id of new mount ends up
non-zero.
	* mounts should be on the same ->mnt_share cyclic list iff they have
the same non-zero ->mnt_group_id value.
	* CL_PRIVATE is mutually exclusive with MNT_SHARED, MNT_SLAVE,
MNT_SHARED_TO_SLAVE and MNT_EXPIRE; the whole point of that thing is to
get a clone of old mount that would *not* be on any namespace-related
lists.

The above allows to make the logics more straightforward; what's more,
it makes the proof that invariants are maintained much simpler.
The variant in mainline is safe (aside of a very narrow race with
unsafe modification of mnt_flags right after we had the mount exposed
in superblock's ->s_mounts; theoretically it can race with ro remount
of the original, but it's not easy to hit), but proof of its correctness
is really unpleasant.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
d08fa7f44a don't set MNT_LOCKED on parentless mounts
Originally MNT_LOCKED meant only one thing - "don't let this mount to
be peeled off its parent, we don't want to have its mountpoint exposed".
Accordingly, it had only been set on mounts that *do* have a parent.
Later it got overloaded with another use - setting it on the absolute
root had given free protection against umount(2) of absolute root
(was possible to trigger, oopsed).  Not a bad trick, but it ended
up costing more than it bought us.  Unfortunately, the cost included
both hard-to-reason-about logics and a subtle race between
mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in
the end of __do_loopback() could race with sb_prepare_remount_readonly()
setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should
be).  The race wouldn't be much of a problem (there are other ways to
deal with it), but the subtlety is.

Turns out that nobody except umount(2) had ever made use of having
MNT_LOCKED set on absolute root.  So let's give up on that trick,
clever as it had been, add an explicit check in do_umount() and
return to using MNT_LOCKED only for mounts that have a parent.

It means that
	* clone_mnt() no longer copies MNT_LOCKED
	* copy_tree() sets it on submounts if their counterparts had
been marked such, and does that right next to attach_mnt() in there,
in the same mount_lock scope.
	* __do_loopback() no longer needs to strip MNT_LOCKED off the
root of subtree it's about to return; no store, no race.
	* init_mount_tree() doesn't bother setting MNT_LOCKED on absolute
root.
	* lock_mnt_tree() does not set MNT_LOCKED on the subtree's root;
accordingly, its caller (loop in attach_recursive_mnt()) does not need to
bother stripping that MNT_LOCKED on root.  Note that lock_mnt_tree() setting
MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt()
(from commit_tree()) that makes them reachable.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
1a867d729f __attach_mnt(): lose the second argument
It's always ->mnt_parent of the first one.  What the function does is
making a mount (with already set parent and mountpoint) visible - in
mount hash and in the parent's list of children.

IOW, it takes the existing rootwards linkage and sets the matching
crownwards linkage.

Renamed to make_visible(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
9ed4b9eaea dissolve_on_fput(): use anon_ns_root()
that's the condition we are actually trying to check there...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
05da054d43 new predicate: anon_ns_root(mount)
checks if mount is the root of an anonymouns namespace.
Switch open-coded equivalents to using it.

For mounts that belong to anon namespace !mnt_has_parent(mount)
is the same as mount == ns->root, and intent is more obvious in
the latter form.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
e031251cb2 constify is_local_mountpoint()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
9cb79ed60e new predicate: mount_is_ancestor()
mount_is_ancestor(p1, p2) returns true iff there is a possibly
empty ancestry chain from p1 to p2.

Convert the open-coded checks.  Unlike those open-coded variants
it does not depend upon p1 not being root...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
cf53a2d423 copy_tree(): don't set ->mnt_mountpoint on the root of copy
It never made any sense - neither when copy_tree() had been introduced
(2.4.11-pre5), nor at any point afterwards.  Mountpoint is meaningless
without parent mount and the root of copied tree has no parent until we get
around to attaching it somewhere.  At that time we'll have mountpoint set;
before that we have no idea which dentry will be used as mountpoint.
IOW, copy_tree() should just leave the default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
ffdc52fbbd prevent mount hash conflicts
Currently it's still possible to run into a pathological situation when
two hashed mounts share both parent and mountpoint.  That does not work
well, for obvious reasons.

We are not far from getting rid of that; the only remaining gap is
attach_recursive_mnt() not being careful enough when sliding a tree
under existing mount (for propagated copies or in 'beneath' case for
the original one).

To deal with that cleanly we need to be able to find overmounts
(i.e. mounts on top of parent's root); we could do hash lookups or scan
the list of children but either would be costly.  Since one of the results
we get from that will be prevention of multiple parallel overmounts, let's
just bite the bullet and store a (non-counting) reference to overmount
in struct mount.

With that done, closing the hole in attach_recursive_mnt() becomes easy
- we just need to follow the chain of overmounts before we change the
mountpoint of the mount we are sliding things under.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
431cc1d8e2 get rid of mnt_set_mountpoint_beneath()
mnt_set_mountpoint_beneath() consists of attaching new mount side-by-side
with the one we want to mount beneath (by mnt_set_mountpoint()), followed
by mnt_change_mountpoint() shifting the the top mount onto the new one
(by mnt_change_mountpoint()).

Both callers of mnt_set_mountpoint_beneath (both in attach_recursive_mnt())
have the same form - in 'beneath' case we call mnt_set_mountpoint_beneath(),
otherwise - mnt_set_mountpoint().

The thing is, expressing that as unconditional mnt_set_mountpoint(),
followed, in 'beneath' case, by mnt_change_mountpoint() is just as easy.
And these mnt_change_mountpoint() callers are similar to the ones we
do when it comes to attaching propagated copies, which will allow more
cleanups in the next commits.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
8c6ce8e86d attach_mnt(): expand in attach_recursive_mnt(), then lose the flag argument
simpler that way - all but one caller pass false as 'beneath' argument,
and that one caller is actually happier with the call expanded - the
logics with choice of mountpoint is identical for 'moving' and 'attaching'
cases, and now that is no longer hidden.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
0748e553df userns and mnt_idmap leak in open_tree_attr(2)
Once want_mount_setattr() has returned a positive, it does require
finish_mount_kattr() to release ->mnt_userns.  Failing do_mount_setattr()
does not change that.

As the result, we can end up leaking userns and possibly mnt_idmap as
well.

Fixes: c4a16820d9 ("fs: add open_tree_attr()")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-24 10:25:04 -04:00
Al Viro
ce7df19686 attach_recursive_mnt(): do not lock the covering tree when sliding something under it
If we are propagating across the userns boundary, we need to lock the
mounts added there.  However, in case when something has already
been mounted there and we end up sliding a new tree under that,
the stuff that had been there before should not get locked.

IOW, lock_mnt_tree() should be called before we reparent the
preexisting tree on top of what we are adding.

Fixes: 3bd045cc9c ("separate copying and locking mount tree on cross-userns copies")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23 14:02:08 -04:00
Al Viro
7484e15dbb replace collect_mounts()/drop_collected_mounts() with a safer variant
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).

A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.

        * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
        * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
        Array is terminated by {NULL, NULL} struct path.  If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array().  Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
        * drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call.  All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
        * instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that.  Existing users (all in
audit_tree.c) are converted.

[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]

Fixes: 80b5dce8c5 ("vfs: Add a function to lazily unmount all mounts from any dentry")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23 14:01:49 -04:00
Junxuan Liao
2773d282cd
docs/vfs: update references to i_mutex to i_rwsem
VFS has switched to i_rwsem for ten years now (9902af79c0: parallel
lookups actual switch to rwsem), but the VFS documentation and comments
still has references to i_mutex.

Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu>
Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23 12:17:33 +02:00
Christian Brauner
7f4f229195
mntns: use stable inode number for initial mount ns
Apart from the network and mount namespace all other namespaces expose a
stable inode number and userspace has been relying on that for a very
long time now. It's very much heavily used API. Align the mount
namespace and use a stable inode number from the reserved procfs inode
number space so this is consistent across all namespaces.

Link: https://lore.kernel.org/20250606-work-nsfs-v1-3-b8749c9a8844@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11 11:59:08 +02:00
Linus Torvalds
35b574a6c2 mount-related bugfixes
this cycle regression (well, bugfix for this cycle bugfix for v6.15-rc1 regression)
 	do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
 	selftests/mount_setattr: adapt detached mount propagation test
 v6.15	fs: allow clone_private_mount() for a path on real rootfs
 v6.11	fs/fhandle.c: fix a race in call of has_locked_children()
 v5.15	fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
 v5.15	clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
 v5.7	path_overmount(): avoid false negatives
 v3.12	finish_automount(): don't leak MNT_LOCKED from parent to child
 v2.6.15	do_change_type(): refuse to operate on unmounted/not ours mounts
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaEXGDwAKCRBZ7Krx/gZQ
 61WQAPwNBpcwum3F5fqT8rcKymqAUFpc0+rluJoBi+qfCQA9ywEAwn+Kh5qqtz++
 cdVnUYQxBrh0u5IOzMEFITlgfYFJZA4=
 =BIeU
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull mount fixes from Al Viro:
 "Various mount-related bugfixes:

   - split the do_move_mount() checks in subtree-of-our-ns and
     entire-anon cases and adapt detached mount propagation selftest for
     mount_setattr

   - allow clone_private_mount() for a path on real rootfs

   - fix a race in call of has_locked_children()

   - fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP

   - make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right
     userns

   - avoid false negatives in path_overmount()

   - don't leak MNT_LOCKED from parent to child in finish_automount()

   - do_change_type(): refuse to operate on unmounted/not ours mounts"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do_change_type(): refuse to operate on unmounted/not ours mounts
  clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
  selftests/mount_setattr: adapt detached mount propagation test
  do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
  fs: allow clone_private_mount() for a path on real rootfs
  fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
  finish_automount(): don't leak MNT_LOCKED from parent to child
  path_overmount(): avoid false negatives
  fs/fhandle.c: fix a race in call of has_locked_children()
2025-06-08 10:35:12 -07:00
Al Viro
12f147ddd6 do_change_type(): refuse to operate on unmounted/not ours mounts
Ensure that propagation settings can only be changed for mounts located
in the caller's mount namespace. This change aligns permission checking
with the rest of mount(2).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 07b20889e3 ("beginning of the shared-subtree proper")
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 01:37:56 -04:00
Al Viro
c28f922c9d clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
What we want is to verify there is that clone won't expose something
hidden by a mount we wouldn't be able to undo.  "Wouldn't be able to undo"
may be a result of MNT_LOCKED on a child, but it may also come from
lacking admin rights in the userns of the namespace mount belongs to.

clone_private_mnt() checks the former, but not the latter.

There's a number of rather confusing CAP_SYS_ADMIN checks in various
userns during the mount, especially with the new mount API; they serve
different purposes and in case of clone_private_mnt() they usually,
but not always end up covering the missing check mentioned above.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Fixes: 427215d85e ("ovl: prevent private clone if bind mount is not allowed")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 01:37:24 -04:00
Al Viro
290da20e33 do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
... and fix the breakage in anon-to-anon case.  There are two cases
acceptable for do_move_mount() and mixing checks for those is making
things hard to follow.

One case is move of a subtree in caller's namespace.
        * source and destination must be in caller's namespace
	* source must be detachable from parent
Another is moving the entire anon namespace elsewhere
	* source must be the root of anon namespace
	* target must either in caller's namespace or in a suitable
	  anon namespace (see may_use_mount() for details).
	* target must not be in the same namespace as source.

It's really easier to follow if tests are *not* mixed together...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 3b5260d12b ("Don't propagate mounts into detached trees")
Reported-by: Allison Karlitskaya <lis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 00:41:20 -04:00
KONDO KAZUMA(近藤 和真)
4954346d80 fs: allow clone_private_mount() for a path on real rootfs
Mounting overlayfs with a directory on real rootfs (initramfs)
as upperdir has failed with following message since commit
db04662e2f ("fs: allow detached mounts in clone_private_mount()").

  [    4.080134] overlayfs: failed to clone upperpath

Overlayfs mount uses clone_private_mount() to create internal mount
for the underlying layers.

The commit made clone_private_mount() reject real rootfs because
it does not have a parent mount and is in the initial mount namespace,
that is not an anonymous mount namespace.

This issue can be fixed by modifying the permission check
of clone_private_mount() following [1].

Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: db04662e2f ("fs: allow detached mounts in clone_private_mount()")
Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1]
Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 00:41:02 -04:00
Al Viro
d8cc0362f9 fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
9ffb14ef61 "move_mount: allow to add a mount into an existing group"
breaks assertions on ->mnt_share/->mnt_slave.  For once, the data structures
in question are actually documented.

Documentation/filesystem/sharedsubtree.rst:
        All vfsmounts in a peer group have the same ->mnt_master.  If it is
	non-NULL, they form a contiguous (ordered) segment of slave list.

do_set_group() puts a mount into the same place in propagation graph
as the old one.  As the result, if old mount gets events from somewhere
and is not a pure event sink, new one needs to be placed next to the
old one in the slave list the old one's on.  If it is a pure event
sink, we only need to make sure the new one doesn't end up in the
middle of some peer group.

"move_mount: allow to add a mount into an existing group" ends up putting
the new one in the beginning of list; that's definitely not going to be
in the middle of anything, so that's fine for case when old is not marked
shared.  In case when old one _is_ marked shared (i.e. is not a pure event
sink), that breaks the assumptions of propagation graph iterators.

Put the new mount next to the old one on the list - that does the right thing
in "old is marked shared" case and is just as correct as the current behaviour
if old is not marked shared (kudos to Pavel for pointing that out - my original
suggested fix changed behaviour in the "nor marked" case, which complicated
things for no good reason).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 9ffb14ef61 ("move_mount: allow to add a mount into an existing group")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 00:40:35 -04:00
Al Viro
5f31c54938 path_overmount(): avoid false negatives
Holding namespace_sem is enough to make sure that result remains valid.
It is *not* enough to avoid false negatives from __lookup_mnt().  Mounts
can be unhashed outside of namespace_sem (stuck children getting detached
on final mntput() of lazy-umounted mount) and having an unrelated mount
removed from the hash chain while we traverse it may end up with false
negative from __lookup_mnt().  We need to sample and recheck the seqlock
component of mount_lock...

Bug predates the introduction of path_overmount() - it had come from
the code in finish_automount() that got abstracted into that helper.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 26df6034fd ("fix automount/automount race properly")
Fixes: 6ac3928156 ("fs: allow to mount beneath top mount")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 00:38:34 -04:00
Al Viro
1f282cdc1d fs/fhandle.c: fix a race in call of has_locked_children()
may_decode_fh() is calling has_locked_children() while holding no locks.
That's an oopsable race...

The rest of the callers are safe since they are holding namespace_sem and
are guaranteed a positive refcount on the mount in question.

Rename the current has_locked_children() to __has_locked_children(), make
it static and switch the fs/namespace.c users to it.

Make has_locked_children() a wrapper for __has_locked_children(), calling
the latter under read_seqlock_excl(&mount_lock).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07 00:37:38 -04:00
Linus Torvalds
0f70f5b08a automount wart removal
Calling conventions of ->d_automount() made saner (flagday change)
 vfs_submount() is gone - its sole remaining user (trace_automount) had
 been switched to saner primitives.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
 6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
 EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
 =GXLi
 -----END PGP SIGNATURE-----

Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull automount updates from Al Viro:
 "Automount wart removal

  A bunch of odd boilerplate gone from instances - the reason for
  those was the need to protect the yet-to-be-attched mount from
  mark_mounts_for_expiry() deciding to take it out.

  But that's easy to detect and take care of in mark_mounts_for_expiry()
  itself; no need to have every instance simulate mount being busy by
  grabbing an extra reference to it, with finish_automount() undoing
  that once it attaches that mount.

  Should've done it that way from the very beginning... This is a
  flagday change, thankfully there are very few instances.

  vfs_submount() is gone - its sole remaining user (trace_automount)
  had been switched to saner primitives"

* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill vfs_submount()
  saner calling conventions for ->d_automount()
2025-05-30 15:38:29 -07:00
Linus Torvalds
a82ba83991 6.15 behaviour wrt mount propagation into detached trees breaks userland.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDn/AQAKCRBZ7Krx/gZQ
 6wAcAQCuiKUqmmrhssYEn/FsboM8eR36XzjlWUYOgzARXR0CIAEA6RmWIY3a/e0n
 DlbKbxmNkQM6xIUjml7nXYzoxaWX0A8=
 =9UbY
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull mount propagation fix from Al Viro:
 "6.15 allowed mount propagation to destinations in detached trees;
  unfortunately, that breaks existing userland, so the old behaviour
  needs to be restored.

  It's not exactly a revert - the original behaviour had a bug, where
  existence of detached tree might disrupt propagation between locations
  not in detached trees. Thankfully, userland did not depend upon that
  bug, so we want to keep the fix"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Don't propagate mounts into detached trees
2025-05-30 15:04:11 -07:00