Commit Graph

246 Commits

Author SHA1 Message Date
Al Viro
691fb82ca6 make d_set_d_op() static
Convert the last user (d_alloc_pseudo()) and be done with that.
Any out-of-tree filesystem using it should switch to d_splice_alias_ops()
or, better yet, check whether it really needs to have ->d_op vary among
its dentries.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11 13:39:52 -04:00
Al Viro
05fb0e6664 new helper: set_default_d_op()
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:21:16 -04:00
Al Viro
790fa81b8c new helper: d_splice_alias_ops()
Uses of d_set_d_op() on live dentry can be very dangerous; it is going
to be withdrawn and replaced with saner things.

The best way for a filesystem is to have the default dentry_operations
set at mount time and be done with that - __d_alloc() will use that.

Currently there are two cases when d_set_d_op() is used on a live dentry -
one is procfs, which has several genuinely different dentry_operations
instances (different ->d_revalidate(), etc.) and another is
simple_lookup(), where we would be better off without overriding ->d_op.

For procfs we have d_set_d_op() calls followed by d_splice_alias();
provide a new helper (d_splice_alias_ops(inode, dentry, d_ops)) that would
combine those two, and do the d_set_d_op() part while under ->d_lock.
That eliminates one of the places where ->d_flags had been modified
without holding ->d_lock; current behaviour is not racy, but the reasons
for that are far too brittle.  Better move to uniform locking rules and
simpler proof of correctness...

The next commit will convert procfs to use of that helper; it is not
exported and won't be until somebody comes up with convincing modular
user for it.

Again, the best approach is to have default ->d_op and let __d_alloc()
do the right thing; filesystem _may_ need non-uniform ->d_op (procfs
does), but there'd better be good reasons for that.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:11:39 -04:00
Linus Torvalds
6d5b940e1e vfs-6.16-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
 ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
 eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
 =IW7H
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs directory lookup updates from Christian Brauner:
 "This contains cleanups for the lookup_one*() family of helpers.

  We expose a set of functions with names containing "lookup_one_len"
  and others without the "_len". This difference has nothing to do with
  "len". It's rater a historical accident that can be confusing.

  The functions without "_len" take a "mnt_idmap" pointer. This is found
  in the "vfsmount" and that is an important question when choosing
  which to use: do you have a vfsmount, or are you "inside" the
  filesystem. A related question is "is permission checking relevant
  here?".

  nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
  functions. They pass nop_mnt_idmap and refuse to work on filesystems
  which have any other idmap.

  This work changes nfsd and cachefile to use the lookup_one family of
  functions and to explictily pass &nop_mnt_idmap which is consistent
  with all other vfs interfaces used where &nop_mnt_idmap is explicitly
  passed.

  The remaining uses of the "_one" functions do not require permission
  checks so these are renamed to be "_noperm" and the permission
  checking is removed.

  This series also changes these lookup function to take a qstr instead
  of separate name and len. In many cases this simplifies the call"

* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  VFS: change lookup_one_common and lookup_noperm_common to take a qstr
  Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
  VFS: rename lookup_one_len family to lookup_noperm and remove permission check
  cachefiles: Use lookup_one() rather than lookup_one_len()
  nfsd: Use lookup_one() rather than lookup_one_len()
  VFS: improve interface for lookup_one functions
2025-05-26 08:02:43 -07:00
Omar Sandoval
b2b4483b5d
dcache: convert dentry flag macros to enum
Commit 9748cb2dc3 ("VFS: repack DENTRY_ flags.") changed the value of
DCACHE_MOUNTED, which broke drgn's path_lookup() helper. drgn is forced
to hard-code it because it's a macro, and macros aren't preserved in
debugging information by default.

Enums, on the other hand, are included in debugging information. Convert
the DCACHE_* flag macros to an enum so that debugging tools like drgn
and bpftrace can make use of them.

Link: 2027d0fea8/drgn/helpers/linux/fs.py (L43-L46)
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/177665a082f048cf536b9cd6af467b3be6b6e6ed.1744141838.git.osandov@fb.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11 15:24:29 +02:00
NeilBrown
06c567403a
Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
try_lookup_noperm() and d_hash_and_lookup() are nearly identical.  The
former does some validation of the name where the latter doesn't.
Outside of the VFS that validation is likely valuable, and having only
one exported function for this task is certainly a good idea.

So make d_hash_and_lookup() local to VFS files and change all other
callers to try_lookup_noperm().  Note that the arguments are swapped.

Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 11:24:41 +02:00
NeilBrown
5741909697
VFS: improve interface for lookup_one functions
The family of functions:
  lookup_one()
  lookup_one_unlocked()
  lookup_one_positive_unlocked()

appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.

They are used by:
   btrfs/ioctl - which is a user-space interface rather than an internal
     activity
   exportfs - i.e. from nfsd or the open_by_handle_at interface
   overlayfs - at access the underlying filesystems
   smb/server - for file service

They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.

It would help if the documentation didn't claim they should "not be
called by generic code".

Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base".  In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct.  Sometimes these are already stored in a qstr, other times it
easily could be.

So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.

QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.

[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:25:32 +02:00
Linus Torvalds
592329e5e9 Summary
* Move vm_table members out of kernel/sysctl.c
 
   All vm_table array members have moved to their respective subsystems leading
   to the removal of vm_table from kernel/sysctl.c. This increases modularity by
   placing the ctl_tables closer to where they are actually used and at the same
   time reducing the chances of merge conflicts in kernel/sysctl.c.
 
 * ctl_table range fixes
 
   Replace the proc_handler function that checks variable ranges in
   coredump_sysctls and vdso_table with the one that actually uses the extra{1,2}
   pointers as min/max values. This tightens the range of the values that users
   can pass into the kernel effectively preventing {under,over}flows.
 
 * Misc fixes
 
   Correct grammar errors and typos in test messages. Update sysctl files in
   MAINTAINERS. Constified and removed array size in declaration for
   alignment_tbl
 
 * Testing
 
   - These have all been in linux-next for at least 1 month
   - They have gone through 0-day
   - Ran all these through sysctl selftests in x86_64
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmfhV8EACgkQupfNUreW
 QU/udAv/VCXGkndQsJ5biXpXYFnokX0gIEaYzzHiqrFycZqr8ys0/wWzc+ar1LjF
 Jvanl2uKB0mUviLKt7Gk0+Hri+PJlYIrbx+5K5eo2wsKUUxFykqLLm59y/orPODl
 gyPQjKNpHJb7COsnEc3Lrq/fvol4NPHlcBPXG8NwehccTeBHZ1ninfo+pSnxh3o8
 kI3GSLLxD4K9AgBl5QuVWH4gU7o//u7lUkKzy03NW+2jmuRv3dRcYF7IdgMINNee
 AeXnygdSBxLzECBvmkfNdyg+AmL8hdsmzbsIh7UuJDvxLlQOInVLZa+sXBotCOIc
 TImCrr1Ws1OuGrD0kpH+21tJvc8pNFWt61QlulObQdrLndWHdZEGyGOusLpXTwbn
 jIWZmMvzk1foSwdgzwPFzUqPEpW3FrBVDo4Z4kenBDrCp56QTX7hGRvkNYJNKvot
 Ue+i8BeHR/Gm/p+UMqgsSTOaNJXTqZhFqwJQVzxU/9LN/vkS0On6fbjgBd5X6Pn+
 a5dlc9gy
 =0bcX
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Move vm_table members out of kernel/sysctl.c

   All vm_table array members have moved to their respective subsystems
   leading to the removal of vm_table from kernel/sysctl.c. This
   increases modularity by placing the ctl_tables closer to where they
   are actually used and at the same time reducing the chances of merge
   conflicts in kernel/sysctl.c.

 - ctl_table range fixes

   Replace the proc_handler function that checks variable ranges in
   coredump_sysctls and vdso_table with the one that actually uses the
   extra{1,2} pointers as min/max values. This tightens the range of the
   values that users can pass into the kernel effectively preventing
   {under,over}flows.

 - Misc fixes

   Correct grammar errors and typos in test messages. Update sysctl
   files in MAINTAINERS. Constified and removed array size in
   declaration for alignment_tbl

* tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (22 commits)
  selftests/sysctl: fix wording of help messages
  selftests: fix spelling/grammar errors in sysctl/sysctl.sh
  MAINTAINERS: Update sysctl file list in MAINTAINERS
  sysctl: Fix underflow value setting risk in vm_table
  coredump: Fixes core_pipe_limit sysctl proc_handler
  sysctl: remove unneeded include
  sysctl: remove the vm_table
  sh: vdso: move the sysctl to arch/sh/kernel/vsyscall/vsyscall.c
  x86: vdso: move the sysctl to arch/x86/entry/vdso/vdso32-setup.c
  fs: dcache: move the sysctl to fs/dcache.c
  sunrpc: simplify rpcauth_cache_shrink_count()
  fs: drop_caches: move sysctl to fs/drop_caches.c
  fs: fs-writeback: move sysctl to fs/fs-writeback.c
  mm: nommu: move sysctl to mm/nommu.c
  security: min_addr: move sysctl to security/min_addr.c
  mm: mmap: move sysctl to mm/mmap.c
  mm: util: move sysctls to mm/util.c
  mm: vmscan: move vmscan sysctls to mm/vmscan.c
  mm: swap: move sysctl to mm/swap.c
  mm: filemap: move sysctl to mm/filemap.c
  ...
2025-03-26 21:02:05 -07:00
Christian Brauner
71628584df
Merge patch series "prep patches for my mkdir series"
NeilBrown <neilb@suse.de> says:

These two patches are cleanup are dependencies for my mkdir changes and
subsequence directory locking changes.

* patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits)
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()

Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 09:25:34 +01:00
NeilBrown
3ff6c8707c
nfs/vfs: discard d_exact_alias()
d_exact_alias() is a descendent of d_add_unique() which was introduced
20 years ago mostly likely to work around problems with NFS servers of
the time.  It is now not used in several situations were it was
originally needed and there have been no reports of problems -
presumably the old NFS servers have been improved.  This only place it
is now use is in NFSv4 code and the old problematic servers are thought
to have been v2/v3 only.

There is no clear benefit in reusing a unhashed() dentry which happens
to have the same name as the dentry we are adding.

So this patch removes d_exact_alias() and the one place that it is used.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250226062135.2043651-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26 09:55:18 +01:00
NeilBrown
9748cb2dc3 VFS: repack DENTRY_ flags.
Bits 13, 23, 24, and 27 are not used.  Move all those holes to the end.

Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250206054504.2950516-7-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-10 10:41:00 +01:00
Kaixiong Yu
52e66823e0 fs: dcache: move the sysctl to fs/dcache.c
The sysctl_vfs_cache_pressure belongs to fs/dcache.c, move it to
fs/dcache.c from kernel/sysctl.c. As a part of fs/dcache.c cleaning,
sysctl_vfs_cache_pressure is changed to a static variable, and change
the inline-type function vfs_pressure_ratio() to out-of-inline type,
export vfs_pressure_ratio() with EXPORT_SYMBOL_GPL to be used by other
files. Move the unneeded include(linux/dcache.h).

Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-02-07 16:53:04 +01:00
Linus Torvalds
a86bf2283d assorted stuff for this merge window
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ
 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ
 wZWZxAAzml9SS01IcSiHWaF5fgrjlA0=
 =rXOq
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs cleanups from Al Viro:
 "Two unrelated patches - one is a removal of long-obsolete include in
  overlayfs (it used to need fs/internal.h, but the extern it wanted has
  been moved back to include/linux/namei.h) and another introduces
  convenience helper constructing struct qstr by a NUL-terminated
  string"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add a string-to-qstr constructor
  fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01 15:07:56 -08:00
Al Viro
c1feab95e0 add a string-to-qstr constructor
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().

The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.

Unlike initializer list it's a valid expression.  What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
	path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid.  It can also be used as initializer, with identical
effect -
	struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
	struct qstr anon_variable = {.name = s, .len = strlen(s)};
	struct qstr x = anon_variable;
	// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
	struct qstr x = {.name = s, .len = strlen(s)};

What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().

This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:45 -05:00
Al Viro
30d61efe11 9p: fix ->rename_sem exclusion
9p wants to be able to build a path from given dentry to fs root and keep
it valid over a blocking operation.

->s_vfs_rename_mutex would be a natural candidate, but there are places
where we need that and where we have no way to tell if ->s_vfs_rename_mutex
is already held deeper in callchain.  Moreover, it's only held for
cross-directory renames; name changes within the same directory happen
without it.

Solution:
	* have d_move() done in ->rename() rather than in its caller
	* maintain a 9p-private rwsem (per-filesystem)
	* hold it exclusive over the relevant part of ->rename()
	* hold it shared over the places where we want the path.

That almost works.  FS_RENAME_DOES_D_MOVE is enough to put all d_move()
and d_exchange() calls under filesystem's control.  However, there's
also __d_unalias(), which isn't covered by any of that.

If ->lookup() hits a directory inode with preexisting dentry elsewhere
(due to e.g. rename done on server behind our back), d_splice_alias()
called by ->lookup() will move/rename that alias.

Add a couple of optional methods, so that __d_unalias() would do
	if alias->d_op->d_unalias_trylock != NULL
		if (!alias->d_op->d_unalias_trylock(alias))
			fail (resulting in -ESTALE from lookup)
	__d_move(...)
	if alias->d_op->d_unalias_unlock != NULL
		alias->d_unalias_unlock(alias)
where it currently does __d_move().  9p instances do down_write_trylock()
and up_write() of ->rename_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:24 -05:00
Al Viro
5be1fa8abd Pass parent directory inode and expected name to ->d_revalidate()
->d_revalidate() often needs to access dentry parent and name; that has
to be done carefully, since the locking environment varies from caller
to caller.  We are not guaranteed that dentry in question will not be
moved right under us - not unless the filesystem is such that nothing
on it ever gets renamed.

It can be dealt with, but that results in boilerplate code that isn't
even needed - the callers normally have just found the dentry via dcache
lookup and want to verify that it's in the right place; they already
have the values of ->d_parent and ->d_name stable.  There is a couple
of exceptions (overlayfs and, to less extent, ecryptfs), but for the
majority of calls that song and dance is not needed at all.

It's easier to make ecryptfs and overlayfs find and pass those values if
there's a ->d_revalidate() instance to be called, rather than doing that
in the instances.

This commit only changes the calling conventions; making use of supplied
values is left to followups.

NOTE: some instances need more than just the parent - things like CIFS
may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate.  This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).

One thing to keep in mind when using name is that name->name will normally
point into the pathname being resolved; the filename in question occupies
name->len bytes starting at name->name, and there is NUL somewhere after it,
but it the next byte might very well be '/' rather than '\0'.  Do not
ignore name->len.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:23 -05:00
Al Viro
58cf9c383c dcache: back inline names with a struct-wrapped array of unsigned long
... so that they can be copied with struct assignment (which generates
better code) and accessed word-by-word.

The type is union shortname_storage; it's a union of arrays of
unsigned char and unsigned long.

struct name_snapshot.inline_name turned into union shortname_storage;
users (all in fs/dcache.c) adjusted.

struct dentry.d_iname has some users outside of fs/dcache.c; to
reduce the amount of noise in commit, it is replaced with
union shortname_storage d_shortname and d_iname is turned into a macro
that expands to d_shortname.string (similar to d_lock handling).
That compat macro is temporary - most of the remaining instances will
be taken out by debugfs series, and once that is merged and few others
are taken care of this will go away.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-17 17:46:05 -05:00
Al Viro
61bc24ac97 make sure that DNAME_INLINE_LEN is a multiple of word size
... calling the number of words DNAME_INLINE_WORDS.

The next step will be to have a structure to hold inline name arrays
(both in dentry and in name_snapshot) and use that to alias the
existing arrays of unsigned char there.  That will allow both
full-structure copies and convenient word-by-word accesses.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-17 17:46:05 -05:00
Linus Torvalds
2aae1d67fd vfs-6.11.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
 ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
 Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
 =eDC8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode / dentry updates from Christian Brauner:
 "This contains smaller performance improvements to inodes and dentries:

  inode:

   - Add rcu based inode lookup variants.

     They avoid one inode hash lock acquire in the common case thereby
     significantly reducing contention. We already support RCU-based
     operations but didn't take advantage of them during inode
     insertion.

     Callers of iget_locked() get the improvement without any code
     changes. Callers that need a custom callback can switch to
     iget5_locked_rcu() as e.g., did btrfs.

     With 20 threads each walking a dedicated 1000 dirs * 1000 files
     directory tree to stat(2) on a 32 core + 24GB ram vm:

        before: 3.54s user 892.30s system 1966% cpu 45.549 total
        after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

     Long-term we should pick up the effort to introduce more
     fine-grained locking and possibly improve on the currently used
     hash implementation.

   - Start zeroing i_state in inode_init_always() instead of doing it in
     individual filesystems.

     This allows us to remove an unneeded lock acquire in new_inode()
     and not burden individual filesystems with this.

  dcache:

   - Move d_lockref out of the area used by RCU lookup to avoid
     cacheline ping poing because the embedded name is sharing a
     cacheline with d_lockref.

   - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
     up with 128 bytes in total"

* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix dentry size
  vfs: move d_lockref out of the area used by RCU lookup
  bcachefs: remove now spurious i_state initialization
  xfs: remove now spurious i_state initialization in xfs_inode_alloc
  vfs: partially sanitize i_state zeroing on inode creation
  xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  btrfs: use iget5_locked_rcu
  vfs: add rcu-based find_inode variants for iget ops
2024-07-15 11:39:44 -07:00
Christian Brauner
dc99c0ff53
fs: fix dentry size
On CONFIG_SMP=y and on 32bit we need to decrease DNAME_INLINE_LEN to 36
btyes to end up with 128 bytes in total.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Links: https://lore.kernel.org/r/CAHk-=whtoqTSCcAvV-X-KPqoDWxS4vxmWpuKLB+Vv8=FtUd5vA@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03 10:34:11 +02:00
Mateusz Guzik
18a5daf0e4
vfs: move d_lockref out of the area used by RCU lookup
Stock kernel scales worse than FreeBSD when doing a 20-way stat(2) on
the same tmpfs-backed file.

According to perf top:
  38.09%  lockref_put_return
  26.08%  lockref_get_not_dead
  25.60%  __d_lookup_rcu
   0.89%  clear_bhb_loop

__d_lookup_rcu is participating in cacheline ping pong due to the
embedded name sharing a cacheline with lockref.

Moving it out resolves the problem:
  41.50%  lockref_put_return
  41.03%  lockref_get_not_dead
   1.54%  clear_bhb_loop

benchmark (will-it-scale, Sapphire Rapids, tmpfs, ops/s):
FreeBSD:7219334
before:	5038006
after:	7842883 (+55%)

One minor remark: the 'after' result is unstable, fluctuating in the
range ~7.8 mln to ~9 mln during different runs.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240613001215.648829-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03 10:34:10 +02:00
Mateusz Guzik
f378ec4eec
vfs: rename parent_ino to d_parent_ino and make it use RCU
The routine is used by procfs through dir_emit_dots.

The combined RCU and lock fallback implementation is too big for an
inline. Given that the routine takes a dentry argument fs/dcache.c seems
like the place to put it in.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240627161152.802567-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27 18:34:21 +02:00
Linus Torvalds
7ea65c89d8 vfs-6.9.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem3wQAKCRCRxhvAZXjc
 otRMAQDeo8qsuuIAcS2KUicKqZR5yMVvrY9r4sQzf7YRcJo5HQD+NQXkKwQuv1VO
 OUeScsic/+I+136AgdjWnlEYO5dp0go=
 =4WKU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Misc features, cleanups, and fixes for vfs and individual filesystems.

  Features:

   - Support idmapped mounts for hugetlbfs.

   - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
     where the passed offset is ignored if the file is O_APPEND. The new
     flag allows a caller to enforce that the offset is honored to
     conform to posix even if the file was opened in append mode.

   - Move i_mmap_rwsem in struct address_space to avoid false sharing
     between i_mmap and i_mmap_rwsem.

   - Convert efs, qnx4, and coda to use the new mount api.

   - Add a generic is_dot_dotdot() helper that's used by various
     filesystems and the VFS code instead of open-coding it multiple
     times.

   - Recently we've added stable offsets which allows stable ordering
     when iterating directories exported through NFS on e.g., tmpfs
     filesystems. Originally an xarray was used for the offset map but
     that caused slab fragmentation issues over time. This switches the
     offset map to the maple tree which has a dense mode that handles
     this scenario a lot better. Includes tests.

   - Finally merge the case-insensitive improvement series Gabriel has
     been working on for a long time. This cleanly propagates case
     insensitive operations through ->s_d_op which in turn allows us to
     remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
     It also improves performance by trying a case-sensitive comparison
     first and then fallback to case-insensitive lookup if that fails.
     This also fixes a bug where overlayfs would be able to be mounted
     over a case insensitive directory which would lead to all sort of
     odd behaviors.

  Cleanups:

   - Make file_dentry() a simple accessor now that ->d_real() is
     simplified because of the backing file work we did the last two
     cycles.

   - Use the dedicated file_mnt_idmap helper in ntfs3.

   - Use smp_load_acquire/store_release() in the i_size_read/write
     helpers and thus remove the hack to handle i_size reads in the
     filemap code.

   - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
     fs/

   - It's no longer necessary to perform a second built-in initramfs
     unpack call because we retain the contents of the previous
     extraction. Remove it.

   - Now that we have removed various allocators kfree_rcu() always
     works with kmem caches and kmalloc(). So simplify various places
     that only use an rcu callback in order to handle the kmem cache
     case.

   - Convert the pipe code to use a lockdep comparison function instead
     of open-coding the nesting making lockdep validation easier.

   - Move code into fs-writeback.c that was located in a header but can
     be made static as it's only used in that one file.

   - Rewrite the alignment checking iterators for iovec and bvec to be
     easier to read, and also significantly more compact in terms of
     generated code. This saves 270 bytes of text on x86-64 (with
     clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
     saves a bit of time for the same workload.

   - Switch various places to use KMEM_CACHE instead of
     kmem_cache_create().

   - Use inode_set_ctime_to_ts() in inode_set_ctime_current()

   - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.

   - Various smaller cleanups for eventfds.

  Fixes:

   - Fix various comments and typos, and unneeded initializations.

   - Fix stack allocation hack for clang in the select code.

   - Improve dump_mapping() debug code on a best-effort basis.

   - Fix build errors in various selftests.

   - Avoid wrap-around instrumentation in various places.

   - Don't allow user namespaces without an idmapping to be used for
     idmapped mounts.

   - Fix sysv sb_read() call.

   - Fix fallback implementation of the get_name() export operation"

* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
  hugetlbfs: support idmapped mounts
  qnx4: convert qnx4 to use the new mount api
  fs: use inode_set_ctime_to_ts to set inode ctime to current time
  libfs: Drop generic_set_encrypted_ci_d_ops
  ubifs: Configure dentry operations at dentry-creation time
  f2fs: Configure dentry operations at dentry-creation time
  ext4: Configure dentry operations at dentry-creation time
  libfs: Add helper to choose dentry operations at mount-time
  libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
  fscrypt: Drop d_revalidate once the key is added
  fscrypt: Drop d_revalidate for valid dentries during lookup
  fscrypt: Factor out a helper to configure the lookup dentry
  ovl: Always reject mounting over case-insensitive directories
  libfs: Attempt exact-match comparison first during casefolded lookup
  efs: remove SLAB_MEM_SPREAD flag usage
  jfs: remove SLAB_MEM_SPREAD flag usage
  minix: remove SLAB_MEM_SPREAD flag usage
  openpromfs: remove SLAB_MEM_SPREAD flag usage
  proc: remove SLAB_MEM_SPREAD flag usage
  qnx6: remove SLAB_MEM_SPREAD flag usage
  ...
2024-03-11 09:38:17 -07:00
Al Viro
7e4a205fe5 Revert "get rid of DCACHE_GENOCIDE"
This reverts commit 5785160732.

Unfortunately, while we only call that thing once, the callback
*can* be called more than once for the same dentry - all it
takes is rename_lock being touched while we are in d_walk().
For now let's revert it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-09 23:31:16 -05:00
Amir Goldstein
11b3f8ae70 fs: remove the inode argument to ->d_real() method
The only remaining user of ->d_real() method is d_real_inode(), which
passed NULL inode argument to get the real data dentry.

There are no longer any users that call ->d_real() with a non-NULL
inode argument for getting a detry from a specific underlying layer.

Remove the inode argument of the method and replace it with an integer
'type' argument, to allow callers to request the real metadata dentry
instead of the real data dentry.

All the current users of d_real_inode() (e.g. uprobe) continue to get
the real data inode.  Caller that need to get the real metadata inode
(e.g. IMA/EVM) can use d_inode(d_real(dentry, D_REAL_METADATA)).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202110132.1584111-3-amir73il@gmail.com
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06 17:00:12 +01:00
Vegard Nossum
1b6ae9f6e6 dcache: remove unnecessary NULL check in dget_dlock()
dget_dlock() requires dentry->d_lock to be held when called, yet
contains a NULL check for dentry.

An audit of all calls to dget_dlock() shows that it is never called
with a NULL pointer (as spin_lock()/spin_unlock() would crash in these
cases):

  $ git grep -W '\<dget_dlock\>'

  arch/powerpc/platforms/cell/spufs/inode.c-              spin_lock(&dentry->d_lock);
  arch/powerpc/platforms/cell/spufs/inode.c-              if (simple_positive(dentry)) {
  arch/powerpc/platforms/cell/spufs/inode.c:                      dget_dlock(dentry);

  fs/autofs/expire.c-             spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
  fs/autofs/expire.c-             if (simple_positive(child)) {
  fs/autofs/expire.c:                     dget_dlock(child);

  fs/autofs/root.c:                       dget_dlock(active);
  fs/autofs/root.c-                       spin_unlock(&active->d_lock);

  fs/autofs/root.c:                       dget_dlock(expiring);
  fs/autofs/root.c-                       spin_unlock(&expiring->d_lock);

  fs/ceph/dir.c-          if (!spin_trylock(&dentry->d_lock))
  fs/ceph/dir.c-                  continue;
  [...]
  fs/ceph/dir.c:                          dget_dlock(dentry);

  fs/ceph/mds_client.c-           spin_lock(&alias->d_lock);
  [...]
  fs/ceph/mds_client.c:                   dn = dget_dlock(alias);

  fs/configfs/inode.c-            spin_lock(&dentry->d_lock);
  fs/configfs/inode.c-            if (simple_positive(dentry)) {
  fs/configfs/inode.c:                    dget_dlock(dentry);

  fs/libfs.c:                             found = dget_dlock(d);
  fs/libfs.c-                     spin_unlock(&d->d_lock);

  fs/libfs.c:             found = dget_dlock(child);
  fs/libfs.c-     spin_unlock(&child->d_lock);

  fs/libfs.c:                             child = dget_dlock(d);
  fs/libfs.c-                     spin_unlock(&d->d_lock);

  fs/ocfs2/dcache.c:                      dget_dlock(dentry);
  fs/ocfs2/dcache.c-                      spin_unlock(&dentry->d_lock);

  include/linux/dcache.h:static inline struct dentry *dget_dlock(struct dentry *dentry)

After taking out the NULL check, dget_dlock() becomes almost identical
to __dget_dlock(); the only difference is that dget_dlock() returns the
dentry that was passed in. These are static inline helpers, so we can
rely on the compiler to discard unused return values. We can therefore
also remove __dget_dlock() and replace calls to it by dget_dlock().

Also fix up and improve the kerneldoc comments while we're at it.

Al Viro pointed out that we can also clean up some of the callers to
make use of the returned value and provided a bit more info for the
kerneldoc.

While preparing v2 I also noticed that the tabs used in the kerneldoc
comments were causing the kerneldoc to get parsed incorrectly so I also
fixed this up (including for d_unhashed, which is otherwise unrelated).

Testing: x86 defconfig build + boot; make htmldocs for the kerneldoc
warning. objdump shows there are code generation changes.

Link: https://lore.kernel.org/all/20231022164520.915013-1-vegard.nossum@oracle.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:51:56 -05:00
Al Viro
1b327b5ac5 kill DCACHE_MAY_FREE
With the new ordering in __dentry_kill() it has become redundant -
it's set if and only if both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST
are set.

We set it in __dentry_kill(), after having set DCACHE_DENTRY_KILLED
with the only condition being that DCACHE_SHRINK_LIST is there;
all of that is done without dropping ->d_lock and the only place
that checks that flag (shrink_dentry_list()) does so under ->d_lock,
after having found the victim on its shrink list.  Since DCACHE_SHRINK_LIST
is set only when placing dentry into shrink list and removed only by
shrink_dentry_list() itself, a check for DCACHE_DENTRY_KILLED in
there would be equivalent to check for DCACHE_MAY_FREE.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:51:52 -05:00
Al Viro
119dcc73a9 Merge branches 'work.dcache-misc' and 'work.dcache2' into work.dcache 2023-11-25 02:51:35 -05:00
Al Viro
5785160732 get rid of DCACHE_GENOCIDE
... now that we never call d_genocide() other than from kill_litter_super()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:50:19 -05:00
Al Viro
8a54b38f3e d_genocide(): move the extern into fs/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:50:15 -05:00
Al Viro
f2824db1b4 kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
now that the only user of d_instantiate_anon() is gone...
[braino fix folded - kudos to Dan Carpenter]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:48:13 -05:00
Al Viro
da549bdd15 dentry: switch the lists of children to hlist
Saves a pointer per struct dentry and actually makes the things less
clumsy.  Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:32:13 -05:00
Al Viro
698f1e2b71 kill d_backing_dentry()
no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:21:42 -05:00
Al Viro
2fcd38f4de [software coproarchaeology] dentry.h: kill a mysterious comment
there's a strange comment in front of d_lookup() declaration:

/* appendix may either be NULL or be used for transname suffixes */

Looks like nobody had been curious enough to track its history;
it predates git, it predates bitkeeper and if you look through
the pre-BK trees, you finally arrive at this in 2.1.44-for-davem:
  /* appendix may either be NULL or be used for transname suffixes */
 -extern struct dentry * d_lookup(struct inode * dir, struct qstr * name,
 -                               struct qstr * appendix);
 +extern struct dentry * d_lookup(struct dentry * dir, struct qstr * name);
In other words, it refers to the third argument d_lookup() used to have
back then.  It had been introduced in 2.1.43-pre, on June 12 1997,
along with d_lookup(), only to be removed by July 4 1997, presumably
when the Cthulhu-awful thing it used to be used for (look for
CONFIG_TRANS_NAMES in 2.1.43-pre, and keep a heavy-duty barfbag
ready) had been, er, noticed and recognized for what it had been.

Despite the appendectomy, the comment remained.  Some things really
need to be put out of their misery...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:21:38 -05:00
Al Viro
0d486510f8 dentry.h: trim externs
d_instantiate_unique() had been gone for 7 years; __d_lookup...()
and shrink_dcache_for_umount() are fs/internal.h fodder.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:21:33 -05:00
Al Viro
8219cb58fe kill d_{is,set}_fallthru()
Introduced in 2015 and never had any in-tree users...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:21:24 -05:00
Al Viro
0bec65a80f DCACHE_COOKIE: RIP
the last user gone in 2021...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:20:38 -05:00
Al Viro
641c3ef5cb DCACHE_... ->d_flags bits: switch to BIT()
For bits 20..22 (inode type cached in ->d_flags) turn the definitions into
expressions like (5 << 20); everything else turns into straight use of
BIT()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:20:33 -05:00
Al Viro
acfde6e8ab struct dentry: get rid of randomize_layout idiocy
This is beyond ridiculous.  There is a reason why that thing is
cacheline-aligned...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:18:49 -05:00
Kent Overstreet
771eb4fe8b fs: factor out d_mark_tmpfile()
New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2023-09-11 23:59:47 -04:00
Linus Torvalds
f721d24e5d tmpfile API change
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ
 6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW
 ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4=
 =cko4
 -----END PGP SIGNATURE-----

Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs tmpfile updates from Al Viro:
 "Miklos' ->tmpfile() signature change; pass an unopened struct file to
  it, let it open the damn thing. Allows to add tmpfile support to FUSE"

* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse: implement ->tmpfile()
  vfs: open inside ->tmpfile()
  vfs: move open right after ->tmpfile()
  vfs: make vfs_tmpfile() static
  ovl: use vfs_tmpfile_open() helper
  cachefiles: use vfs_tmpfile_open() helper
  cachefiles: only pass inode to *mark_inode_inuse() helpers
  cachefiles: tmpfile error handling cleanup
  hugetlbfs: cleanup mknod and tmpfile
  vfs: add vfs_tmpfile_open() helper
2022-10-10 19:45:17 -07:00
Miklos Szeredi
863f144f12 vfs: open inside ->tmpfile()
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-24 07:00:00 +02:00
Al Viro
0f60d28828 dynamic_dname(): drop unused dentry argument
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-20 11:34:04 -04:00
Linus Torvalds
786da5da56 We have a good pile of various fixes and cleanups from Xiubo, Jeff,
Luis and others, almost exclusively in the filesystem.  Several patches
 touch files outside of our normal purview to set the stage for bringing
 in Jeff's long awaited ceph+fscrypt series in the near future.  All of
 them have appropriate acks and sat in linux-next for a while.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmL1HF8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwOuB/97JKHFuOlP1HrD6fYe5a0ul9zC9VG4
 57XPDNqG2PSmfXCjvZhyVU4n53sUlJTqzKDSTXydoPCMQjtyHvysA6gEvcgUJFPd
 PHaZDCd9TmqX8my67NiTK70RVpNR9BujJMVMbOfM+aaisl0K6WQbitO+BfhEiJcK
 QStdKm5lPyf02ESH9jF+Ga0DpokARaLbtDFH7975owxske6gWuoPBCJNrkMooKiX
 LjgEmNgH1F/sJSZXftmKdlw9DtGBFaLQBdfbfSB5oVPRb7chI7xBeraNr6Od3rls
 o4davbFkcsOr+s6LJPDH2BJobmOg+HoMoma7ezspF7ZqBF4Uipv5j3VC
 =1427
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a good pile of various fixes and cleanups from Xiubo, Jeff,
  Luis and others, almost exclusively in the filesystem.

  Several patches touch files outside of our normal purview to set the
  stage for bringing in Jeff's long awaited ceph+fscrypt series in the
  near future. All of them have appropriate acks and sat in linux-next
  for a while"

* tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client: (27 commits)
  libceph: clean up ceph_osdc_start_request prototype
  libceph: fix ceph_pagelist_reserve() comment typo
  ceph: remove useless check for the folio
  ceph: don't truncate file in atomic_open
  ceph: make f_bsize always equal to f_frsize
  ceph: flush the dirty caps immediatelly when quota is approaching
  libceph: print fsid and epoch with osd id
  libceph: check pointer before assigned to "c->rules[]"
  ceph: don't get the inline data for new creating files
  ceph: update the auth cap when the async create req is forwarded
  ceph: make change_auth_cap_ses a global symbol
  ceph: fix incorrect old_size length in ceph_mds_request_args
  ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
  ceph: call netfs_subreq_terminated with was_async == false
  ceph: convert to generic_file_llseek
  ceph: fix the incorrect comment for the ceph_mds_caps struct
  ceph: don't leak snap_rwsem in handle_cap_grant
  ceph: prevent a client from exceeding the MDS maximum xattr size
  ceph: choose auth MDS for getxattr with the Xs caps
  ceph: add session already open notify support
  ...
2022-08-11 12:41:07 -07:00
Xiubo Li
4f48d5da81 fs/dcache: export d_same_name() helper
Compare dentry name with case-exact name, return true if names
are same, or false.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Sebastian Andrzej Siewior
45f78b0a27 fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
__d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.

Calling it under dentry->d_lock is not a problem, since that is also a
"sleeping" spinlock on the same configs.  Unfortunately, two of its
callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.

The key observation is that wakeup can be moved to any point before
dropping ->d_lock.

As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.

This is safe because:

Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).

->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry.  Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.

With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.

One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.

Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry.  Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.

Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock.  That concludes the analysis.

Let __d_lookup_unhash():

  1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
  2) Unhash the dentry
  3) Retrieve and clear dentry::d_wait
  4) Unlock the hash and return the retrieved waitqueue head pointer
  5) Let the caller handle the wake up.
  6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
     build failures for OOT code that used __d_lookup_done() and is not
     aware of the new return value.

This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.

An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:36:10 -04:00
Luis Chamberlain
c8c0c239d5 fs: move dcache sysctls to its own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the dcache sysctl clutter out of kernel/sysctl.c.  This is a
small one-off entry, perhaps later we can simplify this representation,
but for now we use the helpers we have.  We won't know how we can
simplify this further untl we're fully done with the cleanup.

[arnd@arndb.de: avoid unused-function warning]
  Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:36 +02:00
Al Viro
80e5d1ff5d useful constants: struct qstr for ".."
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-15 22:36:45 -04:00
Al Viro
a2bbe66493 constify dentry argument of dentry_path()/dentry_path_raw()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-21 11:43:58 -04:00
Al Viro
bca585d24a new helper: d_find_alias_rcu()
similar to d_find_alias(inode), except that
	* the caller must be holding rcu_read_lock()
	* inode must not be freed until matching rcu_read_unlock()
	* result is *NOT* pinned and can only be dereferenced until
the matching rcu_read_unlock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-16 15:12:06 -05:00