Commit Graph

4329 Commits

Author SHA1 Message Date
Chuck Lever
b44ffa4c4f NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operations
Currently __destroy_client() consults the nfs4_client's async_copies
list to determine whether there are ongoing async COPY operations.
However, NFSD now keeps copy state in that list even when the
async copy has completed, to enable OFFLOAD_STATUS to find the
COPY results for a while after the COPY has completed.

DESTROY_CLIENTID should not be blocked if the client's async_copies
list contains state for only completed copy operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
5c41f32147 NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD
RFC 7862 permits callback services to respond to CB_OFFLOAD with
NFS4ERR_DELAY. Currently NFSD drops the CB_OFFLOAD in that case.

To improve the reliability of COPY offload, NFSD should rather send
another CB_OFFLOAD completion notification.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
409d6f52bd NFSD: Free async copy information in nfsd4_cb_offload_release()
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Currently, NFSD purges the metadata for an async COPY operation as
soon as the CB_OFFLOAD callback has been sent. It does not wait even
for the client's CB_OFFLOAD response, as the paragraph above
suggests that it should.

This makes the OFFLOAD_STATUS operation ineffective during the
window between the completion of an asynchronous COPY and the
server's receipt of the corresponding CB_OFFLOAD response. This is
important if, for example, the client responds with NFS4ERR_DELAY,
or the transport is lost before the server receives the response. A
client might use OFFLOAD_STATUS to query the server about the still
pending asynchronous COPY, but NFSD will respond to OFFLOAD_STATUS
as if it had never heard of the presented copy stateid.

This patch starts to address this issue by extending the lifetime of
struct nfsd4_copy at least until the server has seen the client's
CB_OFFLOAD response, or the CB_OFFLOAD has timed out.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
62a8642ba0 NFSD: Fix nfsd4_shutdown_copy()
nfsd4_shutdown_copy() is just this:

	while ((copy = nfsd4_get_copy(clp)) != NULL)
		nfsd4_stop_copy(copy);

nfsd4_get_copy() bumps @copy's reference count, preventing
nfsd4_stop_copy() from releasing @copy.

A while loop like this usually works by removing the first element
of the list, but neither nfsd4_get_copy() nor nfsd4_stop_copy()
alters the async_copies list.

Best I can tell, then, is that nfsd4_shutdown_copy() continues to
loop until other threads manage to remove all the items from this
list. The spinning loop blocks shutdown until these items are gone.

Possibly the reason we haven't seen this issue in the field is
because client_has_state() prevents __destroy_client() from calling
nfsd4_shutdown_copy() if there are any items on this list. In a
subsequent patch I plan to remove that restriction.

Fixes: e0639dc580 ("NFSD introduce async copy feature")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Chuck Lever
a4452e661b NFSD: Add a tracepoint to record canceled async COPY operations
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
10c93b5101 nfsd: make nfsd4_session->se_flags a bool
While this holds the flags from the CREATE_SESSION request, nothing
ever consults them. The only flag used is NFS4_SESSION_DEAD. Make it a
simple bool instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
53f9ba78e0 nfsd: remove nfsd4_session->se_bchannel
This field is written and is never consulted again. Remove it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
6a404f475f nfsd: make use of warning provided by refcount_t
refcount_t, by design, checks for unwanted situations and provides
warnings.  It is rarely useful to have explicit warnings with refcount
usage.

In this case we have an explicit warning if a refcount_t reaches zero
when decremented.  Simply using refcount_dec() will provide a similar
warning and also mark the refcount_t as saturated to avoid any possible
use-after-free.

This patch drops the warning and uses refcount_dec() instead of
refcount_dec_and_test().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
a2c0412c05 nfsd: Don't fail OP_SETCLIENTID when there are too many clients.
Failing OP_SETCLIENTID or OP_EXCHANGE_ID should only happen if there is
memory allocation failure.  Putting a hard limit on the number of
clients is not really helpful as it will either happen too early and
prevent clients that the server can easily handle, or too late and
allow clients when the server is swamped.

The calculated limit is still useful for expiring courtesy clients where
there are "too many" clients, but it shouldn't prevent the creation of
active clients.

Testing of lots of clients against small-mem servers reports repeated
NFS4ERR_DELAY responses which doesn't seem helpful.  There may have been
reports of similar problems in production use.

Also remove an outdated comment - we do use a slab cache.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:07 -05:00
Yang Erkun
f8c989a0c8 nfsd: release svc_expkey/svc_export with rcu_work
The last reference for `cache_head` can be reduced to zero in `c_show`
and `e_show`(using `rcu_read_lock` and `rcu_read_unlock`). Consequently,
`svc_export_put` and `expkey_put` will be invoked, leading to two
issues:

1. The `svc_export_put` will directly free ex_uuid. However,
   `e_show`/`c_show` will access `ex_uuid` after `cache_put`, which can
   trigger a use-after-free issue, shown below.

   ==================================================================
   BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd]
   Read of size 1 at addr ff11000010fdc120 by task cat/870

   CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
   1.16.1-2.fc37 04/01/2014
   Call Trace:
    <TASK>
    dump_stack_lvl+0x53/0x70
    print_address_description.constprop.0+0x2c/0x3a0
    print_report+0xb9/0x280
    kasan_report+0xae/0xe0
    svc_export_show+0x362/0x430 [nfsd]
    c_show+0x161/0x390 [sunrpc]
    seq_read_iter+0x589/0x770
    seq_read+0x1e5/0x270
    proc_reg_read+0xe1/0x140
    vfs_read+0x125/0x530
    ksys_read+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Allocated by task 830:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    __kasan_kmalloc+0x8f/0xa0
    __kmalloc_node_track_caller_noprof+0x1bc/0x400
    kmemdup_noprof+0x22/0x50
    svc_export_parse+0x8a9/0xb80 [nfsd]
    cache_do_downcall+0x71/0xa0 [sunrpc]
    cache_write_procfs+0x8e/0xd0 [sunrpc]
    proc_reg_write+0xe1/0x140
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Freed by task 868:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    kasan_save_free_info+0x3b/0x60
    __kasan_slab_free+0x37/0x50
    kfree+0xf3/0x3e0
    svc_export_put+0x87/0xb0 [nfsd]
    cache_purge+0x17f/0x1f0 [sunrpc]
    nfsd_destroy_serv+0x226/0x2d0 [nfsd]
    nfsd_svc+0x125/0x1e0 [nfsd]
    write_threads+0x16a/0x2a0 [nfsd]
    nfsctl_transaction_write+0x74/0xa0 [nfsd]
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

2. We cannot sleep while using `rcu_read_lock`/`rcu_read_unlock`.
   However, `svc_export_put`/`expkey_put` will call path_put, which
   subsequently triggers a sleeping operation due to the following
   `dput`.

   =============================
   WARNING: suspicious RCU usage
   5.10.0-dirty #141 Not tainted
   -----------------------------
   ...
   Call Trace:
   dump_stack+0x9a/0xd0
   ___might_sleep+0x231/0x240
   dput+0x39/0x600
   path_put+0x1b/0x30
   svc_export_put+0x17/0x80
   e_show+0x1c9/0x200
   seq_read_iter+0x63f/0x7c0
   seq_read+0x226/0x2d0
   vfs_read+0x113/0x2c0
   ksys_read+0xc9/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x67/0xd1

Fix these issues by using `rcu_work` to help release
`svc_expkey`/`svc_export`. This approach allows for an asynchronous
context to invoke `path_put` and also facilitates the freeing of
`uuid/exp/key` after an RCU grace period.

Fixes: 9ceddd9da1 ("knfsd: Allow lockless lookups of the exports")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Yang Erkun
be8f982c36 nfsd: make sure exp active before svc_export_show
The function `e_show` was called with protection from RCU. This only
ensures that `exp` will not be freed. Therefore, the reference count for
`exp` can drop to zero, which will trigger a refcount use-after-free
warning when `exp_get` is called. To resolve this issue, use
`cache_get_rcu` to ensure that `exp` remains active.

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 819 at lib/refcount.c:25
refcount_warn_saturate+0xb1/0x120
CPU: 3 UID: 0 PID: 819 Comm: cat Not tainted 6.12.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:refcount_warn_saturate+0xb1/0x120
...
Call Trace:
 <TASK>
 e_show+0x20b/0x230 [nfsd]
 seq_read_iter+0x589/0x770
 seq_read+0x1e5/0x270
 vfs_read+0x125/0x530
 ksys_read+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: bf18f163e8 ("NFSD: Using exp_get for export getting")
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Chuck Lever
f64ea4af43 NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()
It's only current caller already length-checks the string, but let's
be safe.

Fixes: 0964a3d3f1 ("[PATCH] knfsd: nfsd4 reboot dirname fix")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
30c1d2411a NFSD: Remove unused values from nfsd4_encode_components_esc()
Clean up. The computed value of @p is saved each time through the
loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
6b9c1080a6 NFSD: Remove unused results in nfsd4_encode_pathname4()
Clean up. The result of "*p++" is saved, but is not used before it
is overwritten. The result of xdr_encode_opaque() is saved each
time through the loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
1e02c641c3 NFSD: Prevent NULL dereference in nfsd4_process_cb_update()
@ses is initialized to NULL. If __nfsd4_find_backchannel() finds no
available backchannel session, setup_callback_client() will try to
dereference @ses and segfault.

Fixes: dcbeaa68db ("nfsd4: allow backchannel recovery")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
da4f777e62 NFSD: Remove a never-true comparison
fh_size is an unsigned int, thus it can never be less than 0.

Fixes: d8b26071e6 ("NFSD: simplify struct nfsfh")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
d08bf5ea64 NFSD: Remove dead code in nfsd4_create_session()
Clean up. AFAICT, there is no way to reach the out_free_conn label
with @old set to a non-NULL value, so the expire_client(old) call
is never reached and can be removed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
NeilBrown
4cc9b9f2bf nfsd: refine and rename NFSD_MAY_LOCK
NFSD_MAY_LOCK means a few different things.
- it means that GSS is not required.
- it means that with NFSEXP_NOAUTHNLM, authentication is not required
- it means that OWNER_OVERRIDE is allowed.

None of these are specific to locking, they are specific to the NLM
protocol.
So:
 - rename to NFSD_MAY_NLM
 - set NFSD_MAY_OWNER_OVERRIDE and NFSD_MAY_BYPASS_GSS in nlm_fopen()
   so that NFSD_MAY_NLM doesn't need to imply these.
 - move the test on NFSEXP_NOAUTHNLM out of nfsd_permission() and
   into fh_verify where other special-case tests on the MAY flags
   happen.  nfsd_permission() can be called from other places than
   fh_verify(), but none of these will have NFSD_MAY_NLM.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Chuck Lever
6640556b0c NFSD: Replace use of NFSD_MAY_LOCK in nfsd4_lock()
NFSv4 LOCK operations should not avoid the set of authorization
checks that apply to all other NFSv4 operations. Also, the
"no_auth_nlm" export option should apply only to NLM LOCK requests.
It's not necessary or sensible to apply it to NFSv4 LOCK operations.

Instead, set no permission bits when calling fh_verify(). Subsequent
stateid processing handles authorization checks.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Julia Lawall
ed9887b876 nfsd: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Pali Rohár
bb4f07f240 nfsd: Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT
Currently NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT do not bypass
only GSS, but bypass any method. This is a problem specially for NFS3
AUTH_NULL-only exports.

The purpose of NFSD_MAY_BYPASS_GSS_ON_ROOT is described in RFC 2623,
section 2.3.2, to allow mounting NFS2/3 GSS-only export without
authentication. So few procedures which do not expose security risk used
during mount time can be called also with AUTH_NONE or AUTH_SYS, to allow
client mount operation to finish successfully.

The problem with current implementation is that for AUTH_NULL-only exports,
the NFSD_MAY_BYPASS_GSS_ON_ROOT is active also for NFS3 AUTH_UNIX mount
attempts which confuse NFS3 clients, and make them think that AUTH_UNIX is
enabled and is working. Linux NFS3 client never switches from AUTH_UNIX to
AUTH_NONE on active mount, which makes the mount inaccessible.

Fix the NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT implementation
and really allow to bypass only exports which have enabled some real
authentication (GSS, TLS, or any other).

The result would be: For AUTH_NULL-only export if client attempts to do
mount with AUTH_UNIX flavor then it will receive access errors, which
instruct client that AUTH_UNIX flavor is not usable and will either try
other auth flavor (AUTH_NULL if enabled) or fails mount procedure.
Similarly if client attempt to do mount with AUTH_NULL flavor and only
AUTH_UNIX flavor is enabled then the client will receive access error.

This should fix problems with AUTH_NULL-only or AUTH_UNIX-only exports if
client attempts to mount it with other auth flavor (e.g. with AUTH_NULL for
AUTH_UNIX-only export, or with AUTH_UNIX for AUTH_NULL-only export).

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:59 -05:00
Pali Rohár
600020927b nfsd: Fill NFSv4.1 server implementation fields in OP_EXCHANGE_ID response
NFSv4.1 OP_EXCHANGE_ID response from server may contain server
implementation details (domain, name and build time) in optional
nfs_impl_id4 field. Currently nfsd does not fill this field.

Send these information in NFSv4.1 OP_EXCHANGE_ID response. Fill them with
the same values as what is Linux NFSv4.1 client doing. Domain is hardcoded
to "kernel.org", name is composed in the same way as "uname -srvm" output
and build time is hardcoded to zeros.

NFSv4.1 client and server implementation fields are useful for statistic
purposes or for identifying type of clients and servers.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Jeff Layton
b9376c7e42 nfsd: new tracepoint for after op_func in compound processing
Turn nfsd_compound_encode_err tracepoint into a class and add a new
nfsd_compound_op_err tracepoint.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Linus Torvalds
70e7730c2a vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
 osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
 pqlG+38UdATImpfqqWjPbb72sBYcfQg=
 =wLUh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Fixup and improve NLM and kNFSD file lock callbacks

     Last year both GFS2 and OCFS2 had some work done to make their
     locking more robust when exported over NFS. Unfortunately, part of
     that work caused both NLM (for NFS v3 exports) and kNFSD (for
     NFSv4.1+ exports) to no longer send lock notifications to clients

     This in itself is not a huge problem because most NFS clients will
     still poll the server in order to acquire a conflicted lock

     It's important for NLM and kNFSD that they do not block their
     kernel threads inside filesystem's file_lock implementations
     because that can produce deadlocks. We used to make sure of this by
     only trusting that posix_lock_file() can correctly handle blocking
     lock calls asynchronously, so the lock managers would only setup
     their file_lock requests for async callbacks if the filesystem did
     not define its own lock() file operation

     However, when GFS2 and OCFS2 grew the capability to correctly
     handle blocking lock requests asynchronously, they started
     signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
     for also trusting posix_lock_file() was inadvertently dropped, so
     now most filesystems no longer produce lock notifications when
     exported over NFS

     Fix this by using an fop_flag which greatly simplifies the problem
     and grooms the way for future uses by both filesystems and lock
     managers alike

   - Add a sysctl to delete the dentry when a file is removed instead of
     making it a negative dentry

     Commit 681ce86235 ("vfs: Delete the associated dentry when
     deleting a file") introduced an unconditional deletion of the
     associated dentry when a file is removed. However, this led to
     performance regressions in specific benchmarks, such as
     ilebench.sum_operations/s, prompting a revert in commit
     4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
     deleting a file""). This reintroduces the concept conditionally
     through a sysctl

   - Expand the statmount() system call:

       * Report the filesystem subtype in a new fs_subtype field to
         e.g., report fuse filesystem subtypes

       * Report the superblock source in a new sb_source field

       * Add a new way to return filesystem specific mount options in an
         option array that returns filesystem specific mount options
         separated by zero bytes and unescaped. This allows caller's to
         retrieve filesystem specific mount options and immediately pass
         them to e.g., fsconfig() without having to unescape or split
         them

       * Report security (LSM) specific mount options in a separate
         security option array. We don't lump them together with
         filesystem specific mount options as security mount options are
         generic and most users aren't interested in them

         The format is the same as for the filesystem specific mount
         option array

   - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command

   - Optimize acl_permission_check() to avoid costly {g,u}id ownership
     checks if possible

   - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()

   - Add synchronous wakeup support for ep_poll_callback.

     Currently, epoll only uses wake_up() to wake up task. But sometimes
     there are epoll users which want to use the synchronous wakeup flag
     to give a hint to the scheduler, e.g., the Android binder driver.
     So add a wake_up_sync() define, and use wake_up_sync() when sync is
     true in ep_poll_callback()

  Fixes:

   - Fix kernel documentation for inode_insert5() and iget5_locked()

   - Annotate racy epoll check on file->f_ep

   - Make F_DUPFD_QUERY associative

   - Avoid filename buffer overrun in initramfs

   - Don't let statmount() return empty strings

   - Add a cond_resched() to dump_user_range() to avoid hogging the CPU

   - Don't query the device logical blocksize multiple times for hfsplus

   - Make filemap_read() check that the offset is positive or zero

  Cleanups:

   - Various typo fixes

   - Cleanup wbc_attach_fdatawrite_inode()

   - Add __releases annotation to wbc_attach_and_unlock_inode()

   - Add hugetlbfs tracepoints

   - Fix various vfs kernel doc parameters

   - Remove obsolete TODO comment from io_cancel()

   - Convert wbc_account_cgroup_owner() to take a folio

   - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()

   - Reorder struct posix_acl to save 8 bytes

   - Annotate struct posix_acl with __counted_by()

   - Replace one-element array with flexible array member in freevxfs

   - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"

* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  statmount: retrieve security mount options
  vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
  statmount: add flag to retrieve unescaped options
  fs: add the ability for statmount() to report the sb_source
  writeback: wbc_attach_fdatawrite_inode out of line
  writeback: add a __releases annoation to wbc_attach_and_unlock_inode
  fs: add the ability for statmount() to report the fs_subtype
  fs: don't let statmount return empty strings
  fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
  hfsplus: don't query the device logical block size multiple times
  freevxfs: Replace one-element array with flexible array member
  fs: optimize acl_permission_check()
  initramfs: avoid filename buffer overrun
  fs/writeback: convert wbc_account_cgroup_owner to take a folio
  acl: Annotate struct posix_acl with __counted_by()
  acl: Realign struct posix_acl to save 8 bytes
  epoll: Add synchronous wakeup support for ep_poll_callback
  coredump: add cond_resched() to dump_user_range
  mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
  mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
  ...
2024-11-18 09:35:30 -08:00
Kairui Song
da0c02516c mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead.  And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.

Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
Jeff Layton
f6259e2e4f nfsd: have nfsd4_deleg_getattr_conflict pass back write deleg pointer
Currently we pass back the size and whether it has been modified, but
those just mirror values tracked inside the delegation. In a later
patch, we'll need to get at the timestamps in the delegation too, so
just pass back a reference to the write delegation, and use that to
properly override values in the iattr.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
3a405432e7 nfsd: drop the nfsd4_fattr_args "size" field
We already have a slot for this in the kstat structure. Just overwrite
that instead of keeping a copy.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
c757ca1a56 nfsd: drop the ncf_cb_bmap field
This is always the same value, and in a later patch we're going to need
to set bits in WORD2. We can simplify this code and save a little space
in the delegation too. Just hardcode the bitmap in the callback encode
function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
f67eef8da0 nfsd: drop inode parameter from nfsd4_change_attribute()
The inode that nfs4_open_delegation() passes to this function is
wrong, which throws off the result. The inode will end up getting a
directory-style change attr instead of a regular-file-style one.

Fix up nfs4_delegation_stat() to fetch STATX_MODE, and then drop the
inode parameter from nfsd4_change_attribute(), since it's no longer
needed.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:06 -05:00
Chuck Lever
612196ef5c NFSD: Remove unused function parameter
Clean up: Commit 65294c1f2c ("nfsd: add a new struct file caching
facility to nfsd") moved the fh_verify() call site out of
nfsd_open(). That was the only user of nfsd_open's @rqstp parameter,
so that parameter can be removed.

Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:58 -05:00
Thorsten Blum
b7165ab074 NFSD: Remove unnecessary posix_acl_entry pointer initialization
The posix_acl_entry pointer pe is already initialized by the
FOREACH_ACL_ENTRY() macro. Remove the unnecessary initialization.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:58 -05:00
Chuck Lever
7f33b92e5b NFSD: Prevent a potential integer overflow
If the tag length is >= U32_MAX - 3 then the "length + 4" addition
can result in an integer overflow. Address this by splitting the
decoding into several steps so that decode_cb_compound4res() does
not have to perform arithmetic on the unsafe length value.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:41:57 -05:00
Linus Torvalds
de2f378f2b nfsd-6.12 fixes:
- Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcvm+cACgkQM2qzM29m
 f5f27xAAwdRav9cbwNfLEIC2cMqS4stZH6DsjJ0VSUa87eJeO0BggqnEGxgK38a6
 jONYrjD5eWdTIuv3P07UF/dAcANT4drr6VgZURukNwnX6fIV0N2pOGH8z3y5Mban
 JByqkGNs2lXegdTn8DLJb8xhU0ZJlLQ01908IKc4E1YecAy8vq2yDeksEIBi3HOB
 y3WViieDdD1rYqh/OGCl/hQ72g+Sq+C6ionl+YsRiE4lfmcWCbnOUSrF5jLbU4Xa
 K8pTMTE1JoLvBwTe4+Dzg48PNVlqq8x386O7Lu+rK8sG+C4LX7I+BX9PlJzWDMBU
 KWN04lNxtxVqNfKN7A3ADw5bIPmcsrc5Lh3FXIomp1deuQ3dFOqsADoVSca4Z3Am
 Sci5MrNqDMG74n6IYkhDMaySpSFs4Gx7Si2G0zgBMRyaw7EdO6b8rx8s5OdPwvFj
 Xv5x+Fya5cXuJ28a8FPrYpKe/ImjNV5A2x/5cJNiOdqY7i7QBcQzu0QfFHzBcMkL
 lL9svfTFzAo9B/jj3VsBr8UudDU14QUibQNtoTpKv9gwWCIJPLsU+KqdCkTBcgrZ
 X5Cv0JSOrw8fOd9GAOksjrFHkc8p6vBIGZC34tmVLLfOSRPyeG93roT5cCMkQAEQ
 XfWvC/epKjGoLRZepA5GG/Y9EjnpiQvUWJvyznWpY5hFYdRfBu0=
 =ltss
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD

* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-09 13:18:07 -08:00
Chuck Lever
bb1fb40f8b NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.

I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").

Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.

For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).

The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.

Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-07 09:11:37 -05:00
Linus Torvalds
3e5e6c9900 nfsd-6.12 fixes:
- Fix two async COPY bugs found during NFS bake-a-thon
 - Fix an svcrdma memory leak
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcmT5wACgkQM2qzM29m
 f5coUg/9FQMf1IeXhyGlDtV0ELUxkHoXaZ2T6zhfXFoLmyl/DU4AWMH4YbSAqk2M
 NnR47sVsM08tIwE3KYQgYzLbbFF41nQUa1sckeel+nYGtpcN6IwOlU5LYYpNNFeQ
 vsECxV78BA6FGdjaXwQ07r4G6lpVhCCqM/RZpDrwNSyoIWVLo77KBUVCSoQb5wzG
 z7OBvO9M7HfVJVOHcPd+tVcZaGAF0fhW812fibZQKV2mrWdhOOe+gWVs8ro3tmm1
 GocbTTQW2hlYcLCZPe1przTI9flfwon6Lk8TmIZuU5IrzcaAB+U3P140aKgl9427
 v4WdLuKYlKi+xISBdRG3omyaLroNUs8IHW4KoBXAW3FinyLzNsAyoPxb02m7SEge
 sOJ/gbeLtb2u+ur4wAp4gDmVKfg3TGyh05Hdt96LXsbQUuWIlwEcPurl+nY93Eoq
 vrPLIdPOXrOD5jBIaVQkBYlaCn04mDg+VTNbG9hW1wjorVFpWKS7MwCjVloXSVIn
 uE++cVpQtIKp8aTYBbFVXqtVREatczl++f+Npnlm8xlcquDbaORkk4ZBOs4vQuHo
 pNuZcWO0rIBR6hakr44OjTLnJBIwChPBYvBVgtq1E6oAbwHvC2SVXbiJB8IcCPOx
 nB2jnF0/tpTs2LnrHAxdAGdU9Om6RGmahfz/uwh/8djdbCVH+Ik=
 =NiQh
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix two async COPY bugs found during NFS bake-a-thon

 - Fix an svcrdma memory leak

* tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  rpcrdma: Always release the rpcrdma_device's xa_array
  NFSD: Never decrement pending_async_copies on error
  NFSD: Initialize struct nfsd4_copy earlier
2024-11-02 09:27:11 -10:00
Chuck Lever
8286f8b622 NFSD: Never decrement pending_async_copies on error
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-30 14:12:16 -04:00
Chuck Lever
63fab04cbd NFSD: Initialize struct nfsd4_copy earlier
Ensure the refcount and async_copies fields are initialized early.
cleanup_async_copy() will reference these fields if an error occurs
in nfsd4_copy(). If they are not correctly initialized, at the very
least, a refcount underflow occurs.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-29 15:31:18 -04:00
Linus Torvalds
f647053312 nfsd-6.12 fixes:
- Fix a couple of use-after-free bugs
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcbp9YACgkQM2qzM29m
 f5dzAg/4lCFRbsebia7qktW88ZDBbAoZyKolk2DHfjZXCuq5DHb/5I/Hk1rGyTYs
 VaJmCU59ZdpyBFSdQhOYKf2xvgNPvJG02U8il5KWtMAY5cStXFjeU0FSDoC5O4Dl
 9IoaVbtAWGMCjxWJ1WEGpU82JoM7moSVB4G718LlxF+4cUS7idq5se0uK31WQvft
 DmJsOfmnch1Y/7+RRcDwbBu0HwP2ZQHS8zMYMQ2JGXPDZJFenTibezVb36YyzyZn
 WQfvaW6MmdiVL9omZxvURL9WuBKA2L+Ly/92PyHaflcAXSngcpfu28IIQbzp/m7K
 JT/3ad32lB7F3LrDP5I4gwVh8oGLYiI5r7RBWo0e98LPvAR/89gBVdZHhjasstCh
 nAL7Kk6P/jQdbM/KR9T+yS7xTVScI5Wp4Xcitz2mlHgU4br67GO9gpo1e/tqXenm
 Gasapkg5qCduz+ksj2vwpeFXKQi+qwJgfVGKMxELoV8qTazyr09Dfgouqe045ztl
 /0khkOrLkw0bYLDNJKhj/XG0ZEV5V10c/0PEnivC2BHVQioBIDRQAaH2S2G/8vQH
 MdWayGhNlTV0g4DdtMVPxf5uN+uQmfMsj5BIe17NIUYxiJnw0CSM/bzHUcJ15+DH
 nTblaek6sa5BFpZ2fed8by4il4/tme3NOJEeHnSqe/3QKyZgOQ==
 =1YOr
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a couple of use-after-free bugs

* tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
  nfsd: fix race between laundromat and free_stateid
2024-10-25 11:38:15 -07:00
Yang Erkun
d5ff2fb2e7 nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.

nfsd_client_shrinker         echo 0 > /proc/fs/nfsd/threads

expire_client                nfsd_shutdown_net
  unhash_client                ...
                               nfs4_state_shutdown_net
                                 /* won't wait shrinker exit */
  /*                             cancel_work(&nn->nfsd_shrinker_work)
   * nfsd_file for this          /* won't destroy unhashed client1 */
   * client1 still alive         nfs4_state_destroy_net
   */

                               nfsd_file_cache_shutdown
                                 /* trigger warning */
                                 kmem_cache_destroy(nfsd_file_slab)
                                 kmem_cache_destroy(nfsd_file_mark_slab)
  /* release nfsd_file and mark */
  __destroy_client

====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

====================================================================
BUG nfsd_file_mark (Tainted: G    B   W         ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.

Fixes: 7c24fa2250 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-21 10:27:36 -04:00
Olga Kornievskaia
8dd91e8d31 nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.

Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:

kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20

This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).

First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.

Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.

Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18 16:40:37 -04:00
Linus Torvalds
6254d53727 NFS Client Bugfixes for Linux 6.12-rc
Localio Bugfixes:
   * Remove duplicated include in localio.c
   * Fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
   * Fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
   * Fix nfsd_file tracepoints to handle NULL rqstp pointers
 
 Other Bugfixes:
   * Fix program selection loop in svc_process_common
   * Fix integer overflow in decode_rc_list()
   * Prevent NULL-pointer dereference in nfs42_complete_copies()
   * Fix CB_RECALL performance issues when using a large number of delegations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcJjjQACgkQ18tUv7Cl
 QOvgJw/6A33s+pjyBVLIKT6oMCPkUJeQ4Rhg9Je0Qw/ji0eFkT4Eyd65kRz3T9M/
 qRrCfWaUd2dTYcbKQyhuGTlEfICZa9R4I0/Ztk9yvf9xcd1xFXKzTkFekGUVeHQA
 OcngDu9psFxhvyzKI8nAHs1ephX/T7TywvTKANMRbeRCYYvVkytAt9YeVMigYZa5
 dnchoUdGUdL6B6RXCU/Qhf0A1uYyA4hkk/FTBCPgv+kYx5pnjFq0y/yIIHDzCR3I
 +yE1ss3EpVTQgt2Ca/cmDyYXsa7G8G51U7cS5AeIoXfsf1EGtTujowWcBY4oqFEC
 ixx58fQe48AqwsP5XDZn8gnsuYH9snnw5rIB0IVqq55/a+XLMupHayyf/iziMV3s
 JWgT4gKDyFca2pT+bJ8iWweU+ecRYxKGnh2NydyBiqowogsHZm4uKh0vELvqqkBd
 RIjCyIiQVhYBII2jqpjRnxrqhGUT5XO99NQdQIGV0bUjCEP4YAjY4ChfEVcWXhnB
 ppyBP+r8N5O77NcVqsVQS26U0/jb9K30LyYl9VT43ank3d+VVtHA5ZqnUflWtwuc
 2XiGDvXW9mIvbVraWIZXUNVy39bzRclDf5bx4jeYLnKCMym81rkEIBOvBKQKZTrl
 v+1Nhaj+fSw+rFSUm0KPqms0UDiT0Ol7ltu84ifadYqubbSEbqU=
 =QBvR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Localio Bugfixes:
   - remove duplicated include in localio.c
   - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
   - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
   - fix nfsd_file tracepoints to handle NULL rqstp pointers

  Other Bugfixes:
   - fix program selection loop in svc_process_common
   - fix integer overflow in decode_rc_list()
   - prevent NULL-pointer dereference in nfs42_complete_copies()
   - fix CB_RECALL performance issues when using a large number of
     delegations"

* tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: remove revoked delegation from server's delegation list
  nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
  nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
  nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
  NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies()
  SUNRPC: Fix integer overflow in decode_rc_list()
  sunrpc: fix prog selection loop in svc_process_common
  nfs: Remove duplicated include in localio.c
2024-10-11 15:37:15 -07:00
Linus Torvalds
5870963f6c nfsd-6.12 fixes:
- Fix NFSD bring-up / shutdown
 - Fix a UAF when releasing a stateid
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcHx6IACgkQM2qzM29m
 f5ecExAAheSpSP9D+1n3hKeOlNfhY8FUzd41Arn4NYV+jIBJbtx94/FlSNOxA0mp
 Ovm1I8uAxy4TR8TLt7tsxfT7JDOStKwXFl3QlOUZT/+uyyJr7/q5R959R3oMiccR
 Rfrpj6j2yYWrI8qGDGHca4Vv2bDSxr4mzztWwDe0SHSsjwf4OAcv+XF5vcfZ/CJN
 Bxulb9WfNU8XvdFcRDHokMfk6jiY6/+FCTwX8ckvbVEG6gHT8+CRYSUJ05j0LJGo
 xKZV913NgzcuV7PH0vq6vExJE6+rEPt/ejDAT5FM5yeNe+WJ4RTDgsYyIr9iLbHF
 mWB9M4NnP+EZhejtOCbZ9RZjjKro09ilEPpqILuuGQPtcHSeWmhNbFz0kwLe+zYZ
 CdtjnPZhjB0ITWgZ1HCtoJ8k/ZcMa7iiM/kApMLGr9fVj8/BHHFzS95PK7K/Fqur
 FLdhvo6CzZCnRd16e2kqWsG7wO2lPWcz4NWTf9wxIG5GCunXoVCEnK1VfHvnldbH
 BIFXZ+ib5qnL2i3Qmz7bQxmfIp5ryZnNx1mF0OM8imR9K/rsnARd7JfQ99lpMy8D
 mD4coZVTMMk/Zg9zuH8k5GBzB2zXXqgngp4IJIxqrKR7/AsuSU3R7r+O9CWN91GQ
 GKpRtMn/rVUg81jxDr3qoKquyxONoyVrVXAKsj1PgUSQdjUJgqU=
 =Rud7
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix NFSD bring-up / shutdown

 - Fix a UAF when releasing a stateid

* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix possible badness in FREE_STATEID
  nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
  NFSD: Mark filecache "down" if init fails
2024-10-10 09:52:49 -07:00
Olga Kornievskaia
c88c150a46 nfsd: fix possible badness in FREE_STATEID
When multiple FREE_STATEIDs are sent for the same delegation stateid,
it can lead to a possible either use-after-free or counter refcount
underflow errors.

In nfsd4_free_stateid() under the client lock we find a delegation
stateid, however the code drops the lock before calling nfs4_put_stid(),
that allows another FREE_STATE to find the stateid again. The first one
will proceed to then free the stateid which leads to either
use-after-free or decrementing already zeroed counter.

Fixes: 3f29cc82a8 ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-05 15:44:25 -04:00
Mike Snitzer
76f5af9952 nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
Otherwise nfsd_file_acquire, nfsd_file_insert_err, and
nfsd_file_cons_err will hit a NULL pointer when they are enabled and
LOCALIO used.

Example trace output (note xid is 0x0 and LOCALIO flag set):
 nfsd_file_acquire: xid=0x0 inode=0000000069a1b2e7
 may_flags=WRITE|LOCALIO ref=1 nf_flags=HASHED|GC nf_may=WRITE
 nf_file=0000000070123234 status=0

Fixes: c63f0e48fe ("nfsd: add nfsd_file_acquire_local()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-04 14:52:04 -04:00
Mike Snitzer
65f2a5c366 nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd
module unload.  Similarly, use RCU around nfs_open_local_fh()'s error
path call to nfs_to->nfsd_serv_put().  Holding RCU ensures that NFS
will safely _call and return_ from its nfs_to calls into the NFSD
functions nfsd_file_put_local() and nfsd_serv_put().

Otherwise, if RCU isn't used then there is a narrow window when NFS's
reference for the nfsd_file and nfsd_serv are dropped and the NFSD
module could be unloaded, which could result in a crash from the
return instruction for either nfs_to->nfsd_file_put_local() or
nfs_to->nfsd_serv_put().

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:43 -04:00
Lizhi Xu
cad3f4a22c inotify: Fix possible deadlock in fsnotify_destroy_mark
[Syzbot reported]
WARNING: possible circular locking dependency detected
6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted
------------------------------------------------------
kswapd0/78 is trying to acquire lock:
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578

but task is already holding lock:
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline]
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}-{0:0}:
       ...
       kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044
       inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline]
       inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline]
       __do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline]
       __se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&group->mark_mutex){+.+.}-{3:3}:
       ...
       __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
       fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
       fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578
       fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934
       fsnotify_inoderemove include/linux/fsnotify.h:264 [inline]
       dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403
       __dentry_kill+0x20d/0x630 fs/dcache.c:610
       shrink_kill+0xa9/0x2c0 fs/dcache.c:1055
       shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082
       prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163
       super_cache_scan+0x34f/0x4b0 fs/super.c:221
       do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
       shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
       shrink_one+0x43b/0x850 mm/vmscan.c:4815
       shrink_many mm/vmscan.c:4876 [inline]
       lru_gen_shrink_node mm/vmscan.c:4954 [inline]
       shrink_node+0x3799/0x3de0 mm/vmscan.c:5934
       kswapd_shrink_node mm/vmscan.c:6762 [inline]
       balance_pgdat mm/vmscan.c:6954 [inline]
       kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223

[Analysis]
The problem is that inotify_new_watch() is using GFP_KERNEL to allocate
new watches under group->mark_mutex, however if dentry reclaim races
with unlinking of an inode, it can end up dropping the last dentry reference
for an unlinked inode resulting in removal of fsnotify mark from reclaim
context which wants to acquire group->mark_mutex as well.

This scenario shows that all notification groups are in principle prone
to this kind of a deadlock (previously, we considered only fanotify and
dnotify to be problematic for other reasons) so make sure all
allocations under group->mark_mutex happen with GFP_NOFS.

Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
2024-10-02 15:14:29 +02:00
Christian Brauner
09ee2a670d
Merge patch series "Fixup NLM and kNFSD file lock callbacks"
Benjamin Coddington <bcodding@redhat.com> says:

Last year both GFS2 and OCFS2 had some work done to make their locking more
robust when exported over NFS.  Unfortunately, part of that work caused both
NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send
lock notifications to clients.

This in itself is not a huge problem because most NFS clients will still
poll the server in order to acquire a conflicted lock, but now that I've
noticed it I can't help but try to fix it because there are big advantages
for setups that might depend on timely lock notifications, and we've
supported that as a feature for a long time.

Its important for NLM and kNFSD that they do not block their kernel threads
inside filesystem's file_lock implementations because that can produce
deadlocks.  We used to make sure of this by only trusting that
posix_lock_file() can correctly handle blocking lock calls asynchronously,
so the lock managers would only setup their file_lock requests for async
callbacks if the filesystem did not define its own lock() file operation.

However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started signalling this
behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting
posix_lock_file() was inadvertently dropped, so now most filesystems no
longer produce lock notifications when exported over NFS.

I tried to fix this by simply including the old check for lock(), but the
resulting include mess and layering violations was more than I could accept.
There's a much cleaner way presented here using an fop_flag, which while
potentially flag-greedy, greatly simplifies the problem and grooms the
way for future uses by both filesystems and lock managers alike.

* patches from https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com:
  exportfs: Remove EXPORT_OP_ASYNC_LOCK
  NLM/NFSD: Fix lock notifications for async-capable filesystems
  gfs2/ocfs2: set FOP_ASYNC_LOCK
  fs: Introduce FOP_ASYNC_LOCK
  NFS: trace: show TIMEDOUT instead of 0x6e
  nfsd: use system_unbound_wq for nfsd_file_gc_worker()
  nfsd: count nfsd_file allocations
  nfsd: fix refcount leak when file is unhashed after being found
  nfsd: remove unneeded EEXIST error check in nfsd_do_file_acquire
  nfsd: add list_head nf_gc to struct nfsd_file

Link: https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-02 07:52:07 +02:00
Benjamin Coddington
7e64c5bc49
NLM/NFSD: Fix lock notifications for async-capable filesystems
Instead of checking just the exportfs flag, use the new
locks_can_async_lock() helper which allows NLM and NFSD to once again
support lock notifications for all filesystems which use posix_lock_file().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Link: https://lore.kernel.org/r/865c40da44af67939e8eb560d17a26c9c50f23e0.1726083391.git.bcodding@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-01 17:00:18 +02:00
Mike Snitzer
946af9b3a0 nfsd: implement server support for NFS_LOCALIO_PROGRAM
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
RPC method that allows the Linux NFS client to verify the local Linux
NFS server can see the nonce (single-use UUID) the client generated and
made available in nfs_common.  The server expects this protocol to use
the same transport as NFS and NFSACL for its RPCs.  This protocol
isn't part of an IETF standard, nor does it need to be considering it
is Linux-to-Linux auxiliary RPC protocol that amounts to an
implementation detail.

The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
the fixed UUID_SIZE (16 bytes).  The fixed size opaque encode and decode
XDR methods are used instead of the less efficient variable sized
methods.

The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
Linux Kernel Organization       400122  nfslocalio

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
[neilb: factored out and simplified single localio protocol]
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Weston Andros Adamson
fa4983862e nfsd: add LOCALIO support
Add server support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when both the client and server are
running on the same host.

If nfsd_open_local_fh() fails then the NFS client will both retry and
fallback to normal network-based read, write and commit operations if
localio is no longer supported.

Care is taken to ensure the same NFS security mechanisms are used
(authentication, etc) regardless of whether localio or regular NFS
access is used.  The auth_domain established as part of the traditional
NFS client access to the NFS server is also used for localio.  Store
auth_domain for localio in nfsd_uuid_t and transfer it to the client
if it is local to the server.

Relative to containers, localio gives the client access to the network
namespace the server has.  This is required to allow the client to
access the server's per-namespace nfsd_net struct.

This commit also introduces the use of NFSD's percpu_ref to interlock
nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is
not destroyed while in use by nfsd_open_local_fh and other LOCALIO
client code.

CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Mike Snitzer
a61e147e6b nfs_common: prepare for the NFS client to use nfsd_file for LOCALIO
The next commit will introduce nfsd_open_local_fh() which returns an
nfsd_file structure.  This commit exposes LOCALIO's required NFSD
symbols to the NFS client:

- Make nfsd_open_local_fh() symbol and other required NFSD symbols
  available to NFS in a global 'nfs_to' nfsd_localio_operations
  struct (global access suggested by Trond, nfsd_localio_operations
  suggested by NeilBrown).  The next commit will also introduce
  nfsd_localio_ops_init() that init_nfsd() will call to initialize
  'nfs_to'.

- Introduce nfsd_file_file() that provides access to nfsd_file's
  backing file.  Keeps nfsd_file structure opaque to NFS client (as
  suggested by Jeff Layton).

- Introduce nfsd_file_put_local() that will put the reference to the
  nfsd_file's associated nn->nfsd_serv and then put the reference to
  the nfsd_file (as suggested by NeilBrown).

Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> # nfs_to
Suggested-by: NeilBrown <neilb@suse.de> # nfsd_localio_operations
Suggested-by: Jeff Layton <jlayton@kernel.org> # nfsd_file_file
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
86ab08beb3 SUNRPC: replace program list with program array
A service created with svc_create_pooled() can be given a linked list of
programs and all of these will be served.

Using a linked list makes it cumbersome when there are several programs
that can be optionally selected with CONFIG settings.

After this patch is applied, API consumers must use only
svc_create_pooled() when creating an RPC service that listens for more
than one RPC program.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Mike Snitzer
47e988147f nfsd: add nfsd_serv_try_get and nfsd_serv_put
Introduce nfsd_serv_try_get and nfsd_serv_put and update the nfsd code
to prevent nfsd_destroy_serv from destroying nn->nfsd_serv until any
caller of nfsd_serv_try_get releases their reference using nfsd_serv_put.

A percpu_ref is used to implement the interlock between
nfsd_destroy_serv and any caller of nfsd_serv_try_get.

This interlock is needed to properly wait for the completion of client
initiated localio calls to nfsd (that are _not_ in the context of nfsd).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
c63f0e48fe nfsd: add nfsd_file_acquire_local()
nfsd_file_acquire_local() can be used to look up a file by filehandle
without having a struct svc_rqst.  This can be used by NFS LOCALIO to
allow the NFS client to bypass the NFS protocol to directly access a
file provided by the NFS server which is running in the same kernel.

In nfsd_file_do_acquire() care is taken to always use fh_verify() if
rqstp is not NULL (as is the case for non-LOCALIO callers).  Otherwise
the non-LOCALIO callers will not supply the correct and required
arguments to __fh_verify (e.g. gssclient isn't passed).

Introduce fh_verify_local() wrapper around __fh_verify to make it
clear that LOCALIO is intended caller.

Also, use GC for nfsd_file returned by nfsd_file_acquire_local.  GC
offers performance improvements if/when a file is reopened before
launderette cleans it from the filecache's LRU.

Suggested-by: Jeff Layton <jlayton@kernel.org> # use filecache's GC
Signed-off-by: NeilBrown <neilb@suse.de>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
5e66d2d92a nfsd: factor out __fh_verify to allow NULL rqstp to be passed
__fh_verify() offers an interface like fh_verify() but doesn't require
a struct svc_rqst *, instead it also takes the specific parts as
explicit required arguments.  So it is safe to call __fh_verify() with
a NULL rqstp, but the net, cred, and client args must not be NULL.

__fh_verify() does not use SVC_NET(), nor does the functions it calls.

Rather than using rqstp->rq_client pass the client and gssclient
explicitly to __fh_verify and then to nfsd_set_fh_dentry().

Lastly, it should be noted that the previous commit prepared for 4
associated tracepoints to only be used if rqstp is not NULL (this is a
stop-gap that should be properly fixed so localio also benefits from
the utility these tracepoints provide when debugging fh_verify
issues).

Signed-off-by: NeilBrown <neilb@suse.de>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Chuck Lever
71c61a0077 NFSD: Short-circuit fh_verify tracepoints for LOCALIO
LOCALIO will be able to call fh_verify() with a NULL rqstp. In this
case, the existing trace points need to be skipped because they
want to dereference the address fields in the passed-in rqstp.

Temporarily make these trace points conditional to avoid a seg
fault in this case. Putting the "rqstp != NULL" check in the trace
points themselves makes the check more efficient.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Chuck Lever
7c0b07b49b NFSD: Avoid using rqstp->rq_vers in nfsd_set_fh_dentry()
Currently, fh_verify() makes some daring assumptions about which
version of file handle the caller wants, based on the things it can
find in the passed-in rqstp. The about-to-be-introduced LOCALIO use
case sometimes has no svc_rqst context, so this logic won't work in
that case.

Instead, examine the passed-in file handle. It's .max_size field
should carry information to allow nfsd_set_fh_dentry() to initialize
the file handle appropriately.

The file handle used by lockd and the one created by write_filehandle
never need any of the version-specific fields (which affect things
like write and getattr requests and pre/post attributes).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
b0d87dbd8b NFSD: Refactor nfsd_setuser_and_check_port()
There are several places where __fh_verify unconditionally dereferences
rqstp to check that the connection is suitably secure.  They look at
rqstp->rq_xprt which is not meaningful in the target use case of
"localio" NFS in which the client talks directly to the local server.

Prepare these to always succeed when rqstp is NULL.

Signed-off-by: NeilBrown <neilb@suse.de>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
0a183f24a7 NFSD: Handle @rqstp == NULL in check_nfsd_access()
LOCALIO-initiated open operations are not running in an nfsd thread
and thus do not have an associated svc_rqst context.

Signed-off-by: NeilBrown <neilb@suse.de>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:29 -04:00
Mike Snitzer
4806ded4c1 nfs_common: factor out nfs_errtbl and nfs_stat_to_errno
Common nfs_stat_to_errno() is used by both fs/nfs/nfs2xdr.c and
fs/nfs/nfs3xdr.c

Will also be used by fs/nfsd/localio.c

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:29 -04:00
NeilBrown
53e4e17557 nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
If nfsd_startup_net() fails and so ->nfsd_net_up is false,
nfsd_destroy_serv() doesn't currently call svc_destroy().  It should.

Fixes: 1e3577a452 ("SUNRPC: discard sv_refcnt, and svc_get/svc_put")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-23 10:37:32 -04:00
Chuck Lever
dc0d0f885a NFSD: Mark filecache "down" if init fails
NeilBrown says:
> The handling of NFSD_FILE_CACHE_UP is strange.  nfsd_file_cache_init()
> sets it, but doesn't clear it on failure.  So if nfsd_file_cache_init()
> fails for some reason, nfsd_file_cache_shutdown() would still try to
> clean up if it was called.

Reported-by: NeilBrown <neilb@suse.de>
Fixes: c7b824c3d0 ("NFSD: Replace the "init once" mechanism")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-23 10:37:20 -04:00
NeilBrown
45bb63ed20 nfsd: fix delegation_blocked() to block correctly for at least 30 seconds
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds.  A
new filehandle would be recorded in the "new" bit set.  That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.

Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set.  This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.

This patch updates bd->new before clearing the set with that index,
instead of afterwards.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 6282cd5655 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:38 -04:00
Jeff Layton
bf92e5008b nfsd: fix initial getattr on write delegation
At this point in compound processing, currentfh refers to the parent of
the file, not the file itself. Get the correct dentry from the delegation
stateid instead.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:37 -04:00
NeilBrown
a078a7dc0e nfsd: untangle code in nfsd4_deleg_getattr_conflict()
The code in nfsd4_deleg_getattr_conflict() is convoluted and buggy.

With this patch we:
 - properly handle non-nfsd leases.  We must not assume flc_owner is a
    delegation unless fl_lmops == &nfsd_lease_mng_ops
 - move the main code out of the for loop
 - have a single exit which calls nfs4_put_stid()
   (and other exits which don't need to call that)

[ jlayton: refactored on top of Neil's other patch: nfsd: fix
	   nfsd4_deleg_getattr_conflict in presence of third party lease ]

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:36 -04:00
Scott Mayhew
5559c157b7 nfsd: enforce upper limit for namelen in __cld_pipe_inprogress_downcall()
This patch is intended to go on top of "nfsd: return -EINVAL when
namelen is 0" from Li Lingfeng.  Li's patch checks for 0, but we should
be enforcing an upper bound as well.

Note that if nfsdcld somehow gets an id > NFS4_OPAQUE_LIMIT in its
database, it'll truncate it to NFS4_OPAQUE_LIMIT when it does the
downcall anyway.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:35 -04:00
Li Lingfeng
22451a16b7 nfsd: return -EINVAL when namelen is 0
When we have a corrupted main.sqlite in /var/lib/nfs/nfsdcld/, it may
result in namelen being 0, which will cause memdup_user() to return
ZERO_SIZE_PTR.
When we access the name.data that has been assigned the value of
ZERO_SIZE_PTR in nfs4_client_to_reclaim(), null pointer dereference is
triggered.

[ T1205] ==================================================================
[ T1205] BUG: KASAN: null-ptr-deref in nfs4_client_to_reclaim+0xe9/0x260
[ T1205] Read of size 1 at addr 0000000000000010 by task nfsdcld/1205
[ T1205]
[ T1205] CPU: 11 PID: 1205 Comm: nfsdcld Not tainted 5.10.0-00003-g2c1423731b8d #406
[ T1205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ T1205] Call Trace:
[ T1205]  dump_stack+0x9a/0xd0
[ T1205]  ? nfs4_client_to_reclaim+0xe9/0x260
[ T1205]  __kasan_report.cold+0x34/0x84
[ T1205]  ? nfs4_client_to_reclaim+0xe9/0x260
[ T1205]  kasan_report+0x3a/0x50
[ T1205]  nfs4_client_to_reclaim+0xe9/0x260
[ T1205]  ? nfsd4_release_lockowner+0x410/0x410
[ T1205]  cld_pipe_downcall+0x5ca/0x760
[ T1205]  ? nfsd4_cld_tracking_exit+0x1d0/0x1d0
[ T1205]  ? down_write_killable_nested+0x170/0x170
[ T1205]  ? avc_policy_seqno+0x28/0x40
[ T1205]  ? selinux_file_permission+0x1b4/0x1e0
[ T1205]  rpc_pipe_write+0x84/0xb0
[ T1205]  vfs_write+0x143/0x520
[ T1205]  ksys_write+0xc9/0x170
[ T1205]  ? __ia32_sys_read+0x50/0x50
[ T1205]  ? ktime_get_coarse_real_ts64+0xfe/0x110
[ T1205]  ? ktime_get_coarse_real_ts64+0xa2/0x110
[ T1205]  do_syscall_64+0x33/0x40
[ T1205]  entry_SYSCALL_64_after_hwframe+0x67/0xd1
[ T1205] RIP: 0033:0x7fdbdb761bc7
[ T1205] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 514
[ T1205] RSP: 002b:00007fff8c4b7248 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ T1205] RAX: ffffffffffffffda RBX: 000000000000042b RCX: 00007fdbdb761bc7
[ T1205] RDX: 000000000000042b RSI: 00007fff8c4b75f0 RDI: 0000000000000008
[ T1205] RBP: 00007fdbdb761bb0 R08: 0000000000000000 R09: 0000000000000001
[ T1205] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000042b
[ T1205] R13: 0000000000000008 R14: 00007fff8c4b75f0 R15: 0000000000000000
[ T1205] ==================================================================

Fix it by checking namelen.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Fixes: 74725959c3 ("nfsd: un-deprecate nfsdcld")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Tested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:35 -04:00
Chuck Lever
0505de9615 NFSD: Wrap async copy operations with trace points
Add an nfsd_copy_async_done to record the timestamp, the final
status code, and the callback stateid of an async copy.

Rename the nfsd_copy_do_async tracepoint to match that naming
convention to make it easier to enable both of these with a
single glob.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
d3c430aa97 NFSD: Clean up extra whitespace in trace_nfsd_copy_done
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
e1d2697c53 NFSD: Record the callback stateid in copy tracepoints
Match COPY operations up with CB_OFFLOAD operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
11848e985c NFSD: Display copy stateids with conventional print formatting
Make it easier to grep for s2s COPY stateids in trace logs: Use the
same display format in nfsd_copy_class as is used to display other
stateids.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
aadc3bbea1 NFSD: Limit the number of concurrent async COPY operations
Nothing appears to limit the number of concurrent async COPY
operations that clients can start. In addition, AFAICT each async
COPY can copy an unlimited number of 4MB chunks, so can run for a
long time. Thus IMO async COPY can become a DoS vector.

Add a restriction mechanism that bounds the number of concurrent
background COPY operations. Start simple and try to be fair -- this
patch implements a per-namespace limit.

An async COPY request that occurs while this limit is exceeded gets
NFS4ERR_DELAY. The requesting client can choose to send the request
again after a delay or fall back to a traditional read/write style
copy.

If there is need to make the mechanism more sophisticated, we can
visit that in future patches.

Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
9ed666eba4 NFSD: Async COPY result needs to return a write verifier
Currently, when NFSD handles an asynchronous COPY, it returns a
zero write verifier, relying on the subsequent CB_OFFLOAD callback
to pass the write verifier and a stable_how4 value to the client.

However, if the CB_OFFLOAD never arrives at the client (for example,
if a network partition occurs just as the server sends the
CB_OFFLOAD operation), the client will never receive this verifier.
Thus, if the client sends a follow-up COMMIT, there is no way for
the client to assess the COMMIT result.

The usual recovery for a missing CB_OFFLOAD is for the client to
send an OFFLOAD_STATUS operation, but that operation does not carry
a write verifier in its result. Neither does it carry a stable_how4
value, so the client /must/ send a COMMIT in this case -- which will
always fail because currently there's still no write verifier in the
COPY result.

Thus the server needs to return a normal write verifier in its COPY
result even if the COPY operation is to be performed asynchronously.

If the server recognizes the callback stateid in subsequent
OFFLOAD_STATUS operations, then obviously it has not restarted, and
the write verifier the client received in the COPY result is still
valid and can be used to assess a COMMIT of the copied data, if one
is needed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
15392c8cd1 nfsd: avoid races with wake_up_var()
wake_up_var() needs a barrier after the important change is made in the
var and before wake_up_var() is called, else it is possible that a wake
up won't be sent when it should.

In each case here the var is changed in an "atomic" manner, so
smb_mb__after_atomic() is sufficient.

In one case the important change (removing the lease) is performed
*after* the wake_up, which is backwards.  The code survives in part
because the wait_var_event is given a timeout.

This patch adds the required barriers and calls destroy_delegation()
*before* waking any threads waiting for the delegation to be destroyed.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
985eeae9c8 nfsd: use clear_and_wake_up_bit()
nfsd has two places that open-code clear_and_wake_up_bit().  One has
the required memory barriers.  The other does not.

Change both to use clear_and_wake_up_bit() so we have the barriers
without the noise.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Thorsten Blum
2869b3a00e NFSD: Annotate struct pnfs_block_deviceaddr with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
volumes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Use struct_size() instead of manually calculating the number of bytes to
allocate for a pnfs_block_deviceaddr with a single volume.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Guoqing Jiang
d078cbf5c3 nfsd: call cache_put if xdr_reserve_space returns NULL
If not enough buffer space available, but idmap_lookup has triggered
lookup_fn which calls cache_get and returns successfully. Then we
missed to call cache_put here which pairs with cache_get.

Fixes: ddd1ea5636 ("nfsd4: use xdr_reserve_space in attribute encoding")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Jeff Layton
ba017fd391 nfsd: add more nfsd_cb tracepoints
Add some tracepoints in the callback client RPC operations. Also
add a tracepoint to nfsd4_cb_getattr_done.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Jeff Layton
c1c9f3ea74 nfsd: track the main opcode for callbacks
Keep track of the "main" opcode for the callback, and display it in the
tracepoint. This makes it simpler to discern what's happening when there
is more than one callback in flight.

The one special case is the CB_NULL RPC. That's not a CB_COMPOUND
opcode, so designate the value 0 for that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Jeff Layton
e8581a9124 nfsd: add more info to WARN_ON_ONCE on failed callbacks
Currently, you get the warning and stack trace, but nothing is printed
about the relevant error codes. Add that in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Li Lingfeng
76a3f3f164 nfsd: fix some spelling errors in comments
Fix spelling errors in comments of nfsd4_release_lockowner and
nfs4_set_delegation.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Li Lingfeng
eb059a413c nfsd: remove unused parameter of nfsd_file_mark_find_or_create
Commit 427f5f83a3 ("NFSD: Ensure nf_inode is never dereferenced") passes
inode directly to nfsd_file_mark_find_or_create instead of getting it from
nf, so there is no need to pass nf.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Hongbo Li
c2feb7ee39 nfsd: use LIST_HEAD() to simplify code
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD().

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Li Lingfeng
340e61e44c nfsd: map the EBADMSG to nfserr_io to avoid warning
Ext4 will throw -EBADMSG through ext4_readdir when a checksum error
occurs, resulting in the following WARNING.

Fix it by mapping EBADMSG to nfserr_io.

nfsd_buffered_readdir
 iterate_dir // -EBADMSG -74
  ext4_readdir // .iterate_shared
   ext4_dx_readdir
    ext4_htree_fill_tree
     htree_dirblock_to_tree
      ext4_read_dirblock
       __ext4_read_dirblock
        ext4_dirblock_csum_verify
         warn_no_space_for_csum
          __warn_no_space_for_csum
        return ERR_PTR(-EFSBADCRC) // -EBADMSG -74
 nfserrno // WARNING

[  161.115610] ------------[ cut here ]------------
[  161.116465] nfsd: non-standard errno: -74
[  161.117315] WARNING: CPU: 1 PID: 780 at fs/nfsd/nfsproc.c:878 nfserrno+0x9d/0xd0
[  161.118596] Modules linked in:
[  161.119243] CPU: 1 PID: 780 Comm: nfsd Not tainted 5.10.0-00014-g79679361fd5d #138
[  161.120684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qe
mu.org 04/01/2014
[  161.123601] RIP: 0010:nfserrno+0x9d/0xd0
[  161.124676] Code: 0f 87 da 30 dd 00 83 e3 01 b8 00 00 00 05 75 d7 44 89 ee 48 c7 c7 c0 57 24 98 89 44 24 04 c6
 05 ce 2b 61 03 01 e8 99 20 d8 00 <0f> 0b 8b 44 24 04 eb b5 4c 89 e6 48 c7 c7 a0 6d a4 99 e8 cc 15 33
[  161.127797] RSP: 0018:ffffc90000e2f9c0 EFLAGS: 00010286
[  161.128794] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  161.130089] RDX: 1ffff1103ee16f6d RSI: 0000000000000008 RDI: fffff520001c5f2a
[  161.131379] RBP: 0000000000000022 R08: 0000000000000001 R09: ffff8881f70c1827
[  161.132664] R10: ffffed103ee18304 R11: 0000000000000001 R12: 0000000000000021
[  161.133949] R13: 00000000ffffffb6 R14: ffff8881317c0000 R15: ffffc90000e2fbd8
[  161.135244] FS:  0000000000000000(0000) GS:ffff8881f7080000(0000) knlGS:0000000000000000
[  161.136695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.137761] CR2: 00007fcaad70b348 CR3: 0000000144256006 CR4: 0000000000770ee0
[  161.139041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  161.140291] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  161.141519] PKRU: 55555554
[  161.142076] Call Trace:
[  161.142575]  ? __warn+0x9b/0x140
[  161.143229]  ? nfserrno+0x9d/0xd0
[  161.143872]  ? report_bug+0x125/0x150
[  161.144595]  ? handle_bug+0x41/0x90
[  161.145284]  ? exc_invalid_op+0x14/0x70
[  161.146009]  ? asm_exc_invalid_op+0x12/0x20
[  161.146816]  ? nfserrno+0x9d/0xd0
[  161.147487]  nfsd_buffered_readdir+0x28b/0x2b0
[  161.148333]  ? nfsd4_encode_dirent_fattr+0x380/0x380
[  161.149258]  ? nfsd_buffered_filldir+0xf0/0xf0
[  161.150093]  ? wait_for_concurrent_writes+0x170/0x170
[  161.151004]  ? generic_file_llseek_size+0x48/0x160
[  161.151895]  nfsd_readdir+0x132/0x190
[  161.152606]  ? nfsd4_encode_dirent_fattr+0x380/0x380
[  161.153516]  ? nfsd_unlink+0x380/0x380
[  161.154256]  ? override_creds+0x45/0x60
[  161.155006]  nfsd4_encode_readdir+0x21a/0x3d0
[  161.155850]  ? nfsd4_encode_readlink+0x210/0x210
[  161.156731]  ? write_bytes_to_xdr_buf+0x97/0xe0
[  161.157598]  ? __write_bytes_to_xdr_buf+0xd0/0xd0
[  161.158494]  ? lock_downgrade+0x90/0x90
[  161.159232]  ? nfs4svc_decode_voidarg+0x10/0x10
[  161.160092]  nfsd4_encode_operation+0x15a/0x440
[  161.160959]  nfsd4_proc_compound+0x718/0xe90
[  161.161818]  nfsd_dispatch+0x18e/0x2c0
[  161.162586]  svc_process_common+0x786/0xc50
[  161.163403]  ? nfsd_svc+0x380/0x380
[  161.164137]  ? svc_printk+0x160/0x160
[  161.164846]  ? svc_xprt_do_enqueue.part.0+0x365/0x380
[  161.165808]  ? nfsd_svc+0x380/0x380
[  161.166523]  ? rcu_is_watching+0x23/0x40
[  161.167309]  svc_process+0x1a5/0x200
[  161.168019]  nfsd+0x1f5/0x380
[  161.168663]  ? nfsd_shutdown_threads+0x260/0x260
[  161.169554]  kthread+0x1c4/0x210
[  161.170224]  ? kthread_insert_work_sanity_check+0x80/0x80
[  161.171246]  ret_from_fork+0x1f/0x30

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Li Lingfeng
2039c5da5d NFSD: remove redundant assignment operation
Commit 5826e09bf3 ("NFSD: OP_CB_RECALL_ANY should recall both read and
write delegations") added a new assignment statement to add
RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY. So the
old one should be removed.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Chuck Lever
202f39039a NFSD: Fix NFSv4's PUTPUBFH operation
According to RFC 8881, all minor versions of NFSv4 support PUTPUBFH.

Replace the XDR decoder for PUTPUBFH with a "noop" since we no
longer want the minorversion check, and PUTPUBFH has no arguments to
decode. (Ideally nfsd4_decode_noop should really be called
nfsd4_decode_void).

PUTPUBFH should now behave just like PUTROOTFH.

Reported-by: Cedric Blancher <cedric.blancher@gmail.com>
Fixes: e1a90ebd8b ("NFSD: Combine decode operations for v4 and v4.1")
Cc: Dan Shelton <dan.f.shelton@gmail.com>
Cc: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Mark Grimes
32b34fa485 nfsd: Add quotes to client info 'callback address'
The 'callback address' in client_info_show is output without quotes
causing yaml parsers to fail on processing IPv6 addresses.
Adding quotes to 'callback address' also matches that used by
the 'address' field.

Signed-off-by: Mark Grimes <mark.grimes@ixsystems.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
438f81e0e9 nfsd: move error choice for incorrect object types to version-specific code.
If an NFS operation expects a particular sort of object (file, dir, link,
etc) but gets a file handle for a different sort of object, it must
return an error.  The actual error varies among NFS versions in non-trivial
ways.

For v2 and v3 there are ISDIR and NOTDIR errors and, for NFSv4 only,
INVAL is suitable.

For v4.0 there is also NFS4ERR_SYMLINK which should be used if a SYMLINK
was found when not expected.  This take precedence over NOTDIR.

For v4.1+ there is also NFS4ERR_WRONG_TYPE which should be used in
preference to EINVAL when none of the specific error codes apply.

When nfsd_mode_check() finds a symlink where it expected a directory it
needs to return an error code that can be converted to NOTDIR for v2 or
v3 but will be SYMLINK for v4.  It must be different from the error
code returns when it finds a symlink but expects a regular file - that
must be converted to EINVAL or SYMLINK.

So we introduce an internal error code nfserr_symlink_not_dir which each
version converts as appropriate.

nfsd_check_obj_isreg() is similar to nfsd_mode_check() except that it is
only used by NFSv4 and only for OPEN.  NFSERR_INVAL is never a suitable
error if the object is the wrong time.  For v4.0 we use nfserr_symlink
for non-dirs even if not a symlink.  For v4.1 we have nfserr_wrong_type.
We handle this difference in-place in nfsd_check_obj_isreg() as there is
nothing to be gained by delaying the choice to nfsd4_map_status().

As a result of these changes, nfsd_mode_check() doesn't need an rqstp
arg any more.

Note that NFSv4 operations are actually performed in the xdr code(!!!)
so to the only place that we can map the status code successfully is in
nfsd4_encode_operation().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
36ffa3d0de nfsd: be more systematic about selecting error codes for internal use.
Rather than using ad hoc values for internal errors (30000, 11000, ...)
use 'enum' to sequentially allocate numbers starting from the first
known available number - now visible as NFS4ERR_FIRST_FREE.

The goal is values that are distinct from all be32 error codes.  To get
those we must first select integers that are not already used, then
convert them with cpu_to_be32().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
1459ad5767 nfsd: Move error code mapping to per-version proc code.
There is code scattered around nfsd which chooses an error status based
on the particular version of nfs being used.  It is cleaner to have the
version specific choices in version specific code.

With this patch common code returns the most specific error code
possible and the version specific code maps that if necessary.

Both v2 (nfsproc.c) and v3 (nfs3proc.c) now have a "map_status()"
function which is called to map the resp->status before each non-trivial
nfsd_proc_* or nfsd3_proc_* function returns.

NFS4ERR_SYMLINK and NFS4ERR_WRONG_TYPE introduce extra complications and
are left for a later patch.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
ef7f6c4904 nfsd: move V4ROOT version check to nfsd_set_fh_dentry()
This further centralizes version number checks.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
c689bdd3bf nfsd: further centralize protocol version checks.
With this patch the only places that test ->rq_vers against a specific
version are nfsd_v4client() and nfsd_set_fh_dentry().
The latter sets some flags in the svc_fh, which now includes:
  fh_64bit_cookies
  fh_use_wgather

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
4f67d24f72 nfsd: use nfsd_v4client() in nfsd_breaker_owns_lease()
nfsd_breaker_owns_lease() currently open-codes the same test that
nfsd_v4client() performs.

With this patch we use nfsd_v4client() instead.

Also as i_am_nfsd() is only used in combination with kthread_data(),
replace it with nfsd_current_rqst() which combines the two and returns a
valid svc_rqst, or NULL.

The test for NULL is moved into nfsd_v4client() for code clarity.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
9fd45c16f3 nfsd: Pass 'cred' instead of 'rqstp' to some functions.
nfsd_permission(), exp_rdonly(), nfsd_setuser(), and nfsexp_flags()
only ever need the cred out of rqstp, so pass it explicitly instead of
the whole rqstp.

This makes the interfaces cleaner.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
c55aeef776 nfsd: Don't pass all of rqst into rqst_exp_find()
Rather than passing the whole rqst, pass the pieces that are actually
needed.  This makes the inputs to rqst_exp_find() more obvious.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
Sagi Grimberg
11673b2a91 nfsd: don't assume copy notify when preprocessing the stateid
Move the stateid handling to nfsd4_copy_notify.
If nfs4_preprocess_stateid_op did not produce an output stateid, error out.

Copy notify specifically does not permit the use of special stateids,
so enforce that outside generic stateid pre-processing.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
3391fc92db sunrpc: allow svc threads to fail initialisation cleanly
If an svc thread needs to perform some initialisation that might fail,
it has no good way to handle the failure.

Before the thread can exit it must call svc_exit_thread(), but that
requires the service mutex to be held.  The thread cannot simply take
the mutex as that could deadlock if there is a concurrent attempt to
shut down all threads (which is unlikely, but not impossible).

nfsd currently call svc_exit_thread() unprotected in the unlikely event
that unshare_fs_struct() fails.

We can clean this up by introducing svc_thread_init_status() by which an
svc thread can report whether initialisation has succeeded.  If it has,
it continues normally into the action loop.  If it has not,
svc_thread_init_status() immediately aborts the thread.
svc_start_kthread() waits for either of these to happen, and calls
svc_exit_thread() (under the mutex) if the thread aborted.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
60749cbe3d sunrpc: change sp_nrthreads from atomic_t to unsigned int.
sp_nrthreads is only ever accessed under the service mutex
  nlmsvc_mutex nfs_callback_mutex nfsd_mutex
so these is no need for it to be an atomic_t.

The fact that all code using it is single-threaded means that we can
simplify svc_pool_victim and remove the temporary elevation of
sp_nrthreads.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
73598a0cfb nfsd: don't allocate the versions array.
Instead of using kmalloc to allocate an array for storing active version
info, just declare an array to the max size - it is only 5 or so.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:29:23 -04:00
NeilBrown
c9f10f811c nfsd: move nfsd_pool_stats_open into nfsctl.c
nfsd_pool_stats_open() is used in nfsctl.c, so move it there.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:57 -04:00
NeilBrown
8203ab8a9d nfsd: don't EXPORT_SYMBOL nfsd4_ssc_init_umount_work()
nfsd4_ssc_init_umount_work() is only used in the nfsd module, so there
is no need to EXPORT it.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:56 -04:00
Youzhong Yang
4b84551a35 nfsd: use system_unbound_wq for nfsd_file_gc_worker()
After many rounds of changes in filecache.c, the fix by commit
ce7df055(NFSD: Make the file_delayed_close workqueue UNBOUND)
is gone, now we are getting syslog messages like these:

[ 1618.186688] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
[ 1638.661616] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 8 times, consider switching to WQ_UNBOUND
[ 1665.284542] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 16 times, consider switching to WQ_UNBOUND
[ 1759.491342] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 32 times, consider switching to WQ_UNBOUND
[ 3013.012308] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 64 times, consider switching to WQ_UNBOUND
[ 3154.172827] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 128 times, consider switching to WQ_UNBOUND
[ 3422.461924] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 256 times, consider switching to WQ_UNBOUND
[ 3963.152054] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 512 times, consider switching to WQ_UNBOUND

Consider use system_unbound_wq instead of system_wq for
nfsd_file_gc_worker().

Signed-off-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:55 -04:00
Jeff Layton
700bb4ff91 nfsd: count nfsd_file allocations
We already count the frees (via nfsd_file_releases). Count the
allocations as well. Also switch the direct call to nfsd_file_slab_free
in nfsd_file_do_acquire to nfsd_file_free, so that the allocs and
releases match up.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:54 -04:00
Jeff Layton
8a79261763 nfsd: fix refcount leak when file is unhashed after being found
If we wait_for_construction and find that the file is no longer hashed,
and we're going to retry the open, the old nfsd_file reference is
currently leaked. Put the reference before retrying.

Fixes: c6593366c0 ("nfsd: don't kill nfsd_files because of lease break error")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:54 -04:00
Jeff Layton
81a95c2b1d nfsd: remove unneeded EEXIST error check in nfsd_do_file_acquire
Given that we do the search and insertion while holding the i_lock, I
don't think it's possible for us to get EEXIST here. Remove this case.

Fixes: c6593366c0 ("nfsd: don't kill nfsd_files because of lease break error")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:54 -04:00
Youzhong Yang
8e6e2ffa65 nfsd: add list_head nf_gc to struct nfsd_file
nfsd_file_put() in one thread can race with another thread doing
garbage collection (running nfsd_file_gc() -> list_lru_walk() ->
nfsd_file_lru_cb()):

  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
  * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set)
  * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and
    returns LRU_ROTATE.
  * garbage collector kicks in again, nfsd_file_lru_cb() now decrements nf->nf_ref
    to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to the dispose
    list [list_lru_isolate_move(lru, &nf->nf_lru, head)]
  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove
    the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' has been added
    to the 'dispose' list by nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply
    treats it as part of the LRU and removes it, which leads to its removal from
    the 'dispose' list.
  * At this moment, 'nf' is unhashed with its nf_ref being 0, and not on the LRU.
    nfsd_file_put() continues its execution [if (refcount_dec_and_test(&nf->nf_ref))],
    as nf->nf_ref is already 0, nf->nf_ref is set to REFCOUNT_SATURATED, and the 'nf'
    gets no chance of being freed.

nfsd_file_put() can also race with nfsd_file_cond_queue():
  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
  * nfsd_file_lru_add() sets REFERENCED bit and returns true.
  * Some userland application runs 'exportfs -f' or something like that, which triggers
    __nfsd_file_cache_purge() -> nfsd_file_cond_queue().
  * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], unhash is done
    successfully.
  * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now nf->nf_ref goes to 2.
  * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds.
  * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, &nf->nf_ref))]
    (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' is added to the
    dispose list [list_add(&nf->nf_lru, dispose)]
  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove
    the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although the 'nf' is not
    in the LRU, but it is linked in the 'dispose' list, nfsd_file_lru_remove() simply
    treats it as part of the LRU and removes it. This leads to its removal from
    the 'dispose' list!
  * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set
    nf->nf_ref to REFCOUNT_SATURATED.

As shown in the above analysis, using nf_lru for both the LRU list and dispose list
can cause the leaks. This patch adds a new list_head nf_gc in struct nfsd_file, and uses
it for the dispose list. This does not fix the nfsd_file leaking issue completely.

Signed-off-by: Youzhong Yang <youzhong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:53 -04:00
Linus Torvalds
6a2fcc51a7 nfsd-6.11 fixes:
- One more write delegation fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmbTO2QACgkQM2qzM29m
 f5d6Jg/+L8ltg5iGzdgwZYoOrlhS7sz4Y/BcOViNZ25we0J+kFaauycCyMCG9wS1
 o1NXAZ8d1lvDTZI8Bw7rzWl1IS2mjfg1NX8t5MhVUxrkus40jjwip9VPYRegQhBT
 WZ/ggaudZinc/+i2toR7eY3wJe/PqOWeML4XWbx//tinfLnlC62UKMudOvaXk3B8
 8y0nGWQaJEuaZuFuA9FFOs7MHgR50rSevOdk90avBqFYBVvq2wA6ZvKw0TbH47Q6
 BbELVbIqlFOSfui/w+DQXqGm7SYMOUkaLsPLspXXlDBR0myjORlQ8Ch6alaWp9pd
 2yAGlYNalTJVlJt/2Uqu4USPZuUK9Ijd+2TNg1ObCdRFzpRVmQDU/wzv8A0DWNdI
 MbiwX2ckwUt3u2nh+DHWagSKcuxcRR908YwEHs3/rAmcZDSWiZdJtDZ3NiBKNZrD
 KHYdEOl5rl5P7bi6VcaR8gYREbKiq6BISo7ru3Ix7ImIQD87a/x393/tkOutw8bM
 VfIEYcnsbqlTs07KVUZ2jcIziFrttPmh5rs8qfDHsk899bzR1CBkQedwZAUD0Ghu
 dmvKebXSoLc2sWli5CcrfkWxkjRuIuSQMOPnY9RrRFFaNXBYC3JA7EUWsvbXsX0x
 WSuZPlS9Jv6bCdgvBMAIjTA/uxShLeEf33GIcKK9iI0mASKwXHY=
 =uoNK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - One more write delegation fix

* tag 'nfsd-6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party lease
2024-09-01 06:55:47 +12:00
NeilBrown
40927f3d09 nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party lease
It is not safe to dereference fl->c.flc_owner without first confirming
fl->fl_lmops is the expected manager.  nfsd4_deleg_getattr_conflict()
tests fl_lmops but largely ignores the result and assumes that flc_owner
is an nfs4_delegation anyway.  This is wrong.

With this patch we restore the "!= &nfsd_lease_mng_ops" case to behave
as it did before the change mentioned below.  This is the same as the
current code, but without any reference to a possible delegation.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-30 10:48:29 -04:00
Linus Torvalds
a18093afa3 nfsd-6.11 fixes:
- Fix a number of crashers
 - Update email address for an NFSD reviewer
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmbPLJQACgkQM2qzM29m
 f5edLBAAveHRyP1XKgLIFtIX40GxK029NEHT2vq9ageaLFsFGylAaAgXSB5rUXER
 3FfhOrNuAr/ZAV2bqm/qnA3yDuh/OuD9MIxEA+C2cNJ4bjEKXaqq5iePTK+AeNGH
 cnvE3UlI9rQkflpuZxbliJZm+u+mxaKnMO2riFbZeKrQh5C6Mn6z4fXp88CZj65U
 7oDeONqyjMEtkjJPutzJZr0gJbPjeGgZlrsVgMMg/nki3y+Fal6Rt0hDO2u9evV9
 3zTyJ7S7yUnsZ0b2JTx061EfJLd7KFmefWG4UKRKYk1XtiDbHt/cUSi4/QBA0EWw
 6VK5aJWUF2OwUpkAYohU0o4/qApoce1raR0cpwrRwzLINDdwPTkfz9L8dpjRPJ68
 ubUhWP7D/xASD5RhSrbH6lG8XY3ISXkRk2knwKXFtJtq2uIz2Gxc6F1gzzPE/whR
 N6JxdiMrUaKoO4qiwOtrXwiYACR9+qsVSW/QxNG1xdQhNXyLb5L6uUSe384MPybw
 nVIkrOdwO4uors6DzkFoHI/Au2QrTi9pzn53PK01RkhrJKUIS3lMPBnPbf5RAIQN
 EowYaJNSm52Px+omXzZzHoOgI0h0P3UWiuXoui1Zcy/xCT6uEbj5QtKxM6iKSXyd
 4KU3DnC+9nr6+3ld47MFn38f7gmPNrvB56XMCDQKmr9x9OmtB5k=
 =fRqF
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a number of crashers

 - Update email address for an NFSD reviewer

* tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  fs/nfsd: fix update of inode attrs in CB_GETATTR
  nfsd: fix potential UAF in nfsd4_cb_getattr_release
  nfsd: hold reference to delegation when updating it for cb_getattr
  MAINTAINERS: Update Olga Kornievskaia's email address
  nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open
  nfsd: ensure that nfsd4_fattr_args.context is zeroed out
2024-08-29 06:20:44 +12:00
Jeff Layton
7e8ae8486e fs/nfsd: fix update of inode attrs in CB_GETATTR
Currently, we copy the mtime and ctime to the in-core inode and then
mark the inode dirty. This is fine for certain types of filesystems, but
not all. Some require a real setattr to properly change these values
(e.g. ceph or reexported NFS).

Fix this code to call notify_change() instead, which is the proper way
to effect a setattr. There is one problem though:

In this case, the client is holding a write delegation and has sent us
attributes to update our cache. We don't want to break the delegation
for this since that would defeat the purpose. Add a new ATTR_DELEG flag
that makes notify_change bypass the try_break_deleg call.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-26 19:04:00 -04:00
Jeff Layton
1116e0e372 nfsd: fix potential UAF in nfsd4_cb_getattr_release
Once we drop the delegation reference, the fields embedded in it are no
longer safe to access. Do that last.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-26 11:53:05 -04:00
Jeff Layton
da05ba23d4 nfsd: hold reference to delegation when updating it for cb_getattr
Once we've dropped the flc_lock, there is nothing that ensures that the
delegation that was found will still be around later. Take a reference
to it while holding the lock and then drop it when we've finished with
the delegation.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-26 11:52:40 -04:00
Olga Kornievskaia
a204501e17 nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open
Prior to commit 3f29cc82a8 ("nfsd: split sc_status out of
sc_type") states_show() relied on sc_type field to be of valid
type before calling into a subfunction to show content of a
particular stateid. From that commit, we split the validity of
the stateid into sc_status and no longer changed sc_type to 0
while unhashing the stateid. This resulted in kernel oopsing
for nfsv4.0 opens that stay around and in nfs4_show_open()
would derefence sc_file which was NULL.

Instead, for closed open stateids forgo displaying information
that relies of having a valid sc_file.

To reproduce: mount the server with 4.0, read and close
a file and then on the server cat /proc/fs/nfsd/clients/2/states

[  513.590804] Call trace:
[  513.590925]  _raw_spin_lock+0xcc/0x160
[  513.591119]  nfs4_show_open+0x78/0x2c0 [nfsd]
[  513.591412]  states_show+0x44c/0x488 [nfsd]
[  513.591681]  seq_read_iter+0x5d8/0x760
[  513.591896]  seq_read+0x188/0x208
[  513.592075]  vfs_read+0x148/0x470
[  513.592241]  ksys_read+0xcc/0x178

Fixes: 3f29cc82a8 ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-23 12:25:46 -04:00
Jeff Layton
f58bab6fd4 nfsd: ensure that nfsd4_fattr_args.context is zeroed out
If nfsd4_encode_fattr4 ends up doing a "goto out" before we get to
checking for the security label, then args.context will be set to
uninitialized junk on the stack, which we'll then try to free.
Initialize it early.

Fixes: f59388a579 ("NFSD: Add nfsd4_encode_fattr4_sec_label()")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-22 14:49:10 -04:00
Linus Torvalds
5189dafa4c nfsd-6.11 fixes:
- Two minor fixes for recent changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAma3pFoACgkQM2qzM29m
 f5f4dA//SiPOS7v60jQubfV3f031E+W3a6YxvB5tqByA23hOavjYowGREyvhtaS5
 lgqRO25BXz7uNOjSWUXaV7SsR1GKElWtk/gGVsfYy03A91Qvafb24oapJRwRXeJ1
 5n6X5AScxJeoDGEIUMoGU3O87v4T6UoTj5K6bjUDqr5kTRTL6rPFj3jYvy0YVLe3
 EsA1P+BLctB1XiboMYNfZyEJbxNF5gpO7OaeS8VPQIeNYhaF1cwsa7A9wDJ22UpB
 BQEguTjPqgzkthOdkc3jNcoXtDUEzAXxQqACKR8CU6cCoyr159KupSlci4LqA2o4
 8CwcVd01L79JjGcrZqo+4vmbuY6rfhN7U/EYcNNJPZKUIo7Q5h4dq/pCaBADK2Ke
 UJVPsmdYXKMwn/pPJI9kjoellqSrjOxqFDXXqBb/e8Z/xXuXmiYRl6E+dtGeBlty
 j4qm12d1BGKs7zRaLLThbFTH6s3kUCjVhEr+uphse6CVF6HV/NNK79jooHrrgCiS
 e1vgRaYIgzMREOyq8fZwvINDuk/8Pa1Z4wlR65orwVzWlZMmk9bLm+wwplcyj34x
 una8RrQhxlicY5nAGUa+aRCkKUtO9oSPs6hukoT+DB06+mhQ7u3ro96ojueMfMdQ
 pKux8zrCdJe7jZUlnIW30/QsrgWhHW9KsrdUA9eKQ1FoZIOdS9I=
 =YX9E
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Two minor fixes for recent changes

* tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets
  sunrpc: avoid -Wformat-security warning
2024-08-10 10:44:21 -07:00
Jeff Layton
91da337e5d nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets
When creating nfsd sockets via the netlink interface, we do want to
register with the portmapper. Don't set SVC_SOCK_ANONYMOUS.

Reported-by: Steve Dickson <steved@redhat.com>
Fixes: 16a4711774 ("NFSD: add listener-{set,get} netlink command")
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-22 09:47:39 -04:00
Linus Torvalds
586a7a8542 NFSD 6.11 Release Notes
This is a light release containing optimizations, code clean-ups,
 and minor bug fixes. This development cycle focused on work outside
 of upstream kernel development:
 
 1. Continuing to build upstream CI for NFSD based on kdevops
 2. Continuing to focus on the quality of NFSD in LTS kernels
 3. Participation in IETF nfsv4 WG discussions about NFSv4 ACLs,
    directory delegation, and NFSv4.2 COPY offload
 
 Notable features in v6.11 that were not pulled through the NFSD tree
 include NFS server-side support for the new pNFS NVMe layout type
 [RFC9561]. Functional testing for pNFS block layouts like this one
 has been introduced to our kdevops CI harness. Work on improving
 the resolution of file attribute time stamps in local filesystems
 is also ongoing tree-wide.
 
 As always I am grateful to NFSD contributors, reviewers, testers,
 and bug reporters who participated during this cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmaVM0cACgkQM2qzM29m
 f5fzOQ//c5CXIF3zCLIUofm5eZSP2zIszmHR75rVTEnf0Ehm2BJRF6VZiTvWXRpz
 bOuswxfV1Bds+TofbPIP8jqDcMp8NIXemdb6+QMwh4FDY4M8t1v6TRYt35L6Ulrq
 bSV81aRS622ofQ35sRzwxpGX6rB6YbB+5L4EKuxdEqRKSB8rCxQcjPy2qypcWlRC
 hEAGDe3IiVxTz4VQBpASRqbH9Udw/XEqIhv5c8aLtPvl8i+yWyV5m2G5FMRdBj49
 u8rCLoPi/mON8TDs2U4pbhcdgfBWWvGS6woFp6qrqM0wzXIPLalWsPGK3DUtuFUg
 onxsClJXMWUvW4k4hbjiqosduLGY/kMeX62Lx1dCj/gktrJpU0GDNR/XbBhHU+hj
 UT2CL8AfedC4FQekdyJri/rDgPiTMsf8UE0lgtchHMUXH0ztrjaRxMGiIFMm5vCl
 dJBMGJfCkKR/+U1YrGRQI0tPL8CJKYI8klOEhLoOsCr/WC9p4nvvAzSg4W9mNK5P
 ni4+KU4f/bj8U0Ap2bUacTpXj6W8VcwJWeuHahVA1Slo+eqXO401hj4W88dQmm9O
 ZDR5h+6PI6KoL/KL6I4EyOv+sIEtW3s18cEWbSSu3N/CPuhSGTx8d2J201shJXRN
 uDdMkvbwv48x20pgD2oTkPrZbJHOL3BK5/WPBg7pwpfkoRrBAhY=
 =Xd5e
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "This is a light release containing optimizations, code clean-ups, and
  minor bug fixes.

  This development cycle focused on work outside of upstream kernel
  development:

   - Continuing to build upstream CI for NFSD based on kdevops

   - Continuing to focus on the quality of NFSD in LTS kernels

   - Participation in IETF nfsv4 WG discussions about NFSv4 ACLs,
     directory delegation, and NFSv4.2 COPY offload

  Notable features for v6.11 that do not come through the NFSD tree
  include NFS server-side support for the new pNFS NVMe layout type
  [RFC9561]. Functional testing for pNFS block layouts like this one has
  been introduced to our kdevops CI harness. Work on improving the
  resolution of file attribute time stamps in local filesystems is also
  ongoing tree-wide.

  As always I am grateful to NFSD contributors, reviewers, testers, and
  bug reporters who participated during this cycle"

* tag 'nfsd-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: nfsd_file_lease_notifier_call gets a file_lease as an argument
  gss_krb5: Fix the error handling path for crypto_sync_skcipher_setkey
  MAINTAINERS: Add a bugzilla link for NFSD
  nfsd: new netlink ops to get/set server pool_mode
  sunrpc: refactor pool_mode setting code
  nfsd: allow passing in array of thread counts via netlink
  nfsd: make nfsd_svc take an array of thread counts
  sunrpc: fix up the special handling of sv_nrpools == 1
  SUNRPC: Add a trace point in svc_xprt_deferred_close
  NFSD: Support write delegations in LAYOUTGET
  lockd: Use *-y instead of *-objs in Makefile
  NFSD: Fix nfsdcld warning
  svcrdma: Handle ADDR_CHANGE CM event properly
  svcrdma: Refactor the creation of listener CMA ID
  NFSD: remove unused structs 'nfsd3_voidargs'
  NFSD: harden svcxdr_dupstr() and svcxdr_tmpalloc() against integer overflows
2024-07-17 12:00:49 -07:00
Linus Torvalds
b051320d6a vfs-6.11.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEF0AAKCRCRxhvAZXjc
 oq0TAQDjfTLN75RwKQ34RIFtRun2q+OMfBQtSegtaccqazghyAD/QfmPuZDxB5DL
 rsI/5k5O4VupIKrEdIaqvNxmkmDsSAc=
 =bf7E
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Support passing NULL along AT_EMPTY_PATH for statx().

     NULL paths with any flag value other than AT_EMPTY_PATH go the
     usual route and end up with -EFAULT to retain compatibility (Rust
     is abusing calls of the sort to detect availability of statx)

     This avoids path lookup code, lockref management, memory allocation
     and in case of NULL path userspace memory access (which can be
     quite expensive with SMAP on x86_64)

   - Don't block i_writecount during exec. Remove the
     deny_write_access() mechanism for executables

   - Relax open_by_handle_at() permissions in specific cases where we
     can prove that the caller had sufficient privileges to open a file

   - Switch timespec64 fields in struct inode to discrete integers
     freeing up 4 bytes

  Fixes:

   - Fix false positive circular locking warning in hfsplus

   - Initialize hfs_inode_info after hfs_alloc_inode() in hfs

   - Avoid accidental overflows in vfs_fallocate()

   - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly
     restarting shmem_fallocate()

   - Add missing quote in comment in fs/readdir

  Cleanups:

   - Don't assign and test in an if statement in mqueue. Move the
     assignment out of the if statement

   - Reflow the logic in may_create_in_sticky()

   - Remove the usage of the deprecated ida_simple_xx() API from procfs

   - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new
     mount api early

   - Rename variables in copy_tree() to make it easier to understand

   - Replace WARN(down_read_trylock, ...) abuse with proper asserts in
     various places in the VFS

   - Get rid of user_path_at_empty() and drop the empty argument from
     getname_flags()

   - Check for error while copying and no path in one branch in
     getname_flags()

   - Avoid redundant smp_mb() for THP handling in do_dentry_open()

   - Rename parent_ino to d_parent_ino and make it use RCU

   - Remove unused header include in fs/readdir

   - Export in_group_capable() helper and switch f2fs and fuse over to
     it instead of open-coding the logic in both places"

* tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  ipc: mqueue: remove assignment from IS_ERR argument
  vfs: rename parent_ino to d_parent_ino and make it use RCU
  vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
  stat: use vfs_empty_path() helper
  fs: new helper vfs_empty_path()
  fs: reflow may_create_in_sticky()
  vfs: remove redundant smp_mb for thp handling in do_dentry_open
  fuse: Use in_group_or_capable() helper
  f2fs: Use in_group_or_capable() helper
  fs: Export in_group_or_capable()
  vfs: reorder checks in may_create_in_sticky
  hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()
  proc: Remove usage of the deprecated ida_simple_xx() API
  hfsplus: fix to avoid false alarm of circular locking
  Improve readability of copy_tree
  vfs: shave a branch in getname_flags
  vfs: retire user_path_at_empty and drop empty arg from getname_flags
  vfs: stop using user_path_at_empty in do_readlinkat
  tmpfs: don't interrupt fallocate with EINTR
  fs: don't block i_writecount during exec
  ...
2024-07-15 10:52:51 -07:00
Jeff Layton
769d20028f nfsd: nfsd_file_lease_notifier_call gets a file_lease as an argument
"data" actually refers to a file_lease and not a file_lock. Both structs
have their file_lock_core as the first field though, so this bug should
be harmless without struct randomization in play.

Reported-by: Florian Evers <florian-evers@gmx.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219008
Fixes: 05580bbfc6 ("nfsd: adapt to breakup of struct file_lock")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Florian Evers <florian-evers@gmx.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-12 12:58:48 -04:00
Jeff Layton
00506072d7 nfsd: new netlink ops to get/set server pool_mode
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:05 -04:00
Jeff Layton
7f5c330b26 nfsd: allow passing in array of thread counts via netlink
Now that nfsd_svc can handle an array of thread counts, fix up the
netlink threads interface to construct one from the netlink call
and pass it through so we can start a pooled server the same way we
would start a normal one.

Note that any unspecified values in the array are considered zeroes,
so it's possible to shut down a pooled server by passing in a short
array that has only zeros, or even an empty array.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:05 -04:00
Jeff Layton
b4d8f22891 nfsd: make nfsd_svc take an array of thread counts
Now that the refcounting is fixed, rework nfsd_svc to use the same
thread setup as the pool_threads interface. Have it take an array of
thread counts instead of just a single value, and pass that from the
netlink threads set interface.  Since the new netlink interface doesn't
have the same restriction as pool_threads, move the guard against
shutting down all threads to write_pool_threads.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:04 -04:00
Chuck Lever
abc02e5602 NFSD: Support write delegations in LAYOUTGET
I noticed LAYOUTGET(LAYOUTIOMODE4_RW) returning NFS4ERR_ACCESS
unexpectedly. The NFS client had created a file with mode 0444, and
the server had returned a write delegation on the OPEN(CREATE). The
client was requesting a RW layout using the write delegation stateid
so that it could flush file modifications.

Creating a read-only file does not seem to be problematic for
NFSv4.1 without pNFS, so I began looking at NFSD's implementation of
LAYOUTGET.

The failure was because fh_verify() was doing a permission check as
part of verifying the FH presented during the LAYOUTGET. It uses the
loga_iomode value to specify the @accmode argument to fh_verify().
fh_verify(MAY_WRITE) on a file whose mode is 0444 fails with -EACCES.

To permit LAYOUT* operations in this case, add OWNER_OVERRIDE when
checking the access permission of the incoming file handle for
LAYOUTGET and LAYOUTCOMMIT.

Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v6.6+
Message-Id: 4E9C0D74-A06D-4DC3-A48A-73034DC40395@oracle.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:03 -04:00
Chuck Lever
18a5450684 NFSD: Fix nfsdcld warning
Since CONFIG_NFSD_LEGACY_CLIENT_TRACKING is a new config option, its
initial default setting should have been Y (if we are to follow the
common practice of "default Y, wait, default N, wait, remove code").

Paul also suggested adding a clearer remedy action to the warning
message.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Message-Id: <d2ab4ee7-ba0f-44ac-b921-90c8fa5a04d2@molgen.mpg.de>
Fixes: 74fd48739d ("nfsd: new Kconfig option for legacy client tracking")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:02 -04:00
Dr. David Alan Gilbert
022d0574ef NFSD: remove unused structs 'nfsd3_voidargs'
'nfsd3_voidargs' in nfs[23]acl.c is unused since
commit 788f7183fb ("NFSD: Add common helpers to decode void args and
encode void results").

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:01 -04:00
Dan Carpenter
dbc834e5db NFSD: harden svcxdr_dupstr() and svcxdr_tmpalloc() against integer overflows
These lengths come from xdr_stream_decode_u32() and so we should be a
bit careful with them.  Use size_add() and struct_size() to avoid
integer overflows.  Saving size_add()/struct_size() results to a u32 is
unsafe because it truncates away the high bits.

Also generally storing sizes in longs is safer.  Most systems these days
use 64 bit CPUs.  It's harder for an addition to overflow 64 bits than
it is to overflow 32 bits.  Also functions like vmalloc() can
successfully allocate UINT_MAX bytes, but nothing can allocate ULONG_MAX
bytes.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:01 -04:00
Linus Torvalds
6c0483dbfe nfsd-6.10 fixes:
- Due to a late review, revert and re-fix a recent crasher fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZ+zksACgkQM2qzM29m
 f5ey0Q/+O8Z//rVg8L8V1EI44gZ1LRTiJEeDFoJU1Hktn0dOwTvGQJNzNbfpTBbT
 zTvvDBd9+cbHQ7Jl2r73+wkqgZbGXsRhMRUCmlA5ur0O7ujkYDo3rpOgQhP4V3LV
 1G6o2uOJT9d9UykUF74Demlg97kNrc8CY1QNujXjKrj8jYpQY4KcRCDb91KdYgc9
 GSque9zMD2jTJkTkSVRUyfWGEDvYtDxs7/4rdErjixaY30dgkPHae6euRzoQd2sC
 v3aoKHSt5Y/pvOX+3/c2QzDqOiKPOnm3dLazB+rK6lYrnt4iUeP3CiPcFmqNJZk2
 kZ3hpCPN8Le3oQFlMe450EljB8wLj2DqLix+Y/iOGyA732GLxNW6Hu8PkwRVFFwx
 cGGp6KryrcwO/BNHgxceErfhBG4IPKSgnY6C8Io+mPYKwSRAOKA4su02RoJDWJkA
 PTf67JgZiKv+bn3EGojeTo9YAvb04EYstttK2yBFUR9xylyew9vaZ9X8mYkqfyHN
 pZEWf/SI8SExZteo3Ymx9HLcX5FYsBjgvigpow33kUCn2hMhuhfBXEtmLNsxOSYC
 aEICcAYqnBovFJ+3vbz11IDS04ERIFdprlLQuExtJ21pujIlIM9pOxrIGvgu5pPl
 ++rsDP31Kh6imuZbSaJ5TFTdtDqcpah587m+mQtlaXAjOQL1qfw=
 =gwat
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Due to a late review, revert and re-fix a recent crasher fix

* tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "nfsd: fix oops when reading pool_stats before server is started"
  nfsd: initialise nfsd_info.mutex early.
2024-06-28 09:32:33 -07:00
NeilBrown
e0011bca60 nfsd: initialise nfsd_info.mutex early.
nfsd_info.mutex can be dereferenced by svc_pool_stats_start()
immediately after the new netns is created.  Currently this can
trigger an oops.

Move the initialisation earlier before it can possibly be dereferenced.

Fixes: 7b207ccd98 ("svc: don't hold reference for poolstats, only mutex.")
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/all/c2e9f6de-1ec4-4d3a-b18d-d5a6ec0814a0@linux.ibm.com/
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-06-25 10:15:39 -04:00
Linus Torvalds
c2fc946223 nfsd-6.10 fixes:
- Fix crashes triggered by administrative operations on the server
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZ2/VgACgkQM2qzM29m
 f5d+tA/+OoNtrNfaGrrXjmKte0eLLQzX4o1IbvQPcydqEQwHFIpCNXGQXdzrjYzg
 ChICf2/IKH7Dl9dOTp0QMsZbTnGORDHcImjaWBck//ObYopLOk8e+pJs2VypX+uE
 O5gIFnpFapP2IZRFiEolzOx5x73wWE7/tJYMzL5sDsV/bHavIR3abH/hBkF/1E7N
 wSCcZuBfGm0pRWJS+KI8Bwa+dVbpq80Y8ITZ3S+gQEH0nHTEw9VxpmV0oQjQinV5
 aCBegPXSBDuc2SX7KNLoHVqVX9x/htAnV5BcjbViEqhyakoa+ANIJb7LaOx/lZ2J
 9CB1lP1FPmw0AVwu4krvdnJncIGlZJPEK7eLv5cAaxNK6jb48Gv0pul5tvMphEym
 +6qw0bqalXvsvoSQFMeidVUvvkDC2fxqHBI0N8w5LVKzbYiv3JUTy3WLt/f6sd0F
 nXWgUhYGgMT62KBxyEI0f2ip38Qb2JjRrSWcKXi5D/2TfLgL5GqbRloFmX8nx6+p
 9RuYeFFNGzS6qWfYljR41lTgkD0nU0/MF9GiRn+8JvqHrACXi3z9oQ8V5jlye+HW
 PrpBvYDqrDMzBiH22H095jtgo7YK+xib87wy9ql4BxrygHCg55czLiSTZliHC1iA
 z8dGtwusFayuTaD21FytD00DkVDJQAlsahhQPfyaUOR0mcrWVR4=
 =LCB+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix crashes triggered by administrative operations on the server

* tag 'nfsd-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: grab nfsd_mutex in nfsd_nl_rpc_status_get_dumpit()
  nfsd: fix oops when reading pool_stats before server is started
2024-06-22 13:55:56 -07:00
Lorenzo Bianconi
da2c8fef13 NFSD: grab nfsd_mutex in nfsd_nl_rpc_status_get_dumpit()
Grab nfsd_mutex lock in nfsd_nl_rpc_status_get_dumpit routine and remove
nfsd_nl_rpc_status_get_start() and nfsd_nl_rpc_status_get_done(). This
patch fix the syzbot log reported below:

INFO: task syz-executor.1:17770 blocked for more than 143 seconds.
      Not tainted 6.10.0-rc3-syzkaller-00022-gcea2a26553ac #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor.1  state:D stack:23800 pid:17770 tgid:17767 ppid:11381  flags:0x00000006
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5408 [inline]
 __schedule+0x17e8/0x4a20 kernel/sched/core.c:6745
 __schedule_loop kernel/sched/core.c:6822 [inline]
 schedule+0x14b/0x320 kernel/sched/core.c:6837
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:6894
 __mutex_lock_common kernel/locking/mutex.c:684 [inline]
 __mutex_lock+0x6a4/0xd70 kernel/locking/mutex.c:752
 nfsd_nl_listener_get_doit+0x115/0x5d0 fs/nfsd/nfsctl.c:2124
 genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
 genl_rcv_msg+0xb16/0xec0 net/netlink/genetlink.c:1210
 netlink_rcv_skb+0x1e5/0x430 net/netlink/af_netlink.c:2564
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
 netlink_unicast+0x7ec/0x980 net/netlink/af_netlink.c:1361
 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0x223/0x270 net/socket.c:745
 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
 ___sys_sendmsg net/socket.c:2639 [inline]
 __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f24ed27cea9
RSP: 002b:00007f24ee0080c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f24ed3b3f80 RCX: 00007f24ed27cea9
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000005
RBP: 00007f24ed2ebff4 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000

Fixes: 1bd773b4f0 ("nfsd: hold nfsd_mutex across entire netlink operation")
Fixes: bd9d6a3efa ("NFSD: add rpc_status netlink support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-06-17 13:16:49 -04:00
Christian Brauner
620c266f39 fhandle: relax open_by_handle_at() permission checks
A current limitation of open_by_handle_at() is that it's currently not possible
to use it from within containers at all because we require CAP_DAC_READ_SEARCH
in the initial namespace. That's unfortunate because there are scenarios where
using open_by_handle_at() from within containers.

Two examples:

(1) cgroupfs allows to encode cgroups to file handles and reopen them with
    open_by_handle_at().
(2) Fanotify allows placing filesystem watches they currently aren't usable in
    containers because the returned file handles cannot be used.

Here's a proposal for relaxing the permission check for open_by_handle_at().

(1) Opening file handles when the caller has privileges over the filesystem
    (1.1) The caller has an unobstructed view of the filesystem.
    (1.2) The caller has permissions to follow a path to the file handle.

This doesn't address the problem of opening a file handle when only a portion
of a filesystem is exposed as is common in containers by e.g., bind-mounting a
subtree. The proposal to solve this use-case is:

(2) Opening file handles when the caller has privileges over a subtree
    (2.1) The caller is able to reach the file from the provided mount fd.
    (2.2) The caller has permissions to construct an unobstructed path to the
          file handle.
    (2.3) The caller has permissions to follow a path to the file handle.

The relaxed permission checks are currently restricted to directory file
handles which are what both cgroupfs and fanotify need. Handling disconnected
non-directory file handles would lead to a potentially non-deterministic api.
If a disconnected non-directory file handle is provided we may fail to decode
a valid path that we could use for permission checking. That in itself isn't a
problem as we would just return EACCES in that case. However, confusion may
arise if a non-disconnected dentry ends up in the cache later and those opening
the file handle would suddenly succeed.

* It's potentially possible to use timing information (side-channel) to infer
  whether a given inode exists. I don't think that's particularly
  problematic. Thanks to Jann for bringing this to my attention.

* An unrelated note (IOW, these are thoughts that apply to
  open_by_handle_at() generically and are unrelated to the changes here):
  Jann pointed out that we should verify whether deleted files could
  potentially be reopened through open_by_handle_at(). I don't think that's
  possible though.

  Another potential thing to check is whether open_by_handle_at() could be
  abused to open internal stuff like memfds or gpu stuff. I don't think so
  but I haven't had the time to completely verify this.

This dates back to discussions Amir and I had quite some time ago and thanks to
him for providing a lot of details around the export code and related patches!

Link: https://lore.kernel.org/r/20240524-vfs-open_by_handle_at-v1-1-3d4b7d22736b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-28 15:57:23 +02:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Linus Torvalds
5af9d1cf39 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmZLJS0ACgkQnJ2qBz9k
 QNmFlggAlIg5oDZfOhJur6h3Icldrl2DsnKer0CAP7TFK+GfkFTEb25paoydBEu4
 Y0VzZ3n3EqhmsJ8P515k1UPPPXlqqZwSRWGAek0FDhQCXhqEYxiWwf9U343hJNBS
 rya4Rnwc1pxqmJU2hrY5R5kEbugUFAIL+qNXzhhLpWonYiy/ya7P5n/qz5F5HJH2
 FufRRaPHcHFfk1u0+PvFrk019AS9C6Y3bkcUGtbpdwmFsuN3D4HKuLEkr1+C9Apb
 NmkoAwCiSobQhAxGDr6Szqu6r1VCuM+n/O9fqLknnL9u0jm95AmGdIMOdQ/ofx6d
 xn3mfRp8gUbPD8PubHhQsMjCmSjGwg==
 =kwWW
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - reduce overhead of fsnotify infrastructure when no permission events
   are in use

 - a few small cleanups

* tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix UAF from FS_ERROR event on a shutting down filesystem
  fsnotify: optimize the case of no permission event watchers
  fsnotify: use an enum for group priority constants
  fsnotify: move s_fsnotify_connectors into fsnotify_sb_info
  fsnotify: lazy attach fsnotify_sb_info state to sb
  fsnotify: create helper fsnotify_update_sb_watchers()
  fsnotify: pass object pointer and type to fsnotify mark helpers
  fanotify: merge two checks regarding add of ignore mark
  fsnotify: create a wrapper fsnotify_find_inode_mark()
  fsnotify: create helpers to get sb and connp from object
  fsnotify: rename fsnotify_{get,put}_sb_connectors()
  fsnotify: Avoid -Wflex-array-member-not-at-end warning
  fanotify: remove unneeded sub-zero check for unsigned value
2024-05-20 12:31:43 -07:00
Chuck Lever
8d915bbf39 NFSD: Force all NFSv4.2 COPY requests to be synchronous
We've discovered that delivering a CB_OFFLOAD operation can be
unreliable in some pretty unremarkable situations. Examples
include:

 - The server dropped the connection because it lost a forechannel
   NFSv4 request and wishes to force the client to retransmit
 - The GSS sequence number window under-flowed
 - A network partition occurred

When that happens, all pending callback operations, including
CB_OFFLOAD, are lost. NFSD does not retransmit them.

Moreover, the Linux NFS client does not yet support sending an
OFFLOAD_STATUS operation to probe whether an asynchronous COPY
operation has finished. Thus, on Linux NFS clients, when a
CB_OFFLOAD is lost, asynchronous COPY can hang until manually
interrupted.

I've tried a couple of remedies, but so far the side-effects are
worse than the disease and they have had to be reverted. So
temporarily force COPY operations to be synchronous so that the use
of CB_OFFLOAD is avoided entirely. This is a fix that can easily be
backported to LTS kernels. I am working on client patches that
introduce an implementation of OFFLOAD_STATUS.

Note that NFSD arbitrarily limits the size of a copy_file_range
to 4MB to avoid indefinitely blocking an nfsd thread. A short
COPY result is returned in that case, and the client can present
a fresh COPY request for the remainder.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-09 09:10:48 -04:00
Trond Myklebust
939cb14d51 NFS/knfsd: Remove the invalid NFS error 'NFSERR_OPNOTSUPP'
NFSERR_OPNOTSUPP is not described by any RFC, and should not be used.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 12:47:24 -04:00
Trond Myklebust
e221c45da3 knfsd: LOOKUP can return an illegal error value
The 'NFS error' NFSERR_OPNOTSUPP is not described by any of the official
NFS related RFCs, but appears to have snuck into some older .x files for
NFSv2.
Either way, it is not in RFC1094, RFC1813 or any of the NFSv4 RFCs, so
should not be returned by the knfsd server, and particularly not by the
"LOOKUP" operation.

Instead, let's return NFSERR_STALE, which is more appropriate if the
filesystem encodes the filehandle as FILEID_INVALID.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 12:47:24 -04:00
Stephen Smalley
442d27ff09 nfsd: set security label during create operations
When security labeling is enabled, the client can pass a file security
label as part of a create operation for the new file, similar to mode
and other attributes. At present, the security label is received by nfsd
and passed down to nfsd_create_setattr(), but nfsd_setattr() is never
called and therefore the label is never set on the new file. This bug
may have been introduced on or around commit d6a97d3f58 ("NFSD:
add security label to struct nfsd_attrs"). Looking at nfsd_setattr()
I am uncertain as to whether the same issue presents for
file ACLs and therefore requires a similar fix for those.

An alternative approach would be to introduce a new LSM hook to set the
"create SID" of the current task prior to the actual file creation, which
would atomically label the new inode at creation time. This would be better
for SELinux and a similar approach has been used previously
(see security_dentry_create_files_as) but perhaps not usable by other LSMs.

Reproducer:
1. Install a Linux distro with SELinux - Fedora is easiest
2. git clone https://github.com/SELinuxProject/selinux-testsuite
3. Install the requisite dependencies per selinux-testsuite/README.md
4. Run something like the following script:
MOUNT=$HOME/selinux-testsuite
sudo systemctl start nfs-server
sudo exportfs -o rw,no_root_squash,security_label localhost:$MOUNT
sudo mkdir -p /mnt/selinux-testsuite
sudo mount -t nfs -o vers=4.2 localhost:$MOUNT /mnt/selinux-testsuite
pushd /mnt/selinux-testsuite/
sudo make -C policy load
pushd tests/filesystem
sudo runcon -t test_filesystem_t ./create_file -f trans_test_file \
	-e test_filesystem_filetranscon_t -v
sudo rm -f trans_test_file
popd
sudo make -C policy unload
popd
sudo umount /mnt/selinux-testsuite
sudo exportfs -u localhost:$MOUNT
sudo rmdir /mnt/selinux-testsuite
sudo systemctl stop nfs-server

Expected output:
<eliding noise from commands run prior to or after the test itself>
Process context:
	unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023
Created file: trans_test_file
File context: unconfined_u:object_r:test_filesystem_filetranscon_t:s0
File context is correct

Actual output:
<eliding noise from commands run prior to or after the test itself>
Process context:
	unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023
Created file: trans_test_file
File context: system_u:object_r:test_file_t:s0
File context error, expected:
	test_filesystem_filetranscon_t
got:
	test_file_t

Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:24 -04:00
Chuck Lever
cc63c21682 NFSD: Add COPY status code to OFFLOAD_STATUS response
Clients that send an OFFLOAD_STATUS might want to distinguish
between an async COPY operation that is still running, has
completed successfully, or that has failed.

The intention of this patch is to make NFSD behave like this:

 * Copy still running:
	OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied
	so far, and an empty osr_status array
 * Copy completed successfully:
	OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied,
	and an osr_status of NFS4_OK
 * Copy failed:
	OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied,
	and an osr_status other than NFS4_OK
 * Copy operation lost, canceled, or otherwise unrecognized:
	OFFLOAD_STATUS returns NFS4ERR_BAD_STATEID

NB: Though RFC 7862 Section 11.2 lists a small set of NFS status
codes that are valid for OFFLOAD_STATUS, there do not seem to be any
explicit spec limits on the status codes that may be returned in the
osr_status field.

At this time we have no unit tests for COPY and its brethren, as
pynfs does not yet implement support for NFSv4.2.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:23 -04:00
Chuck Lever
a8483b9ad9 NFSD: Record status of async copy operation in struct nfsd4_copy
After a client has started an asynchronous COPY operation, a
subsequent OFFLOAD_STATUS operation will need to report the status
code once that COPY operation has completed. The recorded status
record will be used by a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:23 -04:00
Lorenzo Bianconi
16a4711774 NFSD: add listener-{set,get} netlink command
Introduce write_ports netlink command. For listener-set, userspace is
expected to provide a NFS listeners list it wants enabled. All other
sockets will be closed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Co-developed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:22 -04:00
Lorenzo Bianconi
5a939bea25 NFSD: add write_version to netlink command
Introduce write_version netlink command through a "declarative" interface.
This patch introduces a change in behavior since for version-set userspace
is expected to provide a NFS major/minor version list it wants to enable
while all the other ones will be disabled. (procfs write_version
command implements imperative interface where the admin writes +3/-3 to
enable/disable a single version.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:21 -04:00
Lorenzo Bianconi
924f4fb003 NFSD: convert write_threads to netlink command
Introduce write_threads netlink command similar to the one available
through the procfs.

Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Co-developed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:21 -04:00
Jeff Layton
9077d59847 NFSD: allow callers to pass in scope string to nfsd_svc
Currently admins set this by using unshare to create a new uts
namespace, and then resetting the hostname. With the new netlink
interface we can just pass this in directly. Prepare nfsd_svc for
this change.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:20 -04:00
Jeff Layton
0842b4c80b NFSD: move nfsd_mutex handling into nfsd_svc callers
Currently nfsd_svc holds the nfsd_mutex over the whole function. For
some of the later netlink patches though, we want to do some other
things to the server before starting it. Move the mutex handling into
the callers.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:20 -04:00
NeilBrown
0770249b90 nfsd: don't create nfsv4recoverydir in nfsdfs when not used.
When CONFIG_NFSD_LEGACY_CLIENT_TRACKING is not set, the virtual file
  /proc/fs/nfsd/nfsv4recoverydir
is created but responds EINVAL to any access.
This is not useful, is somewhat surprising, and it causes ltp to
complain.

The only known user of this file is in nfs-utils, which handles
non-existence and read-failure equally well.  So there is nothing to
gain from leaving the file present but inaccessible.

So this patch removes the file when its content is not available - i.e.
when that config option is not selected.

Also remove the #ifdef which hides some of the enum values when
CONFIG_NFSD_V$ not selection.  simple_fill_super() quietly ignores array
entries that are not present, so having slots in the array that don't
get used is perfectly acceptable.  So there is no value in this #ifdef.

Reported-by: Petr Vorel <pvorel@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 74fd48739d ("nfsd: new Kconfig option for legacy client tracking")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:19 -04:00
NeilBrown
d43113fbbf nfsd: optimise recalculate_deny_mode() for a common case
recalculate_deny_mode() takes time that is linear in the number of
stateids active on the file.

When called from
  release_openowner -> free_ol_stateid_reaplist ->nfs4_free_ol_stateid
  -> release_all_access

the number of times it is called is linear in the number of stateids.
The net result is that time taken by release_openowner is quadratic in
the number of stateids.

When the nfsd server is shut down while there are many active stateids
this can result in a soft lockup. ("CPU stuck for 302s" seen in one case).

In many cases all the states have the same deny modes and there is no
need to examine the entire list in recalculate_deny_mode().  In
particular, recalculate_deny_mode() will only reduce the deny mode,
never increase it.  So if some prefix of the list causes the original
deny mode to be required, there is no need to examine the remainder of
the list.

So we can improve recalculate_deny_mode() to usually run in constant
time, so release_openowner will typically be only linear in the number
of states.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:19 -04:00
Jeff Layton
9320f27fda nfsd: add tracepoint in mark_client_expired_locked
Show client info alongside the number of cl_rpc_users. If that's
elevated, then we can infer that this function returned nfserr_jukebox.

[ cel: For additional debugging of RPC user refcounting ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:19 -04:00
Chuck Lever
2d49901150 nfsd: new tracepoint for check_slot_seqid
Replace a dprintk in check_slot_seqid with tracepoints. These new
tracepoints track slot sequence numbers during operation.

Suggested-by: Jeffrey Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:18 -04:00
Jeff Layton
db2acd6e8a nfsd: drop extraneous newline from nfsd tracepoints
We never want a newline in tracepoint output.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:18 -04:00
Kefeng Wang
7d12cce878 fs: nfsd: use group allocation/free of per-cpu counters API
Use group allocation/free of per-cpu counters api to accelerate
nfsd percpu_counters init/destroy(), and also squash the
nfsd_percpu_counters_init/reset/destroy() and nfsd_counters_init/destroy()
into callers to simplify code.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:17 -04:00
Jeff Layton
33a1e6ea73 nfsd: trivial GET_DIR_DELEGATION support
This adds basic infrastructure for handing GET_DIR_DELEGATION calls from
clients, including the decoders and encoders. For now, it always just
returns NFS4_OK + GDD4_UNAVAIL.

Eventually clients may start sending this operation, and it's better if
we can return GDD4_UNAVAIL instead of having to abort the whole compound.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:17 -04:00
Chuck Lever
38f080f3cd NFSD: Move callback_wq into struct nfs4_client
Commit 8838203667 ("nfsd: update workqueue creation") made the
callback_wq single-threaded, presumably to protect modifications of
cl_cb_client. See documenting comment for nfsd4_process_cb_update().

However, cl_cb_client is per-lease. There's no other reason that all
callback operations need to be dispatched via a single thread. The
single threading here means all client callbacks can be blocked by a
problem with one client.

Change the NFSv4 callback client so it serializes per-lease instead
of serializing all NFSv4 callback operations on the server.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:16 -04:00
NeilBrown
56c35f43ee nfsd: drop st_mutex before calling move_to_close_lru()
move_to_close_lru() is currently called with ->st_mutex held.
This can lead to a deadlock as move_to_close_lru() waits for sc_count to
drop to 2, and some threads holding a reference might be waiting for the
mutex.  These references will never be dropped so sc_count will never
reach 2.

There can be no harm in dropping ->st_mutex before
move_to_close_lru() because the only place that takes the mutex is
nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is
NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called.

See also
 https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/
where this problem was raised but not successfully resolved.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:16 -04:00
NeilBrown
eec7620800 nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()
move_to_close_lru() waits for sc_count to become zero while holding
rp_mutex.  This can deadlock if another thread holds a reference and is
waiting for rp_mutex.

By the time we get to move_to_close_lru() the openowner is unhashed and
cannot be found any more.  So code waiting for the mutex can safely
retry the lookup if move_to_close_lru() has started.

So change rp_mutex to an atomic_t with three states:

 RP_UNLOCK   - state is still hashed, not locked for reply
 RP_LOCKED   - state is still hashed, is locked for reply
 RP_UNHASHED - state is not hashed, no code can get a lock.

Use wait_var_event() to wait for either a lock, or for the owner to be
unhashed.  In the latter case, retry the lookup.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:16 -04:00
NeilBrown
b3f03739ca nfsd: move nfsd4_cstate_assign_replay() earlier in open handling.
Rather than taking the rp_mutex (via nfsd4_cstate_assign_replay) in
nfsd4_cleanup_open_state() (which seems counter-intuitive), take it and
assign rp_owner as soon as possible - in nfsd4_process_open1().

This will support a future change when nfsd4_cstate_assign_replay() might
fail.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:15 -04:00
NeilBrown
23df17788c nfsd: perform all find_openstateowner_str calls in the one place.
Currently find_openstateowner_str look ups are done both in
nfsd4_process_open1() and alloc_init_open_stateowner() - the latter
possibly being a surprise based on its name.

It would be easier to follow, and more conformant to common patterns, if
the lookup was all in the one place.

So replace alloc_init_open_stateowner() with
find_or_alloc_open_stateowner() and use the latter in
nfsd4_process_open1() without any calls to find_openstateowner_str().

This means all finds are find_openstateowner_str_locked() and
find_openstateowner_str() is no longer needed.  So discard
find_openstateowner_str() and rename find_openstateowner_str_locked() to
find_openstateowner_str().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:15 -04:00
Linus Torvalds
a91bae8794 nfsd-6.9 fixes:
- Avoid freeing unallocated memory (v6.7 regression)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYv/dwACgkQM2qzM29m
 f5dcFQ//R5MRfOwLFqvNO2wwh750ijbR9tlUn9ce+PVQRpM1m1AhcDD9h98ZxNjn
 8rwxMPTJVd0E8GV5BG5B7vaYYQySuU6C9LAbDDFL1mWOSf2VFIY8U6YntqasQ77K
 HGXdFhHKlCLCwIGTvQgJ4Dnu6Ds2mjm83wH9IZ0SXdwHZsr5MeSiOmtpjDZn/Efm
 isx+OBjf4uYEwo02qzMzdX5mmj2eXDwincjK9dEUXYn6yyzIDEwo/wUDtzsaxuui
 TKm1Ni0uul8idukNOH+Clrp+sSDN+bR57mLhBZuToGxBbdMLaHBHOABYSX5BIDKv
 YCHYaIFrya9dtcZdTk5sIEj94hSwl63PP37RZTS/o/djMNWYNkiUmay+UtRpDkk2
 xbzx+C2hgb5OeGiI4ZDBrcIDi5bf5fIAT6yE8hfL853phloH15jlup9dnZQlB4B3
 PsdQEdJZY3uoOQSJ31/okDRFSxR98xClZxZX+N7zdXWcYtP1LXUk+vNTt44riw8C
 i3MBBWF2kxsGQ/r5J0phFRKunWCRjUQD36co4m7hJUYHqdJq2jNnSvk6O6F4R+yv
 IRQOs+qpX2TdMkNJ2jGaV8d4pt0f5bMLPeh3vl3Smf1xzfX+X0E4trtXTsw0nIrD
 m52ZV93Zag3Q4SkQ3ir5r/lfqX8sybJbVqZba85abtiqSSDIsFY=
 =GqdB
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Avoid freeing unallocated memory (v6.7 regression)

* tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix nfsd4_encode_fattr4() crasher
2024-04-29 14:22:24 -07:00
Chuck Lever
18180a4550 NFSD: Fix nfsd4_encode_fattr4() crasher
Ensure that args.acl is initialized early. It is used in an
unconditional call to kfree() on the way out of
nfsd4_encode_fattr4().

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 83ab8678ad ("NFSD: Add struct nfsd4_fattr_args")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-25 18:06:41 -04:00
Linus Torvalds
e33c4963bf nfsd-6.9 fixes:
- Revert some backchannel fixes that went into v6.9-rc
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYqZecACgkQM2qzM29m
 f5fOPhAAlxsB8ThAAyg+vOMeR6lxutEjPzez+U4M3N1s8dj9eMmcx8s6DPxsvfTS
 ExhcqxauCEwh9q+yZQ5ismWiE43Gvf39tvKGXzH30oM4v8xYo9nzxi6w9mEGyqR+
 bkpbvbpuCfSeVY4WXD2rFI8DMh1G8PM9llue2QVLWPny2y5K2gAsbinpV/QQtdgp
 LEwhnGcX2ZmSh9yd7aGiOOO18CHU9rMygcb4XDzmwWggO8eCrElWbp4x8yuZJFAD
 OzuIx00WYny00rlMpZUyZ67hhg0+KcGQVldhr+A8f0K1baFCfW4QDq2T3RGTonoe
 DWdDGW7zwBbBw/rg4NUhdIZ9CArml9bB8ohPsaGr99e/95mtpSnXoEDJvKYgsFVE
 QembV7zPZqzEjtrR6g/tdf8/g1yRRwUMLze+RLjzz70TohLKvj7+qHCHirNMCMVE
 +HXvEdPKCzdZ24YST2wDh6sQ571ka6TX8fL6gYU8O+w+fiitLkEVXGE7KDJvynkm
 bsoJxTtbTuFleaW79d51VpIuiW1uyqKJrD1RnwDlMpp+HDtUnoVXU5SfGnmMp4KG
 Tz9Hx7hgFthXpajDpyzgLMnYVJ8IvXnoQaOp97mqFe/jSktXfslDdsj/R5ZAjZHO
 SfViZ8hjyoxrna7bPh250tSCb/S3qljhsIOc/3tAl8nPWGLoV28=
 =O2Ox
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Revert some backchannel fixes that went into v6.9-rc

* tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "NFSD: Convert the callback workqueue to use delayed_work"
  Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"
2024-04-25 09:31:06 -07:00
Chuck Lever
8ddb7142c8 Revert "NFSD: Convert the callback workqueue to use delayed_work"
This commit was a pre-requisite for commit c1ccfcf1a9 ("NFSD:
Reschedule CB operations when backchannel rpc_clnt is shut down"),
which has already been reverted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23 20:12:41 -04:00
Chuck Lever
9c8ecb9308 Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"
The reverted commit attempted to enable NFSD to retransmit pending
callback operations if an NFS client disconnects, but
unintentionally introduces a hazardous behavior regression if the
client becomes permanently unreachable while callback operations are
still pending.

A disconnect can occur due to network partition or if the NFS server
needs to force the NFS client to retransmit (for example, if a GSS
window under-run occurs).

Reverting the commit will make NFSD behave the same as it did in
v6.8 and before. Pending callback operations are permanently lost if
the client connection is terminated before the client receives them.

For some callback operations, this loss is not harmful.

However, for CB_RECALL, the loss means a delegation might be revoked
unnecessarily. For CB_OFFLOAD, pending COPY operations will never
complete unless the NFS client subsequently sends an OFFLOAD_STATUS
operation, which the Linux NFS client does not currently implement.

These issues still need to be addressed somehow.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23 20:12:34 -04:00
Linus Torvalds
96fca68c4f nfsd-6.9 fixes:
- Fix a potential tracepoint crash
 - Fix NFSv4 GETATTR on big-endian platforms
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYdlTMACgkQM2qzM29m
 f5esDBAAnXOgnizrGTMkpmqWL11UmpIjWDyTxQ7dWrk7dqQGXT3qAAya3dijaJiM
 a1eLdFiaaKFxtkFrR9QPCtqfpR/gNxkkHf05SK/LQ1SL2OMbAMa1/UIaf0teWM78
 CafmMT1YLMyiEDFpB0rAnoJ5VvTU2BVowjfzAW/0PkmwLlO5+XMMhPx/qd1061Ll
 gwl2pqwZPankZRWsUBZtDE5bCTuKQDePrG7e7J7FKVPR+1EqAcudsDMh1tmSTvar
 0NTeLH0LTJ2imZi21b+j9+VKtwXTtmuY2GxhADNb8goUuQI2+lqNakDk4AflQvuy
 Kg3Z0dnNkTWGKPIbV/020vhN/6Fev5RVF9SdPF5WcEfeaWDV5rjEY1s4svphUuS+
 Nh8VCPeQEAamAcShA584G8onWdXGP9sYgBiWXZvh8R38Akq6AC6LPEkbqT6dR5mU
 ftMDGb3BBvkOs7ahjaiUUaPqoRXxeS+Qh06Sa3JrZhbMFdccZRq/AgodtC7ZYGZZ
 4u7yG+y8MIytHbIljE2aCo8U8jV8f4nl6VV3xda3H9zZG0RRfpZfFetHiAWqRjoq
 BEB75eLFDjf1qAXENWzzdeS0wLRRr5PHIkBfDeFq71zyJO37RH15sfVnavinj2KY
 7a0ASn2xlqzDHY7MTZ2ULRCLYsS7XwN88KBF7tNghfQBKJYs59A=
 =wAk4
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a potential tracepoint crash

 - Fix NFSv4 GETATTR on big-endian platforms

* tag 'nfsd-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: fix endianness issue in nfsd4_encode_fattr4
  SUNRPC: Fix rpcgss_context trace event acceptor field
2024-04-15 14:09:47 -07:00
Vasily Gorbik
f488138b52 NFSD: fix endianness issue in nfsd4_encode_fattr4
The nfs4 mount fails with EIO on 64-bit big endian architectures since
v6.7. The issue arises from employing a union in the nfsd4_encode_fattr4()
function to overlay a 32-bit array with a 64-bit values based bitmap,
which does not function as intended. Address the endianness issue by
utilizing bitmap_from_arr32() to copy 32-bit attribute masks into a
bitmap in an endianness-agnostic manner.

Cc: stable@vger.kernel.org
Fixes: fce7913b13 ("NFSD: Use a bitmask loop to encode FATTR4 results")
Link: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2060217
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-11 09:21:06 -04:00
Linus Torvalds
f2f80ac809 nfsd-6.9 fixes:
- Address a slow memory leak with RPC-over-TCP
 - Prevent another NFS4ERR_DELAY loop during CREATE_SESSION
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYReWEACgkQM2qzM29m
 f5fsfxAAhVkcd5Om9iBI7/Ib2QtJdeyn9+Q6hOJi9ITDPpdbSrd1Fmd8ufyKNuxH
 dwGLyV0+ELbUl1RRNfdnl+TkzYHMTURuvDEgUyhYA28GOJVd9GWXwX2KZR7J+AP5
 HtpSGLXt+XvuO7uB+SFS85wwF0DJL39Qy4jCVYCOuN2Z8zqfTg5TwstOQ8X794QN
 b5JzLkUlxQfd6kGRvU+BZHNf7R/yBfjUQWVybyhqzdjnCbbnPH+cl0hTlEIQTYJH
 G31Gty1J/RGt1ZeURuF4OG4lFocRJW/SqoruneweBAOksN9PVcwsoMf6m16l3+AD
 ZMnBt7FInQc/mAqRqIoLTsmYT8OyDa3a6qjubqWCYicCXvj1FxxOd7IaYytXxv/2
 Z8ZvKSSvyXRwM3mUt+3E5DTM8NnsxPxnO9iSGIMUeH7n96LU0X39b/Ll6in6+eu2
 /go8cLe59uuYDF9n2srX/LLWHj5wAWxVi+OgiSsAbsDFYTtJXK+syT2CpsEFXiUZ
 5AYUbfGVqQ8uNtfGaaJd71CNCuEKC5qYpeC5cS2nnruV6SArfG69DMRAO0pxJYAC
 6X7gm9Se1zyI8r9gR0rKjJ5ojeTPQBLfk6oVavum6CCwHzkKQTLG2jHBq8cdpwoL
 KxXc37fhW9m9c2B3g2dikclM2+XrMyUzJ5Ync9SSiwFJN/956I0=
 =dGcu
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address a slow memory leak with RPC-over-TCP

 - Prevent another NFS4ERR_DELAY loop during CREATE_SESSION

* tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
  SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP
2024-04-06 09:37:50 -07:00
Jeff Layton
10396f4df8 nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC that counter is
really more for client-driven RPCs, and this has the effect of
preventing the client from being unhashed until the callback completes.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client can't connect because the old
client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
return on the CREATE_SESSION operation.

The job is only holding a reference to the client so it can clear a flag
after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a
reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of
reference when dealing with the nfsdfs info files, but it should work
appropriately here to ensure that the nfs4_client doesn't disappear.

Fixes: 44df6f439a ("NFSD: add delegation reaper to react to low memory condition")
Reported-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-05 14:05:35 -04:00
Amir Goldstein
230d97d39e fsnotify: create a wrapper fsnotify_find_inode_mark()
In preparation to passing an object pointer to fsnotify_find_mark(), add
a wrapper fsnotify_find_inode_mark() and use it where possible.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20240317184154.1200192-4-amir73il@gmail.com>
2024-04-04 16:24:16 +02:00
Linus Torvalds
d8e8fbec00 nfsd-6.9 fixes:
- Address three recently introduced regressions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYFtQkACgkQM2qzM29m
 f5fSbA//S0YiPCU+bVwN+mK47vhai1Dzaw1kd2rzlfnvTCyBFVocfQU85bFV8uns
 ZMcHhO6jPjRg1o/qkCzeNf+n6AVGwvgOn1jC628jTP1wJOS8WPsZtdwz8vHX/xw5
 qnMwEUtfFPiDn85VyK6z6mJBWRcF/vStPc022Ie5ba5ydxD0rFmaeIc6zYX8csUe
 8tbHlulYtRR97mXjxgtXC/H0scvag07MA1fRKiZRNk0jwUdtcAsHktiBfWPV2KXO
 ew8saA+BLyBBdE01cq1CoMH/3wt/a5rpaiV5ePPtYkmlPVT5IEmSLZUwUVZAjMuu
 qlvKF9xrH7x8/7g/SwOJdH8po0IjiRdW9EKID+6PPqXz+YYKyTe7JqZtvEzt/4x9
 tGbIfyHxpUPPTZ6VedlXYW95uWG+lrKzur+zHQwsNf+U0d5Xila8euEKqvB4kDtk
 riHUlvs8O6WroeHcOQ5pK72l7x8gELdVqHP6n+E0t5+2VKp/Vcqf8WlWV0suoDfa
 D2EYGIzxn8NrNuivGESHfzafmHKKMn51UGrm1Sl0vF5w2LlqMfL4Yo4NIT+e69VR
 E3h6wIhK7Jor9px5/tBD3Y6o9k8wT9upjA+qsrOUhxyU3T9r+s88I64gvGyABTIM
 kZJHLJYROoNO6A5FSRySSH+bk/UU+wJ0s6avgHL1A2yYnyrQUzU=
 =YV5O
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address three recently introduced regressions

* tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
  SUNRPC: Revert 561141dd49
  nfsd: Fix error cleanup path in nfsd_rename()
2024-03-28 14:35:32 -07:00
Chuck Lever
99dc2ef039 NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6cc6 ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.

RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.

Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.

Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Fixes: e4469c6cc6 ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-27 13:19:47 -04:00
Jan Kara
9fe6e9e7b5 nfsd: Fix error cleanup path in nfsd_rename()
Commit a8b0026847 ("rename(): avoid a deadlock in the case of parents
having no common ancestor") added an error bail out path. However this
path does not drop the remount protection that has been acquired. Fix
the cleanup path to properly drop the remount protection.

Fixes: a8b0026847 ("rename(): avoid a deadlock in the case of parents having no common ancestor")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-22 09:52:00 -04:00
Steven Rostedt (Google)
c759e60903 tracing: Remove __assign_str_len()
Now that __assign_str() gets the length from the __string() (and
__string_len()) macros, there's no reason to have a separate
__assign_str_len() macro as __assign_str() can get the length of the
string needed.

Also remove __assign_rel_str() although it had no users anyway.

Link: https://lore.kernel.org/linux-trace-kernel/20240223152206.0b650659@gandalf.local.home

Cc: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:05 -04:00
Steven Rostedt (Google)
9388a2aa45 NFSD: Fix nfsd_clid_class use of __string_len() macro
I'm working on restructuring the __string* macros so that it doesn't need
to recalculate the string twice. That is, it will save it off when
processing __string() and the __assign_str() will not need to do the work
again as it currently does.

Currently __string_len(item, src, len) doesn't actually use "src", but my
changes will require src to be correct as that is where the __assign_str()
will get its value from.

The event class nfsd_clid_class has:

  __string_len(name, name, clp->cl_name.len)

But the second "name" does not exist and causes my changes to fail to
build. That second parameter should be: clp->cl_name.data.

Link: https://lore.kernel.org/linux-trace-kernel/20240222122828.3d8d213c@gandalf.local.home

Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: stable@vger.kernel.org
Fixes: d27b74a867 ("NFSD: Use new __string_len C macros for nfsd_clid_class")
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:17:41 -04:00
Linus Torvalds
cc4a875cf3 lsm/stable-6.9 PR 20240312
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmXwt3cUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXObOhAAqldn1nbYS/t1D/k/9ZN/PtSQetK4
 S58D8+gB59Sg0daWFaRhCwwShIbXS/6XzhqaVb3iAPptJs0YDFMbWLAW2d+dd69K
 /7C8diguHbuJdEnCJtFYQIVinavaYVRlyoQcO8uwTz8uvTgXPOhr2P9NcOApJXcR
 xqttuADVo/9Zn0O9/+GUPCH0ROL0SMnuUjwdVP3bpPHj9zEk8F1/A6chzTeSLJru
 Y4+cRrN/r0JTkvRqPdnF9LSvxK7mtAEaHkKGeLQbw0O5pv3r3w0EWMJvq+uonGU2
 WX0eR5VMfevkFMUdw8FKOTa+OZ0HJ2KKIb4sB4wDMgeGyov7Z6SxgvFeQiSyD3aB
 QnyfLDzeEuPfousxUd45dUDnsWNnSgFF+JAdi0LSzm5hMuLeQDozTsFmh0orQcX1
 L5A6VtAbSPP0ffl+tuPi48q3P3LlSjMP0B8W20NXFYhXukKXCgXVMr/dEvpwpu1m
 o1glviGIXeLQQSnX3lMWb7Ds2igmCtXPrqkdu2vpRhMp0od6n4R4jH73Aj5MeSQn
 n3sP73dg5sAaMjtI2NOisMeFUp09MMlOumCCM+AIplPXremm1kwgKRTIp0rKsLW9
 VoQPXa43LQc3hAgPrpGuE+4yBfaBUq7Z8I37IFER/2y4K8b9YkduW4kDh7OdRz+d
 iQ4Nnu2lR/+CCH0=
 =0mTM
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Promote IMA/EVM to a proper LSM

   This is the bulk of the diffstat, and the source of all the changes
   in the VFS code. Prior to the start of the LSM stacking work it was
   important that IMA/EVM were separate from the rest of the LSMs,
   complete with their own hooks, infrastructure, etc. as it was the
   only way to enable IMA/EVM at the same time as a LSM.

   However, now that the bulk of the LSM infrastructure supports
   multiple simultaneous LSMs, we can simplify things greatly by
   bringing IMA/EVM into the LSM infrastructure as proper LSMs. This is
   something I've wanted to see happen for quite some time and Roberto
   was kind enough to put in the work to make it happen.

 - Use the LSM hook default values to simplify the call_int_hook() macro

   Previously the call_int_hook() macro required callers to supply a
   default return value, despite a default value being specified when
   the LSM hook was defined.

   This simplifies the macro by using the defined default return value
   which makes life easier for callers and should also reduce the number
   of return value bugs in the future (we've had a few pop up recently,
   hence this work).

 - Use the KMEM_CACHE() macro instead of kmem_cache_create()

   The guidance appears to be to use the KMEM_CACHE() macro when
   possible and there is no reason why we can't use the macro, so let's
   use it.

 - Fix a number of comment typos in the LSM hook comment blocks

   Not much to say here, we fixed some questionable grammar decisions in
   the LSM hook comment blocks.

* tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (28 commits)
  cred: Use KMEM_CACHE() instead of kmem_cache_create()
  lsm: use default hook return value in call_int_hook()
  lsm: fix typos in security/security.c comment headers
  integrity: Remove LSM
  ima: Make it independent from 'integrity' LSM
  evm: Make it independent from 'integrity' LSM
  evm: Move to LSM infrastructure
  ima: Move IMA-Appraisal to LSM infrastructure
  ima: Move to LSM infrastructure
  integrity: Move integrity_kernel_module_request() to IMA
  security: Introduce key_post_create_or_update hook
  security: Introduce inode_post_remove_acl hook
  security: Introduce inode_post_set_acl hook
  security: Introduce inode_post_create_tmpfile hook
  security: Introduce path_post_mknod hook
  security: Introduce file_release hook
  security: Introduce file_post_open hook
  security: Introduce inode_post_removexattr hook
  security: Introduce inode_post_setattr hook
  security: Align inode_setattr hook definition with EVM
  ...
2024-03-12 20:03:34 -07:00
Linus Torvalds
a01c9fe323 NFSD 6.9 Release Notes
The bulk of the patches for this release are optimizations, code
 clean-ups, and minor bug fixes.
 
 One new feature to mention is that NFSD administrators now have the
 ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support
 has had this capability for some time.
 
 As always I am grateful to NFSD contributors, reviewers, and
 testers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXwV4QACgkQM2qzM29m
 f5c7cg/8CRe0mGbeEMonoSycBjANDuiRolCM+DhVccUvSyWPqf4blF5yrNHcf5zN
 WmjQHVXIJUMVpLovcakj+4aBIuXGgdSmBJamFTy9fVfcFadiWYRceNgMMXpLMDDI
 fMAszRUyfL/r0Evj0Zajt86R5/gGn+W9X6HlDc1k7VV0Z+fzRw9WMxADy11cgHLp
 mh2bzyPmwu0EfBYlWNWLqzWVZm1C5UCGnlInyr0KXImCLOkpJqAVXTDvDkGFW2Qw
 1kJhodyabf6fRV2ZqPjLUuR4aRqABey83rB0N5z7MumO/dJUBW3CHR3uNMqvkmh3
 XevI8bPzS2Kypijcx7dONtkDWwU+fsvCdepNpmVDB73B19BFiLG+HDbMypJ0dmp+
 rvvfILRDCmIb+FA1DUeT3lIc6ac1f1+qAVc7hi3E7rGctEJWeHDsZg+E1PuTvpxM
 3XfRaFnucY5vwyiB2/uI4eblBHcVXoKho+pUqQMegLPRbgsEUyFUfg3+ZMtntagd
 OVUXvWYIARP97HNh0J5ChcGI72UpXtFWMlbbiTiCzYx4FeiCffeczIERXNJ4FYAg
 fKUaiBhdAN1PPFCRXJORZ5XlSIeZttUNSJUPfmuOpkscMdkpRUIhuEUYo9K8/1eL
 O+YZeGW/kTG+llxOERfEHJoekLf1TgGdU7oBmTIgQIK03hTUih8=
 =75G4
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The bulk of the patches for this release are optimizations, code
  clean-ups, and minor bug fixes.

  One new feature to mention is that NFSD administrators now have the
  ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has
  had this capability for some time.

  As always I am grateful to NFSD contributors, reviewers, and testers"

* tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits)
  NFSD: Clean up nfsd4_encode_replay()
  NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
  NFSD: Document nfsd_setattr() fill-attributes behavior
  nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
  nfsd: Fix a regression in nfsd_setattr()
  NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
  NFSD: handle GETATTR conflict with write delegation
  NFSD: add support for CB_GETATTR callback
  NFSD: Document the phases of CREATE_SESSION
  NFSD: Fix the NFSv4.1 CREATE_SESSION operation
  nfsd: clean up comments over nfs4_client definition
  svcrdma: Add Write chunk WRs to the RPC's Send WR chain
  svcrdma: Post WRs for Write chunks in svc_rdma_sendto()
  svcrdma: Post the Reply chunk and Send WR together
  svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt
  svcrdma: Post Send WR chain
  svcrdma: Fix retry loop in svc_rdma_send()
  svcrdma: Prevent a UAF in svc_rdma_send()
  svcrdma: Fix SQ wake-ups
  svcrdma: Increase the per-transport rw_ctx count
  ...
2024-03-12 14:27:37 -07:00
Linus Torvalds
0c750012e8 vfs-6.9.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4tQAKCRCRxhvAZXjc
 ohnfAP4sm946PZfiC4y5Euk96WDC3hC8WCSBar+fpFmYVzeD9wEAy+NVCsjkMElz
 vqNxwFULUwQjFxxvsM9gvhrgGUud1AE=
 =UZk/
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull file locking updates from Christian Brauner:
 "A few years ago struct file_lock_context was added to allow for
  separate lists to track different types of file locks instead of using
  a singly-linked list for all of them.

  Now leases no longer need to be tracked using struct file_lock.
  However, a lot of the infrastructure is identical for leases and locks
  so separating them isn't trivial.

  This splits a group of fields used by both file locks and leases into
  a new struct file_lock_core. The new core struct is embedded in struct
  file_lock. Coccinelle was used to convert a lot of the callers to deal
  with the move, with the remaining 25% or so converted by hand.

  Afterwards several internal functions in fs/locks.c are made to work
  with struct file_lock_core. Ultimately this allows to split struct
  file_lock into struct file_lock and struct file_lease. The file lease
  APIs are then converted to take struct file_lease"

* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
  filelock: fix deadlock detection in POSIX locking
  filelock: always define for_each_file_lock()
  smb: remove redundant check
  filelock: don't do security checks on nfsd setlease calls
  filelock: split leases out of struct file_lock
  filelock: remove temporary compatibility macros
  smb/server: adapt to breakup of struct file_lock
  smb/client: adapt to breakup of struct file_lock
  ocfs2: adapt to breakup of struct file_lock
  nfsd: adapt to breakup of struct file_lock
  nfs: adapt to breakup of struct file_lock
  lockd: adapt to breakup of struct file_lock
  fuse: adapt to breakup of struct file_lock
  gfs2: adapt to breakup of struct file_lock
  dlm: adapt to breakup of struct file_lock
  ceph: adapt to breakup of struct file_lock
  afs: adapt to breakup of struct file_lock
  9p: adapt to breakup of struct file_lock
  filelock: convert seqfile handling to use file_lock_core
  filelock: convert locks_translate_pid to take file_lock_core
  ...
2024-03-11 10:37:45 -07:00
Chuck Lever
9b350d3e34 NFSD: Clean up nfsd4_encode_replay()
Replace open-coded encoding logic with the use of conventional XDR
utility functions. Add a tracepoint to make replays observable in
field troubleshooting situations.

The WARN_ON is removed. A stack trace is of little use, as there is
only one call site for nfsd4_encode_replay(), and a buffer length
shortage here is unlikely.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-09 13:57:50 -05:00
Dai Ngo
bad4c585cc NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
The NFS server should ask clients to voluntarily return unused
delegations when the number of granted delegations reaches the
max_delegations. This is so that the server can continue to
grant delegations for new requests.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-05 10:22:33 -05:00
Chuck Lever
7d5a352ccc NFSD: Document nfsd_setattr() fill-attributes behavior
Add an explanation to prevent the future removal of the fill-
attribute call sites in nfsd_setattr(). Some NFSv3 client
implementations don't behave correctly if wcc data is not present in
an NFSv3 SETATTR reply.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-05 10:22:33 -05:00
Trond Myklebust
24d92de918 nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
The main point of the guarded SETATTR is to prevent races with other
WRITE and SETATTR calls. That requires that the check of the guard time
against the inode ctime be done after taking the inode lock.

Furthermore, we need to take into account the 32-bit nature of
timestamps in NFSv3, and the possibility that files may change at a
faster rate than once a second.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:33 -05:00
Trond Myklebust
6412e44c40 nfsd: Fix a regression in nfsd_setattr()
Commit bb4d53d66e ("NFSD: use (un)lock_inode instead of
fh_(un)lock for file operations") broke the NFSv3 pre/post op
attributes behaviour when doing a SETATTR rpc call by stripping out
the calls to fh_fill_pre_attrs() and fh_fill_post_attrs().

Fixes: bb4d53d66e ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Message-ID: <20240216012451.22725-1-trondmy@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:33 -05:00
Dai Ngo
5826e09bf3 NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:32 -05:00
Dai Ngo
c5967721e1 NFSD: handle GETATTR conflict with write delegation
If the GETATTR request on a file that has write delegation in effect
and the request attributes include the change info and size attribute
then the request is handled as below:

Server sends CB_GETATTR to client to get the latest change info and file
size. If these values are the same as the server's cached values then
the GETATTR proceeds as normal.

If either the change info or file size is different from the server's
cached values, or the file was already marked as modified, then:

    . update time_modify and time_metadata into file's metadata
      with current time

    . encode GETATTR as normal except the file size is encoded with
      the value returned from CB_GETATTR

    . mark the file as modified

If the CB_GETATTR fails for any reasons, the delegation is recalled
and NFS4ERR_DELAY is returned for the GETATTR.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:32 -05:00
Dai Ngo
6487a13b5c NFSD: add support for CB_GETATTR callback
Includes:
   . CB_GETATTR proc for nfs4_cb_procedures[]
   . XDR encoding and decoding function for CB_GETATTR request/reply
   . add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR
     and store file attributes from client's reply.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:31 -05:00
Chuck Lever
b910544a5a NFSD: Document the phases of CREATE_SESSION
As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be
split into four phases. NFSD's implementation now does it like that
description.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:31 -05:00
Chuck Lever
e4469c6cc6 NFSD: Fix the NFSv4.1 CREATE_SESSION operation
RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1
CREATE_SESSION operation. The section defines four phases of
operation.

Phase 2 processes the CREATE_SESSION sequence ID. As a separate
step, Phase 3 evaluates the CREATE_SESSION arguments.

The problem we are concerned with is when phase 2 is successful but
phase 3 fails. The spec language in this case is "No changes are
made to any client records on the server."

RFC 8881 Section 18.35.4 defines a "client record", and it does
/not/ contain any details related to the special CREATE_SESSION
slot. Therefore NFSD is incorrect to skip incrementing the
CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of
CREATE_SESSION processing fails. In other words, even though NFSD
happens to store the cs_slot in a client record, in terms of the
protocol the slot is logically separate from the client record.

Three complications:

1. The world has moved on since commit 86c3e16cc7 ("nfsd4: confirm
   only on succesful create_session") broke this. So we can't simply
   revert that commit.

2. NFSD's CREATE_SESSION implementation does not cleanly delineate
   the logic of phases 2 and 3. So this won't be a surgical fix.

3. Because of the way it currently handles the CREATE_SESSION slot
   sequence number, nfsd4_create_session() isn't caching error
   responses in the CREATE_SESSION slot. Instead of replaying the
   response cache in those cases, it's executing the transaction
   again.

Reorganize the CREATE_SESSION slot sequence number accounting. This
requires that error responses are appropriately cached in the
CREATE_SESSION slot (once it is found).

Reported-by: Connor Smith <connor.smith@hitachivantara.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:31 -05:00
Chen Hanxiao
f81040276a nfsd: clean up comments over nfs4_client definition
nfsd fault injection has been deprecated since
commit 9d60d93198 ("Deprecate nfsd fault injection")
and removed by
commit e56dc9e294 ("nfsd: remove fault injection code")

So remove the outdated parts about fault injection.

Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:30 -05:00
Kunwu Chan
649e58d593 nfsd: Simplify the allocation of slab caches in nfsd4_init_slabs
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Make the code cleaner and more readable.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:24 -05:00
Kunwu Chan
192d80cdcb nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_create
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'nfsd_drc' to 'nfsd_cacherep'.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:24 -05:00
Kunwu Chan
2f74991a49 nfsd: Simplify the allocation of slab caches in nfsd_file_cache_init
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:23 -05:00
Kunwu Chan
10bcc2f1c8 nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfs
commit 0a31bd5f2b ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:23 -05:00
NeilBrown
05eda6e757 nfsd: don't call locks_release_private() twice concurrently
It is possible for free_blocked_lock() to be called twice concurrently,
once from nfsd4_lock() and once from nfsd4_release_lockowner() calling
remove_blocked_locks().  This is why a kref was added.

It is perfectly safe for locks_delete_block() and kref_put() to be
called in parallel as they use locking or atomicity respectively as
protection.  However locks_release_private() has no locking.  It is
safe for it to be called twice sequentially, but not concurrently.

This patch moves that call from free_blocked_lock() where it could race
with itself, to free_nbl() where it cannot.  This will slightly delay
the freeing of private info or release of the owner - but not by much.
It is arguably more natural for this freeing to happen in free_nbl()
where the structure itself is freed.

This bug was found by code inspection - it has not been seen in practice.

Fixes: 47446d74f1 ("nfsd4: add refcount for nfsd4_blocked_lock")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:23 -05:00
NeilBrown
1e33e1414b nfsd: allow layout state to be admin-revoked.
When there is layout state on a filesystem that is being "unlocked" that
is now revoked, which involves closing the nfsd_file and releasing the
vfs lease.

To avoid races, ->ls_file can now be accessed either:
 - under ->fi_lock for the state's sc_file or
 - under rcu_read_lock() if nfsd_file_get() is used.
To support this, ->fence_client and nfsd4_cb_layout_fail() now take a
second argument being the nfsd_file.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:22 -05:00
NeilBrown
06efa66750 nfsd: allow delegation state ids to be revoked and then freed
Revoking state through 'unlock_filesystem' now revokes any delegation
states found.  When the stateids are then freed by the client, the
revoked stateids will be cleaned up correctly.

As there is already support for revoking delegations, we build on that
for admin-revoking.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:22 -05:00
NeilBrown
39657c7406 nfsd: allow open state ids to be revoked and then freed
Revoking state through 'unlock_filesystem' now revokes any open states
found.  When the stateids are then freed by the client, the revoked
stateids will be cleaned up correctly.

Possibly the related lock states should be revoked too, but a
subsequent patch will do that for all lock state on the superblock.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:21 -05:00
NeilBrown
1c13bf9f2e nfsd: allow lock state ids to be revoked and then freed
Revoking state through 'unlock_filesystem' now revokes any lock states
found.  When the stateids are then freed by the client, the revoked
stateids will be cleaned up correctly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:21 -05:00
NeilBrown
d688d8585e nfsd: allow admin-revoked NFSv4.0 state to be freed.
For NFSv4.1 and later the client easily discovers if there is any
admin-revoked state and will then find and explicitly free it.

For NFSv4.0 there is no such mechanism.  The client can only find that
state is admin-revoked if it tries to use that state, and there is no
way for it to explicitly free the state.  So the server must hold on to
the stateid (at least) for an indefinite amount of time.  A
RELEASE_LOCKOWNER request might justify forgetting some of these
stateids, as would the whole clients lease lapsing, but these are not
reliable.

This patch takes two approaches.

Whenever a client uses an revoked stateid, that stateid is then
discarded and will not be recognised again.  This might confuse a client
which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at
all, but should mostly work.  Hopefully one error will lead to other
resources being closed (e.g.  process exits), which will result in more
stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED.

Also, any admin-revoked stateids that have been that way for more than
one lease time are periodically revoke.

No actual freeing of state happens in this patch.  That will come in
future patches which handle the different sorts of revoked state.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:21 -05:00
NeilBrown
11b2cfbf6c nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revoke
Add "admin-revoked" to the status information for any states that have
been admin-revoked.  This can be useful for confirming correct
behaviour.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:20 -05:00
NeilBrown
39e1be6471 nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/states
Change the "show" functions to show some content even if a file cannot
be found.  This is the case for admin-revoked state.
This is primarily useful for debugging - to ensure states are being
removed eventually.

So change several seq_printf() to seq_puts().  Some of these are needed
to keep checkpatch happy.  Others were done for consistency.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:20 -05:00
NeilBrown
1ac3629bf0 nfsd: prepare for supporting admin-revocation of state
The NFSv4 protocol allows state to be revoked by the admin and has error
codes which allow this to be communicated to the client.

This patch
 - introduces a new state-id status SC_STATUS_ADMIN_REVOKED
   which can be set on open, lock, or delegation state.
 - reports NFS4ERR_ADMIN_REVOKED when these are accessed
 - introduces a per-client counter of these states and returns
   SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero.
   Decrements this when freeing any admin-revoked state.
 - introduces stub code to find all interesting states for a given
   superblock so they can be revoked via the 'unlock_filesystem'
   file in /proc/fs/nfsd/
   No actual states are handled yet.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:19 -05:00
NeilBrown
3f29cc82a8 nfsd: split sc_status out of sc_type
sc_type identifies the type of a state - open, lock, deleg, layout - and
also the status of a state - closed or revoked.

This is a bit untidy and could get worse when "admin-revoked" states are
added.  So clean it up.

With this patch, the type is now all that is stored in sc_type.  This is
zero when the state is first added to ->cl_stateids (causing it to be
ignored), and is then set appropriately once it is fully initialised.
It is set under ->cl_lock to ensure atomicity w.r.t lookup.  It is now
never cleared.

sc_type is still a bit-set even though at most one bit is set.  This allows
lookup functions to be given a bitmap of acceptable types.

sc_type is now an unsigned short rather than char.  There is no value in
restricting to just 8 bits.

All the constants now start SC_TYPE_ matching the field in which they
are stored.  Keeping the existing names and ensuring clear separation
from non-type flags would have required something like
NFS4_STID_TYPE_CLOSED which is cumbersome.  The "NFS4" prefix is
redundant was they only appear in NFS4 code, so remove that and change
STID to SC to match the field.

The status is stored in a separate unsigned short named "sc_status".  It
has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED.
CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used
for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero.
These flags are only ever set, never cleared.
For deleg stateids they are set under the global state_lock.
For open and lock stateids they are set under ->cl_lock.
For layout stateids they are set under ->ls_lock

nfs4_unhash_stid() has been removed, and we never set sc_type = 0.  This
was only used for LOCK and LAYOUT stids and they now use
SC_STATUS_CLOSED.

Also TRACE_DEFINE_NUM() calls for the various STID #define have been
removed because these things are not enums, and so that call is
incorrect.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:19 -05:00
NeilBrown
83e733161f nfsd: avoid race after unhash_delegation_locked()
NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in
purpose.
REVOKED is used for NFSv4.1 states which have been revoked because the
lease has expired.  CLOSED is used in other cases.
The difference has two practical effects.
1/ REVOKED states are on the ->cl_revoked list
2/ REVOKED states result in nfserr_deleg_revoked from
   nfsd4_verify_open_stid() and nfsd4_validate_stateid while
   CLOSED states result in nfserr_bad_stid.

Currently a state that is being revoked is first set to "CLOSED" in
unhash_delegation_locked(), then possibly to "REVOKED" in
revoke_delegation(), at which point it is added to the cl_revoked list.

It is possible that a stateid test could see the CLOSED state
which really should be REVOKED, and so return the wrong error code.  So
it is safest to remove this window of inconsistency.

With this patch, unhash_delegation_locked() always sets the state
correctly, and revoke_delegation() no longer changes the state.

Also remove a redundant test on minorversion when
NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion
is non-zero.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:19 -05:00
NeilBrown
c6540026df nfsd: don't call functions with side-effecting inside WARN_ON()
Code like:

    WARN_ON(foo())

looks like an assertion and might not be expected to have any side
effects.
When testing if a function with side-effects fails a construct like

    if (foo())
       WARN_ON(1);

makes the intent more obvious.

nfsd has several WARN_ON calls where the test has side effects, so it
would be good to change them.  These cases don't really need the
WARN_ON.  They have never failed in 8 years of usage so let's just
remove the WARN_ON wrapper.

Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:18 -05:00
NeilBrown
779457285a nfsd: hold ->cl_lock for hash_delegation_locked()
The protocol for creating a new state in nfsd is to allocate the state
leaving it largely uninitialised, add that state to the ->cl_stateids
idr so as to reserve a state-id, then complete initialisation of the
state and only set ->sc_type to non-zero once the state is fully
initialised.

If a state is found in the idr with ->sc_type == 0, it is ignored.
The ->cl_lock lock is used to avoid races - it is held while checking
sc_type during lookup, and held when a non-zero value is stored in
->sc_type.

... except... hash_delegation_locked() finalises the initialisation of a
delegation state, but does NOT hold ->cl_lock.

So this patch takes ->cl_lock at the appropriate time w.r.t other locks,
and so ensures there are no races (which are extremely unlikely in any
case).
As ->fi_lock is often taken when ->cl_lock is held, we need to take
->cl_lock first of those two.
Currently ->cl_lock and state_lock are never both taken at the same time.
We need both for this patch so an arbitrary choice is needed concerning
which to take first.  As state_lock is more global, it might be more
contended, so take it first.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:18 -05:00
NeilBrown
6b4ca49dc3 nfsd: remove stale comment in nfs4_show_deleg()
As we do now support write delegations, this comment is unhelpful and
misleading.

Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:17 -05:00
Chuck Lever
e3179e443c NFSD: Remove redundant cb_seq_status initialization
As far as I can see, setting cb_seq_status in nfsd4_init_cb() is
superfluous because it is set again in nfsd4_cb_prepare().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:17 -05:00
Chuck Lever
84ebf02d75 NFSD: Remove BUG_ON in nfsd4_process_cb_update()
Don't kill the kworker thread, and don't panic while cl_lock is
held. There's no need for scorching the earth here.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:16 -05:00
Chuck Lever
112bdd597d NFSD: Replace comment with lockdep assertion
Convert a code comment into a real assertion.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:16 -05:00
Chuck Lever
627fb18302 NFSD: Remove unused @reason argument
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:15 -05:00
Chuck Lever
9a026aec88 NFSD: Add callback operation lifetime trace points
Help observe the flow of callback operations.

bc_shutdown() records exactly when the backchannel RPC client is
destroyed and cl_cb_client is replaced with NULL.

Examples include:

         nfsd-955   [004]   650.013997: nfsd_cb_queue:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try)
kworker/u21:4-497   [004]   650.014050: nfsd_cb_seq_status:   task:00000001@00000001 sessionid=65b3c5b8:f541f749:00000001:00000000 tk_status=-107 seq_status=1
kworker/u21:4-497   [004]   650.014051: nfsd_cb_restart:      addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (first try)
kworker/u21:4-497   [004]   650.014066: nfsd_cb_queue:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (need restart)


kworker/u16:0-10    [006]   650.065750: nfsd_cb_start:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN
kworker/u16:0-10    [006]   650.065752: nfsd_cb_bc_update:    addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try)
kworker/u16:0-10    [006]   650.065754: nfsd_cb_bc_shutdown:  addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try)
kworker/u16:0-10    [006]   650.065810: nfsd_cb_new_state:    addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:15 -05:00
Chuck Lever
a7cdf065e1 NFSD: Rename nfsd_cb_state trace point
Make it clear where backchannel state is updated.

Example trace point output:

kworker/u16:0-10    [006]  2800.080404: nfsd_cb_new_state:    addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP
         nfsd-940   [003]  2800.478368: nfsd_cb_new_state:    addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN
kworker/u16:0-10    [003]  2800.478828: nfsd_cb_new_state:    addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN

kworker/u16:0-10    [005]  2802.039724: nfsd_cb_start:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP
kworker/u16:0-10    [005]  2810.611452: nfsd_cb_start:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=FAULT
kworker/u16:0-10    [005]  2810.616832: nfsd_cb_start:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN
kworker/u16:0-10    [005]  2810.616931: nfsd_cb_start:        addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:14 -05:00
Chuck Lever
8626664c87 NFSD: Replace dprintks in nfsd4_cb_sequence_done()
Improve observability of backchannel session operation.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:14 -05:00
Chuck Lever
f52f1975b1 NFSD: Add nfsd_seq4_status trace event
Add a trace point that records SEQ4_STATUS flags returned in an
NFSv4.1 SEQUENCE response. SEQ4_STATUS flags report backchannel
issues and changes to lease state to clients. Knowing what the
server is reporting to clients is useful for debugging both
configuration and operational issues in real time.

For example, upcoming patches will enable server administrators to
revoke parts of a client's lease; that revocation is indicated to
the client when a subsequent SEQUENCE operation has one or more
SEQ4_STATUS flags that are set.

Sample trace records:

nfsd-927   [006]   615.581821: nfsd_seq4_status:     xid=0x095ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT
nfsd-927   [006]   615.588043: nfsd_seq4_status:     xid=0x0a5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT
nfsd-928   [003]   615.588448: nfsd_seq4_status:     xid=0x0b5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:13 -05:00
Chuck Lever
43b02dba11 NFSD: Retransmit callbacks after client reconnects
NFSv4.1 clients assume that if they disconnect, that will force the
server to resend pending callback operations once a fresh connection
has been established.

Turns out NFSD has not been resending after reconnect.

Fixes: 7ba6cad6c8 ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:13 -05:00
Chuck Lever
c1ccfcf1a9 NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down
As part of managing a client disconnect, NFSD closes down and
replaces the backchannel rpc_clnt.

If a callback operation is pending when the backchannel rpc_clnt is
shut down, currently nfsd4_run_cb_work() just discards that
callback. But there are multiple cases to deal with here:

 o The client's lease is getting destroyed. Throw the CB away.

 o The client disconnected. It might be forcing a retransmit of
   CB operations, or it could have disconnected for other reasons.
   Reschedule the CB so it is retransmitted when the client
   reconnects.

Since callback operations can now be rescheduled, ensure that
cb_ops->prepare can be called only once by moving the
cb_ops->prepare paragraph down to just before the rpc_call_async()
call.

Fixes: 2bbfed98a4 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:12 -05:00
Chuck Lever
fe0e9580e2 NFSD: Convert the callback workqueue to use delayed_work
Normally, NFSv4 callback operations are supposed to be sent to the
client as soon as they are queued up.

In a moment, I will introduce a recovery path where the server has
to wait for the client to reconnect. We don't want a hard busy wait
here -- the callback should be requeued to try again in several
milliseconds.

For now, convert nfsd4_callback from struct work_struct to struct
delayed_work, and queue with a zero delay argument. This should
avoid behavior changes for current operation.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:12 -05:00
Chuck Lever
961b4b5e86 NFSD: Reset cb_seq_status after NFS4ERR_DELAY
I noticed that once an NFSv4.1 callback operation gets a
NFS4ERR_DELAY status on CB_SEQUENCE and then the connection is lost,
the callback client loops, resending it indefinitely.

The switch arm in nfsd4_cb_sequence_done() that handles
NFS4ERR_DELAY uses rpc_restart_call() to rearm the RPC state machine
for the retransmit, but that path does not call the rpc_prepare_call
callback again. Thus cb_seq_status is set to -10008 by the first
NFS4ERR_DELAY result, but is never set back to 1 for the retransmits.

nfsd4_cb_sequence_done() thinks it's getting nothing but a
long series of CB_SEQUENCE NFS4ERR_DELAY replies.

Fixes: 7ba6cad6c8 ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:12 -05:00
Josef Bacik
16fb9808ab nfsd: make svc_stat per-network namespace instead of global
The final bit of stats that is global is the rpc svc_stat.  Move this
into the nfsd_net struct and use that everywhere instead of the global
struct.  Remove the unused global struct.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:11 -05:00
Josef Bacik
e41ee44cc6 nfsd: remove nfsd_stats, make th_cnt a global counter
This is the last global stat, take it out of the nfsd_stats struct and
make it a global part of nfsd, report it the same as always.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:11 -05:00
Josef Bacik
4b14885411 nfsd: make all of the nfsd stats per-network namespace
We have a global set of counters that we modify for all of the nfsd
operations, but now that we're exposing these stats across all network
namespaces we need to make the stats also be per-network namespace.  We
already have some caching stats that are per-network namespace, so move
these definitions into the same counter and then adjust all the helpers
and users of these stats to provide the appropriate nfsd_net struct so
that the stats are maintained for the per-network namespace objects.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:10 -05:00
Josef Bacik
93483ac5fe nfsd: expose /proc/net/sunrpc/nfsd in net namespaces
We are running nfsd servers inside of containers with their own network
namespace, and we want to monitor these services using the stats found
in /proc.  However these are not exposed in the proc inside of the
container, so we have to bind mount the host /proc into our containers
to get at this information.

Separate out the stat counters init and the proc registration, and move
the proc registration into the pernet operations entry and exit points
so that these stats can be exposed inside of network namespaces.

This is an intermediate step, this just exposes the global counters in
the network namespace.  Subsequent patches will move these counters into
the per-network namespace container.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:10 -05:00
Josef Bacik
d98416cc21 nfsd: rename NFSD_NET_* to NFSD_STATS_*
We're going to merge the stats all into per network namespace in
subsequent patches, rename these nn counters to be consistent with the
rest of the stats.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:10 -05:00
Josef Bacik
3f6ef182f1 sunrpc: remove ->pg_stats from svc_program
Now that this isn't used anywhere, remove it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:09 -05:00
Josef Bacik
f094323867 sunrpc: pass in the sv_stats struct through svc_create_pooled
Since only one service actually reports the rpc stats there's not much
of a reason to have a pointer to it in the svc_program struct.  Adjust
the svc_create_pooled function to take the sv_stats as an argument and
pass the struct through there as desired instead of getting it from the
svc_program->pg_stats.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:08 -05:00
Josef Bacik
a2214ed588 nfsd: stop setting ->pg_stats for unused stats
A lot of places are setting a blank svc_stats in ->pg_stats and never
utilizing these stats.  Remove all of these extra structs as we're not
reporting these stats anywhere.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:08 -05:00
Jorge Mora
31e4bb8fb8 NFSD: fix LISTXATTRS returning more bytes than maxcount
The maxcount is the maximum number of bytes for the LISTXATTRS4resok
result. This includes the cookie and the count for the name array,
thus subtract 12 bytes from the maxcount: 8 (cookie) + 4 (array count)
when filling up the name array.

Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:07 -05:00
Jorge Mora
2f73f37d66 NFSD: fix LISTXATTRS returning a short list with eof=TRUE
If the XDR buffer is not large enough to fit all attributes
and the remaining bytes left in the XDR buffer (xdrleft) is
equal to the number of bytes for the current attribute, then
the loop will prematurely exit without setting eof to FALSE.
Also in this case, adding the eof flag to the buffer will
make the reply 4 bytes larger than lsxa_maxcount.

Need to check if there are enough bytes to fit not only the
next attribute name but also the eof as well.

Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:07 -05:00
Jorge Mora
61ab5e0758 NFSD: change LISTXATTRS cookie encoding to big-endian
Function nfsd4_listxattr_validate_cookie() expects the cookie
as an offset to the list thus it needs to be encoded in big-endian.

Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:06 -05:00
Jorge Mora
52a357db80 NFSD: fix nfsd4_listxattr_validate_cookie
If LISTXATTRS is sent with a correct cookie but a small maxcount,
this could lead function nfsd4_listxattr_validate_cookie to
return NFS4ERR_BAD_COOKIE. If maxcount = 20, then second check
on function gives RHS = 3 thus any cookie larger than 3 returns
NFS4ERR_BAD_COOKIE.

There is no need to validate the cookie on the return XDR buffer
since attribute referenced by cookie will be the first in the
return buffer.

Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:06 -05:00
NeilBrown
5ff318f645 nfsd: use __fput_sync() to avoid delayed closing of files.
Calling fput() directly or though filp_close() from a kernel thread like
nfsd causes the final __fput() (if necessary) to be called from a
workqueue.  This means that nfsd is not forced to wait for any work to
complete.  If the ->release or ->destroy_inode function is slow for any
reason, this can result in nfsd closing files more quickly than the
workqueue can complete the close and the queue of pending closes can
grow without bounces (30 million has been seen at one customer site,
though this was in part due to a slowness in xfs which has since been
fixed).

nfsd does not need this.  It is quite appropriate and safe for nfsd to
do its own close work.  There is no reason that close should ever wait
for nfsd, so no deadlock can occur.

It should be safe and sensible to change all fput() calls to
__fput_sync().  However in the interests of caution this patch only
changes two - the two that can be most directly affected by client
behaviour and could occur at high frequency.

- the fput() implicitly in flip_close() is changed to __fput_sync()
  by calling get_file() first to ensure filp_close() doesn't do
  the final fput() itself.  If is where files opened for IO are closed.

- the fput() in nfsd_read() is also changed.  This is where directories
  opened for readdir are closed.

This ensure that minimal fput work is queued to the workqueue.

This removes the need for the flush_delayed_fput() call in
nfsd_file_close_inode_sync()

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:05 -05:00
NeilBrown
ffb4025961 nfsd: Don't leave work of closing files to a work queue
The work of closing a file can have non-trivial cost.  Doing it in a
separate work queue thread means that cost isn't imposed on the nfsd
threads and an imbalance can be created.  This can result in files being
queued for the work queue more quickly that the work queue can process
them, resulting in unbounded growth of the queue and memory exhaustion.

To avoid this work imbalance that exhausts memory, this patch moves all
closing of files into the nfsd threads.  This means that when the work
imposes a cost, that cost appears where it would be expected - in the
work of the nfsd thread.  A subsequent patch will ensure the final
__fput() is called in the same (nfsd) thread which calls filp_close().

Files opened for NFSv3 are never explicitly closed by the client and are
kept open by the server in the "filecache", which responds to memory
pressure, is garbage collected even when there is no pressure, and
sometimes closes files when there is particular need such as for rename.
These files currently have filp_close() called in a dedicated work
queue, so their __fput() can have no effect on nfsd threads.

This patch discards the work queue and instead has each nfsd thread call
flip_close() on as many as 8 files from the filecache each time it acts
on a client request (or finds there are no pending client requests).  If
there are more to be closed, more threads are woken.  This spreads the
work of __fput() over multiple threads and imposes any cost on those
threads.

The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
ensure that files are closed more quickly than they can be added to the
cache.  It needs to be small enough to limit the per-request delays that
will be imposed on clients when all threads are busy closing files.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01 09:12:05 -05:00
Roberto Sassu
cd3cec0a02 ima: Move to LSM infrastructure
Move hardcoded IMA function calls (not appraisal-specific functions) from
various places in the kernel to the LSM infrastructure, by introducing a
new LSM named 'ima' (at the end of the LSM list and always enabled like
'integrity').

Having IMA before EVM in the Makefile is sufficient to preserve the
relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM,
and thus the order of IMA and EVM function calls as when they were
hardcoded.

Make moved functions as static (except ima_post_key_create_or_update(),
which is not in ima_main.c), and register them as implementation of the
respective hooks in the new function init_ima_lsm().

Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook
path_post_mknod is always available and ima_post_path_mknod() is always
executed to mark files as new, as before the move.

A slight difference is that IMA and EVM functions registered for the
inode_post_setattr, inode_post_removexattr, path_post_mknod,
inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl
won't be executed for private inodes. Since those inodes are supposed to be
fs-internal, they should not be of interest to IMA or EVM. The S_PRIVATE
flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub
and kernel-internal tmpfs files.

Conditionally register ima_post_key_create_or_update() if
CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register
ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled.

Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-02-15 23:43:46 -05:00
Roberto Sassu
8f46ff5767 security: Introduce file_post_open hook
In preparation to move IMA and EVM to the LSM infrastructure, introduce the
file_post_open hook. Also, export security_file_post_open() for NFS.

Based on policy, IMA calculates the digest of the file content and
extends the TPM with the digest, verifies the file's integrity based on
the digest, and/or includes the file digest in the audit log.

LSMs could similarly take action depending on the file content and the
access mask requested with open().

The new hook returns a value and can cause the open to be aborted.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-02-15 23:43:42 -05:00
Linus Torvalds
c8d80f83de nfsd-6.8 fixes:
- Address a deadlock regression in RELEASE_LOCKOWNER
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXDkGoACgkQM2qzM29m
 f5cD1xAAsqKTmrb2ABdFZadYfVl6lKxtNWEp8S9eRS6eBU6bcNr5sxoTi+eflHuB
 a58TfUwj8ffVtmd0qGaWI8RpDuAYtxhJU6l3TQZCLheMlKvsP3u6lZDjk8CeYtIK
 9bVDZV7MOh9C/p01p/21P2B3OukgzzF8Lz/AFPCxCtoK5nnaDT5F+8pX8yfZU5x0
 bM1gBHzB80BoCdzlimN6QB8EfFkOgIF9Apnk53E676KmkOuADV+CIp4aQMsm8l3r
 6lJAdsOCuw5BgSDJWh1vGmFKubfhP841QbglDnnZcc+WTBhvEO0PR8Kv0Ugn5Xek
 8NkcnOlti+gjH0EKTHx5P4frV0BcWptLjCjVdruOvBszrZtEaX547Mp/d04GGO0U
 8gUEhen/RT7l1E2RM4qII9q5nuaekjI2Da2FGZNmK/j66OPFibiOBMRn3u/haTr5
 2axrrO9NOnfWcTQX08iKZygEUY7E4h3iImqkNw+c+avZ6SRiH548TRCrKdlBeuia
 Pj3QFRfu/9PQnnl9qnROXtAP70AKmX1iSLZ4s9+KqTzQVCW7tLbZpOn9D1hHCJ1q
 0JTVql5rSXCl5YpBsDiqr7qvZhVab/2+1X9wHkLPZd5aIBoUmOQOMebZCUxpu+TR
 houx3NBGYNixk0khEkHH3+c0HydvnxVU3xClUgLxjHgEdeS7R6s=
 =z3oC
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Address a deadlock regression in RELEASE_LOCKOWNER

* tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: don't take fi_lock in nfsd_break_deleg_cb()
2024-02-07 17:48:15 +00:00
NeilBrown
5ea9a7c5fe nfsd: don't take fi_lock in nfsd_break_deleg_cb()
A recent change to check_for_locks() changed it to take ->flc_lock while
holding ->fi_lock.  This creates a lock inversion (reported by lockdep)
because there is a case where ->fi_lock is taken while holding
->flc_lock.

->flc_lock is held across ->fl_lmops callbacks, and
nfsd_break_deleg_cb() is one of those and does take ->fi_lock.  However
it doesn't need to.

Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each
delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list
and so needed the lock.  Since then it doesn't walk the list and doesn't
need the lock.

Two actions are performed under the lock.  One is to call
nfsd_break_one_deleg which calls nfsd4_run_cb().  These doesn't act on
the nfs4_file at all, so don't need the lock.

The other is to set ->fi_had_conflict which is in the nfs4_file.
This field is only ever set here (except when initialised to false)
so there is no possible problem will multiple threads racing when
setting it.

The field is tested twice in nfs4_set_delegation().  The first test does
not hold a lock and is documented as an opportunistic optimisation, so
it doesn't impose any need to hold ->fi_lock while setting
->fi_had_conflict.

The second test in nfs4_set_delegation() *is* make under ->fi_lock, so
removing the locking when ->fi_had_conflict is set could make a change.
The change could only be interesting if ->fi_had_conflict tested as
false even though nfsd_break_one_deleg() ran before ->fi_lock was
unlocked.  i.e. while hash_delegation_locked() was running.
As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb()
there can be no importance to this interaction.

So this patch removes the locking from nfsd_break_one_deleg() and moves
the final test on ->fi_had_conflict out of the locked region to make it
clear that locking isn't important to the test.  It is still tested
*after* vfs_setlease() has succeeded.  This might be significant and as
vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called
under ->flc_lock this "after" is a true ordering provided by a spinlock.

Fixes: edcf972515 ("nfsd: fix RELEASE_LOCKOWNER")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-02-05 09:49:47 -05:00
Jeff Layton
7b8001013d filelock: don't do security checks on nfsd setlease calls
Zdenek reported seeing some AVC denials due to nfsd trying to set
delegations:

    type=AVC msg=audit(09.11.2023 09:03:46.411:496) : avc:  denied  { lease } for  pid=5127 comm=rpc.nfsd capability=lease  scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:nfsd_t:s0 tclass=capability permissive=0

When setting delegations on behalf of nfsd, we don't want to do all of
the normal capabilty and LSM checks. nfsd is a kernel thread and runs
with CAP_LEASE set, so the uid checks end up being a no-op in most cases
anyway.

Some nfsd functions can end up running in normal process context when
tearing down the server. At that point, the CAP_LEASE check can fail and
cause the client to not tear down delegations when expected.

Also, the way the per-fs ->setlease handlers work today is a little
convoluted. The non-trivial ones are wrappers around generic_setlease,
so when they fail due to permission problems they usually they end up
doing a little extra work only to determine that they can't set the
lease anyway. It would be more efficient to do those checks earlier.

Transplant the permission checking from generic_setlease to
vfs_setlease, which will make the permission checking happen earlier on
filesystems that have a ->setlease operation. Add a new kernel_setlease
function that bypasses these checks, and switch nfsd to use that instead
of vfs_setlease.

There is one behavioral change here: prior this patch the
setlease_notifier would fire even if the lease attempt was going to fail
the security checks later. With this change, it doesn't fire until the
caller has passed them. I think this is a desirable change overall. nfsd
is the only user of the setlease_notifier and it doesn't benefit from
being notified about failed attempts.

Cc: Ondrej Mosnáček <omosnacek@gmail.com>
Reported-by: Zdenek Pytela <zpytela@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2248830
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240205-bz2248830-v1-1-d0ec0daecba1@kernel.org
Acked-by: Tom Talpey <tom@talpey.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:53:03 +01:00
Jeff Layton
c69ff40719
filelock: split leases out of struct file_lock
Add a new struct file_lease and move the lease-specific fields from
struct file_lock to it. Convert the appropriate API calls to take
struct file_lease instead, and convert the callers to use them.

There is zero overlap between the lock manager operations for file
locks and the ones for file leases, so split the lease-related
operations off into a new lease_manager_operations struct.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-47-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:44 +01:00
Jeff Layton
05580bbfc6
nfsd: adapt to breakup of struct file_lock
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-42-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:43 +01:00
Jeff Layton
a69ce85ec9
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.

For now, add some macros to ensure that we can continue to build while
the conversion is in progress.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:38 +01:00
Jeff Layton
60f3154d19
nfsd: convert to using new filelock helpers
Convert to using the new file locking helper functions. Also, in later
patches we're going to introduce some macros with names that clash with
the variable names in nfsd4_lock. Rename them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-12-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:36 +01:00
Linus Torvalds
b9fa4cbd84 nfsd-6.8 fixes:
- Fix in-kernel RPC UDP transport
 - Fix NFSv4.0 RELEASE_LOCKOWNER
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWxLmUACgkQM2qzM29m
 f5eCJBAAqTyYweun0ca9wisypEVVWgfoHN8wNpxZR3iaPkL/pWudkk+ImHuSGFCC
 fxdBU+xpBbxTHJ5yrrjyIVBQwHXRx3HQyedxs5SpFxeEr6h4z4fTzC3oX69AxHvr
 zt4ynccUjZ4vjcyASQafONtPUoeCoTUEP9hkrF/WYNdETuKXhGuJ7rM969f3eLF/
 V3oPz0l8PoCsyw6IrFgUOGsNYSdBfydR73WqMJTUP9Avc5l9yG2tKdNfV5n87xTg
 xPuAxyxBEKuzcK1N7qBjiBDt+MRn+QKSid/kVelhTEwb+vgW4DEqJwmBMNHjFzqN
 gy3x8xVatIlbuWvKslSE4MbrAplTdEZrSHK0Q7CNeREHptlf41QS5dgYdVCWcs15
 vgYb18qTipkErdQ4sYGK7oN0PhPTlOkpit/28Vf8fiiSEXZTPBr9bcYvAGD3pZUS
 kzoA0JrZkZQB+TNHQ3j5T2r+XKunF3KkPVXmI7c6RSf8yWcJ1kqe29zpstIfll71
 K157NBqeVM8FyiPOxZqT0AROW9aIPGml7f6SQ4DkUUp84LbEFaz8svkFMu9a7ONs
 oEkc5xbEuvIvyHdY1liBhxw5ugg1nNMIbchVwRHqQIIZGfy76p+KLRSZ2neidhxE
 pl0QNQchT3XsxZJOwcp5YboyBB+gVcKMpFB1rgX4NBcXTh1buWo=
 =t62e
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix in-kernel RPC UDP transport

 - Fix NFSv4.0 RELEASE_LOCKOWNER

* tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix RELEASE_LOCKOWNER
  SUNRPC: use request size to initialize bio_vec in svc_udp_sendto()
2024-01-25 10:26:52 -08:00
NeilBrown
edcf972515 nfsd: fix RELEASE_LOCKOWNER
The test on so_count in nfsd4_release_lockowner() is nonsense and
harmful.  Revert to using check_for_locks(), changing that to not sleep.

First: harmful.
As is documented in the kdoc comment for nfsd4_release_lockowner(), the
test on so_count can transiently return a false positive resulting in a
return of NFS4ERR_LOCKS_HELD when in fact no locks are held.  This is
clearly a protocol violation and with the Linux NFS client it can cause
incorrect behaviour.

If RELEASE_LOCKOWNER is sent while some other thread is still
processing a LOCK request which failed because, at the time that request
was received, the given owner held a conflicting lock, then the nfsd
thread processing that LOCK request can hold a reference (conflock) to
the lock owner that causes nfsd4_release_lockowner() to return an
incorrect error.

The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it
never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so
it knows that the error is impossible.  It assumes the lock owner was in
fact released so it feels free to use the same lock owner identifier in
some later locking request.

When it does reuse a lock owner identifier for which a previous RELEASE
failed, it will naturally use a lock_seqid of zero.  However the server,
which didn't release the lock owner, will expect a larger lock_seqid and
so will respond with NFS4ERR_BAD_SEQID.

So clearly it is harmful to allow a false positive, which testing
so_count allows.

The test is nonsense because ... well... it doesn't mean anything.

so_count is the sum of three different counts.
1/ the set of states listed on so_stateids
2/ the set of active vfs locks owned by any of those states
3/ various transient counts such as for conflicting locks.

When it is tested against '2' it is clear that one of these is the
transient reference obtained by find_lockowner_str_locked().  It is not
clear what the other one is expected to be.

In practice, the count is often 2 because there is precisely one state
on so_stateids.  If there were more, this would fail.

In my testing I see two circumstances when RELEASE_LOCKOWNER is called.
In one case, CLOSE is called before RELEASE_LOCKOWNER.  That results in
all the lock states being removed, and so the lockowner being discarded
(it is removed when there are no more references which usually happens
when the lock state is discarded).  When nfsd4_release_lockowner() finds
that the lock owner doesn't exist, it returns success.

The other case shows an so_count of '2' and precisely one state listed
in so_stateid.  It appears that the Linux client uses a separate lock
owner for each file resulting in one lock state per lock owner, so this
test on '2' is safe.  For another client it might not be safe.

So this patch changes check_for_locks() to use the (newish)
find_any_file_locked() so that it doesn't take a reference on the
nfs4_file and so never calls nfsd_file_put(), and so never sleeps.  With
this check is it safe to restore the use of check_for_locks() rather
than testing so_count against the mysterious '2'.

Fixes: ce3c4ad7f4 ("NFSD: Fix possible sleep during nfsd4_release_lockowner()")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-24 09:49:11 -05:00
Linus Torvalds
488926926a misc cleanups (the part that hadn't been picked by individual fs trees)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ/BCAAKCRBZ7Krx/gZQ
 68qqAQD6LtfYLDJGdJM+lNpyiG4BA7coYpPlJtmH7mzL+MbFPgEAnM7XsK6zyvza
 3+rEggLM0UFWjg9Ln7Nlq035TeYtFwo=
 =w1mD
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc filesystem updates from Al Viro:
 "Misc cleanups (the part that hadn't been picked by individual fs
  trees)"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  apparmorfs: don't duplicate kfree_link()
  orangefs: saner arguments passing in readdir guts
  ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent
  reiserfs_add_entry(): get rid of pointless namelen checks
  __ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks
  ext4_add_entry(): ->d_name.len is never 0
  befs: d_obtain_alias(ERR_PTR(...)) will do the right thing
  affs: d_obtain_alias(ERR_PTR(...)) will do the right thing
  /proc/sys: use d_splice_alias() calling conventions to simplify failure exits
  hostfs: use d_splice_alias() calling conventions to simplify failure exits
  udf_fiiter_add_entry(): check for zero ->d_name.len is bogus...
  udf: d_obtain_alias(ERR_PTR(...)) will do the right thing...
  udf: d_splice_alias() will do the right thing on ERR_PTR() inode
  nfsd: kill stale comment about simple_fill_super() requirements
  bfs_add_entry(): get rid of pointless ->d_name.len checks
  nilfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing...
  zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode
2024-01-11 20:23:50 -08:00
Linus Torvalds
499aa1ca4e dcache stuff for this cycle
change of locking rules for __dentry_kill(), regularized refcounting
 rules in that area, assorted cleanups and removal of weird corner
 cases (e.g. now ->d_iput() on child is always called before the parent
 might hit __dentry_kill(), etc.)
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+sQQAKCRBZ7Krx/gZQ
 6ybjAQDM5jiS93IUzfHjCWq0nVBX5YGbDAkZOeqxbmIdQb+2UAEA6elP5r0fBBcA
 seo3bry4DirQMDaA/Cjh4+8r71YSOQs=
 =7+Hk
 -----END PGP SIGNATURE-----

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache updates from Al Viro:
 "Change of locking rules for __dentry_kill(), regularized refcounting
  rules in that area, assorted cleanups and removal of weird corner
  cases (e.g. now ->d_iput() on child is always called before the parent
  might hit __dentry_kill(), etc)"

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  dcache: remove unnecessary NULL check in dget_dlock()
  kill DCACHE_MAY_FREE
  __d_unalias() doesn't use inode argument
  d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
  get rid of DCACHE_GENOCIDE
  d_genocide(): move the extern into fs/internal.h
  simple_fill_super(): don't bother with d_genocide() on failure
  nsfs: use d_make_root()
  d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
  kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
  retain_dentry(): introduce a trimmed-down lockless variant
  __dentry_kill(): new locking scheme
  d_prune_aliases(): use a shrink list
  switch select_collect{,2}() to use of to_shrink_list()
  to_shrink_list(): call only if refcount is 0
  fold dentry_kill() into dput()
  don't try to cut corners in shrink_lock_dentry()
  fold the call of retain_dentry() into fast_dput()
  Call retain_dentry() with refcount 0
  dentry_kill(): don't bother with retain_dentry() on slow path
  ...
2024-01-11 20:11:35 -08:00
Linus Torvalds
bf4e7080ae fix directory locking scheme on rename
broken in 6.5; we really can't lock two unrelated directories
 without holding ->s_vfs_rename_mutex first and in case of
 same-parent rename of a subdirectory 6.5 ends up doing just
 that.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+lyQAKCRBZ7Krx/gZQ
 60MWAP94hTqeMIpjhsUIkrTnylrIFaiw4UCWFJzIRG1QQYKqCgD/XUaWI9np7dL6
 0wR/j4CQSdJjiEFKUFE2pD3QoSuJYAQ=
 =+x0+
 -----END PGP SIGNATURE-----

Merge tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull rename updates from Al Viro:
 "Fix directory locking scheme on rename

  This was broken in 6.5; we really can't lock two unrelated directories
  without holding ->s_vfs_rename_mutex first and in case of same-parent
  rename of a subdirectory 6.5 ends up doing just that"

* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  rename(): avoid a deadlock in the case of parents having no common ancestor
  kill lock_two_inodes()
  rename(): fix the locking of subdirectories
  f2fs: Avoid reading renamed directory if parent does not change
  ext4: don't access the source subdirectory content on same-directory rename
  ext2: Avoid reading renamed directory if parent does not change
  udf_rename(): only access the child content on cross-directory rename
  ocfs2: Avoid touching renamed directory if parent does not change
  reiserfs: Avoid touching renamed directory if parent does not change
2024-01-11 20:00:22 -08:00
Linus Torvalds
49f4810356 NFSD 6.8 Release Notes
The bulk of the patches for this release are clean-ups and minor bug
 fixes.
 
 There is one significant revert to mention: support for RDMA Read
 operations in the server's RPC-over-RDMA transport implementation
 has been fixed so it waits for Read completion in a way that avoids
 tying up an nfsd thread. This prevents a possible DoS vector if an
 RPC-over-RDMA client should become unresponsive during RDMA Read
 operations.
 
 As always I am grateful to NFSD contributors, reviewers, and
 testers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWdW34ACgkQM2qzM29m
 f5fKmw/+PcjoNDWR55kTmOo8j0h4HF8rhunvP2C50svnnsX63y1WKkLaxyAFN/Hl
 UFucJDQBjJvwi+PEbGOXcjkizuG5mhRBFvFIYDJYGWsE1s7B/v3E/Servvt1wSek
 UjoTjknYrqH6R3YfA8zBaWRJUXwvVQW3Bzo4mShrQK7He9/7nBHdUe0aWbAA9oW3
 QgzKH/FzqCS03MvuxQv74KgBcl3diIrDaj041A3CtSnXzSKqwc3LaUAd5B4BL+oq
 GnxpV1rtZla50M4Ntddi+vSjUvHWZySQ1GEJj7rKLTwpGXkxM2NuMkGx676WR4Iv
 sYDX0fsica2elKbqJem8pk68qi6XEdZVAdoOHdgNJRClmYHby8xkrL/TYKiQZf42
 IN9FogoVSZ+vSdI158Weim9+0Jqf+ffIh57ZtOyQQQAGZkdhB6GhcbdHJhQ9eOgB
 LAiAL7bsoWvDmBh5m9KnBmQYGpZoDUa6AT0bIvGD2O4/MdpHBkyT8Xwt+210nPOK
 mpBtxe5O8cUcg7A5/TwnVRg5jKp4CF8VWh2R8sGDhcYV8UfRthB38h4rHNhv4vxt
 l6ZUgmtTxrs1rCeh6aoiWTKXeQmI8meWlcet7cxw/axAsaTXkYPi5mslxF9f4O8u
 nQ8q7LuZQy2CKZO/t98STwx7s9OJcDOwcy51rnKK85TlCwnxFWg=
 =mIKg
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The bulk of the patches for this release are clean-ups and minor bug
  fixes.

  There is one significant revert to mention: support for RDMA Read
  operations in the server's RPC-over-RDMA transport implementation has
  been fixed so it waits for Read completion in a way that avoids tying
  up an nfsd thread. This prevents a possible DoS vector if an
  RPC-over-RDMA client should become unresponsive during RDMA Read
  operations.

  As always I am grateful to NFSD contributors, reviewers, and testers"

* tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
  nfsd: rename nfsd_last_thread() to nfsd_destroy_serv()
  SUNRPC: discard sv_refcnt, and svc_get/svc_put
  svc: don't hold reference for poolstats, only mutex.
  SUNRPC: remove printk when back channel request not found
  svcrdma: Implement multi-stage Read completion again
  svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete()
  svcrdma: Add back svcxprt_rdma::sc_read_complete_q
  svcrdma: Add back svc_rdma_recv_ctxt::rc_pages
  svcrdma: Clean up comment in svc_rdma_accept()
  svcrdma: Remove queue-shortening warnings
  svcrdma: Remove pointer addresses shown in dprintk()
  svcrdma: Optimize svc_rdma_cc_init()
  svcrdma: De-duplicate completion ID initialization helpers
  svcrdma: Move the svc_rdma_cc_init() call
  svcrdma: Remove struct svc_rdma_read_info
  svcrdma: Update the synopsis of svc_rdma_read_special()
  svcrdma: Update the synopsis of svc_rdma_read_call_chunk()
  svcrdma: Update synopsis of svc_rdma_read_multiple_chunks()
  svcrdma: Update synopsis of svc_rdma_copy_inline_range()
  svcrdma: Update the synopsis of svc_rdma_read_data_item()
  ...
2024-01-10 10:20:08 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Linus Torvalds
bb93c5ed45 vfs-6.8.rw
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
 ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
 ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
 =bF+R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rw updates from Christian Brauner:
 "This contains updates from Amir for read-write backing file helpers
  for stacking filesystems such as overlayfs:

   - Fanotify is currently in the process of introducing pre content
     events. Roughly, a new permission event will be added indicating
     that it is safe to write to the file being accessed. These events
     are used by hierarchical storage managers to e.g., fill the content
     of files on first access.

     During that work we noticed that our current permission checking is
     inconsistent in rw_verify_area() and remap_verify_area().
     Especially in the splice code permission checking is done multiple
     times. For example, one time for the whole range and then again for
     partial ranges inside the iterator.

     In addition, we mostly do permission checking before we call
     file_start_write() except for a few places where we call it after.
     For pre-content events we need such permission checking to be done
     before file_start_write(). So this is a nice reason to clean this
     all up.

     After this series, all permission checking is done before
     file_start_write().

     As part of this cleanup we also massaged the splice code a bit. We
     got rid of a few helpers because we are alredy drowning in special
     read-write helpers. We also cleaned up the return types for splice
     helpers.

   - Introduce generic read-write helpers for backing files. This lifts
     some overlayfs code to common code so it can be used by the FUSE
     passthrough work coming in over the next cycles. Make Amir and
     Miklos the maintainers for this new subsystem of the vfs"

* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: fix __sb_write_started() kerneldoc formatting
  fs: factor out backing_file_mmap() helper
  fs: factor out backing_file_splice_{read,write}() helpers
  fs: factor out backing_file_{read,write}_iter() helpers
  fs: prepare for stackable filesystems backing file helpers
  fsnotify: optionally pass access range in file permission hooks
  fsnotify: assert that file_start_write() is not held in permission hooks
  fsnotify: split fsnotify_perm() into two hooks
  fs: use splice_copy_file_range() inline helper
  splice: return type ssize_t from all helpers
  fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
  fs: move file_start_write() into direct_splice_actor()
  fs: fork splice_file_range() from do_splice_direct()
  fs: create {sb,file}_write_not_started() helpers
  fs: create file_write_started() helper
  fs: create __sb_write_started() helper
  fs: move kiocb_start_write() into vfs_iocb_iter_write()
  fs: move permission hook out of do_iter_read()
  fs: move permission hook out of do_iter_write()
  fs: move file_start_write() into vfs_iter_write()
  ...
2024-01-08 11:11:51 -08:00
NeilBrown
17419aefcb nfsd: rename nfsd_last_thread() to nfsd_destroy_serv()
As this function now destroys the svc_serv, this is a better name.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:33 -05:00
NeilBrown
1e3577a452 SUNRPC: discard sv_refcnt, and svc_get/svc_put
sv_refcnt is no longer useful.
lockd and nfs-cb only ever have the svc active when there are a non-zero
number of threads, so sv_refcnt mirrors sv_nrthreads.

nfsd also keeps the svc active between when a socket is added and when
the first thread is started, but we don't really need a refcount for
that.  We can simply not destroy the svc while there are any permanent
sockets attached.

So remove sv_refcnt and the get/put functions.
Instead of a final call to svc_put(), call svc_destroy() instead.
This is changed to also store NULL in the passed-in pointer to make it
easier to avoid use-after-free situations.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:33 -05:00
NeilBrown
7b207ccd98 svc: don't hold reference for poolstats, only mutex.
A future patch will remove refcounting on svc_serv as it is of little
use.
It is currently used to keep the svc around while the pool_stats file is
open.
Change this to get the pointer, protected by the mutex, only in
seq_start, and the release the mutex in seq_stop.
This means that if the nfsd server is stopped and restarted while the
pool_stats file it open, then some pool stats info could be from the
first instance and some from the second.  This might appear odd, but is
unlikely to be a problem in practice.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:33 -05:00
ChenXiaoSong
52e8910075 NFSv4, NFSD: move enum nfs_cb_opnum4 to include/linux/nfs4.h
Callback operations enum is defined in client and server, move it to
common header file.

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00