The physical device's favored NUMA node ID is available when
allocating a recv_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.
This clean up eliminates the hack of destroying recv_ctxts that
were not created by the receive CQ thread -- recv_ctxts are now
always allocated on a "good" node.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The physical device's NUMA node ID is available when allocating an
svc_xprt for an incoming connection. Use that value to ensure the
svc_xprt structure is allocated on the NUMA node closest to the
device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- A collection of minor bug fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRiQVEACgkQM2qzM29m
f5eYfBAAg5Qz45PL+fo1qWxkJ1ZKaNV1vPdi4tqCt9NEItDTjAnjj0am+rKNGZAz
EOM2yFt4xaZGyMYgXe4VnYl0N+rSpbI/H+Rk/wOq4OHPURQD5EO9VeP86qZ7rmGl
ECPqb39TFTwAiRomC/DHO4eNpoe1rQuXu0tW9+GmqDGxeuh8xdxTk33g17ZXwCFN
tdPkkPjxVPdWd8X7HQg9kWm8AWfV+GyuzE2rKAoOjbs6Wv6d9GCY8Cb5HXkRsQhF
4Zh0PVQuTuXurZwtPXwnS0k4kfvQwjlTIKHlXuo0ZLh+SuFbrWHzv0fVyD+kUpSK
HtWbJ8JcruUvz0WGMtZatzRLHCZLDguV6oVXPp7rtmuxTj4szzHSFpEeAV901sIm
Nkvuomvd02K/fiTo7s3yr6t1VG2vju9LDwhBe197iA3leHAlockfbbxE3NJMGbzQ
NoOPd+lu95cfsanOM1LZZLNfbLrZofoSLK9K1+HD0yAVdyq7u47FyHRrymvCaMrj
GiheuqrBfBMEq+2mCwUn37aM0FblYEXQM0xTVXPQcHtBBN/nGZxPJukmpr7ScNlR
aqMtDoOLu4OEFuo6fe2/94eNi+N5XAZgWmx/mSyaytE8Xw9LJxeQ83UTigaGcYKc
3YIuG1YXg9IyIoIdLkghB+Aj/6fivsGFK9Gud6g7I3xw4f15noA=
=PiNG
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- A collection of minor bug fixes
* tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Remove open coding of string copy
SUNRPC: Fix trace_svc_register() call site
SUNRPC: always free ctxt when freeing deferred request
SUNRPC: double free xprt_ctxt while still in use
SUNRPC: Fix error handling in svc_setup_socket()
SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
lockd: define nlm_port_min,max with CONFIG_SYSCTL
nfsd: define exports_proc_ops with CONFIG_PROC_FS
SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not
been sufficient to use kfree() to free a deferred request. We may need
to free the ctxt as well.
As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose
it to explicit do that even when the ctxt is not stored in an rqst.
So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt,
which may have been taken either from an rqst or from a dreq. The
caller is now responsible for clearing that pointer after the call to
->xpo_release_ctxt.
We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when
revisiting a deferred request. This ensures there is only one pointer
to the ctxt, so the risk of double freeing in future is reduced. The
new code in svc_xprt_release which releases both the ctxt and any
rq_deferred depends on this.
Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The big ticket item for this release is support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server. The goal is to
provide a simple-to-deploy, low-overhead in-transit confidentiality
and peer authentication mechanism. It can supplement NFS Kerberos
and it can protect the use of legacy non-cryptographic user
authentication flavors such as AUTH_SYS. The TLS Record protocol is
handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Work continues on improving NFSD's open file cache. Among the many
clean-ups in that area is a patch to convert the rhashtable to use
the list-hashing version of that data structure.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
/LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
/qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
=lA2+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The big ticket item for this release is that support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server.
The goal is to provide a simple-to-deploy, low-overhead in-transit
confidentiality and peer authentication mechanism. It can supplement
NFS Kerberos and it can protect the use of legacy non-cryptographic
user authentication flavors such as AUTH_SYS. The TLS Record protocol
is handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Aside from that, work continues on improving NFSD's open file cache.
Among the many clean-ups in that area is a patch to convert the
rhashtable to use the list-hashing version of that data structure"
* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
NFSD: Handle new xprtsec= export option
SUNRPC: Support TLS handshake in the server-side TCP socket code
NFSD: Clean up xattr memory allocation flags
NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
SUNRPC: Clear rq_xid when receiving a new RPC Call
SUNRPC: Recognize control messages in server-side TCP socket code
SUNRPC: Be even lazier about releasing pages
SUNRPC: Convert svc_xprt_release() to the release_pages() API
SUNRPC: Relocate svc_free_res_pages()
nfsd: simplify the delayed disposal list code
SUNRPC: Ignore return value of ->xpo_sendto
SUNRPC: Ensure server-side sockets have a sock->file
NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
SUNRPC: return proper error from get_expiry()
lockd: add some client-side tracepoints
nfs: move nfs_fhandle_hash to common include file
lockd: server should unlock lock if client rejects the grant
lockd: fix races in client GRANTED_MSG wait logic
lockd: move struct nlm_wait to lockd.h
...
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There's no need for the cost of this extra virtual function call
during every RPC transaction: the RQ_SECURE bit can be set properly
in ->xpo_recvfrom() instead.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Fix a race when creating NFSv4 files
- Revert the use of relaxed bitops
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmO9twgACgkQM2qzM29m
f5cQ/A//SSv/eZl2cnAMZtN1zd7wIMfI6E9y8Ccv49aebUXGmGDKSwz/CUns2sgO
avWentInUYg2cexIjaQnQeGkiQt0Do+3u/cdT86h2e8q3UhvctYWO5uRCqbP+36H
JRLfNUUbic4P8Yp/LZ5DvwOWae4PLdZq71mxJkaTXGHt8zLn/yEntCY8jb6V7D2L
SxMXAoO05bdzfPc8lXKmaGi4JMsANEOMh5ZMRpKxKTEFQG352db17MqwOAW/Qe+t
mMXY2jRfeufFwimmwLK06EzItgcs6D9g7dM3oIwDUNiPL4l3lOYeynbYOref7fD3
4u11LwZdzZ5LYIZ0HoTpRu3ZxAbrTtmd1FiT7SwN9jjq1vu0Zx0sfqk0R9VixY3c
jP+wYKEDTQUkIVdbG6g/u6yQZvwM281+GiAXoD3FJWKJDwAaqwxd6cphCn314RKY
hlgG4DGhAi0BYbiLVu5ObQwRb1yPgCP2pXqguAdAKbTM2DVC2+hAW3NDUcIKrR1U
JoXmGBaWeuJU9/0JbfVzddXUCs227hnovj1nmGW7E8JUegW4m+3oscEP8tsC5H5S
J3Jr9ovxyYGQE1qxM5909hjPjrZxI3NszKIpgWoo9/jJLUWfGtnS2BclrXUxQrdl
rvbKHvmSLyOsFYnZ5Nt7uj1l7LtWMljrjOjPqe02iU6pRDNHa9Y=
=/7AX
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a race when creating NFSv4 files
- Revert the use of relaxed bitops
* tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Use set_bit(RQ_DROPME)
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
nfsd: fix handling of cached open files in nfsd4_open codepath
The premise that "Once an svc thread is scheduled and executing an
RPC, no other processes will touch svc_rqst::rq_flags" is false.
svc_xprt_enqueue() examines the RQ_BUSY flag in scheduled nfsd
threads when determining which thread to wake up next.
Found via KCSAN.
Fixes: 28df098881 ("SUNRPC: Use RMW bitops in single-threaded hot paths")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If rdma receive buffer allocate failed, should call rpcrdma_regbuf_free()
to free the send buffer, otherwise, the buffer data will be leaked.
Fixes: bb93a1ae2b ("xprtrdma: Allocate req's regbufs at xprt create time")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
net/sunrpc/xprtrdma/frwr_ops.c:151:32: warning: variable 'rc' is uninitialized when used here [-Wuninitialized]
trace_xprtrdma_frwr_alloc(mr, rc);
^~
net/sunrpc/xprtrdma/frwr_ops.c:127:8: note: initialize the variable 'rc' to silence this warning
int rc;
^
= 0
1 warning generated.
The tracepoint is intended to record the error returned from
ib_alloc_mr(). In the current code there is no other purpose for
@rc, so simply replace it.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: d8cf39a280c3b0 ('xprtrdma: MR-related memory allocation should be allowed to fail')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Many memory allocations that xprtrdma does can fail safely. Let's
use this fact to avoid some potential deadlocks: Replace GFP_KERNEL
with GFP flags that do not try hard to acquire memory.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An attempt to establish a connection can always fail and then be
retried. GFP_KERNEL allocation is not necessary here.
Like MR allocation, establishing a connection is always done in a
worker thread. The new GFP flags align with the flags that would
be returned by rpc_task_gfp_mask() in this case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprtrdma always drives a retry of MR allocation if it should fail.
It should be safe to not use GFP_KERNEL for this purpose rather
than sleeping in the memory allocator.
In theory, if these weaker allocations are attempted first, memory
exhaustion is likely to cause xprtrdma to fail fast and not then
invoke the RDMA core APIs, which still might use GFP_KERNEL.
Also note that rpc_task_gfp_mask() always sets __GFP_NORETRY and
__GFP_NOWARN when an RPC-related allocation is being done in a
worker thread. MR allocation is already always done in worker
threads.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently all rpcrdma_regbuf_alloc() call sites pass the same value
as their third argument. That argument can therefore be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 1769e6a816 ("xprtrdma: Clean up rpcrdma_create_req()")
added rpcrdma_req_create() with a GFP flags argument in case a
caller might want to avoid waiting for memory.
There has never been a caller that does not pass GFP_KERNEL as
the third argument. That argument can therefore be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprt_rdma_bc_allocate() is now the only user of RPCRDMA_DEF_GFP.
Replace that macro with the raw flags.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:
kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: <TASK>
kernel: __flush_work.isra.0+0xaf/0x188
kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
kernel: ? lock_timer_base+0x38/0x5f
kernel: __cancel_work_timer+0xea/0x13d
kernel: ? preempt_latency_start+0x2b/0x46
kernel: rdma_addr_cancel+0x70/0x81 [ib_core]
kernel: _destroy_id+0x1a/0x246 [rdma_cm]
kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel: ? _raw_spin_unlock+0x14/0x29
kernel: ? raw_spin_rq_unlock_irq+0x5/0x10
kernel: ? finish_task_switch.isra.0+0x171/0x249
kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel: process_one_work+0x1d8/0x2d4
kernel: worker_thread+0x18b/0x24f
kernel: ? rescuer_thread+0x280/0x280
kernel: kthread+0xf4/0xfc
kernel: ? kthread_complete_and_exit+0x1b/0x1b
kernel: ret_from_fork+0x22/0x30
kernel: </TASK>
SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.
Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.
So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes
issues when under memory pressure.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- There is now a backup maintainer for NFSD
Notable fixes:
- Prevent array overruns in svc_rdma_build_writes()
- Prevent buffer overruns when encoding NFSv3 READDIR results
- Fix a potential UAF in nfsd_file_put()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKjhbsACgkQM2qzM29m
f5cgig/6A9gC2c9v4lR2fH6ufiCWvJBfuVaBbToubwktJHaDLvqH56JcvS3s/gKL
PKGmbQTI/6lgmVgJqQSxJUnfe6wzHx8G1MdjlZEIwi3pUeiV+LpiprZz9TOhgYYV
YDnXRGhb4wKOm75w8rb6X8k106XdKwdBaRQwb88FDawffWoEY0XNYrlmNmmWi8To
ELlOlIwRCBbKJoJ6yEEWQRrBuVBXapbsn29tipZXbdo58g+vL0yDQq9s97b0mHhi
C2apAN2+k18FiBJsA7b7pW1l/P6k9FNEeetvgWyN8OSMpPNmt0vz1HvKaIstPgg1
BX6rgWe5eQBFEk2KNvSGHrV3R+wAp7jeuVpHUMjxXvzmfj4exJV/H8lu+qZJNDGN
ybCJatomR4APFxk+s1kptlzNo7zfyPz15L80HmWIngYJ/lrBOoKPIIi3bwPQcBwW
q2Rc+SlvpqbJvEcomgF/lqQN6inmx44J+KpOSA/S8qSIdSkz0iaZsDahFxgZNe82
h+X/i1maRtnSIvWdGMR7O6kEFT5jky35WlTv/VutTOsUwA4mUU9vZUnufBBHJH07
nOdLMi/QS/O5GOnlyegrODtN75wi+IeKt+WMNmnN+JB8Tsg0kZwjOsc/dbQfyJqP
PrQJ5AUP0TMm90B1873z8yKmhWeXtB71vgAI/d53aadBG4ZEstU=
=6ugk
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Notable changes:
- There is now a backup maintainer for NFSD
Notable fixes:
- Prevent array overruns in svc_rdma_build_writes()
- Prevent buffer overruns when encoding NFSv3 READDIR results
- Fix a potential UAF in nfsd_file_put()"
* tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer()
SUNRPC: Clean up xdr_get_next_encode_buffer()
SUNRPC: Clean up xdr_commit_encode()
SUNRPC: Optimize xdr_reserve_space()
SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()
SUNRPC: Trap RDMA segment overflows
NFSD: Fix potential use-after-free in nfsd_file_put()
MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
Prevent svc_rdma_build_writes() from walking off the end of a Write
chunk's segment array. Caught with KASAN.
The test that this fix replaces is invalid, and might have been left
over from an earlier prototype of the PCL work.
Fixes: 7a1cbfa180 ("svcrdma: Use parsed chunk lists to construct RDMA Writes")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- New Features:
- Add support for 'dacl' and 'sacl' attributes
- Bugfixes and Cleanups:
- Fixes for reporting mapping errors
- Fixes for memory allocation errors
- Improve warning message when locks are lost
- Update documentation for the nfs4_unique_id parameter
- Add an explanation of NFSv4 client identifiers
- Ensure the i_size attribute is written to the fscache storage
- Fix freeing uninitialized nfs4_labels
- Better handling when xprtrdma bc_serv is NULL
- Marke qualified async operations as MOVEABLE tasks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmKWhFQACgkQ18tUv7Cl
QOszjRAAllmtKLbzOkwQwcT3e5ljh9NEJ8NL+Nv1SXjozpFY+1fuXc0ivT4rniU6
68ZHz+faK2UtLytwOO94M0jo2RCAlYS5rfnts89CpdfP3bqmGPAj0Ytw/c/vg+Qf
4eQbAzz++T35DgU7cdeKKZKg9Wtwbq7g0kYv1W8QCiCbxakSjnc/V9Ll5XhS/CAC
1WqKD90TRKUkX0Y1NNsNdXB1dJn/6QAq9B6JTjan+2Rhn7/NCTU8p98mEZGcVD7r
cPHyXTqkPF4IH7lgjEMIRf6eXEzDDZNIs98QLdHJ2Gk0LxW7p7IL7VW8TKzYunvl
coA1bZfYhUZBUJ+eDrrKZ5hHMSn/+eNR5iiIcfqtyU8o3J0NXAXGlLh/iJSGsxIH
PjyjWSfpCgoZVPc4dG3lxR9Iu7UZeAuuB2ZoiNakUkd+UNKK5U5PpaPnYT6adaIp
TegivZclCmgyLQiAdPRifDzhaL5J2pp6kVb5iMY6oX+ObyclW/UcqzKMqIKSt3R8
6+JAmZ6633ojS4r3xFsw/dlEUWuuVq7kYwXK209LqiBn5vvjWNa/WgH4MaSfnJe9
rlw+fs8Aky0w59IhzRJMMVCJ/Q2EYDKmtQLQgYVw80RBFiFgBpMW0wDqMGiddTcu
1IZ2c5+t1GxfASpyu8miexQjRJW6A2MTp0gfHGiHarxdCpaAycA=
=0ccI
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add support for 'dacl' and 'sacl' attributes
Bugfixes and Cleanups:
- Fixes for reporting mapping errors
- Fixes for memory allocation errors
- Improve warning message when locks are lost
- Update documentation for the nfs4_unique_id parameter
- Add an explanation of NFSv4 client identifiers
- Ensure the i_size attribute is written to the fscache storage
- Fix freeing uninitialized nfs4_labels
- Better handling when xprtrdma bc_serv is NULL
- Mark qualified async operations as MOVEABLE tasks"
* tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.1 mark qualified async operations as MOVEABLE tasks
xprtrdma: treat all calls not a bcall when bc_serv is NULL
NFSv4: Fix free of uninitialized nfs4_label on referral lookup.
NFS: Pass i_size to fscache_unuse_cookie() when a file is released
Documentation: Add an explanation of NFSv4 client identifiers
NFS: update documentation for the nfs4_unique_id parameter
NFS: Improve warning message when locks are lost.
NFSv4.1: Enable access to the NFSv4.1 'dacl' and 'sacl' attributes
NFSv4: Add encoders/decoders for the NFSv4.1 dacl and sacl attributes
NFSv4: Specify the type of ACL to cache
NFSv4: Don't hold the layoutget locks across multiple RPC calls
pNFS/files: Fall back to I/O through the MDS on non-fatal layout errors
NFS: Further fixes to the writeback error handling
NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout
NFS: Memory allocation failures are not server fatal errors
NFS: Don't report errors from nfs_pageio_complete() more than once
NFS: Do not report flush errors in nfs_write_end()
NFS: Don't report ENOSPC write errors twice
NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS
NFS: Do not report EINTR/ERESTARTSYS as mapping errors
Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static tools
- Clean the some of the kernel caps, reducing the historical stealth uAPI
leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core to
free them
- Several rc quality bug fixes for hfi1
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYo+NxgAKCRCFwuHvBreF
YbqSAQDJ+QolaATUvOQUPLbuLopUCJLe95VS15Kl3SNXiVUUFAEA8DLL1s6+WShd
AgypUxGHipx5BAytrn45/WiwuDeEbQ8=
=jgTl
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static
tools
- Clean the some of the kernel caps, reducing the historical stealth
uAPI leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs
layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core
to free them
- Several rc quality bug fixes for hfi1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits)
RDMA/rtrs-clt: Fix one kernel-doc comment
RDMA/hfi1: Remove all traces of diagpkt support
RDMA/hfi1: Consolidate software versions
RDMA/hfi1: Remove pointless driver version
RDMA/hfi1: Fix potential integer multiplication overflow errors
RDMA/hfi1: Prevent panic when SDMA is disabled
RDMA/hfi1: Prevent use of lock before it is initialized
RDMA/rxe: Fix an error handling path in rxe_get_mcg()
IB/core: Fix typo in comment
RDMA/core: Fix typo in comment
IB/hf1: Fix typo in comment
IB/qib: Fix typo in comment
IB/iser: Fix typo in comment
RDMA/mlx4: Avoid flush_scheduled_work() usage
IB/isert: Avoid flush_scheduled_work() usage
RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr()
RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq()
RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx()
RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx()
RDMA/irdma: Add SW mechanism to generate completions on error
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmKKlIAeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC3oH/iPm/fLG2sJut8My
sU0RC9K+6ESV5h2Qy6k00/lqKstlu4EvBjw4V8vYpx3Q2+hbSFMn2SeWqqqT3Lkk
Zb8KINCFuuyMtdCBb42PV0zhUf5pCQF7ocm/Ae4jllDHtPmqk3WJ6IGtZBK5JBlw
z6RR/wKt0y0MRj9eZyPyYjOee2L2vuVh4tgnexK/4L8g2ZtMMRThhvUzSMWG4zxR
STYYNp0uFcfT1Vt85+ODevFH4TvdECAj+SqAegN+seHLM17YY7M0/WiIYpxGRv8P
lIpDQl4PBU8EBkpI5hkpJ/3qPincbuVOMLsYfxFtpcjjG12vGjFp2krGpS3TedZQ
3mvaJ7c=
=vLke
-----END PGP SIGNATURE-----
Merge tag 'v5.18' into rdma.git for-next
Following patches have dependencies.
Resolve the merge conflict in
drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names
for the fs functions following linux-next:
https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
I noticed CPU pipeline stalls while using perf.
Once an svc thread is scheduled and executing an RPC, no other
processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are
not needed outside the svc thread scheduler.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.
This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.
Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.
Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further
investigation shows that the crash occurred while NFSD was handling
a deferred request.
This patch addresses two inter-related issues that prevent request
deferral from working correctly for RPC/RDMA requests:
1. Prevent the crash by ensuring that the original
svc_rqst::rq_xprt_ctxt value is available when the request is
revisited. Otherwise svc_rdma_sendto() does not have a Receive
context available with which to construct its reply.
2. Possibly since before commit 71641d99ce ("svcrdma: Properly
compute .len and .buflen for received RPC Calls"),
svc_rdma_recvfrom() did not include the transport header in the
returned xdr_buf. There should have been no need for svc_defer()
and friends to save and restore that header, as of that commit.
This issue is addressed in a backport-friendly way by simply
having svc_rdma_recvfrom() set rq_xprt_hlen to zero
unconditionally, just as svc_tcp_recvfrom() does. This enables
svc_deferred_recv() to correctly reconstruct an RPC message
received via RPC/RDMA.
Reported-by: Trond Myklebust <trondmy@hammerspace.com>
Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Highlights include:
Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when there
are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
in cases where there is no danger of deadlocking the writeback code or
NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
necessary features to recover the unlinked but open file after reboot.
Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to the
list head.
- Fix a kernel memory scribble when reading from a non-socket transport
in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP socket
stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
transport.
Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJDLLUACgkQZwvnipYK
APJZ5Q/9FH5S38RcUOxpFS+XaVdmxltJTMQv9Bz5mcyljq6PwRjTn73Qs7MfBQfD
k90O9Pn/EXgQpFWNW4o/saQGLAO1MbVaesdNNF92Le9NziMz7lr95fsFXOqpg9EQ
rEVOImWinv/N0+P0yv5tgjx6Fheu70L4dnwsxepnmtv2R1VjXOVchK7WjsV+9qey
Xvn15OzhtVCGhRvvKfmGuGW+2I0D29tRNn2X0Pi+LjxAEQ0We/rcsJeGDMGYtbeC
2U4ADk7KB0kJq04WYCH8k8tTrljL3PUONzO6QP1lltmAZ0FOwXf/AjyaD3SML2Xu
A1yc2bEn3/S6wCQ3BiMxoE7UCSewg/ZyQD7B9GjZ/nWYNXREhLKvtVZt+ULTCA6o
UKAI2vlEHljzQI2YfbVVZFyL8KCZqi+BFcJ/H/dtFKNazIkiefkaF+703Pflazis
HjhA5G6IT0Pk6qZ6OBp+Z/4kCDYZYXXi9ysiNuMPWCiYEfZzt7ocUjm44uRAi+vR
Qpf0V01Rg4iplTj3YFXqb9TwxrudzC7AcQ7hbLJia0kD/8djG+aGZs3SJWeO+VuE
Xa41oxAx9jqOSRvSKPvj7nbexrwn8PtQI2NtDRqjPXlg6+Jbk23u8lIEVhsCH877
xoMUHtTKLyqECskzRN4nh7qoBD3QfiLuI74aDcPtGEuXzU1+Ox8=
=jeGt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when
there are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
in cases where there is no danger of deadlocking the writeback code
or NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
necessary features to recover the unlinked but open file after
reboot.
Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to
the list head.
- Fix a kernel memory scribble when reading from a non-socket
transport in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP
socket stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
copy offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
transport.
Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code"
* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
NFS: Don't loop forever in nfs_do_recoalesce()
SUNRPC: Don't return error values in sysfs read of closed files
SUNRPC: Do not dereference non-socket transports in sysfs
NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
SUNRPC don't resend a task on an offlined transport
NFS: replace usage of found with dedicated list iterator variable
SUNRPC: avoid race between mod_timer() and del_timer_sync()
pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
NFS: Avoid writeback threads getting stuck in mempool_alloc()
NFS: nfsiod should not block forever in mempool_alloc()
SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
SUNRPC: Fix unx_lookup_cred() allocation
NFS: Fix memory allocation in rpc_alloc_task()
NFS: Fix memory allocation in rpc_malloc()
SUNRPC: Improve accuracy of socket ENOBUFS determination
SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
SUNRPC: Fix socket waits for write buffer space
...
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC
to be used for some allocations. This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.
Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.
GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.
xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks,
then it blocks all other queued tasks. So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.
Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.
So with this patch:
1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
2/ __rpc_execute() sets PF_MEMALLOC while handling any task
with RPC_TASK_SWAPPER set, or when handling any task that
holds the XPRT_LOCKED lock on an xprt used for swap.
This removes the need for the RPC_IS_SWAPPER() test
in ->buf_alloc handlers.
3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
any task to a swapper xprt. __rpc_execute() will clear it.
3/ PF_MEMALLOC is set for all the connect workers.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.
xprt_dynamic_alloc_slot can block indefinitely. This can tie up all
workqueue threads and NFS can deadlock. So when called from a
workqueue, set __GFP_NORETRY.
The rdma alloc_slot already does not block. However it sets the error
to -EAGAIN suggesting this will trigger a sleep. It does not. As we
can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN
causes immediate retry.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.
mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.
rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If there are failures then we must not leave the non-NULL pointers with
the error value, otherwise `rpcrdma_ep_destroy` gets confused and tries
free them, resulting in an Oops.
Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Deprecated. dprintk is no longer used in xprtrdma.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Deprecated. This information is available via tracepoints.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
6aiufVWKNNRQef9y
=yKBl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck"
* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd4: remove obselete comment
nfsd: document server-to-server-copy parameters
NFSD:fix boolreturn.cocci warning
nfsd: update create verifier comment
SUNRPC: Change return value type of .pc_encode
SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
NFSD: Save location of NFSv4 COMPOUND status
SUNRPC: Change return value type of .pc_decode
SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
SUNRPC: De-duplicate .pc_release() call sites
SUNRPC: Simplify the SVC dispatch code path
SUNRPC: Capture value of xdr_buf::page_base
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
svcrdma: Split svcrmda_wc_{read,write} tracepoints
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split the svcrdma_wc_receive() tracepoint
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
NFSD: Initialize pointer ni with NULL and not plain integer 0
NFSD: simplify struct nfsfh
...
This minor fix-up keeps GCC from complaining that "last' may be used
uninitialized", which breaks some build workflows that have been running
with all warnings treated as errors.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: this field is no longer used.
xprt_rdma_pad_optimize is also no longer used, but is left in place
because it is part of the kernel/userspace API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This is a buffer to be left persistently registered while a
connection is up. Connection tear-down will automatically DMA-unmap,
invalidate, and dereg the MR. A persistently registered buffer is
lower in cost to provide, and it can never be coalesced into the
RDMA segment that carries the data payload.
An RPC that provisions a Write chunk with a non-aligned length now
uses this MR rather than the tail buffer of the RPC's rq_rcv_buf.
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_send:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_send_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_send_err:
- This tracepoint can be left persistently enabled because
completion errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_recv:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_recv_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_recv_err:
- received is always zero, so there's no value in recording it.
- This tracepoint can be left enabled because completion
errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
- New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
- Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmExP7AACgkQ18tUv7Cl
QOshTg/zBz7OfrS23CcLLgNidTJ6S7JOuj1DShG+YzsYXT8f9Nl1DadLM7yAEyok
6JZzC8rXYzJcmYztHZzRyTuzj1+tGGb0u/MrD0bBk42VEel6eOjH/Y9ybn12Gf/E
aqlcJh8hPx44U8oo5EFjRJsg2h28O06vywqhJz+sTbkqKN4hlAgMOo5ysAB+1thg
BrTlR84EKBw5QqxPJ1WPmq9tEyGebU9Yrj1p8f0Uf015IeRNeTOXx3NzmdPshphf
2yJvjumwEzqkcHXTJFDfP6ikIcGPPMNVAOK8DHb+vDGzNsOXW7dDM7GuWA3U8DlU
ZHvyyb05Wwe6Wwg8xwx90FEXcYZFfZbSKmI9z2uoOuGFzNG07zWzPDzRft+qrOvU
VMMwP9oEh71+qesmWTvqIbR2RjxqbCYlTcc8mBrD66DROi6jZ2jznraNC85sxG0Q
b8GE+2SnYr2Q25yehj2xrRlOXyiYNkeeYmIpIquEqH9o7cSyDNJhBWbzIv6x+ith
O/S06ZVKMc9X1nH5t5121XcHrSTMMVA/67WMyKfKMxWnrADAWPQALG+ttoTcbRu7
Txew3Jb+hB8+ZdHAqbPf1l1i+7USQl1CRHMw3GRvNjCL2qcjZb1R7eyJRSQQtUyw
q6SJRGe6Sn1FTUnn96Hv15Zy8VHx+q0cOL/EQVzL1RzJIXYcag==
=Ad/3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"
* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...
svc_xprt_free() already "puts" the bc_xprt before calling the
transport's "free" method. No need to do it twice.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.
The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.
To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).
The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Because wake_up() takes an IRQ-safe lock, it can be expensive,
especially to call inside of a single-threaded completion handler.
What's more, the Send wait queue almost never has waiters, so
most of the time, this is an expensive no-op.
As always, the goal is to reduce the average overhead of each
completion, because a transport's completion handlers are single-
threaded on one CPU core. This change reduces CPU utilization of
the Send completion thread by 2-3% on my server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
The xprtrdma client code currently relies on the task that initiated the
connect to hold the XPRT_LOCK for the duration of the connection
attempt. If the task is woken early, due to some other event, then that
lock could get released early.
Avoid races by using the same mechanism that the socket code uses of
transferring lock ownership to the RDMA connect worker itself. That
frees us to call rpcrdma_xprt_disconnect() directly since we're now
guaranteed exclusion w.r.t. other callers.
Fixes: 4cf44be6f1 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Now that there is only one registration mode, there is only one
target "post_send" method: frwr_send(). rpcrdma_post_sends() no
longer adds much value, especially since all of its call sites
ignore the return code value except to check if it's non-zero.
Just have them call frwr_send() directly instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Unlike xprtrdma_post_send(), this one can be left enabled all the
time, and should almost never fire. But we do want to know about
immediate errors when they happen.
Note that there is already a similar post_linv_err tracepoint.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In the vast majority of cases, rc=0. Don't record that in the
post_recvs tracepoint. Instead, add a separate tracepoint that can
be left enabled all the time to capture the very rare immediate
errors returned by ib_post_recv().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure the tear-down completion is awoken only /after/ we've stopped
fiddling with rpcrdma_rep objects in rpcrdma_post_recvs().
Fixes: 15788d1d10 ("xprtrdma: Do not refresh Receive Queue while it is draining")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ib_post_send() does not disconnect the QP when it returns an
immediate error. Thus, the code that posts LocalInv has to
explicitly disconnect after an immediate error. This is just
like the frwr_send() callers handle it.
If a disconnect isn't done here, the transport deadlocks.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Highlights include:
Stable fixes:
- Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues
Bugfixes
- SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()
- SUNRPC: prevent port reuse on transports which don't request it.
- NFSv3: Fix memory leak in posix_acl_create()
- NFS: Various fixes to attribute revalidation timeouts
- NFSv4: Fix handling of non-atomic change attribute updates
- NFSv4: If a server is down, don't cause mounts to other servers to
hang as well
- pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
- NFS: Fix mount failures due to incorrect setting of the has_sec_mnt_opts
filesystem flag
- NFS: Ensure nfs_readpage returns promptly when an internal error occurs
- NFS: Fix fscache read from NFS after cache error
- pNFS: Various bugfixes around the LAYOUTGET operation
Features
- Multiple patches to add support for fcntl() leases over NFSv4.
- A sysfs interface to display more information about the various
transport connections used by the RPC client
- A sysfs interface to allow a suitably privileged user to offline a
transport that may no longer point to a valid server
- A sysfs interface to allow a suitably privileged user to change the
server IP address used by the RPC client
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmDnPrkACgkQZwvnipYK
APKGAw/9EjdGoic6VpShQyb5uxaRoDd4uwWgFLBOfPhIWC7qNMtkj49wHIOEUm7e
YfdF5RlCdmshaoMxjY84wjl8NTMwHbPahgooDd4+UsZUs2qxZ8dBsr0itfbFsJv8
BpaCYKQt6XGQngGrWfC7SiCETnMej2YsmjDfHvhD58TxnRfPWexHUvx9xi9uGRCS
sIWRA2QMNs7LwdShkkRotagodRLhu/zo4g0lon5lI8D/SRg6o8RoO4YP6oKH1FN4
OyVzy1aWZGocgwCMUtNeuigJSRyDa+bJTfJ2c27uw5g18s0XWZ3j2DxD5I+HCEuE
B4rhg+ujtPIifYLHf2Aj3nlxdBePZ5L67a2MOOUo+wSD+nPmNMZF1eIT/3Jsg/HA
Z8gqcBiTIkBfVGJxWWbrbHfxPXQiK1IRGQx9acyhLCN9M6Kv5bbkn4R4dnronvJR
g6O968fgC5uvl60CXdc8NCpWtSitXB/nH8pn7MbJ8JBGq7QIYNkS0d4E8ePhYwxk
sRYJt21O+ryjodfQDHaUxodzCKGcpRoknpirMmgoAp4zdkva4ltViNsQvHa7jFh8
HIuhU6Aia1xVYpUMDEXf2WMXCT9yLa2TyMDuS5KDfb69wBkQJWeKNkebf+1k03wQ
saEmdoP4aEEujimkA7rqyOlI8XhsudKvBd3HXg+w9+xIt4yoie0=
=NaOI
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Multiple patches to add support for fcntl() leases over NFSv4.
- A sysfs interface to display more information about the various
transport connections used by the RPC client
- A sysfs interface to allow a suitably privileged user to offline a
transport that may no longer point to a valid server
- A sysfs interface to allow a suitably privileged user to change the
server IP address used by the RPC client
Stable fixes:
- Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues
Bugfixes:
- SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()
- SUNRPC: prevent port reuse on transports which don't request it.
- NFSv3: Fix memory leak in posix_acl_create()
- NFS: Various fixes to attribute revalidation timeouts
- NFSv4: Fix handling of non-atomic change attribute updates
- NFSv4: If a server is down, don't cause mounts to other servers to
hang as well
- pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
- NFS: Fix mount failures due to incorrect setting of the
has_sec_mnt_opts filesystem flag
- NFS: Ensure nfs_readpage returns promptly when an internal error
occurs
- NFS: Fix fscache read from NFS after cache error
- pNFS: Various bugfixes around the LAYOUTGET operation"
* tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
NFSv4/pnfs: Clean up layout get on open
NFSv4/pnfs: Fix layoutget behaviour after invalidation
NFSv4/pnfs: Fix the layout barrier update
NFS: Fix fscache read from NFS after cache error
NFS: Ensure nfs_readpage returns promptly when internal error occurs
sunrpc: remove an offlined xprt using sysfs
sunrpc: provide showing transport's state info in the sysfs directory
sunrpc: display xprt's queuelen of assigned tasks via sysfs
sunrpc: provide multipath info in the sysfs directory
NFSv4.1 identify and mark RPC tasks that can move between transports
sunrpc: provide transport info in the sysfs directory
SUNRPC: take a xprt offline using sysfs
sunrpc: add dst_attr attributes to the sysfs xprt directory
SUNRPC for TCP display xprt's source port in sysfs xprt_info
SUNRPC query transport's source port
SUNRPC display xprt's main value in sysfs's xprt_info
SUNRPC mark the first transport
sunrpc: add add sysfs directory per xprt under each xprt_switch
...
We need to keep track of the type for a given transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- add tracepoints for callbacks and for client creation and
destruction
- cache the mounts used for server-to-server copies
- expose callback information in /proc/fs/nfsd/clients/*/info
- don't hold locks unnecessarily while waiting for commits
- update NLM to use xdr_stream, as we have for NFSv2/v3/v4
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmDlvjIVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+0MoP/RJ8Q7zwIz6WFHn3bCRaEXpnnkAH
mmMfELhmgvH0V5nXWbb2rAfhllY+/zeWtf8QHSEKUPCnVLmB7WeXKdjXSy7EnYJ8
R8DuuuII85McIrg93nJ8hxm4wXTaTZKXpS4Vxkuxc6YKxoeJoXOaTjbgRLIw8mfX
w4wPfjAsnROboVxvDHUmBS9zNKaAi2dZ0jH2x2eS7eZSWzoJC30yd+pFSxyYoOac
3fZUntDskQDGIpXHuTf53WcaK7h1bUHrwS7Joez8Z0ctg4vcbJsfdhKZUZwAxOZh
3xWAgm3PFcze5xqHuX8BYBThHfB3uTeygZQRb3zI9sG2UQtQfundrtlxZRSjMMkC
cwlSi2SQNL66EBIgOcS3U/9OeorLALnnRax1KWMWjpFzaBJJQTJDumwLRx4zogI1
Ouiu0fI+hApck+L+qCzJMidA2wxOBsDzH471YiGiqQSmgNZc6wBc+aC/JKN8QAWb
jG53vvpa3gCZa8Rs3KyOoUvtcCCdiQc+nljbzqtVfIvvGa9MSixufa+U5fojLEO7
i8aangK+mteMxrrejEKvRu1efDIfpFq0HW7ev1mzW2Jl/AguDXM5XUeGK2mMMPtc
WqT3arbtGVcXJN+Oh5TzTVuED/DecyO0Fig77G+WJTiWONgoHfs+E5nC4aHSpohn
bMpmQMIOmTa5zgQP
=BQyR
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
- add tracepoints for callbacks and for client creation and destruction
- cache the mounts used for server-to-server copies
- expose callback information in /proc/fs/nfsd/clients/*/info
- don't hold locks unnecessarily while waiting for commits
- update NLM to use xdr_stream, as we have for NFSv2/v3/v4
* tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux: (69 commits)
nfsd: fix NULL dereference in nfs3svc_encode_getaclres
NFSD: Prevent a possible oops in the nfs_dirent() tracepoint
nfsd: remove redundant assignment to pointer 'this'
nfsd: Reduce contention for the nfsd_file nf_rwsem
lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream
lockd: Update the NLMv4 nlm_res results encoder to use struct xdr_stream
lockd: Update the NLMv4 TEST results encoder to use struct xdr_stream
lockd: Update the NLMv4 void results encoder to use struct xdr_stream
lockd: Update the NLMv4 FREE_ALL arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 SHARE arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 SM_NOTIFY arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 nlm_res arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 UNLOCK arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 CANCEL arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 LOCK arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 TEST arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 void arguments decoder to use struct xdr_stream
lockd: Update the NLMv1 SHARE results encoder to use struct xdr_stream
lockd: Update the NLMv1 nlm_res results encoder to use struct xdr_stream
lockd: Update the NLMv1 TEST results encoder to use struct xdr_stream
...
Fix some spelling mistakes in comments:
succes ==> success
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 9ed5af268e ("SUNRPC: Clean up the handling of page padding
in rpc_prepare_reply_pages()") [Dec 2020] affects RPC Replies that
have a data payload (i.e., Write chunks).
rpcrdma_prepare_readch(), as its name suggests, sets up Read chunks
which are data payloads within RPC Calls. Those payloads are
constructed by xdr_write_pages(), which continues to stuff the call
buffer's tail kvec with the payload's XDR roundup. Thus removing
the tail buffer logic in rpcrdma_prepare_readch() was the wrong
thing to do.
Fixes: 586a0787ce ("xprtrdma: Clean up rpcrdma_prepare_readch()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that we fix the XPRT_CONGESTED starvation issue for RDMA as well
as socket based transports.
Ensure we always initialise the request after waking up from the backlog
list.
Fixes: e877a88d1f ("SUNRPC in case of backlog, hand free slots directly to waiting task")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when the
server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the tracking
of invalid cached file metadata.
- Support the NFSv4.2 "change_attr_type" attribute and use it to
optimise handling of change attribute updates.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCVLooACgkQZwvnipYK
APJB5BAAtIJyhx40ooMBzcucDmXd1qovlKsb8ZlvnSI6c7wvHhFPNk9z4zwThnjL
FpVYzJzK6XzAQY/PtgbrPwnSUmW925ngPWYR/hiYe+OGPBnYV+tXP8izCyEkNgMg
45goDOxojGWl7AGTuAJiKcDSdH9PyIrbvt28iwcNSGjslasGSbAoL/836l4OIGr1
Ymxs/NDML11dPco8GIKLGtHd8leFGleDx089VeNsgud8MdaFErp16O5Iz8DdzRKd
W1l2zDMb05j8eDZIfy3w3FyrLkDXA+KgLSADiC8TcpxoadPaQJMeCvoIq8oqVndn
bZBoxduXdLgf54Aec0WnNKFAOyc7pGvZoSNmFouT7EGV73g+g1LQ+ZbEE1bb8fCQ
XHqCVaBt2+47NiTUgdxjXlZRfcn8fYKx0tVxfG3mQVMXUAWfsjmMyQMNgijDRJI2
8Wz3lZMRGMILbR9j4QpP1biVy/2zGNWG/TB5ZZyZMSY4uT+aOpzlqdknb4UsRaSp
f7MfmB7xEWpS4DJr9RIBrJ/hIdnMu1mNInxDPFo5Kl5HNp4TaPm2dPir2ZD2wMZI
daURTX7giUhpE15ZebQDBqWD+mTR0bVDqLLeo131JRmMfMEHugNrr49xe+NkBu/R
QWnFzgkGdQsOeiKRRwEUuhsi74JspqfwzdZzHqcRM5WuXVvBLcA=
=h01b
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift
out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when
the server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the
tracking of invalid cached file metadata.
- Support the NFSv4.2 'change_attr_type' attribute and use it to
optimise handling of change attribute updates"
* tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (85 commits)
xprtrdma: Fix a NULL dereference in frwr_unmap_sync()
sunrpc: Fix misplaced barrier in call_decode
NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code.
xprtrdma: Move fr_mr field to struct rpcrdma_mr
xprtrdma: Move the Work Request union to struct rpcrdma_mr
xprtrdma: Move fr_linv_done field to struct rpcrdma_mr
xprtrdma: Move cqe to struct rpcrdma_mr
xprtrdma: Move fr_cid to struct rpcrdma_mr
xprtrdma: Remove the RPC/RDMA QP event handler
xprtrdma: Don't display r_xprt memory addresses in tracepoints
xprtrdma: Add an rpcrdma_mr_completion_class
xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation
xprtrdma: Avoid Send Queue wrapping
xprtrdma: Do not wake RPC consumer on a failed LocalInv
xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
xprtrdma: Rename frwr_release_mr()
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
xprtrdma: Delete rpcrdma_recv_buffer_put()
xprtrdma: Fix cwnd update ordering
...
The normal mechanism that invalidates and unmaps MRs is
frwr_unmap_async(). frwr_unmap_sync() is used only when an RPC
Reply bearing Write or Reply chunks has been lost (ie, almost
never).
Coverity found that after commit 9a301cafc8 ("xprtrdma: Move
fr_linv_done field to struct rpcrdma_mr"), the while() loop in
frwr_unmap_sync() exits only once @mr is NULL, unconditionally
causing subsequent dereferences of @mr to Oops.
I've tested this fix by creating a client that skips invoking
frwr_unmap_async() when RPC Replies complete. That forces all
invalidation tasks to fall upon frwr_unmap_sync(). Simple workloads
with this fix applied to the adulterated client work as designed.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1504556 ("Null pointer dereferences")
Fixes: 9a301cafc8 ("xprtrdma: Move fr_linv_done field to struct rpcrdma_mr")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The last remaining field in struct rpcrdma_frwr has been
removed, so the struct can be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: Move more of struct rpcrdma_frwr into its parent.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up.
- Simplify variable initialization in the completion handlers.
- Move another field out of struct rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up (for several purposes):
- The MR's cid is initialized sooner so that tracepoints can show
something reasonable even if the MR is never posted.
- The MR's res.id doesn't change so the cid won't change either.
Initializing the cid once is sufficient.
- struct rpcrdma_frwr is going away soon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The handler only recorded a trace event. If indeed no
action is needed by the RPC/RDMA consumer, then the event can be
ignored.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The Send signaling logic is a little subtle, so add some
observability around it. For every xprtrdma_mr_fastreg event, there
should be an xprtrdma_mr_localinv or xprtrdma_mr_reminv event.
When these tracepoints are enabled, we can see exactly when an MR is
DMA-mapped, registered, invalidated (either locally or remotely) and
then DMA-unmapped.
kworker/u25:2-190 [000] 787.979512: xprtrdma_mr_map: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979515: xprtrdma_chunk_read: task:351@5 pos=148 5608@0x8679e0c8f6f56000:0x00000503 (last)
kworker/u25:2-190 [000] 787.979519: xprtrdma_marshal: task:351@5 xid=0x8679e0c8: hdr=52 xdr=148/5608/0 read list/inline
kworker/u25:2-190 [000] 787.979525: xprtrdma_mr_fastreg: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979526: xprtrdma_post_send: task:351@5 cq.id=0 cid=73 (2 SGEs)
...
kworker/5:1H-219 [005] 787.980567: xprtrdma_wc_receive: cq.id=1 cid=161 status=SUCCESS (0/0x0) received=164
kworker/5:1H-219 [005] 787.980571: xprtrdma_post_recvs: peer=[192.168.100.55]:20049 r_xprt=0xffff8884974d4000: 0 new recvs, 70 active (rc 0)
kworker/5:1H-219 [005] 787.980573: xprtrdma_reply: task:351@5 xid=0x8679e0c8 credits=64
kworker/5:1H-219 [005] 787.980576: xprtrdma_mr_reminv: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/5:1H-219 [005] 787.980577: xprtrdma_mr_unmap: mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
Note that I've moved the xprtrdma_post_send tracepoint so that event
always appears after the xprtrdma_mr_fastreg tracepoint. Otherwise
the event log looks counterintuitive (FastReg is always supposed to
happen before Send).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Send WRs can be signalled or unsignalled. A signalled Send WR
always has a matching Send completion, while a unsignalled Send
has a completion only if the Send WR fails.
xprtrdma has a Send account mechanism that is designed to reduce
the number of signalled Send WRs. This in turn mitigates the
interrupt rate of the underlying device.
RDMA consumers can't leave all Sends unsignaled, however, because
providers rely on Send completions to maintain their Send Queue head
and tail pointers. xprtrdma counts the number of unsignaled Send WRs
that have been posted to ensure that Sends are signalled often
enough to prevent the Send Queue from wrapping.
This mechanism neglected to account for FastReg WRs, which are
posted on the Send Queue but never signalled. As a result, the
Send Queue wrapped on occasion, resulting in duplication completions
of FastReg and LocalInv WRs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Throw away any reply where the LocalInv flushes or could not be
posted. The registered memory region is in an unknown state until
the disconnect completes.
rpcrdma_xprt_disconnect() will find and release the MR. No need to
put it back on the MR free list in this case.
The client retransmits pending RPC requests once it reestablishes a
fresh connection, so a replacement reply should be forthcoming on
the next connection instance.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.
I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.
Fixes: 847568942f ("xprtrdma: Remove fr_state")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.
The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().
Fixes: 2ae50ad68c ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Defensive clean up: Protect the rb_all_reps list during rep
creation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.
Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.
On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.
This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().
To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.
Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit e340c2d6ef ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.
This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.
Fixes: e340c2d6ef ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Capture error codes in @ret, which is passed to the send_err
tracepoint, so that they can be logged when something goes awry.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.
The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.
Fixes: 4a06825839 ("SUNRPC: Transport fault injection")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This, to me, seems less cluttered and less redundant. I was hoping
it could help reduce lock contention on the dto_q lock by reducing
the size of the critical section, but alas, the only improvement is
readability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
These fields are no longer used.
The size of struct svc_rdma_recv_ctxt is now less than 300 bytes on
x86_64, down from 2440 bytes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently the generic RPC server layer calls svc_rdma_recvfrom()
twice to retrieve an RPC message that uses Read chunks. I'm not
exactly sure why this design was chosen originally.
Instead, let's wait for the Read chunk completion inline in the
first call to svc_rdma_recvfrom().
The goal is to eliminate some page allocator churn.
rdma_read_complete() replaces pages in the second svc_rqst by
calling put_page() repeatedly while the upper layer waits for the
request to be constructed, which adds unnecessary NFS WRITE round-
trip latency.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Currently, XPT_BUSY is not cleared until xpo_recvfrom returns.
That effectively blocks the receipt and handling of the next RPC
message until the current one has been taken off the transport.
This strict ordering is a requirement for socket transports.
For our kernel RPC/RDMA transport implementation, however, dequeuing
an ingress message is nothing more than a list_del(). The transport
can safely be marked un-busy as soon as that is done.
To keep the changes simpler, this patch just moves the
svc_xprt_received() call site from svc_handle_xprt() into the
transports, so that the actual optimization can be done in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_sendto() now waits for the NIC hardware to finish with
the pages backing rq_res. We still have to release the page array
in some cases, but now it's always safe to immediately re-use the
page backing rq_res's head buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently svc_rdma_sendto() migrates xdr_buf pages into a separate
page list and NULLs out a bunch of entries in rq_pages while the
pages are under I/O. The Send completion handler then frees those
pages later.
Instead, let's wait for the Send completion, then handle page
releasing in the nfsd thread. I'd like to avoid the cost of 250+
put_page() calls in the Send completion handler, which is single-
threaded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor a bit of commonly used logic so that every site that wants
a close deferred to an nfsd thread does all the right things
(set_bit(XPT_CLOSE) then enqueue).
Also, once XPT_CLOSE is set on a transport, it is never cleared. If
XPT_CLOSE is already set, then the close is already being handled
and the enqueue can be skipped.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Post more Receives when the number of pending Receives drops below
a water mark. The batch mechanism is disabled if the underlying
device cannot support a reasonably-sized Receive Queue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace svc_rdma_post_recv() with the new batch receive mechanism.
For the moment it is posting just a single Receive WR at a time,
so no change in behavior is expected.
Since svc_rdma_wc_receive() was the last call site for
svc_rdma_post_recv(), it is removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Introduce a server-side mechanism similar to commit e340c2d6ef
("xprtrdma: Reduce the doorbell rate (Receive)") to post Receive
WRs in batch. Its first consumer is svc_rdma_post_recvs(), which
posts the initial set of Receive WRs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
xprt pinning was removed in commit 365e9992b9 ("svcrdma: Remove
transport reference counting"), but this comment was not updated
to reflect that change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: explain why svc_xprt_enqueue() is invoked in the event
handler even though no xpt_flags bits are toggled here.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I tested commit 43042b90ca ("svcrdma: Reduce Receive doorbell
rate") with mlx4 (IB) and software iWARP and didn't find any
issues. However, I recently got my hardware iWARP setup back on
line (FastLinQ) and it's crashing hard on this commit (confirmed
via bisect).
The failure mode is complex.
- After a connection is established, the first Receive completes
normally.
- But the second and third Receives have garbage in their Receive
buffers. The server responds with ERR_VERS as a result.
- When the client tears down the connection to retry, a couple
of posted Receives flush twice, and that corrupts the recv_ctxt
free list.
- __svc_rdma_free then faults or loops infinitely while destroying
the xprt's recv_ctxts.
Since 43042b90ca ("svcrdma: Reduce Receive doorbell rate") does
not fix a bug but is a scalability enhancement, it's safe and
appropriate to revert it while working on a replacement.
Fixes: 43042b90ca ("svcrdma: Reduce Receive doorbell rate")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This brings it in line with the regular tcp backchannel, which also has
all those timeouts disabled.
Prevents the backchannel from timing out, getting some async operations
like server side copying getting stuck indefinitely on the client side.
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- New Features:
- Support for eager writes, and the write=eager and write=wait mount options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmAwOLwACgkQ18tUv7Cl
QOsUfw//W2KoJ+2IQohQNFcoi+bG1OQE7jnqHtQ+tsKfpJKemcDcu8wQEAqrwALg
vXioG1Ye0QU7P5PZtNxCorylqSTVGvJSIOrfa3lTdn/PDbI7NIgN52w56TzzfeXn
pJ4gDwZzPwUFUblF0LBQUIhJv5IQvOXVgUsMqezbIbMXSiuLR/bjnZ96Q/woKpoL
eg2IZ5EO9Jb0QjuQ1e9U303X7c2qOl1jzpxyQLQfD7ONnWBx3HnJk1l+3JJRi8JV
smnae3I0L3nUZ7rBqoqsvK7YUjUchCEBvkmEMsnHT94D5tI9mxxX5OquREee6QHn
NuJRSNbsIiCD3Ne27fkCut78d6SetoMko7jZ97T6smhyijtXJiLG/6dycMPV9rt/
bVIudWMm9/A9AsXyY2YP5LC6Y6W6dhQRXygUjVgEPBl6kVsb2Eca8IA9QZghF9IL
+XSEulASvxo2rWPylJJ+3aLynfqoHrowVN/Tu61svDnJWTcb+FCxQ5zyLox7erEH
mUhraf1D0uoX9odH1069toN6favZFE6SIDvlUk1QTOjr6p3Jxmkuyl6PNs5t66/S
550z5JVb2deIHOPQxOie7xz/Dk6dnRoaFhTNq/Ootkt9GNe0A+NqSUdoRA5XxN5m
wW11ecLSZSehDksuXjyFmkHtkagLreFxLsHbVnaAtwEm7h/thRI=
=Dssn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS Client Updates from Anna Schumaker:
"New Features:
- Support for eager writes, and the write=eager and write=wait mount
options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support
RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files"
* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
NFS: Set the stable writes flag when initialising the super block
NFS: Add mount options supporting eager writes
NFS: Add support for eager writes
NFS: 'flags' field should be unsigned in struct nfs_server
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
NFS: Always clear an invalid mapping when attempting a buffered write
NFS: Optimise sparse writes past the end of file
NFS: Fix documenting comment for nfs_revalidate_file_size()
NFSv4: Fixes for nfs4_bitmask_adjust()
xprtrdma: Clean up rpcrdma_prepare_readch()
rpcrdma: Capture bytes received in Receive completion tracepoints
xprtrdma: Pad optimization, revisited
rpcrdma: Fix comments about reverse-direction operation
xprtrdma: Refactor invocations of offset_in_page()
xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
NFS: Call readpage_async_filler() from nfs_readpage_async()
NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
...
- Cork the socket while there are queued replies
Fixes:
- DRC shutdown ordering
- svc_rdma_accept() lockdep splat
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAsA80ACgkQM2qzM29m
f5erXA/+MrR3ZtwK2eaTITu13TzzTrMURbp/n0wCCW/Ls1YMb6bn9ggtBwu2W5Cn
Vb0RO9OLcmoI6CjqPh0CTUvvZspMYOAX4W1jQecKt2ml075APdlqUcv9YWPUQqVJ
qTg8HxDymvHvY3I3FcBxhzofmGzF8AOmQZJw9uI5Wt/ivBfqGWcAGlxyRmB3mdsm
cJRK0Sy7QMn2LefMcpMEeSbPA049/NZNRp6fcXnpPQFer42thoosYsNhTlAJfCXC
C5S0z3/T6rpuJucV9la/WkpUA0YhWbPEHWNdAB5tzSqmoEo4LpzJzjv7uyQU4oue
QlmChIz9qasgTI/BnCkBIzPD99S4UQcXjX0BnNinkQ77e6+b/vdAR+T+NLHJdkAf
+7Xz6T9aZNaz2R49CjYl6/kG0rlNkjUzyURRYs/9zEBhogMPH/N4T7Z2M+ljCkeb
tc3OaFDXZ2rfr7EKBGsfnEKINM1gpYipzILkr8GSHUMZLzOB/64upKySaJVjCGXj
7Sf1w+vJUWwYc+FqFvbaR4ybr01VIfdsecpn1TtY870zG1JzimzAHVZk1/xC9+CX
J+lVOXbjawDl1Et3V3fWq6Y7mhAWves/NKPcbSug9sFc4qRHEmPbAq/RRtlsjQcn
foMr5R8qd8OwEamVypZ2nIFxq4q3b742AS8lZhaK+DyZKq3oLac=
=+R4U
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull more nfsd updates from Chuck Lever:
"Here are a few additional NFSD commits for the merge window:
Optimization:
- Cork the socket while there are queued replies
Fixes:
- DRC shutdown ordering
- svc_rdma_accept() lockdep splat"
* tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Further clean up svc_tcp_sendmsg()
SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()
SUNRPC: Use TCP_CORK to optimise send performance on the server
svcrdma: Hold private mutex while invoking rdma_accept()
nfsd: register pernet ops last, unregister first
RDMA core mutex locking was restructured by commit d114c6feed
("RDMA/cma: Add missing locking to rdma_accept()") [Aug 2020]. When
lock debugging is enabled, the RPC/RDMA server trips over the new
lockdep assertion in rdma_accept() because it doesn't call
rdma_accept() from its CM event handler.
As a temporary fix, have svc_rdma_accept() take the handler_mutex
explicitly. In the meantime, let's consider how to restructure the
RPC/RDMA transport to invoke rdma_accept() from the proper context.
Calls to svc_rdma_accept() are serialized with calls to
svc_rdma_free() by the generic RPC server layer.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-rdma/20210209154014.GO4247@nvidia.com/
Fixes: d114c6feed ("RDMA/cma: Add missing locking to rdma_accept()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit 9ed5af268e ("SUNRPC: Clean up the handling of page
padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client
passes payload data to the transport with the padding in xdr->pages
instead of in the send buffer's tail kvec. There's no need for the
extra logic to advance the base of the tail kvec because the upper
layer no longer places XDR padding there.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NetApp Linux team discovered that with NFS/RDMA servers that do
not support RFC 8797, the Linux client is forming NFSv4.x WRITE
requests incorrectly.
In this case, the Linux NFS client disables implicit chunk round-up
for odd-length Read and Write chunks. The goal was to support old
servers that needed that padding to be sent explicitly by clients.
In that case the Linux NFS included the tail kvec in the Read chunk,
since the tail contains any needed padding. That meant a separate
memory registration is needed for the tail kvec, adding to the cost
of forming such requests. To avoid that cost for a mere 3 bytes of
zeroes that are always ignored by receivers, we try to use implicit
roundup when possible.
For NFSv4.x, the tail kvec also sometimes contains a trailing
GETATTR operation. The Linux NFS client unintentionally includes
that GETATTR operation in the Read chunk as well as inline.
The fix is simply to /never/ include the tail kvec when forming a
data payload Read chunk. The padding is thus now always present.
Note that since commit 9ed5af268e ("SUNRPC: Clean up the handling
of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS
client passes payload data to the transport with the padding in
xdr->pages instead of in the send buffer's tail kvec. So now the
Linux NFS client appends XDR padding to all odd-sized Read chunks.
This shouldn't be a problem because:
- RFC 8166-compliant servers are supposed to work with or without
that XDR padding in Read chunks.
- Since the padding is now in the same memory region as the data
payload, a separate memory registration is not needed. In
addition, the link layer extends data in RDMA Read responses to
4-byte boundaries anyway. Thus there is now no savings when the
padding is not included.
Because older kernels include the payload's XDR padding in the
tail kvec, a fix there will be more complicated. Thus backporting
this patch is not recommended.
Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
During the final stages of publication of RFC 8167, reviewers
requested that we use the term "reverse direction" rather than
"backwards direction". Update comments to reflect this preference.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up so that offset_in_page() is invoked less often in the
most common case, which is mapping xdr->pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Remove a conditional branch from the SGL set-up loop in frwr_map():
Instead of using either sg_set_page() or sg_set_buf(), initialize
the mr_page field properly when rpcrdma_convert_kvec() converts the
kvec to an SGL entry. frwr_map() can then invoke sg_set_page()
unconditionally.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Support for FMR was removed by commit ba69cd122e ("xprtrdma:
Remove support for FMR memory registration") [Dec 2018]. That means
the buffer-splitting behavior of rpcrdma_convert_kvec(), added by
commit 821c791a0b ("xprtrdma: Segment head and tail XDR buffers
on page boundaries") [Mar 2016], is no longer necessary. FRWR
memory registration handles this case with aplomb.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The Receive completion handler doesn't look at the contents of the
Receive buffer. The DMA sync isn't terribly expensive but it's one
less thing that needs to be done by the Receive completion handler,
which is single-threaded (per svc_xprt). This helps scalability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This is similar to commit e340c2d6ef ("xprtrdma: Reduce the
doorbell rate (Receive)") which added Receive batching to the
client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. We are not permitted to remove old proc files. Instead,
convert these variables to stubs that are only ever allowed to
display a value of zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that we have an efficient mechanism to update these two stats,
let's start maintaining them again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Receives are frequent events. Avoid the overhead of a memory bus
lock cycle for counting a value that is hardly every used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support.
- A series of patches to improve readdir performance, particularly with
large directories.
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers.
- Micro-optimisations for RDMA.
- RDMA tracing improvements.
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving partial
pages (Dan Aloni).
- Various fixes for getxattr and listxattr, when used over non-TCP
transports.
- Fixes for containerised NFS from Sargun Dhillon.
- switch nfsiod to be an UNBOUND workqueue (Neil Brown).
- READDIR should not ask for security label information if there is no
LSM policy. (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum Mackay).
- A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS code.
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/aiaIACgkQZwvnipYK
APIOihAAvONscxrFSaGRh2ICNv9I/zXW/A5+R3qnkESPVLTqTPJVphoN7FlINAr1
B74pg6n4T4viycbvsogU2+kHrlJZO7B8lTkJL7ynm9Wgyw8+2Ga4QEn1bsAoqmuY
b91p/+LfOLKrYeeojoH31PC73uOYYG1WHXJhjq0l9b5CTgThWpj6O3gDaFEbFvmz
A7V3yqSp04sV70YxUhwelBHZ5BXdiXIKsPnIwvXXHuY7IcamrE4EA3wGCwtxkBnu
4dwbOtRXURNSev0r3n6FsH4wZl+/nvp9UpnGdPtVv94F1zm2JKLwkhoJejS/vpjq
eyKc7ZXBQ0uHbTWI2Yj1YjA61VIUO0R0EDuyTAnRKDeaarID42n5kMG7J8cIglZR
jQfyx99xm0eSrdwxC09tcRL/lBzYcOfc6pJo5P9BtaFtRvbp9iFIHuFKlrXbULd4
WrZzDMhiKVYGSTcTpfQyVoK2rCvn6W1Ida4iYeI0gkJ1v9X90UhbtJOyggn/bxyL
DV/Qy40+l48n7CZfPU2eDv4WXqjKGRibpDoWMBLwUH20dDEX6kKYv3BfApFYGqyO
/GTPAFUZarCy8BENvzZv/Jb9mt5pDQM5p9ZXpdUOhydLMMA+pauaT/Gr+pAHPIPx
MPj546Gh2cEaT883xvRrJmQTG0nw/WscPNcHaJcgL5oYltmuwck=
=IKWG
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support
- A series of patches to improve readdir performance, particularly
with large directories
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers
- Micro-optimisations for RDMA
- RDMA tracing improvements
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving
partial pages (Dan Aloni)
- Various fixes for getxattr and listxattr, when used over non-TCP
transports
- Fixes for containerised NFS from Sargun Dhillon
- switch nfsiod to be an UNBOUND workqueue (Neil Brown)
- READDIR should not ask for security label information if there is
no LSM policy (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay)
- A series of RPC and NFS layer fixes to support the NFSv4.2
READ_PLUS code
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes"
* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
NFSv4/pnfs: Add tracing for the deviceid cache
fs/lockd: convert comma to semicolon
NFSv4.2: fix error return on memory allocation failure
NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
NFSv4.2: decode_read_plus_hole() needs to check the extent offset
NFSv4.2: decode_read_plus_data() must skip padding after data segment
NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
SUNRPC: When expanding the buffer, we may need grow the sparse pages
SUNRPC: Cleanup - constify a number of xdr_buf helpers
SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
SUNRPC: _copy_to/from_pages() now check for zero length
SUNRPC: Cleanup xdr_shrink_bufhead()
SUNRPC: Fix xdr_expand_hole()
SUNRPC: Fixes for xdr_align_data()
SUNRPC: _shift_data_left/right_pages should check the shift length
...
Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.
To avoid crashing, handle sparse page allocation up front.
Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.
Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
An efficient way to handle multiple Read chunks is to post them all
together and then take a single completion. This is also how the
code is already structured: when the Read completion fires, all
portions of the incoming RPC message are available to be assembled.
The difficult problem is setting up the Read sink buffers so that
the server pulls the client's data into place, making subsequent
pull-up unnecessary. There are several cases:
* No Read chunks. No-op.
* One data item Read chunk. This is the fast case, where the inline
part of the RPC-over-RDMA message becomes the head and tail, and
the data item chunk is placed in buf->pages.
* A Position-zero Read chunk. Treated like TCP: the Read chunk is
pulled into contiguous pages.
+ A Position-zero Read chunk with data item chunks. Treated like
TCP: all of the Read chunks are pulled into contiguous pages.
+ Multiple data item chunks. Treated like TCP: the inline part is
copied and the data item chunks are pulled into contiguous pages.
The "*" cases are already supported. This patch adds support for the
"+" cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As a pre-requisite for handling multiple Read chunks in each Read
list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
chunk list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'm about to change the purpose of ri_chunklen: Instead of tracking
the number of bytes in one Read chunk, it will track the total
number of bytes in the Read list. Rename it for clarity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We already have trace_svcrdma_decode_rseg(), which records each
ingress Read segment. Instead of reporting those again when they
are about to be posted as RDMA Reads, let's fire one tracepoint
before posting each type of chunk.
So we'll get:
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100
nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3
kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
of rq_res that do not contain a result payload.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
the parts of rq_res that do not contain a result payload.
This change has been tested to confirm that it does not cause a
regression in the no Write chunk and single Write chunk cases.
Multiple Write chunks have not been tested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing the egress RPC Reply transport header, use
the new parsed Write list and Reply chunk, which are version-
agnostic and already XDR decoded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing RDMA Writes, use the new parsed chunk lists
for the Write list and Reply chunk, which are version-agnostic and
already XDR-decoded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.
Note that the XID sanity test is also removed. The XID is already
looked up by the cb handler, and is rejected if it's not recognized.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This simple data structure binds the location of each data payload
inside of an RPC message to the chunk that will be used to push it
to or pull it from the client.
There are several benefits to this small additional overhead:
* It enables support for more than one chunk in incoming Read and
Write lists.
* It translates the version-specific on-the-wire format into a
generic in-memory structure, enabling support for multiple
versions of the RPC/RDMA transport protocol.
* It enables the server to re-organize a chunk list if it needs to
adjust where Read chunk data lands in server memory without
altering the contents of the XDR-encoded Receive buffer.
Construction of these lists is done while sanity checking each
incoming RPC/RDMA header. Subsequent patches will make use of the
generated data structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA
Sends is that the responder must post the Writes on the Send queue
before posting the Send that conveys the RPC Reply for that Write
payload.
The Linux NFS server implementation now has a transport method that
can post result Payload Writes earlier than svc_rdma_sendto:
->xpo_result_payload()
This gets RDMA Writes going earlier so they are more likely to be
complete at the remote end before the Send completes.
Some care must be taken with pulled-up Replies. We don't want to
push the Write chunk and then send the same payload data via Send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Have the NFSD encoders annotate the boundaries of every
direct-data-placement eligible result data payload. Then change
svcrdma to use that annotation instead of the xdr->page_len
when handling Write chunks.
For NFSv4 on RDMA, that enables the ability to recognize multiple
result payloads per compound. This is a pre-requisite for supporting
multiple Write chunks per RPC transaction.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: "result payload" is a less confusing name for these
payloads. "READ payload" reflects only the NFS usage.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor for subsequent changes.
Constify the xdr_buf argument to ensure the code here does not
modify it, and to enable callers to pass in a
"const struct xdr_buf *".
At the same time, rename the helper functions, which emit RDMA
Writes, not RDMA Sends, and add documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Ensure the code in rw.c does not modify the argument, and
enable callers to also use "const struct xdr_buf *".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When space in the Reply chunk runs out in the middle of a segment,
we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it
oopses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that rpcrdma_ep is no longer part of rpcrdma_xprt, there are
four or five serial address dereferences needed to get to the
IB device needed for DMA unmapping.
Instead, let's use the same pattern that regbufs use: cache a
pointer to the device in the MR, and use that as the indication
that unmapping is necessary.
This also guarantees that the exact same device is used for DMA
mapping and unmapping, even if the r_xprt's ep has been replaced. I
don't think this can happen today, but future changes might break
this assumption.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This function is now invoked only in frwr_ops.c. The move
enables deduplication of the trace_xprtrdma_mr_unmap() call site.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>