Under some circumstances, rxrpc will fail a transmit a packet through the
underlying UDP socket (ie. UDP sendmsg returns an error). This may result
in a call getting stuck.
In the instance being seen, where AFS tries to send a probe to the Volume
Location server, tracepoints show the UDP Tx failure (in this case returing
error 99 EADDRNOTAVAIL) and then nothing more:
afs_make_vl_call: c=0000015d VL.GetCapabilities
rxrpc_call: c=0000015d NWc u=1 sp=rxrpc_kernel_begin_call+0x106/0x170 [rxrpc] a=00000000dd89ee8a
rxrpc_call: c=0000015d Gus u=2 sp=rxrpc_new_client_call+0x14f/0x580 [rxrpc] a=00000000e20e4b08
rxrpc_call: c=0000015d SEE u=2 sp=rxrpc_activate_one_channel+0x7b/0x1c0 [rxrpc] a=00000000e20e4b08
rxrpc_call: c=0000015d CON u=2 sp=rxrpc_kernel_begin_call+0x106/0x170 [rxrpc] a=00000000e20e4b08
rxrpc_tx_fail: c=0000015d r=1 ret=-99 CallDataNofrag
The problem is that if the initial packet fails and the retransmission
timer hasn't been started, the call is set to completed and an error is
returned from rxrpc_send_data_packet() to rxrpc_queue_packet(). Though
rxrpc_instant_resend() is called, this does nothing because the call is
marked completed.
So rxrpc_notify_socket() isn't called and the error is passed back up to
rxrpc_send_data(), rxrpc_kernel_send_data() and thence to afs_make_call()
and afs_vl_get_capabilities() where it is simply ignored because it is
assumed that the result of a probe will be collected asynchronously.
Fileserver probing is similarly affected via afs_fs_get_capabilities().
Fix this by always issuing a notification in __rxrpc_set_call_completion()
if it shifts a call to the completed state, even if an error is also
returned to the caller through the function return value.
Also put in a little bit of optimisation to avoid taking the call
state_lock and disabling softirqs if the call is already in the completed
state and remove some now redundant rxrpc_notify_socket() calls.
Fixes: f5c17aaeb2 ("rxrpc: Calls should only have one terminal state")
Reported-by: Gerry Seidman <gerry@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
The Rx protocol has a "previousPacket" field in it that is not handled in
the same way by all protocol implementations. Sometimes it contains the
serial number of the last DATA packet received, sometimes the sequence
number of the last DATA packet received and sometimes the highest sequence
number so far received.
AF_RXRPC is using this to weed out ACKs that are out of date (it's possible
for ACK packets to get reordered on the wire), but this does not work with
OpenAFS which will just stick the sequence number of the last packet seen
into previousPacket.
The issue being seen is that big AFS FS.StoreData RPC (eg. of ~256MiB) are
timing out when partly sent. A trace was captured, with an additional
tracepoint to show ACKs being discarded in rxrpc_input_ack(). Here's an
excerpt showing the problem.
52873.203230: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 0002449c q=00024499 fl=09
A DATA packet with sequence number 00024499 has been transmitted (the "q="
field).
...
52873.243296: rxrpc_rx_ack: c=000004ae 00012a2b DLY r=00024499 f=00024497 p=00024496 n=0
52873.243376: rxrpc_rx_ack: c=000004ae 00012a2c IDL r=0002449b f=00024499 p=00024498 n=0
52873.243383: rxrpc_rx_ack: c=000004ae 00012a2d OOS r=0002449d f=00024499 p=0002449a n=2
The Out-Of-Sequence ACK indicates that the server didn't see DATA sequence
number 00024499, but did see seq 0002449a (previousPacket, shown as "p=",
skipped the number, but firstPacket, "f=", which shows the bottom of the
window is set at that point).
52873.252663: rxrpc_retransmit: c=000004ae q=24499 a=02 xp=14581537
52873.252664: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244bc q=00024499 fl=0b *RETRANS*
The packet has been retransmitted. Retransmission recurs until the peer
says it got the packet.
52873.271013: rxrpc_rx_ack: c=000004ae 00012a31 OOS r=000244a1 f=00024499 p=0002449e n=6
More OOS ACKs indicate that the other packets that are already in the
transmission pipeline are being received. The specific-ACK list is up to 6
ACKs and NAKs.
...
52873.284792: rxrpc_rx_ack: c=000004ae 00012a49 OOS r=000244b9 f=00024499 p=000244b6 n=30
52873.284802: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=63505500
52873.284804: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c2 q=00024499 fl=0b *RETRANS*
52873.287468: rxrpc_rx_ack: c=000004ae 00012a4a OOS r=000244ba f=00024499 p=000244b7 n=31
52873.287478: rxrpc_rx_ack: c=000004ae 00012a4b OOS r=000244bb f=00024499 p=000244b8 n=32
At this point, the server's receive window is full (n=32) with presumably 1
NAK'd packet and 31 ACK'd packets. We can't transmit any more packets.
52873.287488: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=61327980
52873.287489: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c3 q=00024499 fl=0b *RETRANS*
52873.293850: rxrpc_rx_ack: c=000004ae 00012a4c DLY r=000244bc f=000244a0 p=00024499 n=25
And now we've received an ACK indicating that a DATA retransmission was
received. 7 packets have been processed (the occupied part of the window
moved, as indicated by f= and n=).
52873.293853: rxrpc_rx_discard_ack: c=000004ae r=00012a4c 000244a0<00024499 00024499<000244b8
However, the DLY ACK gets discarded because its previousPacket has gone
backwards (from p=000244b8, in the ACK at 52873.287478 to p=00024499 in the
ACK at 52873.293850).
We then end up in a continuous cycle of retransmit/discard. kafs fails to
update its window because it's discarding the ACKs and can't transmit an
extra packet that would clear the issue because the window is full.
OpenAFS doesn't change the previousPacket value in the ACKs because no new
DATA packets are received with a different previousPacket number.
Fix this by altering the discard check to only discard an ACK based on
previousPacket if there was no advance in the firstPacket. This allows us
to transmit a new packet which will cause previousPacket to advance in the
next ACK.
The check, however, needs to allow for the possibility that previousPacket
may actually have had the serial number placed in it instead - in which
case it will go outside the window and we should ignore it.
Fixes: 1a2391c30c ("rxrpc: Fix detection of out of order acks")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled. This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.
Fix this by:
(1) Copying the RTT/RTO calculation code from Linux's TCP implementation
and altering it to fit rxrpc.
(2) Altering the various users of the RTT to make use of the new SRTT
value.
(3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
value instead (which is needed in jiffies), along with a backoff.
Notes:
(1) rxrpc provides RTT samples by matching the serial numbers on outgoing
DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
against the reference serial number in incoming REQUESTED ACK and
PING-RESPONSE ACK packets.
(2) Each packet that is transmitted on an rxrpc connection gets a new
per-connection serial number, even for retransmissions, so an ACK can
be cross-referenced to a specific trigger packet. This allows RTT
information to be drawn from retransmitted DATA packets also.
(3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
on an rxrpc_call because many RPC calls won't live long enough to
generate more than one sample.
(4) The calculated SRTT value is in units of 8ths of a microsecond rather
than nanoseconds.
The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.
Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the handling of signals in client rxrpc calls made by the afs
filesystem. Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.
Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know. It may even be worse than that
for older servers.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
In rxrpc_input_data(), rxrpc_notify_socket() is called if the base sequence
number of the packet is immediately following the hard-ack point at the end
of the function. However, this isn't sufficient, since the recvmsg side
may have been advancing the window and then overrun the position in which
we're adding - at which point rx_hard_ack >= seq0 and no notification is
generated.
Fix this by always generating a notification at the end of the input
function.
Without this, a long call may stall, possibly indefinitely.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
The subpacket scanning loop in rxrpc_receive_data() references the
subpacket count in the private data part of the sk_buff in the loop
termination condition. However, when the final subpacket is pasted into
the ring buffer, the function is no longer has a ref on the sk_buff and
should not be looking at sp->* any more. This point is actually marked in
the code when skb is cleared (but sp is not - which is an error).
Fix this by caching sp->nr_subpackets in a local variable and using that
instead.
Also clear 'sp' to catch accesses after that point.
This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets
trashed by the sk_buff getting freed and reused in the meantime.
Fixes: e2de6c4048 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the unlock and the ping transmission for a new incoming call into
rxrpc_new_incoming_call() rather than doing it in the caller. This makes
it clearer to see what's going on.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Will Deacon <will@kernel.org>
cc: Davidlohr Bueso <dave@stgolabs.net>
There's a misplaced traceline in rxrpc_input_packet() which is looking at a
packet that just got released rather than the replacement packet.
Fix this by moving the traceline after the assignment that moves the new
packet pointer to the actual packet pointer.
Fixes: d0d5c0cd1e ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
Reported-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in-place decryption routines in AF_RXRPC's rxkad security module
currently call skb_cow_data() to make sure the data isn't shared and that
the skb can be written over. This has a problem, however, as the softirq
handler may be still holding a ref or the Rx ring may be holding multiple
refs when skb_cow_data() is called in rxkad_verify_packet() - and so
skb_shared() returns true and __pskb_pull_tail() dislikes that. If this
occurs, something like the following report will be generated.
kernel BUG at net/core/skbuff.c:1463!
...
RIP: 0010:pskb_expand_head+0x253/0x2b0
...
Call Trace:
__pskb_pull_tail+0x49/0x460
skb_cow_data+0x6f/0x300
rxkad_verify_packet+0x18b/0xb10 [rxrpc]
rxrpc_recvmsg_data.isra.11+0x4a8/0xa10 [rxrpc]
rxrpc_kernel_recv_data+0x126/0x240 [rxrpc]
afs_extract_data+0x51/0x2d0 [kafs]
afs_deliver_fs_fetch_data+0x188/0x400 [kafs]
afs_deliver_to_call+0xac/0x430 [kafs]
afs_wait_for_call_to_complete+0x22f/0x3d0 [kafs]
afs_make_call+0x282/0x3f0 [kafs]
afs_fs_fetch_data+0x164/0x300 [kafs]
afs_fetch_data+0x54/0x130 [kafs]
afs_readpages+0x20d/0x340 [kafs]
read_pages+0x66/0x180
__do_page_cache_readahead+0x188/0x1a0
ondemand_readahead+0x17d/0x2e0
generic_file_read_iter+0x740/0xc10
__vfs_read+0x145/0x1a0
vfs_read+0x8c/0x140
ksys_read+0x4a/0xb0
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by using skb_unshare() instead in the input path for DATA packets
that have a security index != 0. Non-DATA packets don't need in-place
encryption and neither do unencrypted DATA packets.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Reported-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Use the previously-added transmit-phase skbuff private flag to simplify the
socket buffer tracing a bit. Which phase the skbuff comes from can now be
divined from the skb rather than having to be guessed from the call state.
We can also reduce the number of rxrpc_skb_trace values by eliminating the
difference between Tx and Rx in the symbols.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass the reference held on a DATA skb in the rxrpc input handler into the
Rx ring rather than getting an additional ref for this and then dropping
the original ref at the end.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the information now cached in the skbuff private data to avoid the need
to reparse a jumbo packet. We can find all the subpackets by dead
reckoning, so it's only necessary to note how many there are, whether the
last one is flagged as LAST_PACKET and whether any have the REQUEST_ACK
flag set.
This is necessary as once recvmsg() can see the packet, it can start
modifying it, such as doing in-place decryption.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Improve the information stored about jumbo packets so that we don't need to
reparse them so much later.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Don't bother generating maxSkew in the ACK packet as it has been obsolete
since AFS 3.1.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
The object lifetime management on the rxrpc_local struct is broken in that
the rxrpc_local_processor() function is expected to clean up and remove an
object - but it may get requeued by packets coming in on the backing UDP
socket once it starts running.
This may result in the assertion in rxrpc_local_rcu() firing because the
memory has been scheduled for RCU destruction whilst still queued:
rxrpc: Assertion failed
------------[ cut here ]------------
kernel BUG at net/rxrpc/local_object.c:468!
Note that if the processor comes around before the RCU free function, it
will just do nothing because ->dead is true.
Fix this by adding a separate refcount to count active users of the
endpoint that causes the endpoint to be destroyed when it reaches 0.
The original refcount can then be used to refcount objects through the work
processor and cause the memory to be rcu freed when that reaches 0.
Fixes: 4f95dd78a7 ("rxrpc: Rework local endpoint management")
Reported-by: syzbot+1e0edc4b8b7494c28450@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The rxrpc packet serial number cannot be safely used to compute out of
order ack packets for several reasons:
1. The allocation of serial numbers cannot be assumed to imply the order
by which acks are populated and transmitted. In some rxrpc
implementations, delayed acks and ping acks are transmitted
asynchronously to the receipt of data packets and so may be transmitted
out of order. As a result, they can race with idle acks.
2. Serial numbers are allocated by the rxrpc connection and not the call
and as such may wrap independently if multiple channels are in use.
In any case, what matters is whether the ack packet provides new
information relating to the bounds of the window (the firstPacket and
previousPacket in the ACK data).
Fix this by discarding packets that appear to wind back the window bounds
rather than on serial number procession.
Fixes: 298bc15b20 ("rxrpc: Only take the rwind and mtu values from latest ACK")
Signed-off-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_input_packet() function and its call tree was built around the
assumption that data_ready() handler called from UDP to inform a kernel
service that there is data to be had was non-reentrant. This means that
certain locking could be dispensed with.
This, however, turns out not to be the case with a multi-queue network card
that can deliver packets to multiple cpus simultaneously. Each of those
cpus can be in the rxrpc_input_packet() function at the same time.
Fix by adding or changing some structure members:
(1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
(2) Make conn->service_id into a 32-bit variable so that it can be
cmpxchg'd on all arches.
(3) Add call->input_lock to serialise access to the Rx/Tx state. Note
that although the Rx and Tx states are (almost) entirely separate,
there's no point completing the separation and having separate locks
since it's a bi-phasal RPC protocol rather than a bi-direction
streaming protocol. Data transmission and data reception do not take
place simultaneously on any particular call.
and making the following functional changes:
(1) In rxrpc_input_data(), hold call->input_lock around the core to
prevent simultaneous producing of packets into the Rx ring and
updating of tracking state for a particular call.
(2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
The bit test and bit clear can then be combined. No further locking
is needed here.
(3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
the ACK packet. The superseded ACK check is then done both before and
after the lock is taken.
The handing of ackinfo data is split, parsing before the lock is taken
and processing with it held. This is keyed on rxMTU being non-zero.
Congestion management is also done within the locked section.
(4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
rotation. The ACKALL packet carries no information and is only really
useful after all packets have been transmitted since it's imprecise.
(5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
prevent calls being simultaneously implicitly ended on two cpus and
also to prevent any races with incoming call setup.
(6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
on a connection. It is only permitted to happen once for a
connection.
(7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
rx->incoming_lock to see if someone else set up the call, connection
or peer whilst we were getting there. We can't trust the values from
the earlier routing check unless we pin refs on them - which we want
to avoid.
Further, we need to allow for an incoming call to have its state
changed on another CPU between us making it live and us adjusting it
because the conn is now in the RXRPC_CONN_SERVICE state.
(8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
to the RTT buffer. Don't need to lock around setting peer->rtt.
For reference, the inventory of state-accessing or state-altering functions
used by the packet input procedure is:
> rxrpc_input_packet()
* PACKET CHECKING
* ROUTING
> rxrpc_post_packet_to_local()
> rxrpc_find_connection_rcu() - uses RCU
> rxrpc_lookup_peer_rcu() - uses RCU
> rxrpc_find_service_conn_rcu() - uses RCU
> idr_find() - uses RCU
* CONNECTION-LEVEL PROCESSING
- Service upgrade
- Can only happen once per conn
! Changed to use cmpxchg
> rxrpc_post_packet_to_conn()
- Setting conn->hi_serial
- Probably safe not using locks
- Maybe use cmpxchg
* CALL-LEVEL PROCESSING
> Old-call checking
> rxrpc_input_implicit_end_call()
> rxrpc_call_completed()
> rxrpc_queue_call()
! Need to take rx->incoming_lock
> __rxrpc_disconnect_call()
> rxrpc_notify_socket()
> rxrpc_new_incoming_call()
- Uses rx->incoming_lock for the entire process
- Might be able to drop this earlier in favour of the call lock
> rxrpc_incoming_call()
! Conflicts with rxrpc_input_implicit_end_call()
> rxrpc_send_ping()
- Don't need locks to check rtt state
> rxrpc_propose_ACK
* PACKET DISTRIBUTION
> rxrpc_input_call_packet()
> rxrpc_input_data()
* QUEUE DATA PACKET ON CALL
> rxrpc_reduce_call_timer()
- Uses timer_reduce()
! Needs call->input_lock()
> rxrpc_receiving_reply()
! Needs locking around ack state
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_proto_abort()
> rxrpc_input_dup_data()
- Fills the Rx buffer
- rxrpc_propose_ACK()
- rxrpc_notify_socket()
> rxrpc_input_ack()
* APPLY ACK PACKET TO CALL AND DISCARD PACKET
> rxrpc_input_ping_response()
- Probably doesn't need any extra locking
! Need READ_ONCE() on call->ping_serial
> rxrpc_input_check_for_lost_ack()
- Takes call->lock to consult Tx buffer
> rxrpc_peer_add_rtt()
! Needs to take a lock (peer->rtt_input_lock)
! Could perhaps manage with cmpxchg() and xadd() instead
> rxrpc_input_requested_ack
- Consults Tx buffer
! Probably needs a lock
> rxrpc_peer_add_rtt()
> rxrpc_propose_ack()
> rxrpc_input_ackinfo()
- Changes call->tx_winsize
! Use cmpxchg to handle change
! Should perhaps track serial number
- Uses peer->lock to record MTU specification changes
> rxrpc_proto_abort()
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_input_soft_acks()
- Consults the Tx buffer
> rxrpc_congestion_management()
- Modifies the Tx annotations
! Needs call->input_lock()
> rxrpc_queue_call()
> rxrpc_input_abort()
* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
> rxrpc_set_call_completion()
> rxrpc_notify_socket()
> rxrpc_input_ackall()
* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_reject_packet()
There are some functions used by the above that queue the packet, after
which the procedure is terminated:
- rxrpc_post_packet_to_local()
- local->event_queue is an sk_buff_head
- local->processor is a work_struct
- rxrpc_post_packet_to_conn()
- conn->rx_queue is an sk_buff_head
- conn->processor is a work_struct
- rxrpc_reject_packet()
- local->reject_queue is an sk_buff_head
- local->processor is a work_struct
And some that offload processing to process context:
- rxrpc_notify_socket()
- Uses RCU lock
- Uses call->notify_lock to call call->notify_rx
- Uses call->recvmsg_lock to queue recvmsg side
- rxrpc_queue_call()
- call->processor is a work_struct
- rxrpc_propose_ACK()
- Uses call->lock to wrap __rxrpc_propose_ACK()
And a bunch that complete a call, all of which use call->state_lock to
protect the call state:
- rxrpc_call_completed()
- rxrpc_set_call_completion()
- rxrpc_abort_call()
- rxrpc_proto_abort()
- Also uses rxrpc_queue_call()
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
Move the out-of-order and duplicate ACK packet check to before the call to
rxrpc_input_ackinfo() so that the receive window size and MTU size are only
checked in the latest ACK packet and don't regress.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Carry the call state out of the locked section in rxrpc_rotate_tx_window()
rather than sampling it afterwards. This is only used to select tracepoint
data, but could have changed by the time we do the tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
We should only call the function to end a call's Tx phase if we rotated the
marked-last packet out of the transmission buffer.
Make rxrpc_rotate_tx_window() return an indication of whether it just
rotated the packet marked as the last out of the transmit buffer, carrying
the information out of the locked section in that function.
We can then check the return value instead of examining RXRPC_CALL_TX_LAST.
Fixes: 70790dbe3f ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
Signed-off-by: David Howells <dhowells@redhat.com>
We don't need to take the RCU read lock in the rxrpc packet receive
function because it's held further up the stack in the IP input routine
around the UDP receive routines.
Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
This simplifies the code.
Fixes: 70790dbe3f ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
Signed-off-by: David Howells <dhowells@redhat.com>
Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
in which a packet is placed onto the UDP receive queue and then immediately
removed again by rxrpc. Going via the queue in this manner seems like it
should be unnecessary.
This does, however, require the invention of a value to place in encap_type
as that's one of the conditions to switch packets out to the encap_rcv
hook. Possibly the value doesn't actually matter for anything other than
sockopts on the UDP socket, which aren't accessible outside of rxrpc
anyway.
This seems to cut a bit of time out of the time elapsed between each
sk_buff being timestamped and turning up in rxrpc (the final number in the
following trace excerpts). I measured this by making the rxrpc_rx_packet
trace point print the time elapsed between the skb being timestamped and
the current time (in ns), e.g.:
... 424.278721: rxrpc_rx_packet: ... ACK 25026
So doing a 512MiB DIO read from my test server, with an unmodified kernel:
N min max sum mean stddev
27605 2626 7581 7.83992e+07 2840.04 181.029
and with the patch applied:
N min max sum mean stddev
27547 1895 12165 6.77461e+07 2459.29 255.02
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the rxrpc_data_ready() function to pick up all packets and to not miss
any. There are two problems:
(1) The sk_data_ready pointer on the UDP socket is set *after* it is
bound. This means that it's open for business before we're ready to
dequeue packets and there's a tiny window exists in which a packet can
sneak onto the receive queue, but we never know about it.
Fix this by setting the pointers on the socket prior to binding it.
(2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was
an error on the transmission side, even though we set the
sk_error_report hook. Because rxrpc_data_ready() returns immediately
in such a case, it never actually removes its packet from the receive
queue.
Fix this by abstracting out the UDP dequeuing and checksumming into a
separate function that keeps hammering on skb_recv_udp() until it
returns -EAGAIN, passing the packets extracted to the remainder of the
function.
and two potential problems:
(3) It might be possible in some circumstances or in the future for
packets to be being added to the UDP receive queue whilst rxrpc is
running consuming them, so the data_ready() handler might get called
less often than once per packet.
Allow for this by fully draining the queue on each call as (2).
(4) If a packet fails the checksum check, the code currently returns after
discarding the packet without checking for more.
Allow for this by fully draining the queue on each call as (2).
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Fix some refs to init_net that should've been changed to the appropriate
network namespace.
Fixes: 2baec2c3f8 ("rxrpc: Support network namespacing")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
rxrpc_lose_skb() is now exactly the same as rxrpc_free_skb(), so remove it
and use the latter instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the following changes to improve the robustness of the code that sets
up a new service call:
(1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a
service ID check and pass that along to rxrpc_new_incoming_call().
This means that I can remove the check from rxrpc_new_incoming_call()
without the need to worry about the socket attached to the local
endpoint getting replaced - which would invalidate the check.
(2) Cache the rxrpc_peer struct, thereby allowing the peer search to be
done once. The peer is passed to rxrpc_new_incoming_call(), thereby
saving the need to repeat the search.
This also reduces the possibility of rxrpc_publish_service_conn()
BUG()'ing due to the detection of a duplicate connection, despite the
initial search done by rxrpc_find_connection_rcu() having turned up
nothing.
This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be
non-reentrant and the result of the initial search should still hold
true, but it has proven possible to hit.
I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short
the iteration over the hash table if it finds a matching peer with a
zero usage count, but I don't know for sure since it's only ever been
hit once that I know of.
Another possibility is that a bug in rxrpc_data_ready() that checked
the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag
might've let through a packet that caused a spurious and invalid call
to be set up. That is addressed in another patch.
(3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero
usage count rather than stopping and returning not found, just in case
there's another peer record behind it in the bucket.
(4) Don't search the peer records in rxrpc_alloc_incoming_call(), but
rather either use the peer cached in (2) or, if one wasn't found,
preemptively install a new one.
Fixes: 8496af50eb ("rxrpc: Use RCU to access a peer's service connection tree")
Signed-off-by: David Howells <dhowells@redhat.com>
Do more up-front checking on incoming packets to weed out invalid ones and
also ones aimed at services that we don't support.
Whilst we're at it, replace the clearing of call and skew if we don't find
a connection with just initialising the variables to zero at the top of the
function.
Signed-off-by: David Howells <dhowells@redhat.com>
In the input path, a received sk_buff can be marked for rejection by
setting RXRPC_SKB_MARK_* in skb->mark and, if needed, some auxiliary data
(such as an abort code) in skb->priority. The rejection is handled by
queueing the sk_buff up for dealing with in process context. The output
code reads the mark and priority and, theoretically, generates an
appropriate response packet.
However, if RXRPC_SKB_MARK_BUSY is set, this isn't noticed and an ABORT
message with a random abort code is generated (since skb->priority wasn't
set to anything).
Fix this by outputting the appropriate sort of packet.
Also, whilst we're at it, most of the marks are no longer used, so remove
them and rename the remaining two to something more obvious.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix RTT information gathering in AF_RXRPC by the following means:
(1) Enable Rx timestamping on the transport socket with SO_TIMESTAMPNS.
(2) If the sk_buff doesn't have a timestamp set when rxrpc_data_ready()
collects it, set it at that point.
(3) Allow ACKs to be requested on the last packet of a client call, but
not a service call. We need to be careful lest we undo:
bf7d620abf
Author: David Howells <dhowells@redhat.com>
Date: Thu Oct 6 08:11:51 2016 +0100
rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
but that only really applies to service calls that we're handling,
since the client side gets to send the final ACK (or not).
(4) When about to transmit an ACK or DATA packet, record the Tx timestamp
before only; don't update the timestamp afterwards.
(5) Switch the ordering between recording the serial and recording the
timestamp to always set the serial number first. The serial number
shouldn't be seen referenced by an ACK packet until we've transmitted
the packet bearing it - so in the Rx path, we don't need the timestamp
until we've checked the serial number.
Fixes: cf1a6474f8 ("rxrpc: Add per-peer RTT tracker")
Signed-off-by: David Howells <dhowells@redhat.com>
There's a check in rxrpc_data_ready() that's checking the CLIENT_INITIATED
flag in the packet type field rather than in the packet flags field.
Fix this by creating a pair of helper functions to check whether the packet
is going to the client or to the server and use them generally.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
Trace successful packet transmission (kernel_sendmsg() succeeded, that is)
in AF_RXRPC. We can share the enum that defines the transmission points
with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be
applicable to both.
Also, save the internal call->debug_id in the rxrpc_channel struct so that
it can be used in retransmission trace lines.
Signed-off-by: David Howells <dhowells@redhat.com>
Sometimes an in-progress call will stop responding on the fileserver when
the fileserver quietly cancels the call with an internally marked abort
(RX_CALL_DEAD), without sending an ABORT to the client.
This causes the client's call to eventually expire from lack of incoming
packets directed its way, which currently leads to it being cancelled
locally with ETIME. Note that it's not currently clear as to why this
happens as it's really hard to reproduce.
The rotation policy implement by kAFS, however, doesn't differentiate
between ETIME meaning we didn't get any response from the server and ETIME
meaning the call got cancelled mid-flow. The latter leads to an oops when
fetching data as the rotation partially resets the afs_read descriptor,
which can result in a cleared page pointer being dereferenced because that
page has already been filled.
Handle this by the following means:
(1) Set a flag on a call when we receive a packet for it.
(2) Store the highest packet serial number so far received for a call
(bearing in mind this may wrap).
(3) If, when the "not received anything recently" timeout expires on a
call, we've received at least one packet for a call and the connection
as a whole has received packets more recently than that call, then
cancel the call locally with ECONNRESET rather than ETIME.
This indicates that the call was definitely in progress on the server.
(4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
don't try the next server, but rather abort the call.
This avoids the oops as we don't try to reuse the afs_read struct.
Rather, as-yet ungotten pages will be reread at a later data.
Also:
(5) Add an rxrpc tracepoint to log detection of the call being reset.
Without this, I occasionally see an oops like the following:
general protection fault: 0000 [#1] SMP PTI
...
RIP: 0010:_copy_to_iter+0x204/0x310
RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
FS: 00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
Call Trace:
skb_copy_datagram_iter+0x14e/0x289
rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
? trace_buffer_unlock_commit_regs+0x4f/0x89
rxrpc_kernel_recv_data+0x149/0x421
afs_extract_data+0x1e0/0x798
? afs_wait_for_call_to_complete+0xc9/0x52e
afs_deliver_fs_fetch_data+0x33a/0x5ab
afs_deliver_to_call+0x1ee/0x5e0
? afs_wait_for_call_to_complete+0xc9/0x52e
afs_wait_for_call_to_complete+0x12b/0x52e
? wake_up_q+0x54/0x54
afs_make_call+0x287/0x462
? afs_fs_fetch_data+0x3e6/0x3ed
? rcu_read_lock_sched_held+0x5d/0x63
afs_fs_fetch_data+0x3e6/0x3ed
afs_fetch_data+0xbb/0x14a
afs_readpages+0x317/0x40d
__do_page_cache_readahead+0x203/0x2ba
? ondemand_readahead+0x3a7/0x3c1
ondemand_readahead+0x3a7/0x3c1
generic_file_buffered_read+0x18b/0x62f
__vfs_read+0xdb/0xfe
vfs_read+0xb2/0x137
ksys_read+0x50/0x8c
do_syscall_64+0x7d/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Note the weird value in RDI which is a result of trying to kmap() a NULL
page pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The expect_rx_by call timeout is supposed to be set when a call is started
to indicate that we need to receive a packet by that point. This is
currently put back every time we receive a packet, but it isn't started
when we first send a packet. Without this, the call may wait forever if
the server doesn't deign to reply.
Fix this by setting the timeout upon a successful UDP sendmsg call for the
first DATA packet. The timeout is initiated only for initial transmission
and not for subsequent retries as we don't want the retry mechanism to
extend the timeout indefinitely.
Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11
should just be discarded rather than being aborted like other undefined
packet types.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a call-level abort is received for the previous call to complete on a
connection channel, then that abort is queued for the connection processor
to handle. Unfortunately, the connection processor then assumes without
checking that the abort is connection-level (ie. callNumber is 0) and
distributes it over all active calls on that connection, thereby
incorrectly aborting them.
Fix this by discarding aborts aimed at a completed call.
Further, discard all packets aimed at a call that's complete if there's
currently an active call on a channel, since the DATA packets associated
with the new call automatically terminate the old call.
Fixes: 18bfeba50d ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the firewall route keepalive part of AF_RXRPC which is currently
function incorrectly by replying to VERSION REPLY packets from the server
with VERSION REQUEST packets.
Instead, send VERSION REPLY packets to the peers of service connections to
act as keep-alives 20s after the latest packet was transmitted to that
peer.
Also, just discard VERSION REPLY packets rather than replying to them.
Signed-off-by: David Howells <dhowells@redhat.com>
In rxrpc and afs, use the debug_ids that are monotonically allocated to
various objects as they're allocated rather than pointers as kernel
pointers are now hashed making them less useful. Further, the debug ids
aren't reused anywhere nearly as quickly.
In addition, allow kernel services that use rxrpc, such as afs, to take
numbers from the rxrpc counter, assign them to their own call struct and
pass them in to rxrpc for both client and service calls so that the trace
lines for each will have the same ID tag.
Signed-off-by: David Howells <dhowells@redhat.com>
Add an extra timeout that is set/updated when we send a DATA packet that
has the request-ack flag set. This allows us to detect if we don't get an
ACK in response to the latest flagged packet.
The ACK packet is adjudged to have been lost if it doesn't turn up within
2*RTT of the transmission.
If the timeout occurs, we schedule the sending of a PING ACK to find out
the state of the other side. If a new DATA packet is ready to go sooner,
we cancel the sending of the ping and set the request-ack flag on that
instead.
If we get back a PING-RESPONSE ACK that indicates a lower tx_top than what
we had at the time of the ping transmission, we adjudge all the DATA
packets sent between the response tx_top and the ping-time tx_top to have
been lost and retransmit immediately.
Rather than sending a PING ACK, we could just pick a DATA packet and
speculatively retransmit that with request-ack set. It should result in
either a REQUESTED ACK or a DUPLICATE ACK which we can then use in lieu the
a PING-RESPONSE ACK mentioned above.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the rxrpc call expiration timeouts and make them settable from
userspace. By analogy with other rx implementations, there should be three
timeouts:
(1) "Normal timeout"
This is set for all calls and is triggered if we haven't received any
packets from the peer in a while. It is measured from the last time
we received any packet on that call. This is not reset by any
connection packets (such as CHALLENGE/RESPONSE packets).
If a service operation takes a long time, the server should generate
PING ACKs at a duration that's substantially less than the normal
timeout so is to keep both sides alive. This is set at 1/6 of normal
timeout.
(2) "Idle timeout"
This is set only for a service call and is triggered if we stop
receiving the DATA packets that comprise the request data. It is
measured from the last time we received a DATA packet.
(3) "Hard timeout"
This can be set for a call and specified the maximum lifetime of that
call. It should not be specified by default. Some operations (such
as volume transfer) take a long time.
Allow userspace to set/change the timeouts on a call with sendmsg, using a
control message:
RXRPC_SET_CALL_TIMEOUTS
The data to the message is a number of 32-bit words, not all of which need
be given:
u32 hard_timeout; /* sec from first packet */
u32 idle_timeout; /* msec from packet Rx */
u32 normal_timeout; /* msec from data Rx */
This can be set in combination with any other sendmsg() that affects a
call.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix call expiry handling in the following ways
(1) If all the request data from a client call is acked, don't send a
follow up IDLE ACK with firstPacket == 1 and previousPacket == 0 as
this appears to fool some servers into thinking everything has been
accepted.
(2) Never send an abort back to the server once it has ACK'd all the
request packets; rather just try to reuse the channel for the next
call. The first request DATA packet of the next call on the same
channel will implicitly ACK the entire reply of the dead call - even
if we haven't transmitted it yet.
(3) Don't send RX_CALL_TIMEOUT in an ABORT packet, librx uses abort codes
to pass local errors to the caller in addition to remote errors, and
this is meant to be local only.
The following also need to be addressed in future patches:
(4) Service calls should send PING ACKs as 'keep alives' if the server is
still processing the call.
(5) VERSION REPLY packets should be sent to the peers of service
connections to act as keep-alives. This is used to keep firewall
routes in place. The AFS CM should enable this.
Signed-off-by: David Howells <dhowells@redhat.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it possible for a client to use AuriStor's service upgrade facility.
The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
the first sendmsg() of a call. This takes no parameters.
When recvmsg() starts returning data from the call, the service ID field in
the returned msg_name will reflect the result of the upgrade attempt. If
the upgrade was ignored, srx_service will match what was set in the
sendmsg(); if the upgrade happened the srx_service will be altered to
indicate the service the server upgraded to.
Note that:
(1) The choice of upgrade service is up to the server
(2) Further client calls to the same server that would share a connection
are blocked if an upgrade probe is in progress.
(3) This should only be used to probe the service. Clients should then
use the returned service ID in all subsequent communications with that
server (and not set the upgrade). Note that the kernel will not
retain this information should the connection expire from its cache.
(4) If a server that supports upgrading is replaced by one that doesn't,
whilst a connection is live, and if the replacement is running, say,
OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
server will not respond to packets sent to the upgraded connection.
At this point, calls will time out and the server must be reprobed.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint (rxrpc_rx_rwind_change) to log changes in a call's receive
window size as imposed by the peer through an ACK packet.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint (rxrpc_rx_proto) to record protocol errors in received
packets. The following changes are made:
(1) Add a function, __rxrpc_abort_eproto(), to note a protocol error on a
call and mark the call aborted. This is wrapped by
rxrpc_abort_eproto() that makes the why string usable in trace.
(2) Add trace_rxrpc_rx_proto() or rxrpc_abort_eproto() to protocol error
generation points, replacing rxrpc_abort_call() with the latter.
(3) Only send an abort packet in rxkad_verify_packet*() if we actually
managed to abort the call.
Note that a trace event is also emitted if a kernel user (e.g. afs) tries
to send data through a call when it's not in the transmission phase, though
it's not technically a receive event.
Signed-off-by: David Howells <dhowells@redhat.com>
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent. We only
turn them positive when transcribing into a cmsg for userspace recvmsg.
Signed-off-by: David Howells <dhowells@redhat.com>
The RxRPC ACK packet may contain an extension that includes the peer's
current Rx window size for this call. We adjust the local Tx window size
to match. However, the transmitter can stall if the receive window is
reduced to 0 by the peer and then reopened.
This is because the normal way that the transmitter is re-energised is by
dropping something out of our Tx queue and thus making space. When a
single gap is made, the transmitter is woken up. However, because there's
nothing in the Tx queue at this point, this doesn't happen.
To fix this, perform a wake_up() any time we see the peer's Rx window size
increasing.
The observable symptom is that calls start failing on ETIMEDOUT and the
following:
kAFS: SERVER DEAD state=-62
appears in dmesg.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All the routines by which rxrpc is accessed from the outside are serialised
by means of the socket lock (sendmsg, recvmsg, bind,
rxrpc_kernel_begin_call(), ...) and this presents a problem:
(1) If a number of calls on the same socket are in the process of
connection to the same peer, a maximum of four concurrent live calls
are permitted before further calls need to wait for a slot.
(2) If a call is waiting for a slot, it is deep inside sendmsg() or
rxrpc_kernel_begin_call() and the entry function is holding the socket
lock.
(3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
from servicing the other calls as they need to take the socket lock to
do so.
(4) The socket is stuck until a call is aborted and makes its slot
available to the waiter.
Fix this by:
(1) Provide each call with a mutex ('user_mutex') that arbitrates access
by the users of rxrpc separately for each specific call.
(2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
they've got a call and taken its mutex.
Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
set but someone else has the lock. Should I instead only return
EWOULDBLOCK if there's nothing currently to be done on a socket, and
sleep in this particular instance because there is something to be
done, but we appear to be blocked by the interrupt handler doing its
ping?
(3) Make rxrpc_new_client_call() unlock the socket after allocating a new
call, locking its user mutex and adding it to the socket's call tree.
The call is returned locked so that sendmsg() can add data to it
immediately.
From the moment the call is in the socket tree, it is subject to
access by sendmsg() and recvmsg() - even if it isn't connected yet.
(4) Lock new service calls in the UDP data_ready handler (in
rxrpc_new_incoming_call()) because they may already be in the socket's
tree and the data_ready handler makes them live immediately if a user
ID has already been preassigned.
Note that the new call is locked before any notifications are sent
that it is live, so doing mutex_trylock() *ought* to always succeed.
Userspace is prevented from doing sendmsg() on calls that are in a
too-early state in rxrpc_do_sendmsg().
(5) Make rxrpc_new_incoming_call() return the call with the user mutex
held so that a ping can be scheduled immediately under it.
Note that it might be worth moving the ping call into
rxrpc_new_incoming_call() and then we can drop the mutex there.
(6) Make rxrpc_accept_call() take the lock on the call it is accepting and
release the socket after adding the call to the socket's tree. This
is slightly tricky as we've dequeued the call by that point and have
to requeue it.
Note that requeuing emits a trace event.
(7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
new mutex immediately and don't bother with the socket mutex at all.
This patch has the nice bonus that calls on the same socket are now to some
extent parallelisable.
Note that we might want to move rxrpc_service_prealloc() calls out from the
socket lock and give it its own lock, so that we don't hang progress in
other calls because we're waiting for the allocator.
We probably also want to avoid calling rxrpc_notify_socket() from within
the socket lock (rxrpc_accept_call()).
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.c.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following extra tracing information:
(1) Modify the rxrpc_transmit tracepoint to record the Tx window size as
this is varied by the slow-start algorithm.
(2) Modify the rxrpc_rx_ack tracepoint to record more information from
received ACK packets.
(3) Add an rxrpc_rx_data tracepoint to record the information in DATA
packets.
(4) Add an rxrpc_disconnect_call tracepoint to record call disconnection,
including the reason the call was disconnected.
(5) Add an rxrpc_improper_term tracepoint to record implicit termination
of a call by a client either by starting a new call on a particular
connection channel without first transmitting the final ACK for the
previous call.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the way enum values are translated into strings in AF_RXRPC
tracepoints. The problem with just doing a lookup in a normal flat array
of strings or chars is that external tracing infrastructure can't find it.
Rather, TRACE_DEFINE_ENUM must be used.
Also sort the enums and string tables to make it easier to keep them in
order so that a future patch to __print_symbolic() can be optimised to try
a direct lookup into the table first before iterating over it.
A couple of _proto() macro calls are removed because they refered to tables
that got moved to the tracing infrastructure. The relevant data can be
found by way of tracing.
Signed-off-by: David Howells <dhowells@redhat.com>
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.
Overall, this allows acquiring only once the receive queue
lock on dequeue.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450
v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
OpenAFS doesn't always correctly terminate client calls that it makes -
this includes calls the OpenAFS servers make to the cache manager service.
It should end the client call with either:
(1) An ACK that has firstPacket set to one greater than the seq number of
the reply DATA packet with the LAST_PACKET flag set (thereby
hard-ACK'ing all packets). nAcks should be 0 and acks[] should be
empty (ie. no soft-ACKs).
(2) An ACKALL packet.
OpenAFS, though, may send an ACK packet with firstPacket set to the last
seq number or less and soft-ACKs listed for all packets up to and including
the last DATA packet.
The transmitter, however, is obliged to keep the call live and the
soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
permitted to drop any merely soft-ACK'd packet and request retransmission
by sending an ACK packet with a NACK in it.
Further, OpenAFS will also terminate a client call by beginning the next
client call on the same connection channel. This implicitly completes the
previous call.
This patch handles implicit ACK of a call on a channel by the reception of
the first packet of the next call on that channel.
If another call doesn't come along to implicitly ACK a call, then we have
to time the call out. There are some bugs there that will be addressed in
subsequent patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Separate the output of PING ACKs from the output of other sorts of ACK so
that if we receive a PING ACK and schedule transmission of a PING RESPONSE
ACK, the response doesn't get cancelled by a PING ACK we happen to be
scheduling transmission of at the same time.
If a PING RESPONSE gets lost, the other side might just sit there waiting
for it and refuse to proceed otherwise.
Signed-off-by: David Howells <dhowells@redhat.com>
When a reply is deemed lost, we send a ping to find out the other end
received all the request data packets we sent. This should be limited to
client calls and we shouldn't do this on service calls.
Signed-off-by: David Howells <dhowells@redhat.com>
Keep that call timeouts as ktimes rather than jiffies so that they can be
expressed as functions of RTT.
Signed-off-by: David Howells <dhowells@redhat.com>
When we receive an ACK from the peer that tells us what the peer's receive
window (rwind) is, we should reduce ssthresh to rwind if rwind is smaller
than ssthresh.
Signed-off-by: David Howells <dhowells@redhat.com>
Switch to Congestion Avoidance mode at cwnd == ssthresh rather than relying
on cwnd getting incremented beyond ssthresh and the window size, the mode
being shifted and then cwnd being corrected.
We need to make sure we switch into CA mode so that we stop marking every
packet for ACK.
Signed-off-by: David Howells <dhowells@redhat.com>
Note the serial number of the packet being ACK'd in the congestion
management trace rather than the serial number of the ACK packet. Whilst
the serial number of the ACK packet is useful for matching ACK packet in
the output of wireshark, the serial number that the ACK is in response to
is of more use in working out how different trace lines relate.
Signed-off-by: David Howells <dhowells@redhat.com>
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.
Notes:
(1) Since we send fixed-size DATA packets (apart from the final packet in
each phase), counters and calculations are in terms of packets rather
than bytes.
(2) The ACK packet carries the equivalent of TCP SACK.
(3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
suited to SACK of a small number of packets. It seems that, almost
inevitably, by the time three 'duplicate' ACKs have been seen, we have
narrowed the loss down to one or two missing packets, and the
FLIGHT_SIZE calculation ends up as 2.
(4) In rxrpc_resend(), if there was no data that apparently needed
retransmission, we transmit a PING ACK to ask the peer to tell us what
its Rx window state is.
Signed-off-by: David Howells <dhowells@redhat.com>
If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.
If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.
If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.
To make this work, a non-immediate PING ACK must carry a delay. I've chosen
the same as the IDLE ACK for the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.
Signed-off-by: David Howells <dhowells@redhat.com>
Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received. New ACKs will be
proposed once the data is queued.
The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.
Signed-off-by: David Howells <dhowells@redhat.com>
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet. This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log proposed ACKs, including whether the proposal is
used to update a pending ACK or is discarded in favour of an easlier,
higher priority ACK.
Whilst we're at it, get rid of the rxrpc_acks() function and access the
name array directly. We do, however, need to validate the ACK reason
number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
array.
Signed-off-by: David Howells <dhowells@redhat.com>
When the last packet of data to be transmitted on a call is queued, tx_top
is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this
leaves a race in the ACK processing side of things because the flag affects
the interpretation of tx_top and also allows us to start receiving reply
data before we've finished transmitting.
To fix this, make the following changes:
(1) rxrpc_queue_packet() now sets a marker in the annotation buffer
instead of setting the RXRPC_CALL_TX_LAST flag.
(2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
same context as the routines that use it.
(3) rxrpc_end_tx_phase() is simplified to just shift the call state.
The Tx window must have been rotated before calling to discard the
last packet.
(4) rxrpc_receiving_reply() is added to handle the arrival of the first
DATA packet of a reply to a client call (which is an implicit ACK of
the Tx phase).
(5) The last part of rxrpc_input_ack() is reordered to perform Tx
rotation, then soft-ACK application and then to end the phase if we've
rotated the last packet. In the event of a terminal ACK, the soft-ACK
application will be skipped as nAcks should be 0.
(6) rxrpc_input_ackall() now has to rotate as well as ending the phase.
In addition:
(7) Alter the transmit tracepoint to log the rotation of the last packet.
(8) Remove the no-longer relevant queue_reqack tracepoint note. The
ACK-REQUESTED packet header flag is now set as needed when we actually
transmit the packet and may vary by retransmission.
Signed-off-by: David Howells <dhowells@redhat.com>
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
it updates the Tx packet annotations in the annotation buffer. If a
soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
states and that is fine; but if the soft-ACK is an NACK, we overwrite the
to-be-retransmitted with a nak - which isn't.
Instead, we need to let any scheduled retransmission stand if the packet
was NAK'd.
Note that we don't reissue a resend if the annotation is in the
to-be-retransmitted state because someone else must've scheduled the
resend already.
Signed-off-by: David Howells <dhowells@redhat.com>
before_eq() and friends should be used to compare serial numbers (when not
checking for (non)equality) rather than casting to int, subtracting and
checking the result.
Signed-off-by: David Howells <dhowells@redhat.com>
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.
This could probably be made adjustable in future.
Signed-off-by: David Howells <dhowells@redhat.com>
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.
This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.
This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.
Signed-off-by: David Howells <dhowells@redhat.com>
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for. The PING RESPONSE ACK packet will tell us
the following about the peer:
(1) its receive window size
(2) its MTU sizes
(3) its support for jumbo DATA packets
(4) if it supports slow start (similar to RFC 5681)
(5) an estimate of the RTT
This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.
A pair of tracepoints are added so that RTT determination can be observed.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted. This will be used by future congestion
management. Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a configuration option to inject packet loss by discarding
approximately every 8th packet received and approximately every 8th DATA
packet transmitted.
Note that no locking is used, but it shouldn't really matter.
Signed-off-by: David Howells <dhowells@redhat.com>
Improve sk_buff tracing within AF_RXRPC by the following means:
(1) Use an enum to note the event type rather than plain integers and use
an array of event names rather than a big multi ?: list.
(2) Distinguish Rx from Tx packets and account them separately. This
requires the call phase to be tracked so that we know what we might
find in rxtx_buffer[].
(3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the
event type.
(4) A pair of 'rotate' events are added to indicate packets that are about
to be rotated out of the Rx and Tx windows.
(5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for
packet loss injection recording.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to follow the insertion of a packet into the transmit
buffer, its transmission and its rotation out of the buffer.
Signed-off-by: David Howells <dhowells@redhat.com>
The soft-ACK parser doesn't increment the pointer into the soft-ACK list,
resulting in the first ACK/NACK value being applied to all the relevant
packets in the Tx queue. This has the potential to miss retransmissions
and cause excessive retransmissions.
Fix this by incrementing the pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
The code for determining the last packet in rxrpc_recvmsg_data() has been
using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points
to the last packet or not. This isn't a good idea, however, as the input
code may be running simultaneously on another CPU and that sets the flag
*before* updating the top pointer.
Fix this by the following means:
(1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only.
There's otherwise a synchronisation problem between detecting the flag
and checking tx_top. This could probably be dealt with by appropriate
application of memory barriers, but there's a simpler way.
(2) Set RXRPC_CALL_RX_LAST after setting rx_top.
(3) Make rxrpc_rotate_rx_window() consult the flags header field of the
DATA packet it's about to discard to see if that was the last packet.
Use this as the basis for ending the Rx phase. This shouldn't be a
problem because the recvmsg side of things is guaranteed to see the
packets in order.
(4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if:
(a) the packet it has just processed is marked as RXRPC_LAST_PACKET
(b) the call's Rx phase has been ended.
Signed-off-by: David Howells <dhowells@redhat.com>
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit. Further, we
need to place this in the ACK info instead of the sysctl setting.
Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window. Just discard the excess subpackets instead. This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow tx_winsize to grow when the ACK info packet shows a larger receive
window at the other end rather than only permitting it to shrink.
Signed-off-by: David Howells <dhowells@redhat.com>
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet. This will only cause a malfunction in the
following cases:
(1) We receive a jumbo packet (validation and splitting both are wrong).
(2) We see if there's extra ACK info in an ACK packet (we think it's not
there and just ignore it).
Signed-off-by: David Howells <dhowells@redhat.com>
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.
This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.
Signed-off-by: David Howells <dhowells@redhat.com>
Rewrite the data and ack handling code such that:
(1) Parsing of received ACK and ABORT packets and the distribution and the
filing of DATA packets happens entirely within the data_ready context
called from the UDP socket. This allows us to process and discard ACK
and ABORT packets much more quickly (they're no longer stashed on a
queue for a background thread to process).
(2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
keep track of the offset and length of the content of each packet in
the sk_buff metadata. This means we don't do any allocation in the
receive path.
(3) Jumbo DATA packet parsing is now done in data_ready context. Rather
than cloning the packet once for each subpacket and pulling/trimming
it, we file the packet multiple times with an annotation for each
indicating which subpacket is there. From that we can directly
calculate the offset and length.
(4) A call's receive queue can be accessed without taking locks (memory
barriers do have to be used, though).
(5) Incoming calls are set up from preallocated resources and immediately
made live. They can than have packets queued upon them and ACKs
generated. If insufficient resources exist, DATA packet #1 is given a
BUSY reply and other DATA packets are discarded).
(6) sk_buffs no longer take a ref on their parent call.
To make this work, the following changes are made:
(1) Each call's receive buffer is now a circular buffer of sk_buff
pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
between the call and the socket. This permits each sk_buff to be in
the buffer multiple times. The receive buffer is reused for the
transmit buffer.
(2) A circular buffer of annotations (rxtx_annotations) is kept parallel
to the data buffer. Transmission phase annotations indicate whether a
buffered packet has been ACK'd or not and whether it needs
retransmission.
Receive phase annotations indicate whether a slot holds a whole packet
or a jumbo subpacket and, if the latter, which subpacket. They also
note whether the packet has been decrypted in place.
(3) DATA packet window tracking is much simplified. Each phase has just
two numbers representing the window (rx_hard_ack/rx_top and
tx_hard_ack/tx_top).
The hard_ack number is the sequence number before base of the window,
representing the last packet the other side says it has consumed.
hard_ack starts from 0 and the first packet is sequence number 1.
The top number is the sequence number of the highest-numbered packet
residing in the buffer. Packets between hard_ack+1 and top are
soft-ACK'd to indicate they've been received, but not yet consumed.
Four macros, before(), before_eq(), after() and after_eq() are added
to compare sequence numbers within the window. This allows for the
top of the window to wrap when the hard-ack sequence number gets close
to the limit.
Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
to indicate when rx_top and tx_top point at the packets with the
LAST_PACKET bit set, indicating the end of the phase.
(4) Calls are queued on the socket 'receive queue' rather than packets.
This means that we don't need have to invent dummy packets to queue to
indicate abnormal/terminal states and we don't have to keep metadata
packets (such as ABORTs) around
(5) The offset and length of a (sub)packet's content are now passed to
the verify_packet security op. This is currently expected to decrypt
the packet in place and validate it.
However, there's now nowhere to store the revised offset and length of
the actual data within the decrypted blob (there may be a header and
padding to skip) because an sk_buff may represent multiple packets, so
a locate_data security op is added to retrieve these details from the
sk_buff content when needed.
(6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
individually secured and needs to be individually decrypted. The code
to do this is broken out into rxrpc_recvmsg_data() and shared with the
kernel API. It now iterates over the call's receive buffer rather
than walking the socket receive queue.
Additional changes:
(1) The timers are condensed to a single timer that is set for the soonest
of three timeouts (delayed ACK generation, DATA retransmission and
call lifespan).
(2) Transmission of ACK and ABORT packets is effected immediately from
process-context socket ops/kernel API calls that cause them instead of
them being punted off to a background work item. The data_ready
handler still has to defer to the background, though.
(3) A shutdown op is added to the AF_RXRPC socket so that the AFS
filesystem can shut down the socket and flush its own work items
before closing the socket to deal with any in-progress service calls.
Future additional changes that will need to be considered:
(1) Make sure that a call doesn't hog the front of the queue by receiving
data from the network as fast as userspace is consuming it to the
exclusion of other calls.
(2) Transmit delayed ACKs from within recvmsg() when we've consumed
sufficiently more packets to avoid the background work item needing to
run.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need. The idea
is to cut out the background thread usage as much as possible.
[Note that the preallocated structs are not actually used in this patch -
that will be done in a future patch.]
If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.
To this end:
(1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
maximum number each of the listen backlog size. The backlog size is
limited to a maxmimum of 32. Only this many of each can be in the
preallocation buffer.
(2) For userspace sockets, the preallocation is charged initially by
listen() and will be recharged by accepting or rejecting pending
new incoming calls.
(3) For kernel services {,re,dis}charging of the preallocation buffers is
handled manually. Two notifier callbacks have to be provided before
kernel_listen() is invoked:
(a) An indication that a new call has been instantiated. This can be
used to trigger background recharging.
(b) An indication that a call is being discarded. This is used when
the socket is being released.
A function, rxrpc_kernel_charge_accept() is called by the kernel
service to preallocate a single call. It should be passed the user ID
to be used for that call and a callback to associate the rxrpc call
with the kernel service's side of the ID.
(4) Discard the preallocation when the socket is closed.
(5) Temporarily bump the refcount on the call allocated in
rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
preallocation ref on service calls unconditionally. This will no
longer be necessary once the preallocation is used.
Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.
A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer. This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.
Signed-off-by: David Howells <dhowells@redhat.com>
Add two tracepoints:
(1) Record the RxRPC protocol header of packets retrieved from the UDP
socket by the data_ready handler.
(2) Record the outcome of the data_ready handler.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint for working out where local aborts happen. Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.
rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc calls shouldn't hold refs on the sock struct. This was done so that
the socket wouldn't go away whilst the call was in progress, such that the
call could reach the socket's queues.
However, we can mark the socket as requiring an RCU release and rely on the
RCU read lock.
To make this work, we do:
(1) rxrpc_release_call() removes the call's call user ID. This is now
only called from socket operations and not from the call processor:
rxrpc_accept_call() / rxrpc_kernel_accept_call()
rxrpc_reject_call() / rxrpc_kernel_reject_call()
rxrpc_kernel_end_call()
rxrpc_release_calls_on_socket()
rxrpc_recvmsg()
Though it is also called in the cleanup path of
rxrpc_accept_incoming_call() before we assign a user ID.
(2) Pass the socket pointer into rxrpc_release_call() rather than getting
it from the call so that we can get rid of uninitialised calls.
(3) Fix call processor queueing to pass a ref to the work queue and to
release that ref at the end of the processor function (or to pass it
back to the work queue if we have to requeue).
(4) Skip out of the call processor function asap if the call is complete
and don't requeue it if the call is complete.
(5) Clean up the call immediately that the refcount reaches 0 rather than
trying to defer it. Actual deallocation is deferred to RCU, however.
(6) Don't hold socket refs for allocated calls.
(7) Use the RCU read lock when queueing a message on a socket and treat
the call's socket pointer according to RCU rules and check it for
NULL.
We also need to use the RCU read lock when viewing a call through
procfs.
(8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call()
if this hasn't been done yet so that we can then disconnect the call.
Once the call is disconnected, it won't have any access to the
connection struct and the UDP socket for the call work processor to be
able to send the ACK. Terminal retransmission will be handled by the
connection processor.
(9) Release all calls immediately on the closing of a socket rather than
trying to defer this. Incomplete calls will be aborted.
The call refcount model is much simplified. Refs are held on the call by:
(1) A socket's user ID tree.
(2) A socket's incoming call secureq and acceptq.
(3) A kernel service that has a call in progress.
(4) A queued call work processor. We have to take care to put any call
that we failed to queue.
(5) sk_buffs on a socket's receive queue. A future patch will get rid of
this.
Whilst we're at it, we can do:
(1) Get rid of the RXRPC_CALL_EV_RELEASE event. Release is now done
entirely from the socket routines and never from the call's processor.
(2) Get rid of the RXRPC_CALL_DEAD state. Calls now end in the
RXRPC_CALL_COMPLETE state.
(3) Get rid of the rxrpc_call::destroyer work item. Calls are now torn
down when their refcount reaches 0 and then handed over to RCU for
final cleanup.
(4) Get rid of the rxrpc_call::deadspan timer. Calls are cleaned up
immediately they're finished with and don't hang around.
Post-completion retransmission is handled by the connection processor
once the call is disconnected.
(5) Get rid of the dead call expiry setting as there's no longer a timer
to set.
(6) rxrpc_destroy_all_calls() can just check that the call list is empty.
Signed-off-by: David Howells <dhowells@redhat.com>
Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() if the call
is available just in case call->conn is NULL.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass the connection pointer to rxrpc_post_packet_to_call() as the call
might get disconnected whilst we're looking at it, but the connection
pointer determined by rxrpc_data_read() is guaranteed by RCU for the
duration of the call.
Signed-off-by: David Howells <dhowells@redhat.com>