This commit adds support for 'collect_md' mode on xfrm interfaces.
Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.
On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. If exists, the link property is also fetched from
the metadata. The dst metadata type used is METADATA_XFRM which holds
these properties.
On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state, and the link is fetched from the incoming
device. This information can later be used by upper layers such as tc,
ebpf, and ip rules.
Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.
Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-07-20
1) Don't set DST_NOPOLICY in IPv4, a recent patch made this
superfluous. From Eyal Birger.
2) Convert alg_key to flexible array member to avoid an iproute2
compile warning when built with gcc-12.
From Stephen Hemminger.
3) xfrm_register_km and xfrm_unregister_km do always return 0
so change the type to void. From Zhengchao Shao.
4) Fix spelling mistake in esp6.c
From Zhang Jiaming.
5) Improve the wording of comment above XFRM_OFFLOAD flags.
From Petr Vaněk.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If we set XFRM security policy by calling setsockopt with option
IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
struct. However tcp_v6_send_response doesn't look up dst_entry with the
actual socket but looks up with tcp control socket. This may cause a
problem that a RST packet is sent without ESP encryption & peer's TCP
socket can't receive it.
This patch will make the function look up dest_entry with actual socket,
if the socket has XFRM policy(sock_policy), so that the TCP response
packet via this function can be encrypted, & aligned on the encrypted
TCP socket.
Tested: We encountered this problem when a TCP socket which is encrypted
in ESP transport mode encryption, receives challenge ACK at SYN_SENT
state. After receiving challenge ACK, TCP needs to send RST to
establish the socket at next SYN try. But the RST was not encrypted &
peer TCP socket still remains on ESTABLISHED state.
So we verified this with test step as below.
[Test step]
1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
2. Client tries a new connection on the same TCP ports(src & dst).
3. Server will return challenge ACK instead of SYN,ACK.
4. Client will send RST to server to clear the SOCKET.
5. Client will retransmit SYN to server on the same TCP ports.
[Expected result]
The TCP connection should be established.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Sehee Lee <seheele@google.com>
Signed-off-by: Sewook Seo <sewookseo@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Functions xfrm_register_km and xfrm_unregister_km do always return 0,
change the type of functions to void.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Netdev reference helpers have a dev_ prefix for historic
reasons. Renaming the old helpers would be too much churn
but we can rename the tracking ones which are relatively
recent and should be the default for new code.
Rename:
dev_hold_track() -> netdev_hold()
dev_put_track() -> netdev_put()
dev_replace_track() -> netdev_ref_replace()
Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In IPv4 setting the "disable_policy" flag on a device means no policy
should be enforced for traffic originating from the device. This was
implemented by seting the DST_NOPOLICY flag in the dst based on the
originating device.
However, dsts are cached in nexthops regardless of the originating
devices, in which case, the DST_NOPOLICY flag value may be incorrect.
Consider the following setup:
+------------------------------+
| ROUTER |
+-------------+ | +-----------------+ |
| ipsec src |----|-|ipsec0 | |
+-------------+ | |disable_policy=0 | +----+ |
| +-----------------+ |eth1|-|-----
+-------------+ | +-----------------+ +----+ |
| noipsec src |----|-|eth0 | |
+-------------+ | |disable_policy=1 | |
| +-----------------+ |
+------------------------------+
Where ROUTER has a default route towards eth1.
dst entries for traffic arriving from eth0 would have DST_NOPOLICY
and would be cached and therefore can be reused by traffic originating
from ipsec0, skipping policy check.
Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead
of the DST in IN/FWD IPv4 policy checks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After drivers were converted to rely on direction, the flags is not
used anymore and can be removed.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
XFRM state doesn't need anything from flags except to understand
direction, so store it separately. For future patches, such change
will allow us to reuse xfrm_dev_offload for policy offload too, which
has three possible directions instead of two.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The struct xfrm_state_offload has all fields needed to hold information
for offloaded policies too. In order to do not create new struct with
same fields, let's rename existing one and reuse it later.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
num_exthdrs is set but never used, so delete it.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After removal of Innova IPsec support from mlx5 driver, the last user
of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely
remove it as no other hardware is capable (or need) to remove ESP trailer.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-03-19
1) Delete duplicated functions that calls same xfrm_api_check.
From Leon Romanovsky.
2) Align userland API of the default policy structure to the
internal structures. From Nicolas Dichtel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a follow up of commit f8d858e607 ("xfrm: make user policy API
complete"). The goal is to align userland API to the internal structures.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This reverts commit b515d26372.
Commit b515d26372 ("xfrm: xfrm_state_mtu
should return at least 1280 for ipv6") in v5.14 breaks the TCP MSS
calculation in ipsec transport mode, resulting complete stalls of TCP
connections. This happens when the (P)MTU is 1280 or slighly larger.
The desired formula for the MSS is:
MSS = (MTU - ESP_overhead) - IP header - TCP header
However, the above commit clamps the (MTU - ESP_overhead) to a
minimum of 1280, turning the formula into
MSS = max(MTU - ESP overhead, 1280) - IP header - TCP header
With the (P)MTU near 1280, the calculated MSS is too large and the
resulting TCP packets never make it to the destination because they
are over the actual PMTU.
The above commit also causes suboptimal double fragmentation in
xfrm tunnel mode, as described in
https://lore.kernel.org/netdev/20210429202529.codhwpc7w6kbudug@dwarf.suse.cz/
The original problem the above commit was trying to fix is now fixed
by commit 6596a02295 ("xfrm: fix MTU
regression").
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch enables distinguishing SAs and SPs based on if_id during
the xfrm_migrate flow. This ensures support for xfrm interfaces
throughout the SA/SP lifecycle.
When there are multiple existing SPs with the same direction,
the same xfrm_selector and different endpoint addresses,
xfrm_migrate might fail with ENODATA.
Specifically, the code path for performing xfrm_migrate is:
Stage 1: find policy to migrate with
xfrm_migrate_policy_find(sel, dir, type, net)
Stage 2: find and update state(s) with
xfrm_migrate_state_find(mp, net)
Stage 3: update endpoint address(es) of template(s) with
xfrm_policy_migrate(pol, m, num_migrate)
Currently "Stage 1" always returns the first xfrm_policy that
matches, and "Stage 3" looks for the xfrm_tmpl that matches the
old endpoint address. Thus if there are multiple xfrm_policy
with same selector, direction, type and net, "Stage 1" might
rertun a wrong xfrm_policy and "Stage 3" will fail with ENODATA
because it cannot find a xfrm_tmpl with the matching endpoint
address.
The fix is to allow userspace to pass an if_id and add if_id
to the matching rule in Stage 1 and Stage 2 since if_id is a
unique ID for xfrm_policy and xfrm_state. For compatibility,
if_id will only be checked if the attribute is set.
Tested with additions to Android's kernel unit test suite:
https://android-review.googlesource.com/c/kernel/tests/+/1668886
Signed-off-by: Yan Yan <evitayan@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-01-06
1) Fix some clang_analyzer warnings about never read variables.
From luo penghao.
2) Check for pols[0] only once in xfrm_expand_policies().
From Jean Sacren.
3) The SA curlft.use_time was updated only on SA cration time.
Update whenever the SA is used. From Antony Antony
4) Add support for SM3 secure hash.
From Xu Jia.
5) Add support for SM4 symmetric cipher algorithm.
From Xu Jia.
6) Add a rate limit for SA mapping change messages.
From Antony Antony.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel generates mapping change message, XFRM_MSG_MAPPING,
when a source port chage is detected on a input state with UDP
encapsulation set. Kernel generates a message for each IPsec packet
with new source port. For a high speed flow per packet mapping change
message can be excessive, and can overload the user space listener.
Introduce rate limiting for XFRM_MSG_MAPPING message to the user space.
The rate limiting is configurable via netlink, when adding a new SA or
updating it. Use the new attribute XFRMA_MTIMER_THRESH in seconds.
v1->v2 change:
update xfrm_sa_len()
v2->v3 changes:
use u32 insted unsigned long to reduce size of struct xfrm_state
fix xfrm_ompat size Reported-by: kernel test robot <lkp@intel.com>
accept XFRM_MSG_MAPPING only when XFRMA_ENCAP is present
Co-developed-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When there is no policy configured on the system, the default policy is
checked in xfrm_route_forward. However, it was done with the wrong
direction (XFRM_POLICY_FWD instead of XFRM_POLICY_OUT).
The default policy for XFRM_POLICY_FWD was checked just before, with a call
to xfrm[46]_policy_check().
CC: stable@vger.kernel.org
Fixes: 2d151d3907 ("xfrm: Add possibility to set the default to block if we have no policy")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As the default we assume the traffic to pass, if we have no
matching IPsec policy. With this patch, we have a possibility to
change this default from allow to block. It can be configured
via netlink. Each direction (input/output/forward) can be
configured separately. With the default to block configuered,
we need allow policies for all packet flows we accept.
We do not use default policy lookup for the loopback device.
v1->v2
- fix compiling when XFRM is disabled
- Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Christian Langrock <christian.langrock@secunet.com>
Signed-off-by: Christian Langrock <christian.langrock@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-06-28
1) Remove an unneeded error assignment in esp4_gro_receive().
From Yang Li.
2) Add a new byseq state hashtable to find acquire states faster.
From Sabrina Dubroca.
3) Remove some unnecessary variables in pfkey_create().
From zuoqilin.
4) Remove the unused description from xfrm_type struct.
From Florian Westphal.
5) Fix a spelling mistake in the comment of xfrm_state_ok().
From gushengxian.
6) Replace hdr_off indirections by a small helper function.
From Florian Westphal.
7) Remove xfrm4_output_finish and xfrm6_output_finish declarations,
they are not used anymore.From Antony Antony.
8) Remove xfrm replay indirections.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The inner_ipproto saves the inner IP protocol of the plain
text packet. This allows vendor's IPsec feature making offload
decision at skb's features_check and configuring hardware at
ndo_start_xmit.
For example, ConnectX6-DX IPsec device needs the plaintext's
IP protocol to support partial checksum offload on
VXLAN/GENEVE packet over IPsec transport mode tunnel.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This replaces the overflow indirection with the new xfrm_replay_overflow
helper. After this, the 'repl' pointer in xfrm_state is no longer
needed and can be removed as well.
xfrm_replay_overflow() is added in two incarnations, one is used
when the kernel is compiled with xfrm hardware offload support enabled,
the other when its disabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds new xfrm_replay_recheck() helper and calls it from
xfrm input path instead of the indirection.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Similar to other patches: add a new helper to avoid
an indirection.
v2: fix 'net/xfrm/xfrm_replay.c:519:13: warning: 'seq' may be used
uninitialized in this function' warning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
replay protection is implemented using a callback structure and then
called via
x->repl->notify(), x->repl->recheck(), and so on.
all the differect functions are always built-in, so this could be direct
calls instead.
This first patch prepares for removal of the x->repl structure.
Add an enum with the three available replay modes to the xfrm_state
structure and then replace all x->repl->notify() calls by the new
xfrm_replay_notify() helper.
The helper checks the enum internally to adapt behaviour as needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
These function declarations are not needed any more.
The definitions were deleted.
Fixes: 2ab6096db2 ("xfrm: remove output_finish indirection from xfrm_state_afinfo")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After previous patches all remaining users set the function pointer to
the same function: xfrm6_find_1stfragopt.
So remove this function pointer and call ip6_find_1stfragopt directly.
Reduces size of xfrm_type to 64 bytes on 64bit platforms.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When creating new states with seq set in xfrm_usersa_info, we walk
through all the states already installed in that netns to find a
matching ACQUIRE state (__xfrm_find_acq_byseq, called from
xfrm_state_add). This causes severe slowdowns on systems with a large
number of states.
This patch introduces a hashtable using x->km.seq as key, so that the
corresponding state can be found in a reasonable time.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Jianwen reported that IPv6 Interoperability tests are failing in an
IPsec case where one of the links between the IPsec peers has an MTU
of 1280. The peer generates a packet larger than this MTU, the router
replies with a "Packet too big" message indicating an MTU of 1280.
When the peer tries to send another large packet, xfrm_state_mtu
returns 1280 - ipsec_overhead, which causes ip6_setup_cork to fail
with EINVAL.
We can fix this by forcing xfrm_state_mtu to return IPV6_MIN_MTU when
IPv6 is used. After going through IPsec, the packet will then be
fragmented to obey the actual network's PMTU, just before leaving the
host.
Currently, TFC padding is capped to PMTU - overhead to avoid
fragementation: after padding and encapsulation, we still fit within
the PMTU. That behavior is preserved in this patch.
Fixes: 91657eafb6 ("xfrm: take net hdr len into account for esp payload size calculation")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When xfrm interfaces are used in combination with namespaces
and ESP offload, we get a dst_entry NULL pointer dereference.
This is because we don't have a dst_entry attached in the ESP
offloading case and we need to do a policy lookup before the
namespace transition.
Fix this by expicit checking of skb_dst(skb) before accessing it.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
A situation can occur where the interface bound to the sk is different
to the interface bound to the sk attached to the skb. The interface
bound to the sk is the correct one however this information is lost inside
xfrm_output2 and instead the sk on the skb is used in xfrm_output_resume
instead. This assumes that the sk bound interface and the bound interface
attached to the sk within the skb are the same which can lead to lookup
failures inside ip_route_me_harder resulting in the packet being dropped.
We have an l2tp v3 tunnel with ipsec protection. The tunnel is in the
global VRF however we have an encapsulated dot1q tunnel interface that
is within a different VRF. We also have a mangle rule that marks the
packets causing them to be processed inside ip_route_me_harder.
Prior to commit 31c70d5956 ("l2tp: keep original skb ownership") this
worked fine as the sk attached to the skb was changed from the dot1q
encapsulated interface to the sk for the tunnel which meant the interface
bound to the sk and the interface bound to the skb were identical.
Commit 46d6c5ae95 ("netfilter: use actual socket sk rather than skb sk
when routing harder") fixed some of these issues however a similar
problem existed in the xfrm code.
Fixes: 31c70d5956 ("l2tp: keep original skb ownership")
Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide compat_xfrm_userpolicy_info translation for xfrm setsocketopt().
Reallocate buffer and put the missing padding for 64-bit message.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Provide the user-to-kernel translator under XFRM_USER_COMPAT, that
creates for 32-bit xfrm-user message a 64-bit translation.
The translation is afterwards reused by xfrm_user code just as if
userspace had sent 64-bit message.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Provide the kernel-to-user translator under XFRM_USER_COMPAT, that
creates for 64-bit xfrm-user message a 32-bit translation and puts it
in skb's frag_list. net/compat.c layer provides MSG_CMSG_COMPAT to
decide if the message should be taken from skb or frag_list.
(used by wext-core which has also an ABI difference)
Kernel sends 64-bit xfrm messages to the userspace for:
- multicast (monitor events)
- netlink dumps
Wire up the translator to xfrm_nlmsg_multicast().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add a skeleton for xfrm_compat module and provide API to register it in
xfrm_state.ko. struct xfrm_translator will have function pointers to
translate messages received from 32-bit userspace or to be sent to it
from 64-bit kernel.
module_get()/module_put() are used instead of rcu_read_lock() as the
module will vmalloc() memory for translation.
The new API is registered with xfrm_state module, not with xfrm_user as
the former needs translator for user_policy set by setsockopt() and
xfrm_user already uses functions from xfrm_state.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
XFRMA_REPLAY_ESN_VAL was not cloned completely from the old to the new.
Migrate this attribute during XFRMA_MSG_MIGRATE
v1->v2:
- move curleft cloning to a separate patch
Fixes: af2f464e32 ("xfrm: Assign esn pointers when cloning a state")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2020-07-30
Please note that I did the first time now --no-ff merges
of my testing branch into the master branch to include
the [PATCH 0/n] message of a patchset. Please let me
know if this is desirable, or if I should do it any
different.
1) Introduce a oseq-may-wrap flag to disable anti-replay
protection for manually distributed ICVs as suggested
in RFC 4303. From Petr Vaněk.
2) Patchset to fully support IPCOMP for vti4, vti6 and
xfrm interfaces. From Xin Long.
3) Switch from a linear list to a hash list for xfrm interface
lookups. From Eyal Birger.
4) Fixes to not register one xfrm(6)_tunnel object twice.
From Xin Long.
5) Fix two compile errors that were introduced with the
IPCOMP support for vti and xfrm interfaces.
Also from Xin Long.
6) Make the policy hold queue work with VTI. This was
forgotten when VTI was implemented.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commits "xfrm: Move dst->path into struct xfrm_dst"
and "net: Create and use new helper xfrm_dst_child()."
changed xfrm bundle handling under the assumption
that xdst->path and dst->child are not a NULL pointer
only if dst->xfrm is not a NULL pointer. That is true
with one exception. If the xfrm hold queue is used
to wait until a SA is installed by the key manager,
we create a dummy bundle without a valid dst->xfrm
pointer. The current xfrm bundle handling crashes
in that case. Fix this by extending the NULL check
of dst->xfrm with a test of the DST_XFRM_QUEUE flag.
Fixes: 0f6c480f23 ("xfrm: Move dst->path into struct xfrm_dst")
Fixes: b92cf4aab8 ("net: Create and use new helper xfrm_dst_child().")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch is to register a callback function tunnel6_rcv_cb with
is_ipip set in a xfrm_input_afinfo object for tunnel6 and tunnel46.
It will be called by xfrm_rcv_cb() from xfrm_input() when family
is AF_INET6 and proto is IPPROTO_IPIP or IPPROTO_IPV6.
v1->v2:
- Fix a sparse warning caused by the missing "__rcu", as Jakub
noticed.
- Handle the err returned by xfrm_input_register_afinfo() in
tunnel6_init/fini(), as Sabrina noticed.
v2->v3:
- Add "#if IS_ENABLED(CONFIG_INET6_XFRM_TUNNEL)" to fix the build error
when xfrm is disabled, reported by kbuild test robot
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch is to register a callback function tunnel4_rcv_cb with
is_ipip set in a xfrm_input_afinfo object for tunnel4 and tunnel64.
It will be called by xfrm_rcv_cb() from xfrm_input() when family
is AF_INET and proto is IPPROTO_IPIP or IPPROTO_IPV6.
v1->v2:
- Fix a sparse warning caused by the missing "__rcu", as Jakub
noticed.
- Handle the err returned by xfrm_input_register_afinfo() in
tunnel4_init/fini(), as Sabrina noticed.
v2->v3:
- Add "#if IS_ENABLED(CONFIG_INET_XFRM_TUNNEL)" to fix the build error
when xfrm is disabled, reported by kbuild test robot.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch is to add a new member is_ipip to struct xfrm_input_afinfo,
to allow another group family of callback functions to be registered
with is_ipip set.
This will be used for doing a callback for struct xfrm(6)_tunnel of
ipip/ipv6 tunnels in xfrm_input() by calling xfrm_rcv_cb(), which is
needed by ipip/ipv6 tunnels' support in ip(6)_vti and xfrm interface
in the next patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit ed17b8d377 ("xfrm: fix a warning in xfrm_policy_insert_list"),
it would take 'priority' to make a policy unique, and allow duplicated
policies with different 'priority' to be added, which is not expected
by userland, as Tobias reported in strongswan.
To fix this duplicated policies issue, and also fix the issue in
commit ed17b8d377 ("xfrm: fix a warning in xfrm_policy_insert_list"),
when doing add/del/get/update on user interfaces, this patch is to change
to look up a policy with both mark and mask by doing:
mark.v == pol->mark.v && mark.m == pol->mark.m
and leave the check:
(mark & pol->mark.m) == pol->mark.v
for tx/rx path only.
As the userland expects an exact mark and mask match to manage policies.
v1->v2:
- make xfrm_policy_mark_match inline and fix the changelog as
Tobias suggested.
Fixes: 295fae5688 ("xfrm: Allow user space manipulation of SPD mark")
Fixes: ed17b8d377 ("xfrm: fix a warning in xfrm_policy_insert_list")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Tested-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Rather than requiring every hw crypto capable NIC driver to do a check for
slave_dev being set, set real_dev in the xfrm layer and xso init time, and
then override it in the bonding driver as needed. Then NIC drivers can
always use real_dev, and at the same time, we eliminate the use of a
variable name that probably shouldn't have been used in the first place,
particularly given recent current events.
CC: Boris Pismenny <borisp@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Leon Romanovsky <leon@kernel.org>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
Suggested-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is prep work for initial support of bonding hardware encryption
pass-through support. The bonding driver will fill in the slave_dev
pointer, and we use that to know not to skb_push() again on a given
skb that was already processed on the bond device.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During IPsec performance testing, we see bad ICMP checksum. The error packet
has duplicated ESP trailer due to double validate_xmit_xfrm calls. The first call
is from ip_output, but the packet cannot be sent because
netif_xmit_frozen_or_stopped is true and the packet gets dev_requeue_skb. The second
call is from NET_TX softirq. However after the first call, the packet already
has the ESP trailer.
Fix by marking the skb with XFRM_XMIT bit after the packet is handled by
validate_xmit_xfrm to avoid duplicate ESP trailer insertion.
Fixes: f6e27114a6 ("net: Add a xfrm validate function to validate_xmit_skb")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are only two implementaions, one for ipv4 and one for ipv6.
Both are almost identical, they clear skb->cb[], set the TRANSFORMED flag
in IP(6)CB and then call the common xfrm_output() function.
By placing the IPCB handling into the common function, we avoid the need
for the output_finish indirection as the output functions can simply
use xfrm_output().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Move this to xfrm_output.c. This avoids the state->extract_output
indirection.
This patch also removes the duplicated __xfrm6_extract_header helper
added in an earlier patch, we can now use the one from xfrm_inout.h .
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
so next patch can re-use it from net/xfrm/xfrm_output.c without
causing a linker error when IPV6 is a module.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We cannot call this function from the core kernel unless we would force
CONFIG_IPV6=y.
Therefore expose this via ipv6_stubs so we can call it from net/xfrm
in the followup patch.
Since the call is expected to be unlikely, no extra code for the IPV6=y
case is added and we will always eat the indirection cost.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The function only initializes the XFRM CB in the skb.
After previous patch xfrm4_extract_header is only called from
net/xfrm/xfrm_{input,output}.c.
Because of IPV6=m linker errors the ipv6 equivalent
(xfrm6_extract_header) was already placed in xfrm_inout.h because
we can't call functions residing in a module from the core.
So do the same for the ipv4 helper and place it next to the ipv6 one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In order to keep CONFIG_IPV6=m working, xfrm6_extract_header needs to be
duplicated. It will be removed again in a followup change when the
remaining caller is moved to net/xfrm as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We can use a direct call for ipv4, so move the needed functions
to net/xfrm/xfrm_output.c and call them directly.
For ipv6 the indirection can be avoided as well but it will need
a bit more work -- to ease review it will be done in another patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch adds support for encapsulation of ESP over UDPv6. The code
is very similar to the IPv4 encapsulation implementation, and allows
to easily add espintcp on IPv6 as a follow-up.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.
Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will be used by TCP encapsulation to write packets to the encap
socket without holding the user socket's lock. Without this reinjection,
we're already holding the lock of the user socket, and then try to lock
the encap socket as well when we enqueue the encrypted packet.
While at it, add a BUILD_BUG_ON like we usually do for skb->cb, since
it's missing for struct xfrm_trans_cb.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If IPsec is not configured, there is no reason to delay the inevitable.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With the current implementation, phydev cannot be removed:
$ ip link add dummy type dummy
$ ip link add xfrm1 type xfrm dev dummy if_id 1
$ ip l d dummy
kernel:[77938.465445] unregister_netdevice: waiting for dummy to become free. Usage count = 1
Manage it like in ip tunnels, ie just keep the ifindex. Not that the side
effect, is that the phydev is now optional.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The ifname is copied when the interface is created, but is never updated
later. In fact, this property is used only in one error message, where the
netdevice pointer is available, thus let's use it.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
esp4_get_mtu and esp6_get_mtu are exactly the same, the only difference
is a single sizeof() (ipv4 vs. ipv6 header).
Merge both into xfrm_state_mtu() and remove the indirection.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Only a handful of xfrm_types exist, no need to have 512 pointers for them.
Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms.
Also, the unregister function doesn't need to return an error, no single
caller does anything useful with it.
Just place a WARN_ON() where needed instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm_prepare_input needs to lookup the state afinfo backend again to fetch
the address family ethernet protocol value.
There are only two address families, so a switch statement is simpler.
While at it, use u8 for family and proto and remove the owner member --
its not used anywhere.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No module dependency, placing this in xfrm_state.c avoids need for
an indirection.
This also removes the state spinlock -- I don't see why we would need
to hold it during sorting.
This in turn allows to remove the 'net' argument passed to
xfrm_tmpl_sort. Last, remove the EXPORT_SYMBOL, there are no modular
callers.
For the CONFIG_IPV6=m case, vmlinux size increase is about 300 byte.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There is only one implementation of this function; just call it directly.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
same as previous patch: just place this in the caller, no need to
have an indirection for a structure initialization.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Simple initialization, handle it in the caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
Pull timer updates from Ingo Molnar:
"This cycle had the following changes:
- Timer tracing improvements (Anna-Maria Gleixner)
- Continued tasklet reduction work: remove the hrtimer_tasklet
(Thomas Gleixner)
- Fix CPU hotplug remove race in the tick-broadcast mask handling
code (Thomas Gleixner)
- Force upper bound for setting CLOCK_REALTIME, to fix ABI
inconsistencies with handling values that are close to the maximum
supported and the vagueness of when uptime related wraparound might
occur. Make the consistent maximum the year 2232 across all
relevant ABIs and APIs. (Thomas Gleixner)
- various cleanups and smaller fixes"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Fix typos in comments
tick/broadcast: Fix warning about undefined tick_broadcast_oneshot_offline()
timekeeping: Force upper bound for setting CLOCK_REALTIME
timer/trace: Improve timer tracing
timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps
timer: Move trace point to get proper index
tick/sched: Update tick_sched struct documentation
tick: Remove outgoing CPU from broadcast masks
timekeeping: Consistently use unsigned int for seqcount snapshot
softirq: Remove tasklet_hrtimer
xfrm: Replace hrtimer tasklet with softirq hrtimer
mac80211_hwsim: Replace hrtimer tasklet with softirq hrtimer
None of them have any external callers, make them static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No external dependencies, might as well handle this directly.
xfrm_afinfo_policy is now 40 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
handle this directly, its only used by ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Only used by ipv4, we can read the fl4 tos value directly instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This structure is now only 4 bytes, so its more efficient
to cache a copy rather than its address.
No significant size difference in allmodconfig vmlinux.
With non-modular kernel that has all XFRM options enabled, this
series reduces vmlinux image size by ~11kb. All xfrm_mode
indirections are gone and all modes are built-in.
before (ipsec-next master):
text data bss dec filename
21071494 7233140 11104324 39408958 vmlinux.master
after this series:
21066448 7226772 11104324 39397544 vmlinux.patched
With allmodconfig kernel, the size increase is only 362 bytes,
even all the xfrm config options removed in this series are
modular.
before:
text data bss dec filename
15731286 6936912 4046908 26715106 vmlinux.master
after this series:
15731492 6937068 4046908 26715468 vmlinux
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
after previous changes, xfrm_mode contains no function pointers anymore
and all modules defining such struct contain no code except an init/exit
functions to register the xfrm_mode struct with the xfrm core.
Just place the xfrm modes core and remove the modules,
the run-time xfrm_mode register/unregister functionality is removed.
Before:
text data bss dec filename
7523 200 2364 10087 net/xfrm/xfrm_input.o
40003 628 440 41071 net/xfrm/xfrm_state.o
15730338 6937080 4046908 26714326 vmlinux
7389 200 2364 9953 net/xfrm/xfrm_input.o
40574 656 440 41670 net/xfrm/xfrm_state.o
15730084 6937068 4046908 26714060 vmlinux
The xfrm*_mode_{transport,tunnel,beet} modules are gone.
v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6
ones rather than removing them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from
ipv6 in case of CONFIG_IPV6=m.
This change has virtually no effect on vmlinux size, but it reduces
afinfo size and allows followup patch to make xfrm modes const.
v2: mark if (afinfo) tests as likely (Sabrina)
re-fetch afinfo according to inner_mode in xfrm_prepare_input().
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
similar to previous patch: no external module dependencies,
so we can avoid the indirection by placing this in the core.
This change removes the last indirection from xfrm_mode and the
xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore.
Before:
text data bss dec hex filename
3957 136 0 4093 ffd net/xfrm/xfrm_output.o
587 44 0 631 277 net/ipv4/xfrm4_mode_beet.o
649 32 0 681 2a9 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 29d net/ipv6/xfrm6_mode_beet.o
599 32 0 631 277 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec hex filename
5359 184 0 5543 15a7 net/xfrm/xfrm_output.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_beet.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_tunnel.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_beet.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_tunnel.o
v2: fold the *encap_add functions into xfrm*_prepare_output
preserve (move) output2 comment (Sabrina)
use x->outer_mode->encap, not inner
fix a build breakage on ppc (kbuild robot)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No external dependencies on any module, place this in the core.
Increase is about 1800 byte for xfrm_input.o.
The beet helpers get added to internal header, as they can be reused
from xfrm_output.c in the next patch (kernel contains several
copies of them in the xfrm{4,6}_mode_beet.c files).
Before:
text data bss dec filename
5578 176 2364 8118 net/xfrm/xfrm_input.o
1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o
171 40 0 211 net/ipv4/xfrm4_mode_transport.o
1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o
1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o
172 40 0 212 net/ipv6/xfrm6_mode_ro.o
172 40 0 212 net/ipv6/xfrm6_mode_transport.o
1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec filename
7373 200 2364 9937 net/xfrm/xfrm_input.o
587 44 0 631 net/ipv4/xfrm4_mode_beet.o
171 32 0 203 net/ipv4/xfrm4_mode_transport.o
649 32 0 681 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 net/ipv6/xfrm6_mode_beet.o
172 32 0 204 net/ipv6/xfrm6_mode_ro.o
172 32 0 204 net/ipv6/xfrm6_mode_transport.o
599 32 0 631 net/ipv6/xfrm6_mode_tunnel.o
v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix
AF_UNSPEC selector breakage (bisected by Benedict Wong)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are only two versions (tunnel and transport). The ip/ipv6 versions
are only differ in sizeof(iphdr) vs ipv6hdr.
Place this in the core and use x->outer_mode->encap type to call the
correct adjustment helper.
Before:
text data bss dec filename
15730311 6937008 4046908 26714227 vmlinux
After:
15730428 6937008 4046908 26714344 vmlinux
(about 117 byte increase)
v2: use family from x->outer_mode, not inner
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Same is input indirection. Only exception: we need to export
xfrm_outer_mode_output for pktgen.
Increases size of vmlinux by about 163 byte:
Before:
text data bss dec filename
15730208 6936948 4046908 26714064 vmlinux
After:
15730311 6937008 4046908 26714227 vmlinux
xfrm_inner_extract_output has no more external callers, make it static.
v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output
add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca)
add WARN_ON_ONCE for 'call AF_INET6 related output function, but
CONFIG_IPV6=n' case.
make xfrm_inner_extract_output static
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No need for any indirection or abstraction here, both functions
are pretty much the same and quite small, they also have no external
dependencies.
xfrm_prepare_input can then be made static.
With allmodconfig build, size increase of vmlinux is 25 byte:
Before:
text data bss dec filename
15730207 6936924 4046908 26714039 vmlinux
After:
15730208 6936948 4046908 26714064 vmlinux
v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca)
change copied comment to refer to transport and network header,
not skb->{h,nh}, which don't exist anymore. (Sabrina)
make xfrm_prepare_input static (Eyal Birger)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will be useful to know if we're supposed to decode ipv4 or ipv6.
While at it, make the unregister function return void, all module_exit
functions did just BUG(); there is never a point in doing error checks
if there is no way to handle such error.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If an xfrmi is associated to a vrf layer 3 master device,
xfrm_policy_check() fails after traffic decapsulation. The input
interface is replaced by the layer 3 master device, and hence
xfrmi_decode_session() can't match the xfrmi anymore to satisfy
policy checking.
Extend ingress xfrmi lookup to honor the original layer 3 slave
device, allowing xfrm interfaces to operate within a vrf domain.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In commit 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().
And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.
This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.
Fixes: 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Switch the timer to HRTIMER_MODE_SOFT, which executed the timer
callback in softirq context and remove the hrtimer_tasklet.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lkml.kernel.org/r/20190301224821.29843-3-bigeasy@linutronix.de
xfrm_state_put() moves struct xfrm_state to the GC list
and schedules the GC work to clean it up. On net exit call
path, xfrm_state_flush() is called to clean up and
xfrm_flush_gc() is called to wait for the GC work to complete
before exit.
However, this doesn't work because one of the ->destructor(),
ipcomp_destroy(), schedules the same GC work again inside
the GC work. It is hard to wait for such a nested async
callback. This is also why syzbot still reports the following
warning:
WARNING: CPU: 1 PID: 33 at net/ipv6/xfrm6_tunnel.c:351 xfrm6_tunnel_net_exit+0x2cb/0x500 net/ipv6/xfrm6_tunnel.c:351
...
ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
cleanup_net+0x51d/0xb10 net/core/net_namespace.c:551
process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153
worker_thread+0x143/0x14a0 kernel/workqueue.c:2296
kthread+0x357/0x430 kernel/kthread.c:246
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
In fact, it is perfectly fine to bypass GC and destroy xfrm_state
synchronously on net exit call path, because it is in process context
and doesn't need a work struct to do any blocking work.
This patch introduces xfrm_state_put_sync() which simply bypasses
GC, and lets its callers to decide whether to use this synchronous
version. On net exit path, xfrm_state_fini() and
xfrm6_tunnel_net_exit() use it. And, as ipcomp_destroy() itself is
blocking, it can use xfrm_state_put_sync() directly too.
Also rename xfrm_state_gc_destroy() to ___xfrm_state_destroy() to
reflect this change.
Fixes: b48c05ab5d ("xfrm: Fix warning in xfrm6_tunnel_net_exit.")
Reported-and-tested-by: syzbot+e9aebef558e3ed673934@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>