Commit Graph

4349 Commits

Author SHA1 Message Date
Stephen Hemminger
40443f49b3 ip: convert monitor to switch
The decoding of netlink message types is natural for a C
switch statement.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-16 09:49:00 -07:00
David Ahern
db71144c0c Merge branch 'iproute2-master' into iproute2-next
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-15 14:32:10 -07:00
Phil Sutter
d67eb4fbf8 testsuite: Add a first ss test validating ssfilter
This tests a few ssfilter expressions by selecting sockets from a TCP
dump file. The dump was created using the following command:

| ss -ntaD testsuite/tests/ss/ss1.dump

It is fed into ss via TCPDIAG_FILE environment variable.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-15 14:25:18 -07:00
Phil Sutter
744bd07662 testsuite: Prepare for ss tests
This merges the shared bits from ts_tc() and ts_ip() into a common
function for being wrapped by the first ones and adds a third ts_ss()
for testing ss commands.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-15 14:25:18 -07:00
Phil Sutter
38d209ecf2 ss: Review ssfilter
The original problem was ssfilter rejecting single expressions if
enclosed in braces, such as:

| sport = 22 or ( dport = 22 )

This is fixed by allowing 'expr' to be an 'exprlist' enclosed in braces.
The no longer required recursion in 'exprlist' being an 'exprlist'
enclosed in braces is dropped.

In addition to that, a few other things are changed:

* Remove pointless 'null' prefix in 'appled' before 'exprlist'.
* For simple equals matches, '=' operator was required for ports but not
  allowed for hosts. Make this consistent by making '=' operator
  optional in both cases.

Reported-by: Samuel Mannehed <samuel@cendio.se>
Fixes: b2038cc0b2 ("ssfilter: Eliminate shift/reduce conflicts")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-15 14:25:18 -07:00
Phil Sutter
8a03a2f36f man: ip-route: Clarify referenced versions are Linux ones
Versioning scheme of Linux and iproute2 is similar, therefore the
referenced kernel versions are likely to confuse readers. Clarify this
by prefixing each kernel version by 'Linux' prefix.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-15 14:23:48 -07:00
Stephen Hemminger
a84639fcb2 Merge branch 'master' of git://git.kernel.org/pub/scm/network/iproute2/iproute2-next 2018-08-15 14:21:45 -07:00
David Ahern
ea9f9b910e Merge branch 'iproute2-master' into iproute2-next
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-15 09:56:30 -07:00
Phil Sutter
4d82962ccc Merge common code for conditionally colored output
Instead of calling enable_color() conditionally with identical check in
three places, introduce check_enable_color() which does it in one place.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-15 09:55:27 -07:00
Phil Sutter
5332148deb bridge: Fix check for colored output
There is no point in calling enable_color() conditionally if it was
already called for each time '-color' flag was parsed. Align the
algorithm with that in ip and tc by actually making use of 'color'
variable.

Fixes: e9625d6aea ("Merge branch 'iproute2-master' into iproute2-next")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-15 09:54:51 -07:00
Phil Sutter
0d0e0e0bef tc: Fix typo in check for colored output
The check used binary instead of boolean AND, which means colored output
was enabled only if the number of specified '-color' flags was odd.

Fixes: 2d165c0811 ("tc: implement color output")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-15 09:54:32 -07:00
Nishanth Devarajan
141b55f854 Add SKB Priority qdisc support in tc(8)
sch_skbprio is a qdisc that prioritizes packets according to their skb->priority
field. Under congestion, it drops already-enqueued lower priority packets to
make space available for higher priority packets. Skbprio was conceived as a
solution for denial-of-service defenses that need to route packets with
different priorities as a means to overcome DoS attacks.

Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-14 07:06:43 -07:00
Stephen Hemminger
fa6b90904a Merge branch 'master' of git://git.kernel.org/pub/scm/network/iproute2/iproute2-next 2018-08-13 12:17:53 -07:00
Stephen Hemminger
31ad498a01 v4.18.0 2018-08-13 12:11:32 -07:00
Tobias Klauser
b3b7c2a71b tc: bpf: update list of archs with eBPF support in manpage
Update the list of architectures supporting eBPF JIT as of Linux 4.18.
Also mention the Linux version where support for a particular
architecture was introduced. Finally, reformat the list of architectures
as a bullet list in order to make it more readable.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-13 12:09:11 -07:00
David Ahern
c044be6b34 Merge branch 'iproute2-master' into iproute2-next
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-13 07:47:21 -07:00
Toke Høiland-Jørgensen
23a67b008a sch_cake: Make gso-splitting configurable
This patch makes sch_cake's gso/gro splitting configurable
from userspace.

To disable breaking apart superpackets in sch_cake:

tc qdisc replace dev whatever root cake no-split-gso

to enable:

tc qdisc replace dev whatever root cake split-gso

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-13 07:41:44 -07:00
Stephen Hemminger
d97e266e5d ip: show min and max mtu
Add min/max MTU to the link details

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:24:31 -07:00
David Ahern
74eb09ad56 Update kernel headers
Update kernel headers to commit
78cbac647e61 (Merge branch 'ip-faster-in-order-IP-fragments'")

Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:23:31 -07:00
Guillaume Nault
bbc1cd0d27 l2tp: drop lns_mode
This option is never set.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:05:11 -07:00
Guillaume Nault
6022f4dd38 l2tp: drop mtu
This option can't be set by user and is never printed.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:05:11 -07:00
Guillaume Nault
99d6ff2101 l2tp: drop data_seq
This option can't be set by user and is never printed. Furthermore,
L2TP_ATTR_DATA_SEQ has always been a noop in Linux.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:05:11 -07:00
Keara Leibovitz
e8bd395508 tc: fix bugs for tcp_flags and ip_attr hex output
Fix hex output for both the ip_attr and tcp_flags print functions.

Sample usage:

$ $TC qdisc add dev lo ingress
$ $TC filter add dev lo parent ffff: prio 3 proto ip flower ip_tos 0x8/32
$ $TC fitler add dev lo parent ffff: prio 5 proto ip flower ip_proto tcp \
	tcp_flags 0x909/f00

$ $TC filter show dev lo parent ffff:

filter protocol ip pref 3 flower chain 0
filter protocol ip pref 3 flower chain 0 handle 0x1
  eth_type ipv4
  ip_tos 0x8/32
  not_in_hw
filter protocol ip pref 5 flower chain 0
filter protocol ip pref 5 flower chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  tcp_flags 0x909/f00
  not_in_hw

$ $TC -j filter show dev lo parent ffff:

[{
    "protocol":"ip",
    "pref":3,
    "kind":"flower",
    "chain":0
},{
    "protocol":"ip",
    "pref":3,
    "kind":"flower",
    "chain":0,
    "options": {
	"handle":1,
	"keys": {
	    "eth_type":"ipv4",
	    "ip_tos":"0x8/32"
    },
    "not_in_hw":true
    }
},{
    "protocol":"ip",
    "pref":5,
    "kind":"flower",
    "chain":0
},{
    "protocol":"ip",
    "pref":5,
    "kind":"flower",
    "chain":0,
    "options": {
	"handle":1,
	"keys": {
	    "eth_type":"ipv4",
	    "ip_proto":"tcp",
	    "tcp_flags":"0x909/f00"
	},
	"not_in_hw":true
    }
}]

Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-08-12 14:04:00 -07:00
Matteo Croce
d56c7dde9d ip link: don't stop batch processing
When 'ip link show dev DEVICE' is processed in a batch mode, ip exits
and stop processing further commands.
This because ipaddr_list_flush_or_save() calls exit() to avoid printing
the link information twice.
Replace the exit with a classic goto out instruction.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-08 09:24:47 -07:00
Stephen Hemminger
d66fdfda71 tc: flush after each command in batch mode
After each command flush output.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-08-08 09:23:48 -07:00
Lubomir Rintel
3655f788d3 lib/namespace: avoid double-mounting a /sys
This partly reverts 8f0807023d, bringing
back the umount(/sys) attempt.

In a LXC container we're unable to umount the sysfs instance, nor mount
a read-write one. We still are able to create a new read-only instance.

Nevertheless, it still makes sense to attempt the umount() even though
the sysfs is mounted read-only. Otherwise we may end up attempting to
mount a sysfs with the same flags as is already mounted, resulting in
an EBUSY error (meaning "Already mounted").

Perhaps this is not a very likely scenario in real world, but we hit
it in NetworkManager test suite and makes netns_switch() somewhat more
robust. It also fixes the case, when /sys wasn't mounted at all.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-27 13:40:12 -07:00
Stephen Hemminger
e5faf729cb ip: show min and max mtu
Add min/max MTU to the link details

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-27 13:30:19 -07:00
Stephen Hemminger
c8f7a754ed ip/address: fix bracketing in help message
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-27 13:26:21 -07:00
David Ahern
a0bc57e1ef Merge branch 'iproute2-master' into iproute2-next
Conflicts:
	include/uapi/linux/bpf.h

Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-25 10:08:04 -07:00
Jiri Pirko
afcd06991d tc: introduce support for chain templates
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-25 10:00:28 -07:00
Eran Ben Elisha
8c7acf3a7a ip: Add violation counters to VF statisctics
Extend VFs statistics by receive and transmit violation counters.

Example: "ip -s link show dev enp5s0f0"

6: enp5s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:a5:28:f0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       2
    TX: bytes  packets  errors  dropped carrier collsns
    1406       17       0       0       0       0
    vf 0 MAC 00:00:ca:fe:ca:fe, vlan 5, spoof checking off, link-state auto, trust off, query_rss off
    RX: bytes  packets  mcast   bcast   dropped
    1666       29       14         32      0
    TX: bytes  packets   dropped
    2880       44       2412

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-25 09:59:36 -07:00
David Ahern
8b099da560 Update kernel headers
Update kernel headers to commit
aea5f654e6b7 ("net/sched: add skbprio scheduler")

Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-25 09:58:00 -07:00
Stephen Hemminger
7327f78565 rdam: uapi update ib_user_verbs.h
Merge in latest santized kernel header.
Put sanitized version of current ib_user_verbs.h.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-23 13:49:20 -07:00
Stephen Hemminger
7c16a8da6b uapi: fix tcp.h repair
Upstream define for TCP_REPAIR changed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-23 13:47:22 -07:00
David Ahern
7f57c8b726 devlink: CTRL_ATTR_FAMILY_ID is a u16
CTRL_ATTR_FAMILY_ID is a u16, not a u32. Update devlink accordingly.

Fixes: a3c4b484a1 ("add devlink tool")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-23 13:44:36 -07:00
David Ahern
5f9c8c6a16 Merge branch 'tc-tunnels-tos-ttl' into iproute2-next
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-20 08:59:43 -07:00
Or Gerlitz
761ec9e29f tc/flower: Add match on encapsulating tos/ttl
Add matching on tos/ttl of the IP tunnel headers.

For example, here's decap rule that matches on the tunnel tos:

tc filter add dev vxlan_sys_4789 protocol ip parent ffff: prio 10 flower \
   enc_src_ip 192.168.10.2 enc_dst_ip 192.168.10.1 enc_key_id 100 enc_dst_port 4789 enc_tos 0x30 \
   src_mac e4:11:22:33:44:70 dst_mac e4:11:22:33:44:50  \
   action tunnel_key unset \
   action mirred egress redirect dev eth0_0

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-20 08:59:11 -07:00
Or Gerlitz
9f89b0cc0e tc/act_tunnel_key: Enable setup of tos and ttl
Allow to set tos and ttl for the tunnel.

For example, here's encap rule that sets tos to the tunnel:

tc filter add dev eth0_0 protocol ip parent ffff: prio 10 flower \
   src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70 \
   action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 dst_port 4789 tos 0x30 \
   action mirred egress redirect dev vxlan_sys_4789

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-20 08:58:31 -07:00
David Ahern
204db84eb8 Update kernel headers
Update kernel headers to
a3eed83a1895 ("Merge branch 'qed-Add-support-for-phy-module-query'")

Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-20 08:57:23 -07:00
Toke Høiland-Jørgensen
77c9fbd06e q_cake: Rename autorate_ingress parameter to use dash as word separator
This is consistent with the other multi-word parameters. Also change the
JSON output to be consistent with way it is formatted for the other
options.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-20 08:46:42 -07:00
Jesus Sanchez-Palencia
b625e36108 tc: Do not use addattr_nest_compat on mqprio and netem
Here we are partially reverting commit c14f9d92ee
"treewide: Use addattr_nest()/addattr_nest_end() to handle nested
attributes" .

As discussed in [1], changing from the 'manually' coded version that
used addattr_l() to addattr_nest_compat() wasn't functionally
equivalent, because now the messages have extra fields appended to it.

This introduced a regression since the implementation of parse_attr()
from both mqprio and netem can't handle this new message format.

Without this fix, mqprio returns an error. netem won't return an error
but its internal configuration ends up wrong.

As an example, this can be reproduced by the following commands when
this patch is not applied:

 1) mqprio
$ tc qdisc replace dev enp3s0 parent root handle 100 mqprio \
	num_tc 3 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
	queues 1@0 1@1 2@2 hw 0

RTNETLINK answers: Numerical result out of range

 2) netem
$ tc qdisc add dev enp3s0 root netem rate 5kbit 20 100 5 \
	distribution normal latency 1 1

$ tc -s qdisc

(...)
qdisc netem 8001: dev enp3s0 root refcnt 9 limit 1000 delay 0us  0us
 Sent 402 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
(...)

With this patch applied, the tc -s qdisc command above for netem instead
reads:

(...)
qdisc netem 8002: dev enp3s0 root refcnt 9 limit 1000 delay 0us  0us \
	rate 5Kbit packetoverhead 20 cellsize 100 celloverhead 5
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
(...)

[1] https://patchwork.ozlabs.org/patch/867860/#1893405

Fixes: c14f9d92ee ("treewide: Use addattr_nest()/addattr_nest_end() to handle nested attributes")
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-19 15:50:07 -07:00
Toke Høiland-Jørgensen
714444c0cb Add support for CAKE qdisc
sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-19 09:23:46 -07:00
Alex Vesker
8b4fbf0bed devlink: Add support for devlink-region access
Devlink region allows access to driver defined address regions.
Each device can create its supported address regions and register
them. A device which exposes a region will allow access to it
using devlink.

This support allows reading and dumping regions snapshots as well
as presenting information such as region size and current available
snapshots.

A snapshot represents a memory image of a region taken by the driver.
If a device collects a snapshot of an address region it can be later
exposed using devlink region read or dump commands.
This functionality allows for future analyses on the snapshots.

The dump command is designed to read the full address space of a
region or of a snapshot unlike the read command which allows
reading only a specific section in a region/snapshot indicated by
an address and a length, current support is for reading and dumping
for a previously taken snapshot ID.

New commands added:
 devlink region show [ DEV/REGION ]
 devlink region delete DEV/REGION snapshot SNAPSHOT_ID
 devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
 devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
                                address ADDRESS length length

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-19 09:20:15 -07:00
Qiaobin Fu
697dce7b3a net:sched: add action inheritdsfield to skbedit
The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v4:
* Make tc use netlink helper functions

v3:
* Make flag represented in JSON output as a null value

v2:
* Align the output syntax with the input syntax

* Fix the style issues

Original idea by Jamal Hadi Salim <jhs@mojatatu.com>

Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-19 09:17:56 -07:00
Mathieu Xhonneux
04cb3c0d43 ip: add support for seg6local End.BPF action
This patch adds support for the End.BPF action of the seg6local
lightweight tunnel. Functions from the BPF lightweight tunnel are
re-used in this patch. Example:

$ ip -6 route add fc00::18 encap seg6local action End.BPF endpoint
obj my_bpf.o sec my_func dev eth0

$ ip -6 route show fc00::18
fc00::18  encap seg6local action End.BPF endpoint my_bpf.o:[my_func]
dev eth0 metric 1024 pref medium

v2: - re-use of print_encap_bpf_prog instead of fprintf
    - introduction of "endpoint" keyword for more consistency with
      others parameters

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-18 15:56:18 -07:00
Serhey Popovych
8df708afd6 ipaddress: Fix and make consistent label match handling
Since commit 9516823051 ("ipaddress: Improve print_linkinfo()") we
return -1 instead of 0 when ip-address(8) label does not match network
device name as we did before change. This causes regression when trying
to output ip address matching label:

     # ip addr add 192.168.192.1/24 dev lo label lo:1
     # ip addr show label lo:1
     <no output>

This is special case and return 0 from print_linkinfo() earlier to match
only filter.ifindex and filter.up if given, but not rest fields in
@filter. Then call print_selected_addrinfo() without calling
print_link_stats() in ipaddr_list_flush_or_save().

Later print_selected_addrinfo() calls print_addrinfo() that finally
matches IFA_LABEL attribute in netlink buffer with filter.label using
ifa_label_match_rta().

On the other hand there is three conditions checked in print_linkinfo()
to determine label special case:

    1) filter.label != NULL
    2) filter.family == AF_UNSPEC || filter.family == AF_PACKET
    3) fnmatch(filter.label, name, 0)

With 1) it is ok to check if filtering by label is on by given pattern
in @filter.label.

Since label is IPv4 specific and AF_PACKET is for printing ip-link(8)
information (see ipaddr_link_list()::ipaddress.c as example) checking
for AF_PACKET in 2) doesn't take much sense: better to defer these
checks to print_addrinfo() determine valid combinations before calling
ifa_label_match_rta() to finally match IFA_LABEL to pattern in
filter.label.

For 3) we have following call for test case:

    fnmatch(pattern, string, flags) ->
      fnmatch(filter.label, name, 0) ->
        fnmatch("lo:1", "lo", 0) == FNM_NOMATCH (1) or non-zero on error

To support special case in print_linkinfo() for filtering by label we
only need to check if label pattern is given in filter.label and return
0 to skip print_link_stats() in ipaddr_list_flush_or_save(): actual
filtering will be done in print_addrinfo().

Before commit 9516823051 ("ipaddress: Improve print_linkinfo()"):
-------------------------------------------------------------------

$ ip addr sh label lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN \
group default qlen 1000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                          fnmatch("lo", "lo", 0) == 0
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
$ ip addr show label 'lo:*'
    inet 192.168.192.1/24 scope global lo:1
       valid_lft forever preferred_lft forever
$ ip addr sh label lo:1
    inet 192.168.192.1/24 scope global lo:1
       valid_lft forever preferred_lft forever
$ ip -4 addr sh label lo:1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN \
group default qlen 1000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                             filter.family == AF_INET
    inet 192.168.192.1/24 scope global lo:1
       valid_lft forever preferred_lft forever

After this change applied:
--------------------------

$ ip/ip addr show label lo
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
$ ip/ip addr show label 'lo:*'
    inet 192.168.192.1/24 scope global lo:1
        valid_lft forever preferred_lft forever
$ ip/ip addr show label lo:1
    inet 192.168.192.1/24 scope global lo:1
       valid_lft forever preferred_lft forever
$ ip/ip -4 addr show label lo:1
    inet 192.168.192.1/24 scope global lo:1
       valid_lft forever preferred_lft forever

Note that we no longer show link information as we did previously:
    we are filtering by "label" pattern, not showing by "dev".

Fixes: commit 9516823051 ("ipaddress: Improve print_linkinfo()")
Reported-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2018-07-18 15:52:55 -07:00
David Ahern
b05a68f721 Merge branch 'bpf-btf' into iproute2-next
Daniel Borkmann  says:

====================

Main part of this set is to: i) avoid strict af_alg kernel dependency,
ii) add loader support for bpf to bpf calls and iii) add btf loader
support with an option to annotate maps. For details please see the
individual patches. Thanks!

====================

Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-17 19:39:06 -07:00
Daniel Borkmann
f823f36012 bpf: implement btf handling and map annotation
Implement loading of .BTF section from object file and build up
internal table for retrieving key/value id related to maps in
the BPF program. Latter is done by setting up struct btf_type
table.

One of the issues is that there's a disconnect between the data
types used in the map and struct bpf_elf_map, meaning the underlying
types are unknown from the map description. One way to overcome
this is to add a annotation such that the loader will recognize
the relation to both. BPF_ANNOTATE_KV_PAIR(map_foo, struct key,
struct val); has been added to the API that programs can use.

The loader will then pick the corresponding key/value type ids and
attach it to the maps for creation. This can later on be dumped via
bpftool for introspection.

Example with test_xdp_noinline.o from kernel selftests:

  [...]

  struct ctl_value {
        union {
                __u64 value;
                __u32 ifindex;
                __u8 mac[6];
        };
  };

  struct bpf_map_def __attribute__ ((section("maps"), used)) ctl_array = {
        .type		= BPF_MAP_TYPE_ARRAY,
        .key_size	= sizeof(__u32),
        .value_size	= sizeof(struct ctl_value),
        .max_entries	= 16,
        .map_flags	= 0,
  };
  BPF_ANNOTATE_KV_PAIR(ctl_array, __u32, struct ctl_value);

  [...]

Above could also further be wrapped in a macro. Compiling through LLVM and
converting to BTF:

  # llc --version
  LLVM (http://llvm.org/):
    LLVM version 7.0.0svn
    Optimized build.
    Default target: x86_64-unknown-linux-gnu
    Host CPU: skylake

    Registered Targets:
      bpf    - BPF (host endian)
      bpfeb  - BPF (big endian)
      bpfel  - BPF (little endian)
  [...]

  # clang [...] -O2 -target bpf -g -emit-llvm -c test_xdp_noinline.c -o - |
    llc -march=bpf -mcpu=probe -mattr=dwarfris -filetype=obj -o test_xdp_noinline.o
  # pahole -J test_xdp_noinline.o

Checking pahole dump of BPF object file:

  # file test_xdp_noinline.o
  test_xdp_noinline.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), with debug_info, not stripped
  # pahole test_xdp_noinline.o
  [...]
  struct ctl_value {
	union {
		__u64              value;                /*     0     8 */
		__u32              ifindex;              /*     0     4 */
		__u8               mac[0];               /*     0     0 */
	};                                               /*     0     8 */

	/* size: 8, cachelines: 1, members: 1 */
	/* last cacheline: 8 bytes */
  };

Now loading into kernel and dumping the map via bpftool:

  # ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test
  # ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:227 qdisc noqueue state UNKNOWN group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever
  [...]
  # bpftool prog show id 227
  227: xdp  tag a85e060c275c5616  gpl
      loaded_at 2018-07-17T14:41:29+0000  uid 0
      xlated 8152B  not jited  memlock 12288B  map_ids 381,385,386,382,384,383
  # bpftool map dump id 386
   [{
        "key": 0,
        "value": {
            "": {
                "value": 0,
                "ifindex": 0,
                "mac": []
            }
        }
    },{
        "key": 1,
        "value": {
            "": {
                "value": 0,
                "ifindex": 0,
                "mac": []
            }
        }
    },{
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-17 19:38:44 -07:00
Daniel Borkmann
b5cb33aec6 bpf: implement bpf to bpf calls support
Implement missing bpf to bpf calls support. The loader will
recognize .text section and handle relocation entries that
are emitted by LLVM.

First step is processing of map related relocation entries
for .text section, and in a second step loader will copy .text
section into program section and adjust call instruction
offset accordingly.

Example with test_xdp_noinline.o from kernel selftests:

 1) Every function as __attribute__ ((always_inline)), rest
    left unchanged:

  # ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test
  # ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:233 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
  [...]
  # bpftool prog dump xlated id 233
  [...]
  1669: (2d) if r3 > r2 goto pc+4
  1670: (79) r2 = *(u64 *)(r10 -136)
  1671: (61) r2 = *(u32 *)(r2 +0)
  1672: (63) *(u32 *)(r1 +0) = r2
  1673: (b7) r0 = 1
  1674: (95) exit        <-- 1674 insns total

 2) Every function as __attribute__ ((noinline)), rest
    left unchanged:

  # ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test
  # ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:236 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
  [...]
  # bpftool prog dump xlated id 236
  [...]
  1000: (bf) r1 = r6
  1001: (b7) r2 = 24
  1002: (85) call pc+3   <-- pc-relative call insns
  1003: (1f) r7 -= r0
  1004: (bf) r0 = r7
  1005: (95) exit
  1006: (bf) r0 = r1
  1007: (bf) r1 = r2
  1008: (67) r1 <<= 32
  1009: (77) r1 >>= 32
  1010: (bf) r3 = r0
  1011: (6f) r3 <<= r1
  1012: (87) r2 = -r2
  1013: (57) r2 &= 31
  1014: (67) r0 <<= 32
  1015: (77) r0 >>= 32
  1016: (7f) r0 >>= r2
  1017: (4f) r0 |= r3
  1018: (95) exit        <-- 1018 insns total

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-17 19:38:43 -07:00
Daniel Borkmann
6e5094dbb7 bpf: remove strict dependency on af_alg
Do not bail out when AF_ALG is not supported by the kernel and
only do so when a map is requested in object ns where we're
calculating the hash. Otherwise, the loader can operate just
fine, therefore lets not fail early when it's not needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-17 19:38:40 -07:00