This is an update for 460c03f3f3 ("iplink: double the buffer size also in
iplink_get()"). After update, we will not need to double the buffer size
every time when VFs number increased.
With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
length parameter.
With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
answer to avoid overwrite data in nlh, because it may has more info after
nlh. also this will avoid nlh buffer not enough issue.
We need to free answer after using.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Commit 959f1428 ("color: add new COLOR_NONE and disable_color function")
introducing color enum COLOR_NONE, which is not only duplicite of
COLOR_CLEAR, but also caused segfault, when running ip with --color
switch, as 'attr + 8' in color_fprintf() access array item out of
bounds. Thus removing it and restoring "magic" offset + 7.
Reproduce with:
$ ip -c a
Signed-off-by: Petr Vorel <petr.vorel@gmail.com>
Commit d0e72011 ("ip: ipaddress.c: add support for json output")
introduced passing -1 as enum color_attr. This is not only wrong as no
color_attr has value -1, but also causes another segfault in color_fprintf()
on this setup as there is no item with index -1 in array of enum attr_colors[].
Using COLOR_CLEAR is valid option.
Reproduce with:
$ COLORFGBG='0;15' ip -c a
NOTE: COLORFGBG is environmental variable used for defining whether user
has light or dark background.
COLORFGBG="0;15" is used to ask for color set suitable for light background,
COLORFGBG="15;0" is used to ask for color set suitable for dark background.
Signed-off-by: Petr Vorel <petr.vorel@gmail.com>
Keep it as simple as possible for now: just escape anything that is not
isprint-able, is among the "escape" parameter or '\' as an octal escape
sequence. This should be pretty easy to extend if any other user needs
something more complex in the future.
Signed-off-by: Ivan Delalande <colona@arista.com>
iproute2 contains a bunch of kernel headers, including uapi ones.
Android's libc uses uapi headers almost directly, and uses a
script to fix kernel types that don't match what userspace
expects.
For example: https://issuetracker.google.com/36987220 reports
that our struct ip_mreq_source contains "__be32 imr_multiaddr"
rather than "struct in_addr imr_multiaddr". The script addresses
this by replacing the uapi struct definition with a #include
<bits/ip_mreq.h> which contains the traditional userspace
definition.
Unfortunately, when we compile iproute2, this definition
conflicts with the one in iproute2's linux/in.h.
Historically we've just solved this problem by running "git rm"
on all the iproute2 include/linux headers that break Android's
libc. However, deleting the files in this way makes it harder to
keep up with upstream, because every upstream change to
an include file causes a merge conflict with the delete.
This patch fixes the problem by moving the iproute2 linux headers
from include/linux to include/uapi/linux.
Tested: compiles on ubuntu trusty (glibc)
Signed-off-by: Elliott Hughes <enh@google.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
The original problem was that something like:
| strncpy(ifr.ifr_name, *argv, IFNAMSIZ);
might leave ifr.ifr_name unterminated if length of *argv exceeds
IFNAMSIZ. In order to fix this, I thought about replacing all those
cases with (equivalent) calls to snprintf() or even introducing
strlcpy(). But as Ulrich Drepper correctly pointed out when rejecting
the latter from being added to glibc, truncating a string without
notifying the user is not to be considered good practice. So let's
excercise what he suggested and reject empty, overlong or otherwise
invalid interface names right from the start - this way calls to
strncpy() like shown above become safe and the user has a chance to
reconsider what he was trying to do.
Note that this doesn't add calls to check_ifname() to all places where
user supplied interface name is parsed. In many cases, the interface
must exist already and is therefore looked up using ll_name_to_index(),
so if_nametoindex() will perform the necessary checks already.
Signed-off-by: Phil Sutter <phil@nwl.cc>
As Stephen Hemminger mentioned on the last submission the new_json_obj
function is always called with fp == stdout, so right now, there's no
need of this extra argument.
The background for the rework is the following:
The ip monitor didn't call `new_json_obj` (even for in non json context),
so the static FILE* _fp variable wasn't initialized, thus raising a
SIGSEGV in ipaddress.c. This patch should fix this issue for good, new
paths won't have to call `new_json_obj`.
How to reproduce:
$ ip -t mon label link
(gdb) bt
.#0 _IO_vfprintf_internal (s=s@entry=0x0, format=format@entry=0x45460d “%d: “, ap=ap@entry=0x7fffffff7f18) at vfprintf.c:1278
.#1 0x0000000000451310 in color_fprintf (fp=0x0, attr=<optimized out>, fmt=0x45460d “%d: “) at color.c:108
.#2 0x000000000044a856 in print_color_int (t=t@entry=PRINT_ANY, color=color@entry=4294967295, key=key@entry=0x4545fc “ifindex”,
fmt=fmt@entry=0x45460d “%d: “, value=<optimized out>) at ip_print.c:132
.#3 0x000000000040ccd2 in print_int (value=<optimized out>, fmt=0x45460d “%d: “, key=0x4545fc “ifindex”, t=PRINT_ANY) at ip_common.h:189
.#4 print_linkinfo (who=<optimized out>, n=0x7fffffffa380, arg=0x7ffff77a82a0 <_IO_2_1_stdout_>) at ipaddress.c:1107
.#5 0x0000000000422e13 in accept_msg (who=0x7fffffff8320, ctrl=0x7fffffff8310, n=0x7fffffffa380, arg=0x7ffff77a82a0 <_IO_2_1_stdout_>) at ipmonitor.c:89
.#6 0x000000000044c58f in rtnl_listen (rtnl=0x672160 <rth>, handler=handler@entry=0x422c70 <accept_msg>, jarg=0x7ffff77a82a0 <_IO_2_1_stdout_>)
at libnetlink.c:761
.#7 0x00000000004233db in do_ipmonitor (argc=<optimized out>, argv=0x7fffffffe5a0) at ipmonitor.c:310
.#8 0x0000000000408f74 in do_cmd (argv0=0x7fffffffe7f5 “mon”, argc=3, argv=0x7fffffffe588) at ip.c:116
.#9 0x0000000000408a94 in main (argc=4, argv=0x7fffffffe580) at ip.c:311
Fixes: 6377572f ("ip: ip_print: add new API to print JSON or regular format output")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Julien Fortin <julien@cumulusnetworks.com>
Move the json printer which is based on json writer into the
iproute2 library, so it can be used by library code and tools
other than ip. Should probably have been done from the beginning
like that given json writer is in the library already anyway.
No functional changes.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Julien Fortin <julien@cumulusnetworks.com>
Consolidate dump of prog info to use bpf_dump_prog_info() when possible.
Moving forward, we want to have a consistent output for BPF progs when
being dumped. E.g. in cls/act case we used to dump tag as a separate
netlink attribute before we had BPF_OBJ_GET_INFO_BY_FD bpf(2) command.
Move dumping tag into bpf_dump_prog_info() as well, and only dump the
netlink attribute for older kernels. Also, reuse bpf_dump_prog_info()
for XDP case, so we can dump tag and whether program was jited, which
we currently don't show.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
By making use of strncpy(), both implementations are really simple so
there is no need to add libbsd as additional dependency.
Signed-off-by: Phil Sutter <phil@nwl.cc>
RDMA devices are cross-functional devices from one side,
but very tailored for the specific markets from another.
Such diversity caused to spread of RDMA related configuration
across various tools, e.g. devlink, ip, ethtool, ib specific and
vendor specific solutions.
This patch adds ability to fill device and port information
by reading RDMA netlink.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
BIT() macro was implemented and used by devlink for now, but following
patches of rdmatool will reuse the same macro, so put it in common
header file.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Add support for extended ack error reporting via libmnl.
Add a new function rtnl_talk_extack that takes a callback as an input
arg. If a netlink response contains extack attributes, the callback is
is invoked with the the err string, offset in the message and a pointer
to the message returned by the kernel.
If iproute2 is built without libmnl, it will still work but
extended error reports from kernel will not be available.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Make use of TCA_BPF_ID/TCA_ACT_BPF_ID that we exposed and print the ID
of the programs loaded and use the new BPF_OBJ_GET_INFO_BY_FD command
for dumping further information about the program, currently whether
the attached program is jited.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for map in map in the loader and add a small example program.
The outer map uses inner_id to reference a bpf_elf_map with a given ID
as the inner type. Loading maps is done in three passes, i) all non-map
in map maps are loaded, ii) all map in map maps are loaded based on the
inner_id map spec of a non-map in map with corresponding id, and iii)
related inner maps are attached to the map in map with given inner_idx
key. Pinned objetcs are assumed to be managed externally, so they are
only retrieved from BPF fs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Distinguish between externally learned vs offloaded FDBs. This is done
in order to indicate that FDBs added by software was successfully
offloaded.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Since commit a8f820a380a2a06 ('can: add Virtual CAN Tunnel driver (vxcan)')
for Linux 4.12 a virtual CAN tunnel driver analogue to veth is available in
Linux.
This patch adds the ability to create vxcan device pairs.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
ipaddr_list_flush_or_save generates a list of nlmsg's for links and
optionally for addresses. Move the code into ip_linkaddr_list and
export it along with the supporting infrastructure.
API to use this function is:
struct nlmsg_chain linfo = { NULL, NULL};
struct nlmsg_chain ainfo = { NULL, NULL};
ip_linkaddr_list(family, filter_req, &linfo, &ainfo);
... error checking and code looping over linfo/ainfo ...
free_nlmsg_chain(&linfo);
free_nlmsg_chain(&ainfo);
Signed-off-by: David Ahern <dsahern@gmail.com>
Kernel now supports up to 30 labels but not defined as part of the uapi.
iproute2 handles up to 8 labels but in a non-consistent way. Update ip
to handle more labels, but in a more programmatic way.
For the MPLS address family, the data field in inet_prefix is used for
labels. Increase that field to 64 u32's -- 64 as nothing more than a
convenient power of 2 number.
Update mpls_pton to take the length of the address field, convert that
length to number of labels and add better error handling to the parsing
of the user supplied string.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
'ip vrf pids' is used to list processes bound to a vrf, but it only
shows the pid leaving a lot of work for the user. Add the command
name to the output. With this patch you get the more user friendly:
$ ip vrf pids mgmt
1121 ntpd
1418 gdm-session-wor
1488 gnome-session
1491 dbus-launch
1492 dbus-daemon
1565 sshd
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
'ip vrf pids' is used to list processes bound to a vrf, but it only
shows the pid leaving a lot of work for the user. Add the command
name to the output. With this patch you get the more user friendly:
$ ip vrf pids mgmt
1121 ntpd
1418 gdm-session-wor
1488 gnome-session
1491 dbus-launch
1492 dbus-daemon
1565 sshd
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Allow callers of the dump API to handle nlmsg errors (e.g., an
unsupported feature). Setting RTNL_HANDLE_F_SUPPRESS_NLERR in the
rtnl_handle avoids unnecessary messages to the users in some case.
For example,
RTNETLINK answers: Operation not supported
when probing for support of a new feature.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
COLORFGBG environment variable is used to detect dark background.
Idea and a bit of code is borrowed from Vim, thanks.
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The sample tc action allows sampling packets matching a classifier. It
peeks randomly packets, and samples them using the psample netlink
channel. The user can specify the psample group, which the packet will be
sampled to, the sampling rate and the packet truncation (to save
kernel-user traffic).
The sampled packets contain informative metadata, for example, the input
interface and the original packet length.
The action syntax:
tc filter add [...] \
action sample rate <RATE> group <GROUP> [trunc <SIZE>]
[...]
Where:
RATE := The sampling rate which is the ratio of packets observed at the
data source to the samples generated
GROUP := the psample module sampling group
SIZE := optional truncation size
An example for a common usecase of the sample tc action: to sample ingress
traffic from interface eth1, one may use the commands:
tc qdisc add dev eth1 handle ffff: ingress
tc filter add dev eth1 parent ffff: \
matchall action sample rate 12 group 4
Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets
on dev eth1 to psample group 4.
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
In order to ensure no backward/forward compatiablity problems,
make sure that all kernel headers used come from the local copy.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
iplink_vrf has 2 functions used to validate a user given device name is
a VRF device and to return the table id. If the user string is not a
device name ip commands with a vrf keyword show a confusing error
message: "RTNETLINK answers: No such device".
Add a variant of rtnl_talk that does not display the "RTNETLINK answers"
message and update iplink_vrf to use it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Add make_path to recursively call mkdir as needed to create a given
path with the given mode.
Add find_cgroup2_mount to lookup path where cgroup2 is mounted. If it
is not already mounted, cgroup2 is mounted under /var/run/cgroup2 for
use by iproute2.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Based on version in kernel repo, samples/bpf/libbpf.h
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Adds support to configure BPF programs as nexthop actions via the LWT
framework.
Example:
ip route add 192.168.253.2/32 \
encap bpf out obj lwt_len_hist_kern.o section len_hist \
dev veth0
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Now that we made the BPF loader generic as a library, reuse it
for loading XDP programs as well. This basically adds a minimal
start of a facility for iproute2 to load XDP programs. There
currently only exists the xdp1_user.c sample code in the kernel
tree that sets up netlink directly and an iovisor/bcc front-end.
Since we have all the necessary infrastructure in place already
from tc side, we can just reuse its loader back-end and thus
facilitate migration and usability among the two for people
familiar with tc/bpf already. Sharing maps, performing tail calls,
etc works the same way as with tc. Naturally, once kernel
configuration API evolves, we will extend new features for XDP
here as well, resp. extend dumping of related netlink attributes.
Minimal example:
clang -target bpf -O2 -Wall -c prog.c -o prog.o
ip [-force] link set dev em1 xdp obj prog.o # attaching
ip [-d] link # dumping
ip link set dev em1 xdp off # detaching
For the dump, intention is that in the first line for each ip
link entry, we'll see "xdp" to indicate that this device has an
XDP program attached. Once we dump some more useful information
via netlink (digest, etc), idea is that 'ip -d link' will then
display additional relevant program information below the "link/
ether [...]" output line for such devices, for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.
The 'unset' action is optional. It is used to explicitly unset the
metadata created by the tunnel device during decap. If not used, the
metadata will be released automatically by the kernel.
The 'set' operation, will set the metadata with the specified values for
the encap.
For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:
$ tc filter add dev net0 protocol ip parent ffff: \
flower \
ip_proto 1 \
dst_ip 11.11.11.2 \
action tunnel_key set \
src_ip 11.11.0.1 \
dst_ip 11.11.0.2 \
id 11 \
action mirred egress redirect dev vxlan0
Signed-off-by: Amir Vadai <amir@vadai.me>
This is needed for some HWs to do proper macthing and steering.
Possible values are none, link, network, transport.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
This work moves the bpf loader into the iproute2 library and reworks
the tc specific parts into generic code. It's useful as we can then
more easily support new program types by just having the same ELF
loader backend. Joint work with Thomas Graf. I hacked a rough start
of a test suite to make sure nothing breaks [1] and looks all good.
[1] https://github.com/borkmann/clsact/blob/master/test_bpf.sh
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Adjusting iproute2 utility to support new macvlan link type mode called
"source".
Example of commands that can be applied:
ip link add link eth0 name macvlan0 type macvlan mode source
ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
ip link set link dev macvlan0 type macvlan macaddr del 00:11:11:11:11:11
ip link set link dev macvlan0 type macvlan macaddr flush
ip -details link show dev macvlan0
Based on previous work of Stefan Gula <steweg@gmail.com>
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Cc: steweg@gmail.com
If we have multicast routes and do ip route show table all we'll get the
following output:
...
multicast ???/32 from ???/32 table default proto static iif eth0
The "???" are because the rtm_family is set to RTNL_FAMILY_IPMR instead
(or RTNL_FAMILY_IP6MR for ipv6). Add a simple workaround that returns the
real family based on the rtm_type (always RTN_MULTICAST for ipmr routes)
and the rtm_family. Similar workaround is already used in ipmroute, and
we can use this helper there as well.
After the patch the output is:
multicast 239.10.10.10/32 from 0.0.0.0/32 table default proto static iif eth0
Also fix a minor whitespace error and switch to tabs.
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
This patch adds support for the stats argument to the bridge
vlan command which will display the per-vlan statistics and the device
each vlan belongs to with its flags. The supported command filtering
options are dev and vid. Also the man page is updated to explain the new
option.
The patch uses the new RTM_GETSTATS interface with a filter_mask to dump
all bridges and ports vlans. Later we can add support for using the
per-device dump and filter it in the kernel instead.
Example:
$ bridge -s vlan show
port vlan id
br0 1 Egress Untagged
RX: 2536 bytes 20 packets
TX: 2536 bytes 20 packets
101
RX: 43158 bytes 50 packets
TX: 43158 bytes 50 packets
eth1 1 Egress Untagged
RX: 2536 bytes 20 packets
TX: 2536 bytes 20 packets
100
RX: 0 bytes 0 packets
TX: 0 bytes 0 packets
101
RX: 43158 bytes 50 packets
TX: 43158 bytes 50 packets
102
RX: 16897 bytes 93 packets
TX: 0 bytes 0 packets
The format is the same as bridge vlan show but with stats, even though
under the hood the calls done to the kernel are different.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
All users of genl have the same code to open a genl socket and resolve
the family for their specific protocol. Introduce a helper to initialize
the handle, and use it in all the genl code.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Add two NLA's that allow configuration of Infiniband node or port GUIDs
by referencing the IPoIB net device set over the physical function. The
format to be used is as follows:
ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70
ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78
Signed-off-by: Eli Cohen <eli@mellanox.com>
Kernel gained support for filtering link dumps with commit dc599f76c22b
("net: Add support for filtering link dump by master device and kind").
Add support to ip link command. If a user passes master device or
kind to ip link command they are added to the link dump request message.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Use kernel shared buffer occupancy control commands to make snapshot and
clear occupancy watermarks. Also, allow to show occupancy values in a
nice way.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Implement kernel devlink shared buffer interface. Introduce new object
"sb" and allow to browse the shared buffer parameters and also change
configuration.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Follow-up to kernel commit 6c9059817432 ("bpf: pre-allocate hash map
elements"). Add flags support, so that we can pass in BPF_F_NO_PREALLOC
flag for disallowing preallocation. Update examples accordingly and also
remove the BPF_* map helper macros from them as they were not very useful.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add new signatures for BPF_FUNC_csum_diff, BPF_FUNC_skb_get_tunnel_opt
and BPF_FUNC_skb_set_tunnel_opt.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There is only a single user who needs it to be reentrant (not really,
but it's safer like this), add rt_addr_n2a_r() for it to use.
Signed-off-by: Phil Sutter <phil@nwl.cc>
There are only three users which require it to be reentrant, the rest is
fine without. Instead, provide a reentrant format_host_r() for users
which need it.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This adds two helper functions which map a given data field to a color,
so color_fprintf() statements don't have to be duplicated with only a
different color value depending on that data field's value. In order for
this to work in a generic way, COLOR_CLEAR has been added to serve as a
fallback default of uncolored output.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Netlink already provides hello_timer, tcn_timer, topology_change_timer
and gc_timer, so let's make them visible.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
This enables a user to remove an offline peer from the kernel data
structures. This could for example be useful when deliberately scaling
in peer nodes in a cloud environment.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
This enables a user to remove an offline peer from the kernel data
structures. This could for example be useful when deliberately scaling
in peer nodes in a cloud environment.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Fix a whitespace in bpf_dump_error() usage, and also a missing closing
bracket in ntohl() macro for eBPF programs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch:
- Adds a utility function for parsing a 64 bit address
- Adds a utility function for converting a 64 bit address to ASCII
- Adds and ILA encap type in lwt tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
Improve example files further and add a more generic set of possible
helpers for them that can be used.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
The recently introduced object pinning can be further extended in order
to allow sharing maps beyond tc namespace. F.e. maps that are being pinned
from tracing side, can be accessed through this facility as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
This larger work addresses one of the bigger remaining issues on
tc's eBPF frontend, that is, to allow for persistent file descriptors.
Whenever tc parses the ELF object, extracts and loads maps into the
kernel, these file descriptors will be out of reach after the tc
instance exits.
Meaning, for simple (unnested) programs which contain one or
multiple maps, the kernel holds a reference, and they will live
on inside the kernel until the program holding them is unloaded,
but they will be out of reach for user space, even worse with
(also multiple nested) tail calls.
For this issue, we introduced the concept of an agent that can
receive the set of file descriptors from the tc instance creating
them, in order to be able to further inspect/update map data for
a specific use case. However, while that is more tied towards
specific applications, it still doesn't easily allow for sharing
maps accross multiple tc instances and would require a daemon to
be running in the background. F.e. when a map should be shared by
two eBPF programs, one attached to ingress, one to egress, this
currently doesn't work with the tc frontend.
This work solves exactly that, i.e. if requested, maps can now be
_arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within
a single object (but various program sections, PIN_OBJECT_NS) without
"loosing" the file descriptor set. To make that happen, we use eBPF
object pinning introduced in kernel commit b2197755b263 ("bpf: add
support for persistent maps/progs") for exactly this purpose.
The shipped examples/bpf/bpf_shared.c code from this patch can be
easily applied, for instance, as:
- classifier-classifier shared:
tc filter add dev foo parent 1: bpf obj shared.o sec egress
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
- classifier-action shared (here: late binding to a dummy classifier):
tc actions add action bpf obj shared.o sec egress pass index 42
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \
action bpf index 42
The toy example increments a shared counter on egress and dumps its
value on ingress (if no sharing (PIN_NONE) would have been chosen,
map value is 0, of course, due to the two map instances being created):
[...]
<idle>-0 [002] ..s. 38264.788234: : map val: 4
<idle>-0 [002] ..s. 38264.788919: : map val: 4
<idle>-0 [002] ..s. 38264.789599: : map val: 5
[...]
... thus if both sections reference the pinned map(s) in question,
tc will take care of fetching the appropriate file descriptor.
The patch has been tested extensively on both, classifier and
action sides.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If get_rt_realms() fails, try to get a possible raw u32 realms
value for the u32 RTA_FLOW/FRA_FLOW attribute, as it might be
useful to directly configure the hex value itself. And only if
that fails, then bail out.
The source realm is provided in the upper u16 (mask: 0xffff0000)
and the destination realm through the lower u16 part (mask:
0x0000ffff). This can be useful for tc's bpf realm matcher, but
also a full hex/mask param can be provided already for matching
through iptables' --realm cmdline option, for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds support to parse and print lwtunnel
encapsulation attributes attached to routes for MPLS
and IP tunnels.
example:
Add ipv4 route with mpls encap attributes:
Examples:
MPLS:
$ ip route add 40.1.2.0/30 encap mpls 200 via inet 40.1.1.1 dev eth3
$ ip route show
40.1.2.0/30 encap mpls 200 via 40.1.1.1 dev eth3
Add ipv4 multipath route with mpls encap attributes:
$ ip route add 10.1.1.0/30 nexthop encap mpls 200 via 10.1.1.1 dev eth0 \
nexthop encap mpls 700 via 40.1.1.2 dev eth3
$ ip route show
10.1.1.0/30
nexthop encap mpls 200 via 10.1.1.1 dev eth0 weight 1
nexthop encap mpls 700 via 40.1.1.2 dev eth3 weight 1
IP:
$ ip route add 10.1.1.1/24 encap ip id 200 dst 20.1.1.1 dev vxlan0
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Benc <jbenc@redhat.com>
This patch introduces two new api's rta_nest and rta_nest_end to
nest attributes inside a rta attribute represented by 'struct rtattr'
as required to construct a nexthop. Also adds rta_addattr* variants
for u8, u16 and u64 as needed to support encapsulation.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Benc <jbenc@redhat.com>
When having optional classid, most minimal command can be sth
like:
tc filter add dev foo parent X: bpf obj prog.o
Therefore, adapt the code so that a next argument will not be
enforced as the case currently.
Also, minor cleanup on the classid, where we should rather
have used addattr32(), and add flags for exec configuration,
for example (using short notation):
tc filter add dev foo parent X: bpf da obj prog.o
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Add support for filtering neighbor dumps by master device. Kernel side
support provided by commit 21fdd092acc7. Since the feature is not
available in older kernels the user is given a warning message if the
kernel does not support the request.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
This patch follows the changes of commit 4d98ab0 ("Fix FSF address in
file headers"), fixing file headers added after it.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This adds support for slightly less output than is normally provided by
'ip link show' and 'ip addr show'. This is a bit better when you have a
host with lots of interfaces. Sample output:
$ ip -br link show
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
p2p1 UP 08:00:27:ee:0b:3b <BROADCAST,MULTICAST,UP,LOWER_UP>
p7p1 UP 08:00:27:9d:62:9f <BROADCAST,MULTICAST,UP,LOWER_UP>
p8p1 DOWN 08:00:27:dc:d8:ca <NO-CARRIER,BROADCAST,MULTICAST,UP>
p9p1 UP 08:00:27:76:d9:75 <BROADCAST,MULTICAST,UP,LOWER_UP>
p7p1.100@p7p1 UP 08:00:27:9d:62:9f <BROADCAST,MULTICAST,UP,LOWER_UP>
$ ip -br -4 addr show
lo UNKNOWN 127.0.0.1/8
p2p1 UP 192.168.56.2/24
p7p1 UP 70.0.0.1/24
p8p1 DOWN 80.0.0.1/24
p9p1 UP 10.0.5.15/24
p7p1.100@p7p1 UP 200.0.0.1/24
$ ip -br -6 addr show
lo UNKNOWN ::1/128
p2p1 UP fe80::a00:27ff:feee:b3b/64
p7p1 UP 7000::1/8 fe80::a00:27ff:fe9d:629f/64
p8p1 DOWN 8000::1/8
p9p1 UP fe80::a00:27ff:fe76:d975/64
p7p1.100@p7p1 UP fe80::a00:27ff:fe9d:629f/64
$ ip -br addr show p7p1
p7p1 UP 70.0.0.1/24 7000::1/8 fe80::a00:27ff:fe9d:629f/64
v2: Now with color support!
v3: Better field width estimation (except netdev names to keep output at a
decent width) and whitespace fixup.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
There have been several instances where response from kernel
has overrun the stack buffer from the caller. Avoid future problems
by passing a size argument.
Also drop the unused peer and group arguments to rtnl_talk.
With this patch, it's now possible to listen in all netns that have an nsid
assigned into the netns where the socket is opened.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
It is hard to quickly find what you are looking for in the output of the
ip command. Color helps.
This patch adds a '-c' flag to highlight these with individual colors:
- interface name
- ip address
- mac address
- up/down state
Signed-off-by: Mathias Nyman <m.nyman@iki.fi>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
The warning was:
In file included from namespace.c:14:0:
../include/namespace.h: In function ‘setns’:
../include/namespace.h:37:2: warning: implicit declaration of function ‘syscall’ [-Wimplicit-function-declaration]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Two commands are added:
- ip netns list-id
- ip monitor nsid
A cache is also added to remember the association between the iproute2 netns
name (from /var/run/netns/) and the nsid.
To avoid interfering with the rth socket, a new rtnl socket (rtnsh) is used to
get nsid (we may send rtnl request during listing on rth).
Example:
$ ip netns list-id
nsid 0 (iproute2 netns name: foo)
$ ip monitor nsid
Deleted nsid 0 (iproute2 netns name: foo)
nsid 16 (iproute2 netns name: bar)
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
This work finalizes both eBPF front-ends for the classifier and action
part in tc, it allows for custom ELF section selection, a simplified tc
command frontend (while keeping compat), reusing of common maps between
classifier and actions residing in the same object file, and exporting
of all map fds to an eBPF agent for handing off further control in user
space.
It also adds an extensive example of how eBPF can be used, and a minimal
self-contained example agent that dumps map data. The example is well
documented and hopefully provides a good starting point into programming
cls_bpf and act_bpf.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
If '-nm' specified that do not fail if there is no
default class names file in /etc/iproute2.
Changed default class name file cls_names -> tc_cls.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
- Pull in the uapi mpls.h
- Update rtnetlink.h to include the mpls rtnetlink notification multicast group.
- Define AF_MPLS in utils.h if it is not defined from elsewhere
as is done with AF_DECnet
The address syntax for multiple mpls labels is a complete invention.
When I looked there seemed to be no wide spread convention for talking
about an mpls label stack in text for. Sometimes people did:
"{ Label1, Label2, Label3 }", sometimes people would do:
"[ label3, label2, label1 ]", and most of the time label
stacks were not explicitly shown at all.
The syntax I wound up using, so it would not have spaces and so it
would visually distinct from other kinds of addresses is.
label1/label2/label3 Where label1 is the label at the top of the label
stack and label3 is the label at the bottom on the label stack.
When there is a single label this matches what seems to be convention
with other tools. Just print out the numeric value of the mpls label.
The netlink protocol for labels uses the on the wire format for a
label stack. The ttl and traffic class are expected to be 0. Using
the on the wire format is common and what happens with other address
types. BGP when passing label stacks also uses this technique with the
exception that the ttl byte is not included making each label in a BGP
label stack 3 bytes instead of 4.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add support for the RTA_VIA attribute that specifies an address family
as well as an address for the next hop gateway.
To make it easy to pass this reorder inet_prefix so that it's tail
is a proper RTA_VIA attribute.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add the functions family_name and read_family to convert an address
family to a string and to convernt a string to an address family.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For some address families (like AF_PACKET) it is helpful to have the
length when prenting the address.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This work adds the tc frontend for kernel commit e2e9b6541dd4 ("cls_bpf:
add initial eBPF support for programmable classifiers").
A C-like classifier program (f.e. see e2e9b6541dd4) is being compiled via
LLVM's eBPF backend into an ELF file, that is then being passed to tc. tc
then loads, if any, eBPF maps and eBPF opcodes (with fixed-up eBPF map file
descriptors) out of its dedicated sections, and via bpf(2) into the kernel
and then the resulting fd via netlink down to cls_bpf. cls_bpf allows for
annotations, currently, I've used the file name for that, so that the user
can easily identify his filter when dumping configurations back.
Example usage:
clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
tc filter add dev em1 parent 1: bpf run object-file cls.o classid x:y
tc filter show dev em1 [...]
filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid x:y cls.o
I placed the parser bits derived from Alexei's kernel sample, into tc_bpf.c
as my next step is to also add the same support for BPF action, so we can
have a fully fledged eBPF classifier and action in tc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
The kernel now provides ids for peer netns. This patch implements a new command
'set' to assign an id.
When netns are listed, if an id is assigned, it is now displayed.
Example:
$ ip netns add foo
$ ip netns set foo 1
$ ip netns
foo (id: 1)
init_net
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
This change allows to exec some cmd on each
named netns (except default) by specifying '-all' option:
# ip -all netns exec ip link
Each command executes synchronously.
Exit status is not considered, so there might be a case
that some CMD can fail on some netns but success on the other.
EXAMPLES:
1) Show link info on all netns:
$ ip -all netns exec ip link
netns: test_net
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether 1a:19:6f:25:eb:85 brd ff:ff:ff:ff:ff:ff
netns: home0
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether ea:1a:59:40:d3:29 brd ff:ff:ff:ff:ff:ff
netns: lan0
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether ce:49:d5:46:81:ea brd ff:ff:ff:ff:ff:ff
2) Set UP tap0 device for the all netns:
$ ip -all netns exec ip link set dev tap0 up
netns: test_net
netns: home0
netns: lan0
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
When HAVE_SETNS is not set, iproute2 provides a local implementation of this
function based on __NR_setns.
This macro is defined in sys/syscall.h, which was not included, thus the local
implementation always returned -1.
CC: Vadim Kochan <vadim4j@gmail.com>
Fixes: eb67e4498a ("lib: Add netns_switch func for change network namespace")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Warning was:
In file included from bridge.c:16:0:
../include/namespace.h:33:12: warning: ‘setns’ defined but not used [-Wunused-function]
CC: Vadim Kochan <vadim4j@gmail.com>
Fixes: eb67e4498a ("lib: Add netns_switch func for change network namespace")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Sometimes, it is more convenient to get only one specific nested attribute by
type. For example for IFLA_AF_SPEC where type is address family (AF_INET6).
So add this helper for this purpose.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
New netns_switch func moved to the lib/namespace.c from ip/ipnetns.c
so it can be used from the other tools for fast switching
network namespace.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Added another timestamp format to look like more logging info:
[2014-12-22T22:36:50.489 ] 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default
link/ether 3c:97:0e:a3:86:2e brd ff:ff:ff:ff:ff:ff
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Replaced handling netlink messages by rtnl_dump_filter
from lib/libnetlink.c, also:
- removed unused dump_fp arg;
- added MAGIC_SEQ #define for 123456 seq id;
- silently exit if ENOENT errno is caused for NETLINK_SOCK_DIAG proto
in lib/libnetlink.c: rtnl_duml_filter_l(...) function. This fix
was added in a3fd8e58c1 by Eric
for misc/ss.c
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
There were only few Netlink protocol names
which were printed on the screen:
rtnl, fw, tcpdiag
So added the ability to identify Netlink proto name
from /etc/iproute/nl_protos or from static table.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Added 'ip fou...' commands to enable/disable UDP ports for doing
foo-over-udp and Generic UDP Encapsulation variant. Arguments are port
number to bind to and IP protocol to map to port (for direct FOU).
Examples:
ip fou add port 7777 gue
ip fou add port 8888 ipproto 4
The first command creates a GUE port, the second creates a direct FOU
port for IPIP (receive payload is a assumed to be an IPv4 packet).
Signed-off-by: Tom Herbert <therbert@google.com>
The RTM_NEWLINK message accepts ifi_index non-zero value and lets
creation of links with given index (if it's free, or course). This
functionality is available since linux-v3.5.
This patch makes this API available via ip tool.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use warn_unused_result to enforce checking return value of rtnl_send,
and fix where the errors are.
Suggested by initial patch from Petr Písař <ppisar@redhat.com>
The kernel already supports it, so add the support
to iproute2 as well.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
This change adds the interface group to the output of "ip link show".
It also makes "ip link" print _all_ devices if no group filter is specified;
previously, only interfaces of the default group (0) were shown.
Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
ss -i can output "fastopen" attribute if socket used Fast Open
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>