mirror of
				https://git.proxmox.com/git/mirror_iproute2
				synced 2025-11-04 02:56:43 +00:00 
			
		
		
		
	sparc64 support was added in 7a12b5031c6b (sparc64: Add eBPF JIT., 2017-04-17)[0] and ppc64 in 156d0e290e96 (powerpc/ebpf/jit: Implement JIT compiler for extended BPF, 2016-06-22)[1]. [0]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7a12b5031c6b [1]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=156d0e290e96 Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
		
			
				
	
	
		
			939 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			939 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
.TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux"
 | 
						|
.SH NAME
 | 
						|
BPF \- BPF programmable classifier and actions for ingress/egress
 | 
						|
queueing disciplines
 | 
						|
.SH SYNOPSIS
 | 
						|
.SS eBPF classifier (filter) or action:
 | 
						|
.B tc filter ... bpf
 | 
						|
[
 | 
						|
.B object-file
 | 
						|
OBJ_FILE ] [
 | 
						|
.B section
 | 
						|
CLS_NAME ] [
 | 
						|
.B export
 | 
						|
UDS_FILE ] [
 | 
						|
.B verbose
 | 
						|
] [
 | 
						|
.B skip_hw
 | 
						|
|
 | 
						|
.B skip_sw
 | 
						|
] [
 | 
						|
.B police
 | 
						|
POLICE_SPEC ] [
 | 
						|
.B action
 | 
						|
ACTION_SPEC ] [
 | 
						|
.B classid
 | 
						|
CLASSID ]
 | 
						|
.br
 | 
						|
.B tc action ... bpf
 | 
						|
[
 | 
						|
.B object-file
 | 
						|
OBJ_FILE ] [
 | 
						|
.B section
 | 
						|
CLS_NAME ] [
 | 
						|
.B export
 | 
						|
UDS_FILE ] [
 | 
						|
.B verbose
 | 
						|
]
 | 
						|
 | 
						|
.SS cBPF classifier (filter) or action:
 | 
						|
.B tc filter ... bpf
 | 
						|
[
 | 
						|
.B bytecode-file
 | 
						|
BPF_FILE |
 | 
						|
.B bytecode
 | 
						|
BPF_BYTECODE ] [
 | 
						|
.B police
 | 
						|
POLICE_SPEC ] [
 | 
						|
.B action
 | 
						|
ACTION_SPEC ] [
 | 
						|
.B classid
 | 
						|
CLASSID ]
 | 
						|
.br
 | 
						|
.B tc action ... bpf
 | 
						|
[
 | 
						|
.B bytecode-file
 | 
						|
BPF_FILE |
 | 
						|
.B bytecode
 | 
						|
BPF_BYTECODE ]
 | 
						|
 | 
						|
.SH DESCRIPTION
 | 
						|
 | 
						|
Extended Berkeley Packet Filter (
 | 
						|
.B eBPF
 | 
						|
) and classic Berkeley Packet Filter
 | 
						|
(originally known as BPF, for better distinction referred to as
 | 
						|
.B cBPF
 | 
						|
here) are both available as a fully programmable and highly efficient
 | 
						|
classifier and actions. They both offer a minimal instruction set for
 | 
						|
implementing small programs which can safely be loaded into the kernel
 | 
						|
and thus executed in a tiny virtual machine from kernel space. An in-kernel
 | 
						|
verifier guarantees that a specified program always terminates and neither
 | 
						|
crashes nor leaks data from the kernel.
 | 
						|
 | 
						|
In Linux, it's generally considered that eBPF is the successor of cBPF.
 | 
						|
The kernel internally transforms cBPF expressions into eBPF expressions and
 | 
						|
executes the latter. Execution of them can be performed in an interpreter
 | 
						|
or at setup time, they can be just-in-time compiled (JIT'ed) to run as
 | 
						|
native machine code. Currently, x86_64, ARM64, s390, ppc64 and sparc64
 | 
						|
architectures have eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have
 | 
						|
cBPF, but did not (yet) switch to eBPF JIT support.
 | 
						|
 | 
						|
eBPF's instruction set has similar underlying principles as the cBPF
 | 
						|
instruction set, it however is modelled closer to the underlying
 | 
						|
architecture to better mimic native instruction sets with the aim to
 | 
						|
achieve a better run-time performance. It is designed to be JIT'ed with
 | 
						|
a one to one mapping, which can also open up the possibility for compilers
 | 
						|
to generate optimized eBPF code through an eBPF backend that performs
 | 
						|
almost as fast as natively compiled code. Given that LLVM provides such
 | 
						|
an eBPF backend, eBPF programs can therefore easily be programmed in a
 | 
						|
subset of the C language. Other than that, eBPF infrastructure also comes
 | 
						|
with a construct called "maps". eBPF maps are key/value stores that are
 | 
						|
shared between multiple eBPF programs, but also between eBPF programs and
 | 
						|
user space applications.
 | 
						|
 | 
						|
For the traffic control subsystem, classifier and actions that can be
 | 
						|
attached to ingress and egress qdiscs can be written in eBPF or cBPF. The
 | 
						|
advantage over other classifier and actions is that eBPF/cBPF provides the
 | 
						|
generic framework, while users can implement their highly specialized use
 | 
						|
cases efficiently. This means that the classifier or action written that
 | 
						|
way will not suffer from feature bloat, and can therefore execute its task
 | 
						|
highly efficient. It allows for non-linear classification and even merging
 | 
						|
the action part into the classification. Combined with efficient eBPF map
 | 
						|
data structures, user space can push new policies like classids into the
 | 
						|
kernel without reloading a classifier, or it can gather statistics that
 | 
						|
are pushed into one map and use another one for dynamically load balancing
 | 
						|
traffic based on the determined load, just to provide a few examples.
 | 
						|
 | 
						|
.SH PARAMETERS
 | 
						|
.SS object-file
 | 
						|
points to an object file that has an executable and linkable format (ELF)
 | 
						|
and contains eBPF opcodes and eBPF map definitions. The LLVM compiler
 | 
						|
infrastructure with
 | 
						|
.B clang(1)
 | 
						|
as a C language front end is one project that supports emitting eBPF object
 | 
						|
files that can be passed to the eBPF classifier (more details in the
 | 
						|
.B EXAMPLES
 | 
						|
section). This option is mandatory when an eBPF classifier or action is
 | 
						|
to be loaded.
 | 
						|
 | 
						|
.SS section
 | 
						|
is the name of the ELF section from the object file, where the eBPF
 | 
						|
classifier or action resides. By default the section name for the
 | 
						|
classifier is called "classifier", and for the action "action". Given
 | 
						|
that a single object file can contain multiple classifier and actions,
 | 
						|
the corresponding section name needs to be specified, if it differs
 | 
						|
from the defaults.
 | 
						|
 | 
						|
.SS export
 | 
						|
points to a Unix domain socket file. In case the eBPF object file also
 | 
						|
contains a section named "maps" with eBPF map specifications, then the
 | 
						|
map file descriptors can be handed off via the Unix domain socket to
 | 
						|
an eBPF "agent" herding all descriptors after tc lifetime. This can be
 | 
						|
some third party application implementing the IPC counterpart for the
 | 
						|
import, that uses them for calling into
 | 
						|
.B bpf(2)
 | 
						|
system call to read out or update eBPF map data from user space, for
 | 
						|
example, for monitoring purposes or to push down new policies.
 | 
						|
 | 
						|
.SS verbose
 | 
						|
if set, it will dump the eBPF verifier output, even if loading the eBPF
 | 
						|
program was successful. By default, only on error, the verifier log is
 | 
						|
being emitted to the user.
 | 
						|
 | 
						|
.SS skip_hw | skip_sw
 | 
						|
hardware offload control flags. By default TC will try to offload
 | 
						|
filters to hardware if possible.
 | 
						|
.B skip_hw
 | 
						|
explicitly disables the attempt to offload.
 | 
						|
.B skip_sw
 | 
						|
forces the offload and disables running the eBPF program in the kernel.
 | 
						|
If hardware offload is not possible and this flag was set kernel will
 | 
						|
report an error and filter will not be installed at all.
 | 
						|
 | 
						|
.SS police
 | 
						|
is an optional parameter for an eBPF/cBPF classifier that specifies a
 | 
						|
police in
 | 
						|
.B tc(1)
 | 
						|
which is attached to the classifier, for example, on an ingress qdisc.
 | 
						|
 | 
						|
.SS action
 | 
						|
is an optional parameter for an eBPF/cBPF classifier that specifies a
 | 
						|
subsequent action in
 | 
						|
.B tc(1)
 | 
						|
which is attached to a classifier.
 | 
						|
 | 
						|
.SS classid
 | 
						|
.SS flowid
 | 
						|
provides the default traffic control class identifier for this eBPF/cBPF
 | 
						|
classifier. The default class identifier can also be overwritten by the
 | 
						|
return code of the eBPF/cBPF program. A default return code of
 | 
						|
.B -1
 | 
						|
specifies the here provided default class identifier to be used. A return
 | 
						|
code of the eBPF/cBPF program of 0 implies that no match took place, and
 | 
						|
a return code other than these two will override the default classid. This
 | 
						|
allows for efficient, non-linear classification with only a single eBPF/cBPF
 | 
						|
program as opposed to having multiple individual programs for various class
 | 
						|
identifiers which would need to reparse packet contents.
 | 
						|
 | 
						|
.SS bytecode
 | 
						|
is being used for loading cBPF classifier and actions only. The cBPF bytecode
 | 
						|
is directly passed as a text string in the form of
 | 
						|
.B \'s,c t f k,c t f k,c t f k,...\'
 | 
						|
, where
 | 
						|
.B s
 | 
						|
denotes the number of subsequent 4-tuples. One such 4-tuple consists of
 | 
						|
.B c t f k
 | 
						|
decimals, where
 | 
						|
.B c
 | 
						|
represents the cBPF opcode,
 | 
						|
.B t
 | 
						|
the jump true offset target,
 | 
						|
.B f
 | 
						|
the jump false offset target and
 | 
						|
.B k
 | 
						|
the immediate constant/literal. There are various tools that generate code
 | 
						|
in this loadable format, for example,
 | 
						|
.B bpf_asm
 | 
						|
that ships with the Linux kernel source tree under
 | 
						|
.B tools/net/
 | 
						|
, so it is certainly not expected to hack this by hand. The
 | 
						|
.B bytecode
 | 
						|
or
 | 
						|
.B bytecode-file
 | 
						|
option is mandatory when a cBPF classifier or action is to be loaded.
 | 
						|
 | 
						|
.SS bytecode-file
 | 
						|
also being used to load a cBPF classifier or action. It's effectively the
 | 
						|
same as
 | 
						|
.B bytecode
 | 
						|
only that the cBPF bytecode is not passed directly via command line, but
 | 
						|
rather resides in a text file.
 | 
						|
 | 
						|
.SH EXAMPLES
 | 
						|
.SS eBPF TOOLING
 | 
						|
A full blown example including eBPF agent code can be found inside the
 | 
						|
iproute2 source package under:
 | 
						|
.B examples/bpf/
 | 
						|
 | 
						|
As prerequisites, the kernel needs to have the eBPF system call namely
 | 
						|
.B bpf(2)
 | 
						|
enabled and ships with
 | 
						|
.B cls_bpf
 | 
						|
and
 | 
						|
.B act_bpf
 | 
						|
kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT
 | 
						|
support, depending which of the two the given architecture supports:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B echo 1 > /proc/sys/net/core/bpf_jit_enable
 | 
						|
.in
 | 
						|
 | 
						|
A given restricted C file can be compiled via LLVM as:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o
 | 
						|
.in
 | 
						|
 | 
						|
The compiler invocation might still simplify in future, so for now,
 | 
						|
it's quite handy to alias this construct in one way or another, for
 | 
						|
example:
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
__bcc() {
 | 
						|
        clang -O2 -emit-llvm -c $1 -o - | \\
 | 
						|
        llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
 | 
						|
}
 | 
						|
 | 
						|
alias bcc=__bcc
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
A minimal, stand-alone unit, which matches on all traffic with the
 | 
						|
default classid (return code of -1) looks like:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
#include <linux/bpf.h>
 | 
						|
 | 
						|
#ifndef __section
 | 
						|
# define __section(x)  __attribute__((section(x), used))
 | 
						|
#endif
 | 
						|
 | 
						|
__section("classifier") int cls_main(struct __sk_buff *skb)
 | 
						|
{
 | 
						|
        return -1;
 | 
						|
}
 | 
						|
 | 
						|
char __license[] __section("license") = "GPL";
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
More examples can be found further below in subsection
 | 
						|
.B eBPF PROGRAMMING
 | 
						|
as focus here will be on tooling.
 | 
						|
 | 
						|
There can be various other sections, for example, also for actions.
 | 
						|
Thus, an object file in eBPF can contain multiple entrance points.
 | 
						|
Always a specific entrance point, however, must be specified when
 | 
						|
configuring with tc. A license must be part of the restricted C code
 | 
						|
and the license string syntax is the same as with Linux kernel modules.
 | 
						|
The kernel reserves its right that some eBPF helper functions can be
 | 
						|
restricted to GPL compatible licenses only, and thus may reject a program
 | 
						|
from loading into the kernel when such a license mismatch occurs.
 | 
						|
 | 
						|
The resulting object file from the compilation can be inspected with
 | 
						|
the usual set of tools that also operate on normal object files, for
 | 
						|
example
 | 
						|
.B objdump(1)
 | 
						|
for inspecting ELF section headers:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
objdump -h bpf.o
 | 
						|
[...]
 | 
						|
3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
 | 
						|
                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
 | 
						|
4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
 | 
						|
                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
 | 
						|
5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
 | 
						|
                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
 | 
						|
6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
 | 
						|
                CONTENTS, ALLOC, LOAD, DATA
 | 
						|
7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
 | 
						|
                CONTENTS, ALLOC, LOAD, DATA
 | 
						|
[...]
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
Adding an eBPF classifier from an object file that contains a classifier
 | 
						|
in the default ELF section is trivial (note that instead of "object-file"
 | 
						|
also shortcuts such as "obj" can be used):
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B bcc bpf.c
 | 
						|
.br
 | 
						|
.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
 | 
						|
.in
 | 
						|
 | 
						|
In case the classifier resides in ELF section "mycls", then that same
 | 
						|
command needs to be invoked as:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
 | 
						|
.in
 | 
						|
 | 
						|
Dumping the classifier configuration will tell the location of the
 | 
						|
classifier, in other words that it's from object file "bpf.o" under
 | 
						|
section "mycls":
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc filter show dev em1
 | 
						|
.br
 | 
						|
.B filter parent 1: protocol all pref 49152 bpf
 | 
						|
.br
 | 
						|
.B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]
 | 
						|
.in
 | 
						|
 | 
						|
The same program can also be installed on ingress qdisc side as opposed
 | 
						|
to egress ...
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc qdisc add dev em1 handle ffff: ingress
 | 
						|
.br
 | 
						|
.B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1
 | 
						|
.in
 | 
						|
 | 
						|
\&... and again dumped from there:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc filter show dev em1 parent ffff:
 | 
						|
.br
 | 
						|
.B filter protocol all pref 49152 bpf
 | 
						|
.br
 | 
						|
.B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]
 | 
						|
.in
 | 
						|
 | 
						|
Attaching a classifier and action on ingress has the restriction that
 | 
						|
it doesn't have an actual underlying queueing discipline. What ingress
 | 
						|
can do is to classify, mangle, redirect or drop packets. When queueing
 | 
						|
is required on ingress side, then ingress must redirect packets to the
 | 
						|
.B ifb
 | 
						|
device, otherwise policing can be used. Moreover, ingress can be used to
 | 
						|
have an early drop point of unwanted packets before they hit upper layers
 | 
						|
of the networking stack, perform network accounting with eBPF maps that
 | 
						|
could be shared with egress, or have an early mangle and/or redirection
 | 
						|
point to different networking devices.
 | 
						|
 | 
						|
Multiple eBPF actions and classifier can be placed into a single
 | 
						|
object file within various sections. In that case, non-default section
 | 
						|
names must be provided, which is the case for both actions in this
 | 
						|
example:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e
 | 
						|
.br
 | 
						|
.in +25n
 | 
						|
.B                          action bpf obj bpf.o sec action-mark \e
 | 
						|
.br
 | 
						|
.B                          action bpf obj bpf.o sec action-rand ok
 | 
						|
.in -25n
 | 
						|
.in -4n
 | 
						|
 | 
						|
The advantage of this is that the classifier and the two actions can
 | 
						|
then share eBPF maps with each other, if implemented in the programs.
 | 
						|
 | 
						|
In order to access eBPF maps from user space beyond
 | 
						|
.B tc(8)
 | 
						|
setup lifetime, the ownership can be transferred to an eBPF agent via
 | 
						|
Unix domain sockets. There are two possibilities for implementing this:
 | 
						|
 | 
						|
.B 1)
 | 
						|
implementation of an own eBPF agent that takes care of setting up
 | 
						|
the Unix domain socket and implementing the protocol that
 | 
						|
.B tc(8)
 | 
						|
dictates. A code example of this can be found inside the iproute2
 | 
						|
source package under:
 | 
						|
.B examples/bpf/
 | 
						|
 | 
						|
.B 2)
 | 
						|
use
 | 
						|
.B tc exec
 | 
						|
for transferring the eBPF map file descriptors through a Unix domain
 | 
						|
socket, and spawning an application such as
 | 
						|
.B sh(1)
 | 
						|
\&. This approach's advantage is that tc will place the file descriptors
 | 
						|
into the environment and thus make them available just like stdin, stdout,
 | 
						|
stderr file descriptors, meaning, in case user applications run from within
 | 
						|
this fd-owner shell, they can terminate and restart without losing eBPF
 | 
						|
maps file descriptors. Example invocation with the previous classifier and
 | 
						|
action mixture:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc exec bpf imp /tmp/bpf
 | 
						|
.br
 | 
						|
.B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e
 | 
						|
.br
 | 
						|
.in +25n
 | 
						|
.B                          action bpf obj bpf.o sec action-mark \e
 | 
						|
.br
 | 
						|
.B                          action bpf obj bpf.o sec action-rand ok
 | 
						|
.in -25n
 | 
						|
.in -4n
 | 
						|
 | 
						|
Assuming that eBPF maps are shared with classifier and actions, it's
 | 
						|
enough to export them once, for example, from within the classifier
 | 
						|
or action command. tc will setup all eBPF map file descriptors at the
 | 
						|
time when the object file is first parsed.
 | 
						|
 | 
						|
When a shell has been spawned, the environment will have a couple of
 | 
						|
eBPF related variables. BPF_NUM_MAPS provides the total number of maps
 | 
						|
that have been transferred over the Unix domain socket. BPF_MAP<X>'s
 | 
						|
value is the file descriptor number that can be accessed in eBPF agent
 | 
						|
applications, in other words, it can directly be used as the file
 | 
						|
descriptor value for the
 | 
						|
.B bpf(2)
 | 
						|
system call to retrieve or alter eBPF map values. <X> denotes the
 | 
						|
identifier of the eBPF map. It corresponds to the
 | 
						|
.B id
 | 
						|
member of
 | 
						|
.B struct bpf_elf_map
 | 
						|
\& from the tc eBPF map specification.
 | 
						|
 | 
						|
The environment in this example looks as follows:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
sh# env | grep BPF
 | 
						|
    BPF_NUM_MAPS=3
 | 
						|
    BPF_MAP1=6
 | 
						|
    BPF_MAP0=5
 | 
						|
    BPF_MAP2=7
 | 
						|
sh# ls -la /proc/self/fd
 | 
						|
    [...]
 | 
						|
    lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
 | 
						|
    lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
 | 
						|
    lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
 | 
						|
sh# my_bpf_agent
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
eBPF agents are very useful in that they can prepopulate eBPF maps from
 | 
						|
user space, monitor statistics via maps and based on that feedback, for
 | 
						|
example, rewrite classids in eBPF map values during runtime. Given that eBPF
 | 
						|
agents are implemented as normal applications, they can also dynamically
 | 
						|
receive traffic control policies from external controllers and thus push
 | 
						|
them down into eBPF maps to dynamically adapt to network conditions. Moreover,
 | 
						|
eBPF maps can also be shared with other eBPF program types (e.g. tracing),
 | 
						|
thus very powerful combination can therefore be implemented.
 | 
						|
 | 
						|
.SS eBPF PROGRAMMING
 | 
						|
 | 
						|
eBPF classifier and actions are being implemented in restricted C syntax
 | 
						|
(in future, there could additionally be new language frontends supported).
 | 
						|
 | 
						|
The header file
 | 
						|
.B linux/bpf.h
 | 
						|
provides eBPF helper functions that can be called from an eBPF program.
 | 
						|
This man page will only provide two minimal, stand-alone examples, have a
 | 
						|
look at
 | 
						|
.B examples/bpf
 | 
						|
from the iproute2 source package for a fully fledged flow dissector
 | 
						|
example to better demonstrate some of the possibilities with eBPF.
 | 
						|
 | 
						|
Supported 32 bit classifier return codes from the C program and their meanings:
 | 
						|
.in +4n
 | 
						|
.B 0
 | 
						|
, denotes a mismatch
 | 
						|
.br
 | 
						|
.B -1
 | 
						|
, denotes the default classid configured from the command line
 | 
						|
.br
 | 
						|
.B else
 | 
						|
, everything else will override the default classid to provide a facility for
 | 
						|
non-linear matching
 | 
						|
.in
 | 
						|
 | 
						|
Supported 32 bit action return codes from the C program and their meanings (
 | 
						|
.B linux/pkt_cls.h
 | 
						|
):
 | 
						|
.in +4n
 | 
						|
.B TC_ACT_OK (0)
 | 
						|
, will terminate the packet processing pipeline and allows the packet to
 | 
						|
proceed
 | 
						|
.br
 | 
						|
.B TC_ACT_SHOT (2)
 | 
						|
, will terminate the packet processing pipeline and drops the packet
 | 
						|
.br
 | 
						|
.B TC_ACT_UNSPEC (-1)
 | 
						|
, will use the default action configured from tc (similarly as returning
 | 
						|
.B -1
 | 
						|
from a classifier)
 | 
						|
.br
 | 
						|
.B TC_ACT_PIPE (3)
 | 
						|
, will iterate to the next action, if available
 | 
						|
.br
 | 
						|
.B TC_ACT_RECLASSIFY (1)
 | 
						|
, will terminate the packet processing pipeline and start classification
 | 
						|
from the beginning
 | 
						|
.br
 | 
						|
.B else
 | 
						|
, everything else is an unspecified return code
 | 
						|
.in
 | 
						|
 | 
						|
Both classifier and action return codes are supported in eBPF and cBPF
 | 
						|
programs.
 | 
						|
 | 
						|
To demonstrate restricted C syntax, a minimal toy classifier example is
 | 
						|
provided, which assumes that egress packets, for instance originating
 | 
						|
from a container, have previously been marked in interval [0, 255]. The
 | 
						|
program keeps statistics on different marks for user space and maps the
 | 
						|
classid to the root qdisc with the marking itself as the minor handle:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
#include <stdint.h>
 | 
						|
#include <asm/types.h>
 | 
						|
 | 
						|
#include <linux/bpf.h>
 | 
						|
#include <linux/pkt_sched.h>
 | 
						|
 | 
						|
#include "helpers.h"
 | 
						|
 | 
						|
struct tuple {
 | 
						|
        long packets;
 | 
						|
        long bytes;
 | 
						|
};
 | 
						|
 | 
						|
#define BPF_MAP_ID_STATS        1 /* agent's map identifier */
 | 
						|
#define BPF_MAX_MARK            256
 | 
						|
 | 
						|
struct bpf_elf_map __section("maps") map_stats = {
 | 
						|
        .type           =       BPF_MAP_TYPE_ARRAY,
 | 
						|
        .id             =       BPF_MAP_ID_STATS,
 | 
						|
        .size_key       =       sizeof(uint32_t),
 | 
						|
        .size_value     =       sizeof(struct tuple),
 | 
						|
        .max_elem       =       BPF_MAX_MARK,
 | 
						|
};
 | 
						|
 | 
						|
static inline void cls_update_stats(const struct __sk_buff *skb,
 | 
						|
                                    uint32_t mark)
 | 
						|
{
 | 
						|
        struct tuple *tu;
 | 
						|
 | 
						|
        tu = bpf_map_lookup_elem(&map_stats, &mark);
 | 
						|
        if (likely(tu)) {
 | 
						|
                __sync_fetch_and_add(&tu->packets, 1);
 | 
						|
                __sync_fetch_and_add(&tu->bytes, skb->len);
 | 
						|
        }
 | 
						|
}
 | 
						|
 | 
						|
__section("cls") int cls_main(struct __sk_buff *skb)
 | 
						|
{
 | 
						|
        uint32_t mark = skb->mark;
 | 
						|
 | 
						|
        if (unlikely(mark >= BPF_MAX_MARK))
 | 
						|
                return 0;
 | 
						|
 | 
						|
        cls_update_stats(skb, mark);
 | 
						|
 | 
						|
        return TC_H_MAKE(TC_H_ROOT, mark);
 | 
						|
}
 | 
						|
 | 
						|
char __license[] __section("license") = "GPL";
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
Another small example is a port redirector which demuxes destination port
 | 
						|
80 into the interval [8080, 8087] steered by RSS, that can then be attached
 | 
						|
to ingress qdisc. The exercise of adding the egress counterpart and IPv6
 | 
						|
support is left to the reader:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
#include <asm/types.h>
 | 
						|
#include <asm/byteorder.h>
 | 
						|
 | 
						|
#include <linux/bpf.h>
 | 
						|
#include <linux/filter.h>
 | 
						|
#include <linux/in.h>
 | 
						|
#include <linux/if_ether.h>
 | 
						|
#include <linux/ip.h>
 | 
						|
#include <linux/tcp.h>
 | 
						|
 | 
						|
#include "helpers.h"
 | 
						|
 | 
						|
static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
 | 
						|
                                 __u16 old_port, __u16 new_port)
 | 
						|
{
 | 
						|
        bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
 | 
						|
                            old_port, new_port, sizeof(new_port));
 | 
						|
        bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
 | 
						|
                            &new_port, sizeof(new_port), 0);
 | 
						|
}
 | 
						|
 | 
						|
static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
 | 
						|
{
 | 
						|
        __u16 dport, dport_new = 8080, off;
 | 
						|
        __u8 ip_proto, ip_vl;
 | 
						|
 | 
						|
        ip_proto = load_byte(skb, nh_off +
 | 
						|
                             offsetof(struct iphdr, protocol));
 | 
						|
        if (ip_proto != IPPROTO_TCP)
 | 
						|
                return 0;
 | 
						|
 | 
						|
        ip_vl = load_byte(skb, nh_off);
 | 
						|
        if (likely(ip_vl == 0x45))
 | 
						|
                nh_off += sizeof(struct iphdr);
 | 
						|
        else
 | 
						|
                nh_off += (ip_vl & 0xF) << 2;
 | 
						|
 | 
						|
        dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
 | 
						|
        if (dport != 80)
 | 
						|
                return 0;
 | 
						|
 | 
						|
        off = skb->queue_mapping & 7;
 | 
						|
        set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
 | 
						|
                      __cpu_to_be16(dport_new + off));
 | 
						|
        return -1;
 | 
						|
}
 | 
						|
 | 
						|
__section("lb") int lb_main(struct __sk_buff *skb)
 | 
						|
{
 | 
						|
        int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
 | 
						|
 | 
						|
        if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
 | 
						|
                ret = lb_do_ipv4(skb, nh_off);
 | 
						|
 | 
						|
        return ret;
 | 
						|
}
 | 
						|
 | 
						|
char __license[] __section("license") = "GPL";
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
The related helper header file
 | 
						|
.B helpers.h
 | 
						|
in both examples was:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
/* Misc helper macros. */
 | 
						|
#define __section(x) __attribute__((section(x), used))
 | 
						|
#define offsetof(x, y) __builtin_offsetof(x, y)
 | 
						|
#define likely(x) __builtin_expect(!!(x), 1)
 | 
						|
#define unlikely(x) __builtin_expect(!!(x), 0)
 | 
						|
 | 
						|
/* Used map structure */
 | 
						|
struct bpf_elf_map {
 | 
						|
    __u32 type;
 | 
						|
    __u32 size_key;
 | 
						|
    __u32 size_value;
 | 
						|
    __u32 max_elem;
 | 
						|
    __u32 id;
 | 
						|
};
 | 
						|
 | 
						|
/* Some used BPF function calls. */
 | 
						|
static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
 | 
						|
                                  int len, int flags) =
 | 
						|
      (void *) BPF_FUNC_skb_store_bytes;
 | 
						|
static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
 | 
						|
                                  int to, int flags) =
 | 
						|
      (void *) BPF_FUNC_l4_csum_replace;
 | 
						|
static void *(*bpf_map_lookup_elem)(void *map, void *key) =
 | 
						|
      (void *) BPF_FUNC_map_lookup_elem;
 | 
						|
 | 
						|
/* Some used BPF intrinsics. */
 | 
						|
unsigned long long load_byte(void *skb, unsigned long long off)
 | 
						|
    asm ("llvm.bpf.load.byte");
 | 
						|
unsigned long long load_half(void *skb, unsigned long long off)
 | 
						|
    asm ("llvm.bpf.load.half");
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
Best practice, we recommend to only have a single eBPF classifier loaded
 | 
						|
in tc and perform
 | 
						|
.B all
 | 
						|
necessary matching and mangling from there instead of a list of individual
 | 
						|
classifier and separate actions. Just a single classifier tailored for a
 | 
						|
given use-case will be most efficient to run.
 | 
						|
 | 
						|
.SS eBPF DEBUGGING
 | 
						|
 | 
						|
Both tc
 | 
						|
.B filter
 | 
						|
and
 | 
						|
.B action
 | 
						|
commands for
 | 
						|
.B bpf
 | 
						|
support an optional
 | 
						|
.B verbose
 | 
						|
parameter that can be used to inspect the eBPF verifier log. It is dumped
 | 
						|
by default in case of an error.
 | 
						|
 | 
						|
In case the eBPF/cBPF JIT compiler has been enabled, it can also be
 | 
						|
instructed to emit a debug output of the resulting opcode image into
 | 
						|
the kernel log, which can be read via
 | 
						|
.B dmesg(1)
 | 
						|
:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B echo 2 > /proc/sys/net/core/bpf_jit_enable
 | 
						|
.in
 | 
						|
 | 
						|
The Linux kernel source tree ships additionally under
 | 
						|
.B tools/net/
 | 
						|
a small helper called
 | 
						|
.B bpf_jit_disasm
 | 
						|
that reads out the opcode image dump from the kernel log and dumps the
 | 
						|
resulting disassembly:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B bpf_jit_disasm -o
 | 
						|
.in
 | 
						|
 | 
						|
Other than that, the Linux kernel also contains an extensive eBPF/cBPF
 | 
						|
test suite module called
 | 
						|
.B test_bpf
 | 
						|
\&. Upon ...
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B modprobe test_bpf
 | 
						|
.in
 | 
						|
 | 
						|
\&... it performs a diversity of test cases and dumps the results into
 | 
						|
the kernel log that can be inspected with
 | 
						|
.B dmesg(1)
 | 
						|
\&. The results can differ depending on whether the JIT compiler is enabled
 | 
						|
or not. In case of failed test cases, the module will fail to load. In
 | 
						|
such cases, we urge you to file a bug report to the related JIT authors,
 | 
						|
Linux kernel and networking mailing lists.
 | 
						|
 | 
						|
.SS cBPF
 | 
						|
 | 
						|
Although we generally recommend switching to implementing
 | 
						|
.B eBPF
 | 
						|
classifier and actions, for the sake of completeness, a few words on how to
 | 
						|
program in cBPF will be lost here.
 | 
						|
 | 
						|
Likewise, the
 | 
						|
.B bpf_jit_enable
 | 
						|
switch can be enabled as mentioned already. Tooling such as
 | 
						|
.B bpf_jit_disasm
 | 
						|
is also independent whether eBPF or cBPF code is being loaded.
 | 
						|
 | 
						|
Unlike in eBPF, classifier and action are not implemented in restricted C,
 | 
						|
but rather in a minimal assembler-like language or with the help of other
 | 
						|
tooling.
 | 
						|
 | 
						|
The raw interface with tc takes opcodes directly. For example, the most
 | 
						|
minimal classifier matching on every packet resulting in the default
 | 
						|
classid of 1:1 looks like:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1
 | 
						|
.in
 | 
						|
 | 
						|
The first decimal of the bytecode sequence denotes the number of subsequent
 | 
						|
4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of
 | 
						|
.B c t f k
 | 
						|
decimals, where
 | 
						|
.B c
 | 
						|
represents the cBPF opcode,
 | 
						|
.B t
 | 
						|
the jump true offset target,
 | 
						|
.B f
 | 
						|
the jump false offset target and
 | 
						|
.B k
 | 
						|
the immediate constant/literal. Here, this denotes an unconditional return
 | 
						|
from the program with immediate value of -1.
 | 
						|
 | 
						|
Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone
 | 
						|
helper tool under the GNU General Public License version 2 for
 | 
						|
.B iptables(8)
 | 
						|
BPF extension, which abuses the
 | 
						|
.B libpcap
 | 
						|
internal classic BPF compiler, his code derived here for usage with
 | 
						|
.B tc(8)
 | 
						|
:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
#include <pcap.h>
 | 
						|
#include <stdio.h>
 | 
						|
 | 
						|
int main(int argc, char **argv)
 | 
						|
{
 | 
						|
        struct bpf_program prog;
 | 
						|
        struct bpf_insn *ins;
 | 
						|
        int i, ret, dlt = DLT_RAW;
 | 
						|
 | 
						|
        if (argc < 2 || argc > 3)
 | 
						|
                return 1;
 | 
						|
        if (argc == 3) {
 | 
						|
                dlt = pcap_datalink_name_to_val(argv[1]);
 | 
						|
                if (dlt == -1)
 | 
						|
                        return 1;
 | 
						|
        }
 | 
						|
 | 
						|
        ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
 | 
						|
                                  1, PCAP_NETMASK_UNKNOWN);
 | 
						|
        if (ret)
 | 
						|
                return 1;
 | 
						|
 | 
						|
        printf("%d,", prog.bf_len);
 | 
						|
        ins = prog.bf_insns;
 | 
						|
 | 
						|
        for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
 | 
						|
                printf("%u %u %u %u,", ins->code,
 | 
						|
                       ins->jt, ins->jf, ins->k);
 | 
						|
        printf("%u %u %u %u",
 | 
						|
               ins->code, ins->jt, ins->jf, ins->k);
 | 
						|
 | 
						|
        pcap_freecode(&prog);
 | 
						|
        return 0;
 | 
						|
}
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
Given this small helper, any
 | 
						|
.B tcpdump(8)
 | 
						|
filter expression can be abused as a classifier where a match will
 | 
						|
result in the default classid:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
 | 
						|
.br
 | 
						|
.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
 | 
						|
.in
 | 
						|
 | 
						|
Basically, such a minimal generator is equivalent to:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\\\n' ',' > /var/bpf/tcp-syn
 | 
						|
.in
 | 
						|
 | 
						|
Since
 | 
						|
.B libpcap
 | 
						|
does not support all Linux' specific cBPF extensions in its compiler, the
 | 
						|
Linux kernel also ships under
 | 
						|
.B tools/net/
 | 
						|
a minimal BPF assembler called
 | 
						|
.B bpf_asm
 | 
						|
for providing full control. For detailed syntax and semantics on implementing
 | 
						|
such programs by hand, see references under
 | 
						|
.B FURTHER READING
 | 
						|
\&.
 | 
						|
 | 
						|
Trivial toy example in
 | 
						|
.B bpf_asm
 | 
						|
for classifying IPv4/TCP packets, saved in a text file called
 | 
						|
.B foobar
 | 
						|
:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.nf
 | 
						|
.sp
 | 
						|
ldh [12]
 | 
						|
jne #0x800, drop
 | 
						|
ldb [23]
 | 
						|
jneq #6, drop
 | 
						|
ret #-1
 | 
						|
drop: ret #0
 | 
						|
.fi
 | 
						|
.in
 | 
						|
 | 
						|
Similarly, such a classifier can be loaded as:
 | 
						|
 | 
						|
.in +4n
 | 
						|
.B bpf_asm foobar > /var/bpf/tcp-syn
 | 
						|
.br
 | 
						|
.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
 | 
						|
.in
 | 
						|
 | 
						|
For BPF classifiers, the Linux kernel provides additionally under
 | 
						|
.B tools/net/
 | 
						|
a small BPF debugger called
 | 
						|
.B bpf_dbg
 | 
						|
, which can be used to test a classifier against pcap files, single-step
 | 
						|
or add various breakpoints into the classifier program and dump register
 | 
						|
contents during runtime.
 | 
						|
 | 
						|
Implementing an action in classic BPF is rather limited in the sense that
 | 
						|
packet mangling is not supported. Therefore, it's generally recommended to
 | 
						|
make the switch to eBPF, whenever possible.
 | 
						|
 | 
						|
.SH FURTHER READING
 | 
						|
Further and more technical details about the BPF architecture can be found
 | 
						|
in the Linux kernel source tree under
 | 
						|
.B Documentation/networking/filter.txt
 | 
						|
\&.
 | 
						|
 | 
						|
Further details on eBPF
 | 
						|
.B tc(8)
 | 
						|
examples can be found in the iproute2 source
 | 
						|
tree under
 | 
						|
.B examples/bpf/
 | 
						|
\&.
 | 
						|
 | 
						|
.SH SEE ALSO
 | 
						|
.BR tc (8),
 | 
						|
.BR tc-ematch (8)
 | 
						|
.BR bpf (2)
 | 
						|
.BR bpf (4)
 | 
						|
 | 
						|
.SH AUTHORS
 | 
						|
Manpage written by Daniel Borkmann.
 | 
						|
 | 
						|
Please report corrections or improvements to the Linux kernel networking
 | 
						|
mailing list:
 | 
						|
.B <netdev@vger.kernel.org>
 |