bpf, docs: document open-coded BPF iterators

Extract BPF open-coded iterators documentation spread out across a few
original commit messages ([0], [1]) into a dedicated doc section under
Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that
BPF iterator program type should be accompanied by a corresponding
open-coded BPF iterator implementation, going forward.

  [0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/
  [1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
Andrii Nakryiko 2025-05-09 11:03:50 -07:00 committed by Alexei Starovoitov
parent c8ce7db0ca
commit 7220eabff8

View File

@ -2,10 +2,117 @@
BPF Iterators
=============
--------
Overview
--------
----------
Motivation
----------
BPF supports two separate entities collectively known as "BPF iterators": BPF
iterator *program type* and *open-coded* BPF iterators. The former is
a stand-alone BPF program type which, when attached and activated by user,
will be called once for each entity (task_struct, cgroup, etc) that is being
iterated. The latter is a set of BPF-side APIs implementing iterator
functionality and available across multiple BPF program types. Open-coded
iterators provide similar functionality to BPF iterator programs, but gives
more flexibility and control to all other BPF program types. BPF iterator
programs, on the other hand, can be used to implement anonymous or BPF
FS-mounted special files, whose contents are generated by attached BPF iterator
program, backed by seq_file functionality. Both are useful depending on
specific needs.
When adding a new BPF iterator program, it is expected that similar
functionality will be added as open-coded iterator for maximum flexibility.
It's also expected that iteration logic and code will be maximally shared and
reused between two iterator API surfaces.
------------------------
Open-coded BPF Iterators
------------------------
Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
(constructor, next element fetch, destructor) and iterator-specific type
describing on-the-stack iterator state, which is guaranteed by the BPF
verifier to not be tampered with outside of the corresponding
constructor/destructor/next APIs.
Each kind of open-coded BPF iterator has its own associated
struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
small enough to fit on BPF stack. For performance reasons its best to avoid
dynamic memory allocation for iterator state and size the state struct big
enough to fit everything necessary. But if necessary, dynamic memory
allocation is a way to bypass BPF stack limitations. Note, state struct size
is part of iterator's user-visible API, so changing it will break backwards
compatibility, so be deliberate about designing it.
All kfuncs (constructor, next, destructor) have to be named consistently as
bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
type, and iterator state should be represented as a matching
`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
a pointer to this `struct bpf_iter_<type>` as the very first argument.
Additionally:
- Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
number of arguments. Return type is not enforced either.
- Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
type and should have exactly one argument: `struct bpf_iter_<type> *`
(const/volatile/restrict and typedefs are ignored).
- Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
should have exactly one argument, similar to the next method.
- `struct bpf_iter_<type>` size is enforced to be positive and
a multiple of 8 bytes (to fit stack slots correctly).
Such strictness and consistency allows to build generic helpers abstracting
important, but boilerplate, details to be able to use open-coded iterators
effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
enforced at kfunc registration point by the kernel.
Constructor/next/destructor implementation contract is as follows:
- constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
the stack. If any of the input arguments are invalid, constructor should
make sure to still initialize it such that subsequent next() calls will
return NULL. I.e., on error, *return error and construct empty iterator*.
Constructor kfunc is marked with KF_ITER_NEW flag.
- next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
and produces an element. Next method should always return a pointer. The
contract between BPF verifier is that next method *guarantees* that it
will eventually return NULL when elements are exhausted. Once NULL is
returned, subsequent next calls *should keep returning NULL*. Next method
is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
NULL-returning kfunc, of course).
- destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
constructor failed or next returned nothing. Destructor frees up any
resources and marks stack space used by `struct bpf_iter_<type>` as usable
for something else. Destructor is marked with KF_ITER_DESTROY flag.
Any open-coded BPF iterator implementation has to implement at least these
three methods. It is enforced that for any given type of iterator only
applicable constructor/destructor/next are callable. I.e., verifier ensures
you can't pass number iterator state into, say, cgroup iterator's next method.
From a 10,000-feet BPF verification point of view, next methods are the points
of forking a verification state, which are conceptually similar to what
verifier is doing when validating conditional jumps. Verifier is branching out
`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
(iteration is done) and non-NULL (new element is returned). NULL is simulated
first and is supposed to reach exit without looping. After that non-NULL case
is validated and it either reaches exit (for trivial examples with no real
loop), or reaches another `call bpf_iter_<type>_next` instruction with the
state equivalent to already (partially) validated one. State equivalency at
that point means we technically are going to be looping forever without
"breaking out" out of established "state envelope" (i.e., subsequent
iterations don't add any new knowledge or constraints to the verifier state,
so running 1, 2, 10, or a million of them doesn't matter). But taking into
account the contract stating that iterator next method *has to* return NULL
eventually, we can conclude that loop body is safe and will eventually
terminate. Given we validated logic outside of the loop (NULL case), and
concluded that loop body is safe (though potentially looping many times),
verifier can claim safety of the overall program logic.
------------------------
BPF Iterators Motivation
------------------------
There are a few existing ways to dump kernel data into user space. The most
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps