bpf, docs: document open-coded BPF iterators

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Extract BPF open-coded iterators documentation spread out across a few
original commit messages ([0], [1]) into a dedicated doc section under
Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that
BPF iterator program type should be accompanied by a corresponding
open-coded BPF iterator implementation, going forward.

[0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/
[1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Andrii Nakryiko and committed by

Alexei Starovoitov 11 months ago 7220eabf c8ce7db0

+110 -3

1 changed file

expand all

Documentation

bpf

bpf_iterators.rst

+110 -3

Documentation/bpf/bpf_iterators.rst

··· 2 2 BPF Iterators 3 3 ============= 4 4 5 + -------- 6 + Overview 7 + -------- 5 8 6 - ---------- 7 - Motivation 8 - ---------- 9 + BPF supports two separate entities collectively known as "BPF iterators": BPF 10 + iterator *program type* and *open-coded* BPF iterators. The former is 11 + a stand-alone BPF program type which, when attached and activated by user, 12 + will be called once for each entity (task_struct, cgroup, etc) that is being 13 + iterated. The latter is a set of BPF-side APIs implementing iterator 14 + functionality and available across multiple BPF program types. Open-coded 15 + iterators provide similar functionality to BPF iterator programs, but gives 16 + more flexibility and control to all other BPF program types. BPF iterator 17 + programs, on the other hand, can be used to implement anonymous or BPF 18 + FS-mounted special files, whose contents are generated by attached BPF iterator 19 + program, backed by seq_file functionality. Both are useful depending on 20 + specific needs. 21 + 22 + When adding a new BPF iterator program, it is expected that similar 23 + functionality will be added as open-coded iterator for maximum flexibility. 24 + It's also expected that iteration logic and code will be maximally shared and 25 + reused between two iterator API surfaces. 26 + 27 + ------------------------ 28 + Open-coded BPF Iterators 29 + ------------------------ 30 + 31 + Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs 32 + (constructor, next element fetch, destructor) and iterator-specific type 33 + describing on-the-stack iterator state, which is guaranteed by the BPF 34 + verifier to not be tampered with outside of the corresponding 35 + constructor/destructor/next APIs. 36 + 37 + Each kind of open-coded BPF iterator has its own associated 38 + struct bpf_iter_<type>, where <type> denotes a specific type of iterator. 39 + bpf_iter_<type> state needs to live on BPF program stack, so make sure it's 40 + small enough to fit on BPF stack. For performance reasons its best to avoid 41 + dynamic memory allocation for iterator state and size the state struct big 42 + enough to fit everything necessary. But if necessary, dynamic memory 43 + allocation is a way to bypass BPF stack limitations. Note, state struct size 44 + is part of iterator's user-visible API, so changing it will break backwards 45 + compatibility, so be deliberate about designing it. 46 + 47 + All kfuncs (constructor, next, destructor) have to be named consistently as 48 + bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator 49 + type, and iterator state should be represented as a matching 50 + `struct bpf_iter_<type>` state type. Also, all iter kfuncs should have 51 + a pointer to this `struct bpf_iter_<type>` as the very first argument. 52 + 53 + Additionally: 54 + - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra 55 + number of arguments. Return type is not enforced either. 56 + - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer 57 + type and should have exactly one argument: `struct bpf_iter_<type> *` 58 + (const/volatile/restrict and typedefs are ignored). 59 + - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and 60 + should have exactly one argument, similar to the next method. 61 + - `struct bpf_iter_<type>` size is enforced to be positive and 62 + a multiple of 8 bytes (to fit stack slots correctly). 63 + 64 + Such strictness and consistency allows to build generic helpers abstracting 65 + important, but boilerplate, details to be able to use open-coded iterators 66 + effectively and ergonomically (see libbpf's bpf_for_each() macro). This is 67 + enforced at kfunc registration point by the kernel. 68 + 69 + Constructor/next/destructor implementation contract is as follows: 70 + - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on 71 + the stack. If any of the input arguments are invalid, constructor should 72 + make sure to still initialize it such that subsequent next() calls will 73 + return NULL. I.e., on error, *return error and construct empty iterator*. 74 + Constructor kfunc is marked with KF_ITER_NEW flag. 75 + 76 + - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state 77 + and produces an element. Next method should always return a pointer. The 78 + contract between BPF verifier is that next method *guarantees* that it 79 + will eventually return NULL when elements are exhausted. Once NULL is 80 + returned, subsequent next calls *should keep returning NULL*. Next method 81 + is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as 82 + NULL-returning kfunc, of course). 83 + 84 + - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if 85 + constructor failed or next returned nothing. Destructor frees up any 86 + resources and marks stack space used by `struct bpf_iter_<type>` as usable 87 + for something else. Destructor is marked with KF_ITER_DESTROY flag. 88 + 89 + Any open-coded BPF iterator implementation has to implement at least these 90 + three methods. It is enforced that for any given type of iterator only 91 + applicable constructor/destructor/next are callable. I.e., verifier ensures 92 + you can't pass number iterator state into, say, cgroup iterator's next method. 93 + 94 + From a 10,000-feet BPF verification point of view, next methods are the points 95 + of forking a verification state, which are conceptually similar to what 96 + verifier is doing when validating conditional jumps. Verifier is branching out 97 + `call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL 98 + (iteration is done) and non-NULL (new element is returned). NULL is simulated 99 + first and is supposed to reach exit without looping. After that non-NULL case 100 + is validated and it either reaches exit (for trivial examples with no real 101 + loop), or reaches another `call bpf_iter_<type>_next` instruction with the 102 + state equivalent to already (partially) validated one. State equivalency at 103 + that point means we technically are going to be looping forever without 104 + "breaking out" out of established "state envelope" (i.e., subsequent 105 + iterations don't add any new knowledge or constraints to the verifier state, 106 + so running 1, 2, 10, or a million of them doesn't matter). But taking into 107 + account the contract stating that iterator next method *has to* return NULL 108 + eventually, we can conclude that loop body is safe and will eventually 109 + terminate. Given we validated logic outside of the loop (NULL case), and 110 + concluded that loop body is safe (though potentially looping many times), 111 + verifier can claim safety of the overall program logic. 112 + 113 + ------------------------ 114 + BPF Iterators Motivation 115 + ------------------------ 9 116 10 117 There are a few existing ways to dump kernel data into user space. The most 11 118 popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps