Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs/bpf: Add BPF ring buffer design notes

Add commit description from patch #1 as a stand-alone documentation under
Documentation/bpf, as it might be more convenient format, in long term
perspective.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-6-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Andrii Nakryiko and committed by
Alexei Starovoitov
97abb2b3 c97099b0

+209
+209
Documentation/bpf/ringbuf.rst
··· 1 + =============== 2 + BPF ring buffer 3 + =============== 4 + 5 + This document describes BPF ring buffer design, API, and implementation details. 6 + 7 + .. contents:: 8 + :local: 9 + :depth: 2 10 + 11 + Motivation 12 + ---------- 13 + 14 + There are two distinctive motivators for this work, which are not satisfied by 15 + existing perf buffer, which prompted creation of a new ring buffer 16 + implementation. 17 + 18 + - more efficient memory utilization by sharing ring buffer across CPUs; 19 + - preserving ordering of events that happen sequentially in time, even across 20 + multiple CPUs (e.g., fork/exec/exit events for a task). 21 + 22 + These two problems are independent, but perf buffer fails to satisfy both. 23 + Both are a result of a choice to have per-CPU perf ring buffer. Both can be 24 + also solved by having an MPSC implementation of ring buffer. The ordering 25 + problem could technically be solved for perf buffer with some in-kernel 26 + counting, but given the first one requires an MPSC buffer, the same solution 27 + would solve the second problem automatically. 28 + 29 + Semantics and APIs 30 + ------------------ 31 + 32 + Single ring buffer is presented to BPF programs as an instance of BPF map of 33 + type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but 34 + ultimately rejected. 35 + 36 + One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make 37 + ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not 38 + enforce "same CPU only" rule. This would be more familiar interface compatible 39 + with existing perf buffer use in BPF, but would fail if application needed more 40 + advanced logic to lookup ring buffer by arbitrary key. 41 + ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach. 42 + Additionally, given the performance of BPF ringbuf, many use cases would just 43 + opt into a simple single ring buffer shared among all CPUs, for which current 44 + approach would be an overkill. 45 + 46 + Another approach could introduce a new concept, alongside BPF map, to represent 47 + generic "container" object, which doesn't necessarily have key/value interface 48 + with lookup/update/delete operations. This approach would add a lot of extra 49 + infrastructure that has to be built for observability and verifier support. It 50 + would also add another concept that BPF developers would have to familiarize 51 + themselves with, new syntax in libbpf, etc. But then would really provide no 52 + additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF`` 53 + doesn't support lookup/update/delete operations, but so doesn't few other map 54 + types (e.g., queue and stack; array doesn't support delete, etc). 55 + 56 + The approach chosen has an advantage of re-using existing BPF map 57 + infrastructure (introspection APIs in kernel, libbpf support, etc), being 58 + familiar concept (no need to teach users a new type of object in BPF program), 59 + and utilizing existing tooling (bpftool). For common scenario of using a single 60 + ring buffer for all CPUs, it's as simple and straightforward, as would be with 61 + a dedicated "container" object. On the other hand, by being a map, it can be 62 + combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement 63 + a wide variety of topologies, from one ring buffer for each CPU (e.g., as 64 + a replacement for perf buffer use cases), to a complicated application 65 + hashing/sharding of ring buffers (e.g., having a small pool of ring buffers 66 + with hashed task's tgid being a look up key to preserve order, but reduce 67 + contention). 68 + 69 + Key and value sizes are enforced to be zero. ``max_entries`` is used to specify 70 + the size of ring buffer and has to be a power of 2 value. 71 + 72 + There are a bunch of similarities between perf buffer 73 + (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics: 74 + 75 + - variable-length records; 76 + - if there is no more space left in ring buffer, reservation fails, no 77 + blocking; 78 + - memory-mappable data area for user-space applications for ease of 79 + consumption and high performance; 80 + - epoll notifications for new incoming data; 81 + - but still the ability to do busy polling for new data to achieve the 82 + lowest latency, if necessary. 83 + 84 + BPF ringbuf provides two sets of APIs to BPF programs: 85 + 86 + - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring 87 + buffer, similarly to ``bpf_perf_event_output()``; 88 + - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` 89 + APIs split the whole process into two steps. First, a fixed amount of space 90 + is reserved. If successful, a pointer to a data inside ring buffer data 91 + area is returned, which BPF programs can use similarly to a data inside 92 + array/hash maps. Once ready, this piece of memory is either committed or 93 + discarded. Discard is similar to commit, but makes consumer ignore the 94 + record. 95 + 96 + ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy, 97 + because record has to be prepared in some other place first. But it allows to 98 + submit records of the length that's not known to verifier beforehand. It also 99 + closely matches ``bpf_perf_event_output()``, so will simplify migration 100 + significantly. 101 + 102 + ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory 103 + pointer directly to ring buffer memory. In a lot of cases records are larger 104 + than BPF stack space allows, so many programs have use extra per-CPU array as 105 + a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs 106 + completely. But in exchange, it only allows a known constant size of memory to 107 + be reserved, such that verifier can verify that BPF program can't access memory 108 + outside its reserved record space. bpf_ringbuf_output(), while slightly slower 109 + due to extra memory copy, covers some use cases that are not suitable for 110 + ``bpf_ringbuf_reserve()``. 111 + 112 + The difference between commit and discard is very small. Discard just marks 113 + a record as discarded, and such records are supposed to be ignored by consumer 114 + code. Discard is useful for some advanced use-cases, such as ensuring 115 + all-or-nothing multi-record submission, or emulating temporary 116 + ``malloc()``/``free()`` within single BPF program invocation. 117 + 118 + Each reserved record is tracked by verifier through existing 119 + reference-tracking logic, similar to socket ref-tracking. It is thus 120 + impossible to reserve a record, but forget to submit (or discard) it. 121 + 122 + ``bpf_ringbuf_query()`` helper allows to query various properties of ring 123 + buffer. Currently 4 are supported: 124 + 125 + - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; 126 + - ``BPF_RB_RING_SIZE`` returns the size of ring buffer; 127 + - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition 128 + of consumer/producer, respectively. 129 + 130 + Returned values are momentarily snapshots of ring buffer state and could be 131 + off by the time helper returns, so this should be used only for 132 + debugging/reporting reasons or for implementing various heuristics, that take 133 + into account highly-changeable nature of some of those characteristics. 134 + 135 + One such heuristic might involve more fine-grained control over poll/epoll 136 + notifications about new data availability in ring buffer. Together with 137 + ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard 138 + helpers, it allows BPF program a high degree of control and, e.g., more 139 + efficient batched notifications. Default self-balancing strategy, though, 140 + should be adequate for most applications and will work reliable and efficiently 141 + already. 142 + 143 + Design and Implementation 144 + ------------------------- 145 + 146 + This reserve/commit schema allows a natural way for multiple producers, either 147 + on different CPUs or even on the same CPU/in the same BPF program, to reserve 148 + independent records and work with them without blocking other producers. This 149 + means that if BPF program was interruped by another BPF program sharing the 150 + same ring buffer, they will both get a record reserved (provided there is 151 + enough space left) and can work with it and submit it independently. This 152 + applies to NMI context as well, except that due to using a spinlock during 153 + reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get 154 + a lock, in which case reservation will fail even if ring buffer is not full. 155 + 156 + The ring buffer itself internally is implemented as a power-of-2 sized 157 + circular buffer, with two logical and ever-increasing counters (which might 158 + wrap around on 32-bit architectures, that's not a problem): 159 + 160 + - consumer counter shows up to which logical position consumer consumed the 161 + data; 162 + - producer counter denotes amount of data reserved by all producers. 163 + 164 + Each time a record is reserved, producer that "owns" the record will 165 + successfully advance producer counter. At that point, data is still not yet 166 + ready to be consumed, though. Each record has 8 byte header, which contains the 167 + length of reserved record, as well as two extra bits: busy bit to denote that 168 + record is still being worked on, and discard bit, which might be set at commit 169 + time if record is discarded. In the latter case, consumer is supposed to skip 170 + the record and move on to the next one. Record header also encodes record's 171 + relative offset from the beginning of ring buffer data area (in pages). This 172 + allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the 173 + pointer to the record itself, without requiring also the pointer to ring buffer 174 + itself. Ring buffer memory location will be restored from record metadata 175 + header. This significantly simplifies verifier, as well as improving API 176 + usability. 177 + 178 + Producer counter increments are serialized under spinlock, so there is 179 + a strict ordering between reservations. Commits, on the other hand, are 180 + completely lockless and independent. All records become available to consumer 181 + in the order of reservations, but only after all previous records where 182 + already committed. It is thus possible for slow producers to temporarily hold 183 + off submitted records, that were reserved later. 184 + 185 + Reservation/commit/consumer protocol is verified by litmus tests in 186 + Documentation/litmus_tests/bpf-rb/_. 187 + 188 + One interesting implementation bit, that significantly simplifies (and thus 189 + speeds up as well) implementation of both producers and consumers is how data 190 + area is mapped twice contiguously back-to-back in the virtual memory. This 191 + allows to not take any special measures for samples that have to wrap around 192 + at the end of the circular buffer data area, because the next page after the 193 + last data page would be first data page again, and thus the sample will still 194 + appear completely contiguous in virtual memory. See comment and a simple ASCII 195 + diagram showing this visually in ``bpf_ringbuf_area_alloc()``. 196 + 197 + Another feature that distinguishes BPF ringbuf from perf ring buffer is 198 + a self-pacing notifications of new data being availability. 199 + ``bpf_ringbuf_commit()`` implementation will send a notification of new record 200 + being available after commit only if consumer has already caught up right up to 201 + the record being committed. If not, consumer still has to catch up and thus 202 + will see new data anyways without needing an extra poll notification. 203 + Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c_) show that 204 + this allows to achieve a very high throughput without having to resort to 205 + tricks like "notify only every Nth sample", which are necessary with perf 206 + buffer. For extreme cases, when BPF program wants more manual control of 207 + notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and 208 + ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of 209 + data availability, but require extra caution and diligence in using this API.