bpf, documentation: Add graph documentation for non-owning refs

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

It is difficult to intuit the semantics of owning and non-owning
references from verifier code. In order to keep the high-level details
from being lost in the mailing list, this patch adds documentation
explaining semantics and details.

The target audience of doc added in this patch is folks working on BPF
internals, as there's focus on "what should the verifier do here". Via
reorganization or copy-and-paste, much of the content can probably be
repurposed for BPF program writer audience as well.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-9-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Dave Marchevsky and committed by

Alexei Starovoitov 3 years ago c31315c3 215249f6

+267

2 changed files

expand all

Documentation

bpf

graph_ds_impl.rst

other.rst

+266

Documentation/bpf/graph_ds_impl.rst

··· 1 + ========================= 2 + BPF Graph Data Structures 3 + ========================= 4 + 5 + This document describes implementation details of new-style "graph" data 6 + structures (linked_list, rbtree), with particular focus on the verifier's 7 + implementation of semantics specific to those data structures. 8 + 9 + Although no specific verifier code is referred to in this document, the document 10 + assumes that the reader has general knowledge of BPF verifier internals, BPF 11 + maps, and BPF program writing. 12 + 13 + Note that the intent of this document is to describe the current state of 14 + these graph data structures. **No guarantees** of stability for either 15 + semantics or APIs are made or implied here. 16 + 17 + .. contents:: 18 + :local: 19 + :depth: 2 20 + 21 + Introduction 22 + ------------ 23 + 24 + The BPF map API has historically been the main way to expose data structures 25 + of various types for use within BPF programs. Some data structures fit naturally 26 + with the map API (HASH, ARRAY), others less so. Consequentially, programs 27 + interacting with the latter group of data structures can be hard to parse 28 + for kernel programmers without previous BPF experience. 29 + 30 + Luckily, some restrictions which necessitated the use of BPF map semantics are 31 + no longer relevant. With the introduction of kfuncs, kptrs, and the any-context 32 + BPF allocator, it is now possible to implement BPF data structures whose API 33 + and semantics more closely match those exposed to the rest of the kernel. 34 + 35 + Two such data structures - linked_list and rbtree - have many verification 36 + details in common. Because both have "root"s ("head" for linked_list) and 37 + "node"s, the verifier code and this document refer to common functionality 38 + as "graph_api", "graph_root", "graph_node", etc. 39 + 40 + Unless otherwise stated, examples and semantics below apply to both graph data 41 + structures. 42 + 43 + Unstable API 44 + ------------ 45 + 46 + Data structures implemented using the BPF map API have historically used BPF 47 + helper functions - either standard map API helpers like ``bpf_map_update_elem`` 48 + or map-specific helpers. The new-style graph data structures instead use kfuncs 49 + to define their manipulation helpers. Because there are no stability guarantees 50 + for kfuncs, the API and semantics for these data structures can be evolved in 51 + a way that breaks backwards compatibility if necessary. 52 + 53 + Root and node types for the new data structures are opaquely defined in the 54 + ``uapi/linux/bpf.h`` header. 55 + 56 + Locking 57 + ------- 58 + 59 + The new-style data structures are intrusive and are defined similarly to their 60 + vanilla kernel counterparts: 61 + 62 + .. code-block:: c 63 + struct node_data { 64 + long key; 65 + long data; 66 + struct bpf_rb_node node; 67 + }; 68 + 69 + struct bpf_spin_lock glock; 70 + struct bpf_rb_root groot __contains(node_data, node); 71 + 72 + The "root" type for both linked_list and rbtree expects to be in a map_value 73 + which also contains a ``bpf_spin_lock`` - in the above example both global 74 + variables are placed in a single-value arraymap. The verifier considers this 75 + spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in 76 + the same map_value and will enforce that the correct lock is held when 77 + verifying BPF programs that manipulate the tree. Since this lock checking 78 + happens at verification time, there is no runtime penalty. 79 + 80 + Non-owning references 81 + --------------------- 82 + 83 + **Motivation** 84 + 85 + Consider the following BPF code: 86 + 87 + .. code-block:: c 88 + 89 + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 90 + 91 + bpf_spin_lock(&lock); 92 + 93 + bpf_rbtree_add(&tree, n); /* PASSED */ 94 + 95 + bpf_spin_unlock(&lock); 96 + 97 + From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` 98 + has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of 99 + ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the 100 + program has ownership of the pointee's (object pointed to by ``n``) lifetime. 101 + The BPF program must pass off ownership before exiting - either via 102 + ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with 103 + ``bpf_rbtree_add``. 104 + 105 + (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where 106 + "ownership is acquired" and "ownership is passed", respectively) 107 + 108 + What should the verifier do with ``n`` after ownership is passed off? If the 109 + object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier 110 + should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as 111 + the object is no longer valid. The underlying memory may have been reused for 112 + some other allocation, unmapped, etc. 113 + 114 + When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less 115 + obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, 116 + but that would result in programs with useful, common coding patterns being 117 + rejected, e.g.: 118 + 119 + .. code-block:: c 120 + 121 + int x; 122 + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 123 + 124 + bpf_spin_lock(&lock); 125 + 126 + bpf_rbtree_add(&tree, n); /* PASSED */ 127 + x = n->data; 128 + n->data = 42; 129 + 130 + bpf_spin_unlock(&lock); 131 + 132 + Both the read from and write to ``n->data`` would be rejected. The verifier 133 + can do better, though, by taking advantage of two details: 134 + 135 + * Graph data structure APIs can only be used when the ``bpf_spin_lock`` 136 + associated with the graph root is held 137 + 138 + * Both graph data structures have pointer stability 139 + 140 + * Because graph nodes are allocated with ``bpf_obj_new`` and 141 + adding / removing from the root involves fiddling with the 142 + ``bpf_{list,rb}_node`` field of the node struct, a graph node will 143 + remain at the same address after either operation. 144 + 145 + Because the associated ``bpf_spin_lock`` must be held by any program adding 146 + or removing, if we're in the critical section bounded by that lock, we know 147 + that no other program can add or remove until the end of the critical section. 148 + This combined with pointer stability means that, until the critical section 149 + ends, we can safely access the graph node through ``n`` even after it was used 150 + to pass ownership. 151 + 152 + The verifier considers such a reference a *non-owning reference*. The ref 153 + returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. 154 + Both terms currently only have meaning in the context of graph nodes and API. 155 + 156 + **Details** 157 + 158 + Let's enumerate the properties of both types of references. 159 + 160 + *owning reference* 161 + 162 + * This reference controls the lifetime of the pointee 163 + 164 + * Ownership of pointee must be 'released' by passing it to some graph API 165 + kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee 166 + 167 + * If not released before program ends, verifier considers program invalid 168 + 169 + * Access to the pointee's memory will not page fault 170 + 171 + *non-owning reference* 172 + 173 + * This reference does not own the pointee 174 + 175 + * It cannot be used to add the graph node to a graph root, nor ``free``'d via 176 + ``bpf_obj_drop`` 177 + 178 + * No explicit control of lifetime, but can infer valid lifetime based on 179 + non-owning ref existence (see explanation below) 180 + 181 + * Access to the pointee's memory will not page fault 182 + 183 + From verifier's perspective non-owning references can only exist 184 + between spin_lock and spin_unlock. Why? After spin_unlock another program 185 + can do arbitrary operations on the data structure like removing and ``free``-ing 186 + via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, 187 + ``free``'d, and reused via bpf_obj_new would point to an entirely different thing. 188 + Or the memory could go away. 189 + 190 + To prevent this logic violation all non-owning references are invalidated by the 191 + verifier after a critical section ends. This is necessary to ensure the "will 192 + not page fault" property of non-owning references. So if the verifier hasn't 193 + invalidated a non-owning ref, accessing it will not page fault. 194 + 195 + Currently ``bpf_obj_drop`` is not allowed in the critical section, so 196 + if there's a valid non-owning ref, we must be in a critical section, and can 197 + conclude that the ref's memory hasn't been dropped-and- ``free``'d or 198 + dropped-and-reused. 199 + 200 + Any reference to a node that is in an rbtree _must_ be non-owning, since 201 + the tree has control of the pointee's lifetime. Similarly, any ref to a node 202 + that isn't in rbtree _must_ be owning. This results in a nice property: 203 + graph API add / remove implementations don't need to check if a node 204 + has already been added (or already removed), as the ownership model 205 + allows the verifier to prevent such a state from being valid by simply checking 206 + types. 207 + 208 + However, pointer aliasing poses an issue for the above "nice property". 209 + Consider the following example: 210 + 211 + .. code-block:: c 212 + 213 + struct node_data *n, *m, *o, *p; 214 + n = bpf_obj_new(typeof(*n)); /* 1 */ 215 + 216 + bpf_spin_lock(&lock); 217 + 218 + bpf_rbtree_add(&tree, n); /* 2 */ 219 + m = bpf_rbtree_first(&tree); /* 3 */ 220 + 221 + o = bpf_rbtree_remove(&tree, n); /* 4 */ 222 + p = bpf_rbtree_remove(&tree, m); /* 5 */ 223 + 224 + bpf_spin_unlock(&lock); 225 + 226 + bpf_obj_drop(o); 227 + bpf_obj_drop(p); /* 6 */ 228 + 229 + Assume the tree is empty before this program runs. If we track verifier state 230 + changes here using numbers in above comments: 231 + 232 + 1) n is an owning reference 233 + 234 + 2) n is a non-owning reference, it's been added to the tree 235 + 236 + 3) n and m are non-owning references, they both point to the same node 237 + 238 + 4) o is an owning reference, n and m non-owning, all point to same node 239 + 240 + 5) o and p are owning, n and m non-owning, all point to the same node 241 + 242 + 6) a double-free has occurred, since o and p point to same node and o was 243 + ``free``'d in previous statement 244 + 245 + States 4 and 5 violate our "nice property", as there are non-owning refs to 246 + a node which is not in an rbtree. Statement 5 will try to remove a node which 247 + has already been removed as a result of this violation. State 6 is a dangerous 248 + double-free. 249 + 250 + At a minimum we should prevent state 6 from being possible. If we can't also 251 + prevent state 5 then we must abandon our "nice property" and check whether a 252 + node has already been removed at runtime. 253 + 254 + We prevent both by generalizing the "invalidate non-owning references" behavior 255 + of ``bpf_spin_unlock`` and doing similar invalidation after 256 + ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: 257 + 258 + * takes an arbitrary node argument 259 + 260 + * removes it from the data structure 261 + 262 + * returns an owning reference to the removed node 263 + 264 + May result in a state where some other non-owning reference points to the same 265 + node. So ``remove``-type kfuncs must be considered a non-owning reference 266 + invalidation point as well.

Documentation/bpf/other.rst

··· 7 7 8 8 ringbuf 9 9 llvm_reloc 10 + graph_ds_impl