Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs/bpf: Add LRU internals description and graph

Extend the bpf hashmap docs to include a brief description of the
internals of the LRU map type (setting appropriate API expectations),
including the original commit message from Martin and a variant on the
graph that I had presented during my Linux Plumbers Conference 2022
talk on "Pressure feedback for LRU map types"[0].

The node names in the dot file correspond roughly to the functions
where the logic for those decisions or steps is defined, to help
curious developers to cross-reference and update this logic if the
details of the LRU implementation ever differ from this description.

[0] https://lpc.events/event/16/contributions/1368/

Signed-off-by: Joe Stringer <joe@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230422172054.3355436-2-joe@isovalent.com

authored by

Joe Stringer and committed by
Daniel Borkmann
1a986518 af0335d2

+214
+42
Documentation/bpf/map_hash.rst
··· 1 1 .. SPDX-License-Identifier: GPL-2.0-only 2 2 .. Copyright (C) 2022 Red Hat, Inc. 3 + .. Copyright (C) 2022-2023 Isovalent, Inc. 3 4 4 5 =============================================== 5 6 BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants ··· 216 215 cur_key = &next_key; 217 216 } 218 217 } 218 + 219 + Internals 220 + ========= 221 + 222 + This section of the document is targeted at Linux developers and describes 223 + aspects of the map implementations that are not considered stable ABI. The 224 + following details are subject to change in future versions of the kernel. 225 + 226 + ``BPF_MAP_TYPE_LRU_HASH`` and variants 227 + -------------------------------------- 228 + 229 + Updating elements in LRU maps may trigger eviction behaviour when the capacity 230 + of the map is reached. There are various steps that the update algorithm 231 + attempts in order to enforce the LRU property which have increasing impacts on 232 + other CPUs involved in the following operation attempts: 233 + 234 + - Attempt to use CPU-local state to batch operations 235 + - Attempt to fetch free nodes from global lists 236 + - Attempt to pull any node from a global list and remove it from the hashmap 237 + - Attempt to pull any node from any CPU's list and remove it from the hashmap 238 + 239 + This algorithm is described visually in the following diagram. See the 240 + description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of 241 + the corresponding operations: 242 + 243 + .. kernel-figure:: map_lru_hash_update.dot 244 + :alt: Diagram outlining the LRU eviction steps taken during map update. 245 + 246 + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and 247 + variants. See the dot file source for kernel function name code references. 248 + 249 + Map updates start from the oval in the top right "begin ``bpf_map_update()``" 250 + and progress through the graph towards the bottom where the result may be 251 + either a successful update or a failure with various error codes. The key in 252 + the top right provides indicators for which locks may be involved in specific 253 + operations. This is intended as a visual hint for reasoning about how map 254 + contention may impact update operations, though the map type and flags may 255 + impact the actual contention on those locks, based on the logic described in 256 + the table above. For instance, if the map is created with type 257 + ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map 258 + properties would be per-cpu.
+172
Documentation/bpf/map_lru_hash_update.dot
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + // Copyright (C) 2022-2023 Isovalent, Inc. 3 + digraph { 4 + node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes 5 + graph [splines=ortho, nodesep=1] 6 + 7 + subgraph cluster_key { 8 + label = "Key\n(locks held during operation)"; 9 + rankdir = TB; 10 + 11 + remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"] 12 + hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"] 13 + lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"] 14 + local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"] 15 + no_lock [shape=rectangle,label="no locks held"] 16 + } 17 + 18 + begin [shape=oval,label="begin\nbpf_map_update()"] 19 + 20 + // Nodes below with an 'fn_' prefix are roughly labeled by the C function 21 + // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c. 22 + // Number suffixes and errno suffixes handle subsections of the corresponding 23 + // logic in the function as of the writing of this dot. 24 + 25 + // cf. __local_list_pop_free() / bpf_percpu_lru_pop_free() 26 + local_freelist_check [shape=diamond,fillcolor=1, 27 + label="Local freelist\nnode available?"]; 28 + use_local_node [shape=rectangle, 29 + label="Use node owned\nby this CPU"] 30 + 31 + // cf. bpf_lru_pop_free() 32 + common_lru_check [shape=diamond, 33 + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; 34 + 35 + fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2, 36 + label="Flush local pending, 37 + Rotate Global list, move 38 + LOCAL_FREE_TARGET 39 + from global -> local"] 40 + // Also corresponds to: 41 + // fn__local_list_flush() 42 + // fn_bpf_lru_list_rotate() 43 + fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2, 44 + label="Able to free\nLOCAL_FREE_TARGET\nnodes?"] 45 + 46 + fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3, 47 + label="Shrink inactive list 48 + up to remaining 49 + LOCAL_FREE_TARGET 50 + (global LRU -> local)"] 51 + fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2, 52 + label="> 0 entries in\nlocal free list?"] 53 + fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2, 54 + label="Steal one node from 55 + inactive, or if empty, 56 + from active global list"] 57 + fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3, 58 + label="Try to remove\nnode from hashtab"] 59 + 60 + local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"] 61 + common_lru_check2 [shape=diamond, 62 + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; 63 + 64 + subgraph cluster_remote_lock { 65 + label = "Iterate through CPUs\n(start from current)"; 66 + style = dashed; 67 + rankdir=LR; 68 + 69 + local_freelist_check5 [shape=diamond,fillcolor=4, 70 + label="Steal a node from\nper-cpu freelist?"] 71 + local_freelist_check6 [shape=rectangle,fillcolor=4, 72 + label="Steal a node from 73 + (1) Unreferenced pending, or 74 + (2) Any pending node"] 75 + local_freelist_check7 [shape=rectangle,fillcolor=3, 76 + label="Try to remove\nnode from hashtab"] 77 + fn_htab_lru_map_update_elem [shape=diamond, 78 + label="Stole node\nfrom remote\nCPU?"] 79 + fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"] 80 + // Also corresponds to: 81 + // use_local_node() 82 + // fn__local_list_pop_pending() 83 + } 84 + 85 + fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle, 86 + label="Use node that was\nnot recently referenced"] 87 + local_freelist_check4 [shape=rectangle, 88 + label="Use node that was\nactively referenced\nin global list"] 89 + fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"] 90 + fn_htab_lru_map_update_elem3 [shape=rectangle, 91 + label="Use node that was\nactively referenced\nin (another?) CPU's cache"] 92 + fn_htab_lru_map_update_elem4 [shape=rectangle,fillcolor=3, 93 + label="Update hashmap\nwith new element"] 94 + fn_htab_lru_map_update_elem5 [shape=oval,label="return 0"] 95 + fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"] 96 + fn_htab_lru_map_update_elem_EEXIST [shape=oval,label="return -EEXIST"] 97 + fn_htab_lru_map_update_elem_ENOENT [shape=oval,label="return -ENOENT"] 98 + 99 + begin -> local_freelist_check 100 + local_freelist_check -> use_local_node [xlabel="Y"] 101 + local_freelist_check -> common_lru_check [xlabel="N"] 102 + common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"] 103 + common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"] 104 + fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free 105 + fn___bpf_lru_node_move_to_free -> 106 + fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"] 107 + fn___bpf_lru_node_move_to_free -> 108 + fn___bpf_lru_list_shrink_inactive [xlabel="N"] 109 + fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink 110 + fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"] 111 + fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"] 112 + fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3 113 + fn___bpf_lru_list_shrink3 -> local_freelist_check2 114 + local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"] 115 + local_freelist_check2 -> common_lru_check2 [xlabel = "N"] 116 + common_lru_check2 -> local_freelist_check5 [xlabel = "Y"] 117 + common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"] 118 + local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"] 119 + local_freelist_check5 -> local_freelist_check6 [xlabel = "N"] 120 + local_freelist_check6 -> local_freelist_check7 121 + local_freelist_check7 -> fn_htab_lru_map_update_elem 122 + 123 + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"] 124 + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2 [xlabel = "N"] 125 + fn_htab_lru_map_update_elem2 -> 126 + fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"] 127 + fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"] 128 + fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4 129 + 130 + use_local_node -> fn_htab_lru_map_update_elem4 131 + fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4 132 + local_freelist_check4 -> fn_htab_lru_map_update_elem4 133 + 134 + fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [headlabel="Success"] 135 + fn_htab_lru_map_update_elem4 -> 136 + fn_htab_lru_map_update_elem_EBUSY [xlabel="Hashtab lock failed"] 137 + fn_htab_lru_map_update_elem4 -> 138 + fn_htab_lru_map_update_elem_EEXIST [xlabel="BPF_EXIST set and\nkey already exists"] 139 + fn_htab_lru_map_update_elem4 -> 140 + fn_htab_lru_map_update_elem_ENOENT [headlabel="BPF_NOEXIST set\nand no such entry"] 141 + 142 + // Create invisible pad nodes to line up various nodes 143 + pad0 [style=invis] 144 + pad1 [style=invis] 145 + pad2 [style=invis] 146 + pad3 [style=invis] 147 + pad4 [style=invis] 148 + 149 + // Line up the key with the top of the graph 150 + no_lock -> local_lock [style=invis] 151 + local_lock -> lru_lock [style=invis] 152 + lru_lock -> hash_lock [style=invis] 153 + hash_lock -> remote_lock [style=invis] 154 + remote_lock -> local_freelist_check5 [style=invis] 155 + remote_lock -> fn___bpf_lru_list_shrink [style=invis] 156 + 157 + // Line up return code nodes at the bottom of the graph 158 + fn_htab_lru_map_update_elem -> pad0 [style=invis] 159 + pad0 -> pad1 [style=invis] 160 + pad1 -> pad2 [style=invis] 161 + //pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis] 162 + fn_htab_lru_map_update_elem4 -> pad3 [style=invis] 163 + pad3 -> fn_htab_lru_map_update_elem5 [style=invis] 164 + pad3 -> fn_htab_lru_map_update_elem_EBUSY [style=invis] 165 + pad3 -> fn_htab_lru_map_update_elem_EEXIST [style=invis] 166 + pad3 -> fn_htab_lru_map_update_elem_ENOENT [style=invis] 167 + 168 + // Reduce diagram width by forcing some nodes to appear above others 169 + local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis] 170 + common_lru_check2 -> pad4 [style=invis] 171 + pad4 -> local_freelist_check5 [style=invis] 172 + }