Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf, docs: DEVMAPs and XDP_REDIRECT

Add documentation for BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
including kernel version introduced, usage and examples.

Add documentation that describes XDP_REDIRECT.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221115144921.165483-1-mtahhan@redhat.com

authored by

Maryam Tahhan and committed by
Daniel Borkmann
d1e91173 f80e16b6

+310 -2
+1
Documentation/bpf/index.rst
··· 29 29 clang-notes 30 30 linux-notes 31 31 other 32 + redirect 32 33 33 34 .. only:: subproject and html 34 35
+222
Documentation/bpf/map_devmap.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0-only 2 + .. Copyright (C) 2022 Red Hat, Inc. 3 + 4 + ================================================= 5 + BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH 6 + ================================================= 7 + 8 + .. note:: 9 + - ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14 10 + - ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4 11 + 12 + ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily 13 + used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``. 14 + ``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as 15 + the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH`` 16 + is backed by a hash table that uses a key to lookup a reference to a net device. 17 + The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``> 18 + pairs to update the maps with new net devices. 19 + 20 + .. note:: 21 + - The key to a hash map doesn't have to be an ``ifindex``. 22 + - While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices 23 + it comes at the cost of a hash of the key when performing a look up. 24 + 25 + The setup and packet enqueue/send code is shared between the two types of 26 + devmap; only the lookup and insertion is different. 27 + 28 + Usage 29 + ===== 30 + Kernel BPF 31 + ---------- 32 + .. c:function:: 33 + long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags) 34 + 35 + Redirect the packet to the endpoint referenced by ``map`` at index ``key``. 36 + For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains 37 + references to net devices (for forwarding packets through other ports). 38 + 39 + The lower two bits of *flags* are used as the return code if the map lookup 40 + fails. This is so that the return value can be one of the XDP program return 41 + codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags`` 42 + can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined 43 + below. 44 + 45 + With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces 46 + in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded 47 + from the broadcast. 48 + 49 + .. note:: 50 + - The key is ignored if BPF_F_BROADCAST is set. 51 + - The broadcast feature can also be used to implement multicast forwarding: 52 + simply create multiple DEVMAPs, each one corresponding to a single multicast group. 53 + 54 + This helper will return ``XDP_REDIRECT`` on success, or the value of the two 55 + lower bits of the ``flags`` argument if the map lookup fails. 56 + 57 + More information about redirection can be found :doc:`redirect` 58 + 59 + .. c:function:: 60 + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) 61 + 62 + Net device entries can be retrieved using the ``bpf_map_lookup_elem()`` 63 + helper. 64 + 65 + Userspace 66 + --------- 67 + .. note:: 68 + DEVMAP entries can only be updated/deleted from user space and not 69 + from an eBPF program. Trying to call these functions from a kernel eBPF 70 + program will result in the program failing to load and a verifier warning. 71 + 72 + .. c:function:: 73 + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags); 74 + 75 + Net device entries can be added or updated using the ``bpf_map_update_elem()`` 76 + helper. This helper replaces existing elements atomically. The ``value`` parameter 77 + can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards 78 + compatibility. 79 + 80 + .. code-block:: c 81 + 82 + struct bpf_devmap_val { 83 + __u32 ifindex; /* device index */ 84 + union { 85 + int fd; /* prog fd on map write */ 86 + __u32 id; /* prog id on map read */ 87 + } bpf_prog; 88 + }; 89 + 90 + The ``flags`` argument can be one of the following: 91 + 92 + - ``BPF_ANY``: Create a new element or update an existing element. 93 + - ``BPF_NOEXIST``: Create a new element only if it did not exist. 94 + - ``BPF_EXIST``: Update an existing element. 95 + 96 + DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd`` 97 + to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have 98 + access to both Rx device and Tx device. The program associated with the ``fd`` 99 + must have type XDP with expected attach type ``xdp_devmap``. 100 + When a program is associated with a device index, the program is run on an 101 + ``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples 102 + of how to attach/use xdp_devmap progs can be found in the kernel selftests: 103 + 104 + - ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c`` 105 + - ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c`` 106 + 107 + .. c:function:: 108 + int bpf_map_lookup_elem(int fd, const void *key, void *value); 109 + 110 + Net device entries can be retrieved using the ``bpf_map_lookup_elem()`` 111 + helper. 112 + 113 + .. c:function:: 114 + int bpf_map_delete_elem(int fd, const void *key); 115 + 116 + Net device entries can be deleted using the ``bpf_map_delete_elem()`` 117 + helper. This helper will return 0 on success, or negative error in case of 118 + failure. 119 + 120 + Examples 121 + ======== 122 + 123 + Kernel BPF 124 + ---------- 125 + 126 + The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP`` 127 + called tx_port. 128 + 129 + .. code-block:: c 130 + 131 + struct { 132 + __uint(type, BPF_MAP_TYPE_DEVMAP); 133 + __type(key, __u32); 134 + __type(value, __u32); 135 + __uint(max_entries, 256); 136 + } tx_port SEC(".maps"); 137 + 138 + The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH`` 139 + called forward_map. 140 + 141 + .. code-block:: c 142 + 143 + struct { 144 + __uint(type, BPF_MAP_TYPE_DEVMAP_HASH); 145 + __type(key, __u32); 146 + __type(value, struct bpf_devmap_val); 147 + __uint(max_entries, 32); 148 + } forward_map SEC(".maps"); 149 + 150 + .. note:: 151 + 152 + The value type in the DEVMAP above is a ``struct bpf_devmap_val`` 153 + 154 + The following code snippet shows a simple xdp_redirect_map program. This program 155 + would work with a user space program that populates the devmap ``forward_map`` based 156 + on ingress ifindexes. The BPF program (below) is redirecting packets using the 157 + ingress ``ifindex`` as the ``key``. 158 + 159 + .. code-block:: c 160 + 161 + SEC("xdp") 162 + int xdp_redirect_map_func(struct xdp_md *ctx) 163 + { 164 + int index = ctx->ingress_ifindex; 165 + 166 + return bpf_redirect_map(&forward_map, index, 0); 167 + } 168 + 169 + The following code snippet shows a BPF program that is broadcasting packets to 170 + all the interfaces in the ``tx_port`` devmap. 171 + 172 + .. code-block:: c 173 + 174 + SEC("xdp") 175 + int xdp_redirect_map_func(struct xdp_md *ctx) 176 + { 177 + return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS); 178 + } 179 + 180 + User space 181 + ---------- 182 + 183 + The following code snippet shows how to update a devmap called ``tx_port``. 184 + 185 + .. code-block:: c 186 + 187 + int update_devmap(int ifindex, int redirect_ifindex) 188 + { 189 + int ret; 190 + 191 + ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0); 192 + if (ret < 0) { 193 + fprintf(stderr, "Failed to update devmap_ value: %s\n", 194 + strerror(errno)); 195 + } 196 + 197 + return ret; 198 + } 199 + 200 + The following code snippet shows how to update a hash_devmap called ``forward_map``. 201 + 202 + .. code-block:: c 203 + 204 + int update_devmap(int ifindex, int redirect_ifindex) 205 + { 206 + struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex }; 207 + int ret; 208 + 209 + ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0); 210 + if (ret < 0) { 211 + fprintf(stderr, "Failed to update devmap_ value: %s\n", 212 + strerror(errno)); 213 + } 214 + return ret; 215 + } 216 + 217 + References 218 + =========== 219 + 220 + - https://lwn.net/Articles/728146/ 221 + - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176 222 + - https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106
+81
Documentation/bpf/redirect.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0-only 2 + .. Copyright (C) 2022 Red Hat, Inc. 3 + 4 + ======== 5 + Redirect 6 + ======== 7 + XDP_REDIRECT 8 + ############ 9 + Supported maps 10 + -------------- 11 + 12 + XDP_REDIRECT works with the following map types: 13 + 14 + - ``BPF_MAP_TYPE_DEVMAP`` 15 + - ``BPF_MAP_TYPE_DEVMAP_HASH`` 16 + - ``BPF_MAP_TYPE_CPUMAP`` 17 + - ``BPF_MAP_TYPE_XSKMAP`` 18 + 19 + For more information on these maps, please see the specific map documentation. 20 + 21 + Process 22 + ------- 23 + 24 + .. kernel-doc:: net/core/filter.c 25 + :doc: xdp redirect 26 + 27 + .. note:: 28 + Not all drivers support transmitting frames after a redirect, and for 29 + those that do, not all of them support non-linear frames. Non-linear xdp 30 + bufs/frames are bufs/frames that contain more than one fragment. 31 + 32 + Debugging packet drops 33 + ---------------------- 34 + Silent packet drops for XDP_REDIRECT can be debugged using: 35 + 36 + - bpf_trace 37 + - perf_record 38 + 39 + bpf_trace 40 + ^^^^^^^^^ 41 + The following bpftrace command can be used to capture and count all XDP tracepoints: 42 + 43 + .. code-block:: none 44 + 45 + sudo bpftrace -e 'tracepoint:xdp:* { @cnt[probe] = count(); }' 46 + Attaching 12 probes... 47 + ^C 48 + 49 + @cnt[tracepoint:xdp:mem_connect]: 18 50 + @cnt[tracepoint:xdp:mem_disconnect]: 18 51 + @cnt[tracepoint:xdp:xdp_exception]: 19605 52 + @cnt[tracepoint:xdp:xdp_devmap_xmit]: 1393604 53 + @cnt[tracepoint:xdp:xdp_redirect]: 22292200 54 + 55 + .. note:: 56 + The various xdp tracepoints can be found in ``source/include/trace/events/xdp.h`` 57 + 58 + The following bpftrace command can be used to extract the ``ERRNO`` being returned as 59 + part of the err parameter: 60 + 61 + .. code-block:: none 62 + 63 + sudo bpftrace -e \ 64 + 'tracepoint:xdp:xdp_redirect*_err {@redir_errno[-args->err] = count();} 65 + tracepoint:xdp:xdp_devmap_xmit {@devmap_errno[-args->err] = count();}' 66 + 67 + perf record 68 + ^^^^^^^^^^^ 69 + The perf tool also supports recording tracepoints: 70 + 71 + .. code-block:: none 72 + 73 + perf record -a -e xdp:xdp_redirect_err \ 74 + -e xdp:xdp_redirect_map_err \ 75 + -e xdp:xdp_exception \ 76 + -e xdp:xdp_devmap_xmit 77 + 78 + References 79 + =========== 80 + 81 + - https://github.com/xdp-project/xdp-tutorial/tree/master/tracing02-xdp-monitor
+6 -2
net/core/filter.c
··· 4108 4108 .arg2_type = ARG_ANYTHING, 4109 4109 }; 4110 4110 4111 - /* XDP_REDIRECT works by a three-step process, implemented in the functions 4111 + /** 4112 + * DOC: xdp redirect 4113 + * 4114 + * XDP_REDIRECT works by a three-step process, implemented in the functions 4112 4115 * below: 4113 4116 * 4114 4117 * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target ··· 4126 4123 * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(), 4127 4124 * which will flush all the different bulk queues, thus completing the 4128 4125 * redirect. 4129 - * 4126 + */ 4127 + /* 4130 4128 * Pointers to the map entries will be kept around for this whole sequence of 4131 4129 * steps, protected by RCU. However, there is no top-level rcu_read_lock() in 4132 4130 * the core code; instead, the RCU protection relies on everything happening