Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

+209

Documentation/bpf/ringbuf.rst

··· 1 + =============== 2 + BPF ring buffer 3 + =============== 4 + 5 + This document describes BPF ring buffer design, API, and implementation details. 6 + 7 + .. contents:: 8 + :local: 9 + :depth: 2 10 + 11 + Motivation 12 + ---------- 13 + 14 + There are two distinctive motivators for this work, which are not satisfied by 15 + existing perf buffer, which prompted creation of a new ring buffer 16 + implementation. 17 + 18 + - more efficient memory utilization by sharing ring buffer across CPUs; 19 + - preserving ordering of events that happen sequentially in time, even across 20 + multiple CPUs (e.g., fork/exec/exit events for a task). 21 + 22 + These two problems are independent, but perf buffer fails to satisfy both. 23 + Both are a result of a choice to have per-CPU perf ring buffer. Both can be 24 + also solved by having an MPSC implementation of ring buffer. The ordering 25 + problem could technically be solved for perf buffer with some in-kernel 26 + counting, but given the first one requires an MPSC buffer, the same solution 27 + would solve the second problem automatically. 28 + 29 + Semantics and APIs 30 + ------------------ 31 + 32 + Single ring buffer is presented to BPF programs as an instance of BPF map of 33 + type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but 34 + ultimately rejected. 35 + 36 + One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make 37 + ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not 38 + enforce "same CPU only" rule. This would be more familiar interface compatible 39 + with existing perf buffer use in BPF, but would fail if application needed more 40 + advanced logic to lookup ring buffer by arbitrary key. 41 + ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach. 42 + Additionally, given the performance of BPF ringbuf, many use cases would just 43 + opt into a simple single ring buffer shared among all CPUs, for which current 44 + approach would be an overkill. 45 + 46 + Another approach could introduce a new concept, alongside BPF map, to represent 47 + generic "container" object, which doesn't necessarily have key/value interface 48 + with lookup/update/delete operations. This approach would add a lot of extra 49 + infrastructure that has to be built for observability and verifier support. It 50 + would also add another concept that BPF developers would have to familiarize 51 + themselves with, new syntax in libbpf, etc. But then would really provide no 52 + additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF`` 53 + doesn't support lookup/update/delete operations, but so doesn't few other map 54 + types (e.g., queue and stack; array doesn't support delete, etc). 55 + 56 + The approach chosen has an advantage of re-using existing BPF map 57 + infrastructure (introspection APIs in kernel, libbpf support, etc), being 58 + familiar concept (no need to teach users a new type of object in BPF program), 59 + and utilizing existing tooling (bpftool). For common scenario of using a single 60 + ring buffer for all CPUs, it's as simple and straightforward, as would be with 61 + a dedicated "container" object. On the other hand, by being a map, it can be 62 + combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement 63 + a wide variety of topologies, from one ring buffer for each CPU (e.g., as 64 + a replacement for perf buffer use cases), to a complicated application 65 + hashing/sharding of ring buffers (e.g., having a small pool of ring buffers 66 + with hashed task's tgid being a look up key to preserve order, but reduce 67 + contention). 68 + 69 + Key and value sizes are enforced to be zero. ``max_entries`` is used to specify 70 + the size of ring buffer and has to be a power of 2 value. 71 + 72 + There are a bunch of similarities between perf buffer 73 + (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics: 74 + 75 + - variable-length records; 76 + - if there is no more space left in ring buffer, reservation fails, no 77 + blocking; 78 + - memory-mappable data area for user-space applications for ease of 79 + consumption and high performance; 80 + - epoll notifications for new incoming data; 81 + - but still the ability to do busy polling for new data to achieve the 82 + lowest latency, if necessary. 83 + 84 + BPF ringbuf provides two sets of APIs to BPF programs: 85 + 86 + - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring 87 + buffer, similarly to ``bpf_perf_event_output()``; 88 + - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` 89 + APIs split the whole process into two steps. First, a fixed amount of space 90 + is reserved. If successful, a pointer to a data inside ring buffer data 91 + area is returned, which BPF programs can use similarly to a data inside 92 + array/hash maps. Once ready, this piece of memory is either committed or 93 + discarded. Discard is similar to commit, but makes consumer ignore the 94 + record. 95 + 96 + ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy, 97 + because record has to be prepared in some other place first. But it allows to 98 + submit records of the length that's not known to verifier beforehand. It also 99 + closely matches ``bpf_perf_event_output()``, so will simplify migration 100 + significantly. 101 + 102 + ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory 103 + pointer directly to ring buffer memory. In a lot of cases records are larger 104 + than BPF stack space allows, so many programs have use extra per-CPU array as 105 + a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs 106 + completely. But in exchange, it only allows a known constant size of memory to 107 + be reserved, such that verifier can verify that BPF program can't access memory 108 + outside its reserved record space. bpf_ringbuf_output(), while slightly slower 109 + due to extra memory copy, covers some use cases that are not suitable for 110 + ``bpf_ringbuf_reserve()``. 111 + 112 + The difference between commit and discard is very small. Discard just marks 113 + a record as discarded, and such records are supposed to be ignored by consumer 114 + code. Discard is useful for some advanced use-cases, such as ensuring 115 + all-or-nothing multi-record submission, or emulating temporary 116 + ``malloc()``/``free()`` within single BPF program invocation. 117 + 118 + Each reserved record is tracked by verifier through existing 119 + reference-tracking logic, similar to socket ref-tracking. It is thus 120 + impossible to reserve a record, but forget to submit (or discard) it. 121 + 122 + ``bpf_ringbuf_query()`` helper allows to query various properties of ring 123 + buffer. Currently 4 are supported: 124 + 125 + - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; 126 + - ``BPF_RB_RING_SIZE`` returns the size of ring buffer; 127 + - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition 128 + of consumer/producer, respectively. 129 + 130 + Returned values are momentarily snapshots of ring buffer state and could be 131 + off by the time helper returns, so this should be used only for 132 + debugging/reporting reasons or for implementing various heuristics, that take 133 + into account highly-changeable nature of some of those characteristics. 134 + 135 + One such heuristic might involve more fine-grained control over poll/epoll 136 + notifications about new data availability in ring buffer. Together with 137 + ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard 138 + helpers, it allows BPF program a high degree of control and, e.g., more 139 + efficient batched notifications. Default self-balancing strategy, though, 140 + should be adequate for most applications and will work reliable and efficiently 141 + already. 142 + 143 + Design and Implementation 144 + ------------------------- 145 + 146 + This reserve/commit schema allows a natural way for multiple producers, either 147 + on different CPUs or even on the same CPU/in the same BPF program, to reserve 148 + independent records and work with them without blocking other producers. This 149 + means that if BPF program was interruped by another BPF program sharing the 150 + same ring buffer, they will both get a record reserved (provided there is 151 + enough space left) and can work with it and submit it independently. This 152 + applies to NMI context as well, except that due to using a spinlock during 153 + reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get 154 + a lock, in which case reservation will fail even if ring buffer is not full. 155 + 156 + The ring buffer itself internally is implemented as a power-of-2 sized 157 + circular buffer, with two logical and ever-increasing counters (which might 158 + wrap around on 32-bit architectures, that's not a problem): 159 + 160 + - consumer counter shows up to which logical position consumer consumed the 161 + data; 162 + - producer counter denotes amount of data reserved by all producers. 163 + 164 + Each time a record is reserved, producer that "owns" the record will 165 + successfully advance producer counter. At that point, data is still not yet 166 + ready to be consumed, though. Each record has 8 byte header, which contains the 167 + length of reserved record, as well as two extra bits: busy bit to denote that 168 + record is still being worked on, and discard bit, which might be set at commit 169 + time if record is discarded. In the latter case, consumer is supposed to skip 170 + the record and move on to the next one. Record header also encodes record's 171 + relative offset from the beginning of ring buffer data area (in pages). This 172 + allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the 173 + pointer to the record itself, without requiring also the pointer to ring buffer 174 + itself. Ring buffer memory location will be restored from record metadata 175 + header. This significantly simplifies verifier, as well as improving API 176 + usability. 177 + 178 + Producer counter increments are serialized under spinlock, so there is 179 + a strict ordering between reservations. Commits, on the other hand, are 180 + completely lockless and independent. All records become available to consumer 181 + in the order of reservations, but only after all previous records where 182 + already committed. It is thus possible for slow producers to temporarily hold 183 + off submitted records, that were reserved later. 184 + 185 + Reservation/commit/consumer protocol is verified by litmus tests in 186 + Documentation/litmus_tests/bpf-rb/_. 187 + 188 + One interesting implementation bit, that significantly simplifies (and thus 189 + speeds up as well) implementation of both producers and consumers is how data 190 + area is mapped twice contiguously back-to-back in the virtual memory. This 191 + allows to not take any special measures for samples that have to wrap around 192 + at the end of the circular buffer data area, because the next page after the 193 + last data page would be first data page again, and thus the sample will still 194 + appear completely contiguous in virtual memory. See comment and a simple ASCII 195 + diagram showing this visually in ``bpf_ringbuf_area_alloc()``. 196 + 197 + Another feature that distinguishes BPF ringbuf from perf ring buffer is 198 + a self-pacing notifications of new data being availability. 199 + ``bpf_ringbuf_commit()`` implementation will send a notification of new record 200 + being available after commit only if consumer has already caught up right up to 201 + the record being committed. If not, consumer still has to catch up and thus 202 + will see new data anyways without needing an extra poll notification. 203 + Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c_) show that 204 + this allows to achieve a very high throughput without having to resort to 205 + tricks like "notify only every Nth sample", which are necessary with perf 206 + buffer. For extreme cases, when BPF program wants more manual control of 207 + notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and 208 + ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of 209 + data availability, but require extra caution and diligence in using this API.

+1 -1

MAINTAINERS

··· 18456 18456 L: bpf@vger.kernel.org 18457 18457 S: Maintained 18458 18458 F: include/net/xdp_sock* 18459 - F: include/net/xsk_buffer_pool.h 18459 + F: include/net/xsk_buff_pool.h 18460 18460 F: include/uapi/linux/if_xdp.h 18461 18461 F: net/xdp/ 18462 18462 F: samples/bpf/xdpsock*

+1 -1

drivers/net/ethernet/amazon/ena/ena_netdev.c

··· 263 263 dma_addr_t dma = 0; 264 264 u32 size; 265 265 266 - tx_info->xdpf = convert_to_xdp_frame(xdp); 266 + tx_info->xdpf = xdp_convert_buff_to_frame(xdp); 267 267 size = tx_info->xdpf->len; 268 268 ena_buf = tx_info->bufs; 269 269

+1 -1

drivers/net/ethernet/intel/i40e/i40e_txrx.c

··· 2167 2167 2168 2168 int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring) 2169 2169 { 2170 - struct xdp_frame *xdpf = convert_to_xdp_frame(xdp); 2170 + struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp); 2171 2171 2172 2172 if (unlikely(!xdpf)) 2173 2173 return I40E_XDP_CONSUMED;

+1 -1

drivers/net/ethernet/intel/ice/ice_txrx_lib.c

··· 254 254 */ 255 255 int ice_xmit_xdp_buff(struct xdp_buff *xdp, struct ice_ring *xdp_ring) 256 256 { 257 - struct xdp_frame *xdpf = convert_to_xdp_frame(xdp); 257 + struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp); 258 258 259 259 if (unlikely(!xdpf)) 260 260 return ICE_XDP_CONSUMED;

+1 -1

drivers/net/ethernet/intel/ixgbe/ixgbe_main.c

··· 2215 2215 case XDP_PASS: 2216 2216 break; 2217 2217 case XDP_TX: 2218 - xdpf = convert_to_xdp_frame(xdp); 2218 + xdpf = xdp_convert_buff_to_frame(xdp); 2219 2219 if (unlikely(!xdpf)) { 2220 2220 result = IXGBE_XDP_CONSUMED; 2221 2221 break;

+1 -1

drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

··· 107 107 case XDP_PASS: 108 108 break; 109 109 case XDP_TX: 110 - xdpf = convert_to_xdp_frame(xdp); 110 + xdpf = xdp_convert_buff_to_frame(xdp); 111 111 if (unlikely(!xdpf)) { 112 112 result = IXGBE_XDP_CONSUMED; 113 113 break;

+1 -1

drivers/net/ethernet/marvell/mvneta.c

··· 2073 2073 int cpu; 2074 2074 u32 ret; 2075 2075 2076 - xdpf = convert_to_xdp_frame(xdp); 2076 + xdpf = xdp_convert_buff_to_frame(xdp); 2077 2077 if (unlikely(!xdpf)) 2078 2078 return MVNETA_XDP_DROPPED; 2079 2079

+5 -5

drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c

··· 64 64 struct xdp_frame *xdpf; 65 65 dma_addr_t dma_addr; 66 66 67 - xdpf = convert_to_xdp_frame(xdp); 67 + xdpf = xdp_convert_buff_to_frame(xdp); 68 68 if (unlikely(!xdpf)) 69 69 return false; 70 70 ··· 97 97 xdpi.frame.xdpf = xdpf; 98 98 xdpi.frame.dma_addr = dma_addr; 99 99 } else { 100 - /* Driver assumes that convert_to_xdp_frame returns an xdp_frame 101 - * that points to the same memory region as the original 102 - * xdp_buff. It allows to map the memory only once and to use 103 - * the DMA_BIDIRECTIONAL mode. 100 + /* Driver assumes that xdp_convert_buff_to_frame returns 101 + * an xdp_frame that points to the same memory region as 102 + * the original xdp_buff. It allows to map the memory only 103 + * once and to use the DMA_BIDIRECTIONAL mode. 104 104 */ 105 105 106 106 xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE;

+1 -1

drivers/net/ethernet/sfc/rx.c

··· 329 329 330 330 case XDP_TX: 331 331 /* Buffer ownership passes to tx on success. */ 332 - xdpf = convert_to_xdp_frame(&xdp); 332 + xdpf = xdp_convert_buff_to_frame(&xdp); 333 333 err = efx_xdp_tx_buffers(efx, 1, &xdpf, true); 334 334 if (unlikely(err != 1)) { 335 335 efx_free_rx_buffers(rx_queue, rx_buf, 1);

+1 -1

drivers/net/ethernet/socionext/netsec.c

··· 867 867 static u32 netsec_xdp_xmit_back(struct netsec_priv *priv, struct xdp_buff *xdp) 868 868 { 869 869 struct netsec_desc_ring *tx_ring = &priv->desc_ring[NETSEC_RING_TX]; 870 - struct xdp_frame *xdpf = convert_to_xdp_frame(xdp); 870 + struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp); 871 871 u32 ret; 872 872 873 873 if (unlikely(!xdpf))

+1 -1

drivers/net/ethernet/ti/cpsw_priv.c

··· 1355 1355 ret = CPSW_XDP_PASS; 1356 1356 break; 1357 1357 case XDP_TX: 1358 - xdpf = convert_to_xdp_frame(xdp); 1358 + xdpf = xdp_convert_buff_to_frame(xdp); 1359 1359 if (unlikely(!xdpf)) 1360 1360 goto drop; 1361 1361

+1 -1

drivers/net/tun.c

··· 1295 1295 1296 1296 static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) 1297 1297 { 1298 - struct xdp_frame *frame = convert_to_xdp_frame(xdp); 1298 + struct xdp_frame *frame = xdp_convert_buff_to_frame(xdp); 1299 1299 1300 1300 if (unlikely(!frame)) 1301 1301 return -EOVERFLOW;

+2 -6

drivers/net/veth.c

··· 541 541 static int veth_xdp_tx(struct veth_rq *rq, struct xdp_buff *xdp, 542 542 struct veth_xdp_tx_bq *bq) 543 543 { 544 - struct xdp_frame *frame = convert_to_xdp_frame(xdp); 544 + struct xdp_frame *frame = xdp_convert_buff_to_frame(xdp); 545 545 546 546 if (unlikely(!frame)) 547 547 return -EOVERFLOW; ··· 575 575 struct xdp_buff xdp; 576 576 u32 act; 577 577 578 - xdp.data_hard_start = hard_start; 579 - xdp.data = frame->data; 580 - xdp.data_end = frame->data + frame->len; 581 - xdp.data_meta = frame->data - frame->metasize; 582 - xdp.frame_sz = frame->frame_sz; 578 + xdp_convert_frame_to_buff(frame, &xdp); 583 579 xdp.rxq = &rq->xdp_rxq; 584 580 585 581 act = bpf_prog_run_xdp(xdp_prog, &xdp);

+2 -2

drivers/net/virtio_net.c

··· 703 703 break; 704 704 case XDP_TX: 705 705 stats->xdp_tx++; 706 - xdpf = convert_to_xdp_frame(&xdp); 706 + xdpf = xdp_convert_buff_to_frame(&xdp); 707 707 if (unlikely(!xdpf)) 708 708 goto err_xdp; 709 709 err = virtnet_xdp_xmit(dev, 1, &xdpf, 0); ··· 892 892 break; 893 893 case XDP_TX: 894 894 stats->xdp_tx++; 895 - xdpf = convert_to_xdp_frame(&xdp); 895 + xdpf = xdp_convert_buff_to_frame(&xdp); 896 896 if (unlikely(!xdpf)) 897 897 goto err_xdp; 898 898 err = virtnet_xdp_xmit(dev, 1, &xdpf, 0);

+64

include/linux/bpf-netns.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _BPF_NETNS_H 3 + #define _BPF_NETNS_H 4 + 5 + #include <linux/mutex.h> 6 + #include <uapi/linux/bpf.h> 7 + 8 + enum netns_bpf_attach_type { 9 + NETNS_BPF_INVALID = -1, 10 + NETNS_BPF_FLOW_DISSECTOR = 0, 11 + MAX_NETNS_BPF_ATTACH_TYPE 12 + }; 13 + 14 + static inline enum netns_bpf_attach_type 15 + to_netns_bpf_attach_type(enum bpf_attach_type attach_type) 16 + { 17 + switch (attach_type) { 18 + case BPF_FLOW_DISSECTOR: 19 + return NETNS_BPF_FLOW_DISSECTOR; 20 + default: 21 + return NETNS_BPF_INVALID; 22 + } 23 + } 24 + 25 + /* Protects updates to netns_bpf */ 26 + extern struct mutex netns_bpf_mutex; 27 + 28 + union bpf_attr; 29 + struct bpf_prog; 30 + 31 + #ifdef CONFIG_NET 32 + int netns_bpf_prog_query(const union bpf_attr *attr, 33 + union bpf_attr __user *uattr); 34 + int netns_bpf_prog_attach(const union bpf_attr *attr, 35 + struct bpf_prog *prog); 36 + int netns_bpf_prog_detach(const union bpf_attr *attr); 37 + int netns_bpf_link_create(const union bpf_attr *attr, 38 + struct bpf_prog *prog); 39 + #else 40 + static inline int netns_bpf_prog_query(const union bpf_attr *attr, 41 + union bpf_attr __user *uattr) 42 + { 43 + return -EOPNOTSUPP; 44 + } 45 + 46 + static inline int netns_bpf_prog_attach(const union bpf_attr *attr, 47 + struct bpf_prog *prog) 48 + { 49 + return -EOPNOTSUPP; 50 + } 51 + 52 + static inline int netns_bpf_prog_detach(const union bpf_attr *attr) 53 + { 54 + return -EOPNOTSUPP; 55 + } 56 + 57 + static inline int netns_bpf_link_create(const union bpf_attr *attr, 58 + struct bpf_prog *prog) 59 + { 60 + return -EOPNOTSUPP; 61 + } 62 + #endif 63 + 64 + #endif /* _BPF_NETNS_H */

+21

include/linux/bpf.h

··· 90 90 int (*map_direct_value_meta)(const struct bpf_map *map, 91 91 u64 imm, u32 *off); 92 92 int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma); 93 + __poll_t (*map_poll)(struct bpf_map *map, struct file *filp, 94 + struct poll_table_struct *pts); 93 95 }; 94 96 95 97 struct bpf_map_memory { ··· 246 244 ARG_PTR_TO_LONG, /* pointer to long */ 247 245 ARG_PTR_TO_SOCKET, /* pointer to bpf_sock (fullsock) */ 248 246 ARG_PTR_TO_BTF_ID, /* pointer to in-kernel struct */ 247 + ARG_PTR_TO_ALLOC_MEM, /* pointer to dynamically allocated memory */ 248 + ARG_PTR_TO_ALLOC_MEM_OR_NULL, /* pointer to dynamically allocated memory or NULL */ 249 + ARG_CONST_ALLOC_SIZE_OR_ZERO, /* number of allocated bytes requested */ 249 250 }; 250 251 251 252 /* type of values returned from helper functions */ ··· 260 255 RET_PTR_TO_SOCKET_OR_NULL, /* returns a pointer to a socket or NULL */ 261 256 RET_PTR_TO_TCP_SOCK_OR_NULL, /* returns a pointer to a tcp_sock or NULL */ 262 257 RET_PTR_TO_SOCK_COMMON_OR_NULL, /* returns a pointer to a sock_common or NULL */ 258 + RET_PTR_TO_ALLOC_MEM_OR_NULL, /* returns a pointer to dynamically allocated memory or NULL */ 263 259 }; 264 260 265 261 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs ··· 328 322 PTR_TO_XDP_SOCK, /* reg points to struct xdp_sock */ 329 323 PTR_TO_BTF_ID, /* reg points to kernel struct */ 330 324 PTR_TO_BTF_ID_OR_NULL, /* reg points to kernel struct or NULL */ 325 + PTR_TO_MEM, /* reg points to valid memory region */ 326 + PTR_TO_MEM_OR_NULL, /* reg points to valid memory region or NULL */ 331 327 }; 332 328 333 329 /* The information passed from prog-specific *_is_valid_access ··· 1250 1242 struct net_device *dev_rx); 1251 1243 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, 1252 1244 struct bpf_prog *xdp_prog); 1245 + bool dev_map_can_have_prog(struct bpf_map *map); 1253 1246 1254 1247 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key); 1255 1248 void __cpu_map_flush(void); ··· 1363 1354 u32 key) 1364 1355 { 1365 1356 return NULL; 1357 + } 1358 + static inline bool dev_map_can_have_prog(struct bpf_map *map) 1359 + { 1360 + return false; 1366 1361 } 1367 1362 1368 1363 static inline void __dev_flush(void) ··· 1624 1611 extern const struct bpf_func_proto bpf_jiffies64_proto; 1625 1612 extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto; 1626 1613 extern const struct bpf_func_proto bpf_event_output_data_proto; 1614 + extern const struct bpf_func_proto bpf_ringbuf_output_proto; 1615 + extern const struct bpf_func_proto bpf_ringbuf_reserve_proto; 1616 + extern const struct bpf_func_proto bpf_ringbuf_submit_proto; 1617 + extern const struct bpf_func_proto bpf_ringbuf_discard_proto; 1618 + extern const struct bpf_func_proto bpf_ringbuf_query_proto; 1627 1619 1628 1620 const struct bpf_func_proto *bpf_tracing_func_proto( 1629 1621 enum bpf_func_id func_id, const struct bpf_prog *prog); 1622 + 1623 + const struct bpf_func_proto *tracing_prog_func_proto( 1624 + enum bpf_func_id func_id, const struct bpf_prog *prog); 1630 1625 1631 1626 /* Shared helpers among cBPF and eBPF. */ 1632 1627 void bpf_user_rnd_init_once(void);

+4

include/linux/bpf_types.h

··· 118 118 #if defined(CONFIG_BPF_JIT) 119 119 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) 120 120 #endif 121 + BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) 121 122 122 123 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) 123 124 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) ··· 126 125 BPF_LINK_TYPE(BPF_LINK_TYPE_CGROUP, cgroup) 127 126 #endif 128 127 BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter) 128 + #ifdef CONFIG_NET 129 + BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns) 130 + #endif

+4

include/linux/bpf_verifier.h

··· 54 54 55 55 u32 btf_id; /* for PTR_TO_BTF_ID */ 56 56 57 + u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */ 58 + 57 59 /* Max size from any of the above. */ 58 60 unsigned long raw; 59 61 }; ··· 65 63 * offset, so they can share range knowledge. 66 64 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we 67 65 * came from, when one is tested for != NULL. 66 + * For PTR_TO_MEM_OR_NULL this is used to identify memory allocation 67 + * for the purpose of tracking that it's freed. 68 68 * For PTR_TO_SOCKET this is used to share which pointers retain the 69 69 * same reference to the socket, to determine proper reference freeing. 70 70 */

-26

include/linux/skbuff.h

··· 1283 1283 const struct flow_dissector_key *key, 1284 1284 unsigned int key_count); 1285 1285 1286 - #ifdef CONFIG_NET 1287 - int skb_flow_dissector_prog_query(const union bpf_attr *attr, 1288 - union bpf_attr __user *uattr); 1289 - int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, 1290 - struct bpf_prog *prog); 1291 - 1292 - int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr); 1293 - #else 1294 - static inline int skb_flow_dissector_prog_query(const union bpf_attr *attr, 1295 - union bpf_attr __user *uattr) 1296 - { 1297 - return -EOPNOTSUPP; 1298 - } 1299 - 1300 - static inline int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, 1301 - struct bpf_prog *prog) 1302 - { 1303 - return -EOPNOTSUPP; 1304 - } 1305 - 1306 - static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr) 1307 - { 1308 - return -EOPNOTSUPP; 1309 - } 1310 - #endif 1311 - 1312 1286 struct bpf_flow_dissector; 1313 1287 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, 1314 1288 __be16 proto, int nhoff, int hlen, unsigned int flags);

+8

include/linux/skmsg.h

··· 437 437 psock_set_prog(&progs->skb_verdict, NULL); 438 438 } 439 439 440 + int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb); 441 + 442 + static inline bool sk_psock_strp_enabled(struct sk_psock *psock) 443 + { 444 + if (!psock) 445 + return false; 446 + return psock->parser.enabled; 447 + } 440 448 #endif /* _LINUX_SKMSG_H */

+6

include/net/flow_dissector.h

··· 8 8 #include <linux/string.h> 9 9 #include <uapi/linux/if_ether.h> 10 10 11 + struct bpf_prog; 12 + struct net; 11 13 struct sk_buff; 12 14 13 15 /** ··· 370 368 memset(key_control, 0, sizeof(*key_control)); 371 369 memset(key_basic, 0, sizeof(*key_basic)); 372 370 } 371 + 372 + #ifdef CONFIG_BPF_SYSCALL 373 + int flow_dissector_bpf_prog_attach(struct net *net, struct bpf_prog *prog); 374 + #endif /* CONFIG_BPF_SYSCALL */ 373 375 374 376 #endif

+3 -1

include/net/net_namespace.h

··· 33 33 #include <net/netns/mpls.h> 34 34 #include <net/netns/can.h> 35 35 #include <net/netns/xdp.h> 36 + #include <net/netns/bpf.h> 36 37 #include <linux/ns_common.h> 37 38 #include <linux/idr.h> 38 39 #include <linux/skbuff.h> ··· 163 162 #endif 164 163 struct net_generic __rcu *gen; 165 164 166 - struct bpf_prog __rcu *flow_dissector_prog; 165 + /* Used to store attached BPF programs */ 166 + struct netns_bpf bpf; 167 167 168 168 /* Note : following structs are cache line aligned */ 169 169 #ifdef CONFIG_XFRM

+18

include/net/netns/bpf.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * BPF programs attached to network namespace 4 + */ 5 + 6 + #ifndef __NETNS_BPF_H__ 7 + #define __NETNS_BPF_H__ 8 + 9 + #include <linux/bpf-netns.h> 10 + 11 + struct bpf_prog; 12 + 13 + struct netns_bpf { 14 + struct bpf_prog __rcu *progs[MAX_NETNS_BPF_ATTACH_TYPE]; 15 + struct bpf_link *links[MAX_NETNS_BPF_ATTACH_TYPE]; 16 + }; 17 + 18 + #endif /* __NETNS_BPF_H__ */

+1 -1

include/net/sock.h

··· 2690 2690 2691 2691 void sock_def_readable(struct sock *sk); 2692 2692 2693 - int sock_bindtoindex(struct sock *sk, int ifindex); 2693 + int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk); 2694 2694 void sock_enable_timestamps(struct sock *sk); 2695 2695 void sock_no_linger(struct sock *sk); 2696 2696 void sock_set_keepalive(struct sock *sk);

+9

include/net/tls.h

··· 571 571 return !!tls_sw_ctx_tx(ctx); 572 572 } 573 573 574 + static inline bool tls_sw_has_ctx_rx(const struct sock *sk) 575 + { 576 + struct tls_context *ctx = tls_get_ctx(sk); 577 + 578 + if (!ctx) 579 + return false; 580 + return !!tls_sw_ctx_rx(ctx); 581 + } 582 + 574 583 void tls_sw_write_space(struct sock *sk, struct tls_context *ctx); 575 584 void tls_device_write_space(struct sock *sk, struct tls_context *ctx); 576 585

+16 -1

include/net/xdp.h

··· 61 61 struct xdp_mem_info mem; 62 62 } ____cacheline_aligned; /* perf critical, avoid false-sharing */ 63 63 64 + struct xdp_txq_info { 65 + struct net_device *dev; 66 + }; 67 + 64 68 struct xdp_buff { 65 69 void *data; 66 70 void *data_end; 67 71 void *data_meta; 68 72 void *data_hard_start; 69 73 struct xdp_rxq_info *rxq; 74 + struct xdp_txq_info *txq; 70 75 u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/ 71 76 }; 72 77 ··· 111 106 112 107 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp); 113 108 109 + static inline 110 + void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) 111 + { 112 + xdp->data_hard_start = frame->data - frame->headroom - sizeof(*frame); 113 + xdp->data = frame->data; 114 + xdp->data_end = frame->data + frame->len; 115 + xdp->data_meta = frame->data - frame->metasize; 116 + xdp->frame_sz = frame->frame_sz; 117 + } 118 + 114 119 /* Convert xdp_buff to xdp_frame */ 115 120 static inline 116 - struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) 121 + struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp) 117 122 { 118 123 struct xdp_frame *xdp_frame; 119 124 int metasize;

+94 -1

include/uapi/linux/bpf.h

··· 147 147 BPF_MAP_TYPE_SK_STORAGE, 148 148 BPF_MAP_TYPE_DEVMAP_HASH, 149 149 BPF_MAP_TYPE_STRUCT_OPS, 150 + BPF_MAP_TYPE_RINGBUF, 150 151 }; 151 152 152 153 /* Note that tracing related programs such as ··· 225 224 BPF_CGROUP_INET6_GETPEERNAME, 226 225 BPF_CGROUP_INET4_GETSOCKNAME, 227 226 BPF_CGROUP_INET6_GETSOCKNAME, 227 + BPF_XDP_DEVMAP, 228 228 __MAX_BPF_ATTACH_TYPE 229 229 }; 230 230 ··· 237 235 BPF_LINK_TYPE_TRACING = 2, 238 236 BPF_LINK_TYPE_CGROUP = 3, 239 237 BPF_LINK_TYPE_ITER = 4, 238 + BPF_LINK_TYPE_NETNS = 5, 240 239 241 240 MAX_BPF_LINK_TYPE, 242 241 }; ··· 3160 3157 * **bpf_sk_cgroup_id**\ (). 3161 3158 * Return 3162 3159 * The id is returned or 0 in case the id could not be retrieved. 3160 + * 3161 + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags) 3162 + * Description 3163 + * Copy *size* bytes from *data* into a ring buffer *ringbuf*. 3164 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3165 + * new data availability is sent. 3166 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3167 + * new data availability is sent unconditionally. 3168 + * Return 3169 + * 0, on success; 3170 + * < 0, on error. 3171 + * 3172 + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags) 3173 + * Description 3174 + * Reserve *size* bytes of payload in a ring buffer *ringbuf*. 3175 + * Return 3176 + * Valid pointer with *size* bytes of memory available; NULL, 3177 + * otherwise. 3178 + * 3179 + * void bpf_ringbuf_submit(void *data, u64 flags) 3180 + * Description 3181 + * Submit reserved ring buffer sample, pointed to by *data*. 3182 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3183 + * new data availability is sent. 3184 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3185 + * new data availability is sent unconditionally. 3186 + * Return 3187 + * Nothing. Always succeeds. 3188 + * 3189 + * void bpf_ringbuf_discard(void *data, u64 flags) 3190 + * Description 3191 + * Discard reserved ring buffer sample, pointed to by *data*. 3192 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3193 + * new data availability is sent. 3194 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3195 + * new data availability is sent unconditionally. 3196 + * Return 3197 + * Nothing. Always succeeds. 3198 + * 3199 + * u64 bpf_ringbuf_query(void *ringbuf, u64 flags) 3200 + * Description 3201 + * Query various characteristics of provided ring buffer. What 3202 + * exactly is queries is determined by *flags*: 3203 + * - BPF_RB_AVAIL_DATA - amount of data not yet consumed; 3204 + * - BPF_RB_RING_SIZE - the size of ring buffer; 3205 + * - BPF_RB_CONS_POS - consumer position (can wrap around); 3206 + * - BPF_RB_PROD_POS - producer(s) position (can wrap around); 3207 + * Data returned is just a momentary snapshots of actual values 3208 + * and could be inaccurate, so this facility should be used to 3209 + * power heuristics and for reporting, not to make 100% correct 3210 + * calculation. 3211 + * Return 3212 + * Requested value, or 0, if flags are not recognized. 3163 3213 */ 3164 3214 #define __BPF_FUNC_MAPPER(FN) \ 3165 3215 FN(unspec), \ ··· 3344 3288 FN(seq_printf), \ 3345 3289 FN(seq_write), \ 3346 3290 FN(sk_cgroup_id), \ 3347 - FN(sk_ancestor_cgroup_id), 3291 + FN(sk_ancestor_cgroup_id), \ 3292 + FN(ringbuf_output), \ 3293 + FN(ringbuf_reserve), \ 3294 + FN(ringbuf_submit), \ 3295 + FN(ringbuf_discard), \ 3296 + FN(ringbuf_query), 3348 3297 3349 3298 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 3350 3299 * function eBPF program intends to call ··· 3457 3396 /* BPF_FUNC_read_branch_records flags. */ 3458 3397 enum { 3459 3398 BPF_F_GET_BRANCH_RECORDS_SIZE = (1ULL << 0), 3399 + }; 3400 + 3401 + /* BPF_FUNC_bpf_ringbuf_commit, BPF_FUNC_bpf_ringbuf_discard, and 3402 + * BPF_FUNC_bpf_ringbuf_output flags. 3403 + */ 3404 + enum { 3405 + BPF_RB_NO_WAKEUP = (1ULL << 0), 3406 + BPF_RB_FORCE_WAKEUP = (1ULL << 1), 3407 + }; 3408 + 3409 + /* BPF_FUNC_bpf_ringbuf_query flags */ 3410 + enum { 3411 + BPF_RB_AVAIL_DATA = 0, 3412 + BPF_RB_RING_SIZE = 1, 3413 + BPF_RB_CONS_POS = 2, 3414 + BPF_RB_PROD_POS = 3, 3415 + }; 3416 + 3417 + /* BPF ring buffer constants */ 3418 + enum { 3419 + BPF_RINGBUF_BUSY_BIT = (1U << 31), 3420 + BPF_RINGBUF_DISCARD_BIT = (1U << 30), 3421 + BPF_RINGBUF_HDR_SZ = 8, 3460 3422 }; 3461 3423 3462 3424 /* Mode for BPF_FUNC_skb_adjust_room helper. */ ··· 3614 3530 __u32 dst_ip4; 3615 3531 __u32 dst_ip6[4]; 3616 3532 __u32 state; 3533 + __s32 rx_queue_mapping; 3617 3534 }; 3618 3535 3619 3536 struct bpf_tcp_sock { ··· 3708 3623 /* Below access go through struct xdp_rxq_info */ 3709 3624 __u32 ingress_ifindex; /* rxq->dev->ifindex */ 3710 3625 __u32 rx_queue_index; /* rxq->queue_index */ 3626 + 3627 + __u32 egress_ifindex; /* txq->dev->ifindex */ 3711 3628 }; 3712 3629 3713 3630 enum sk_action { ··· 3732 3645 __u32 remote_port; /* Stored in network byte order */ 3733 3646 __u32 local_port; /* stored in host byte order */ 3734 3647 __u32 size; /* Total size of sk_msg */ 3648 + 3649 + __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ 3735 3650 }; 3736 3651 3737 3652 struct sk_reuseport_md { ··· 3840 3751 __u64 cgroup_id; 3841 3752 __u32 attach_type; 3842 3753 } cgroup; 3754 + struct { 3755 + __u32 netns_ino; 3756 + __u32 attach_type; 3757 + } netns; 3843 3758 }; 3844 3759 } __attribute__((aligned(8))); 3845 3760

+2 -1

kernel/bpf/Makefile

··· 4 4 5 5 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o 6 6 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o 7 - obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o 7 + obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o 8 8 obj-$(CONFIG_BPF_SYSCALL) += disasm.o 9 9 obj-$(CONFIG_BPF_JIT) += trampoline.o 10 10 obj-$(CONFIG_BPF_SYSCALL) += btf.o ··· 13 13 obj-$(CONFIG_BPF_SYSCALL) += devmap.o 14 14 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o 15 15 obj-$(CONFIG_BPF_SYSCALL) += offload.o 16 + obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o 16 17 endif 17 18 ifeq ($(CONFIG_PERF_EVENTS),y) 18 19 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o

+1 -1

kernel/bpf/bpf_lsm.c

··· 49 49 }; 50 50 51 51 const struct bpf_verifier_ops lsm_verifier_ops = { 52 - .get_func_proto = bpf_tracing_func_proto, 52 + .get_func_proto = tracing_prog_func_proto, 53 53 .is_valid_access = btf_ctx_access, 54 54 };

+1 -1

kernel/bpf/cgroup.c

··· 595 595 mutex_lock(&cgroup_mutex); 596 596 /* link might have been auto-released by dying cgroup, so fail */ 597 597 if (!cg_link->cgroup) { 598 - ret = -EINVAL; 598 + ret = -ENOLINK; 599 599 goto out_unlock; 600 600 } 601 601 if (old_prog && link->prog != old_prog) {

+1 -1

kernel/bpf/core.c

··· 1543 1543 1544 1544 /* ARG1 at this point is guaranteed to point to CTX from 1545 1545 * the verifier side due to the fact that the tail call is 1546 - * handeled like a helper, that is, bpf_tail_call_proto, 1546 + * handled like a helper, that is, bpf_tail_call_proto, 1547 1547 * where arg1_type is ARG_PTR_TO_CTX. 1548 1548 */ 1549 1549 insn = prog->insnsi;

+1 -1

kernel/bpf/cpumap.c

··· 621 621 { 622 622 struct xdp_frame *xdpf; 623 623 624 - xdpf = convert_to_xdp_frame(xdp); 624 + xdpf = xdp_convert_buff_to_frame(xdp); 625 625 if (unlikely(!xdpf)) 626 626 return -EOVERFLOW; 627 627

+113 -19

kernel/bpf/devmap.c

··· 60 60 unsigned int count; 61 61 }; 62 62 63 + /* DEVMAP values */ 64 + struct bpf_devmap_val { 65 + u32 ifindex; /* device index */ 66 + union { 67 + int fd; /* prog fd on map write */ 68 + u32 id; /* prog id on map read */ 69 + } bpf_prog; 70 + }; 71 + 63 72 struct bpf_dtab_netdev { 64 73 struct net_device *dev; /* must be first member, due to tracepoint */ 65 74 struct hlist_node index_hlist; 66 75 struct bpf_dtab *dtab; 76 + struct bpf_prog *xdp_prog; 67 77 struct rcu_head rcu; 68 78 unsigned int idx; 79 + struct bpf_devmap_val val; 69 80 }; 70 81 71 82 struct bpf_dtab { ··· 116 105 117 106 static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr) 118 107 { 108 + u32 valsize = attr->value_size; 119 109 u64 cost = 0; 120 110 int err; 121 111 122 - /* check sanity of attributes */ 112 + /* check sanity of attributes. 2 value sizes supported: 113 + * 4 bytes: ifindex 114 + * 8 bytes: ifindex + prog fd 115 + */ 123 116 if (attr->max_entries == 0 || attr->key_size != 4 || 124 - attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) 117 + (valsize != offsetofend(struct bpf_devmap_val, ifindex) && 118 + valsize != offsetofend(struct bpf_devmap_val, bpf_prog.fd)) || 119 + attr->map_flags & ~DEV_CREATE_FLAG_MASK) 125 120 return -EINVAL; 126 121 127 122 /* Lookup returns a pointer straight to dev->ifindex, so make sure the ··· 234 217 235 218 hlist_for_each_entry_safe(dev, next, head, index_hlist) { 236 219 hlist_del_rcu(&dev->index_hlist); 220 + if (dev->xdp_prog) 221 + bpf_prog_put(dev->xdp_prog); 237 222 dev_put(dev->dev); 238 223 kfree(dev); 239 224 } ··· 250 231 if (!dev) 251 232 continue; 252 233 234 + if (dev->xdp_prog) 235 + bpf_prog_put(dev->xdp_prog); 253 236 dev_put(dev->dev); 254 237 kfree(dev); 255 238 } ··· 336 315 } 337 316 338 317 return -ENOENT; 318 + } 319 + 320 + bool dev_map_can_have_prog(struct bpf_map *map) 321 + { 322 + if ((map->map_type == BPF_MAP_TYPE_DEVMAP || 323 + map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) && 324 + map->value_size != offsetofend(struct bpf_devmap_val, ifindex)) 325 + return true; 326 + 327 + return false; 339 328 } 340 329 341 330 static int bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) ··· 465 434 if (unlikely(err)) 466 435 return err; 467 436 468 - xdpf = convert_to_xdp_frame(xdp); 437 + xdpf = xdp_convert_buff_to_frame(xdp); 469 438 if (unlikely(!xdpf)) 470 439 return -EOVERFLOW; 471 440 472 441 return bq_enqueue(dev, xdpf, dev_rx); 442 + } 443 + 444 + static struct xdp_buff *dev_map_run_prog(struct net_device *dev, 445 + struct xdp_buff *xdp, 446 + struct bpf_prog *xdp_prog) 447 + { 448 + struct xdp_txq_info txq = { .dev = dev }; 449 + u32 act; 450 + 451 + xdp->txq = &txq; 452 + 453 + act = bpf_prog_run_xdp(xdp_prog, xdp); 454 + switch (act) { 455 + case XDP_PASS: 456 + return xdp; 457 + case XDP_DROP: 458 + break; 459 + default: 460 + bpf_warn_invalid_xdp_action(act); 461 + fallthrough; 462 + case XDP_ABORTED: 463 + trace_xdp_exception(dev, xdp_prog, act); 464 + break; 465 + } 466 + 467 + xdp_return_buff(xdp); 468 + return NULL; 473 469 } 474 470 475 471 int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp, ··· 510 452 { 511 453 struct net_device *dev = dst->dev; 512 454 455 + if (dst->xdp_prog) { 456 + xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog); 457 + if (!xdp) 458 + return 0; 459 + } 513 460 return __xdp_enqueue(dev, xdp, dev_rx); 514 461 } 515 462 ··· 535 472 static void *dev_map_lookup_elem(struct bpf_map *map, void *key) 536 473 { 537 474 struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); 538 - struct net_device *dev = obj ? obj->dev : NULL; 539 475 540 - return dev ? &dev->ifindex : NULL; 476 + return obj ? &obj->val : NULL; 541 477 } 542 478 543 479 static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key) 544 480 { 545 481 struct bpf_dtab_netdev *obj = __dev_map_hash_lookup_elem(map, 546 482 *(u32 *)key); 547 - struct net_device *dev = obj ? obj->dev : NULL; 548 - 549 - return dev ? &dev->ifindex : NULL; 483 + return obj ? &obj->val : NULL; 550 484 } 551 485 552 486 static void __dev_map_entry_free(struct rcu_head *rcu) ··· 551 491 struct bpf_dtab_netdev *dev; 552 492 553 493 dev = container_of(rcu, struct bpf_dtab_netdev, rcu); 494 + if (dev->xdp_prog) 495 + bpf_prog_put(dev->xdp_prog); 554 496 dev_put(dev->dev); 555 497 kfree(dev); 556 498 } ··· 603 541 604 542 static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net, 605 543 struct bpf_dtab *dtab, 606 - u32 ifindex, 544 + struct bpf_devmap_val *val, 607 545 unsigned int idx) 608 546 { 547 + struct bpf_prog *prog = NULL; 609 548 struct bpf_dtab_netdev *dev; 610 549 611 550 dev = kmalloc_node(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN, ··· 614 551 if (!dev) 615 552 return ERR_PTR(-ENOMEM); 616 553 617 - dev->dev = dev_get_by_index(net, ifindex); 618 - if (!dev->dev) { 619 - kfree(dev); 620 - return ERR_PTR(-EINVAL); 554 + dev->dev = dev_get_by_index(net, val->ifindex); 555 + if (!dev->dev) 556 + goto err_out; 557 + 558 + if (val->bpf_prog.fd >= 0) { 559 + prog = bpf_prog_get_type_dev(val->bpf_prog.fd, 560 + BPF_PROG_TYPE_XDP, false); 561 + if (IS_ERR(prog)) 562 + goto err_put_dev; 563 + if (prog->expected_attach_type != BPF_XDP_DEVMAP) 564 + goto err_put_prog; 621 565 } 622 566 623 567 dev->idx = idx; 624 568 dev->dtab = dtab; 569 + if (prog) { 570 + dev->xdp_prog = prog; 571 + dev->val.bpf_prog.id = prog->aux->id; 572 + } else { 573 + dev->xdp_prog = NULL; 574 + dev->val.bpf_prog.id = 0; 575 + } 576 + dev->val.ifindex = val->ifindex; 625 577 626 578 return dev; 579 + err_put_prog: 580 + bpf_prog_put(prog); 581 + err_put_dev: 582 + dev_put(dev->dev); 583 + err_out: 584 + kfree(dev); 585 + return ERR_PTR(-EINVAL); 627 586 } 628 587 629 588 static int __dev_map_update_elem(struct net *net, struct bpf_map *map, 630 589 void *key, void *value, u64 map_flags) 631 590 { 632 591 struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 592 + struct bpf_devmap_val val = { .bpf_prog.fd = -1 }; 633 593 struct bpf_dtab_netdev *dev, *old_dev; 634 - u32 ifindex = *(u32 *)value; 635 594 u32 i = *(u32 *)key; 636 595 637 596 if (unlikely(map_flags > BPF_EXIST)) ··· 663 578 if (unlikely(map_flags == BPF_NOEXIST)) 664 579 return -EEXIST; 665 580 666 - if (!ifindex) { 581 + /* already verified value_size <= sizeof val */ 582 + memcpy(&val, value, map->value_size); 583 + 584 + if (!val.ifindex) { 667 585 dev = NULL; 586 + /* can not specify fd if ifindex is 0 */ 587 + if (val.bpf_prog.fd != -1) 588 + return -EINVAL; 668 589 } else { 669 - dev = __dev_map_alloc_node(net, dtab, ifindex, i); 590 + dev = __dev_map_alloc_node(net, dtab, &val, i); 670 591 if (IS_ERR(dev)) 671 592 return PTR_ERR(dev); 672 593 } ··· 699 608 void *key, void *value, u64 map_flags) 700 609 { 701 610 struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 611 + struct bpf_devmap_val val = { .bpf_prog.fd = -1 }; 702 612 struct bpf_dtab_netdev *dev, *old_dev; 703 - u32 ifindex = *(u32 *)value; 704 613 u32 idx = *(u32 *)key; 705 614 unsigned long flags; 706 615 int err = -EEXIST; 707 616 708 - if (unlikely(map_flags > BPF_EXIST || !ifindex)) 617 + /* already verified value_size <= sizeof val */ 618 + memcpy(&val, value, map->value_size); 619 + 620 + if (unlikely(map_flags > BPF_EXIST || !val.ifindex)) 709 621 return -EINVAL; 710 622 711 623 spin_lock_irqsave(&dtab->index_lock, flags); ··· 717 623 if (old_dev && (map_flags & BPF_NOEXIST)) 718 624 goto out_err; 719 625 720 - dev = __dev_map_alloc_node(net, dtab, ifindex, idx); 626 + dev = __dev_map_alloc_node(net, dtab, &val, idx); 721 627 if (IS_ERR(dev)) { 722 628 err = PTR_ERR(dev); 723 629 goto out_err;

+34

kernel/bpf/helpers.c

··· 601 601 .arg5_type = ARG_CONST_SIZE_OR_ZERO, 602 602 }; 603 603 604 + const struct bpf_func_proto bpf_get_current_task_proto __weak; 605 + const struct bpf_func_proto bpf_probe_read_user_proto __weak; 606 + const struct bpf_func_proto bpf_probe_read_user_str_proto __weak; 607 + const struct bpf_func_proto bpf_probe_read_kernel_proto __weak; 608 + const struct bpf_func_proto bpf_probe_read_kernel_str_proto __weak; 609 + 604 610 const struct bpf_func_proto * 605 611 bpf_base_func_proto(enum bpf_func_id func_id) 606 612 { ··· 635 629 return &bpf_ktime_get_ns_proto; 636 630 case BPF_FUNC_ktime_get_boot_ns: 637 631 return &bpf_ktime_get_boot_ns_proto; 632 + case BPF_FUNC_ringbuf_output: 633 + return &bpf_ringbuf_output_proto; 634 + case BPF_FUNC_ringbuf_reserve: 635 + return &bpf_ringbuf_reserve_proto; 636 + case BPF_FUNC_ringbuf_submit: 637 + return &bpf_ringbuf_submit_proto; 638 + case BPF_FUNC_ringbuf_discard: 639 + return &bpf_ringbuf_discard_proto; 640 + case BPF_FUNC_ringbuf_query: 641 + return &bpf_ringbuf_query_proto; 638 642 default: 639 643 break; 640 644 } ··· 663 647 return bpf_get_trace_printk_proto(); 664 648 case BPF_FUNC_jiffies64: 665 649 return &bpf_jiffies64_proto; 650 + default: 651 + break; 652 + } 653 + 654 + if (!perfmon_capable()) 655 + return NULL; 656 + 657 + switch (func_id) { 658 + case BPF_FUNC_get_current_task: 659 + return &bpf_get_current_task_proto; 660 + case BPF_FUNC_probe_read_user: 661 + return &bpf_probe_read_user_proto; 662 + case BPF_FUNC_probe_read_kernel: 663 + return &bpf_probe_read_kernel_proto; 664 + case BPF_FUNC_probe_read_user_str: 665 + return &bpf_probe_read_user_str_proto; 666 + case BPF_FUNC_probe_read_kernel_str: 667 + return &bpf_probe_read_kernel_str_proto; 666 668 default: 667 669 return NULL; 668 670 }

+373

kernel/bpf/net_namespace.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include <linux/filter.h> 5 + #include <net/net_namespace.h> 6 + 7 + /* 8 + * Functions to manage BPF programs attached to netns 9 + */ 10 + 11 + struct bpf_netns_link { 12 + struct bpf_link link; 13 + enum bpf_attach_type type; 14 + enum netns_bpf_attach_type netns_type; 15 + 16 + /* We don't hold a ref to net in order to auto-detach the link 17 + * when netns is going away. Instead we rely on pernet 18 + * pre_exit callback to clear this pointer. Must be accessed 19 + * with netns_bpf_mutex held. 20 + */ 21 + struct net *net; 22 + }; 23 + 24 + /* Protects updates to netns_bpf */ 25 + DEFINE_MUTEX(netns_bpf_mutex); 26 + 27 + /* Must be called with netns_bpf_mutex held. */ 28 + static void __net_exit bpf_netns_link_auto_detach(struct bpf_link *link) 29 + { 30 + struct bpf_netns_link *net_link = 31 + container_of(link, struct bpf_netns_link, link); 32 + 33 + net_link->net = NULL; 34 + } 35 + 36 + static void bpf_netns_link_release(struct bpf_link *link) 37 + { 38 + struct bpf_netns_link *net_link = 39 + container_of(link, struct bpf_netns_link, link); 40 + enum netns_bpf_attach_type type = net_link->netns_type; 41 + struct net *net; 42 + 43 + /* Link auto-detached by dying netns. */ 44 + if (!net_link->net) 45 + return; 46 + 47 + mutex_lock(&netns_bpf_mutex); 48 + 49 + /* Recheck after potential sleep. We can race with cleanup_net 50 + * here, but if we see a non-NULL struct net pointer pre_exit 51 + * has not happened yet and will block on netns_bpf_mutex. 52 + */ 53 + net = net_link->net; 54 + if (!net) 55 + goto out_unlock; 56 + 57 + net->bpf.links[type] = NULL; 58 + RCU_INIT_POINTER(net->bpf.progs[type], NULL); 59 + 60 + out_unlock: 61 + mutex_unlock(&netns_bpf_mutex); 62 + } 63 + 64 + static void bpf_netns_link_dealloc(struct bpf_link *link) 65 + { 66 + struct bpf_netns_link *net_link = 67 + container_of(link, struct bpf_netns_link, link); 68 + 69 + kfree(net_link); 70 + } 71 + 72 + static int bpf_netns_link_update_prog(struct bpf_link *link, 73 + struct bpf_prog *new_prog, 74 + struct bpf_prog *old_prog) 75 + { 76 + struct bpf_netns_link *net_link = 77 + container_of(link, struct bpf_netns_link, link); 78 + enum netns_bpf_attach_type type = net_link->netns_type; 79 + struct net *net; 80 + int ret = 0; 81 + 82 + if (old_prog && old_prog != link->prog) 83 + return -EPERM; 84 + if (new_prog->type != link->prog->type) 85 + return -EINVAL; 86 + 87 + mutex_lock(&netns_bpf_mutex); 88 + 89 + net = net_link->net; 90 + if (!net || !check_net(net)) { 91 + /* Link auto-detached or netns dying */ 92 + ret = -ENOLINK; 93 + goto out_unlock; 94 + } 95 + 96 + old_prog = xchg(&link->prog, new_prog); 97 + rcu_assign_pointer(net->bpf.progs[type], new_prog); 98 + bpf_prog_put(old_prog); 99 + 100 + out_unlock: 101 + mutex_unlock(&netns_bpf_mutex); 102 + return ret; 103 + } 104 + 105 + static int bpf_netns_link_fill_info(const struct bpf_link *link, 106 + struct bpf_link_info *info) 107 + { 108 + const struct bpf_netns_link *net_link = 109 + container_of(link, struct bpf_netns_link, link); 110 + unsigned int inum = 0; 111 + struct net *net; 112 + 113 + mutex_lock(&netns_bpf_mutex); 114 + net = net_link->net; 115 + if (net && check_net(net)) 116 + inum = net->ns.inum; 117 + mutex_unlock(&netns_bpf_mutex); 118 + 119 + info->netns.netns_ino = inum; 120 + info->netns.attach_type = net_link->type; 121 + return 0; 122 + } 123 + 124 + static void bpf_netns_link_show_fdinfo(const struct bpf_link *link, 125 + struct seq_file *seq) 126 + { 127 + struct bpf_link_info info = {}; 128 + 129 + bpf_netns_link_fill_info(link, &info); 130 + seq_printf(seq, 131 + "netns_ino:\t%u\n" 132 + "attach_type:\t%u\n", 133 + info.netns.netns_ino, 134 + info.netns.attach_type); 135 + } 136 + 137 + static const struct bpf_link_ops bpf_netns_link_ops = { 138 + .release = bpf_netns_link_release, 139 + .dealloc = bpf_netns_link_dealloc, 140 + .update_prog = bpf_netns_link_update_prog, 141 + .fill_link_info = bpf_netns_link_fill_info, 142 + .show_fdinfo = bpf_netns_link_show_fdinfo, 143 + }; 144 + 145 + int netns_bpf_prog_query(const union bpf_attr *attr, 146 + union bpf_attr __user *uattr) 147 + { 148 + __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); 149 + u32 prog_id, prog_cnt = 0, flags = 0; 150 + enum netns_bpf_attach_type type; 151 + struct bpf_prog *attached; 152 + struct net *net; 153 + 154 + if (attr->query.query_flags) 155 + return -EINVAL; 156 + 157 + type = to_netns_bpf_attach_type(attr->query.attach_type); 158 + if (type < 0) 159 + return -EINVAL; 160 + 161 + net = get_net_ns_by_fd(attr->query.target_fd); 162 + if (IS_ERR(net)) 163 + return PTR_ERR(net); 164 + 165 + rcu_read_lock(); 166 + attached = rcu_dereference(net->bpf.progs[type]); 167 + if (attached) { 168 + prog_cnt = 1; 169 + prog_id = attached->aux->id; 170 + } 171 + rcu_read_unlock(); 172 + 173 + put_net(net); 174 + 175 + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) 176 + return -EFAULT; 177 + if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt))) 178 + return -EFAULT; 179 + 180 + if (!attr->query.prog_cnt || !prog_ids || !prog_cnt) 181 + return 0; 182 + 183 + if (copy_to_user(prog_ids, &prog_id, sizeof(u32))) 184 + return -EFAULT; 185 + 186 + return 0; 187 + } 188 + 189 + int netns_bpf_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) 190 + { 191 + enum netns_bpf_attach_type type; 192 + struct net *net; 193 + int ret; 194 + 195 + type = to_netns_bpf_attach_type(attr->attach_type); 196 + if (type < 0) 197 + return -EINVAL; 198 + 199 + net = current->nsproxy->net_ns; 200 + mutex_lock(&netns_bpf_mutex); 201 + 202 + /* Attaching prog directly is not compatible with links */ 203 + if (net->bpf.links[type]) { 204 + ret = -EEXIST; 205 + goto out_unlock; 206 + } 207 + 208 + switch (type) { 209 + case NETNS_BPF_FLOW_DISSECTOR: 210 + ret = flow_dissector_bpf_prog_attach(net, prog); 211 + break; 212 + default: 213 + ret = -EINVAL; 214 + break; 215 + } 216 + out_unlock: 217 + mutex_unlock(&netns_bpf_mutex); 218 + 219 + return ret; 220 + } 221 + 222 + /* Must be called with netns_bpf_mutex held. */ 223 + static int __netns_bpf_prog_detach(struct net *net, 224 + enum netns_bpf_attach_type type) 225 + { 226 + struct bpf_prog *attached; 227 + 228 + /* Progs attached via links cannot be detached */ 229 + if (net->bpf.links[type]) 230 + return -EINVAL; 231 + 232 + attached = rcu_dereference_protected(net->bpf.progs[type], 233 + lockdep_is_held(&netns_bpf_mutex)); 234 + if (!attached) 235 + return -ENOENT; 236 + RCU_INIT_POINTER(net->bpf.progs[type], NULL); 237 + bpf_prog_put(attached); 238 + return 0; 239 + } 240 + 241 + int netns_bpf_prog_detach(const union bpf_attr *attr) 242 + { 243 + enum netns_bpf_attach_type type; 244 + int ret; 245 + 246 + type = to_netns_bpf_attach_type(attr->attach_type); 247 + if (type < 0) 248 + return -EINVAL; 249 + 250 + mutex_lock(&netns_bpf_mutex); 251 + ret = __netns_bpf_prog_detach(current->nsproxy->net_ns, type); 252 + mutex_unlock(&netns_bpf_mutex); 253 + 254 + return ret; 255 + } 256 + 257 + static int netns_bpf_link_attach(struct net *net, struct bpf_link *link, 258 + enum netns_bpf_attach_type type) 259 + { 260 + struct bpf_prog *prog; 261 + int err; 262 + 263 + mutex_lock(&netns_bpf_mutex); 264 + 265 + /* Allow attaching only one prog or link for now */ 266 + if (net->bpf.links[type]) { 267 + err = -E2BIG; 268 + goto out_unlock; 269 + } 270 + /* Links are not compatible with attaching prog directly */ 271 + prog = rcu_dereference_protected(net->bpf.progs[type], 272 + lockdep_is_held(&netns_bpf_mutex)); 273 + if (prog) { 274 + err = -EEXIST; 275 + goto out_unlock; 276 + } 277 + 278 + switch (type) { 279 + case NETNS_BPF_FLOW_DISSECTOR: 280 + err = flow_dissector_bpf_prog_attach(net, link->prog); 281 + break; 282 + default: 283 + err = -EINVAL; 284 + break; 285 + } 286 + if (err) 287 + goto out_unlock; 288 + 289 + net->bpf.links[type] = link; 290 + 291 + out_unlock: 292 + mutex_unlock(&netns_bpf_mutex); 293 + return err; 294 + } 295 + 296 + int netns_bpf_link_create(const union bpf_attr *attr, struct bpf_prog *prog) 297 + { 298 + enum netns_bpf_attach_type netns_type; 299 + struct bpf_link_primer link_primer; 300 + struct bpf_netns_link *net_link; 301 + enum bpf_attach_type type; 302 + struct net *net; 303 + int err; 304 + 305 + if (attr->link_create.flags) 306 + return -EINVAL; 307 + 308 + type = attr->link_create.attach_type; 309 + netns_type = to_netns_bpf_attach_type(type); 310 + if (netns_type < 0) 311 + return -EINVAL; 312 + 313 + net = get_net_ns_by_fd(attr->link_create.target_fd); 314 + if (IS_ERR(net)) 315 + return PTR_ERR(net); 316 + 317 + net_link = kzalloc(sizeof(*net_link), GFP_USER); 318 + if (!net_link) { 319 + err = -ENOMEM; 320 + goto out_put_net; 321 + } 322 + bpf_link_init(&net_link->link, BPF_LINK_TYPE_NETNS, 323 + &bpf_netns_link_ops, prog); 324 + net_link->net = net; 325 + net_link->type = type; 326 + net_link->netns_type = netns_type; 327 + 328 + err = bpf_link_prime(&net_link->link, &link_primer); 329 + if (err) { 330 + kfree(net_link); 331 + goto out_put_net; 332 + } 333 + 334 + err = netns_bpf_link_attach(net, &net_link->link, netns_type); 335 + if (err) { 336 + bpf_link_cleanup(&link_primer); 337 + goto out_put_net; 338 + } 339 + 340 + put_net(net); 341 + return bpf_link_settle(&link_primer); 342 + 343 + out_put_net: 344 + put_net(net); 345 + return err; 346 + } 347 + 348 + static void __net_exit netns_bpf_pernet_pre_exit(struct net *net) 349 + { 350 + enum netns_bpf_attach_type type; 351 + struct bpf_link *link; 352 + 353 + mutex_lock(&netns_bpf_mutex); 354 + for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) { 355 + link = net->bpf.links[type]; 356 + if (link) 357 + bpf_netns_link_auto_detach(link); 358 + else 359 + __netns_bpf_prog_detach(net, type); 360 + } 361 + mutex_unlock(&netns_bpf_mutex); 362 + } 363 + 364 + static struct pernet_operations netns_bpf_pernet_ops __net_initdata = { 365 + .pre_exit = netns_bpf_pernet_pre_exit, 366 + }; 367 + 368 + static int __init netns_bpf_init(void) 369 + { 370 + return register_pernet_subsys(&netns_bpf_pernet_ops); 371 + } 372 + 373 + subsys_initcall(netns_bpf_init);

+501

kernel/bpf/ringbuf.c

··· 1 + #include <linux/bpf.h> 2 + #include <linux/btf.h> 3 + #include <linux/err.h> 4 + #include <linux/irq_work.h> 5 + #include <linux/slab.h> 6 + #include <linux/filter.h> 7 + #include <linux/mm.h> 8 + #include <linux/vmalloc.h> 9 + #include <linux/wait.h> 10 + #include <linux/poll.h> 11 + #include <uapi/linux/btf.h> 12 + 13 + #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) 14 + 15 + /* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ 16 + #define RINGBUF_PGOFF \ 17 + (offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT) 18 + /* consumer page and producer page */ 19 + #define RINGBUF_POS_PAGES 2 20 + 21 + #define RINGBUF_MAX_RECORD_SZ (UINT_MAX/4) 22 + 23 + /* Maximum size of ring buffer area is limited by 32-bit page offset within 24 + * record header, counted in pages. Reserve 8 bits for extensibility, and take 25 + * into account few extra pages for consumer/producer pages and 26 + * non-mmap()'able parts. This gives 64GB limit, which seems plenty for single 27 + * ring buffer. 28 + */ 29 + #define RINGBUF_MAX_DATA_SZ \ 30 + (((1ULL << 24) - RINGBUF_POS_PAGES - RINGBUF_PGOFF) * PAGE_SIZE) 31 + 32 + struct bpf_ringbuf { 33 + wait_queue_head_t waitq; 34 + struct irq_work work; 35 + u64 mask; 36 + struct page **pages; 37 + int nr_pages; 38 + spinlock_t spinlock ____cacheline_aligned_in_smp; 39 + /* Consumer and producer counters are put into separate pages to allow 40 + * mapping consumer page as r/w, but restrict producer page to r/o. 41 + * This protects producer position from being modified by user-space 42 + * application and ruining in-kernel position tracking. 43 + */ 44 + unsigned long consumer_pos __aligned(PAGE_SIZE); 45 + unsigned long producer_pos __aligned(PAGE_SIZE); 46 + char data[] __aligned(PAGE_SIZE); 47 + }; 48 + 49 + struct bpf_ringbuf_map { 50 + struct bpf_map map; 51 + struct bpf_map_memory memory; 52 + struct bpf_ringbuf *rb; 53 + }; 54 + 55 + /* 8-byte ring buffer record header structure */ 56 + struct bpf_ringbuf_hdr { 57 + u32 len; 58 + u32 pg_off; 59 + }; 60 + 61 + static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node) 62 + { 63 + const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | 64 + __GFP_ZERO; 65 + int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES; 66 + int nr_data_pages = data_sz >> PAGE_SHIFT; 67 + int nr_pages = nr_meta_pages + nr_data_pages; 68 + struct page **pages, *page; 69 + struct bpf_ringbuf *rb; 70 + size_t array_size; 71 + int i; 72 + 73 + /* Each data page is mapped twice to allow "virtual" 74 + * continuous read of samples wrapping around the end of ring 75 + * buffer area: 76 + * ------------------------------------------------------ 77 + * | meta pages | real data pages | same data pages | 78 + * ------------------------------------------------------ 79 + * | | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 | 80 + * ------------------------------------------------------ 81 + * | | TA DA | TA DA | 82 + * ------------------------------------------------------ 83 + * ^^^^^^^ 84 + * | 85 + * Here, no need to worry about special handling of wrapped-around 86 + * data due to double-mapped data pages. This works both in kernel and 87 + * when mmap()'ed in user-space, simplifying both kernel and 88 + * user-space implementations significantly. 89 + */ 90 + array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages); 91 + if (array_size > PAGE_SIZE) 92 + pages = vmalloc_node(array_size, numa_node); 93 + else 94 + pages = kmalloc_node(array_size, flags, numa_node); 95 + if (!pages) 96 + return NULL; 97 + 98 + for (i = 0; i < nr_pages; i++) { 99 + page = alloc_pages_node(numa_node, flags, 0); 100 + if (!page) { 101 + nr_pages = i; 102 + goto err_free_pages; 103 + } 104 + pages[i] = page; 105 + if (i >= nr_meta_pages) 106 + pages[nr_data_pages + i] = page; 107 + } 108 + 109 + rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages, 110 + VM_ALLOC | VM_USERMAP, PAGE_KERNEL); 111 + if (rb) { 112 + rb->pages = pages; 113 + rb->nr_pages = nr_pages; 114 + return rb; 115 + } 116 + 117 + err_free_pages: 118 + for (i = 0; i < nr_pages; i++) 119 + __free_page(pages[i]); 120 + kvfree(pages); 121 + return NULL; 122 + } 123 + 124 + static void bpf_ringbuf_notify(struct irq_work *work) 125 + { 126 + struct bpf_ringbuf *rb = container_of(work, struct bpf_ringbuf, work); 127 + 128 + wake_up_all(&rb->waitq); 129 + } 130 + 131 + static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) 132 + { 133 + struct bpf_ringbuf *rb; 134 + 135 + if (!data_sz || !PAGE_ALIGNED(data_sz)) 136 + return ERR_PTR(-EINVAL); 137 + 138 + #ifdef CONFIG_64BIT 139 + /* on 32-bit arch, it's impossible to overflow record's hdr->pgoff */ 140 + if (data_sz > RINGBUF_MAX_DATA_SZ) 141 + return ERR_PTR(-E2BIG); 142 + #endif 143 + 144 + rb = bpf_ringbuf_area_alloc(data_sz, numa_node); 145 + if (!rb) 146 + return ERR_PTR(-ENOMEM); 147 + 148 + spin_lock_init(&rb->spinlock); 149 + init_waitqueue_head(&rb->waitq); 150 + init_irq_work(&rb->work, bpf_ringbuf_notify); 151 + 152 + rb->mask = data_sz - 1; 153 + rb->consumer_pos = 0; 154 + rb->producer_pos = 0; 155 + 156 + return rb; 157 + } 158 + 159 + static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) 160 + { 161 + struct bpf_ringbuf_map *rb_map; 162 + u64 cost; 163 + int err; 164 + 165 + if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK) 166 + return ERR_PTR(-EINVAL); 167 + 168 + if (attr->key_size || attr->value_size || 169 + attr->max_entries == 0 || !PAGE_ALIGNED(attr->max_entries)) 170 + return ERR_PTR(-EINVAL); 171 + 172 + rb_map = kzalloc(sizeof(*rb_map), GFP_USER); 173 + if (!rb_map) 174 + return ERR_PTR(-ENOMEM); 175 + 176 + bpf_map_init_from_attr(&rb_map->map, attr); 177 + 178 + cost = sizeof(struct bpf_ringbuf_map) + 179 + sizeof(struct bpf_ringbuf) + 180 + attr->max_entries; 181 + err = bpf_map_charge_init(&rb_map->map.memory, cost); 182 + if (err) 183 + goto err_free_map; 184 + 185 + rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node); 186 + if (IS_ERR(rb_map->rb)) { 187 + err = PTR_ERR(rb_map->rb); 188 + goto err_uncharge; 189 + } 190 + 191 + return &rb_map->map; 192 + 193 + err_uncharge: 194 + bpf_map_charge_finish(&rb_map->map.memory); 195 + err_free_map: 196 + kfree(rb_map); 197 + return ERR_PTR(err); 198 + } 199 + 200 + static void bpf_ringbuf_free(struct bpf_ringbuf *rb) 201 + { 202 + /* copy pages pointer and nr_pages to local variable, as we are going 203 + * to unmap rb itself with vunmap() below 204 + */ 205 + struct page **pages = rb->pages; 206 + int i, nr_pages = rb->nr_pages; 207 + 208 + vunmap(rb); 209 + for (i = 0; i < nr_pages; i++) 210 + __free_page(pages[i]); 211 + kvfree(pages); 212 + } 213 + 214 + static void ringbuf_map_free(struct bpf_map *map) 215 + { 216 + struct bpf_ringbuf_map *rb_map; 217 + 218 + /* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0, 219 + * so the programs (can be more than one that used this map) were 220 + * disconnected from events. Wait for outstanding critical sections in 221 + * these programs to complete 222 + */ 223 + synchronize_rcu(); 224 + 225 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 226 + bpf_ringbuf_free(rb_map->rb); 227 + kfree(rb_map); 228 + } 229 + 230 + static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key) 231 + { 232 + return ERR_PTR(-ENOTSUPP); 233 + } 234 + 235 + static int ringbuf_map_update_elem(struct bpf_map *map, void *key, void *value, 236 + u64 flags) 237 + { 238 + return -ENOTSUPP; 239 + } 240 + 241 + static int ringbuf_map_delete_elem(struct bpf_map *map, void *key) 242 + { 243 + return -ENOTSUPP; 244 + } 245 + 246 + static int ringbuf_map_get_next_key(struct bpf_map *map, void *key, 247 + void *next_key) 248 + { 249 + return -ENOTSUPP; 250 + } 251 + 252 + static size_t bpf_ringbuf_mmap_page_cnt(const struct bpf_ringbuf *rb) 253 + { 254 + size_t data_pages = (rb->mask + 1) >> PAGE_SHIFT; 255 + 256 + /* consumer page + producer page + 2 x data pages */ 257 + return RINGBUF_POS_PAGES + 2 * data_pages; 258 + } 259 + 260 + static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) 261 + { 262 + struct bpf_ringbuf_map *rb_map; 263 + size_t mmap_sz; 264 + 265 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 266 + mmap_sz = bpf_ringbuf_mmap_page_cnt(rb_map->rb) << PAGE_SHIFT; 267 + 268 + if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > mmap_sz) 269 + return -EINVAL; 270 + 271 + return remap_vmalloc_range(vma, rb_map->rb, 272 + vma->vm_pgoff + RINGBUF_PGOFF); 273 + } 274 + 275 + static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) 276 + { 277 + unsigned long cons_pos, prod_pos; 278 + 279 + cons_pos = smp_load_acquire(&rb->consumer_pos); 280 + prod_pos = smp_load_acquire(&rb->producer_pos); 281 + return prod_pos - cons_pos; 282 + } 283 + 284 + static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, 285 + struct poll_table_struct *pts) 286 + { 287 + struct bpf_ringbuf_map *rb_map; 288 + 289 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 290 + poll_wait(filp, &rb_map->rb->waitq, pts); 291 + 292 + if (ringbuf_avail_data_sz(rb_map->rb)) 293 + return EPOLLIN | EPOLLRDNORM; 294 + return 0; 295 + } 296 + 297 + const struct bpf_map_ops ringbuf_map_ops = { 298 + .map_alloc = ringbuf_map_alloc, 299 + .map_free = ringbuf_map_free, 300 + .map_mmap = ringbuf_map_mmap, 301 + .map_poll = ringbuf_map_poll, 302 + .map_lookup_elem = ringbuf_map_lookup_elem, 303 + .map_update_elem = ringbuf_map_update_elem, 304 + .map_delete_elem = ringbuf_map_delete_elem, 305 + .map_get_next_key = ringbuf_map_get_next_key, 306 + }; 307 + 308 + /* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself, 309 + * calculate offset from record metadata to ring buffer in pages, rounded 310 + * down. This page offset is stored as part of record metadata and allows to 311 + * restore struct bpf_ringbuf * from record pointer. This page offset is 312 + * stored at offset 4 of record metadata header. 313 + */ 314 + static size_t bpf_ringbuf_rec_pg_off(struct bpf_ringbuf *rb, 315 + struct bpf_ringbuf_hdr *hdr) 316 + { 317 + return ((void *)hdr - (void *)rb) >> PAGE_SHIFT; 318 + } 319 + 320 + /* Given pointer to ring buffer record header, restore pointer to struct 321 + * bpf_ringbuf itself by using page offset stored at offset 4 322 + */ 323 + static struct bpf_ringbuf * 324 + bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) 325 + { 326 + unsigned long addr = (unsigned long)(void *)hdr; 327 + unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT; 328 + 329 + return (void*)((addr & PAGE_MASK) - off); 330 + } 331 + 332 + static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) 333 + { 334 + unsigned long cons_pos, prod_pos, new_prod_pos, flags; 335 + u32 len, pg_off; 336 + struct bpf_ringbuf_hdr *hdr; 337 + 338 + if (unlikely(size > RINGBUF_MAX_RECORD_SZ)) 339 + return NULL; 340 + 341 + len = round_up(size + BPF_RINGBUF_HDR_SZ, 8); 342 + cons_pos = smp_load_acquire(&rb->consumer_pos); 343 + 344 + if (in_nmi()) { 345 + if (!spin_trylock_irqsave(&rb->spinlock, flags)) 346 + return NULL; 347 + } else { 348 + spin_lock_irqsave(&rb->spinlock, flags); 349 + } 350 + 351 + prod_pos = rb->producer_pos; 352 + new_prod_pos = prod_pos + len; 353 + 354 + /* check for out of ringbuf space by ensuring producer position 355 + * doesn't advance more than (ringbuf_size - 1) ahead 356 + */ 357 + if (new_prod_pos - cons_pos > rb->mask) { 358 + spin_unlock_irqrestore(&rb->spinlock, flags); 359 + return NULL; 360 + } 361 + 362 + hdr = (void *)rb->data + (prod_pos & rb->mask); 363 + pg_off = bpf_ringbuf_rec_pg_off(rb, hdr); 364 + hdr->len = size | BPF_RINGBUF_BUSY_BIT; 365 + hdr->pg_off = pg_off; 366 + 367 + /* pairs with consumer's smp_load_acquire() */ 368 + smp_store_release(&rb->producer_pos, new_prod_pos); 369 + 370 + spin_unlock_irqrestore(&rb->spinlock, flags); 371 + 372 + return (void *)hdr + BPF_RINGBUF_HDR_SZ; 373 + } 374 + 375 + BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags) 376 + { 377 + struct bpf_ringbuf_map *rb_map; 378 + 379 + if (unlikely(flags)) 380 + return 0; 381 + 382 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 383 + return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size); 384 + } 385 + 386 + const struct bpf_func_proto bpf_ringbuf_reserve_proto = { 387 + .func = bpf_ringbuf_reserve, 388 + .ret_type = RET_PTR_TO_ALLOC_MEM_OR_NULL, 389 + .arg1_type = ARG_CONST_MAP_PTR, 390 + .arg2_type = ARG_CONST_ALLOC_SIZE_OR_ZERO, 391 + .arg3_type = ARG_ANYTHING, 392 + }; 393 + 394 + static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) 395 + { 396 + unsigned long rec_pos, cons_pos; 397 + struct bpf_ringbuf_hdr *hdr; 398 + struct bpf_ringbuf *rb; 399 + u32 new_len; 400 + 401 + hdr = sample - BPF_RINGBUF_HDR_SZ; 402 + rb = bpf_ringbuf_restore_from_rec(hdr); 403 + new_len = hdr->len ^ BPF_RINGBUF_BUSY_BIT; 404 + if (discard) 405 + new_len |= BPF_RINGBUF_DISCARD_BIT; 406 + 407 + /* update record header with correct final size prefix */ 408 + xchg(&hdr->len, new_len); 409 + 410 + /* if consumer caught up and is waiting for our record, notify about 411 + * new data availability 412 + */ 413 + rec_pos = (void *)hdr - (void *)rb->data; 414 + cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; 415 + 416 + if (flags & BPF_RB_FORCE_WAKEUP) 417 + irq_work_queue(&rb->work); 418 + else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) 419 + irq_work_queue(&rb->work); 420 + } 421 + 422 + BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags) 423 + { 424 + bpf_ringbuf_commit(sample, flags, false /* discard */); 425 + return 0; 426 + } 427 + 428 + const struct bpf_func_proto bpf_ringbuf_submit_proto = { 429 + .func = bpf_ringbuf_submit, 430 + .ret_type = RET_VOID, 431 + .arg1_type = ARG_PTR_TO_ALLOC_MEM, 432 + .arg2_type = ARG_ANYTHING, 433 + }; 434 + 435 + BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags) 436 + { 437 + bpf_ringbuf_commit(sample, flags, true /* discard */); 438 + return 0; 439 + } 440 + 441 + const struct bpf_func_proto bpf_ringbuf_discard_proto = { 442 + .func = bpf_ringbuf_discard, 443 + .ret_type = RET_VOID, 444 + .arg1_type = ARG_PTR_TO_ALLOC_MEM, 445 + .arg2_type = ARG_ANYTHING, 446 + }; 447 + 448 + BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64, size, 449 + u64, flags) 450 + { 451 + struct bpf_ringbuf_map *rb_map; 452 + void *rec; 453 + 454 + if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP))) 455 + return -EINVAL; 456 + 457 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 458 + rec = __bpf_ringbuf_reserve(rb_map->rb, size); 459 + if (!rec) 460 + return -EAGAIN; 461 + 462 + memcpy(rec, data, size); 463 + bpf_ringbuf_commit(rec, flags, false /* discard */); 464 + return 0; 465 + } 466 + 467 + const struct bpf_func_proto bpf_ringbuf_output_proto = { 468 + .func = bpf_ringbuf_output, 469 + .ret_type = RET_INTEGER, 470 + .arg1_type = ARG_CONST_MAP_PTR, 471 + .arg2_type = ARG_PTR_TO_MEM, 472 + .arg3_type = ARG_CONST_SIZE_OR_ZERO, 473 + .arg4_type = ARG_ANYTHING, 474 + }; 475 + 476 + BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) 477 + { 478 + struct bpf_ringbuf *rb; 479 + 480 + rb = container_of(map, struct bpf_ringbuf_map, map)->rb; 481 + 482 + switch (flags) { 483 + case BPF_RB_AVAIL_DATA: 484 + return ringbuf_avail_data_sz(rb); 485 + case BPF_RB_RING_SIZE: 486 + return rb->mask + 1; 487 + case BPF_RB_CONS_POS: 488 + return smp_load_acquire(&rb->consumer_pos); 489 + case BPF_RB_PROD_POS: 490 + return smp_load_acquire(&rb->producer_pos); 491 + default: 492 + return 0; 493 + } 494 + } 495 + 496 + const struct bpf_func_proto bpf_ringbuf_query_proto = { 497 + .func = bpf_ringbuf_query, 498 + .ret_type = RET_INTEGER, 499 + .arg1_type = ARG_CONST_MAP_PTR, 500 + .arg2_type = ARG_ANYTHING, 501 + };

+23 -6

kernel/bpf/syscall.c

··· 26 26 #include <linux/audit.h> 27 27 #include <uapi/linux/btf.h> 28 28 #include <linux/bpf_lsm.h> 29 + #include <linux/poll.h> 30 + #include <linux/bpf-netns.h> 29 31 30 32 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ 31 33 (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ ··· 664 662 return err; 665 663 } 666 664 665 + static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct *pts) 666 + { 667 + struct bpf_map *map = filp->private_data; 668 + 669 + if (map->ops->map_poll) 670 + return map->ops->map_poll(map, filp, pts); 671 + 672 + return EPOLLERR; 673 + } 674 + 667 675 const struct file_operations bpf_map_fops = { 668 676 #ifdef CONFIG_PROC_FS 669 677 .show_fdinfo = bpf_map_show_fdinfo, ··· 682 670 .read = bpf_dummy_read, 683 671 .write = bpf_dummy_write, 684 672 .mmap = bpf_map_mmap, 673 + .poll = bpf_map_poll, 685 674 }; 686 675 687 676 int bpf_map_new_fd(struct bpf_map *map, int flags) ··· 1400 1387 1401 1388 buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN); 1402 1389 if (!buf) { 1403 - kvfree(buf_prevkey); 1390 + kfree(buf_prevkey); 1404 1391 return -ENOMEM; 1405 1392 } 1406 1393 ··· 1485 1472 map = __bpf_map_get(f); 1486 1473 if (IS_ERR(map)) 1487 1474 return PTR_ERR(map); 1488 - if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { 1475 + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) || 1476 + !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { 1489 1477 err = -EPERM; 1490 1478 goto err_put; 1491 1479 } ··· 2869 2855 ret = lirc_prog_attach(attr, prog); 2870 2856 break; 2871 2857 case BPF_PROG_TYPE_FLOW_DISSECTOR: 2872 - ret = skb_flow_dissector_bpf_prog_attach(attr, prog); 2858 + ret = netns_bpf_prog_attach(attr, prog); 2873 2859 break; 2874 2860 case BPF_PROG_TYPE_CGROUP_DEVICE: 2875 2861 case BPF_PROG_TYPE_CGROUP_SKB: ··· 2909 2895 case BPF_PROG_TYPE_FLOW_DISSECTOR: 2910 2896 if (!capable(CAP_NET_ADMIN)) 2911 2897 return -EPERM; 2912 - return skb_flow_dissector_bpf_prog_detach(attr); 2898 + return netns_bpf_prog_detach(attr); 2913 2899 case BPF_PROG_TYPE_CGROUP_DEVICE: 2914 2900 case BPF_PROG_TYPE_CGROUP_SKB: 2915 2901 case BPF_PROG_TYPE_CGROUP_SOCK: ··· 2962 2948 case BPF_LIRC_MODE2: 2963 2949 return lirc_prog_query(attr, uattr); 2964 2950 case BPF_FLOW_DISSECTOR: 2965 - return skb_flow_dissector_prog_query(attr, uattr); 2951 + return netns_bpf_prog_query(attr, uattr); 2966 2952 default: 2967 2953 return -EINVAL; 2968 2954 } ··· 3887 3873 case BPF_PROG_TYPE_TRACING: 3888 3874 ret = tracing_bpf_link_attach(attr, prog); 3889 3875 break; 3876 + case BPF_PROG_TYPE_FLOW_DISSECTOR: 3877 + ret = netns_bpf_link_create(attr, prog); 3878 + break; 3890 3879 default: 3891 3880 ret = -EINVAL; 3892 3881 } ··· 3941 3924 if (link->ops->update_prog) 3942 3925 ret = link->ops->update_prog(link, new_prog, old_prog); 3943 3926 else 3944 - ret = EINVAL; 3927 + ret = -EINVAL; 3945 3928 3946 3929 out_put_progs: 3947 3930 if (old_prog)

+146 -49

kernel/bpf/verifier.c

··· 233 233 bool pkt_access; 234 234 int regno; 235 235 int access_size; 236 + int mem_size; 236 237 u64 msize_max_value; 237 238 int ref_obj_id; 238 239 int func_id; ··· 409 408 type == PTR_TO_SOCKET_OR_NULL || 410 409 type == PTR_TO_SOCK_COMMON_OR_NULL || 411 410 type == PTR_TO_TCP_SOCK_OR_NULL || 412 - type == PTR_TO_BTF_ID_OR_NULL; 411 + type == PTR_TO_BTF_ID_OR_NULL || 412 + type == PTR_TO_MEM_OR_NULL; 413 413 } 414 414 415 415 static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) ··· 424 422 return type == PTR_TO_SOCKET || 425 423 type == PTR_TO_SOCKET_OR_NULL || 426 424 type == PTR_TO_TCP_SOCK || 427 - type == PTR_TO_TCP_SOCK_OR_NULL; 425 + type == PTR_TO_TCP_SOCK_OR_NULL || 426 + type == PTR_TO_MEM || 427 + type == PTR_TO_MEM_OR_NULL; 428 428 } 429 429 430 430 static bool arg_type_may_be_refcounted(enum bpf_arg_type type) ··· 440 436 */ 441 437 static bool is_release_function(enum bpf_func_id func_id) 442 438 { 443 - return func_id == BPF_FUNC_sk_release; 439 + return func_id == BPF_FUNC_sk_release || 440 + func_id == BPF_FUNC_ringbuf_submit || 441 + func_id == BPF_FUNC_ringbuf_discard; 444 442 } 445 443 446 444 static bool may_be_acquire_function(enum bpf_func_id func_id) ··· 450 444 return func_id == BPF_FUNC_sk_lookup_tcp || 451 445 func_id == BPF_FUNC_sk_lookup_udp || 452 446 func_id == BPF_FUNC_skc_lookup_tcp || 453 - func_id == BPF_FUNC_map_lookup_elem; 447 + func_id == BPF_FUNC_map_lookup_elem || 448 + func_id == BPF_FUNC_ringbuf_reserve; 454 449 } 455 450 456 451 static bool is_acquire_function(enum bpf_func_id func_id, ··· 461 454 462 455 if (func_id == BPF_FUNC_sk_lookup_tcp || 463 456 func_id == BPF_FUNC_sk_lookup_udp || 464 - func_id == BPF_FUNC_skc_lookup_tcp) 457 + func_id == BPF_FUNC_skc_lookup_tcp || 458 + func_id == BPF_FUNC_ringbuf_reserve) 465 459 return true; 466 460 467 461 if (func_id == BPF_FUNC_map_lookup_elem && ··· 502 494 [PTR_TO_XDP_SOCK] = "xdp_sock", 503 495 [PTR_TO_BTF_ID] = "ptr_", 504 496 [PTR_TO_BTF_ID_OR_NULL] = "ptr_or_null_", 497 + [PTR_TO_MEM] = "mem", 498 + [PTR_TO_MEM_OR_NULL] = "mem_or_null", 505 499 }; 506 500 507 501 static char slot_type_char[] = { ··· 2478 2468 return 0; 2479 2469 } 2480 2470 2481 - /* check read/write into map element returned by bpf_map_lookup_elem() */ 2482 - static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off, 2483 - int size, bool zero_size_allowed) 2471 + /* check read/write into memory region (e.g., map value, ringbuf sample, etc) */ 2472 + static int __check_mem_access(struct bpf_verifier_env *env, int regno, 2473 + int off, int size, u32 mem_size, 2474 + bool zero_size_allowed) 2484 2475 { 2485 - struct bpf_reg_state *regs = cur_regs(env); 2486 - struct bpf_map *map = regs[regno].map_ptr; 2476 + bool size_ok = size > 0 || (size == 0 && zero_size_allowed); 2477 + struct bpf_reg_state *reg; 2487 2478 2488 - if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) || 2489 - off + size > map->value_size) { 2479 + if (off >= 0 && size_ok && (u64)off + size <= mem_size) 2480 + return 0; 2481 + 2482 + reg = &cur_regs(env)[regno]; 2483 + switch (reg->type) { 2484 + case PTR_TO_MAP_VALUE: 2490 2485 verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n", 2491 - map->value_size, off, size); 2492 - return -EACCES; 2486 + mem_size, off, size); 2487 + break; 2488 + case PTR_TO_PACKET: 2489 + case PTR_TO_PACKET_META: 2490 + case PTR_TO_PACKET_END: 2491 + verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n", 2492 + off, size, regno, reg->id, off, mem_size); 2493 + break; 2494 + case PTR_TO_MEM: 2495 + default: 2496 + verbose(env, "invalid access to memory, mem_size=%u off=%d size=%d\n", 2497 + mem_size, off, size); 2493 2498 } 2494 - return 0; 2499 + 2500 + return -EACCES; 2495 2501 } 2496 2502 2497 - /* check read/write into a map element with possible variable offset */ 2498 - static int check_map_access(struct bpf_verifier_env *env, u32 regno, 2499 - int off, int size, bool zero_size_allowed) 2503 + /* check read/write into a memory region with possible variable offset */ 2504 + static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno, 2505 + int off, int size, u32 mem_size, 2506 + bool zero_size_allowed) 2500 2507 { 2501 2508 struct bpf_verifier_state *vstate = env->cur_state; 2502 2509 struct bpf_func_state *state = vstate->frame[vstate->curframe]; 2503 2510 struct bpf_reg_state *reg = &state->regs[regno]; 2504 2511 int err; 2505 2512 2506 - /* We may have adjusted the register to this map value, so we 2513 + /* We may have adjusted the register pointing to memory region, so we 2507 2514 * need to try adding each of min_value and max_value to off 2508 2515 * to make sure our theoretical access will be safe. 2509 2516 */ ··· 2541 2514 regno); 2542 2515 return -EACCES; 2543 2516 } 2544 - err = __check_map_access(env, regno, reg->smin_value + off, size, 2545 - zero_size_allowed); 2517 + err = __check_mem_access(env, regno, reg->smin_value + off, size, 2518 + mem_size, zero_size_allowed); 2546 2519 if (err) { 2547 - verbose(env, "R%d min value is outside of the array range\n", 2520 + verbose(env, "R%d min value is outside of the allowed memory range\n", 2548 2521 regno); 2549 2522 return err; 2550 2523 } ··· 2554 2527 * If reg->umax_value + off could overflow, treat that as unbounded too. 2555 2528 */ 2556 2529 if (reg->umax_value >= BPF_MAX_VAR_OFF) { 2557 - verbose(env, "R%d unbounded memory access, make sure to bounds check any array access into a map\n", 2530 + verbose(env, "R%d unbounded memory access, make sure to bounds check any such access\n", 2558 2531 regno); 2559 2532 return -EACCES; 2560 2533 } 2561 - err = __check_map_access(env, regno, reg->umax_value + off, size, 2562 - zero_size_allowed); 2563 - if (err) 2564 - verbose(env, "R%d max value is outside of the array range\n", 2534 + err = __check_mem_access(env, regno, reg->umax_value + off, size, 2535 + mem_size, zero_size_allowed); 2536 + if (err) { 2537 + verbose(env, "R%d max value is outside of the allowed memory range\n", 2565 2538 regno); 2539 + return err; 2540 + } 2566 2541 2567 - if (map_value_has_spin_lock(reg->map_ptr)) { 2568 - u32 lock = reg->map_ptr->spin_lock_off; 2542 + return 0; 2543 + } 2544 + 2545 + /* check read/write into a map element with possible variable offset */ 2546 + static int check_map_access(struct bpf_verifier_env *env, u32 regno, 2547 + int off, int size, bool zero_size_allowed) 2548 + { 2549 + struct bpf_verifier_state *vstate = env->cur_state; 2550 + struct bpf_func_state *state = vstate->frame[vstate->curframe]; 2551 + struct bpf_reg_state *reg = &state->regs[regno]; 2552 + struct bpf_map *map = reg->map_ptr; 2553 + int err; 2554 + 2555 + err = check_mem_region_access(env, regno, off, size, map->value_size, 2556 + zero_size_allowed); 2557 + if (err) 2558 + return err; 2559 + 2560 + if (map_value_has_spin_lock(map)) { 2561 + u32 lock = map->spin_lock_off; 2569 2562 2570 2563 /* if any part of struct bpf_spin_lock can be touched by 2571 2564 * load/store reject this program. ··· 2643 2596 } 2644 2597 } 2645 2598 2646 - static int __check_packet_access(struct bpf_verifier_env *env, u32 regno, 2647 - int off, int size, bool zero_size_allowed) 2648 - { 2649 - struct bpf_reg_state *regs = cur_regs(env); 2650 - struct bpf_reg_state *reg = &regs[regno]; 2651 - 2652 - if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) || 2653 - (u64)off + size > reg->range) { 2654 - verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n", 2655 - off, size, regno, reg->id, reg->off, reg->range); 2656 - return -EACCES; 2657 - } 2658 - return 0; 2659 - } 2660 - 2661 2599 static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off, 2662 2600 int size, bool zero_size_allowed) 2663 2601 { ··· 2663 2631 regno); 2664 2632 return -EACCES; 2665 2633 } 2666 - err = __check_packet_access(env, regno, off, size, zero_size_allowed); 2634 + err = __check_mem_access(env, regno, off, size, reg->range, 2635 + zero_size_allowed); 2667 2636 if (err) { 2668 2637 verbose(env, "R%d offset is outside of the packet\n", regno); 2669 2638 return err; 2670 2639 } 2671 2640 2672 - /* __check_packet_access has made sure "off + size - 1" is within u16. 2641 + /* __check_mem_access has made sure "off + size - 1" is within u16. 2673 2642 * reg->umax_value can't be bigger than MAX_PACKET_OFF which is 0xffff, 2674 2643 * otherwise find_good_pkt_pointers would have refused to set range info 2675 - * that __check_packet_access would have rejected this pkt access. 2644 + * that __check_mem_access would have rejected this pkt access. 2676 2645 * Therefore, "off + reg->umax_value + size - 1" won't overflow u32. 2677 2646 */ 2678 2647 env->prog->aux->max_pkt_offset = ··· 3253 3220 mark_reg_unknown(env, regs, value_regno); 3254 3221 } 3255 3222 } 3223 + } else if (reg->type == PTR_TO_MEM) { 3224 + if (t == BPF_WRITE && value_regno >= 0 && 3225 + is_pointer_value(env, value_regno)) { 3226 + verbose(env, "R%d leaks addr into mem\n", value_regno); 3227 + return -EACCES; 3228 + } 3229 + err = check_mem_region_access(env, regno, off, size, 3230 + reg->mem_size, false); 3231 + if (!err && t == BPF_READ && value_regno >= 0) 3232 + mark_reg_unknown(env, regs, value_regno); 3256 3233 } else if (reg->type == PTR_TO_CTX) { 3257 3234 enum bpf_reg_type reg_type = SCALAR_VALUE; 3258 3235 u32 btf_id = 0; ··· 3600 3557 return -EACCES; 3601 3558 return check_map_access(env, regno, reg->off, access_size, 3602 3559 zero_size_allowed); 3560 + case PTR_TO_MEM: 3561 + return check_mem_region_access(env, regno, reg->off, 3562 + access_size, reg->mem_size, 3563 + zero_size_allowed); 3603 3564 default: /* scalar_value|ptr_to_stack or invalid ptr */ 3604 3565 return check_stack_boundary(env, regno, access_size, 3605 3566 zero_size_allowed, meta); ··· 3708 3661 type == ARG_CONST_SIZE_OR_ZERO; 3709 3662 } 3710 3663 3664 + static bool arg_type_is_alloc_mem_ptr(enum bpf_arg_type type) 3665 + { 3666 + return type == ARG_PTR_TO_ALLOC_MEM || 3667 + type == ARG_PTR_TO_ALLOC_MEM_OR_NULL; 3668 + } 3669 + 3670 + static bool arg_type_is_alloc_size(enum bpf_arg_type type) 3671 + { 3672 + return type == ARG_CONST_ALLOC_SIZE_OR_ZERO; 3673 + } 3674 + 3711 3675 static bool arg_type_is_int_ptr(enum bpf_arg_type type) 3712 3676 { 3713 3677 return type == ARG_PTR_TO_INT || ··· 3778 3720 type != expected_type) 3779 3721 goto err_type; 3780 3722 } else if (arg_type == ARG_CONST_SIZE || 3781 - arg_type == ARG_CONST_SIZE_OR_ZERO) { 3723 + arg_type == ARG_CONST_SIZE_OR_ZERO || 3724 + arg_type == ARG_CONST_ALLOC_SIZE_OR_ZERO) { 3782 3725 expected_type = SCALAR_VALUE; 3783 3726 if (type != expected_type) 3784 3727 goto err_type; ··· 3850 3791 * happens during stack boundary checking. 3851 3792 */ 3852 3793 if (register_is_null(reg) && 3853 - arg_type == ARG_PTR_TO_MEM_OR_NULL) 3794 + (arg_type == ARG_PTR_TO_MEM_OR_NULL || 3795 + arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL)) 3854 3796 /* final test in check_stack_boundary() */; 3855 3797 else if (!type_is_pkt_pointer(type) && 3856 3798 type != PTR_TO_MAP_VALUE && 3799 + type != PTR_TO_MEM && 3857 3800 type != expected_type) 3858 3801 goto err_type; 3859 3802 meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM; 3803 + } else if (arg_type_is_alloc_mem_ptr(arg_type)) { 3804 + expected_type = PTR_TO_MEM; 3805 + if (register_is_null(reg) && 3806 + arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL) 3807 + /* final test in check_stack_boundary() */; 3808 + else if (type != expected_type) 3809 + goto err_type; 3810 + if (meta->ref_obj_id) { 3811 + verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", 3812 + regno, reg->ref_obj_id, 3813 + meta->ref_obj_id); 3814 + return -EFAULT; 3815 + } 3816 + meta->ref_obj_id = reg->ref_obj_id; 3860 3817 } else if (arg_type_is_int_ptr(arg_type)) { 3861 3818 expected_type = PTR_TO_STACK; 3862 3819 if (!type_is_pkt_pointer(type) && ··· 3968 3893 zero_size_allowed, meta); 3969 3894 if (!err) 3970 3895 err = mark_chain_precision(env, regno); 3896 + } else if (arg_type_is_alloc_size(arg_type)) { 3897 + if (!tnum_is_const(reg->var_off)) { 3898 + verbose(env, "R%d unbounded size, use 'var &= const' or 'if (var < const)'\n", 3899 + regno); 3900 + return -EACCES; 3901 + } 3902 + meta->mem_size = reg->var_off.value; 3971 3903 } else if (arg_type_is_int_ptr(arg_type)) { 3972 3904 int size = int_ptr_type_to_size(arg_type); 3973 3905 ··· 4009 3927 func_id != BPF_FUNC_skb_output && 4010 3928 func_id != BPF_FUNC_perf_event_read_value && 4011 3929 func_id != BPF_FUNC_xdp_output) 3930 + goto error; 3931 + break; 3932 + case BPF_MAP_TYPE_RINGBUF: 3933 + if (func_id != BPF_FUNC_ringbuf_output && 3934 + func_id != BPF_FUNC_ringbuf_reserve && 3935 + func_id != BPF_FUNC_ringbuf_submit && 3936 + func_id != BPF_FUNC_ringbuf_discard && 3937 + func_id != BPF_FUNC_ringbuf_query) 4012 3938 goto error; 4013 3939 break; 4014 3940 case BPF_MAP_TYPE_STACK_TRACE: ··· 4745 4655 mark_reg_known_zero(env, regs, BPF_REG_0); 4746 4656 regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL; 4747 4657 regs[BPF_REG_0].id = ++env->id_gen; 4658 + } else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) { 4659 + mark_reg_known_zero(env, regs, BPF_REG_0); 4660 + regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL; 4661 + regs[BPF_REG_0].id = ++env->id_gen; 4662 + regs[BPF_REG_0].mem_size = meta.mem_size; 4748 4663 } else { 4749 4664 verbose(env, "unknown return type %d of func %s#%d\n", 4750 4665 fn->ret_type, func_id_name(func_id), func_id); ··· 6706 6611 reg->type = PTR_TO_TCP_SOCK; 6707 6612 } else if (reg->type == PTR_TO_BTF_ID_OR_NULL) { 6708 6613 reg->type = PTR_TO_BTF_ID; 6614 + } else if (reg->type == PTR_TO_MEM_OR_NULL) { 6615 + reg->type = PTR_TO_MEM; 6709 6616 } 6710 6617 if (is_null) { 6711 6618 /* We don't need id and ref_obj_id from this point

+19 -9

kernel/trace/bpf_trace.c

··· 147 147 return ret; 148 148 } 149 149 150 - static const struct bpf_func_proto bpf_probe_read_user_proto = { 150 + const struct bpf_func_proto bpf_probe_read_user_proto = { 151 151 .func = bpf_probe_read_user, 152 152 .gpl_only = true, 153 153 .ret_type = RET_INTEGER, ··· 167 167 return ret; 168 168 } 169 169 170 - static const struct bpf_func_proto bpf_probe_read_user_str_proto = { 170 + const struct bpf_func_proto bpf_probe_read_user_str_proto = { 171 171 .func = bpf_probe_read_user_str, 172 172 .gpl_only = true, 173 173 .ret_type = RET_INTEGER, ··· 198 198 return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, false); 199 199 } 200 200 201 - static const struct bpf_func_proto bpf_probe_read_kernel_proto = { 201 + const struct bpf_func_proto bpf_probe_read_kernel_proto = { 202 202 .func = bpf_probe_read_kernel, 203 203 .gpl_only = true, 204 204 .ret_type = RET_INTEGER, ··· 253 253 return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, false); 254 254 } 255 255 256 - static const struct bpf_func_proto bpf_probe_read_kernel_str_proto = { 256 + const struct bpf_func_proto bpf_probe_read_kernel_str_proto = { 257 257 .func = bpf_probe_read_kernel_str, 258 258 .gpl_only = true, 259 259 .ret_type = RET_INTEGER, ··· 585 585 goto out; 586 586 } 587 587 588 - err = strncpy_from_unsafe(bufs->buf[memcpy_cnt], 589 - (void *) (long) args[fmt_cnt], 590 - MAX_SEQ_PRINTF_STR_LEN); 588 + err = strncpy_from_unsafe_strict(bufs->buf[memcpy_cnt], 589 + (void *) (long) args[fmt_cnt], 590 + MAX_SEQ_PRINTF_STR_LEN); 591 591 if (err < 0) 592 592 bufs->buf[memcpy_cnt][0] = '\0'; 593 593 params[fmt_cnt] = (u64)(long)bufs->buf[memcpy_cnt]; ··· 907 907 return (long) current; 908 908 } 909 909 910 - static const struct bpf_func_proto bpf_get_current_task_proto = { 910 + const struct bpf_func_proto bpf_get_current_task_proto = { 911 911 .func = bpf_get_current_task, 912 912 .gpl_only = true, 913 913 .ret_type = RET_INTEGER, ··· 1088 1088 return &bpf_perf_event_read_value_proto; 1089 1089 case BPF_FUNC_get_ns_current_pid_tgid: 1090 1090 return &bpf_get_ns_current_pid_tgid_proto; 1091 + case BPF_FUNC_ringbuf_output: 1092 + return &bpf_ringbuf_output_proto; 1093 + case BPF_FUNC_ringbuf_reserve: 1094 + return &bpf_ringbuf_reserve_proto; 1095 + case BPF_FUNC_ringbuf_submit: 1096 + return &bpf_ringbuf_submit_proto; 1097 + case BPF_FUNC_ringbuf_discard: 1098 + return &bpf_ringbuf_discard_proto; 1099 + case BPF_FUNC_ringbuf_query: 1100 + return &bpf_ringbuf_query_proto; 1091 1101 default: 1092 1102 return NULL; 1093 1103 } ··· 1467 1457 } 1468 1458 } 1469 1459 1470 - static const struct bpf_func_proto * 1460 + const struct bpf_func_proto * 1471 1461 tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) 1472 1462 { 1473 1463 switch (func_id) {

+18

net/core/dev.c

··· 5420 5420 struct bpf_prog *new = xdp->prog; 5421 5421 int ret = 0; 5422 5422 5423 + if (new) { 5424 + u32 i; 5425 + 5426 + /* generic XDP does not work with DEVMAPs that can 5427 + * have a bpf_prog installed on an entry 5428 + */ 5429 + for (i = 0; i < new->aux->used_map_cnt; i++) { 5430 + if (dev_map_can_have_prog(new->aux->used_maps[i])) 5431 + return -EINVAL; 5432 + } 5433 + } 5434 + 5423 5435 switch (xdp->command) { 5424 5436 case XDP_SETUP_PROG: 5425 5437 rcu_assign_pointer(dev->xdp_prog, new); ··· 8843 8831 8844 8832 if (!offload && bpf_prog_is_dev_bound(prog->aux)) { 8845 8833 NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported"); 8834 + bpf_prog_put(prog); 8835 + return -EINVAL; 8836 + } 8837 + 8838 + if (prog->expected_attach_type == BPF_XDP_DEVMAP) { 8839 + NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device"); 8846 8840 bpf_prog_put(prog); 8847 8841 return -EINVAL; 8848 8842 }

+93 -1

net/core/filter.c

··· 4248 4248 static int _bpf_setsockopt(struct sock *sk, int level, int optname, 4249 4249 char *optval, int optlen, u32 flags) 4250 4250 { 4251 + char devname[IFNAMSIZ]; 4252 + struct net *net; 4253 + int ifindex; 4251 4254 int ret = 0; 4252 4255 int val; 4253 4256 ··· 4260 4257 sock_owned_by_me(sk); 4261 4258 4262 4259 if (level == SOL_SOCKET) { 4263 - if (optlen != sizeof(int)) 4260 + if (optlen != sizeof(int) && optname != SO_BINDTODEVICE) 4264 4261 return -EINVAL; 4265 4262 val = *((int *)optval); 4266 4263 ··· 4300 4297 sk->sk_mark = val; 4301 4298 sk_dst_reset(sk); 4302 4299 } 4300 + break; 4301 + case SO_BINDTODEVICE: 4302 + ret = -ENOPROTOOPT; 4303 + #ifdef CONFIG_NETDEVICES 4304 + optlen = min_t(long, optlen, IFNAMSIZ - 1); 4305 + strncpy(devname, optval, optlen); 4306 + devname[optlen] = 0; 4307 + 4308 + ifindex = 0; 4309 + if (devname[0] != '\0') { 4310 + struct net_device *dev; 4311 + 4312 + ret = -ENODEV; 4313 + 4314 + net = sock_net(sk); 4315 + dev = dev_get_by_name(net, devname); 4316 + if (!dev) 4317 + break; 4318 + ifindex = dev->ifindex; 4319 + dev_put(dev); 4320 + } 4321 + ret = sock_bindtoindex(sk, ifindex, false); 4322 + #endif 4303 4323 break; 4304 4324 default: 4305 4325 ret = -EINVAL; ··· 6469 6443 return &bpf_msg_push_data_proto; 6470 6444 case BPF_FUNC_msg_pop_data: 6471 6445 return &bpf_msg_pop_data_proto; 6446 + case BPF_FUNC_perf_event_output: 6447 + return &bpf_event_output_data_proto; 6448 + case BPF_FUNC_get_current_uid_gid: 6449 + return &bpf_get_current_uid_gid_proto; 6450 + case BPF_FUNC_get_current_pid_tgid: 6451 + return &bpf_get_current_pid_tgid_proto; 6452 + case BPF_FUNC_sk_storage_get: 6453 + return &bpf_sk_storage_get_proto; 6454 + case BPF_FUNC_sk_storage_delete: 6455 + return &bpf_sk_storage_delete_proto; 6456 + #ifdef CONFIG_CGROUPS 6457 + case BPF_FUNC_get_current_cgroup_id: 6458 + return &bpf_get_current_cgroup_id_proto; 6459 + case BPF_FUNC_get_current_ancestor_cgroup_id: 6460 + return &bpf_get_current_ancestor_cgroup_id_proto; 6461 + #endif 6462 + #ifdef CONFIG_CGROUP_NET_CLASSID 6463 + case BPF_FUNC_get_cgroup_classid: 6464 + return &bpf_get_cgroup_classid_curr_proto; 6465 + #endif 6472 6466 default: 6473 6467 return bpf_base_func_proto(func_id); 6474 6468 } ··· 6875 6829 case offsetof(struct bpf_sock, protocol): 6876 6830 case offsetof(struct bpf_sock, dst_port): 6877 6831 case offsetof(struct bpf_sock, src_port): 6832 + case offsetof(struct bpf_sock, rx_queue_mapping): 6878 6833 case bpf_ctx_range(struct bpf_sock, src_ip4): 6879 6834 case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]): 6880 6835 case bpf_ctx_range(struct bpf_sock, dst_ip4): ··· 7041 6994 const struct bpf_prog *prog, 7042 6995 struct bpf_insn_access_aux *info) 7043 6996 { 6997 + if (prog->expected_attach_type != BPF_XDP_DEVMAP) { 6998 + switch (off) { 6999 + case offsetof(struct xdp_md, egress_ifindex): 7000 + return false; 7001 + } 7002 + } 7003 + 7044 7004 if (type == BPF_WRITE) { 7045 7005 if (bpf_prog_is_dev_bound(prog->aux)) { 7046 7006 switch (off) { ··· 7310 7256 info->reg_type = PTR_TO_PACKET_END; 7311 7257 if (size != sizeof(__u64)) 7312 7258 return false; 7259 + break; 7260 + case offsetof(struct sk_msg_md, sk): 7261 + if (size != sizeof(__u64)) 7262 + return false; 7263 + info->reg_type = PTR_TO_SOCKET; 7313 7264 break; 7314 7265 case bpf_ctx_range(struct sk_msg_md, family): 7315 7266 case bpf_ctx_range(struct sk_msg_md, remote_ip4): ··· 7931 7872 skc_state), 7932 7873 target_size)); 7933 7874 break; 7875 + case offsetof(struct bpf_sock, rx_queue_mapping): 7876 + #ifdef CONFIG_XPS 7877 + *insn++ = BPF_LDX_MEM( 7878 + BPF_FIELD_SIZEOF(struct sock, sk_rx_queue_mapping), 7879 + si->dst_reg, si->src_reg, 7880 + bpf_target_off(struct sock, sk_rx_queue_mapping, 7881 + sizeof_field(struct sock, 7882 + sk_rx_queue_mapping), 7883 + target_size)); 7884 + *insn++ = BPF_JMP_IMM(BPF_JNE, si->dst_reg, NO_QUEUE_MAPPING, 7885 + 1); 7886 + *insn++ = BPF_MOV64_IMM(si->dst_reg, -1); 7887 + #else 7888 + *insn++ = BPF_MOV64_IMM(si->dst_reg, -1); 7889 + *target_size = 2; 7890 + #endif 7891 + break; 7934 7892 } 7935 7893 7936 7894 return insn - insn_buf; ··· 8017 7941 *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, 8018 7942 offsetof(struct xdp_rxq_info, 8019 7943 queue_index)); 7944 + break; 7945 + case offsetof(struct xdp_md, egress_ifindex): 7946 + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, txq), 7947 + si->dst_reg, si->src_reg, 7948 + offsetof(struct xdp_buff, txq)); 7949 + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_txq_info, dev), 7950 + si->dst_reg, si->dst_reg, 7951 + offsetof(struct xdp_txq_info, dev)); 7952 + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, 7953 + offsetof(struct net_device, ifindex)); 8020 7954 break; 8021 7955 } 8022 7956 ··· 8678 8592 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_sg, size), 8679 8593 si->dst_reg, si->src_reg, 8680 8594 offsetof(struct sk_msg_sg, size)); 8595 + break; 8596 + 8597 + case offsetof(struct sk_msg_md, sk): 8598 + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg, sk), 8599 + si->dst_reg, si->src_reg, 8600 + offsetof(struct sk_msg, sk)); 8681 8601 break; 8682 8602 } 8683 8603

+20 -104

net/core/flow_dissector.c

··· 31 31 #include <net/netfilter/nf_conntrack_core.h> 32 32 #include <net/netfilter/nf_conntrack_labels.h> 33 33 #endif 34 - 35 - static DEFINE_MUTEX(flow_dissector_mutex); 34 + #include <linux/bpf-netns.h> 36 35 37 36 static void dissector_set_key(struct flow_dissector *flow_dissector, 38 37 enum flow_dissector_key_id key_id) ··· 69 70 } 70 71 EXPORT_SYMBOL(skb_flow_dissector_init); 71 72 72 - int skb_flow_dissector_prog_query(const union bpf_attr *attr, 73 - union bpf_attr __user *uattr) 73 + #ifdef CONFIG_BPF_SYSCALL 74 + int flow_dissector_bpf_prog_attach(struct net *net, struct bpf_prog *prog) 74 75 { 75 - __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); 76 - u32 prog_id, prog_cnt = 0, flags = 0; 76 + enum netns_bpf_attach_type type = NETNS_BPF_FLOW_DISSECTOR; 77 77 struct bpf_prog *attached; 78 - struct net *net; 79 - 80 - if (attr->query.query_flags) 81 - return -EINVAL; 82 - 83 - net = get_net_ns_by_fd(attr->query.target_fd); 84 - if (IS_ERR(net)) 85 - return PTR_ERR(net); 86 - 87 - rcu_read_lock(); 88 - attached = rcu_dereference(net->flow_dissector_prog); 89 - if (attached) { 90 - prog_cnt = 1; 91 - prog_id = attached->aux->id; 92 - } 93 - rcu_read_unlock(); 94 - 95 - put_net(net); 96 - 97 - if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) 98 - return -EFAULT; 99 - if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt))) 100 - return -EFAULT; 101 - 102 - if (!attr->query.prog_cnt || !prog_ids || !prog_cnt) 103 - return 0; 104 - 105 - if (copy_to_user(prog_ids, &prog_id, sizeof(u32))) 106 - return -EFAULT; 107 - 108 - return 0; 109 - } 110 - 111 - int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, 112 - struct bpf_prog *prog) 113 - { 114 - struct bpf_prog *attached; 115 - struct net *net; 116 - int ret = 0; 117 - 118 - net = current->nsproxy->net_ns; 119 - mutex_lock(&flow_dissector_mutex); 120 78 121 79 if (net == &init_net) { 122 80 /* BPF flow dissector in the root namespace overrides ··· 86 130 for_each_net(ns) { 87 131 if (ns == &init_net) 88 132 continue; 89 - if (rcu_access_pointer(ns->flow_dissector_prog)) { 90 - ret = -EEXIST; 91 - goto out; 92 - } 133 + if (rcu_access_pointer(ns->bpf.progs[type])) 134 + return -EEXIST; 93 135 } 94 136 } else { 95 137 /* Make sure root flow dissector is not attached 96 138 * when attaching to the non-root namespace. 97 139 */ 98 - if (rcu_access_pointer(init_net.flow_dissector_prog)) { 99 - ret = -EEXIST; 100 - goto out; 101 - } 140 + if (rcu_access_pointer(init_net.bpf.progs[type])) 141 + return -EEXIST; 102 142 } 103 143 104 - attached = rcu_dereference_protected(net->flow_dissector_prog, 105 - lockdep_is_held(&flow_dissector_mutex)); 106 - if (attached == prog) { 144 + attached = rcu_dereference_protected(net->bpf.progs[type], 145 + lockdep_is_held(&netns_bpf_mutex)); 146 + if (attached == prog) 107 147 /* The same program cannot be attached twice */ 108 - ret = -EINVAL; 109 - goto out; 110 - } 111 - rcu_assign_pointer(net->flow_dissector_prog, prog); 148 + return -EINVAL; 149 + 150 + rcu_assign_pointer(net->bpf.progs[type], prog); 112 151 if (attached) 113 152 bpf_prog_put(attached); 114 - out: 115 - mutex_unlock(&flow_dissector_mutex); 116 - return ret; 117 - } 118 - 119 - static int flow_dissector_bpf_prog_detach(struct net *net) 120 - { 121 - struct bpf_prog *attached; 122 - 123 - mutex_lock(&flow_dissector_mutex); 124 - attached = rcu_dereference_protected(net->flow_dissector_prog, 125 - lockdep_is_held(&flow_dissector_mutex)); 126 - if (!attached) { 127 - mutex_unlock(&flow_dissector_mutex); 128 - return -ENOENT; 129 - } 130 - RCU_INIT_POINTER(net->flow_dissector_prog, NULL); 131 - bpf_prog_put(attached); 132 - mutex_unlock(&flow_dissector_mutex); 133 153 return 0; 134 154 } 135 - 136 - int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr) 137 - { 138 - return flow_dissector_bpf_prog_detach(current->nsproxy->net_ns); 139 - } 140 - 141 - static void __net_exit flow_dissector_pernet_pre_exit(struct net *net) 142 - { 143 - /* We're not racing with attach/detach because there are no 144 - * references to netns left when pre_exit gets called. 145 - */ 146 - if (rcu_access_pointer(net->flow_dissector_prog)) 147 - flow_dissector_bpf_prog_detach(net); 148 - } 149 - 150 - static struct pernet_operations flow_dissector_pernet_ops __net_initdata = { 151 - .pre_exit = flow_dissector_pernet_pre_exit, 152 - }; 155 + #endif /* CONFIG_BPF_SYSCALL */ 153 156 154 157 /** 155 158 * __skb_flow_get_ports - extract the upper layer ports and return them ··· 959 1044 960 1045 WARN_ON_ONCE(!net); 961 1046 if (net) { 1047 + enum netns_bpf_attach_type type = NETNS_BPF_FLOW_DISSECTOR; 1048 + 962 1049 rcu_read_lock(); 963 - attached = rcu_dereference(init_net.flow_dissector_prog); 1050 + attached = rcu_dereference(init_net.bpf.progs[type]); 964 1051 965 1052 if (!attached) 966 - attached = rcu_dereference(net->flow_dissector_prog); 1053 + attached = rcu_dereference(net->bpf.progs[type]); 967 1054 968 1055 if (attached) { 969 1056 struct bpf_flow_keys flow_keys; ··· 1786 1869 skb_flow_dissector_init(&flow_keys_basic_dissector, 1787 1870 flow_keys_basic_dissector_keys, 1788 1871 ARRAY_SIZE(flow_keys_basic_dissector_keys)); 1789 - 1790 - return register_pernet_subsys(&flow_dissector_pernet_ops); 1872 + return 0; 1791 1873 } 1792 1874 core_initcall(init_default_flow_dissectors);

+74 -24

net/core/skmsg.c

··· 7 7 8 8 #include <net/sock.h> 9 9 #include <net/tcp.h> 10 + #include <net/tls.h> 10 11 11 12 static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce) 12 13 { ··· 683 682 return container_of(parser, struct sk_psock, parser); 684 683 } 685 684 686 - static void sk_psock_verdict_apply(struct sk_psock *psock, 687 - struct sk_buff *skb, int verdict) 685 + static void sk_psock_skb_redirect(struct sk_psock *psock, struct sk_buff *skb) 688 686 { 689 687 struct sk_psock *psock_other; 690 688 struct sock *sk_other; 691 689 bool ingress; 690 + 691 + sk_other = tcp_skb_bpf_redirect_fetch(skb); 692 + if (unlikely(!sk_other)) { 693 + kfree_skb(skb); 694 + return; 695 + } 696 + psock_other = sk_psock(sk_other); 697 + if (!psock_other || sock_flag(sk_other, SOCK_DEAD) || 698 + !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { 699 + kfree_skb(skb); 700 + return; 701 + } 702 + 703 + ingress = tcp_skb_bpf_ingress(skb); 704 + if ((!ingress && sock_writeable(sk_other)) || 705 + (ingress && 706 + atomic_read(&sk_other->sk_rmem_alloc) <= 707 + sk_other->sk_rcvbuf)) { 708 + if (!ingress) 709 + skb_set_owner_w(skb, sk_other); 710 + skb_queue_tail(&psock_other->ingress_skb, skb); 711 + schedule_work(&psock_other->work); 712 + } else { 713 + kfree_skb(skb); 714 + } 715 + } 716 + 717 + static void sk_psock_tls_verdict_apply(struct sk_psock *psock, 718 + struct sk_buff *skb, int verdict) 719 + { 720 + switch (verdict) { 721 + case __SK_REDIRECT: 722 + sk_psock_skb_redirect(psock, skb); 723 + break; 724 + case __SK_PASS: 725 + case __SK_DROP: 726 + default: 727 + break; 728 + } 729 + } 730 + 731 + int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb) 732 + { 733 + struct bpf_prog *prog; 734 + int ret = __SK_PASS; 735 + 736 + rcu_read_lock(); 737 + prog = READ_ONCE(psock->progs.skb_verdict); 738 + if (likely(prog)) { 739 + tcp_skb_bpf_redirect_clear(skb); 740 + ret = sk_psock_bpf_run(psock, prog, skb); 741 + ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); 742 + } 743 + rcu_read_unlock(); 744 + sk_psock_tls_verdict_apply(psock, skb, ret); 745 + return ret; 746 + } 747 + EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read); 748 + 749 + static void sk_psock_verdict_apply(struct sk_psock *psock, 750 + struct sk_buff *skb, int verdict) 751 + { 752 + struct sock *sk_other; 692 753 693 754 switch (verdict) { 694 755 case __SK_PASS: ··· 770 707 } 771 708 goto out_free; 772 709 case __SK_REDIRECT: 773 - sk_other = tcp_skb_bpf_redirect_fetch(skb); 774 - if (unlikely(!sk_other)) 775 - goto out_free; 776 - psock_other = sk_psock(sk_other); 777 - if (!psock_other || sock_flag(sk_other, SOCK_DEAD) || 778 - !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) 779 - goto out_free; 780 - ingress = tcp_skb_bpf_ingress(skb); 781 - if ((!ingress && sock_writeable(sk_other)) || 782 - (ingress && 783 - atomic_read(&sk_other->sk_rmem_alloc) <= 784 - sk_other->sk_rcvbuf)) { 785 - if (!ingress) 786 - skb_set_owner_w(skb, sk_other); 787 - skb_queue_tail(&psock_other->ingress_skb, skb); 788 - schedule_work(&psock_other->work); 789 - break; 790 - } 791 - /* fall-through */ 710 + sk_psock_skb_redirect(psock, skb); 711 + break; 792 712 case __SK_DROP: 793 713 /* fall-through */ 794 714 default: ··· 825 779 rcu_read_lock(); 826 780 psock = sk_psock(sk); 827 781 if (likely(psock)) { 828 - write_lock_bh(&sk->sk_callback_lock); 829 - strp_data_ready(&psock->parser.strp); 830 - write_unlock_bh(&sk->sk_callback_lock); 782 + if (tls_sw_has_ctx_rx(sk)) { 783 + psock->parser.saved_data_ready(sk); 784 + } else { 785 + write_lock_bh(&sk->sk_callback_lock); 786 + strp_data_ready(&psock->parser.strp); 787 + write_unlock_bh(&sk->sk_callback_lock); 788 + } 831 789 } 832 790 rcu_read_unlock(); 833 791 }

+6 -4

net/core/sock.c

··· 594 594 return ret; 595 595 } 596 596 597 - int sock_bindtoindex(struct sock *sk, int ifindex) 597 + int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk) 598 598 { 599 599 int ret; 600 600 601 - lock_sock(sk); 601 + if (lock_sk) 602 + lock_sock(sk); 602 603 ret = sock_bindtoindex_locked(sk, ifindex); 603 - release_sock(sk); 604 + if (lock_sk) 605 + release_sock(sk); 604 606 605 607 return ret; 606 608 } ··· 648 646 goto out; 649 647 } 650 648 651 - return sock_bindtoindex(sk, index); 649 + return sock_bindtoindex(sk, index, true); 652 650 out: 653 651 #endif 654 652

+1 -1

net/ipv4/udp_tunnel.c

··· 22 22 goto error; 23 23 24 24 if (cfg->bind_ifindex) { 25 - err = sock_bindtoindex(sock->sk, cfg->bind_ifindex); 25 + err = sock_bindtoindex(sock->sk, cfg->bind_ifindex, true); 26 26 if (err < 0) 27 27 goto error; 28 28 }

+1 -1

net/ipv6/ip6_udp_tunnel.c

··· 30 30 goto error; 31 31 } 32 32 if (cfg->bind_ifindex) { 33 - err = sock_bindtoindex(sock->sk, cfg->bind_ifindex); 33 + err = sock_bindtoindex(sock->sk, cfg->bind_ifindex, true); 34 34 if (err < 0) 35 35 goto error; 36 36 }

+18 -2

net/tls/tls_sw.c

··· 1742 1742 long timeo; 1743 1743 bool is_kvec = iov_iter_is_kvec(&msg->msg_iter); 1744 1744 bool is_peek = flags & MSG_PEEK; 1745 + bool bpf_strp_enabled; 1745 1746 int num_async = 0; 1746 1747 int pending; 1747 1748 ··· 1753 1752 1754 1753 psock = sk_psock_get(sk); 1755 1754 lock_sock(sk); 1755 + bpf_strp_enabled = sk_psock_strp_enabled(psock); 1756 1756 1757 1757 /* Process pending decrypted records. It must be non-zero-copy */ 1758 1758 err = process_rx_list(ctx, msg, &control, &cmsg, 0, len, false, ··· 1807 1805 1808 1806 if (to_decrypt <= len && !is_kvec && !is_peek && 1809 1807 ctx->control == TLS_RECORD_TYPE_DATA && 1810 - prot->version != TLS_1_3_VERSION) 1808 + prot->version != TLS_1_3_VERSION && 1809 + !bpf_strp_enabled) 1811 1810 zc = true; 1812 1811 1813 1812 /* Do not use async mode if record is non-data */ 1814 - if (ctx->control == TLS_RECORD_TYPE_DATA) 1813 + if (ctx->control == TLS_RECORD_TYPE_DATA && !bpf_strp_enabled) 1815 1814 async_capable = ctx->async_capable; 1816 1815 else 1817 1816 async_capable = false; ··· 1862 1859 goto pick_next_record; 1863 1860 1864 1861 if (!zc) { 1862 + if (bpf_strp_enabled) { 1863 + err = sk_psock_tls_strp_read(psock, skb); 1864 + if (err != __SK_PASS) { 1865 + rxm->offset = rxm->offset + rxm->full_len; 1866 + rxm->full_len = 0; 1867 + if (err == __SK_DROP) 1868 + consume_skb(skb); 1869 + ctx->recv_pkt = NULL; 1870 + __strp_unpause(&ctx->strp); 1871 + continue; 1872 + } 1873 + } 1874 + 1865 1875 if (rxm->full_len > len) { 1866 1876 retain_skb = true; 1867 1877 chunk = len;

+5 -5

tools/bpf/bpftool/btf.c

··· 553 553 btf = btf__parse_elf(*argv, NULL); 554 554 555 555 if (IS_ERR(btf)) { 556 - err = PTR_ERR(btf); 556 + err = -PTR_ERR(btf); 557 557 btf = NULL; 558 558 p_err("failed to load BTF from %s: %s", 559 559 *argv, strerror(err)); ··· 951 951 } 952 952 953 953 fprintf(stderr, 954 - "Usage: %s btf { show | list } [id BTF_ID]\n" 955 - " %s btf dump BTF_SRC [format FORMAT]\n" 956 - " %s btf help\n" 954 + "Usage: %1$s %2$s { show | list } [id BTF_ID]\n" 955 + " %1$s %2$s dump BTF_SRC [format FORMAT]\n" 956 + " %1$s %2$s help\n" 957 957 "\n" 958 958 " BTF_SRC := { id BTF_ID | prog PROG | map MAP [{key | value | kv | all}] | file FILE }\n" 959 959 " FORMAT := { raw | c }\n" ··· 961 961 " " HELP_SPEC_PROGRAM "\n" 962 962 " " HELP_SPEC_OPTIONS "\n" 963 963 "", 964 - bin_name, bin_name, bin_name); 964 + bin_name, "btf"); 965 965 966 966 return 0; 967 967 }

+6 -8

tools/bpf/bpftool/cgroup.c

··· 491 491 } 492 492 493 493 fprintf(stderr, 494 - "Usage: %s %s { show | list } CGROUP [**effective**]\n" 495 - " %s %s tree [CGROUP_ROOT] [**effective**]\n" 496 - " %s %s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n" 497 - " %s %s detach CGROUP ATTACH_TYPE PROG\n" 498 - " %s %s help\n" 494 + "Usage: %1$s %2$s { show | list } CGROUP [**effective**]\n" 495 + " %1$s %2$s tree [CGROUP_ROOT] [**effective**]\n" 496 + " %1$s %2$s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n" 497 + " %1$s %2$s detach CGROUP ATTACH_TYPE PROG\n" 498 + " %1$s %2$s help\n" 499 499 "\n" 500 500 HELP_SPEC_ATTACH_TYPES "\n" 501 501 " " HELP_SPEC_ATTACH_FLAGS "\n" 502 502 " " HELP_SPEC_PROGRAM "\n" 503 503 " " HELP_SPEC_OPTIONS "\n" 504 504 "", 505 - bin_name, argv[-2], 506 - bin_name, argv[-2], bin_name, argv[-2], 507 - bin_name, argv[-2], bin_name, argv[-2]); 505 + bin_name, argv[-2]); 508 506 509 507 return 0; 510 508 }

+69 -22

tools/bpf/bpftool/feature.c

··· 758 758 print_end_section(); 759 759 } 760 760 761 + #ifdef USE_LIBCAP 762 + #define capability(c) { c, false, #c } 763 + #define capability_msg(a, i) a[i].set ? "" : a[i].name, a[i].set ? "" : ", " 764 + #endif 765 + 761 766 static int handle_perms(void) 762 767 { 763 768 #ifdef USE_LIBCAP 764 - cap_value_t cap_list[1] = { CAP_SYS_ADMIN }; 765 - bool has_sys_admin_cap = false; 769 + struct { 770 + cap_value_t cap; 771 + bool set; 772 + char name[14]; /* strlen("CAP_SYS_ADMIN") */ 773 + } bpf_caps[] = { 774 + capability(CAP_SYS_ADMIN), 775 + #ifdef CAP_BPF 776 + capability(CAP_BPF), 777 + capability(CAP_NET_ADMIN), 778 + capability(CAP_PERFMON), 779 + #endif 780 + }; 781 + cap_value_t cap_list[ARRAY_SIZE(bpf_caps)]; 782 + unsigned int i, nb_bpf_caps = 0; 783 + bool cap_sys_admin_only = true; 766 784 cap_flag_value_t val; 767 785 int res = -1; 768 786 cap_t caps; ··· 792 774 return -1; 793 775 } 794 776 795 - if (cap_get_flag(caps, CAP_SYS_ADMIN, CAP_EFFECTIVE, &val)) { 796 - p_err("bug: failed to retrieve CAP_SYS_ADMIN status"); 797 - goto exit_free; 798 - } 799 - if (val == CAP_SET) 800 - has_sys_admin_cap = true; 777 + #ifdef CAP_BPF 778 + if (CAP_IS_SUPPORTED(CAP_BPF)) 779 + cap_sys_admin_only = false; 780 + #endif 801 781 802 - if (!run_as_unprivileged && !has_sys_admin_cap) { 803 - p_err("full feature probing requires CAP_SYS_ADMIN, run as root or use 'unprivileged'"); 804 - goto exit_free; 782 + for (i = 0; i < ARRAY_SIZE(bpf_caps); i++) { 783 + const char *cap_name = bpf_caps[i].name; 784 + cap_value_t cap = bpf_caps[i].cap; 785 + 786 + if (cap_get_flag(caps, cap, CAP_EFFECTIVE, &val)) { 787 + p_err("bug: failed to retrieve %s status: %s", cap_name, 788 + strerror(errno)); 789 + goto exit_free; 790 + } 791 + 792 + if (val == CAP_SET) { 793 + bpf_caps[i].set = true; 794 + cap_list[nb_bpf_caps++] = cap; 795 + } 796 + 797 + if (cap_sys_admin_only) 798 + /* System does not know about CAP_BPF, meaning that 799 + * CAP_SYS_ADMIN is the only capability required. We 800 + * just checked it, break. 801 + */ 802 + break; 805 803 } 806 804 807 - if ((run_as_unprivileged && !has_sys_admin_cap) || 808 - (!run_as_unprivileged && has_sys_admin_cap)) { 805 + if ((run_as_unprivileged && !nb_bpf_caps) || 806 + (!run_as_unprivileged && nb_bpf_caps == ARRAY_SIZE(bpf_caps)) || 807 + (!run_as_unprivileged && cap_sys_admin_only && nb_bpf_caps)) { 809 808 /* We are all good, exit now */ 810 809 res = 0; 811 810 goto exit_free; 812 811 } 813 812 814 - /* if (run_as_unprivileged && has_sys_admin_cap), drop CAP_SYS_ADMIN */ 813 + if (!run_as_unprivileged) { 814 + if (cap_sys_admin_only) 815 + p_err("missing %s, required for full feature probing; run as root or use 'unprivileged'", 816 + bpf_caps[0].name); 817 + else 818 + p_err("missing %s%s%s%s%s%s%s%srequired for full feature probing; run as root or use 'unprivileged'", 819 + capability_msg(bpf_caps, 0), 820 + capability_msg(bpf_caps, 1), 821 + capability_msg(bpf_caps, 2), 822 + capability_msg(bpf_caps, 3)); 823 + goto exit_free; 824 + } 815 825 816 - if (cap_set_flag(caps, CAP_EFFECTIVE, ARRAY_SIZE(cap_list), cap_list, 826 + /* if (run_as_unprivileged && nb_bpf_caps > 0), drop capabilities. */ 827 + if (cap_set_flag(caps, CAP_EFFECTIVE, nb_bpf_caps, cap_list, 817 828 CAP_CLEAR)) { 818 - p_err("bug: failed to clear CAP_SYS_ADMIN from capabilities"); 829 + p_err("bug: failed to clear capabilities: %s", strerror(errno)); 819 830 goto exit_free; 820 831 } 821 832 822 833 if (cap_set_proc(caps)) { 823 - p_err("failed to drop CAP_SYS_ADMIN: %s", strerror(errno)); 834 + p_err("failed to drop capabilities: %s", strerror(errno)); 824 835 goto exit_free; 825 836 } 826 837 ··· 864 817 865 818 return res; 866 819 #else 867 - /* Detection assumes user has sufficient privileges (CAP_SYS_ADMIN). 820 + /* Detection assumes user has specific privileges. 868 821 * We do not use libpcap so let's approximate, and restrict usage to 869 822 * root user only. 870 823 */ ··· 948 901 } 949 902 } 950 903 951 - /* Full feature detection requires CAP_SYS_ADMIN privilege. 904 + /* Full feature detection requires specific privileges. 952 905 * Let's approximate, and warn if user is not root. 953 906 */ 954 907 if (handle_perms()) ··· 984 937 } 985 938 986 939 fprintf(stderr, 987 - "Usage: %s %s probe [COMPONENT] [full] [unprivileged] [macros [prefix PREFIX]]\n" 988 - " %s %s help\n" 940 + "Usage: %1$s %2$s probe [COMPONENT] [full] [unprivileged] [macros [prefix PREFIX]]\n" 941 + " %1$s %2$s help\n" 989 942 "\n" 990 943 " COMPONENT := { kernel | dev NAME }\n" 991 944 "", 992 - bin_name, argv[-2], bin_name, argv[-2]); 945 + bin_name, argv[-2]); 993 946 994 947 return 0; 995 948 }

+3 -3

tools/bpf/bpftool/gen.c

··· 586 586 } 587 587 588 588 fprintf(stderr, 589 - "Usage: %1$s gen skeleton FILE\n" 590 - " %1$s gen help\n" 589 + "Usage: %1$s %2$s skeleton FILE\n" 590 + " %1$s %2$s help\n" 591 591 "\n" 592 592 " " HELP_SPEC_OPTIONS "\n" 593 593 "", 594 - bin_name); 594 + bin_name, "gen"); 595 595 596 596 return 0; 597 597 }

+4 -4

tools/bpf/bpftool/iter.c

··· 68 68 static int do_help(int argc, char **argv) 69 69 { 70 70 fprintf(stderr, 71 - "Usage: %s %s pin OBJ PATH\n" 72 - " %s %s help\n" 73 - "\n", 74 - bin_name, argv[-2], bin_name, argv[-2]); 71 + "Usage: %1$s %2$s pin OBJ PATH\n" 72 + " %1$s %2$s help\n" 73 + "", 74 + bin_name, "iter"); 75 75 76 76 return 0; 77 77 }

+32 -23

tools/bpf/bpftool/link.c

··· 17 17 [BPF_LINK_TYPE_TRACING] = "tracing", 18 18 [BPF_LINK_TYPE_CGROUP] = "cgroup", 19 19 [BPF_LINK_TYPE_ITER] = "iter", 20 + [BPF_LINK_TYPE_NETNS] = "netns", 20 21 }; 21 22 22 23 static int link_parse_fd(int *argc, char ***argv) ··· 63 62 jsonw_uint_field(json_wtr, "prog_id", info->prog_id); 64 63 } 65 64 65 + static void show_link_attach_type_json(__u32 attach_type, json_writer_t *wtr) 66 + { 67 + if (attach_type < ARRAY_SIZE(attach_type_name)) 68 + jsonw_string_field(wtr, "attach_type", 69 + attach_type_name[attach_type]); 70 + else 71 + jsonw_uint_field(wtr, "attach_type", attach_type); 72 + } 73 + 66 74 static int get_prog_info(int prog_id, struct bpf_prog_info *info) 67 75 { 68 76 __u32 len = sizeof(*info); ··· 115 105 jsonw_uint_field(json_wtr, "prog_type", 116 106 prog_info.type); 117 107 118 - if (info->tracing.attach_type < ARRAY_SIZE(attach_type_name)) 119 - jsonw_string_field(json_wtr, "attach_type", 120 - attach_type_name[info->tracing.attach_type]); 121 - else 122 - jsonw_uint_field(json_wtr, "attach_type", 123 - info->tracing.attach_type); 108 + show_link_attach_type_json(info->tracing.attach_type, 109 + json_wtr); 124 110 break; 125 111 case BPF_LINK_TYPE_CGROUP: 126 112 jsonw_lluint_field(json_wtr, "cgroup_id", 127 113 info->cgroup.cgroup_id); 128 - if (info->cgroup.attach_type < ARRAY_SIZE(attach_type_name)) 129 - jsonw_string_field(json_wtr, "attach_type", 130 - attach_type_name[info->cgroup.attach_type]); 131 - else 132 - jsonw_uint_field(json_wtr, "attach_type", 133 - info->cgroup.attach_type); 114 + show_link_attach_type_json(info->cgroup.attach_type, json_wtr); 115 + break; 116 + case BPF_LINK_TYPE_NETNS: 117 + jsonw_uint_field(json_wtr, "netns_ino", 118 + info->netns.netns_ino); 119 + show_link_attach_type_json(info->netns.attach_type, json_wtr); 134 120 break; 135 121 default: 136 122 break; ··· 159 153 printf("prog %u ", info->prog_id); 160 154 } 161 155 156 + static void show_link_attach_type_plain(__u32 attach_type) 157 + { 158 + if (attach_type < ARRAY_SIZE(attach_type_name)) 159 + printf("attach_type %s ", attach_type_name[attach_type]); 160 + else 161 + printf("attach_type %u ", attach_type); 162 + } 163 + 162 164 static int show_link_close_plain(int fd, struct bpf_link_info *info) 163 165 { 164 166 struct bpf_prog_info prog_info; ··· 190 176 else 191 177 printf("\n\tprog_type %u ", prog_info.type); 192 178 193 - if (info->tracing.attach_type < ARRAY_SIZE(attach_type_name)) 194 - printf("attach_type %s ", 195 - attach_type_name[info->tracing.attach_type]); 196 - else 197 - printf("attach_type %u ", info->tracing.attach_type); 179 + show_link_attach_type_plain(info->tracing.attach_type); 198 180 break; 199 181 case BPF_LINK_TYPE_CGROUP: 200 182 printf("\n\tcgroup_id %zu ", (size_t)info->cgroup.cgroup_id); 201 - if (info->cgroup.attach_type < ARRAY_SIZE(attach_type_name)) 202 - printf("attach_type %s ", 203 - attach_type_name[info->cgroup.attach_type]); 204 - else 205 - printf("attach_type %u ", info->cgroup.attach_type); 183 + show_link_attach_type_plain(info->cgroup.attach_type); 184 + break; 185 + case BPF_LINK_TYPE_NETNS: 186 + printf("\n\tnetns_ino %u ", info->netns.netns_ino); 187 + show_link_attach_type_plain(info->netns.attach_type); 206 188 break; 207 189 default: 208 190 break; ··· 322 312 " %1$s %2$s help\n" 323 313 "\n" 324 314 " " HELP_SPEC_LINK "\n" 325 - " " HELP_SPEC_PROGRAM "\n" 326 315 " " HELP_SPEC_OPTIONS "\n" 327 316 "", 328 317 bin_name, argv[-2]);

+18 -23

tools/bpf/bpftool/map.c

··· 1561 1561 } 1562 1562 1563 1563 fprintf(stderr, 1564 - "Usage: %s %s { show | list } [MAP]\n" 1565 - " %s %s create FILE type TYPE key KEY_SIZE value VALUE_SIZE \\\n" 1566 - " entries MAX_ENTRIES name NAME [flags FLAGS] \\\n" 1567 - " [dev NAME]\n" 1568 - " %s %s dump MAP\n" 1569 - " %s %s update MAP [key DATA] [value VALUE] [UPDATE_FLAGS]\n" 1570 - " %s %s lookup MAP [key DATA]\n" 1571 - " %s %s getnext MAP [key DATA]\n" 1572 - " %s %s delete MAP key DATA\n" 1573 - " %s %s pin MAP FILE\n" 1574 - " %s %s event_pipe MAP [cpu N index M]\n" 1575 - " %s %s peek MAP\n" 1576 - " %s %s push MAP value VALUE\n" 1577 - " %s %s pop MAP\n" 1578 - " %s %s enqueue MAP value VALUE\n" 1579 - " %s %s dequeue MAP\n" 1580 - " %s %s freeze MAP\n" 1581 - " %s %s help\n" 1564 + "Usage: %1$s %2$s { show | list } [MAP]\n" 1565 + " %1$s %2$s create FILE type TYPE key KEY_SIZE value VALUE_SIZE \\\n" 1566 + " entries MAX_ENTRIES name NAME [flags FLAGS] \\\n" 1567 + " [dev NAME]\n" 1568 + " %1$s %2$s dump MAP\n" 1569 + " %1$s %2$s update MAP [key DATA] [value VALUE] [UPDATE_FLAGS]\n" 1570 + " %1$s %2$s lookup MAP [key DATA]\n" 1571 + " %1$s %2$s getnext MAP [key DATA]\n" 1572 + " %1$s %2$s delete MAP key DATA\n" 1573 + " %1$s %2$s pin MAP FILE\n" 1574 + " %1$s %2$s event_pipe MAP [cpu N index M]\n" 1575 + " %1$s %2$s peek MAP\n" 1576 + " %1$s %2$s push MAP value VALUE\n" 1577 + " %1$s %2$s pop MAP\n" 1578 + " %1$s %2$s enqueue MAP value VALUE\n" 1579 + " %1$s %2$s dequeue MAP\n" 1580 + " %1$s %2$s freeze MAP\n" 1581 + " %1$s %2$s help\n" 1582 1582 "\n" 1583 1583 " " HELP_SPEC_MAP "\n" 1584 1584 " DATA := { [hex] BYTES }\n" ··· 1593 1593 " queue | stack | sk_storage | struct_ops }\n" 1594 1594 " " HELP_SPEC_OPTIONS "\n" 1595 1595 "", 1596 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 1597 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 1598 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 1599 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 1600 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 1601 1596 bin_name, argv[-2]); 1602 1597 1603 1598 return 0;

+6 -6

tools/bpf/bpftool/net.c

··· 458 458 } 459 459 460 460 fprintf(stderr, 461 - "Usage: %s %s { show | list } [dev <devname>]\n" 462 - " %s %s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n" 463 - " %s %s detach ATTACH_TYPE dev <devname>\n" 464 - " %s %s help\n" 461 + "Usage: %1$s %2$s { show | list } [dev <devname>]\n" 462 + " %1$s %2$s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n" 463 + " %1$s %2$s detach ATTACH_TYPE dev <devname>\n" 464 + " %1$s %2$s help\n" 465 465 "\n" 466 466 " " HELP_SPEC_PROGRAM "\n" 467 467 " ATTACH_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n" ··· 470 470 " For progs attached to cgroups, use \"bpftool cgroup\"\n" 471 471 " to dump program attachments. For program types\n" 472 472 " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n" 473 - " consult iproute2.\n", 474 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 473 + " consult iproute2.\n" 474 + "", 475 475 bin_name, argv[-2]); 476 476 477 477 return 0;

+1 -1

tools/bpf/bpftool/perf.c

··· 231 231 static int do_help(int argc, char **argv) 232 232 { 233 233 fprintf(stderr, 234 - "Usage: %s %s { show | list | help }\n" 234 + "Usage: %1$s %2$s { show | list | help }\n" 235 235 "", 236 236 bin_name, argv[-2]); 237 237

+12 -15

tools/bpf/bpftool/prog.c

··· 1984 1984 } 1985 1985 1986 1986 fprintf(stderr, 1987 - "Usage: %s %s { show | list } [PROG]\n" 1988 - " %s %s dump xlated PROG [{ file FILE | opcodes | visual | linum }]\n" 1989 - " %s %s dump jited PROG [{ file FILE | opcodes | linum }]\n" 1990 - " %s %s pin PROG FILE\n" 1991 - " %s %s { load | loadall } OBJ PATH \\\n" 1987 + "Usage: %1$s %2$s { show | list } [PROG]\n" 1988 + " %1$s %2$s dump xlated PROG [{ file FILE | opcodes | visual | linum }]\n" 1989 + " %1$s %2$s dump jited PROG [{ file FILE | opcodes | linum }]\n" 1990 + " %1$s %2$s pin PROG FILE\n" 1991 + " %1$s %2$s { load | loadall } OBJ PATH \\\n" 1992 1992 " [type TYPE] [dev NAME] \\\n" 1993 1993 " [map { idx IDX | name NAME } MAP]\\\n" 1994 1994 " [pinmaps MAP_DIR]\n" 1995 - " %s %s attach PROG ATTACH_TYPE [MAP]\n" 1996 - " %s %s detach PROG ATTACH_TYPE [MAP]\n" 1997 - " %s %s run PROG \\\n" 1995 + " %1$s %2$s attach PROG ATTACH_TYPE [MAP]\n" 1996 + " %1$s %2$s detach PROG ATTACH_TYPE [MAP]\n" 1997 + " %1$s %2$s run PROG \\\n" 1998 1998 " data_in FILE \\\n" 1999 1999 " [data_out FILE [data_size_out L]] \\\n" 2000 2000 " [ctx_in FILE [ctx_out FILE [ctx_size_out M]]] \\\n" 2001 2001 " [repeat N]\n" 2002 - " %s %s profile PROG [duration DURATION] METRICs\n" 2003 - " %s %s tracelog\n" 2004 - " %s %s help\n" 2002 + " %1$s %2$s profile PROG [duration DURATION] METRICs\n" 2003 + " %1$s %2$s tracelog\n" 2004 + " %1$s %2$s help\n" 2005 2005 "\n" 2006 2006 " " HELP_SPEC_MAP "\n" 2007 2007 " " HELP_SPEC_PROGRAM "\n" ··· 2022 2022 " METRIC := { cycles | instructions | l1d_loads | llc_misses }\n" 2023 2023 " " HELP_SPEC_OPTIONS "\n" 2024 2024 "", 2025 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 2026 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 2027 - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2], 2028 - bin_name, argv[-2], bin_name, argv[-2]); 2025 + bin_name, argv[-2]); 2029 2026 2030 2027 return 0; 2031 2028 }

+7 -8

tools/bpf/bpftool/struct_ops.c

··· 566 566 } 567 567 568 568 fprintf(stderr, 569 - "Usage: %s %s { show | list } [STRUCT_OPS_MAP]\n" 570 - " %s %s dump [STRUCT_OPS_MAP]\n" 571 - " %s %s register OBJ\n" 572 - " %s %s unregister STRUCT_OPS_MAP\n" 573 - " %s %s help\n" 569 + "Usage: %1$s %2$s { show | list } [STRUCT_OPS_MAP]\n" 570 + " %1$s %2$s dump [STRUCT_OPS_MAP]\n" 571 + " %1$s %2$s register OBJ\n" 572 + " %1$s %2$s unregister STRUCT_OPS_MAP\n" 573 + " %1$s %2$s help\n" 574 574 "\n" 575 575 " OPTIONS := { {-j|--json} [{-p|--pretty}] }\n" 576 - " STRUCT_OPS_MAP := [ id STRUCT_OPS_MAP_ID | name STRUCT_OPS_MAP_NAME ]\n", 577 - bin_name, argv[-2], bin_name, argv[-2], 578 - bin_name, argv[-2], bin_name, argv[-2], 576 + " STRUCT_OPS_MAP := [ id STRUCT_OPS_MAP_ID | name STRUCT_OPS_MAP_NAME ]\n" 577 + "", 579 578 bin_name, argv[-2]); 580 579 581 580 return 0;

+94 -1

tools/include/uapi/linux/bpf.h

··· 147 147 BPF_MAP_TYPE_SK_STORAGE, 148 148 BPF_MAP_TYPE_DEVMAP_HASH, 149 149 BPF_MAP_TYPE_STRUCT_OPS, 150 + BPF_MAP_TYPE_RINGBUF, 150 151 }; 151 152 152 153 /* Note that tracing related programs such as ··· 225 224 BPF_CGROUP_INET6_GETPEERNAME, 226 225 BPF_CGROUP_INET4_GETSOCKNAME, 227 226 BPF_CGROUP_INET6_GETSOCKNAME, 227 + BPF_XDP_DEVMAP, 228 228 __MAX_BPF_ATTACH_TYPE 229 229 }; 230 230 ··· 237 235 BPF_LINK_TYPE_TRACING = 2, 238 236 BPF_LINK_TYPE_CGROUP = 3, 239 237 BPF_LINK_TYPE_ITER = 4, 238 + BPF_LINK_TYPE_NETNS = 5, 240 239 241 240 MAX_BPF_LINK_TYPE, 242 241 }; ··· 3160 3157 * **bpf_sk_cgroup_id**\ (). 3161 3158 * Return 3162 3159 * The id is returned or 0 in case the id could not be retrieved. 3160 + * 3161 + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags) 3162 + * Description 3163 + * Copy *size* bytes from *data* into a ring buffer *ringbuf*. 3164 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3165 + * new data availability is sent. 3166 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3167 + * new data availability is sent unconditionally. 3168 + * Return 3169 + * 0, on success; 3170 + * < 0, on error. 3171 + * 3172 + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags) 3173 + * Description 3174 + * Reserve *size* bytes of payload in a ring buffer *ringbuf*. 3175 + * Return 3176 + * Valid pointer with *size* bytes of memory available; NULL, 3177 + * otherwise. 3178 + * 3179 + * void bpf_ringbuf_submit(void *data, u64 flags) 3180 + * Description 3181 + * Submit reserved ring buffer sample, pointed to by *data*. 3182 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3183 + * new data availability is sent. 3184 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3185 + * new data availability is sent unconditionally. 3186 + * Return 3187 + * Nothing. Always succeeds. 3188 + * 3189 + * void bpf_ringbuf_discard(void *data, u64 flags) 3190 + * Description 3191 + * Discard reserved ring buffer sample, pointed to by *data*. 3192 + * If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of 3193 + * new data availability is sent. 3194 + * IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of 3195 + * new data availability is sent unconditionally. 3196 + * Return 3197 + * Nothing. Always succeeds. 3198 + * 3199 + * u64 bpf_ringbuf_query(void *ringbuf, u64 flags) 3200 + * Description 3201 + * Query various characteristics of provided ring buffer. What 3202 + * exactly is queries is determined by *flags*: 3203 + * - BPF_RB_AVAIL_DATA - amount of data not yet consumed; 3204 + * - BPF_RB_RING_SIZE - the size of ring buffer; 3205 + * - BPF_RB_CONS_POS - consumer position (can wrap around); 3206 + * - BPF_RB_PROD_POS - producer(s) position (can wrap around); 3207 + * Data returned is just a momentary snapshots of actual values 3208 + * and could be inaccurate, so this facility should be used to 3209 + * power heuristics and for reporting, not to make 100% correct 3210 + * calculation. 3211 + * Return 3212 + * Requested value, or 0, if flags are not recognized. 3163 3213 */ 3164 3214 #define __BPF_FUNC_MAPPER(FN) \ 3165 3215 FN(unspec), \ ··· 3344 3288 FN(seq_printf), \ 3345 3289 FN(seq_write), \ 3346 3290 FN(sk_cgroup_id), \ 3347 - FN(sk_ancestor_cgroup_id), 3291 + FN(sk_ancestor_cgroup_id), \ 3292 + FN(ringbuf_output), \ 3293 + FN(ringbuf_reserve), \ 3294 + FN(ringbuf_submit), \ 3295 + FN(ringbuf_discard), \ 3296 + FN(ringbuf_query), 3348 3297 3349 3298 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 3350 3299 * function eBPF program intends to call ··· 3457 3396 /* BPF_FUNC_read_branch_records flags. */ 3458 3397 enum { 3459 3398 BPF_F_GET_BRANCH_RECORDS_SIZE = (1ULL << 0), 3399 + }; 3400 + 3401 + /* BPF_FUNC_bpf_ringbuf_commit, BPF_FUNC_bpf_ringbuf_discard, and 3402 + * BPF_FUNC_bpf_ringbuf_output flags. 3403 + */ 3404 + enum { 3405 + BPF_RB_NO_WAKEUP = (1ULL << 0), 3406 + BPF_RB_FORCE_WAKEUP = (1ULL << 1), 3407 + }; 3408 + 3409 + /* BPF_FUNC_bpf_ringbuf_query flags */ 3410 + enum { 3411 + BPF_RB_AVAIL_DATA = 0, 3412 + BPF_RB_RING_SIZE = 1, 3413 + BPF_RB_CONS_POS = 2, 3414 + BPF_RB_PROD_POS = 3, 3415 + }; 3416 + 3417 + /* BPF ring buffer constants */ 3418 + enum { 3419 + BPF_RINGBUF_BUSY_BIT = (1U << 31), 3420 + BPF_RINGBUF_DISCARD_BIT = (1U << 30), 3421 + BPF_RINGBUF_HDR_SZ = 8, 3460 3422 }; 3461 3423 3462 3424 /* Mode for BPF_FUNC_skb_adjust_room helper. */ ··· 3614 3530 __u32 dst_ip4; 3615 3531 __u32 dst_ip6[4]; 3616 3532 __u32 state; 3533 + __s32 rx_queue_mapping; 3617 3534 }; 3618 3535 3619 3536 struct bpf_tcp_sock { ··· 3708 3623 /* Below access go through struct xdp_rxq_info */ 3709 3624 __u32 ingress_ifindex; /* rxq->dev->ifindex */ 3710 3625 __u32 rx_queue_index; /* rxq->queue_index */ 3626 + 3627 + __u32 egress_ifindex; /* txq->dev->ifindex */ 3711 3628 }; 3712 3629 3713 3630 enum sk_action { ··· 3732 3645 __u32 remote_port; /* Stored in network byte order */ 3733 3646 __u32 local_port; /* stored in host byte order */ 3734 3647 __u32 size; /* Total size of sk_msg */ 3648 + 3649 + __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ 3735 3650 }; 3736 3651 3737 3652 struct sk_reuseport_md { ··· 3840 3751 __u64 cgroup_id; 3841 3752 __u32 attach_type; 3842 3753 } cgroup; 3754 + struct { 3755 + __u32 netns_ino; 3756 + __u32 attach_type; 3757 + } netns; 3843 3758 }; 3844 3759 } __attribute__((aligned(8))); 3845 3760

+1 -1

tools/lib/bpf/Build

··· 1 1 libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \ 2 2 netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \ 3 - btf_dump.o 3 + btf_dump.o ringbuf.o

+3 -3

tools/lib/bpf/Makefile

··· 151 151 sed 's/\[.*\]//' | \ 152 152 awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}' | \ 153 153 sort -u | wc -l) 154 - VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \ 154 + VERSIONED_SYM_COUNT = $(shell readelf --dyn-syms --wide $(OUTPUT)libbpf.so | \ 155 155 grep -Eo '[^ ]+@LIBBPF_' | cut -d@ -f1 | sort -u | wc -l) 156 156 157 157 CMD_TARGETS = $(LIB_TARGET) $(PC_FILE) ··· 218 218 sed 's/\[.*\]//' | \ 219 219 awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}'| \ 220 220 sort -u > $(OUTPUT)libbpf_global_syms.tmp; \ 221 - readelf -s --wide $(OUTPUT)libbpf.so | \ 221 + readelf --dyn-syms --wide $(OUTPUT)libbpf.so | \ 222 222 grep -Eo '[^ ]+@LIBBPF_' | cut -d@ -f1 | \ 223 223 sort -u > $(OUTPUT)libbpf_versioned_syms.tmp; \ 224 224 diff -u $(OUTPUT)libbpf_global_syms.tmp \ ··· 264 264 $(call QUIET_INSTALL, $(PC_FILE)) \ 265 265 $(call do_install,$(PC_FILE),$(libdir_SQ)/pkgconfig,644) 266 266 267 - install: install_lib install_pkgconfig 267 + install: install_lib install_pkgconfig install_headers 268 268 269 269 ### Cleaning rules 270 270

+43 -6

tools/lib/bpf/libbpf.c

··· 6657 6657 .expected_attach_type = BPF_TRACE_ITER, 6658 6658 .is_attach_btf = true, 6659 6659 .attach_fn = attach_iter), 6660 + BPF_EAPROG_SEC("xdp_devmap", BPF_PROG_TYPE_XDP, 6661 + BPF_XDP_DEVMAP), 6660 6662 BPF_PROG_SEC("xdp", BPF_PROG_TYPE_XDP), 6661 6663 BPF_PROG_SEC("perf_event", BPF_PROG_TYPE_PERF_EVENT), 6662 6664 BPF_PROG_SEC("lwt_in", BPF_PROG_TYPE_LWT_IN), ··· 7896 7894 return bpf_program__attach_iter(prog, NULL); 7897 7895 } 7898 7896 7899 - struct bpf_link * 7900 - bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd) 7897 + static struct bpf_link * 7898 + bpf_program__attach_fd(struct bpf_program *prog, int target_fd, 7899 + const char *target_name) 7901 7900 { 7902 7901 enum bpf_attach_type attach_type; 7903 7902 char errmsg[STRERR_BUFSIZE]; ··· 7918 7915 link->detach = &bpf_link__detach_fd; 7919 7916 7920 7917 attach_type = bpf_program__get_expected_attach_type(prog); 7921 - link_fd = bpf_link_create(prog_fd, cgroup_fd, attach_type, NULL); 7918 + link_fd = bpf_link_create(prog_fd, target_fd, attach_type, NULL); 7922 7919 if (link_fd < 0) { 7923 7920 link_fd = -errno; 7924 7921 free(link); 7925 - pr_warn("program '%s': failed to attach to cgroup: %s\n", 7926 - bpf_program__title(prog, false), 7922 + pr_warn("program '%s': failed to attach to %s: %s\n", 7923 + bpf_program__title(prog, false), target_name, 7927 7924 libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg))); 7928 7925 return ERR_PTR(link_fd); 7929 7926 } 7930 7927 link->fd = link_fd; 7931 7928 return link; 7929 + } 7930 + 7931 + struct bpf_link * 7932 + bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd) 7933 + { 7934 + return bpf_program__attach_fd(prog, cgroup_fd, "cgroup"); 7935 + } 7936 + 7937 + struct bpf_link * 7938 + bpf_program__attach_netns(struct bpf_program *prog, int netns_fd) 7939 + { 7940 + return bpf_program__attach_fd(prog, netns_fd, "netns"); 7932 7941 } 7933 7942 7934 7943 struct bpf_link * ··· 8152 8137 if (!pb) 8153 8138 return; 8154 8139 if (pb->cpu_bufs) { 8155 - for (i = 0; i < pb->cpu_cnt && pb->cpu_bufs[i]; i++) { 8140 + for (i = 0; i < pb->cpu_cnt; i++) { 8156 8141 struct perf_cpu_buf *cpu_buf = pb->cpu_bufs[i]; 8142 + 8143 + if (!cpu_buf) 8144 + continue; 8157 8145 8158 8146 bpf_map_delete_elem(pb->map_fd, &cpu_buf->map_key); 8159 8147 perf_buffer__free_cpu_buf(pb, cpu_buf); ··· 8472 8454 } 8473 8455 } 8474 8456 return cnt < 0 ? -errno : cnt; 8457 + } 8458 + 8459 + int perf_buffer__consume(struct perf_buffer *pb) 8460 + { 8461 + int i, err; 8462 + 8463 + for (i = 0; i < pb->cpu_cnt; i++) { 8464 + struct perf_cpu_buf *cpu_buf = pb->cpu_bufs[i]; 8465 + 8466 + if (!cpu_buf) 8467 + continue; 8468 + 8469 + err = perf_buffer__process_records(pb, cpu_buf); 8470 + if (err) { 8471 + pr_warn("error while processing records: %d\n", err); 8472 + return err; 8473 + } 8474 + } 8475 + return 0; 8475 8476 } 8476 8477 8477 8478 struct bpf_prog_info_array_desc {

+24

tools/lib/bpf/libbpf.h

··· 253 253 bpf_program__attach_lsm(struct bpf_program *prog); 254 254 LIBBPF_API struct bpf_link * 255 255 bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd); 256 + LIBBPF_API struct bpf_link * 257 + bpf_program__attach_netns(struct bpf_program *prog, int netns_fd); 256 258 257 259 struct bpf_map; 258 260 ··· 480 478 LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info, 481 479 size_t info_size, __u32 flags); 482 480 481 + /* Ring buffer APIs */ 482 + struct ring_buffer; 483 + 484 + typedef int (*ring_buffer_sample_fn)(void *ctx, void *data, size_t size); 485 + 486 + struct ring_buffer_opts { 487 + size_t sz; /* size of this struct, for forward/backward compatiblity */ 488 + }; 489 + 490 + #define ring_buffer_opts__last_field sz 491 + 492 + LIBBPF_API struct ring_buffer * 493 + ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx, 494 + const struct ring_buffer_opts *opts); 495 + LIBBPF_API void ring_buffer__free(struct ring_buffer *rb); 496 + LIBBPF_API int ring_buffer__add(struct ring_buffer *rb, int map_fd, 497 + ring_buffer_sample_fn sample_cb, void *ctx); 498 + LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms); 499 + LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb); 500 + 501 + /* Perf buffer APIs */ 483 502 struct perf_buffer; 484 503 485 504 typedef void (*perf_buffer_sample_fn)(void *ctx, int cpu, ··· 556 533 557 534 LIBBPF_API void perf_buffer__free(struct perf_buffer *pb); 558 535 LIBBPF_API int perf_buffer__poll(struct perf_buffer *pb, int timeout_ms); 536 + LIBBPF_API int perf_buffer__consume(struct perf_buffer *pb); 559 537 560 538 typedef enum bpf_perf_event_ret 561 539 (*bpf_perf_event_print_t)(struct perf_event_header *hdr,

+7

tools/lib/bpf/libbpf.map

··· 262 262 bpf_link_get_fd_by_id; 263 263 bpf_link_get_next_id; 264 264 bpf_program__attach_iter; 265 + bpf_program__attach_netns; 266 + perf_buffer__consume; 267 + ring_buffer__add; 268 + ring_buffer__consume; 269 + ring_buffer__free; 270 + ring_buffer__new; 271 + ring_buffer__poll; 265 272 } LIBBPF_0.0.8;

+5

tools/lib/bpf/libbpf_probes.c

··· 238 238 if (btf_fd < 0) 239 239 return false; 240 240 break; 241 + case BPF_MAP_TYPE_RINGBUF: 242 + key_size = 0; 243 + value_size = 0; 244 + max_entries = 4096; 245 + break; 241 246 case BPF_MAP_TYPE_UNSPEC: 242 247 case BPF_MAP_TYPE_HASH: 243 248 case BPF_MAP_TYPE_ARRAY:

+288

tools/lib/bpf/ringbuf.c

··· 1 + // SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) 2 + /* 3 + * Ring buffer operations. 4 + * 5 + * Copyright (C) 2020 Facebook, Inc. 6 + */ 7 + #ifndef _GNU_SOURCE 8 + #define _GNU_SOURCE 9 + #endif 10 + #include <stdlib.h> 11 + #include <stdio.h> 12 + #include <errno.h> 13 + #include <unistd.h> 14 + #include <linux/err.h> 15 + #include <linux/bpf.h> 16 + #include <asm/barrier.h> 17 + #include <sys/mman.h> 18 + #include <sys/epoll.h> 19 + #include <tools/libc_compat.h> 20 + 21 + #include "libbpf.h" 22 + #include "libbpf_internal.h" 23 + #include "bpf.h" 24 + 25 + /* make sure libbpf doesn't use kernel-only integer typedefs */ 26 + #pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 27 + 28 + struct ring { 29 + ring_buffer_sample_fn sample_cb; 30 + void *ctx; 31 + void *data; 32 + unsigned long *consumer_pos; 33 + unsigned long *producer_pos; 34 + unsigned long mask; 35 + int map_fd; 36 + }; 37 + 38 + struct ring_buffer { 39 + struct epoll_event *events; 40 + struct ring *rings; 41 + size_t page_size; 42 + int epoll_fd; 43 + int ring_cnt; 44 + }; 45 + 46 + static void ringbuf_unmap_ring(struct ring_buffer *rb, struct ring *r) 47 + { 48 + if (r->consumer_pos) { 49 + munmap(r->consumer_pos, rb->page_size); 50 + r->consumer_pos = NULL; 51 + } 52 + if (r->producer_pos) { 53 + munmap(r->producer_pos, rb->page_size + 2 * (r->mask + 1)); 54 + r->producer_pos = NULL; 55 + } 56 + } 57 + 58 + /* Add extra RINGBUF maps to this ring buffer manager */ 59 + int ring_buffer__add(struct ring_buffer *rb, int map_fd, 60 + ring_buffer_sample_fn sample_cb, void *ctx) 61 + { 62 + struct bpf_map_info info; 63 + __u32 len = sizeof(info); 64 + struct epoll_event *e; 65 + struct ring *r; 66 + void *tmp; 67 + int err; 68 + 69 + memset(&info, 0, sizeof(info)); 70 + 71 + err = bpf_obj_get_info_by_fd(map_fd, &info, &len); 72 + if (err) { 73 + err = -errno; 74 + pr_warn("ringbuf: failed to get map info for fd=%d: %d\n", 75 + map_fd, err); 76 + return err; 77 + } 78 + 79 + if (info.type != BPF_MAP_TYPE_RINGBUF) { 80 + pr_warn("ringbuf: map fd=%d is not BPF_MAP_TYPE_RINGBUF\n", 81 + map_fd); 82 + return -EINVAL; 83 + } 84 + 85 + tmp = reallocarray(rb->rings, rb->ring_cnt + 1, sizeof(*rb->rings)); 86 + if (!tmp) 87 + return -ENOMEM; 88 + rb->rings = tmp; 89 + 90 + tmp = reallocarray(rb->events, rb->ring_cnt + 1, sizeof(*rb->events)); 91 + if (!tmp) 92 + return -ENOMEM; 93 + rb->events = tmp; 94 + 95 + r = &rb->rings[rb->ring_cnt]; 96 + memset(r, 0, sizeof(*r)); 97 + 98 + r->map_fd = map_fd; 99 + r->sample_cb = sample_cb; 100 + r->ctx = ctx; 101 + r->mask = info.max_entries - 1; 102 + 103 + /* Map writable consumer page */ 104 + tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 105 + map_fd, 0); 106 + if (tmp == MAP_FAILED) { 107 + err = -errno; 108 + pr_warn("ringbuf: failed to mmap consumer page for map fd=%d: %d\n", 109 + map_fd, err); 110 + return err; 111 + } 112 + r->consumer_pos = tmp; 113 + 114 + /* Map read-only producer page and data pages. We map twice as big 115 + * data size to allow simple reading of samples that wrap around the 116 + * end of a ring buffer. See kernel implementation for details. 117 + * */ 118 + tmp = mmap(NULL, rb->page_size + 2 * info.max_entries, PROT_READ, 119 + MAP_SHARED, map_fd, rb->page_size); 120 + if (tmp == MAP_FAILED) { 121 + err = -errno; 122 + ringbuf_unmap_ring(rb, r); 123 + pr_warn("ringbuf: failed to mmap data pages for map fd=%d: %d\n", 124 + map_fd, err); 125 + return err; 126 + } 127 + r->producer_pos = tmp; 128 + r->data = tmp + rb->page_size; 129 + 130 + e = &rb->events[rb->ring_cnt]; 131 + memset(e, 0, sizeof(*e)); 132 + 133 + e->events = EPOLLIN; 134 + e->data.fd = rb->ring_cnt; 135 + if (epoll_ctl(rb->epoll_fd, EPOLL_CTL_ADD, map_fd, e) < 0) { 136 + err = -errno; 137 + ringbuf_unmap_ring(rb, r); 138 + pr_warn("ringbuf: failed to epoll add map fd=%d: %d\n", 139 + map_fd, err); 140 + return err; 141 + } 142 + 143 + rb->ring_cnt++; 144 + return 0; 145 + } 146 + 147 + void ring_buffer__free(struct ring_buffer *rb) 148 + { 149 + int i; 150 + 151 + if (!rb) 152 + return; 153 + 154 + for (i = 0; i < rb->ring_cnt; ++i) 155 + ringbuf_unmap_ring(rb, &rb->rings[i]); 156 + if (rb->epoll_fd >= 0) 157 + close(rb->epoll_fd); 158 + 159 + free(rb->events); 160 + free(rb->rings); 161 + free(rb); 162 + } 163 + 164 + struct ring_buffer * 165 + ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx, 166 + const struct ring_buffer_opts *opts) 167 + { 168 + struct ring_buffer *rb; 169 + int err; 170 + 171 + if (!OPTS_VALID(opts, ring_buffer_opts)) 172 + return NULL; 173 + 174 + rb = calloc(1, sizeof(*rb)); 175 + if (!rb) 176 + return NULL; 177 + 178 + rb->page_size = getpagesize(); 179 + 180 + rb->epoll_fd = epoll_create1(EPOLL_CLOEXEC); 181 + if (rb->epoll_fd < 0) { 182 + err = -errno; 183 + pr_warn("ringbuf: failed to create epoll instance: %d\n", err); 184 + goto err_out; 185 + } 186 + 187 + err = ring_buffer__add(rb, map_fd, sample_cb, ctx); 188 + if (err) 189 + goto err_out; 190 + 191 + return rb; 192 + 193 + err_out: 194 + ring_buffer__free(rb); 195 + return NULL; 196 + } 197 + 198 + static inline int roundup_len(__u32 len) 199 + { 200 + /* clear out top 2 bits (discard and busy, if set) */ 201 + len <<= 2; 202 + len >>= 2; 203 + /* add length prefix */ 204 + len += BPF_RINGBUF_HDR_SZ; 205 + /* round up to 8 byte alignment */ 206 + return (len + 7) / 8 * 8; 207 + } 208 + 209 + static int ringbuf_process_ring(struct ring* r) 210 + { 211 + int *len_ptr, len, err, cnt = 0; 212 + unsigned long cons_pos, prod_pos; 213 + bool got_new_data; 214 + void *sample; 215 + 216 + cons_pos = smp_load_acquire(r->consumer_pos); 217 + do { 218 + got_new_data = false; 219 + prod_pos = smp_load_acquire(r->producer_pos); 220 + while (cons_pos < prod_pos) { 221 + len_ptr = r->data + (cons_pos & r->mask); 222 + len = smp_load_acquire(len_ptr); 223 + 224 + /* sample not committed yet, bail out for now */ 225 + if (len & BPF_RINGBUF_BUSY_BIT) 226 + goto done; 227 + 228 + got_new_data = true; 229 + cons_pos += roundup_len(len); 230 + 231 + if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) { 232 + sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ; 233 + err = r->sample_cb(r->ctx, sample, len); 234 + if (err) { 235 + /* update consumer pos and bail out */ 236 + smp_store_release(r->consumer_pos, 237 + cons_pos); 238 + return err; 239 + } 240 + cnt++; 241 + } 242 + 243 + smp_store_release(r->consumer_pos, cons_pos); 244 + } 245 + } while (got_new_data); 246 + done: 247 + return cnt; 248 + } 249 + 250 + /* Consume available ring buffer(s) data without event polling. 251 + * Returns number of records consumed across all registered ring buffers, or 252 + * negative number if any of the callbacks return error. 253 + */ 254 + int ring_buffer__consume(struct ring_buffer *rb) 255 + { 256 + int i, err, res = 0; 257 + 258 + for (i = 0; i < rb->ring_cnt; i++) { 259 + struct ring *ring = &rb->rings[i]; 260 + 261 + err = ringbuf_process_ring(ring); 262 + if (err < 0) 263 + return err; 264 + res += err; 265 + } 266 + return res; 267 + } 268 + 269 + /* Poll for available data and consume records, if any are available. 270 + * Returns number of records consumed, or negative number, if any of the 271 + * registered callbacks returned error. 272 + */ 273 + int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms) 274 + { 275 + int i, cnt, err, res = 0; 276 + 277 + cnt = epoll_wait(rb->epoll_fd, rb->events, rb->ring_cnt, timeout_ms); 278 + for (i = 0; i < cnt; i++) { 279 + __u32 ring_id = rb->events[i].data.fd; 280 + struct ring *ring = &rb->rings[ring_id]; 281 + 282 + err = ringbuf_process_ring(ring); 283 + if (err < 0) 284 + return err; 285 + res += cnt; 286 + } 287 + return cnt < 0 ? -errno : res; 288 + }

+4 -1

tools/testing/selftests/bpf/Makefile

··· 413 413 $(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@ 414 414 $(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h 415 415 $(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h 416 + $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \ 417 + $(OUTPUT)/perfbuf_bench.skel.h 416 418 $(OUTPUT)/bench.o: bench.h testing_helpers.h 417 419 $(OUTPUT)/bench: LDLIBS += -lm 418 420 $(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \ 419 421 $(OUTPUT)/bench_count.o \ 420 422 $(OUTPUT)/bench_rename.o \ 421 - $(OUTPUT)/bench_trigger.o 423 + $(OUTPUT)/bench_trigger.o \ 424 + $(OUTPUT)/bench_ringbufs.o 422 425 $(call msg,BINARY,,$@) 423 426 $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS) 424 427

+16

tools/testing/selftests/bpf/bench.c

··· 130 130 {}, 131 131 }; 132 132 133 + extern struct argp bench_ringbufs_argp; 134 + 135 + static const struct argp_child bench_parsers[] = { 136 + { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, 137 + {}, 138 + }; 139 + 133 140 static error_t parse_arg(int key, char *arg, struct argp_state *state) 134 141 { 135 142 static int pos_args; ··· 215 208 .options = opts, 216 209 .parser = parse_arg, 217 210 .doc = argp_program_doc, 211 + .children = bench_parsers, 218 212 }; 219 213 if (argp_parse(&argp, argc, argv, 0, NULL, NULL)) 220 214 exit(1); ··· 318 310 extern const struct bench bench_trig_kprobe; 319 311 extern const struct bench bench_trig_fentry; 320 312 extern const struct bench bench_trig_fmodret; 313 + extern const struct bench bench_rb_libbpf; 314 + extern const struct bench bench_rb_custom; 315 + extern const struct bench bench_pb_libbpf; 316 + extern const struct bench bench_pb_custom; 321 317 322 318 static const struct bench *benchs[] = { 323 319 &bench_count_global, ··· 339 327 &bench_trig_kprobe, 340 328 &bench_trig_fentry, 341 329 &bench_trig_fmodret, 330 + &bench_rb_libbpf, 331 + &bench_rb_custom, 332 + &bench_pb_libbpf, 333 + &bench_pb_custom, 342 334 }; 343 335 344 336 static void setup_benchmark()

+566

tools/testing/selftests/bpf/benchs/bench_ringbufs.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2020 Facebook */ 3 + #include <asm/barrier.h> 4 + #include <linux/perf_event.h> 5 + #include <linux/ring_buffer.h> 6 + #include <sys/epoll.h> 7 + #include <sys/mman.h> 8 + #include <argp.h> 9 + #include <stdlib.h> 10 + #include "bench.h" 11 + #include "ringbuf_bench.skel.h" 12 + #include "perfbuf_bench.skel.h" 13 + 14 + static struct { 15 + bool back2back; 16 + int batch_cnt; 17 + bool sampled; 18 + int sample_rate; 19 + int ringbuf_sz; /* per-ringbuf, in bytes */ 20 + bool ringbuf_use_output; /* use slower output API */ 21 + int perfbuf_sz; /* per-CPU size, in pages */ 22 + } args = { 23 + .back2back = false, 24 + .batch_cnt = 500, 25 + .sampled = false, 26 + .sample_rate = 500, 27 + .ringbuf_sz = 512 * 1024, 28 + .ringbuf_use_output = false, 29 + .perfbuf_sz = 128, 30 + }; 31 + 32 + enum { 33 + ARG_RB_BACK2BACK = 2000, 34 + ARG_RB_USE_OUTPUT = 2001, 35 + ARG_RB_BATCH_CNT = 2002, 36 + ARG_RB_SAMPLED = 2003, 37 + ARG_RB_SAMPLE_RATE = 2004, 38 + }; 39 + 40 + static const struct argp_option opts[] = { 41 + { "rb-b2b", ARG_RB_BACK2BACK, NULL, 0, "Back-to-back mode"}, 42 + { "rb-use-output", ARG_RB_USE_OUTPUT, NULL, 0, "Use bpf_ringbuf_output() instead of bpf_ringbuf_reserve()"}, 43 + { "rb-batch-cnt", ARG_RB_BATCH_CNT, "CNT", 0, "Set BPF-side record batch count"}, 44 + { "rb-sampled", ARG_RB_SAMPLED, NULL, 0, "Notification sampling"}, 45 + { "rb-sample-rate", ARG_RB_SAMPLE_RATE, "RATE", 0, "Notification sample rate"}, 46 + {}, 47 + }; 48 + 49 + static error_t parse_arg(int key, char *arg, struct argp_state *state) 50 + { 51 + switch (key) { 52 + case ARG_RB_BACK2BACK: 53 + args.back2back = true; 54 + break; 55 + case ARG_RB_USE_OUTPUT: 56 + args.ringbuf_use_output = true; 57 + break; 58 + case ARG_RB_BATCH_CNT: 59 + args.batch_cnt = strtol(arg, NULL, 10); 60 + if (args.batch_cnt < 0) { 61 + fprintf(stderr, "Invalid batch count."); 62 + argp_usage(state); 63 + } 64 + break; 65 + case ARG_RB_SAMPLED: 66 + args.sampled = true; 67 + break; 68 + case ARG_RB_SAMPLE_RATE: 69 + args.sample_rate = strtol(arg, NULL, 10); 70 + if (args.sample_rate < 0) { 71 + fprintf(stderr, "Invalid perfbuf sample rate."); 72 + argp_usage(state); 73 + } 74 + break; 75 + default: 76 + return ARGP_ERR_UNKNOWN; 77 + } 78 + return 0; 79 + } 80 + 81 + /* exported into benchmark runner */ 82 + const struct argp bench_ringbufs_argp = { 83 + .options = opts, 84 + .parser = parse_arg, 85 + }; 86 + 87 + /* RINGBUF-LIBBPF benchmark */ 88 + 89 + static struct counter buf_hits; 90 + 91 + static inline void bufs_trigger_batch() 92 + { 93 + (void)syscall(__NR_getpgid); 94 + } 95 + 96 + static void bufs_validate() 97 + { 98 + if (env.consumer_cnt != 1) { 99 + fprintf(stderr, "rb-libbpf benchmark doesn't support multi-consumer!\n"); 100 + exit(1); 101 + } 102 + 103 + if (args.back2back && env.producer_cnt > 1) { 104 + fprintf(stderr, "back-to-back mode makes sense only for single-producer case!\n"); 105 + exit(1); 106 + } 107 + } 108 + 109 + static void *bufs_sample_producer(void *input) 110 + { 111 + if (args.back2back) { 112 + /* initial batch to get everything started */ 113 + bufs_trigger_batch(); 114 + return NULL; 115 + } 116 + 117 + while (true) 118 + bufs_trigger_batch(); 119 + return NULL; 120 + } 121 + 122 + static struct ringbuf_libbpf_ctx { 123 + struct ringbuf_bench *skel; 124 + struct ring_buffer *ringbuf; 125 + } ringbuf_libbpf_ctx; 126 + 127 + static void ringbuf_libbpf_measure(struct bench_res *res) 128 + { 129 + struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx; 130 + 131 + res->hits = atomic_swap(&buf_hits.value, 0); 132 + res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); 133 + } 134 + 135 + static struct ringbuf_bench *ringbuf_setup_skeleton() 136 + { 137 + struct ringbuf_bench *skel; 138 + 139 + setup_libbpf(); 140 + 141 + skel = ringbuf_bench__open(); 142 + if (!skel) { 143 + fprintf(stderr, "failed to open skeleton\n"); 144 + exit(1); 145 + } 146 + 147 + skel->rodata->batch_cnt = args.batch_cnt; 148 + skel->rodata->use_output = args.ringbuf_use_output ? 1 : 0; 149 + 150 + if (args.sampled) 151 + /* record data + header take 16 bytes */ 152 + skel->rodata->wakeup_data_size = args.sample_rate * 16; 153 + 154 + bpf_map__resize(skel->maps.ringbuf, args.ringbuf_sz); 155 + 156 + if (ringbuf_bench__load(skel)) { 157 + fprintf(stderr, "failed to load skeleton\n"); 158 + exit(1); 159 + } 160 + 161 + return skel; 162 + } 163 + 164 + static int buf_process_sample(void *ctx, void *data, size_t len) 165 + { 166 + atomic_inc(&buf_hits.value); 167 + return 0; 168 + } 169 + 170 + static void ringbuf_libbpf_setup() 171 + { 172 + struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx; 173 + struct bpf_link *link; 174 + 175 + ctx->skel = ringbuf_setup_skeleton(); 176 + ctx->ringbuf = ring_buffer__new(bpf_map__fd(ctx->skel->maps.ringbuf), 177 + buf_process_sample, NULL, NULL); 178 + if (!ctx->ringbuf) { 179 + fprintf(stderr, "failed to create ringbuf\n"); 180 + exit(1); 181 + } 182 + 183 + link = bpf_program__attach(ctx->skel->progs.bench_ringbuf); 184 + if (IS_ERR(link)) { 185 + fprintf(stderr, "failed to attach program!\n"); 186 + exit(1); 187 + } 188 + } 189 + 190 + static void *ringbuf_libbpf_consumer(void *input) 191 + { 192 + struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx; 193 + 194 + while (ring_buffer__poll(ctx->ringbuf, -1) >= 0) { 195 + if (args.back2back) 196 + bufs_trigger_batch(); 197 + } 198 + fprintf(stderr, "ringbuf polling failed!\n"); 199 + return NULL; 200 + } 201 + 202 + /* RINGBUF-CUSTOM benchmark */ 203 + struct ringbuf_custom { 204 + __u64 *consumer_pos; 205 + __u64 *producer_pos; 206 + __u64 mask; 207 + void *data; 208 + int map_fd; 209 + }; 210 + 211 + static struct ringbuf_custom_ctx { 212 + struct ringbuf_bench *skel; 213 + struct ringbuf_custom ringbuf; 214 + int epoll_fd; 215 + struct epoll_event event; 216 + } ringbuf_custom_ctx; 217 + 218 + static void ringbuf_custom_measure(struct bench_res *res) 219 + { 220 + struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx; 221 + 222 + res->hits = atomic_swap(&buf_hits.value, 0); 223 + res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); 224 + } 225 + 226 + static void ringbuf_custom_setup() 227 + { 228 + struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx; 229 + const size_t page_size = getpagesize(); 230 + struct bpf_link *link; 231 + struct ringbuf_custom *r; 232 + void *tmp; 233 + int err; 234 + 235 + ctx->skel = ringbuf_setup_skeleton(); 236 + 237 + ctx->epoll_fd = epoll_create1(EPOLL_CLOEXEC); 238 + if (ctx->epoll_fd < 0) { 239 + fprintf(stderr, "failed to create epoll fd: %d\n", -errno); 240 + exit(1); 241 + } 242 + 243 + r = &ctx->ringbuf; 244 + r->map_fd = bpf_map__fd(ctx->skel->maps.ringbuf); 245 + r->mask = args.ringbuf_sz - 1; 246 + 247 + /* Map writable consumer page */ 248 + tmp = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 249 + r->map_fd, 0); 250 + if (tmp == MAP_FAILED) { 251 + fprintf(stderr, "failed to mmap consumer page: %d\n", -errno); 252 + exit(1); 253 + } 254 + r->consumer_pos = tmp; 255 + 256 + /* Map read-only producer page and data pages. */ 257 + tmp = mmap(NULL, page_size + 2 * args.ringbuf_sz, PROT_READ, MAP_SHARED, 258 + r->map_fd, page_size); 259 + if (tmp == MAP_FAILED) { 260 + fprintf(stderr, "failed to mmap data pages: %d\n", -errno); 261 + exit(1); 262 + } 263 + r->producer_pos = tmp; 264 + r->data = tmp + page_size; 265 + 266 + ctx->event.events = EPOLLIN; 267 + err = epoll_ctl(ctx->epoll_fd, EPOLL_CTL_ADD, r->map_fd, &ctx->event); 268 + if (err < 0) { 269 + fprintf(stderr, "failed to epoll add ringbuf: %d\n", -errno); 270 + exit(1); 271 + } 272 + 273 + link = bpf_program__attach(ctx->skel->progs.bench_ringbuf); 274 + if (IS_ERR(link)) { 275 + fprintf(stderr, "failed to attach program\n"); 276 + exit(1); 277 + } 278 + } 279 + 280 + #define RINGBUF_BUSY_BIT (1 << 31) 281 + #define RINGBUF_DISCARD_BIT (1 << 30) 282 + #define RINGBUF_META_LEN 8 283 + 284 + static inline int roundup_len(__u32 len) 285 + { 286 + /* clear out top 2 bits */ 287 + len <<= 2; 288 + len >>= 2; 289 + /* add length prefix */ 290 + len += RINGBUF_META_LEN; 291 + /* round up to 8 byte alignment */ 292 + return (len + 7) / 8 * 8; 293 + } 294 + 295 + static void ringbuf_custom_process_ring(struct ringbuf_custom *r) 296 + { 297 + unsigned long cons_pos, prod_pos; 298 + int *len_ptr, len; 299 + bool got_new_data; 300 + 301 + cons_pos = smp_load_acquire(r->consumer_pos); 302 + while (true) { 303 + got_new_data = false; 304 + prod_pos = smp_load_acquire(r->producer_pos); 305 + while (cons_pos < prod_pos) { 306 + len_ptr = r->data + (cons_pos & r->mask); 307 + len = smp_load_acquire(len_ptr); 308 + 309 + /* sample not committed yet, bail out for now */ 310 + if (len & RINGBUF_BUSY_BIT) 311 + return; 312 + 313 + got_new_data = true; 314 + cons_pos += roundup_len(len); 315 + 316 + atomic_inc(&buf_hits.value); 317 + } 318 + if (got_new_data) 319 + smp_store_release(r->consumer_pos, cons_pos); 320 + else 321 + break; 322 + }; 323 + } 324 + 325 + static void *ringbuf_custom_consumer(void *input) 326 + { 327 + struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx; 328 + int cnt; 329 + 330 + do { 331 + if (args.back2back) 332 + bufs_trigger_batch(); 333 + cnt = epoll_wait(ctx->epoll_fd, &ctx->event, 1, -1); 334 + if (cnt > 0) 335 + ringbuf_custom_process_ring(&ctx->ringbuf); 336 + } while (cnt >= 0); 337 + fprintf(stderr, "ringbuf polling failed!\n"); 338 + return 0; 339 + } 340 + 341 + /* PERFBUF-LIBBPF benchmark */ 342 + static struct perfbuf_libbpf_ctx { 343 + struct perfbuf_bench *skel; 344 + struct perf_buffer *perfbuf; 345 + } perfbuf_libbpf_ctx; 346 + 347 + static void perfbuf_measure(struct bench_res *res) 348 + { 349 + struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx; 350 + 351 + res->hits = atomic_swap(&buf_hits.value, 0); 352 + res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); 353 + } 354 + 355 + static struct perfbuf_bench *perfbuf_setup_skeleton() 356 + { 357 + struct perfbuf_bench *skel; 358 + 359 + setup_libbpf(); 360 + 361 + skel = perfbuf_bench__open(); 362 + if (!skel) { 363 + fprintf(stderr, "failed to open skeleton\n"); 364 + exit(1); 365 + } 366 + 367 + skel->rodata->batch_cnt = args.batch_cnt; 368 + 369 + if (perfbuf_bench__load(skel)) { 370 + fprintf(stderr, "failed to load skeleton\n"); 371 + exit(1); 372 + } 373 + 374 + return skel; 375 + } 376 + 377 + static enum bpf_perf_event_ret 378 + perfbuf_process_sample_raw(void *input_ctx, int cpu, 379 + struct perf_event_header *e) 380 + { 381 + switch (e->type) { 382 + case PERF_RECORD_SAMPLE: 383 + atomic_inc(&buf_hits.value); 384 + break; 385 + case PERF_RECORD_LOST: 386 + break; 387 + default: 388 + return LIBBPF_PERF_EVENT_ERROR; 389 + } 390 + return LIBBPF_PERF_EVENT_CONT; 391 + } 392 + 393 + static void perfbuf_libbpf_setup() 394 + { 395 + struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx; 396 + struct perf_event_attr attr; 397 + struct perf_buffer_raw_opts pb_opts = { 398 + .event_cb = perfbuf_process_sample_raw, 399 + .ctx = (void *)(long)0, 400 + .attr = &attr, 401 + }; 402 + struct bpf_link *link; 403 + 404 + ctx->skel = perfbuf_setup_skeleton(); 405 + 406 + memset(&attr, 0, sizeof(attr)); 407 + attr.config = PERF_COUNT_SW_BPF_OUTPUT, 408 + attr.type = PERF_TYPE_SOFTWARE; 409 + attr.sample_type = PERF_SAMPLE_RAW; 410 + /* notify only every Nth sample */ 411 + if (args.sampled) { 412 + attr.sample_period = args.sample_rate; 413 + attr.wakeup_events = args.sample_rate; 414 + } else { 415 + attr.sample_period = 1; 416 + attr.wakeup_events = 1; 417 + } 418 + 419 + if (args.sample_rate > args.batch_cnt) { 420 + fprintf(stderr, "sample rate %d is too high for given batch count %d\n", 421 + args.sample_rate, args.batch_cnt); 422 + exit(1); 423 + } 424 + 425 + ctx->perfbuf = perf_buffer__new_raw(bpf_map__fd(ctx->skel->maps.perfbuf), 426 + args.perfbuf_sz, &pb_opts); 427 + if (!ctx->perfbuf) { 428 + fprintf(stderr, "failed to create perfbuf\n"); 429 + exit(1); 430 + } 431 + 432 + link = bpf_program__attach(ctx->skel->progs.bench_perfbuf); 433 + if (IS_ERR(link)) { 434 + fprintf(stderr, "failed to attach program\n"); 435 + exit(1); 436 + } 437 + } 438 + 439 + static void *perfbuf_libbpf_consumer(void *input) 440 + { 441 + struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx; 442 + 443 + while (perf_buffer__poll(ctx->perfbuf, -1) >= 0) { 444 + if (args.back2back) 445 + bufs_trigger_batch(); 446 + } 447 + fprintf(stderr, "perfbuf polling failed!\n"); 448 + return NULL; 449 + } 450 + 451 + /* PERFBUF-CUSTOM benchmark */ 452 + 453 + /* copies of internal libbpf definitions */ 454 + struct perf_cpu_buf { 455 + struct perf_buffer *pb; 456 + void *base; /* mmap()'ed memory */ 457 + void *buf; /* for reconstructing segmented data */ 458 + size_t buf_size; 459 + int fd; 460 + int cpu; 461 + int map_key; 462 + }; 463 + 464 + struct perf_buffer { 465 + perf_buffer_event_fn event_cb; 466 + perf_buffer_sample_fn sample_cb; 467 + perf_buffer_lost_fn lost_cb; 468 + void *ctx; /* passed into callbacks */ 469 + 470 + size_t page_size; 471 + size_t mmap_size; 472 + struct perf_cpu_buf **cpu_bufs; 473 + struct epoll_event *events; 474 + int cpu_cnt; /* number of allocated CPU buffers */ 475 + int epoll_fd; /* perf event FD */ 476 + int map_fd; /* BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map FD */ 477 + }; 478 + 479 + static void *perfbuf_custom_consumer(void *input) 480 + { 481 + struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx; 482 + struct perf_buffer *pb = ctx->perfbuf; 483 + struct perf_cpu_buf *cpu_buf; 484 + struct perf_event_mmap_page *header; 485 + size_t mmap_mask = pb->mmap_size - 1; 486 + struct perf_event_header *ehdr; 487 + __u64 data_head, data_tail; 488 + size_t ehdr_size; 489 + void *base; 490 + int i, cnt; 491 + 492 + while (true) { 493 + if (args.back2back) 494 + bufs_trigger_batch(); 495 + cnt = epoll_wait(pb->epoll_fd, pb->events, pb->cpu_cnt, -1); 496 + if (cnt <= 0) { 497 + fprintf(stderr, "perf epoll failed: %d\n", -errno); 498 + exit(1); 499 + } 500 + 501 + for (i = 0; i < cnt; ++i) { 502 + cpu_buf = pb->events[i].data.ptr; 503 + header = cpu_buf->base; 504 + base = ((void *)header) + pb->page_size; 505 + 506 + data_head = ring_buffer_read_head(header); 507 + data_tail = header->data_tail; 508 + while (data_head != data_tail) { 509 + ehdr = base + (data_tail & mmap_mask); 510 + ehdr_size = ehdr->size; 511 + 512 + if (ehdr->type == PERF_RECORD_SAMPLE) 513 + atomic_inc(&buf_hits.value); 514 + 515 + data_tail += ehdr_size; 516 + } 517 + ring_buffer_write_tail(header, data_tail); 518 + } 519 + } 520 + return NULL; 521 + } 522 + 523 + const struct bench bench_rb_libbpf = { 524 + .name = "rb-libbpf", 525 + .validate = bufs_validate, 526 + .setup = ringbuf_libbpf_setup, 527 + .producer_thread = bufs_sample_producer, 528 + .consumer_thread = ringbuf_libbpf_consumer, 529 + .measure = ringbuf_libbpf_measure, 530 + .report_progress = hits_drops_report_progress, 531 + .report_final = hits_drops_report_final, 532 + }; 533 + 534 + const struct bench bench_rb_custom = { 535 + .name = "rb-custom", 536 + .validate = bufs_validate, 537 + .setup = ringbuf_custom_setup, 538 + .producer_thread = bufs_sample_producer, 539 + .consumer_thread = ringbuf_custom_consumer, 540 + .measure = ringbuf_custom_measure, 541 + .report_progress = hits_drops_report_progress, 542 + .report_final = hits_drops_report_final, 543 + }; 544 + 545 + const struct bench bench_pb_libbpf = { 546 + .name = "pb-libbpf", 547 + .validate = bufs_validate, 548 + .setup = perfbuf_libbpf_setup, 549 + .producer_thread = bufs_sample_producer, 550 + .consumer_thread = perfbuf_libbpf_consumer, 551 + .measure = perfbuf_measure, 552 + .report_progress = hits_drops_report_progress, 553 + .report_final = hits_drops_report_final, 554 + }; 555 + 556 + const struct bench bench_pb_custom = { 557 + .name = "pb-custom", 558 + .validate = bufs_validate, 559 + .setup = perfbuf_libbpf_setup, 560 + .producer_thread = bufs_sample_producer, 561 + .consumer_thread = perfbuf_custom_consumer, 562 + .measure = perfbuf_measure, 563 + .report_progress = hits_drops_report_progress, 564 + .report_final = hits_drops_report_final, 565 + }; 566 +

+75

tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh

··· 1 + #!/bin/bash 2 + 3 + set -eufo pipefail 4 + 5 + RUN_BENCH="sudo ./bench -w3 -d10 -a" 6 + 7 + function hits() 8 + { 9 + echo "$*" | sed -E "s/.*hits\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/" 10 + } 11 + 12 + function drops() 13 + { 14 + echo "$*" | sed -E "s/.*drops\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/" 15 + } 16 + 17 + function header() 18 + { 19 + local len=${#1} 20 + 21 + printf "\n%s\n" "$1" 22 + for i in $(seq 1 $len); do printf '='; done 23 + printf '\n' 24 + } 25 + 26 + function summarize() 27 + { 28 + bench="$1" 29 + summary=$(echo $2 | tail -n1) 30 + printf "%-20s %s (drops %s)\n" "$bench" "$(hits $summary)" "$(drops $summary)" 31 + } 32 + 33 + header "Single-producer, parallel producer" 34 + for b in rb-libbpf rb-custom pb-libbpf pb-custom; do 35 + summarize $b "$($RUN_BENCH $b)" 36 + done 37 + 38 + header "Single-producer, parallel producer, sampled notification" 39 + for b in rb-libbpf rb-custom pb-libbpf pb-custom; do 40 + summarize $b "$($RUN_BENCH --rb-sampled $b)" 41 + done 42 + 43 + header "Single-producer, back-to-back mode" 44 + for b in rb-libbpf rb-custom pb-libbpf pb-custom; do 45 + summarize $b "$($RUN_BENCH --rb-b2b $b)" 46 + summarize $b-sampled "$($RUN_BENCH --rb-sampled --rb-b2b $b)" 47 + done 48 + 49 + header "Ringbuf back-to-back, effect of sample rate" 50 + for b in 1 5 10 25 50 100 250 500 1000 2000 3000; do 51 + summarize "rb-sampled-$b" "$($RUN_BENCH --rb-b2b --rb-batch-cnt $b --rb-sampled --rb-sample-rate $b rb-custom)" 52 + done 53 + header "Perfbuf back-to-back, effect of sample rate" 54 + for b in 1 5 10 25 50 100 250 500 1000 2000 3000; do 55 + summarize "pb-sampled-$b" "$($RUN_BENCH --rb-b2b --rb-batch-cnt $b --rb-sampled --rb-sample-rate $b pb-custom)" 56 + done 57 + 58 + header "Ringbuf back-to-back, reserve+commit vs output" 59 + summarize "reserve" "$($RUN_BENCH --rb-b2b rb-custom)" 60 + summarize "output" "$($RUN_BENCH --rb-b2b --rb-use-output rb-custom)" 61 + 62 + header "Ringbuf sampled, reserve+commit vs output" 63 + summarize "reserve-sampled" "$($RUN_BENCH --rb-sampled rb-custom)" 64 + summarize "output-sampled" "$($RUN_BENCH --rb-sampled --rb-use-output rb-custom)" 65 + 66 + header "Single-producer, consumer/producer competing on the same CPU, low batch count" 67 + for b in rb-libbpf rb-custom pb-libbpf pb-custom; do 68 + summarize $b "$($RUN_BENCH --rb-batch-cnt 1 --rb-sample-rate 1 --prod-affinity 0 --cons-affinity 0 $b)" 69 + done 70 + 71 + header "Ringbuf, multi-producer contention" 72 + for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do 73 + summarize "rb-libbpf nr_prod $b" "$($RUN_BENCH -p$b --rb-batch-cnt 50 rb-libbpf)" 74 + done 75 +

+128 -38

tools/testing/selftests/bpf/prog_tests/flow_dissector.c

··· 6 6 #include <linux/if_tun.h> 7 7 #include <sys/uio.h> 8 8 9 + #include "bpf_flow.skel.h" 10 + 9 11 #ifndef IP_MF 10 12 #define IP_MF 0x2000 11 13 #endif ··· 103 101 104 102 #define VLAN_HLEN 4 105 103 104 + static __u32 duration; 106 105 struct test tests[] = { 107 106 { 108 107 .name = "ipv4", ··· 447 444 return 0; 448 445 } 449 446 447 + static int init_prog_array(struct bpf_object *obj, struct bpf_map *prog_array) 448 + { 449 + int i, err, map_fd, prog_fd; 450 + struct bpf_program *prog; 451 + char prog_name[32]; 452 + 453 + map_fd = bpf_map__fd(prog_array); 454 + if (map_fd < 0) 455 + return -1; 456 + 457 + for (i = 0; i < bpf_map__def(prog_array)->max_entries; i++) { 458 + snprintf(prog_name, sizeof(prog_name), "flow_dissector/%i", i); 459 + 460 + prog = bpf_object__find_program_by_title(obj, prog_name); 461 + if (!prog) 462 + return -1; 463 + 464 + prog_fd = bpf_program__fd(prog); 465 + if (prog_fd < 0) 466 + return -1; 467 + 468 + err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY); 469 + if (err) 470 + return -1; 471 + } 472 + return 0; 473 + } 474 + 475 + static void run_tests_skb_less(int tap_fd, struct bpf_map *keys) 476 + { 477 + int i, err, keys_fd; 478 + 479 + keys_fd = bpf_map__fd(keys); 480 + if (CHECK(keys_fd < 0, "bpf_map__fd", "err %d\n", keys_fd)) 481 + return; 482 + 483 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 484 + /* Keep in sync with 'flags' from eth_get_headlen. */ 485 + __u32 eth_get_headlen_flags = 486 + BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG; 487 + struct bpf_prog_test_run_attr tattr = {}; 488 + struct bpf_flow_keys flow_keys = {}; 489 + __u32 key = (__u32)(tests[i].keys.sport) << 16 | 490 + tests[i].keys.dport; 491 + 492 + /* For skb-less case we can't pass input flags; run 493 + * only the tests that have a matching set of flags. 494 + */ 495 + 496 + if (tests[i].flags != eth_get_headlen_flags) 497 + continue; 498 + 499 + err = tx_tap(tap_fd, &tests[i].pkt, sizeof(tests[i].pkt)); 500 + CHECK(err < 0, "tx_tap", "err %d errno %d\n", err, errno); 501 + 502 + err = bpf_map_lookup_elem(keys_fd, &key, &flow_keys); 503 + CHECK_ATTR(err, tests[i].name, "bpf_map_lookup_elem %d\n", err); 504 + 505 + CHECK_ATTR(err, tests[i].name, "skb-less err %d\n", err); 506 + CHECK_FLOW_KEYS(tests[i].name, flow_keys, tests[i].keys); 507 + 508 + err = bpf_map_delete_elem(keys_fd, &key); 509 + CHECK_ATTR(err, tests[i].name, "bpf_map_delete_elem %d\n", err); 510 + } 511 + } 512 + 513 + static void test_skb_less_prog_attach(struct bpf_flow *skel, int tap_fd) 514 + { 515 + int err, prog_fd; 516 + 517 + prog_fd = bpf_program__fd(skel->progs._dissect); 518 + if (CHECK(prog_fd < 0, "bpf_program__fd", "err %d\n", prog_fd)) 519 + return; 520 + 521 + err = bpf_prog_attach(prog_fd, 0, BPF_FLOW_DISSECTOR, 0); 522 + if (CHECK(err, "bpf_prog_attach", "err %d errno %d\n", err, errno)) 523 + return; 524 + 525 + run_tests_skb_less(tap_fd, skel->maps.last_dissection); 526 + 527 + err = bpf_prog_detach(prog_fd, BPF_FLOW_DISSECTOR); 528 + CHECK(err, "bpf_prog_detach", "err %d errno %d\n", err, errno); 529 + } 530 + 531 + static void test_skb_less_link_create(struct bpf_flow *skel, int tap_fd) 532 + { 533 + struct bpf_link *link; 534 + int err, net_fd; 535 + 536 + net_fd = open("/proc/self/ns/net", O_RDONLY); 537 + if (CHECK(net_fd < 0, "open(/proc/self/ns/net)", "err %d\n", errno)) 538 + return; 539 + 540 + link = bpf_program__attach_netns(skel->progs._dissect, net_fd); 541 + if (CHECK(IS_ERR(link), "attach_netns", "err %ld\n", PTR_ERR(link))) 542 + goto out_close; 543 + 544 + run_tests_skb_less(tap_fd, skel->maps.last_dissection); 545 + 546 + err = bpf_link__destroy(link); 547 + CHECK(err, "bpf_link__destroy", "err %d\n", err); 548 + out_close: 549 + close(net_fd); 550 + } 551 + 450 552 void test_flow_dissector(void) 451 553 { 452 554 int i, err, prog_fd, keys_fd = -1, tap_fd; 453 - struct bpf_object *obj; 454 - __u32 duration = 0; 555 + struct bpf_flow *skel; 455 556 456 - err = bpf_flow_load(&obj, "./bpf_flow.o", "flow_dissector", 457 - "jmp_table", "last_dissection", &prog_fd, &keys_fd); 458 - if (CHECK_FAIL(err)) 557 + skel = bpf_flow__open_and_load(); 558 + if (CHECK(!skel, "skel", "failed to open/load skeleton\n")) 459 559 return; 560 + 561 + prog_fd = bpf_program__fd(skel->progs._dissect); 562 + if (CHECK(prog_fd < 0, "bpf_program__fd", "err %d\n", prog_fd)) 563 + goto out_destroy_skel; 564 + keys_fd = bpf_map__fd(skel->maps.last_dissection); 565 + if (CHECK(keys_fd < 0, "bpf_map__fd", "err %d\n", keys_fd)) 566 + goto out_destroy_skel; 567 + err = init_prog_array(skel->obj, skel->maps.jmp_table); 568 + if (CHECK(err, "init_prog_array", "err %d\n", err)) 569 + goto out_destroy_skel; 460 570 461 571 for (i = 0; i < ARRAY_SIZE(tests); i++) { 462 572 struct bpf_flow_keys flow_keys; ··· 603 487 * via BPF map in this case. 604 488 */ 605 489 606 - err = bpf_prog_attach(prog_fd, 0, BPF_FLOW_DISSECTOR, 0); 607 - CHECK(err, "bpf_prog_attach", "err %d errno %d\n", err, errno); 608 - 609 490 tap_fd = create_tap("tap0"); 610 491 CHECK(tap_fd < 0, "create_tap", "tap_fd %d errno %d\n", tap_fd, errno); 611 492 err = ifup("tap0"); 612 493 CHECK(err, "ifup", "err %d errno %d\n", err, errno); 613 494 614 - for (i = 0; i < ARRAY_SIZE(tests); i++) { 615 - /* Keep in sync with 'flags' from eth_get_headlen. */ 616 - __u32 eth_get_headlen_flags = 617 - BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG; 618 - struct bpf_prog_test_run_attr tattr = {}; 619 - struct bpf_flow_keys flow_keys = {}; 620 - __u32 key = (__u32)(tests[i].keys.sport) << 16 | 621 - tests[i].keys.dport; 495 + /* Test direct prog attachment */ 496 + test_skb_less_prog_attach(skel, tap_fd); 497 + /* Test indirect prog attachment via link */ 498 + test_skb_less_link_create(skel, tap_fd); 622 499 623 - /* For skb-less case we can't pass input flags; run 624 - * only the tests that have a matching set of flags. 625 - */ 626 - 627 - if (tests[i].flags != eth_get_headlen_flags) 628 - continue; 629 - 630 - err = tx_tap(tap_fd, &tests[i].pkt, sizeof(tests[i].pkt)); 631 - CHECK(err < 0, "tx_tap", "err %d errno %d\n", err, errno); 632 - 633 - err = bpf_map_lookup_elem(keys_fd, &key, &flow_keys); 634 - CHECK_ATTR(err, tests[i].name, "bpf_map_lookup_elem %d\n", err); 635 - 636 - CHECK_ATTR(err, tests[i].name, "skb-less err %d\n", err); 637 - CHECK_FLOW_KEYS(tests[i].name, flow_keys, tests[i].keys); 638 - 639 - err = bpf_map_delete_elem(keys_fd, &key); 640 - CHECK_ATTR(err, tests[i].name, "bpf_map_delete_elem %d\n", err); 641 - } 642 - 643 - bpf_prog_detach(prog_fd, BPF_FLOW_DISSECTOR); 644 - bpf_object__close(obj); 500 + close(tap_fd); 501 + out_destroy_skel: 502 + bpf_flow__destroy(skel); 645 503 }

+554 -40

tools/testing/selftests/bpf/prog_tests/flow_dissector_reattach.c

··· 11 11 #include <fcntl.h> 12 12 #include <sched.h> 13 13 #include <stdbool.h> 14 + #include <sys/stat.h> 14 15 #include <unistd.h> 15 16 16 17 #include <linux/bpf.h> ··· 19 18 20 19 #include "test_progs.h" 21 20 22 - static bool is_attached(int netns) 21 + static int init_net = -1; 22 + 23 + static __u32 query_attached_prog_id(int netns) 23 24 { 24 - __u32 cnt; 25 + __u32 prog_ids[1] = {}; 26 + __u32 prog_cnt = ARRAY_SIZE(prog_ids); 25 27 int err; 26 28 27 - err = bpf_prog_query(netns, BPF_FLOW_DISSECTOR, 0, NULL, NULL, &cnt); 29 + err = bpf_prog_query(netns, BPF_FLOW_DISSECTOR, 0, NULL, 30 + prog_ids, &prog_cnt); 28 31 if (CHECK_FAIL(err)) { 29 32 perror("bpf_prog_query"); 30 - return true; /* fail-safe */ 33 + return 0; 31 34 } 32 35 33 - return cnt > 0; 36 + return prog_cnt == 1 ? prog_ids[0] : 0; 34 37 } 35 38 36 - static int load_prog(void) 39 + static bool prog_is_attached(int netns) 40 + { 41 + return query_attached_prog_id(netns) > 0; 42 + } 43 + 44 + static int load_prog(enum bpf_prog_type type) 37 45 { 38 46 struct bpf_insn prog[] = { 39 47 BPF_MOV64_IMM(BPF_REG_0, BPF_OK), ··· 50 40 }; 51 41 int fd; 52 42 53 - fd = bpf_load_program(BPF_PROG_TYPE_FLOW_DISSECTOR, prog, 54 - ARRAY_SIZE(prog), "GPL", 0, NULL, 0); 43 + fd = bpf_load_program(type, prog, ARRAY_SIZE(prog), "GPL", 0, NULL, 0); 55 44 if (CHECK_FAIL(fd < 0)) 56 45 perror("bpf_load_program"); 57 46 58 47 return fd; 59 48 } 60 49 61 - static void do_flow_dissector_reattach(void) 50 + static __u32 query_prog_id(int prog) 62 51 { 63 - int prog_fd[2] = { -1, -1 }; 52 + struct bpf_prog_info info = {}; 53 + __u32 info_len = sizeof(info); 64 54 int err; 65 55 66 - prog_fd[0] = load_prog(); 67 - if (prog_fd[0] < 0) 68 - return; 69 - 70 - prog_fd[1] = load_prog(); 71 - if (prog_fd[1] < 0) 72 - goto out_close; 73 - 74 - err = bpf_prog_attach(prog_fd[0], 0, BPF_FLOW_DISSECTOR, 0); 75 - if (CHECK_FAIL(err)) { 76 - perror("bpf_prog_attach-0"); 77 - goto out_close; 56 + err = bpf_obj_get_info_by_fd(prog, &info, &info_len); 57 + if (CHECK_FAIL(err || info_len != sizeof(info))) { 58 + perror("bpf_obj_get_info_by_fd"); 59 + return 0; 78 60 } 61 + 62 + return info.id; 63 + } 64 + 65 + static int unshare_net(int old_net) 66 + { 67 + int err, new_net; 68 + 69 + err = unshare(CLONE_NEWNET); 70 + if (CHECK_FAIL(err)) { 71 + perror("unshare(CLONE_NEWNET)"); 72 + return -1; 73 + } 74 + new_net = open("/proc/self/ns/net", O_RDONLY); 75 + if (CHECK_FAIL(new_net < 0)) { 76 + perror("open(/proc/self/ns/net)"); 77 + setns(old_net, CLONE_NEWNET); 78 + return -1; 79 + } 80 + return new_net; 81 + } 82 + 83 + static void test_prog_attach_prog_attach(int netns, int prog1, int prog2) 84 + { 85 + int err; 86 + 87 + err = bpf_prog_attach(prog1, 0, BPF_FLOW_DISSECTOR, 0); 88 + if (CHECK_FAIL(err)) { 89 + perror("bpf_prog_attach(prog1)"); 90 + return; 91 + } 92 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 79 93 80 94 /* Expect success when attaching a different program */ 81 - err = bpf_prog_attach(prog_fd[1], 0, BPF_FLOW_DISSECTOR, 0); 95 + err = bpf_prog_attach(prog2, 0, BPF_FLOW_DISSECTOR, 0); 82 96 if (CHECK_FAIL(err)) { 83 - perror("bpf_prog_attach-1"); 97 + perror("bpf_prog_attach(prog2) #1"); 84 98 goto out_detach; 85 99 } 100 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog2)); 86 101 87 102 /* Expect failure when attaching the same program twice */ 88 - err = bpf_prog_attach(prog_fd[1], 0, BPF_FLOW_DISSECTOR, 0); 103 + err = bpf_prog_attach(prog2, 0, BPF_FLOW_DISSECTOR, 0); 89 104 if (CHECK_FAIL(!err || errno != EINVAL)) 90 - perror("bpf_prog_attach-2"); 105 + perror("bpf_prog_attach(prog2) #2"); 106 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog2)); 91 107 92 108 out_detach: 93 109 err = bpf_prog_detach(0, BPF_FLOW_DISSECTOR); 94 110 if (CHECK_FAIL(err)) 95 111 perror("bpf_prog_detach"); 112 + CHECK_FAIL(prog_is_attached(netns)); 113 + } 114 + 115 + static void test_link_create_link_create(int netns, int prog1, int prog2) 116 + { 117 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts); 118 + int link1, link2; 119 + 120 + link1 = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &opts); 121 + if (CHECK_FAIL(link < 0)) { 122 + perror("bpf_link_create(prog1)"); 123 + return; 124 + } 125 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 126 + 127 + /* Expect failure creating link when another link exists */ 128 + errno = 0; 129 + link2 = bpf_link_create(prog2, netns, BPF_FLOW_DISSECTOR, &opts); 130 + if (CHECK_FAIL(link2 != -1 || errno != E2BIG)) 131 + perror("bpf_prog_attach(prog2) expected E2BIG"); 132 + if (link2 != -1) 133 + close(link2); 134 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 135 + 136 + close(link1); 137 + CHECK_FAIL(prog_is_attached(netns)); 138 + } 139 + 140 + static void test_prog_attach_link_create(int netns, int prog1, int prog2) 141 + { 142 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts); 143 + int err, link; 144 + 145 + err = bpf_prog_attach(prog1, -1, BPF_FLOW_DISSECTOR, 0); 146 + if (CHECK_FAIL(err)) { 147 + perror("bpf_prog_attach(prog1)"); 148 + return; 149 + } 150 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 151 + 152 + /* Expect failure creating link when prog attached */ 153 + errno = 0; 154 + link = bpf_link_create(prog2, netns, BPF_FLOW_DISSECTOR, &opts); 155 + if (CHECK_FAIL(link != -1 || errno != EEXIST)) 156 + perror("bpf_link_create(prog2) expected EEXIST"); 157 + if (link != -1) 158 + close(link); 159 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 160 + 161 + err = bpf_prog_detach(-1, BPF_FLOW_DISSECTOR); 162 + if (CHECK_FAIL(err)) 163 + perror("bpf_prog_detach"); 164 + CHECK_FAIL(prog_is_attached(netns)); 165 + } 166 + 167 + static void test_link_create_prog_attach(int netns, int prog1, int prog2) 168 + { 169 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts); 170 + int err, link; 171 + 172 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &opts); 173 + if (CHECK_FAIL(link < 0)) { 174 + perror("bpf_link_create(prog1)"); 175 + return; 176 + } 177 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 178 + 179 + /* Expect failure attaching prog when link exists */ 180 + errno = 0; 181 + err = bpf_prog_attach(prog2, -1, BPF_FLOW_DISSECTOR, 0); 182 + if (CHECK_FAIL(!err || errno != EEXIST)) 183 + perror("bpf_prog_attach(prog2) expected EEXIST"); 184 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 185 + 186 + close(link); 187 + CHECK_FAIL(prog_is_attached(netns)); 188 + } 189 + 190 + static void test_link_create_prog_detach(int netns, int prog1, int prog2) 191 + { 192 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts); 193 + int err, link; 194 + 195 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &opts); 196 + if (CHECK_FAIL(link < 0)) { 197 + perror("bpf_link_create(prog1)"); 198 + return; 199 + } 200 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 201 + 202 + /* Expect failure detaching prog when link exists */ 203 + errno = 0; 204 + err = bpf_prog_detach(-1, BPF_FLOW_DISSECTOR); 205 + if (CHECK_FAIL(!err || errno != EINVAL)) 206 + perror("bpf_prog_detach expected EINVAL"); 207 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 208 + 209 + close(link); 210 + CHECK_FAIL(prog_is_attached(netns)); 211 + } 212 + 213 + static void test_prog_attach_detach_query(int netns, int prog1, int prog2) 214 + { 215 + int err; 216 + 217 + err = bpf_prog_attach(prog1, 0, BPF_FLOW_DISSECTOR, 0); 218 + if (CHECK_FAIL(err)) { 219 + perror("bpf_prog_attach(prog1)"); 220 + return; 221 + } 222 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 223 + 224 + err = bpf_prog_detach(0, BPF_FLOW_DISSECTOR); 225 + if (CHECK_FAIL(err)) { 226 + perror("bpf_prog_detach"); 227 + return; 228 + } 229 + 230 + /* Expect no prog attached after successful detach */ 231 + CHECK_FAIL(prog_is_attached(netns)); 232 + } 233 + 234 + static void test_link_create_close_query(int netns, int prog1, int prog2) 235 + { 236 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts); 237 + int link; 238 + 239 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &opts); 240 + if (CHECK_FAIL(link < 0)) { 241 + perror("bpf_link_create(prog1)"); 242 + return; 243 + } 244 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 245 + 246 + close(link); 247 + /* Expect no prog attached after closing last link FD */ 248 + CHECK_FAIL(prog_is_attached(netns)); 249 + } 250 + 251 + static void test_link_update_no_old_prog(int netns, int prog1, int prog2) 252 + { 253 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 254 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 255 + int err, link; 256 + 257 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 258 + if (CHECK_FAIL(link < 0)) { 259 + perror("bpf_link_create(prog1)"); 260 + return; 261 + } 262 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 263 + 264 + /* Expect success replacing the prog when old prog not specified */ 265 + update_opts.flags = 0; 266 + update_opts.old_prog_fd = 0; 267 + err = bpf_link_update(link, prog2, &update_opts); 268 + if (CHECK_FAIL(err)) 269 + perror("bpf_link_update"); 270 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog2)); 271 + 272 + close(link); 273 + CHECK_FAIL(prog_is_attached(netns)); 274 + } 275 + 276 + static void test_link_update_replace_old_prog(int netns, int prog1, int prog2) 277 + { 278 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 279 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 280 + int err, link; 281 + 282 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 283 + if (CHECK_FAIL(link < 0)) { 284 + perror("bpf_link_create(prog1)"); 285 + return; 286 + } 287 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 288 + 289 + /* Expect success F_REPLACE and old prog specified to succeed */ 290 + update_opts.flags = BPF_F_REPLACE; 291 + update_opts.old_prog_fd = prog1; 292 + err = bpf_link_update(link, prog2, &update_opts); 293 + if (CHECK_FAIL(err)) 294 + perror("bpf_link_update"); 295 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog2)); 296 + 297 + close(link); 298 + CHECK_FAIL(prog_is_attached(netns)); 299 + } 300 + 301 + static void test_link_update_invalid_opts(int netns, int prog1, int prog2) 302 + { 303 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 304 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 305 + int err, link; 306 + 307 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 308 + if (CHECK_FAIL(link < 0)) { 309 + perror("bpf_link_create(prog1)"); 310 + return; 311 + } 312 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 313 + 314 + /* Expect update to fail w/ old prog FD but w/o F_REPLACE*/ 315 + errno = 0; 316 + update_opts.flags = 0; 317 + update_opts.old_prog_fd = prog1; 318 + err = bpf_link_update(link, prog2, &update_opts); 319 + if (CHECK_FAIL(!err || errno != EINVAL)) { 320 + perror("bpf_link_update expected EINVAL"); 321 + goto out_close; 322 + } 323 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 324 + 325 + /* Expect update to fail on old prog FD mismatch */ 326 + errno = 0; 327 + update_opts.flags = BPF_F_REPLACE; 328 + update_opts.old_prog_fd = prog2; 329 + err = bpf_link_update(link, prog2, &update_opts); 330 + if (CHECK_FAIL(!err || errno != EPERM)) { 331 + perror("bpf_link_update expected EPERM"); 332 + goto out_close; 333 + } 334 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 335 + 336 + /* Expect update to fail for invalid old prog FD */ 337 + errno = 0; 338 + update_opts.flags = BPF_F_REPLACE; 339 + update_opts.old_prog_fd = -1; 340 + err = bpf_link_update(link, prog2, &update_opts); 341 + if (CHECK_FAIL(!err || errno != EBADF)) { 342 + perror("bpf_link_update expected EBADF"); 343 + goto out_close; 344 + } 345 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 346 + 347 + /* Expect update to fail with invalid flags */ 348 + errno = 0; 349 + update_opts.flags = BPF_F_ALLOW_MULTI; 350 + update_opts.old_prog_fd = 0; 351 + err = bpf_link_update(link, prog2, &update_opts); 352 + if (CHECK_FAIL(!err || errno != EINVAL)) 353 + perror("bpf_link_update expected EINVAL"); 354 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 96 355 97 356 out_close: 98 - close(prog_fd[1]); 99 - close(prog_fd[0]); 357 + close(link); 358 + CHECK_FAIL(prog_is_attached(netns)); 359 + } 360 + 361 + static void test_link_update_invalid_prog(int netns, int prog1, int prog2) 362 + { 363 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 364 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 365 + int err, link, prog3; 366 + 367 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 368 + if (CHECK_FAIL(link < 0)) { 369 + perror("bpf_link_create(prog1)"); 370 + return; 371 + } 372 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 373 + 374 + /* Expect failure when new prog FD is not valid */ 375 + errno = 0; 376 + update_opts.flags = 0; 377 + update_opts.old_prog_fd = 0; 378 + err = bpf_link_update(link, -1, &update_opts); 379 + if (CHECK_FAIL(!err || errno != EBADF)) { 380 + perror("bpf_link_update expected EINVAL"); 381 + goto out_close_link; 382 + } 383 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 384 + 385 + prog3 = load_prog(BPF_PROG_TYPE_SOCKET_FILTER); 386 + if (prog3 < 0) 387 + goto out_close_link; 388 + 389 + /* Expect failure when new prog FD type doesn't match */ 390 + errno = 0; 391 + update_opts.flags = 0; 392 + update_opts.old_prog_fd = 0; 393 + err = bpf_link_update(link, prog3, &update_opts); 394 + if (CHECK_FAIL(!err || errno != EINVAL)) 395 + perror("bpf_link_update expected EINVAL"); 396 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 397 + 398 + close(prog3); 399 + out_close_link: 400 + close(link); 401 + CHECK_FAIL(prog_is_attached(netns)); 402 + } 403 + 404 + static void test_link_update_netns_gone(int netns, int prog1, int prog2) 405 + { 406 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 407 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 408 + int err, link, old_net; 409 + 410 + old_net = netns; 411 + netns = unshare_net(old_net); 412 + if (netns < 0) 413 + return; 414 + 415 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 416 + if (CHECK_FAIL(link < 0)) { 417 + perror("bpf_link_create(prog1)"); 418 + return; 419 + } 420 + CHECK_FAIL(query_attached_prog_id(netns) != query_prog_id(prog1)); 421 + 422 + close(netns); 423 + err = setns(old_net, CLONE_NEWNET); 424 + if (CHECK_FAIL(err)) { 425 + perror("setns(CLONE_NEWNET)"); 426 + close(link); 427 + return; 428 + } 429 + 430 + /* Expect failure when netns destroyed */ 431 + errno = 0; 432 + update_opts.flags = 0; 433 + update_opts.old_prog_fd = 0; 434 + err = bpf_link_update(link, prog2, &update_opts); 435 + if (CHECK_FAIL(!err || errno != ENOLINK)) 436 + perror("bpf_link_update"); 437 + 438 + close(link); 439 + } 440 + 441 + static void test_link_get_info(int netns, int prog1, int prog2) 442 + { 443 + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, create_opts); 444 + DECLARE_LIBBPF_OPTS(bpf_link_update_opts, update_opts); 445 + struct bpf_link_info info = {}; 446 + struct stat netns_stat = {}; 447 + __u32 info_len, link_id; 448 + int err, link, old_net; 449 + 450 + old_net = netns; 451 + netns = unshare_net(old_net); 452 + if (netns < 0) 453 + return; 454 + 455 + err = fstat(netns, &netns_stat); 456 + if (CHECK_FAIL(err)) { 457 + perror("stat(netns)"); 458 + goto out_resetns; 459 + } 460 + 461 + link = bpf_link_create(prog1, netns, BPF_FLOW_DISSECTOR, &create_opts); 462 + if (CHECK_FAIL(link < 0)) { 463 + perror("bpf_link_create(prog1)"); 464 + goto out_resetns; 465 + } 466 + 467 + info_len = sizeof(info); 468 + err = bpf_obj_get_info_by_fd(link, &info, &info_len); 469 + if (CHECK_FAIL(err)) { 470 + perror("bpf_obj_get_info"); 471 + goto out_unlink; 472 + } 473 + CHECK_FAIL(info_len != sizeof(info)); 474 + 475 + /* Expect link info to be sane and match prog and netns details */ 476 + CHECK_FAIL(info.type != BPF_LINK_TYPE_NETNS); 477 + CHECK_FAIL(info.id == 0); 478 + CHECK_FAIL(info.prog_id != query_prog_id(prog1)); 479 + CHECK_FAIL(info.netns.netns_ino != netns_stat.st_ino); 480 + CHECK_FAIL(info.netns.attach_type != BPF_FLOW_DISSECTOR); 481 + 482 + update_opts.flags = 0; 483 + update_opts.old_prog_fd = 0; 484 + err = bpf_link_update(link, prog2, &update_opts); 485 + if (CHECK_FAIL(err)) { 486 + perror("bpf_link_update(prog2)"); 487 + goto out_unlink; 488 + } 489 + 490 + link_id = info.id; 491 + info_len = sizeof(info); 492 + err = bpf_obj_get_info_by_fd(link, &info, &info_len); 493 + if (CHECK_FAIL(err)) { 494 + perror("bpf_obj_get_info"); 495 + goto out_unlink; 496 + } 497 + CHECK_FAIL(info_len != sizeof(info)); 498 + 499 + /* Expect no info change after update except in prog id */ 500 + CHECK_FAIL(info.type != BPF_LINK_TYPE_NETNS); 501 + CHECK_FAIL(info.id != link_id); 502 + CHECK_FAIL(info.prog_id != query_prog_id(prog2)); 503 + CHECK_FAIL(info.netns.netns_ino != netns_stat.st_ino); 504 + CHECK_FAIL(info.netns.attach_type != BPF_FLOW_DISSECTOR); 505 + 506 + /* Leave netns link is attached to and close last FD to it */ 507 + err = setns(old_net, CLONE_NEWNET); 508 + if (CHECK_FAIL(err)) { 509 + perror("setns(NEWNET)"); 510 + goto out_unlink; 511 + } 512 + close(netns); 513 + old_net = -1; 514 + netns = -1; 515 + 516 + info_len = sizeof(info); 517 + err = bpf_obj_get_info_by_fd(link, &info, &info_len); 518 + if (CHECK_FAIL(err)) { 519 + perror("bpf_obj_get_info"); 520 + goto out_unlink; 521 + } 522 + CHECK_FAIL(info_len != sizeof(info)); 523 + 524 + /* Expect netns_ino to change to 0 */ 525 + CHECK_FAIL(info.type != BPF_LINK_TYPE_NETNS); 526 + CHECK_FAIL(info.id != link_id); 527 + CHECK_FAIL(info.prog_id != query_prog_id(prog2)); 528 + CHECK_FAIL(info.netns.netns_ino != 0); 529 + CHECK_FAIL(info.netns.attach_type != BPF_FLOW_DISSECTOR); 530 + 531 + out_unlink: 532 + close(link); 533 + out_resetns: 534 + if (old_net != -1) 535 + setns(old_net, CLONE_NEWNET); 536 + if (netns != -1) 537 + close(netns); 538 + } 539 + 540 + static void run_tests(int netns) 541 + { 542 + struct test { 543 + const char *test_name; 544 + void (*test_func)(int netns, int prog1, int prog2); 545 + } tests[] = { 546 + { "prog attach, prog attach", 547 + test_prog_attach_prog_attach }, 548 + { "link create, link create", 549 + test_link_create_link_create }, 550 + { "prog attach, link create", 551 + test_prog_attach_link_create }, 552 + { "link create, prog attach", 553 + test_link_create_prog_attach }, 554 + { "link create, prog detach", 555 + test_link_create_prog_detach }, 556 + { "prog attach, detach, query", 557 + test_prog_attach_detach_query }, 558 + { "link create, close, query", 559 + test_link_create_close_query }, 560 + { "link update no old prog", 561 + test_link_update_no_old_prog }, 562 + { "link update with replace old prog", 563 + test_link_update_replace_old_prog }, 564 + { "link update invalid opts", 565 + test_link_update_invalid_opts }, 566 + { "link update invalid prog", 567 + test_link_update_invalid_prog }, 568 + { "link update netns gone", 569 + test_link_update_netns_gone }, 570 + { "link get info", 571 + test_link_get_info }, 572 + }; 573 + int i, progs[2] = { -1, -1 }; 574 + char test_name[80]; 575 + 576 + for (i = 0; i < ARRAY_SIZE(progs); i++) { 577 + progs[i] = load_prog(BPF_PROG_TYPE_FLOW_DISSECTOR); 578 + if (progs[i] < 0) 579 + goto out_close; 580 + } 581 + 582 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 583 + snprintf(test_name, sizeof(test_name), 584 + "flow dissector %s%s", 585 + tests[i].test_name, 586 + netns == init_net ? " (init_net)" : ""); 587 + if (test__start_subtest(test_name)) 588 + tests[i].test_func(netns, progs[0], progs[1]); 589 + } 590 + out_close: 591 + for (i = 0; i < ARRAY_SIZE(progs); i++) { 592 + if (progs[i] != -1) 593 + CHECK_FAIL(close(progs[i])); 594 + } 100 595 } 101 596 102 597 void test_flow_dissector_reattach(void) 103 598 { 104 - int init_net, self_net, err; 599 + int err, new_net, saved_net; 105 600 106 - self_net = open("/proc/self/ns/net", O_RDONLY); 107 - if (CHECK_FAIL(self_net < 0)) { 601 + saved_net = open("/proc/self/ns/net", O_RDONLY); 602 + if (CHECK_FAIL(saved_net < 0)) { 108 603 perror("open(/proc/self/ns/net"); 109 604 return; 110 605 } ··· 626 111 goto out_close; 627 112 } 628 113 629 - if (is_attached(init_net)) { 114 + if (prog_is_attached(init_net)) { 630 115 test__skip(); 631 116 printf("Can't test with flow dissector attached to init_net\n"); 632 117 goto out_setns; 633 118 } 634 119 635 120 /* First run tests in root network namespace */ 636 - do_flow_dissector_reattach(); 121 + run_tests(init_net); 637 122 638 123 /* Then repeat tests in a non-root namespace */ 639 - err = unshare(CLONE_NEWNET); 640 - if (CHECK_FAIL(err)) { 641 - perror("unshare(CLONE_NEWNET)"); 124 + new_net = unshare_net(init_net); 125 + if (new_net < 0) 642 126 goto out_setns; 643 - } 644 - do_flow_dissector_reattach(); 127 + run_tests(new_net); 128 + close(new_net); 645 129 646 130 out_setns: 647 131 /* Move back to netns we started in. */ 648 - err = setns(self_net, CLONE_NEWNET); 132 + err = setns(saved_net, CLONE_NEWNET); 649 133 if (CHECK_FAIL(err)) 650 134 perror("setns(/proc/self/ns/net)"); 651 135 652 136 out_close: 653 137 close(init_net); 654 - close(self_net); 138 + close(saved_net); 655 139 }

+211

tools/testing/selftests/bpf/prog_tests/ringbuf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #define _GNU_SOURCE 3 + #include <linux/compiler.h> 4 + #include <asm/barrier.h> 5 + #include <test_progs.h> 6 + #include <sys/mman.h> 7 + #include <sys/epoll.h> 8 + #include <time.h> 9 + #include <sched.h> 10 + #include <signal.h> 11 + #include <pthread.h> 12 + #include <sys/sysinfo.h> 13 + #include <linux/perf_event.h> 14 + #include <linux/ring_buffer.h> 15 + #include "test_ringbuf.skel.h" 16 + 17 + #define EDONE 7777 18 + 19 + static int duration = 0; 20 + 21 + struct sample { 22 + int pid; 23 + int seq; 24 + long value; 25 + char comm[16]; 26 + }; 27 + 28 + static int sample_cnt; 29 + 30 + static int process_sample(void *ctx, void *data, size_t len) 31 + { 32 + struct sample *s = data; 33 + 34 + sample_cnt++; 35 + 36 + switch (s->seq) { 37 + case 0: 38 + CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n", 39 + 333L, s->value); 40 + return 0; 41 + case 1: 42 + CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n", 43 + 777L, s->value); 44 + return -EDONE; 45 + default: 46 + /* we don't care about the rest */ 47 + return 0; 48 + } 49 + } 50 + 51 + static struct test_ringbuf *skel; 52 + static struct ring_buffer *ringbuf; 53 + 54 + static void trigger_samples() 55 + { 56 + skel->bss->dropped = 0; 57 + skel->bss->total = 0; 58 + skel->bss->discarded = 0; 59 + 60 + /* trigger exactly two samples */ 61 + skel->bss->value = 333; 62 + syscall(__NR_getpgid); 63 + skel->bss->value = 777; 64 + syscall(__NR_getpgid); 65 + } 66 + 67 + static void *poll_thread(void *input) 68 + { 69 + long timeout = (long)input; 70 + 71 + return (void *)(long)ring_buffer__poll(ringbuf, timeout); 72 + } 73 + 74 + void test_ringbuf(void) 75 + { 76 + const size_t rec_sz = BPF_RINGBUF_HDR_SZ + sizeof(struct sample); 77 + pthread_t thread; 78 + long bg_ret = -1; 79 + int err; 80 + 81 + skel = test_ringbuf__open_and_load(); 82 + if (CHECK(!skel, "skel_open_load", "skeleton open&load failed\n")) 83 + return; 84 + 85 + /* only trigger BPF program for current process */ 86 + skel->bss->pid = getpid(); 87 + 88 + ringbuf = ring_buffer__new(bpf_map__fd(skel->maps.ringbuf), 89 + process_sample, NULL, NULL); 90 + if (CHECK(!ringbuf, "ringbuf_create", "failed to create ringbuf\n")) 91 + goto cleanup; 92 + 93 + err = test_ringbuf__attach(skel); 94 + if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err)) 95 + goto cleanup; 96 + 97 + trigger_samples(); 98 + 99 + /* 2 submitted + 1 discarded records */ 100 + CHECK(skel->bss->avail_data != 3 * rec_sz, 101 + "err_avail_size", "exp %ld, got %ld\n", 102 + 3L * rec_sz, skel->bss->avail_data); 103 + CHECK(skel->bss->ring_size != 4096, 104 + "err_ring_size", "exp %ld, got %ld\n", 105 + 4096L, skel->bss->ring_size); 106 + CHECK(skel->bss->cons_pos != 0, 107 + "err_cons_pos", "exp %ld, got %ld\n", 108 + 0L, skel->bss->cons_pos); 109 + CHECK(skel->bss->prod_pos != 3 * rec_sz, 110 + "err_prod_pos", "exp %ld, got %ld\n", 111 + 3L * rec_sz, skel->bss->prod_pos); 112 + 113 + /* poll for samples */ 114 + err = ring_buffer__poll(ringbuf, -1); 115 + 116 + /* -EDONE is used as an indicator that we are done */ 117 + if (CHECK(err != -EDONE, "err_done", "done err: %d\n", err)) 118 + goto cleanup; 119 + 120 + /* we expect extra polling to return nothing */ 121 + err = ring_buffer__poll(ringbuf, 0); 122 + if (CHECK(err != 0, "extra_samples", "poll result: %d\n", err)) 123 + goto cleanup; 124 + 125 + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n", 126 + 0L, skel->bss->dropped); 127 + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n", 128 + 2L, skel->bss->total); 129 + CHECK(skel->bss->discarded != 1, "err_discarded", "exp %ld, got %ld\n", 130 + 1L, skel->bss->discarded); 131 + 132 + /* now validate consumer position is updated and returned */ 133 + trigger_samples(); 134 + CHECK(skel->bss->cons_pos != 3 * rec_sz, 135 + "err_cons_pos", "exp %ld, got %ld\n", 136 + 3L * rec_sz, skel->bss->cons_pos); 137 + err = ring_buffer__poll(ringbuf, -1); 138 + CHECK(err <= 0, "poll_err", "err %d\n", err); 139 + 140 + /* start poll in background w/ long timeout */ 141 + err = pthread_create(&thread, NULL, poll_thread, (void *)(long)10000); 142 + if (CHECK(err, "bg_poll", "pthread_create failed: %d\n", err)) 143 + goto cleanup; 144 + 145 + /* turn off notifications now */ 146 + skel->bss->flags = BPF_RB_NO_WAKEUP; 147 + 148 + /* give background thread a bit of a time */ 149 + usleep(50000); 150 + trigger_samples(); 151 + /* sleeping arbitrarily is bad, but no better way to know that 152 + * epoll_wait() **DID NOT** unblock in background thread 153 + */ 154 + usleep(50000); 155 + /* background poll should still be blocked */ 156 + err = pthread_tryjoin_np(thread, (void **)&bg_ret); 157 + if (CHECK(err != EBUSY, "try_join", "err %d\n", err)) 158 + goto cleanup; 159 + 160 + /* BPF side did everything right */ 161 + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n", 162 + 0L, skel->bss->dropped); 163 + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n", 164 + 2L, skel->bss->total); 165 + CHECK(skel->bss->discarded != 1, "err_discarded", "exp %ld, got %ld\n", 166 + 1L, skel->bss->discarded); 167 + 168 + /* clear flags to return to "adaptive" notification mode */ 169 + skel->bss->flags = 0; 170 + 171 + /* produce new samples, no notification should be triggered, because 172 + * consumer is now behind 173 + */ 174 + trigger_samples(); 175 + 176 + /* background poll should still be blocked */ 177 + err = pthread_tryjoin_np(thread, (void **)&bg_ret); 178 + if (CHECK(err != EBUSY, "try_join", "err %d\n", err)) 179 + goto cleanup; 180 + 181 + /* now force notifications */ 182 + skel->bss->flags = BPF_RB_FORCE_WAKEUP; 183 + sample_cnt = 0; 184 + trigger_samples(); 185 + 186 + /* now we should get a pending notification */ 187 + usleep(50000); 188 + err = pthread_tryjoin_np(thread, (void **)&bg_ret); 189 + if (CHECK(err, "join_bg", "err %d\n", err)) 190 + goto cleanup; 191 + 192 + if (CHECK(bg_ret != 1, "bg_ret", "epoll_wait result: %ld", bg_ret)) 193 + goto cleanup; 194 + 195 + /* 3 rounds, 2 samples each */ 196 + CHECK(sample_cnt != 6, "wrong_sample_cnt", 197 + "expected to see %d samples, got %d\n", 6, sample_cnt); 198 + 199 + /* BPF side did everything right */ 200 + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n", 201 + 0L, skel->bss->dropped); 202 + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n", 203 + 2L, skel->bss->total); 204 + CHECK(skel->bss->discarded != 1, "err_discarded", "exp %ld, got %ld\n", 205 + 1L, skel->bss->discarded); 206 + 207 + test_ringbuf__detach(skel); 208 + cleanup: 209 + ring_buffer__free(ringbuf); 210 + test_ringbuf__destroy(skel); 211 + }

+102

tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #define _GNU_SOURCE 3 + #include <test_progs.h> 4 + #include <sys/epoll.h> 5 + #include "test_ringbuf_multi.skel.h" 6 + 7 + static int duration = 0; 8 + 9 + struct sample { 10 + int pid; 11 + int seq; 12 + long value; 13 + char comm[16]; 14 + }; 15 + 16 + static int process_sample(void *ctx, void *data, size_t len) 17 + { 18 + int ring = (unsigned long)ctx; 19 + struct sample *s = data; 20 + 21 + switch (s->seq) { 22 + case 0: 23 + CHECK(ring != 1, "sample1_ring", "exp %d, got %d\n", 1, ring); 24 + CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n", 25 + 333L, s->value); 26 + break; 27 + case 1: 28 + CHECK(ring != 2, "sample2_ring", "exp %d, got %d\n", 2, ring); 29 + CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n", 30 + 777L, s->value); 31 + break; 32 + default: 33 + CHECK(true, "extra_sample", "unexpected sample seq %d, val %ld\n", 34 + s->seq, s->value); 35 + return -1; 36 + } 37 + 38 + return 0; 39 + } 40 + 41 + void test_ringbuf_multi(void) 42 + { 43 + struct test_ringbuf_multi *skel; 44 + struct ring_buffer *ringbuf; 45 + int err; 46 + 47 + skel = test_ringbuf_multi__open_and_load(); 48 + if (CHECK(!skel, "skel_open_load", "skeleton open&load failed\n")) 49 + return; 50 + 51 + /* only trigger BPF program for current process */ 52 + skel->bss->pid = getpid(); 53 + 54 + ringbuf = ring_buffer__new(bpf_map__fd(skel->maps.ringbuf1), 55 + process_sample, (void *)(long)1, NULL); 56 + if (CHECK(!ringbuf, "ringbuf_create", "failed to create ringbuf\n")) 57 + goto cleanup; 58 + 59 + err = ring_buffer__add(ringbuf, bpf_map__fd(skel->maps.ringbuf2), 60 + process_sample, (void *)(long)2); 61 + if (CHECK(err, "ringbuf_add", "failed to add another ring\n")) 62 + goto cleanup; 63 + 64 + err = test_ringbuf_multi__attach(skel); 65 + if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err)) 66 + goto cleanup; 67 + 68 + /* trigger few samples, some will be skipped */ 69 + skel->bss->target_ring = 0; 70 + skel->bss->value = 333; 71 + syscall(__NR_getpgid); 72 + 73 + /* skipped, no ringbuf in slot 1 */ 74 + skel->bss->target_ring = 1; 75 + skel->bss->value = 555; 76 + syscall(__NR_getpgid); 77 + 78 + skel->bss->target_ring = 2; 79 + skel->bss->value = 777; 80 + syscall(__NR_getpgid); 81 + 82 + /* poll for samples, should get 2 ringbufs back */ 83 + err = ring_buffer__poll(ringbuf, -1); 84 + if (CHECK(err != 4, "poll_res", "expected 4 records, got %d\n", err)) 85 + goto cleanup; 86 + 87 + /* expect extra polling to return nothing */ 88 + err = ring_buffer__poll(ringbuf, 0); 89 + if (CHECK(err < 0, "extra_samples", "poll result: %d\n", err)) 90 + goto cleanup; 91 + 92 + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n", 93 + 0L, skel->bss->dropped); 94 + CHECK(skel->bss->skipped != 1, "err_skipped", "exp %ld, got %ld\n", 95 + 1L, skel->bss->skipped); 96 + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n", 97 + 2L, skel->bss->total); 98 + 99 + cleanup: 100 + ring_buffer__free(ringbuf); 101 + test_ringbuf_multi__destroy(skel); 102 + }

+30

tools/testing/selftests/bpf/prog_tests/skb_helpers.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <test_progs.h> 3 + #include <network_helpers.h> 4 + 5 + void test_skb_helpers(void) 6 + { 7 + struct __sk_buff skb = { 8 + .wire_len = 100, 9 + .gso_segs = 8, 10 + .gso_size = 10, 11 + }; 12 + struct bpf_prog_test_run_attr tattr = { 13 + .data_in = &pkt_v4, 14 + .data_size_in = sizeof(pkt_v4), 15 + .ctx_in = &skb, 16 + .ctx_size_in = sizeof(skb), 17 + .ctx_out = &skb, 18 + .ctx_size_out = sizeof(skb), 19 + }; 20 + struct bpf_object *obj; 21 + int err; 22 + 23 + err = bpf_prog_load("./test_skb_helpers.o", BPF_PROG_TYPE_SCHED_CLS, &obj, 24 + &tattr.prog_fd); 25 + if (CHECK_ATTR(err, "load", "err %d errno %d\n", err, errno)) 26 + return; 27 + err = bpf_prog_test_run_xattr(&tattr); 28 + CHECK_ATTR(err, "len", "err %d errno %d\n", err, errno); 29 + bpf_object__close(obj); 30 + }

+35

tools/testing/selftests/bpf/prog_tests/sockmap_basic.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 // Copyright (c) 2020 Cloudflare 3 + #include <error.h> 3 4 4 5 #include "test_progs.h" 6 + #include "test_skmsg_load_helpers.skel.h" 5 7 6 8 #define TCP_REPAIR 19 /* TCP sock is under repair right now */ 7 9 ··· 72 70 close(s); 73 71 } 74 72 73 + static void test_skmsg_helpers(enum bpf_map_type map_type) 74 + { 75 + struct test_skmsg_load_helpers *skel; 76 + int err, map, verdict; 77 + 78 + skel = test_skmsg_load_helpers__open_and_load(); 79 + if (CHECK_FAIL(!skel)) { 80 + perror("test_skmsg_load_helpers__open_and_load"); 81 + return; 82 + } 83 + 84 + verdict = bpf_program__fd(skel->progs.prog_msg_verdict); 85 + map = bpf_map__fd(skel->maps.sock_map); 86 + 87 + err = bpf_prog_attach(verdict, map, BPF_SK_MSG_VERDICT, 0); 88 + if (CHECK_FAIL(err)) { 89 + perror("bpf_prog_attach"); 90 + goto out; 91 + } 92 + 93 + err = bpf_prog_detach2(verdict, map, BPF_SK_MSG_VERDICT); 94 + if (CHECK_FAIL(err)) { 95 + perror("bpf_prog_detach2"); 96 + goto out; 97 + } 98 + out: 99 + test_skmsg_load_helpers__destroy(skel); 100 + } 101 + 75 102 void test_sockmap_basic(void) 76 103 { 77 104 if (test__start_subtest("sockmap create_update_free")) 78 105 test_sockmap_create_update_free(BPF_MAP_TYPE_SOCKMAP); 79 106 if (test__start_subtest("sockhash create_update_free")) 80 107 test_sockmap_create_update_free(BPF_MAP_TYPE_SOCKHASH); 108 + if (test__start_subtest("sockmap sk_msg load helpers")) 109 + test_skmsg_helpers(BPF_MAP_TYPE_SOCKMAP); 110 + if (test__start_subtest("sockhash sk_msg load helpers")) 111 + test_skmsg_helpers(BPF_MAP_TYPE_SOCKHASH); 81 112 }

+97

tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <uapi/linux/bpf.h> 3 + #include <linux/if_link.h> 4 + #include <test_progs.h> 5 + 6 + #include "test_xdp_devmap_helpers.skel.h" 7 + #include "test_xdp_with_devmap_helpers.skel.h" 8 + 9 + #define IFINDEX_LO 1 10 + 11 + struct bpf_devmap_val { 12 + u32 ifindex; /* device index */ 13 + union { 14 + int fd; /* prog fd on map write */ 15 + u32 id; /* prog id on map read */ 16 + } bpf_prog; 17 + }; 18 + 19 + void test_xdp_with_devmap_helpers(void) 20 + { 21 + struct test_xdp_with_devmap_helpers *skel; 22 + struct bpf_prog_info info = {}; 23 + struct bpf_devmap_val val = { 24 + .ifindex = IFINDEX_LO, 25 + }; 26 + __u32 len = sizeof(info); 27 + __u32 duration = 0, idx = 0; 28 + int err, dm_fd, map_fd; 29 + 30 + 31 + skel = test_xdp_with_devmap_helpers__open_and_load(); 32 + if (CHECK_FAIL(!skel)) { 33 + perror("test_xdp_with_devmap_helpers__open_and_load"); 34 + return; 35 + } 36 + 37 + /* can not attach program with DEVMAPs that allow programs 38 + * as xdp generic 39 + */ 40 + dm_fd = bpf_program__fd(skel->progs.xdp_redir_prog); 41 + err = bpf_set_link_xdp_fd(IFINDEX_LO, dm_fd, XDP_FLAGS_SKB_MODE); 42 + CHECK(err == 0, "Generic attach of program with 8-byte devmap", 43 + "should have failed\n"); 44 + 45 + dm_fd = bpf_program__fd(skel->progs.xdp_dummy_dm); 46 + map_fd = bpf_map__fd(skel->maps.dm_ports); 47 + err = bpf_obj_get_info_by_fd(dm_fd, &info, &len); 48 + if (CHECK_FAIL(err)) 49 + goto out_close; 50 + 51 + val.bpf_prog.fd = dm_fd; 52 + err = bpf_map_update_elem(map_fd, &idx, &val, 0); 53 + CHECK(err, "Add program to devmap entry", 54 + "err %d errno %d\n", err, errno); 55 + 56 + err = bpf_map_lookup_elem(map_fd, &idx, &val); 57 + CHECK(err, "Read devmap entry", "err %d errno %d\n", err, errno); 58 + CHECK(info.id != val.bpf_prog.id, "Expected program id in devmap entry", 59 + "expected %u read %u\n", info.id, val.bpf_prog.id); 60 + 61 + /* can not attach BPF_XDP_DEVMAP program to a device */ 62 + err = bpf_set_link_xdp_fd(IFINDEX_LO, dm_fd, XDP_FLAGS_SKB_MODE); 63 + CHECK(err == 0, "Attach of BPF_XDP_DEVMAP program", 64 + "should have failed\n"); 65 + 66 + val.ifindex = 1; 67 + val.bpf_prog.fd = bpf_program__fd(skel->progs.xdp_dummy_prog); 68 + err = bpf_map_update_elem(map_fd, &idx, &val, 0); 69 + CHECK(err == 0, "Add non-BPF_XDP_DEVMAP program to devmap entry", 70 + "should have failed\n"); 71 + 72 + out_close: 73 + test_xdp_with_devmap_helpers__destroy(skel); 74 + } 75 + 76 + void test_neg_xdp_devmap_helpers(void) 77 + { 78 + struct test_xdp_devmap_helpers *skel; 79 + __u32 duration = 0; 80 + 81 + skel = test_xdp_devmap_helpers__open_and_load(); 82 + if (CHECK(skel, 83 + "Load of XDP program accessing egress ifindex without attach type", 84 + "should have failed\n")) { 85 + test_xdp_devmap_helpers__destroy(skel); 86 + } 87 + } 88 + 89 + 90 + void test_xdp_devmap_attach(void) 91 + { 92 + if (test__start_subtest("DEVMAP with programs in entries")) 93 + test_xdp_with_devmap_helpers(); 94 + 95 + if (test__start_subtest("Verifier check of DEVMAP programs")) 96 + test_neg_xdp_devmap_helpers(); 97 + }

+10 -10

tools/testing/selftests/bpf/progs/bpf_flow.c

··· 20 20 #include <bpf/bpf_endian.h> 21 21 22 22 int _version SEC("version") = 1; 23 - #define PROG(F) SEC(#F) int bpf_func_##F 23 + #define PROG(F) PROG_(F, _##F) 24 + #define PROG_(NUM, NAME) SEC("flow_dissector/"#NUM) int bpf_func##NAME 24 25 25 26 /* These are the identifiers of the BPF programs that will be used in tail 26 27 * calls. Name is limited to 16 characters, with the terminating character and 27 28 * bpf_func_ above, we have only 6 to work with, anything after will be cropped. 28 29 */ 29 - enum { 30 - IP, 31 - IPV6, 32 - IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */ 33 - IPV6FR, /* Fragmentation IPv6 Extension Header */ 34 - MPLS, 35 - VLAN, 36 - }; 30 + #define IP 0 31 + #define IPV6 1 32 + #define IPV6OP 2 /* Destination/Hop-by-Hop Options IPv6 Ext. Header */ 33 + #define IPV6FR 3 /* Fragmentation IPv6 Extension Header */ 34 + #define MPLS 4 35 + #define VLAN 5 36 + #define MAX_PROG 6 37 37 38 38 #define IP_MF 0x2000 39 39 #define IP_OFFSET 0x1FFF ··· 59 59 60 60 struct { 61 61 __uint(type, BPF_MAP_TYPE_PROG_ARRAY); 62 - __uint(max_entries, 8); 62 + __uint(max_entries, MAX_PROG); 63 63 __uint(key_size, sizeof(__u32)); 64 64 __uint(value_size, sizeof(__u32)); 65 65 } jmp_table SEC(".maps");

+33

tools/testing/selftests/bpf/progs/connect4_prog.c

··· 9 9 #include <linux/in6.h> 10 10 #include <sys/socket.h> 11 11 #include <netinet/tcp.h> 12 + #include <linux/if.h> 13 + #include <errno.h> 12 14 13 15 #include <bpf/bpf_helpers.h> 14 16 #include <bpf/bpf_endian.h> ··· 21 19 22 20 #ifndef TCP_CA_NAME_MAX 23 21 #define TCP_CA_NAME_MAX 16 22 + #endif 23 + 24 + #ifndef IFNAMSIZ 25 + #define IFNAMSIZ 16 24 26 #endif 25 27 26 28 int _version SEC("version") = 1; ··· 81 75 return 0; 82 76 } 83 77 78 + static __inline int bind_to_device(struct bpf_sock_addr *ctx) 79 + { 80 + char veth1[IFNAMSIZ] = "test_sock_addr1"; 81 + char veth2[IFNAMSIZ] = "test_sock_addr2"; 82 + char missing[IFNAMSIZ] = "nonexistent_dev"; 83 + char del_bind[IFNAMSIZ] = ""; 84 + 85 + if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE, 86 + &veth1, sizeof(veth1))) 87 + return 1; 88 + if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE, 89 + &veth2, sizeof(veth2))) 90 + return 1; 91 + if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE, 92 + &missing, sizeof(missing)) != -ENODEV) 93 + return 1; 94 + if (bpf_setsockopt(ctx, SOL_SOCKET, SO_BINDTODEVICE, 95 + &del_bind, sizeof(del_bind))) 96 + return 1; 97 + 98 + return 0; 99 + } 100 + 84 101 SEC("cgroup/connect4") 85 102 int connect_v4_prog(struct bpf_sock_addr *ctx) 86 103 { ··· 116 87 117 88 tuple.ipv4.daddr = bpf_htonl(DST_REWRITE_IP4); 118 89 tuple.ipv4.dport = bpf_htons(DST_REWRITE_PORT4); 90 + 91 + /* Bind to device and unbind it. */ 92 + if (bind_to_device(ctx)) 93 + return 0; 119 94 120 95 if (ctx->type != SOCK_STREAM && ctx->type != SOCK_DGRAM) 121 96 return 0;

+33

tools/testing/selftests/bpf/progs/perfbuf_bench.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2020 Facebook 3 + 4 + #include <linux/bpf.h> 5 + #include <stdint.h> 6 + #include <bpf/bpf_helpers.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + struct { 11 + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); 12 + __uint(value_size, sizeof(int)); 13 + __uint(key_size, sizeof(int)); 14 + } perfbuf SEC(".maps"); 15 + 16 + const volatile int batch_cnt = 0; 17 + 18 + long sample_val = 42; 19 + long dropped __attribute__((aligned(128))) = 0; 20 + 21 + SEC("fentry/__x64_sys_getpgid") 22 + int bench_perfbuf(void *ctx) 23 + { 24 + __u64 *sample; 25 + int i; 26 + 27 + for (i = 0; i < batch_cnt; i++) { 28 + if (bpf_perf_event_output(ctx, &perfbuf, BPF_F_CURRENT_CPU, 29 + &sample_val, sizeof(sample_val))) 30 + __sync_add_and_fetch(&dropped, 1); 31 + } 32 + return 0; 33 + }

+60

tools/testing/selftests/bpf/progs/ringbuf_bench.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2020 Facebook 3 + 4 + #include <linux/bpf.h> 5 + #include <stdint.h> 6 + #include <bpf/bpf_helpers.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + struct { 11 + __uint(type, BPF_MAP_TYPE_RINGBUF); 12 + } ringbuf SEC(".maps"); 13 + 14 + const volatile int batch_cnt = 0; 15 + const volatile long use_output = 0; 16 + 17 + long sample_val = 42; 18 + long dropped __attribute__((aligned(128))) = 0; 19 + 20 + const volatile long wakeup_data_size = 0; 21 + 22 + static __always_inline long get_flags() 23 + { 24 + long sz; 25 + 26 + if (!wakeup_data_size) 27 + return 0; 28 + 29 + sz = bpf_ringbuf_query(&ringbuf, BPF_RB_AVAIL_DATA); 30 + return sz >= wakeup_data_size ? BPF_RB_FORCE_WAKEUP : BPF_RB_NO_WAKEUP; 31 + } 32 + 33 + SEC("fentry/__x64_sys_getpgid") 34 + int bench_ringbuf(void *ctx) 35 + { 36 + long *sample, flags; 37 + int i; 38 + 39 + if (!use_output) { 40 + for (i = 0; i < batch_cnt; i++) { 41 + sample = bpf_ringbuf_reserve(&ringbuf, 42 + sizeof(sample_val), 0); 43 + if (!sample) { 44 + __sync_add_and_fetch(&dropped, 1); 45 + } else { 46 + *sample = sample_val; 47 + flags = get_flags(); 48 + bpf_ringbuf_submit(sample, flags); 49 + } 50 + } 51 + } else { 52 + for (i = 0; i < batch_cnt; i++) { 53 + flags = get_flags(); 54 + if (bpf_ringbuf_output(&ringbuf, &sample_val, 55 + sizeof(sample_val), flags)) 56 + __sync_add_and_fetch(&dropped, 1); 57 + } 58 + } 59 + return 0; 60 + }

+78

tools/testing/selftests/bpf/progs/test_ringbuf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2020 Facebook 3 + 4 + #include <linux/bpf.h> 5 + #include <bpf/bpf_helpers.h> 6 + 7 + char _license[] SEC("license") = "GPL"; 8 + 9 + struct sample { 10 + int pid; 11 + int seq; 12 + long value; 13 + char comm[16]; 14 + }; 15 + 16 + struct { 17 + __uint(type, BPF_MAP_TYPE_RINGBUF); 18 + __uint(max_entries, 1 << 12); 19 + } ringbuf SEC(".maps"); 20 + 21 + /* inputs */ 22 + int pid = 0; 23 + long value = 0; 24 + long flags = 0; 25 + 26 + /* outputs */ 27 + long total = 0; 28 + long discarded = 0; 29 + long dropped = 0; 30 + 31 + long avail_data = 0; 32 + long ring_size = 0; 33 + long cons_pos = 0; 34 + long prod_pos = 0; 35 + 36 + /* inner state */ 37 + long seq = 0; 38 + 39 + SEC("tp/syscalls/sys_enter_getpgid") 40 + int test_ringbuf(void *ctx) 41 + { 42 + int cur_pid = bpf_get_current_pid_tgid() >> 32; 43 + struct sample *sample; 44 + int zero = 0; 45 + 46 + if (cur_pid != pid) 47 + return 0; 48 + 49 + sample = bpf_ringbuf_reserve(&ringbuf, sizeof(*sample), 0); 50 + if (!sample) { 51 + __sync_fetch_and_add(&dropped, 1); 52 + return 1; 53 + } 54 + 55 + sample->pid = pid; 56 + bpf_get_current_comm(sample->comm, sizeof(sample->comm)); 57 + sample->value = value; 58 + 59 + sample->seq = seq++; 60 + __sync_fetch_and_add(&total, 1); 61 + 62 + if (sample->seq & 1) { 63 + /* copy from reserved sample to a new one... */ 64 + bpf_ringbuf_output(&ringbuf, sample, sizeof(*sample), flags); 65 + /* ...and then discard reserved sample */ 66 + bpf_ringbuf_discard(sample, flags); 67 + __sync_fetch_and_add(&discarded, 1); 68 + } else { 69 + bpf_ringbuf_submit(sample, flags); 70 + } 71 + 72 + avail_data = bpf_ringbuf_query(&ringbuf, BPF_RB_AVAIL_DATA); 73 + ring_size = bpf_ringbuf_query(&ringbuf, BPF_RB_RING_SIZE); 74 + cons_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_CONS_POS); 75 + prod_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_PROD_POS); 76 + 77 + return 0; 78 + }

+77

tools/testing/selftests/bpf/progs/test_ringbuf_multi.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2020 Facebook 3 + 4 + #include <linux/bpf.h> 5 + #include <bpf/bpf_helpers.h> 6 + 7 + char _license[] SEC("license") = "GPL"; 8 + 9 + struct sample { 10 + int pid; 11 + int seq; 12 + long value; 13 + char comm[16]; 14 + }; 15 + 16 + struct ringbuf_map { 17 + __uint(type, BPF_MAP_TYPE_RINGBUF); 18 + __uint(max_entries, 1 << 12); 19 + } ringbuf1 SEC(".maps"), 20 + ringbuf2 SEC(".maps"); 21 + 22 + struct { 23 + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); 24 + __uint(max_entries, 4); 25 + __type(key, int); 26 + __array(values, struct ringbuf_map); 27 + } ringbuf_arr SEC(".maps") = { 28 + .values = { 29 + [0] = &ringbuf1, 30 + [2] = &ringbuf2, 31 + }, 32 + }; 33 + 34 + /* inputs */ 35 + int pid = 0; 36 + int target_ring = 0; 37 + long value = 0; 38 + 39 + /* outputs */ 40 + long total = 0; 41 + long dropped = 0; 42 + long skipped = 0; 43 + 44 + SEC("tp/syscalls/sys_enter_getpgid") 45 + int test_ringbuf(void *ctx) 46 + { 47 + int cur_pid = bpf_get_current_pid_tgid() >> 32; 48 + struct sample *sample; 49 + void *rb; 50 + int zero = 0; 51 + 52 + if (cur_pid != pid) 53 + return 0; 54 + 55 + rb = bpf_map_lookup_elem(&ringbuf_arr, &target_ring); 56 + if (!rb) { 57 + skipped += 1; 58 + return 1; 59 + } 60 + 61 + sample = bpf_ringbuf_reserve(rb, sizeof(*sample), 0); 62 + if (!sample) { 63 + dropped += 1; 64 + return 1; 65 + } 66 + 67 + sample->pid = pid; 68 + bpf_get_current_comm(sample->comm, sizeof(sample->comm)); 69 + sample->value = value; 70 + 71 + sample->seq = total; 72 + total += 1; 73 + 74 + bpf_ringbuf_submit(sample, 0); 75 + 76 + return 0; 77 + }

+28

tools/testing/selftests/bpf/progs/test_skb_helpers.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + #include "vmlinux.h" 3 + #include <bpf/bpf_helpers.h> 4 + #include <bpf/bpf_endian.h> 5 + 6 + #define TEST_COMM_LEN 16 7 + 8 + struct { 9 + __uint(type, BPF_MAP_TYPE_CGROUP_ARRAY); 10 + __uint(max_entries, 1); 11 + __type(key, u32); 12 + __type(value, u32); 13 + } cgroup_map SEC(".maps"); 14 + 15 + char _license[] SEC("license") = "GPL"; 16 + 17 + SEC("classifier/test_skb_helpers") 18 + int test_skb_helpers(struct __sk_buff *skb) 19 + { 20 + struct task_struct *task; 21 + char comm[TEST_COMM_LEN]; 22 + __u32 tpid; 23 + 24 + task = (struct task_struct *)bpf_get_current_task(); 25 + bpf_probe_read_kernel(&tpid , sizeof(tpid), &task->tgid); 26 + bpf_probe_read_kernel_str(&comm, sizeof(comm), &task->comm); 27 + return 0; 28 + }

+47

tools/testing/selftests/bpf/progs/test_skmsg_load_helpers.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2020 Isovalent, Inc. 3 + #include "vmlinux.h" 4 + #include <bpf/bpf_helpers.h> 5 + 6 + struct { 7 + __uint(type, BPF_MAP_TYPE_SOCKMAP); 8 + __uint(max_entries, 2); 9 + __type(key, __u32); 10 + __type(value, __u64); 11 + } sock_map SEC(".maps"); 12 + 13 + struct { 14 + __uint(type, BPF_MAP_TYPE_SOCKHASH); 15 + __uint(max_entries, 2); 16 + __type(key, __u32); 17 + __type(value, __u64); 18 + } sock_hash SEC(".maps"); 19 + 20 + struct { 21 + __uint(type, BPF_MAP_TYPE_SK_STORAGE); 22 + __uint(map_flags, BPF_F_NO_PREALLOC); 23 + __type(key, __u32); 24 + __type(value, __u64); 25 + } socket_storage SEC(".maps"); 26 + 27 + SEC("sk_msg") 28 + int prog_msg_verdict(struct sk_msg_md *msg) 29 + { 30 + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); 31 + int verdict = SK_PASS; 32 + __u32 pid, tpid; 33 + __u64 *sk_stg; 34 + 35 + pid = bpf_get_current_pid_tgid() >> 32; 36 + sk_stg = bpf_sk_storage_get(&socket_storage, msg->sk, 0, BPF_SK_STORAGE_GET_F_CREATE); 37 + if (!sk_stg) 38 + return SK_DROP; 39 + *sk_stg = pid; 40 + bpf_probe_read_kernel(&tpid , sizeof(tpid), &task->tgid); 41 + if (pid != tpid) 42 + verdict = SK_DROP; 43 + bpf_sk_storage_delete(&socket_storage, (void *)msg->sk); 44 + return verdict; 45 + } 46 + 47 + char _license[] SEC("license") = "GPL";

+45 -1

tools/testing/selftests/bpf/progs/test_sockmap_kern.h

··· 79 79 80 80 struct { 81 81 __uint(type, BPF_MAP_TYPE_ARRAY); 82 - __uint(max_entries, 1); 82 + __uint(max_entries, 2); 83 83 __type(key, int); 84 84 __type(value, int); 85 85 } sock_skb_opts SEC(".maps"); 86 + 87 + struct { 88 + __uint(type, TEST_MAP_TYPE); 89 + __uint(max_entries, 20); 90 + __uint(key_size, sizeof(int)); 91 + __uint(value_size, sizeof(int)); 92 + } tls_sock_map SEC(".maps"); 86 93 87 94 SEC("sk_skb1") 88 95 int bpf_prog1(struct __sk_buff *skb) ··· 123 116 return bpf_sk_redirect_hash(skb, &sock_map, &ret, flags); 124 117 #endif 125 118 119 + } 120 + 121 + SEC("sk_skb3") 122 + int bpf_prog3(struct __sk_buff *skb) 123 + { 124 + const int one = 1; 125 + int err, *f, ret = SK_PASS; 126 + void *data_end; 127 + char *c; 128 + 129 + err = bpf_skb_pull_data(skb, 19); 130 + if (err) 131 + goto tls_out; 132 + 133 + c = (char *)(long)skb->data; 134 + data_end = (void *)(long)skb->data_end; 135 + 136 + if (c + 18 < data_end) 137 + memcpy(&c[13], "PASS", 4); 138 + f = bpf_map_lookup_elem(&sock_skb_opts, &one); 139 + if (f && *f) { 140 + __u64 flags = 0; 141 + 142 + ret = 0; 143 + flags = *f; 144 + #ifdef SOCKMAP 145 + return bpf_sk_redirect_map(skb, &tls_sock_map, ret, flags); 146 + #else 147 + return bpf_sk_redirect_hash(skb, &tls_sock_map, &ret, flags); 148 + #endif 149 + } 150 + 151 + f = bpf_map_lookup_elem(&sock_skb_opts, &one); 152 + if (f && *f) 153 + ret = SK_DROP; 154 + tls_out: 155 + return ret; 126 156 } 127 157 128 158 SEC("sockops")

+22

tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* fails to load without expected_attach_type = BPF_XDP_DEVMAP 3 + * because of access to egress_ifindex 4 + */ 5 + #include "vmlinux.h" 6 + #include <bpf/bpf_helpers.h> 7 + 8 + SEC("xdp_dm_log") 9 + int xdpdm_devlog(struct xdp_md *ctx) 10 + { 11 + char fmt[] = "devmap redirect: dev %u -> dev %u len %u\n"; 12 + void *data_end = (void *)(long)ctx->data_end; 13 + void *data = (void *)(long)ctx->data; 14 + unsigned int len = data_end - data; 15 + 16 + bpf_trace_printk(fmt, sizeof(fmt), 17 + ctx->ingress_ifindex, ctx->egress_ifindex, len); 18 + 19 + return XDP_PASS; 20 + } 21 + 22 + char _license[] SEC("license") = "GPL";

+44

tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include "vmlinux.h" 4 + #include <bpf/bpf_helpers.h> 5 + 6 + struct { 7 + __uint(type, BPF_MAP_TYPE_DEVMAP); 8 + __uint(key_size, sizeof(__u32)); 9 + __uint(value_size, sizeof(struct bpf_devmap_val)); 10 + __uint(max_entries, 4); 11 + } dm_ports SEC(".maps"); 12 + 13 + SEC("xdp_redir") 14 + int xdp_redir_prog(struct xdp_md *ctx) 15 + { 16 + return bpf_redirect_map(&dm_ports, 1, 0); 17 + } 18 + 19 + /* invalid program on DEVMAP entry; 20 + * SEC name means expected attach type not set 21 + */ 22 + SEC("xdp_dummy") 23 + int xdp_dummy_prog(struct xdp_md *ctx) 24 + { 25 + return XDP_PASS; 26 + } 27 + 28 + /* valid program on DEVMAP entry via SEC name; 29 + * has access to egress and ingress ifindex 30 + */ 31 + SEC("xdp_devmap") 32 + int xdp_dummy_dm(struct xdp_md *ctx) 33 + { 34 + char fmt[] = "devmap redirect: dev %u -> dev %u len %u\n"; 35 + void *data_end = (void *)(long)ctx->data_end; 36 + void *data = (void *)(long)ctx->data; 37 + unsigned int len = data_end - data; 38 + 39 + bpf_trace_printk(fmt, sizeof(fmt), 40 + ctx->ingress_ifindex, ctx->egress_ifindex, len); 41 + 42 + return XDP_PASS; 43 + } 44 + char _license[] SEC("license") = "GPL";

+47 -5

tools/testing/selftests/bpf/test_maps.c

··· 1394 1394 1395 1395 key = 1; 1396 1396 value = 1234; 1397 - /* Insert key=1 element. */ 1397 + /* Try to insert key=1 element. */ 1398 1398 assert(bpf_map_update_elem(fd, &key, &value, BPF_ANY) == -1 && 1399 1399 errno == EPERM); 1400 1400 1401 - /* Check that key=2 is not found. */ 1401 + /* Check that key=1 is not found. */ 1402 1402 assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == ENOENT); 1403 1403 assert(bpf_map_get_next_key(fd, &key, &value) == -1 && errno == ENOENT); 1404 + 1405 + close(fd); 1404 1406 } 1405 1407 1406 - static void test_map_wronly(void) 1408 + static void test_map_wronly_hash(void) 1407 1409 { 1408 1410 int fd, key = 0, value = 0; 1409 1411 1410 1412 fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1411 1413 MAP_SIZE, map_flags | BPF_F_WRONLY); 1412 1414 if (fd < 0) { 1413 - printf("Failed to create map for read only test '%s'!\n", 1415 + printf("Failed to create map for write only test '%s'!\n", 1414 1416 strerror(errno)); 1415 1417 exit(1); 1416 1418 } ··· 1422 1420 /* Insert key=1 element. */ 1423 1421 assert(bpf_map_update_elem(fd, &key, &value, BPF_ANY) == 0); 1424 1422 1425 - /* Check that key=2 is not found. */ 1423 + /* Check that reading elements and keys from the map is not allowed. */ 1426 1424 assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == EPERM); 1427 1425 assert(bpf_map_get_next_key(fd, &key, &value) == -1 && errno == EPERM); 1426 + 1427 + close(fd); 1428 + } 1429 + 1430 + static void test_map_wronly_stack_or_queue(enum bpf_map_type map_type) 1431 + { 1432 + int fd, value = 0; 1433 + 1434 + assert(map_type == BPF_MAP_TYPE_QUEUE || 1435 + map_type == BPF_MAP_TYPE_STACK); 1436 + fd = bpf_create_map(map_type, 0, sizeof(value), MAP_SIZE, 1437 + map_flags | BPF_F_WRONLY); 1438 + /* Stack/Queue maps do not support BPF_F_NO_PREALLOC */ 1439 + if (map_flags & BPF_F_NO_PREALLOC) { 1440 + assert(fd < 0 && errno == EINVAL); 1441 + return; 1442 + } 1443 + if (fd < 0) { 1444 + printf("Failed to create map '%s'!\n", strerror(errno)); 1445 + exit(1); 1446 + } 1447 + 1448 + value = 1234; 1449 + assert(bpf_map_update_elem(fd, NULL, &value, BPF_ANY) == 0); 1450 + 1451 + /* Peek element should fail */ 1452 + assert(bpf_map_lookup_elem(fd, NULL, &value) == -1 && errno == EPERM); 1453 + 1454 + /* Pop element should fail */ 1455 + assert(bpf_map_lookup_and_delete_elem(fd, NULL, &value) == -1 && 1456 + errno == EPERM); 1457 + 1458 + close(fd); 1459 + } 1460 + 1461 + static void test_map_wronly(void) 1462 + { 1463 + test_map_wronly_hash(); 1464 + test_map_wronly_stack_or_queue(BPF_MAP_TYPE_STACK); 1465 + test_map_wronly_stack_or_queue(BPF_MAP_TYPE_QUEUE); 1428 1466 } 1429 1467 1430 1468 static void prepare_reuseport_grp(int type, int map_fd, size_t map_elem_size,

+142 -21

tools/testing/selftests/bpf/test_sockmap.c

··· 63 63 int test_cnt; 64 64 int passed; 65 65 int failed; 66 - int map_fd[8]; 67 - struct bpf_map *maps[8]; 66 + int map_fd[9]; 67 + struct bpf_map *maps[9]; 68 68 int prog_fd[11]; 69 69 70 70 int txmsg_pass; ··· 79 79 int txmsg_start_pop; 80 80 int txmsg_pop; 81 81 int txmsg_ingress; 82 - int txmsg_skb; 82 + int txmsg_redir_skb; 83 + int txmsg_ktls_skb; 84 + int txmsg_ktls_skb_drop; 85 + int txmsg_ktls_skb_redir; 83 86 int ktls; 84 87 int peek_flag; 85 88 ··· 107 104 {"txmsg_start_pop", required_argument, NULL, 'w'}, 108 105 {"txmsg_pop", required_argument, NULL, 'x'}, 109 106 {"txmsg_ingress", no_argument, &txmsg_ingress, 1 }, 110 - {"txmsg_skb", no_argument, &txmsg_skb, 1 }, 107 + {"txmsg_redir_skb", no_argument, &txmsg_redir_skb, 1 }, 111 108 {"ktls", no_argument, &ktls, 1 }, 112 109 {"peek", no_argument, &peek_flag, 1 }, 113 110 {"whitelist", required_argument, NULL, 'n' }, ··· 172 169 txmsg_start_push = txmsg_end_push = 0; 173 170 txmsg_pass = txmsg_drop = txmsg_redir = 0; 174 171 txmsg_apply = txmsg_cork = 0; 175 - txmsg_ingress = txmsg_skb = 0; 172 + txmsg_ingress = txmsg_redir_skb = 0; 173 + txmsg_ktls_skb = txmsg_ktls_skb_drop = txmsg_ktls_skb_redir = 0; 176 174 } 177 175 178 176 static int test_start_subtest(const struct _test *t, struct sockmap_options *o) ··· 506 502 507 503 static int msg_verify_data(struct msghdr *msg, int size, int chunk_sz) 508 504 { 509 - int i, j, bytes_cnt = 0; 505 + int i, j = 0, bytes_cnt = 0; 510 506 unsigned char k = 0; 511 507 512 508 for (i = 0; i < msg->msg_iovlen; i++) { 513 509 unsigned char *d = msg->msg_iov[i].iov_base; 514 510 515 - for (j = 0; 516 - j < msg->msg_iov[i].iov_len && size; j++) { 511 + /* Special case test for skb ingress + ktls */ 512 + if (i == 0 && txmsg_ktls_skb) { 513 + if (msg->msg_iov[i].iov_len < 4) 514 + return -EIO; 515 + if (txmsg_ktls_skb_redir) { 516 + if (memcmp(&d[13], "PASS", 4) != 0) { 517 + fprintf(stderr, 518 + "detected redirect ktls_skb data error with skb ingress update @iov[%i]:%i \"%02x %02x %02x %02x\" != \"PASS\"\n", i, 0, d[13], d[14], d[15], d[16]); 519 + return -EIO; 520 + } 521 + d[13] = 0; 522 + d[14] = 1; 523 + d[15] = 2; 524 + d[16] = 3; 525 + j = 13; 526 + } else if (txmsg_ktls_skb) { 527 + if (memcmp(d, "PASS", 4) != 0) { 528 + fprintf(stderr, 529 + "detected ktls_skb data error with skb ingress update @iov[%i]:%i \"%02x %02x %02x %02x\" != \"PASS\"\n", i, 0, d[0], d[1], d[2], d[3]); 530 + return -EIO; 531 + } 532 + d[0] = 0; 533 + d[1] = 1; 534 + d[2] = 2; 535 + d[3] = 3; 536 + } 537 + } 538 + 539 + for (; j < msg->msg_iov[i].iov_len && size; j++) { 517 540 if (d[j] != k++) { 518 541 fprintf(stderr, 519 542 "detected data corruption @iov[%i]:%i %02x != %02x, %02x ?= %02x\n", ··· 755 724 rxpid = fork(); 756 725 if (rxpid == 0) { 757 726 iov_buf -= (txmsg_pop - txmsg_start_pop + 1); 758 - if (opt->drop_expected) 727 + if (opt->drop_expected || txmsg_ktls_skb_drop) 759 728 _exit(0); 760 729 761 730 if (!iov_buf) /* zero bytes sent case */ ··· 942 911 return err; 943 912 } 944 913 914 + /* Attach programs to TLS sockmap */ 915 + if (txmsg_ktls_skb) { 916 + err = bpf_prog_attach(prog_fd[0], map_fd[8], 917 + BPF_SK_SKB_STREAM_PARSER, 0); 918 + if (err) { 919 + fprintf(stderr, 920 + "ERROR: bpf_prog_attach (TLS sockmap %i->%i): %d (%s)\n", 921 + prog_fd[0], map_fd[8], err, strerror(errno)); 922 + return err; 923 + } 924 + 925 + err = bpf_prog_attach(prog_fd[2], map_fd[8], 926 + BPF_SK_SKB_STREAM_VERDICT, 0); 927 + if (err) { 928 + fprintf(stderr, "ERROR: bpf_prog_attach (TLS sockmap): %d (%s)\n", 929 + err, strerror(errno)); 930 + return err; 931 + } 932 + } 933 + 945 934 /* Attach to cgroups */ 946 - err = bpf_prog_attach(prog_fd[2], cg_fd, BPF_CGROUP_SOCK_OPS, 0); 935 + err = bpf_prog_attach(prog_fd[3], cg_fd, BPF_CGROUP_SOCK_OPS, 0); 947 936 if (err) { 948 937 fprintf(stderr, "ERROR: bpf_prog_attach (groups): %d (%s)\n", 949 938 err, strerror(errno)); ··· 979 928 980 929 /* Attach txmsg program to sockmap */ 981 930 if (txmsg_pass) 982 - tx_prog_fd = prog_fd[3]; 983 - else if (txmsg_redir) 984 931 tx_prog_fd = prog_fd[4]; 985 - else if (txmsg_apply) 932 + else if (txmsg_redir) 986 933 tx_prog_fd = prog_fd[5]; 987 - else if (txmsg_cork) 934 + else if (txmsg_apply) 988 935 tx_prog_fd = prog_fd[6]; 989 - else if (txmsg_drop) 936 + else if (txmsg_cork) 990 937 tx_prog_fd = prog_fd[7]; 938 + else if (txmsg_drop) 939 + tx_prog_fd = prog_fd[8]; 991 940 else 992 941 tx_prog_fd = 0; 993 942 ··· 1159 1108 } 1160 1109 } 1161 1110 1162 - if (txmsg_skb) { 1111 + if (txmsg_ktls_skb) { 1112 + int ingress = BPF_F_INGRESS; 1113 + 1114 + i = 0; 1115 + err = bpf_map_update_elem(map_fd[8], &i, &p2, BPF_ANY); 1116 + if (err) { 1117 + fprintf(stderr, 1118 + "ERROR: bpf_map_update_elem (c1 sockmap): %d (%s)\n", 1119 + err, strerror(errno)); 1120 + } 1121 + 1122 + if (txmsg_ktls_skb_redir) { 1123 + i = 1; 1124 + err = bpf_map_update_elem(map_fd[7], 1125 + &i, &ingress, BPF_ANY); 1126 + if (err) { 1127 + fprintf(stderr, 1128 + "ERROR: bpf_map_update_elem (txmsg_ingress): %d (%s)\n", 1129 + err, strerror(errno)); 1130 + } 1131 + } 1132 + 1133 + if (txmsg_ktls_skb_drop) { 1134 + i = 1; 1135 + err = bpf_map_update_elem(map_fd[7], &i, &i, BPF_ANY); 1136 + } 1137 + } 1138 + 1139 + if (txmsg_redir_skb) { 1163 1140 int skb_fd = (test == SENDMSG || test == SENDPAGE) ? 1164 1141 p2 : p1; 1165 1142 int ingress = BPF_F_INGRESS; ··· 1202 1123 } 1203 1124 1204 1125 i = 3; 1205 - err = bpf_map_update_elem(map_fd[0], 1206 - &i, &skb_fd, BPF_ANY); 1126 + err = bpf_map_update_elem(map_fd[0], &i, &skb_fd, BPF_ANY); 1207 1127 if (err) { 1208 1128 fprintf(stderr, 1209 1129 "ERROR: bpf_map_update_elem (c1 sockmap): %d (%s)\n", ··· 1236 1158 fprintf(stderr, "unknown test\n"); 1237 1159 out: 1238 1160 /* Detatch and zero all the maps */ 1239 - bpf_prog_detach2(prog_fd[2], cg_fd, BPF_CGROUP_SOCK_OPS); 1161 + bpf_prog_detach2(prog_fd[3], cg_fd, BPF_CGROUP_SOCK_OPS); 1240 1162 bpf_prog_detach2(prog_fd[0], map_fd[0], BPF_SK_SKB_STREAM_PARSER); 1241 1163 bpf_prog_detach2(prog_fd[1], map_fd[0], BPF_SK_SKB_STREAM_VERDICT); 1164 + bpf_prog_detach2(prog_fd[0], map_fd[8], BPF_SK_SKB_STREAM_PARSER); 1165 + bpf_prog_detach2(prog_fd[2], map_fd[8], BPF_SK_SKB_STREAM_VERDICT); 1166 + 1242 1167 if (tx_prog_fd >= 0) 1243 1168 bpf_prog_detach2(tx_prog_fd, map_fd[1], BPF_SK_MSG_VERDICT); 1244 1169 ··· 1310 1229 } 1311 1230 if (txmsg_ingress) 1312 1231 strncat(options, "ingress,", OPTSTRING); 1313 - if (txmsg_skb) 1314 - strncat(options, "skb,", OPTSTRING); 1232 + if (txmsg_redir_skb) 1233 + strncat(options, "redir_skb,", OPTSTRING); 1234 + if (txmsg_ktls_skb) 1235 + strncat(options, "ktls_skb,", OPTSTRING); 1315 1236 if (ktls) 1316 1237 strncat(options, "ktls,", OPTSTRING); 1317 1238 if (peek_flag) ··· 1444 1361 txmsg_ingress = txmsg_redir = 1; 1445 1362 test_send(opt, cgrp); 1446 1363 } 1364 + 1365 + static void test_txmsg_skb(int cgrp, struct sockmap_options *opt) 1366 + { 1367 + bool data = opt->data_test; 1368 + int k = ktls; 1369 + 1370 + opt->data_test = true; 1371 + ktls = 1; 1372 + 1373 + txmsg_pass = txmsg_drop = 0; 1374 + txmsg_ingress = txmsg_redir = 0; 1375 + txmsg_ktls_skb = 1; 1376 + txmsg_pass = 1; 1377 + 1378 + /* Using data verification so ensure iov layout is 1379 + * expected from test receiver side. e.g. has enough 1380 + * bytes to write test code. 1381 + */ 1382 + opt->iov_length = 100; 1383 + opt->iov_count = 1; 1384 + opt->rate = 1; 1385 + test_exec(cgrp, opt); 1386 + 1387 + txmsg_ktls_skb_drop = 1; 1388 + test_exec(cgrp, opt); 1389 + 1390 + txmsg_ktls_skb_drop = 0; 1391 + txmsg_ktls_skb_redir = 1; 1392 + test_exec(cgrp, opt); 1393 + 1394 + opt->data_test = data; 1395 + ktls = k; 1396 + } 1397 + 1447 1398 1448 1399 /* Test cork with hung data. This tests poor usage patterns where 1449 1400 * cork can leave data on the ring if user program is buggy and ··· 1659 1542 "sock_bytes", 1660 1543 "sock_redir_flags", 1661 1544 "sock_skb_opts", 1545 + "tls_sock_map", 1662 1546 }; 1663 1547 1664 1548 int prog_attach_type[] = { 1665 1549 BPF_SK_SKB_STREAM_PARSER, 1550 + BPF_SK_SKB_STREAM_VERDICT, 1666 1551 BPF_SK_SKB_STREAM_VERDICT, 1667 1552 BPF_CGROUP_SOCK_OPS, 1668 1553 BPF_SK_MSG_VERDICT, ··· 1677 1558 }; 1678 1559 1679 1560 int prog_type[] = { 1561 + BPF_PROG_TYPE_SK_SKB, 1680 1562 BPF_PROG_TYPE_SK_SKB, 1681 1563 BPF_PROG_TYPE_SK_SKB, 1682 1564 BPF_PROG_TYPE_SOCK_OPS, ··· 1740 1620 {"txmsg test redirect", test_txmsg_redir}, 1741 1621 {"txmsg test drop", test_txmsg_drop}, 1742 1622 {"txmsg test ingress redirect", test_txmsg_ingress_redir}, 1623 + {"txmsg test skb", test_txmsg_skb}, 1743 1624 {"txmsg test apply", test_txmsg_apply}, 1744 1625 {"txmsg test cork", test_txmsg_cork}, 1745 1626 {"txmsg test hanging corks", test_txmsg_cork_hangs},

+2 -2

tools/testing/selftests/bpf/verifier/and.c

··· 15 15 BPF_EXIT_INSN(), 16 16 }, 17 17 .fixup_map_hash_48b = { 3 }, 18 - .errstr = "R0 max value is outside of the array range", 18 + .errstr = "R0 max value is outside of the allowed memory range", 19 19 .result = REJECT, 20 20 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, 21 21 }, ··· 44 44 BPF_EXIT_INSN(), 45 45 }, 46 46 .fixup_map_hash_48b = { 3 }, 47 - .errstr = "R0 max value is outside of the array range", 47 + .errstr = "R0 max value is outside of the allowed memory range", 48 48 .result = REJECT, 49 49 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, 50 50 },

+2 -2

tools/testing/selftests/bpf/verifier/array_access.c

··· 117 117 BPF_EXIT_INSN(), 118 118 }, 119 119 .fixup_map_hash_48b = { 3 }, 120 - .errstr = "R0 min value is outside of the array range", 120 + .errstr = "R0 min value is outside of the allowed memory range", 121 121 .result = REJECT, 122 122 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, 123 123 }, ··· 137 137 BPF_EXIT_INSN(), 138 138 }, 139 139 .fixup_map_hash_48b = { 3 }, 140 - .errstr = "R0 unbounded memory access, make sure to bounds check any array access into a map", 140 + .errstr = "R0 unbounded memory access, make sure to bounds check any such access", 141 141 .result = REJECT, 142 142 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, 143 143 },

+3 -3

tools/testing/selftests/bpf/verifier/bounds.c

··· 20 20 BPF_EXIT_INSN(), 21 21 }, 22 22 .fixup_map_hash_8b = { 3 }, 23 - .errstr = "R0 max value is outside of the array range", 23 + .errstr = "R0 max value is outside of the allowed memory range", 24 24 .result = REJECT, 25 25 }, 26 26 { ··· 146 146 BPF_EXIT_INSN(), 147 147 }, 148 148 .fixup_map_hash_8b = { 3 }, 149 - .errstr = "R0 min value is outside of the array range", 149 + .errstr = "R0 min value is outside of the allowed memory range", 150 150 .result = REJECT 151 151 }, 152 152 { ··· 354 354 BPF_EXIT_INSN(), 355 355 }, 356 356 .fixup_map_hash_8b = { 3 }, 357 - .errstr = "R0 max value is outside of the array range", 357 + .errstr = "R0 max value is outside of the allowed memory range", 358 358 .result = REJECT 359 359 }, 360 360 {

+1 -1

tools/testing/selftests/bpf/verifier/calls.c

··· 105 105 .prog_type = BPF_PROG_TYPE_SCHED_CLS, 106 106 .fixup_map_hash_8b = { 16 }, 107 107 .result = REJECT, 108 - .errstr = "R0 min value is outside of the array range", 108 + .errstr = "R0 min value is outside of the allowed memory range", 109 109 }, 110 110 { 111 111 "calls: overlapping caller/callee",

+2 -2

tools/testing/selftests/bpf/verifier/direct_value_access.c

··· 68 68 }, 69 69 .fixup_map_array_48b = { 1 }, 70 70 .result = REJECT, 71 - .errstr = "R1 min value is outside of the array range", 71 + .errstr = "R1 min value is outside of the allowed memory range", 72 72 }, 73 73 { 74 74 "direct map access, write test 7", ··· 220 220 }, 221 221 .fixup_map_array_small = { 1 }, 222 222 .result = REJECT, 223 - .errstr = "R1 min value is outside of the array range", 223 + .errstr = "R1 min value is outside of the allowed memory range", 224 224 }, 225 225 { 226 226 "direct map access, write test 19",

+1 -1

tools/testing/selftests/bpf/verifier/helper_access_var_len.c

··· 318 318 BPF_EXIT_INSN(), 319 319 }, 320 320 .fixup_map_hash_48b = { 4 }, 321 - .errstr = "R1 min value is outside of the array range", 321 + .errstr = "R1 min value is outside of the allowed memory range", 322 322 .result = REJECT, 323 323 .prog_type = BPF_PROG_TYPE_TRACEPOINT, 324 324 },

+3 -3

tools/testing/selftests/bpf/verifier/helper_value_access.c

··· 280 280 BPF_EXIT_INSN(), 281 281 }, 282 282 .fixup_map_hash_48b = { 3 }, 283 - .errstr = "R1 min value is outside of the array range", 283 + .errstr = "R1 min value is outside of the allowed memory range", 284 284 .result = REJECT, 285 285 .prog_type = BPF_PROG_TYPE_TRACEPOINT, 286 286 }, ··· 415 415 BPF_EXIT_INSN(), 416 416 }, 417 417 .fixup_map_hash_48b = { 3 }, 418 - .errstr = "R1 min value is outside of the array range", 418 + .errstr = "R1 min value is outside of the allowed memory range", 419 419 .result = REJECT, 420 420 .prog_type = BPF_PROG_TYPE_TRACEPOINT, 421 421 }, ··· 926 926 }, 927 927 .fixup_map_hash_16b = { 3, 10 }, 928 928 .result = REJECT, 929 - .errstr = "R2 unbounded memory access, make sure to bounds check any array access into a map", 929 + .errstr = "R2 unbounded memory access, make sure to bounds check any such access", 930 930 .prog_type = BPF_PROG_TYPE_TRACEPOINT, 931 931 }, 932 932 {

+4 -4

tools/testing/selftests/bpf/verifier/value_ptr_arith.c

··· 50 50 .fixup_map_array_48b = { 8 }, 51 51 .result = ACCEPT, 52 52 .result_unpriv = REJECT, 53 - .errstr_unpriv = "R0 min value is outside of the array range", 53 + .errstr_unpriv = "R0 min value is outside of the allowed memory range", 54 54 .retval = 1, 55 55 }, 56 56 { ··· 325 325 }, 326 326 .fixup_map_array_48b = { 3 }, 327 327 .result = REJECT, 328 - .errstr = "R0 min value is outside of the array range", 328 + .errstr = "R0 min value is outside of the allowed memory range", 329 329 .result_unpriv = REJECT, 330 330 .errstr_unpriv = "R0 pointer arithmetic of map value goes out of range", 331 331 }, ··· 601 601 }, 602 602 .fixup_map_array_48b = { 3 }, 603 603 .result = REJECT, 604 - .errstr = "R1 max value is outside of the array range", 604 + .errstr = "R1 max value is outside of the allowed memory range", 605 605 .errstr_unpriv = "R1 pointer arithmetic of map value goes out of range", 606 606 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, 607 607 }, ··· 726 726 }, 727 727 .fixup_map_array_48b = { 3 }, 728 728 .result = REJECT, 729 - .errstr = "R0 min value is outside of the array range", 729 + .errstr = "R0 min value is outside of the allowed memory range", 730 730 }, 731 731 { 732 732 "map access: value_ptr -= known scalar, 2",