Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'bpf: Add user-space-publisher ring buffer map type'

David Vernet says:

====================
This patch set defines a new map type, BPF_MAP_TYPE_USER_RINGBUF, which
provides single-user-space-producer / single-kernel-consumer semantics over
a ring buffer. Along with the new map type, a helper function called
bpf_user_ringbuf_drain() is added which allows a BPF program to specify a
callback with the following signature, to which samples are posted by the
helper:

void (struct bpf_dynptr *dynptr, void *context);

The program can then use the bpf_dynptr_read() or bpf_dynptr_data() helper
functions to safely read the sample from the dynptr. There are currently no
helpers available to determine the size of the sample, but one could easily
be added if required.

On the user-space side, libbpf has been updated to export a new
'struct ring_buffer_user' type, along with the following symbols:

struct ring_buffer_user *
ring_buffer_user__new(int map_fd,
const struct ring_buffer_user_opts *opts);
void ring_buffer_user__free(struct ring_buffer_user *rb);
void *ring_buffer_user__reserve(struct ring_buffer_user *rb,
uint32_t size);
void *ring_buffer_user__poll(struct ring_buffer_user *rb, uint32_t size,
int timeout_ms);
void ring_buffer_user__discard(struct ring_buffer_user *rb, void *sample);
void ring_buffer_user__submit(struct ring_buffer_user *rb, void *sample);

These symbols are exported for inclusion in libbpf version 1.0.0.

Signed-off-by: David Vernet <void@manifault.com>
---
v5 -> v6:
- Fixed s/BPF_MAP_TYPE_RINGBUF/BPF_MAP_TYPE_USER_RINGBUF typo in the
libbpf user ringbuf doxygen header comment for ring_buffer_user__new()
(Andrii).
- Specify that pointer returned from ring_buffer_user__reserve() and its
blocking counterpart is 8-byte aligned (Andrii).
- Renamed user_ringbuf__commit() to user_ringbuf_commit(), as it's static
(Andrii).
- Another slight reworking of user_ring_buffer__reserve_blocking() to
remove some extraneous nanosecond variables + checking (Andrii).
- Add a final check of user_ring_buffer__reserve() in
user_ring_buffer__reserve_blocking().
- Moved busy bit lock / unlock logic from __bpf_user_ringbuf_peek() to
bpf_user_ringbuf_drain() (Andrii).
- -ENOSPC -> -ENODATA for an empty ring buffer in
__bpf_user_ringbuf_peek() (Andrii).
- Updated BPF_RB_FORCE_WAKEUP to only force a wakeup notification to be
sent if even if no sample was drained.
- Changed a bit of the wording in the UAPI header for
bpf_user_ringbuf_drain() to mention the BPF_RB_FORCE_WAKEUP behavior.
- Remove extra space after return in ringbuf_map_poll_user() (Andrii).
- Removed now-extraneous paragraph from the commit summary of patch 2/4
(Andrii).
v4 -> v5:
- DENYLISTed the user-ringbuf test suite on s390x. We have a number of
functions in the progs/user_ringbuf_success.c prog that user-space
fires by invoking a syscall. Not all of these syscalls are available
on s390x. If and when we add the ability to kick the kernel from
user-space, or if we end up using iterators for that per Hao's
suggestion, we could re-enable this test suite on s390x.
- Fixed a few more places that needed ringbuffer -> ring buffer.
v3 -> v4:
- Update BPF_MAX_USER_RINGBUF_SAMPLES to not specify a bit, and instead
just specify a number of samples. (Andrii)
- Update "ringbuffer" in comments and commit summaries to say "ring
buffer". (Andrii)
- Return -E2BIG from bpf_user_ringbuf_drain() both when a sample can't
fit into the ring buffer, and when it can't fit into a dynptr. (Andrii)
- Don't loop over samples in __bpf_user_ringbuf_peek() if a sample was
discarded. Instead, return -EAGAIN so the caller can deal with it. Also
updated the caller to detect -EAGAIN and skip over it when iterating.
(Andrii)
- Removed the heuristic for notifying user-space when a sample is drained,
causing the ring buffer to no longer be full. This may be useful in the
future, but is being removed now because it's strictly a heuristic.
- Re-add BPF_RB_FORCE_WAKEUP flag to bpf_user_ringbuf_drain(). (Andrii)
- Remove helper_allocated_dynptr tracker from verifier. (Andrii)
- Add libbpf function header comments to tools/lib/bpf/libbpf.h, so that
they will be included in rendered libbpf docs. (Andrii)
- Add symbols to a new LIBBPF_1.1.0 section in linker version script,
rather than including them in LIBBPF_1.0.0. (Andrii)
- Remove libbpf_err() calls from static libbpf functions. (Andrii)
- Check user_ring_buffer_opts instead of ring_buffer_opts in
user_ring_buffer__new(). (Andrii)
- Avoid an extra if in the hot path in user_ringbuf__commit(). (Andrii)
- Use ENOSPC rather than ENODATA if no space is available in the ring
buffer. (Andrii)
- Don't round sample size in header to 8, but still round size that is
reserved and written to 8, and validate positions are multiples of 8
(Andrii).
- Use nanoseconds for most calculations in
user_ring_buffer__reserve_blocking(). (Andrii)
- Don't use CHECK() in testcases, instead use ASSERT_*. (Andrii)
- Use SEC("?raw_tp") instead of SEC("?raw_tp/sys_nanosleep") in negative
test. (Andrii)
- Move test_user_ringbuf.h header to live next to BPF program instead of
a directory up from both it and the user-space test program. (Andrii)
- Update bpftool help message / docs to also include user_ringbuf.
v2 -> v3:
- Lots of formatting fixes, such as keeping things on one line if they fit
within 100 characters, and removing some extraneous newlines. Applies
to all diffs in the patch-set. (Andrii)
- Renamed ring_buffer_user__* symbols to user_ring_buffer__*. (Andrii)
- Added a missing smb_mb__before_atomic() in
__bpf_user_ringbuf_sample_release(). (Hao)
- Restructure how and when notification events are sent from the kernel to
the user-space producers via the .map_poll() callback for the
BPF_MAP_TYPE_USER_RINGBUF map. Before, we only sent a notification when
the ringbuffer was fully drained. Now, we guarantee user-space that
we'll send an event at least once per bpf_user_ringbuf_drain(), as long
as at least one sample was drained, and BPF_RB_NO_WAKEUP was not passed.
As a heuristic, we also send a notification event any time a sample being
drained causes the ringbuffer to no longer be full. (Andrii)
- Continuing on the above point, updated
user_ring_buffer__reserve_blocking() to loop around epoll_wait() until a
sufficiently large sample is found. (Andrii)
- Communicate BPF_RINGBUF_BUSY_BIT and BPF_RINGBUF_DISCARD_BIT in sample
headers. The ringbuffer implementation still only supports
single-producer semantics, but we can now add synchronization support in
user_ring_buffer__reserve(), and will automatically get multi-producer
semantics. (Andrii)
- Updated some commit summaries, specifically adding more details where
warranted. (Andrii)
- Improved function documentation for bpf_user_ringbuf_drain(), more
clearly explaining all function arguments and return types, as well as
the semantics for waking up user-space producers.
- Add function header comments for user_ring_buffer__reserve{_blocking}().
(Andrii)
- Rounding-up all samples to 8-bytes in the user-space producer, and
enforcing that all samples are properly aligned in the kernel. (Andrii)
- Added testcases that verify that bpf_user_ringbuf_drain() properly
validates samples, and returns error conditions if any invalid samples
are encountered. (Andrii)
- Move atomic_t busy field out of the consumer page, and into the
struct bpf_ringbuf. (Andrii)
- Split ringbuf_map_{mmap, poll}_{kern, user}() into separate
implementations. (Andrii)
- Don't silently consume errors in bpf_user_ringbuf_drain(). (Andrii)
- Remove magic number of samples (4096) from bpf_user_ringbuf_drain(),
and instead use BPF_MAX_USER_RINGBUF_SAMPLES macro, which allows
128k samples. (Andrii)
- Remove MEM_ALLOC modifier from PTR_TO_DYNPTR register in verifier, and
instead rely solely on the register being PTR_TO_DYNPTR. (Andrii)
- Move freeing of atomic_t busy bit to before we invoke irq_work_queue() in
__bpf_user_ringbuf_sample_release(). (Andrii)
- Only check for BPF_RB_NO_WAKEUP flag in bpf_ringbuf_drain().
- Remove libbpf function names from kernel smp_{load, store}* comments in
the kernel. (Andrii)
- Don't use double-underscore naming convention in libbpf functions.
(Andrii)
- Use proper __u32 and __u64 for types where we need to guarantee their
size. (Andrii)

v1 -> v2:
- Following Joanne landing 883743422ced ("bpf: Fix ref_obj_id for dynptr
data slices in verifier") [0], removed [PATCH 1/5] bpf: Clear callee
saved regs after updating REG0 [1]. (Joanne)
- Following the above adjustment, updated check_helper_call() to not store
a reference for bpf_dynptr_data() if the register containing the dynptr
is of type MEM_ALLOC. (Joanne)
- Fixed casting issue pointed out by kernel test robot by adding a missing
(uintptr_t) cast. (lkp)

[0] https://lore.kernel.org/all/20220809214055.4050604-1-joannelkoong@gmail.com/
[1] https://lore.kernel.org/all/20220808155341.2479054-1-void@manifault.com/
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

+1967 -21
+9 -2
include/linux/bpf.h
··· 451 451 /* DYNPTR points to memory local to the bpf program. */ 452 452 DYNPTR_TYPE_LOCAL = BIT(8 + BPF_BASE_TYPE_BITS), 453 453 454 - /* DYNPTR points to a ringbuf record. */ 454 + /* DYNPTR points to a kernel-produced ringbuf record. */ 455 455 DYNPTR_TYPE_RINGBUF = BIT(9 + BPF_BASE_TYPE_BITS), 456 456 457 457 /* Size is known at compile time. */ ··· 656 656 PTR_TO_MEM, /* reg points to valid memory region */ 657 657 PTR_TO_BUF, /* reg points to a read/write buffer */ 658 658 PTR_TO_FUNC, /* reg points to a bpf program function */ 659 + PTR_TO_DYNPTR, /* reg points to a dynptr */ 659 660 __BPF_REG_TYPE_MAX, 660 661 661 662 /* Extended reg_types. */ ··· 1394 1393 1395 1394 #define BPF_MAP_CAN_READ BIT(0) 1396 1395 #define BPF_MAP_CAN_WRITE BIT(1) 1396 + 1397 + /* Maximum number of user-producer ring buffer samples that can be drained in 1398 + * a call to bpf_user_ringbuf_drain(). 1399 + */ 1400 + #define BPF_MAX_USER_RINGBUF_SAMPLES (128 * 1024) 1397 1401 1398 1402 static inline u32 bpf_map_flags_to_cap(struct bpf_map *map) 1399 1403 { ··· 2501 2495 extern const struct bpf_func_proto bpf_copy_from_user_task_proto; 2502 2496 extern const struct bpf_func_proto bpf_set_retval_proto; 2503 2497 extern const struct bpf_func_proto bpf_get_retval_proto; 2498 + extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; 2504 2499 2505 2500 const struct bpf_func_proto *tracing_prog_func_proto( 2506 2501 enum bpf_func_id func_id, const struct bpf_prog *prog); ··· 2646 2639 BPF_DYNPTR_TYPE_INVALID, 2647 2640 /* Points to memory that is local to the bpf program */ 2648 2641 BPF_DYNPTR_TYPE_LOCAL, 2649 - /* Underlying data is a ringbuf record */ 2642 + /* Underlying data is a kernel-produced ringbuf record */ 2650 2643 BPF_DYNPTR_TYPE_RINGBUF, 2651 2644 }; 2652 2645
+1
include/linux/bpf_types.h
··· 126 126 #endif 127 127 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) 128 128 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) 129 + BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops) 129 130 130 131 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) 131 132 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
+39
include/uapi/linux/bpf.h
··· 928 928 BPF_MAP_TYPE_INODE_STORAGE, 929 929 BPF_MAP_TYPE_TASK_STORAGE, 930 930 BPF_MAP_TYPE_BLOOM_FILTER, 931 + BPF_MAP_TYPE_USER_RINGBUF, 931 932 }; 932 933 933 934 /* Note that tracing related programs such as ··· 5388 5387 * Return 5389 5388 * Current *ktime*. 5390 5389 * 5390 + * long bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags) 5391 + * Description 5392 + * Drain samples from the specified user ring buffer, and invoke 5393 + * the provided callback for each such sample: 5394 + * 5395 + * long (\*callback_fn)(struct bpf_dynptr \*dynptr, void \*ctx); 5396 + * 5397 + * If **callback_fn** returns 0, the helper will continue to try 5398 + * and drain the next sample, up to a maximum of 5399 + * BPF_MAX_USER_RINGBUF_SAMPLES samples. If the return value is 1, 5400 + * the helper will skip the rest of the samples and return. Other 5401 + * return values are not used now, and will be rejected by the 5402 + * verifier. 5403 + * Return 5404 + * The number of drained samples if no error was encountered while 5405 + * draining samples, or 0 if no samples were present in the ring 5406 + * buffer. If a user-space producer was epoll-waiting on this map, 5407 + * and at least one sample was drained, they will receive an event 5408 + * notification notifying them of available space in the ring 5409 + * buffer. If the BPF_RB_NO_WAKEUP flag is passed to this 5410 + * function, no wakeup notification will be sent. If the 5411 + * BPF_RB_FORCE_WAKEUP flag is passed, a wakeup notification will 5412 + * be sent even if no sample was drained. 5413 + * 5414 + * On failure, the returned value is one of the following: 5415 + * 5416 + * **-EBUSY** if the ring buffer is contended, and another calling 5417 + * context was concurrently draining the ring buffer. 5418 + * 5419 + * **-EINVAL** if user-space is not properly tracking the ring 5420 + * buffer due to the producer position not being aligned to 8 5421 + * bytes, a sample not being aligned to 8 bytes, or the producer 5422 + * position not matching the advertised length of a sample. 5423 + * 5424 + * **-E2BIG** if user-space has tried to publish a sample which is 5425 + * larger than the size of the ring buffer, or which cannot fit 5426 + * within a struct bpf_dynptr. 5391 5427 */ 5392 5428 #define __BPF_FUNC_MAPPER(FN) \ 5393 5429 FN(unspec), \ ··· 5636 5598 FN(tcp_raw_check_syncookie_ipv4), \ 5637 5599 FN(tcp_raw_check_syncookie_ipv6), \ 5638 5600 FN(ktime_get_tai_ns), \ 5601 + FN(user_ringbuf_drain), \ 5639 5602 /* */ 5640 5603 5641 5604 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
+2
kernel/bpf/helpers.c
··· 1659 1659 return &bpf_for_each_map_elem_proto; 1660 1660 case BPF_FUNC_loop: 1661 1661 return &bpf_loop_proto; 1662 + case BPF_FUNC_user_ringbuf_drain: 1663 + return &bpf_user_ringbuf_drain_proto; 1662 1664 default: 1663 1665 break; 1664 1666 }
+232 -11
kernel/bpf/ringbuf.c
··· 38 38 struct page **pages; 39 39 int nr_pages; 40 40 spinlock_t spinlock ____cacheline_aligned_in_smp; 41 - /* Consumer and producer counters are put into separate pages to allow 42 - * mapping consumer page as r/w, but restrict producer page to r/o. 43 - * This protects producer position from being modified by user-space 44 - * application and ruining in-kernel position tracking. 41 + /* For user-space producer ring buffers, an atomic_t busy bit is used 42 + * to synchronize access to the ring buffers in the kernel, rather than 43 + * the spinlock that is used for kernel-producer ring buffers. This is 44 + * done because the ring buffer must hold a lock across a BPF program's 45 + * callback: 46 + * 47 + * __bpf_user_ringbuf_peek() // lock acquired 48 + * -> program callback_fn() 49 + * -> __bpf_user_ringbuf_sample_release() // lock released 50 + * 51 + * It is unsafe and incorrect to hold an IRQ spinlock across what could 52 + * be a long execution window, so we instead simply disallow concurrent 53 + * access to the ring buffer by kernel consumers, and return -EBUSY from 54 + * __bpf_user_ringbuf_peek() if the busy bit is held by another task. 55 + */ 56 + atomic_t busy ____cacheline_aligned_in_smp; 57 + /* Consumer and producer counters are put into separate pages to 58 + * allow each position to be mapped with different permissions. 59 + * This prevents a user-space application from modifying the 60 + * position and ruining in-kernel tracking. The permissions of the 61 + * pages depend on who is producing samples: user-space or the 62 + * kernel. 63 + * 64 + * Kernel-producer 65 + * --------------- 66 + * The producer position and data pages are mapped as r/o in 67 + * userspace. For this approach, bits in the header of samples are 68 + * used to signal to user-space, and to other producers, whether a 69 + * sample is currently being written. 70 + * 71 + * User-space producer 72 + * ------------------- 73 + * Only the page containing the consumer position is mapped r/o in 74 + * user-space. User-space producers also use bits of the header to 75 + * communicate to the kernel, but the kernel must carefully check and 76 + * validate each sample to ensure that they're correctly formatted, and 77 + * fully contained within the ring buffer. 45 78 */ 46 79 unsigned long consumer_pos __aligned(PAGE_SIZE); 47 80 unsigned long producer_pos __aligned(PAGE_SIZE); ··· 169 136 return NULL; 170 137 171 138 spin_lock_init(&rb->spinlock); 139 + atomic_set(&rb->busy, 0); 172 140 init_waitqueue_head(&rb->waitq); 173 141 init_irq_work(&rb->work, bpf_ringbuf_notify); 174 142 ··· 258 224 return -ENOTSUPP; 259 225 } 260 226 261 - static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) 227 + static int ringbuf_map_mmap_kern(struct bpf_map *map, struct vm_area_struct *vma) 262 228 { 263 229 struct bpf_ringbuf_map *rb_map; 264 230 ··· 276 242 vma->vm_pgoff + RINGBUF_PGOFF); 277 243 } 278 244 245 + static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma) 246 + { 247 + struct bpf_ringbuf_map *rb_map; 248 + 249 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 250 + 251 + if (vma->vm_flags & VM_WRITE) { 252 + if (vma->vm_pgoff == 0) 253 + /* Disallow writable mappings to the consumer pointer, 254 + * and allow writable mappings to both the producer 255 + * position, and the ring buffer data itself. 256 + */ 257 + return -EPERM; 258 + } else { 259 + vma->vm_flags &= ~VM_MAYWRITE; 260 + } 261 + /* remap_vmalloc_range() checks size and offset constraints */ 262 + return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF); 263 + } 264 + 279 265 static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) 280 266 { 281 267 unsigned long cons_pos, prod_pos; ··· 305 251 return prod_pos - cons_pos; 306 252 } 307 253 308 - static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, 309 - struct poll_table_struct *pts) 254 + static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) 255 + { 256 + return rb->mask + 1; 257 + } 258 + 259 + static __poll_t ringbuf_map_poll_kern(struct bpf_map *map, struct file *filp, 260 + struct poll_table_struct *pts) 310 261 { 311 262 struct bpf_ringbuf_map *rb_map; 312 263 ··· 323 264 return 0; 324 265 } 325 266 267 + static __poll_t ringbuf_map_poll_user(struct bpf_map *map, struct file *filp, 268 + struct poll_table_struct *pts) 269 + { 270 + struct bpf_ringbuf_map *rb_map; 271 + 272 + rb_map = container_of(map, struct bpf_ringbuf_map, map); 273 + poll_wait(filp, &rb_map->rb->waitq, pts); 274 + 275 + if (ringbuf_avail_data_sz(rb_map->rb) < ringbuf_total_data_sz(rb_map->rb)) 276 + return EPOLLOUT | EPOLLWRNORM; 277 + return 0; 278 + } 279 + 326 280 BTF_ID_LIST_SINGLE(ringbuf_map_btf_ids, struct, bpf_ringbuf_map) 327 281 const struct bpf_map_ops ringbuf_map_ops = { 328 282 .map_meta_equal = bpf_map_meta_equal, 329 283 .map_alloc = ringbuf_map_alloc, 330 284 .map_free = ringbuf_map_free, 331 - .map_mmap = ringbuf_map_mmap, 332 - .map_poll = ringbuf_map_poll, 285 + .map_mmap = ringbuf_map_mmap_kern, 286 + .map_poll = ringbuf_map_poll_kern, 333 287 .map_lookup_elem = ringbuf_map_lookup_elem, 334 288 .map_update_elem = ringbuf_map_update_elem, 335 289 .map_delete_elem = ringbuf_map_delete_elem, 336 290 .map_get_next_key = ringbuf_map_get_next_key, 337 291 .map_btf_id = &ringbuf_map_btf_ids[0], 292 + }; 293 + 294 + BTF_ID_LIST_SINGLE(user_ringbuf_map_btf_ids, struct, bpf_ringbuf_map) 295 + const struct bpf_map_ops user_ringbuf_map_ops = { 296 + .map_meta_equal = bpf_map_meta_equal, 297 + .map_alloc = ringbuf_map_alloc, 298 + .map_free = ringbuf_map_free, 299 + .map_mmap = ringbuf_map_mmap_user, 300 + .map_poll = ringbuf_map_poll_user, 301 + .map_lookup_elem = ringbuf_map_lookup_elem, 302 + .map_update_elem = ringbuf_map_update_elem, 303 + .map_delete_elem = ringbuf_map_delete_elem, 304 + .map_get_next_key = ringbuf_map_get_next_key, 305 + .map_btf_id = &user_ringbuf_map_btf_ids[0], 338 306 }; 339 307 340 308 /* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself, ··· 398 312 return NULL; 399 313 400 314 len = round_up(size + BPF_RINGBUF_HDR_SZ, 8); 401 - if (len > rb->mask + 1) 315 + if (len > ringbuf_total_data_sz(rb)) 402 316 return NULL; 403 317 404 318 cons_pos = smp_load_acquire(&rb->consumer_pos); ··· 545 459 case BPF_RB_AVAIL_DATA: 546 460 return ringbuf_avail_data_sz(rb); 547 461 case BPF_RB_RING_SIZE: 548 - return rb->mask + 1; 462 + return ringbuf_total_data_sz(rb); 549 463 case BPF_RB_CONS_POS: 550 464 return smp_load_acquire(&rb->consumer_pos); 551 465 case BPF_RB_PROD_POS: ··· 638 552 .ret_type = RET_VOID, 639 553 .arg1_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_RINGBUF | OBJ_RELEASE, 640 554 .arg2_type = ARG_ANYTHING, 555 + }; 556 + 557 + static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *size) 558 + { 559 + int err; 560 + u32 hdr_len, sample_len, total_len, flags, *hdr; 561 + u64 cons_pos, prod_pos; 562 + 563 + /* Synchronizes with smp_store_release() in user-space producer. */ 564 + prod_pos = smp_load_acquire(&rb->producer_pos); 565 + if (prod_pos % 8) 566 + return -EINVAL; 567 + 568 + /* Synchronizes with smp_store_release() in __bpf_user_ringbuf_sample_release() */ 569 + cons_pos = smp_load_acquire(&rb->consumer_pos); 570 + if (cons_pos >= prod_pos) 571 + return -ENODATA; 572 + 573 + hdr = (u32 *)((uintptr_t)rb->data + (uintptr_t)(cons_pos & rb->mask)); 574 + /* Synchronizes with smp_store_release() in user-space producer. */ 575 + hdr_len = smp_load_acquire(hdr); 576 + flags = hdr_len & (BPF_RINGBUF_BUSY_BIT | BPF_RINGBUF_DISCARD_BIT); 577 + sample_len = hdr_len & ~flags; 578 + total_len = round_up(sample_len + BPF_RINGBUF_HDR_SZ, 8); 579 + 580 + /* The sample must fit within the region advertised by the producer position. */ 581 + if (total_len > prod_pos - cons_pos) 582 + return -EINVAL; 583 + 584 + /* The sample must fit within the data region of the ring buffer. */ 585 + if (total_len > ringbuf_total_data_sz(rb)) 586 + return -E2BIG; 587 + 588 + /* The sample must fit into a struct bpf_dynptr. */ 589 + err = bpf_dynptr_check_size(sample_len); 590 + if (err) 591 + return -E2BIG; 592 + 593 + if (flags & BPF_RINGBUF_DISCARD_BIT) { 594 + /* If the discard bit is set, the sample should be skipped. 595 + * 596 + * Update the consumer pos, and return -EAGAIN so the caller 597 + * knows to skip this sample and try to read the next one. 598 + */ 599 + smp_store_release(&rb->consumer_pos, cons_pos + total_len); 600 + return -EAGAIN; 601 + } 602 + 603 + if (flags & BPF_RINGBUF_BUSY_BIT) 604 + return -ENODATA; 605 + 606 + *sample = (void *)((uintptr_t)rb->data + 607 + (uintptr_t)((cons_pos + BPF_RINGBUF_HDR_SZ) & rb->mask)); 608 + *size = sample_len; 609 + return 0; 610 + } 611 + 612 + static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t size, u64 flags) 613 + { 614 + u64 consumer_pos; 615 + u32 rounded_size = round_up(size + BPF_RINGBUF_HDR_SZ, 8); 616 + 617 + /* Using smp_load_acquire() is unnecessary here, as the busy-bit 618 + * prevents another task from writing to consumer_pos after it was read 619 + * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek(). 620 + */ 621 + consumer_pos = rb->consumer_pos; 622 + /* Synchronizes with smp_load_acquire() in user-space producer. */ 623 + smp_store_release(&rb->consumer_pos, consumer_pos + rounded_size); 624 + } 625 + 626 + BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map, 627 + void *, callback_fn, void *, callback_ctx, u64, flags) 628 + { 629 + struct bpf_ringbuf *rb; 630 + long samples, discarded_samples = 0, ret = 0; 631 + bpf_callback_t callback = (bpf_callback_t)callback_fn; 632 + u64 wakeup_flags = BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP; 633 + int busy = 0; 634 + 635 + if (unlikely(flags & ~wakeup_flags)) 636 + return -EINVAL; 637 + 638 + rb = container_of(map, struct bpf_ringbuf_map, map)->rb; 639 + 640 + /* If another consumer is already consuming a sample, wait for them to finish. */ 641 + if (!atomic_try_cmpxchg(&rb->busy, &busy, 1)) 642 + return -EBUSY; 643 + 644 + for (samples = 0; samples < BPF_MAX_USER_RINGBUF_SAMPLES && ret == 0; samples++) { 645 + int err; 646 + u32 size; 647 + void *sample; 648 + struct bpf_dynptr_kern dynptr; 649 + 650 + err = __bpf_user_ringbuf_peek(rb, &sample, &size); 651 + if (err) { 652 + if (err == -ENODATA) { 653 + break; 654 + } else if (err == -EAGAIN) { 655 + discarded_samples++; 656 + continue; 657 + } else { 658 + ret = err; 659 + goto schedule_work_return; 660 + } 661 + } 662 + 663 + bpf_dynptr_init(&dynptr, sample, BPF_DYNPTR_TYPE_LOCAL, 0, size); 664 + ret = callback((uintptr_t)&dynptr, (uintptr_t)callback_ctx, 0, 0, 0); 665 + __bpf_user_ringbuf_sample_release(rb, size, flags); 666 + } 667 + ret = samples - discarded_samples; 668 + 669 + schedule_work_return: 670 + /* Prevent the clearing of the busy-bit from being reordered before the 671 + * storing of any rb consumer or producer positions. 672 + */ 673 + smp_mb__before_atomic(); 674 + atomic_set(&rb->busy, 0); 675 + 676 + if (flags & BPF_RB_FORCE_WAKEUP) 677 + irq_work_queue(&rb->work); 678 + else if (!(flags & BPF_RB_NO_WAKEUP) && samples > 0) 679 + irq_work_queue(&rb->work); 680 + return ret; 681 + } 682 + 683 + const struct bpf_func_proto bpf_user_ringbuf_drain_proto = { 684 + .func = bpf_user_ringbuf_drain, 685 + .ret_type = RET_INTEGER, 686 + .arg1_type = ARG_CONST_MAP_PTR, 687 + .arg2_type = ARG_PTR_TO_FUNC, 688 + .arg3_type = ARG_PTR_TO_STACK_OR_NULL, 689 + .arg4_type = ARG_ANYTHING, 641 690 };
+59 -3
kernel/bpf/verifier.c
··· 563 563 [PTR_TO_BUF] = "buf", 564 564 [PTR_TO_FUNC] = "func", 565 565 [PTR_TO_MAP_KEY] = "map_key", 566 + [PTR_TO_DYNPTR] = "dynptr_ptr", 566 567 }; 567 568 568 569 if (type & PTR_MAYBE_NULL) { ··· 5689 5688 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } }; 5690 5689 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } }; 5691 5690 static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } }; 5691 + static const struct bpf_reg_types dynptr_types = { 5692 + .types = { 5693 + PTR_TO_STACK, 5694 + PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL, 5695 + } 5696 + }; 5692 5697 5693 5698 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = { 5694 5699 [ARG_PTR_TO_MAP_KEY] = &map_key_value_types, ··· 5721 5714 [ARG_PTR_TO_CONST_STR] = &const_str_ptr_types, 5722 5715 [ARG_PTR_TO_TIMER] = &timer_types, 5723 5716 [ARG_PTR_TO_KPTR] = &kptr_types, 5724 - [ARG_PTR_TO_DYNPTR] = &stack_ptr_types, 5717 + [ARG_PTR_TO_DYNPTR] = &dynptr_types, 5725 5718 }; 5726 5719 5727 5720 static int check_reg_type(struct bpf_verifier_env *env, u32 regno, ··· 6073 6066 err = check_mem_size_reg(env, reg, regno, true, meta); 6074 6067 break; 6075 6068 case ARG_PTR_TO_DYNPTR: 6069 + /* We only need to check for initialized / uninitialized helper 6070 + * dynptr args if the dynptr is not PTR_TO_DYNPTR, as the 6071 + * assumption is that if it is, that a helper function 6072 + * initialized the dynptr on behalf of the BPF program. 6073 + */ 6074 + if (base_type(reg->type) == PTR_TO_DYNPTR) 6075 + break; 6076 6076 if (arg_type & MEM_UNINIT) { 6077 6077 if (!is_dynptr_reg_valid_uninit(env, reg)) { 6078 6078 verbose(env, "Dynptr has to be an uninitialized dynptr\n"); ··· 6254 6240 func_id != BPF_FUNC_ringbuf_discard_dynptr) 6255 6241 goto error; 6256 6242 break; 6243 + case BPF_MAP_TYPE_USER_RINGBUF: 6244 + if (func_id != BPF_FUNC_user_ringbuf_drain) 6245 + goto error; 6246 + break; 6257 6247 case BPF_MAP_TYPE_STACK_TRACE: 6258 6248 if (func_id != BPF_FUNC_get_stackid) 6259 6249 goto error; ··· 6375 6357 case BPF_FUNC_ringbuf_submit_dynptr: 6376 6358 case BPF_FUNC_ringbuf_discard_dynptr: 6377 6359 if (map->map_type != BPF_MAP_TYPE_RINGBUF) 6360 + goto error; 6361 + break; 6362 + case BPF_FUNC_user_ringbuf_drain: 6363 + if (map->map_type != BPF_MAP_TYPE_USER_RINGBUF) 6378 6364 goto error; 6379 6365 break; 6380 6366 case BPF_FUNC_get_stackid: ··· 6907 6885 return 0; 6908 6886 } 6909 6887 6888 + static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env, 6889 + struct bpf_func_state *caller, 6890 + struct bpf_func_state *callee, 6891 + int insn_idx) 6892 + { 6893 + /* bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void 6894 + * callback_ctx, u64 flags); 6895 + * callback_fn(struct bpf_dynptr_t* dynptr, void *callback_ctx); 6896 + */ 6897 + __mark_reg_not_init(env, &callee->regs[BPF_REG_0]); 6898 + callee->regs[BPF_REG_1].type = PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL; 6899 + __mark_reg_known_zero(&callee->regs[BPF_REG_1]); 6900 + callee->regs[BPF_REG_2] = caller->regs[BPF_REG_3]; 6901 + 6902 + /* unused */ 6903 + __mark_reg_not_init(env, &callee->regs[BPF_REG_3]); 6904 + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); 6905 + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); 6906 + 6907 + callee->in_callback_fn = true; 6908 + return 0; 6909 + } 6910 + 6910 6911 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) 6911 6912 { 6912 6913 struct bpf_verifier_state *state = env->cur_state; ··· 7389 7344 case BPF_FUNC_dynptr_data: 7390 7345 for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) { 7391 7346 if (arg_type_is_dynptr(fn->arg_type[i])) { 7347 + struct bpf_reg_state *reg = &regs[BPF_REG_1 + i]; 7348 + 7392 7349 if (meta.ref_obj_id) { 7393 7350 verbose(env, "verifier internal error: meta.ref_obj_id already set\n"); 7394 7351 return -EFAULT; 7395 7352 } 7396 - /* Find the id of the dynptr we're tracking the reference of */ 7397 - meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]); 7353 + 7354 + if (base_type(reg->type) != PTR_TO_DYNPTR) 7355 + /* Find the id of the dynptr we're 7356 + * tracking the reference of 7357 + */ 7358 + meta.ref_obj_id = stack_slot_get_id(env, reg); 7398 7359 break; 7399 7360 } 7400 7361 } ··· 7408 7357 verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n"); 7409 7358 return -EFAULT; 7410 7359 } 7360 + break; 7361 + case BPF_FUNC_user_ringbuf_drain: 7362 + err = __check_func_call(env, insn, insn_idx_p, meta.subprogno, 7363 + set_user_ringbuf_callback_state); 7411 7364 break; 7412 7365 } 7413 7366 ··· 12690 12635 case BPF_MAP_TYPE_ARRAY_OF_MAPS: 12691 12636 case BPF_MAP_TYPE_HASH_OF_MAPS: 12692 12637 case BPF_MAP_TYPE_RINGBUF: 12638 + case BPF_MAP_TYPE_USER_RINGBUF: 12693 12639 case BPF_MAP_TYPE_INODE_STORAGE: 12694 12640 case BPF_MAP_TYPE_SK_STORAGE: 12695 12641 case BPF_MAP_TYPE_TASK_STORAGE:
+1 -1
tools/bpf/bpftool/Documentation/bpftool-map.rst
··· 55 55 | | **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash** 56 56 | | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** 57 57 | | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage** 58 - | | **task_storage** | **bloom_filter** } 58 + | | **task_storage** | **bloom_filter** | **user_ringbuf** } 59 59 60 60 DESCRIPTION 61 61 ===========
+1 -1
tools/bpf/bpftool/map.c
··· 1459 1459 " devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n" 1460 1460 " cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n" 1461 1461 " queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n" 1462 - " task_storage | bloom_filter }\n" 1462 + " task_storage | bloom_filter | user_ringbuf }\n" 1463 1463 " " HELP_SPEC_OPTIONS " |\n" 1464 1464 " {-f|--bpffs} | {-n|--nomount} }\n" 1465 1465 "",
+39
tools/include/uapi/linux/bpf.h
··· 928 928 BPF_MAP_TYPE_INODE_STORAGE, 929 929 BPF_MAP_TYPE_TASK_STORAGE, 930 930 BPF_MAP_TYPE_BLOOM_FILTER, 931 + BPF_MAP_TYPE_USER_RINGBUF, 931 932 }; 932 933 933 934 /* Note that tracing related programs such as ··· 5388 5387 * Return 5389 5388 * Current *ktime*. 5390 5389 * 5390 + * long bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags) 5391 + * Description 5392 + * Drain samples from the specified user ring buffer, and invoke 5393 + * the provided callback for each such sample: 5394 + * 5395 + * long (\*callback_fn)(struct bpf_dynptr \*dynptr, void \*ctx); 5396 + * 5397 + * If **callback_fn** returns 0, the helper will continue to try 5398 + * and drain the next sample, up to a maximum of 5399 + * BPF_MAX_USER_RINGBUF_SAMPLES samples. If the return value is 1, 5400 + * the helper will skip the rest of the samples and return. Other 5401 + * return values are not used now, and will be rejected by the 5402 + * verifier. 5403 + * Return 5404 + * The number of drained samples if no error was encountered while 5405 + * draining samples, or 0 if no samples were present in the ring 5406 + * buffer. If a user-space producer was epoll-waiting on this map, 5407 + * and at least one sample was drained, they will receive an event 5408 + * notification notifying them of available space in the ring 5409 + * buffer. If the BPF_RB_NO_WAKEUP flag is passed to this 5410 + * function, no wakeup notification will be sent. If the 5411 + * BPF_RB_FORCE_WAKEUP flag is passed, a wakeup notification will 5412 + * be sent even if no sample was drained. 5413 + * 5414 + * On failure, the returned value is one of the following: 5415 + * 5416 + * **-EBUSY** if the ring buffer is contended, and another calling 5417 + * context was concurrently draining the ring buffer. 5418 + * 5419 + * **-EINVAL** if user-space is not properly tracking the ring 5420 + * buffer due to the producer position not being aligned to 8 5421 + * bytes, a sample not being aligned to 8 bytes, or the producer 5422 + * position not matching the advertised length of a sample. 5423 + * 5424 + * **-E2BIG** if user-space has tried to publish a sample which is 5425 + * larger than the size of the ring buffer, or which cannot fit 5426 + * within a struct bpf_dynptr. 5391 5427 */ 5392 5428 #define __BPF_FUNC_MAPPER(FN) \ 5393 5429 FN(unspec), \ ··· 5636 5598 FN(tcp_raw_check_syncookie_ipv4), \ 5637 5599 FN(tcp_raw_check_syncookie_ipv6), \ 5638 5600 FN(ktime_get_tai_ns), \ 5601 + FN(user_ringbuf_drain), \ 5639 5602 /* */ 5640 5603 5641 5604 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
+9 -2
tools/lib/bpf/libbpf.c
··· 163 163 [BPF_MAP_TYPE_INODE_STORAGE] = "inode_storage", 164 164 [BPF_MAP_TYPE_TASK_STORAGE] = "task_storage", 165 165 [BPF_MAP_TYPE_BLOOM_FILTER] = "bloom_filter", 166 + [BPF_MAP_TYPE_USER_RINGBUF] = "user_ringbuf", 166 167 }; 167 168 168 169 static const char * const prog_type_name[] = { ··· 2373 2372 return sz; 2374 2373 } 2375 2374 2375 + static bool map_is_ringbuf(const struct bpf_map *map) 2376 + { 2377 + return map->def.type == BPF_MAP_TYPE_RINGBUF || 2378 + map->def.type == BPF_MAP_TYPE_USER_RINGBUF; 2379 + } 2380 + 2376 2381 static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def) 2377 2382 { 2378 2383 map->def.type = def->map_type; ··· 2393 2386 map->btf_value_type_id = def->value_type_id; 2394 2387 2395 2388 /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */ 2396 - if (map->def.type == BPF_MAP_TYPE_RINGBUF) 2389 + if (map_is_ringbuf(map)) 2397 2390 map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries); 2398 2391 2399 2392 if (def->parts & MAP_DEF_MAP_TYPE) ··· 4376 4369 map->def.max_entries = max_entries; 4377 4370 4378 4371 /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */ 4379 - if (map->def.type == BPF_MAP_TYPE_RINGBUF) 4372 + if (map_is_ringbuf(map)) 4380 4373 map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries); 4381 4374 4382 4375 return 0;
+107
tools/lib/bpf/libbpf.h
··· 1011 1011 1012 1012 /* Ring buffer APIs */ 1013 1013 struct ring_buffer; 1014 + struct user_ring_buffer; 1014 1015 1015 1016 typedef int (*ring_buffer_sample_fn)(void *ctx, void *data, size_t size); 1016 1017 ··· 1030 1029 LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms); 1031 1030 LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb); 1032 1031 LIBBPF_API int ring_buffer__epoll_fd(const struct ring_buffer *rb); 1032 + 1033 + struct user_ring_buffer_opts { 1034 + size_t sz; /* size of this struct, for forward/backward compatibility */ 1035 + }; 1036 + 1037 + #define user_ring_buffer_opts__last_field sz 1038 + 1039 + /* @brief **user_ring_buffer__new()** creates a new instance of a user ring 1040 + * buffer. 1041 + * 1042 + * @param map_fd A file descriptor to a BPF_MAP_TYPE_USER_RINGBUF map. 1043 + * @param opts Options for how the ring buffer should be created. 1044 + * @return A user ring buffer on success; NULL and errno being set on a 1045 + * failure. 1046 + */ 1047 + LIBBPF_API struct user_ring_buffer * 1048 + user_ring_buffer__new(int map_fd, const struct user_ring_buffer_opts *opts); 1049 + 1050 + /* @brief **user_ring_buffer__reserve()** reserves a pointer to a sample in the 1051 + * user ring buffer. 1052 + * @param rb A pointer to a user ring buffer. 1053 + * @param size The size of the sample, in bytes. 1054 + * @return A pointer to an 8-byte aligned reserved region of the user ring 1055 + * buffer; NULL, and errno being set if a sample could not be reserved. 1056 + * 1057 + * This function is *not* thread safe, and callers must synchronize accessing 1058 + * this function if there are multiple producers. If a size is requested that 1059 + * is larger than the size of the entire ring buffer, errno will be set to 1060 + * E2BIG and NULL is returned. If the ring buffer could accommodate the size, 1061 + * but currently does not have enough space, errno is set to ENOSPC and NULL is 1062 + * returned. 1063 + * 1064 + * After initializing the sample, callers must invoke 1065 + * **user_ring_buffer__submit()** to post the sample to the kernel. Otherwise, 1066 + * the sample must be freed with **user_ring_buffer__discard()**. 1067 + */ 1068 + LIBBPF_API void *user_ring_buffer__reserve(struct user_ring_buffer *rb, __u32 size); 1069 + 1070 + /* @brief **user_ring_buffer__reserve_blocking()** reserves a record in the 1071 + * ring buffer, possibly blocking for up to @timeout_ms until a sample becomes 1072 + * available. 1073 + * @param rb The user ring buffer. 1074 + * @param size The size of the sample, in bytes. 1075 + * @param timeout_ms The amount of time, in milliseconds, for which the caller 1076 + * should block when waiting for a sample. -1 causes the caller to block 1077 + * indefinitely. 1078 + * @return A pointer to an 8-byte aligned reserved region of the user ring 1079 + * buffer; NULL, and errno being set if a sample could not be reserved. 1080 + * 1081 + * This function is *not* thread safe, and callers must synchronize 1082 + * accessing this function if there are multiple producers 1083 + * 1084 + * If **timeout_ms** is -1, the function will block indefinitely until a sample 1085 + * becomes available. Otherwise, **timeout_ms** must be non-negative, or errno 1086 + * is set to EINVAL, and NULL is returned. If **timeout_ms** is 0, no blocking 1087 + * will occur and the function will return immediately after attempting to 1088 + * reserve a sample. 1089 + * 1090 + * If **size** is larger than the size of the entire ring buffer, errno is set 1091 + * to E2BIG and NULL is returned. If the ring buffer could accommodate 1092 + * **size**, but currently does not have enough space, the caller will block 1093 + * until at most **timeout_ms** has elapsed. If insufficient space is available 1094 + * at that time, errno is set to ENOSPC, and NULL is returned. 1095 + * 1096 + * The kernel guarantees that it will wake up this thread to check if 1097 + * sufficient space is available in the ring buffer at least once per 1098 + * invocation of the **bpf_ringbuf_drain()** helper function, provided that at 1099 + * least one sample is consumed, and the BPF program did not invoke the 1100 + * function with BPF_RB_NO_WAKEUP. A wakeup may occur sooner than that, but the 1101 + * kernel does not guarantee this. If the helper function is invoked with 1102 + * BPF_RB_FORCE_WAKEUP, a wakeup event will be sent even if no sample is 1103 + * consumed. 1104 + * 1105 + * When a sample of size **size** is found within **timeout_ms**, a pointer to 1106 + * the sample is returned. After initializing the sample, callers must invoke 1107 + * **user_ring_buffer__submit()** to post the sample to the ring buffer. 1108 + * Otherwise, the sample must be freed with **user_ring_buffer__discard()**. 1109 + */ 1110 + LIBBPF_API void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb, 1111 + __u32 size, 1112 + int timeout_ms); 1113 + 1114 + /* @brief **user_ring_buffer__submit()** submits a previously reserved sample 1115 + * into the ring buffer. 1116 + * @param rb The user ring buffer. 1117 + * @param sample A reserved sample. 1118 + * 1119 + * It is not necessary to synchronize amongst multiple producers when invoking 1120 + * this function. 1121 + */ 1122 + LIBBPF_API void user_ring_buffer__submit(struct user_ring_buffer *rb, void *sample); 1123 + 1124 + /* @brief **user_ring_buffer__discard()** discards a previously reserved sample. 1125 + * @param rb The user ring buffer. 1126 + * @param sample A reserved sample. 1127 + * 1128 + * It is not necessary to synchronize amongst multiple producers when invoking 1129 + * this function. 1130 + */ 1131 + LIBBPF_API void user_ring_buffer__discard(struct user_ring_buffer *rb, void *sample); 1132 + 1133 + /* @brief **user_ring_buffer__free()** frees a ring buffer that was previously 1134 + * created with **user_ring_buffer__new()**. 1135 + * @param rb The user ring buffer being freed. 1136 + */ 1137 + LIBBPF_API void user_ring_buffer__free(struct user_ring_buffer *rb); 1033 1138 1034 1139 /* Perf buffer APIs */ 1035 1140 struct perf_buffer;
+10
tools/lib/bpf/libbpf.map
··· 368 368 libbpf_bpf_prog_type_str; 369 369 perf_buffer__buffer; 370 370 }; 371 + 372 + LIBBPF_1.1.0 { 373 + global: 374 + user_ring_buffer__discard; 375 + user_ring_buffer__free; 376 + user_ring_buffer__new; 377 + user_ring_buffer__reserve; 378 + user_ring_buffer__reserve_blocking; 379 + user_ring_buffer__submit; 380 + } LIBBPF_1.0.0;
+1
tools/lib/bpf/libbpf_probes.c
··· 231 231 return btf_fd; 232 232 break; 233 233 case BPF_MAP_TYPE_RINGBUF: 234 + case BPF_MAP_TYPE_USER_RINGBUF: 234 235 key_size = 0; 235 236 value_size = 0; 236 237 max_entries = 4096;
+1 -1
tools/lib/bpf/libbpf_version.h
··· 4 4 #define __LIBBPF_VERSION_H 5 5 6 6 #define LIBBPF_MAJOR_VERSION 1 7 - #define LIBBPF_MINOR_VERSION 0 7 + #define LIBBPF_MINOR_VERSION 1 8 8 9 9 #endif /* __LIBBPF_VERSION_H */
+271
tools/lib/bpf/ringbuf.c
··· 16 16 #include <asm/barrier.h> 17 17 #include <sys/mman.h> 18 18 #include <sys/epoll.h> 19 + #include <time.h> 19 20 20 21 #include "libbpf.h" 21 22 #include "libbpf_internal.h" ··· 38 37 size_t page_size; 39 38 int epoll_fd; 40 39 int ring_cnt; 40 + }; 41 + 42 + struct user_ring_buffer { 43 + struct epoll_event event; 44 + unsigned long *consumer_pos; 45 + unsigned long *producer_pos; 46 + void *data; 47 + unsigned long mask; 48 + size_t page_size; 49 + int map_fd; 50 + int epoll_fd; 51 + }; 52 + 53 + /* 8-byte ring buffer header structure */ 54 + struct ringbuf_hdr { 55 + __u32 len; 56 + __u32 pad; 41 57 }; 42 58 43 59 static void ringbuf_unmap_ring(struct ring_buffer *rb, struct ring *r) ··· 317 299 int ring_buffer__epoll_fd(const struct ring_buffer *rb) 318 300 { 319 301 return rb->epoll_fd; 302 + } 303 + 304 + static void user_ringbuf_unmap_ring(struct user_ring_buffer *rb) 305 + { 306 + if (rb->consumer_pos) { 307 + munmap(rb->consumer_pos, rb->page_size); 308 + rb->consumer_pos = NULL; 309 + } 310 + if (rb->producer_pos) { 311 + munmap(rb->producer_pos, rb->page_size + 2 * (rb->mask + 1)); 312 + rb->producer_pos = NULL; 313 + } 314 + } 315 + 316 + void user_ring_buffer__free(struct user_ring_buffer *rb) 317 + { 318 + if (!rb) 319 + return; 320 + 321 + user_ringbuf_unmap_ring(rb); 322 + 323 + if (rb->epoll_fd >= 0) 324 + close(rb->epoll_fd); 325 + 326 + free(rb); 327 + } 328 + 329 + static int user_ringbuf_map(struct user_ring_buffer *rb, int map_fd) 330 + { 331 + struct bpf_map_info info; 332 + __u32 len = sizeof(info); 333 + void *tmp; 334 + struct epoll_event *rb_epoll; 335 + int err; 336 + 337 + memset(&info, 0, sizeof(info)); 338 + 339 + err = bpf_obj_get_info_by_fd(map_fd, &info, &len); 340 + if (err) { 341 + err = -errno; 342 + pr_warn("user ringbuf: failed to get map info for fd=%d: %d\n", map_fd, err); 343 + return err; 344 + } 345 + 346 + if (info.type != BPF_MAP_TYPE_USER_RINGBUF) { 347 + pr_warn("user ringbuf: map fd=%d is not BPF_MAP_TYPE_USER_RINGBUF\n", map_fd); 348 + return -EINVAL; 349 + } 350 + 351 + rb->map_fd = map_fd; 352 + rb->mask = info.max_entries - 1; 353 + 354 + /* Map read-only consumer page */ 355 + tmp = mmap(NULL, rb->page_size, PROT_READ, MAP_SHARED, map_fd, 0); 356 + if (tmp == MAP_FAILED) { 357 + err = -errno; 358 + pr_warn("user ringbuf: failed to mmap consumer page for map fd=%d: %d\n", 359 + map_fd, err); 360 + return err; 361 + } 362 + rb->consumer_pos = tmp; 363 + 364 + /* Map read-write the producer page and data pages. We map the data 365 + * region as twice the total size of the ring buffer to allow the 366 + * simple reading and writing of samples that wrap around the end of 367 + * the buffer. See the kernel implementation for details. 368 + */ 369 + tmp = mmap(NULL, rb->page_size + 2 * info.max_entries, 370 + PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, rb->page_size); 371 + if (tmp == MAP_FAILED) { 372 + err = -errno; 373 + pr_warn("user ringbuf: failed to mmap data pages for map fd=%d: %d\n", 374 + map_fd, err); 375 + return err; 376 + } 377 + 378 + rb->producer_pos = tmp; 379 + rb->data = tmp + rb->page_size; 380 + 381 + rb_epoll = &rb->event; 382 + rb_epoll->events = EPOLLOUT; 383 + if (epoll_ctl(rb->epoll_fd, EPOLL_CTL_ADD, map_fd, rb_epoll) < 0) { 384 + err = -errno; 385 + pr_warn("user ringbuf: failed to epoll add map fd=%d: %d\n", map_fd, err); 386 + return err; 387 + } 388 + 389 + return 0; 390 + } 391 + 392 + struct user_ring_buffer * 393 + user_ring_buffer__new(int map_fd, const struct user_ring_buffer_opts *opts) 394 + { 395 + struct user_ring_buffer *rb; 396 + int err; 397 + 398 + if (!OPTS_VALID(opts, user_ring_buffer_opts)) 399 + return errno = EINVAL, NULL; 400 + 401 + rb = calloc(1, sizeof(*rb)); 402 + if (!rb) 403 + return errno = ENOMEM, NULL; 404 + 405 + rb->page_size = getpagesize(); 406 + 407 + rb->epoll_fd = epoll_create1(EPOLL_CLOEXEC); 408 + if (rb->epoll_fd < 0) { 409 + err = -errno; 410 + pr_warn("user ringbuf: failed to create epoll instance: %d\n", err); 411 + goto err_out; 412 + } 413 + 414 + err = user_ringbuf_map(rb, map_fd); 415 + if (err) 416 + goto err_out; 417 + 418 + return rb; 419 + 420 + err_out: 421 + user_ring_buffer__free(rb); 422 + return errno = -err, NULL; 423 + } 424 + 425 + static void user_ringbuf_commit(struct user_ring_buffer *rb, void *sample, bool discard) 426 + { 427 + __u32 new_len; 428 + struct ringbuf_hdr *hdr; 429 + uintptr_t hdr_offset; 430 + 431 + hdr_offset = rb->mask + 1 + (sample - rb->data) - BPF_RINGBUF_HDR_SZ; 432 + hdr = rb->data + (hdr_offset & rb->mask); 433 + 434 + new_len = hdr->len & ~BPF_RINGBUF_BUSY_BIT; 435 + if (discard) 436 + new_len |= BPF_RINGBUF_DISCARD_BIT; 437 + 438 + /* Synchronizes with smp_load_acquire() in __bpf_user_ringbuf_peek() in 439 + * the kernel. 440 + */ 441 + __atomic_exchange_n(&hdr->len, new_len, __ATOMIC_ACQ_REL); 442 + } 443 + 444 + void user_ring_buffer__discard(struct user_ring_buffer *rb, void *sample) 445 + { 446 + user_ringbuf_commit(rb, sample, true); 447 + } 448 + 449 + void user_ring_buffer__submit(struct user_ring_buffer *rb, void *sample) 450 + { 451 + user_ringbuf_commit(rb, sample, false); 452 + } 453 + 454 + void *user_ring_buffer__reserve(struct user_ring_buffer *rb, __u32 size) 455 + { 456 + __u32 avail_size, total_size, max_size; 457 + /* 64-bit to avoid overflow in case of extreme application behavior */ 458 + __u64 cons_pos, prod_pos; 459 + struct ringbuf_hdr *hdr; 460 + 461 + /* Synchronizes with smp_store_release() in __bpf_user_ringbuf_peek() in 462 + * the kernel. 463 + */ 464 + cons_pos = smp_load_acquire(rb->consumer_pos); 465 + /* Synchronizes with smp_store_release() in user_ringbuf_commit() */ 466 + prod_pos = smp_load_acquire(rb->producer_pos); 467 + 468 + max_size = rb->mask + 1; 469 + avail_size = max_size - (prod_pos - cons_pos); 470 + /* Round up total size to a multiple of 8. */ 471 + total_size = (size + BPF_RINGBUF_HDR_SZ + 7) / 8 * 8; 472 + 473 + if (total_size > max_size) 474 + return errno = E2BIG, NULL; 475 + 476 + if (avail_size < total_size) 477 + return errno = ENOSPC, NULL; 478 + 479 + hdr = rb->data + (prod_pos & rb->mask); 480 + hdr->len = size | BPF_RINGBUF_BUSY_BIT; 481 + hdr->pad = 0; 482 + 483 + /* Synchronizes with smp_load_acquire() in __bpf_user_ringbuf_peek() in 484 + * the kernel. 485 + */ 486 + smp_store_release(rb->producer_pos, prod_pos + total_size); 487 + 488 + return (void *)rb->data + ((prod_pos + BPF_RINGBUF_HDR_SZ) & rb->mask); 489 + } 490 + 491 + static __u64 ns_elapsed_timespec(const struct timespec *start, const struct timespec *end) 492 + { 493 + __u64 start_ns, end_ns, ns_per_s = 1000000000; 494 + 495 + start_ns = (__u64)start->tv_sec * ns_per_s + start->tv_nsec; 496 + end_ns = (__u64)end->tv_sec * ns_per_s + end->tv_nsec; 497 + 498 + return end_ns - start_ns; 499 + } 500 + 501 + void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb, __u32 size, int timeout_ms) 502 + { 503 + void *sample; 504 + int err, ms_remaining = timeout_ms; 505 + struct timespec start; 506 + 507 + if (timeout_ms < 0 && timeout_ms != -1) 508 + return errno = EINVAL, NULL; 509 + 510 + if (timeout_ms != -1) { 511 + err = clock_gettime(CLOCK_MONOTONIC, &start); 512 + if (err) 513 + return NULL; 514 + } 515 + 516 + do { 517 + int cnt, ms_elapsed; 518 + struct timespec curr; 519 + __u64 ns_per_ms = 1000000; 520 + 521 + sample = user_ring_buffer__reserve(rb, size); 522 + if (sample) 523 + return sample; 524 + else if (errno != ENOSPC) 525 + return NULL; 526 + 527 + /* The kernel guarantees at least one event notification 528 + * delivery whenever at least one sample is drained from the 529 + * ring buffer in an invocation to bpf_ringbuf_drain(). Other 530 + * additional events may be delivered at any time, but only one 531 + * event is guaranteed per bpf_ringbuf_drain() invocation, 532 + * provided that a sample is drained, and the BPF program did 533 + * not pass BPF_RB_NO_WAKEUP to bpf_ringbuf_drain(). If 534 + * BPF_RB_FORCE_WAKEUP is passed to bpf_ringbuf_drain(), a 535 + * wakeup event will be delivered even if no samples are 536 + * drained. 537 + */ 538 + cnt = epoll_wait(rb->epoll_fd, &rb->event, 1, ms_remaining); 539 + if (cnt < 0) 540 + return NULL; 541 + 542 + if (timeout_ms == -1) 543 + continue; 544 + 545 + err = clock_gettime(CLOCK_MONOTONIC, &curr); 546 + if (err) 547 + return NULL; 548 + 549 + ms_elapsed = ns_elapsed_timespec(&start, &curr) / ns_per_ms; 550 + ms_remaining = timeout_ms - ms_elapsed; 551 + } while (ms_remaining > 0); 552 + 553 + /* Try one more time to reserve a sample after the specified timeout has elapsed. */ 554 + return user_ring_buffer__reserve(rb, size); 320 555 }
+1
tools/testing/selftests/bpf/DENYLIST.s390x
··· 71 71 cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc) 72 72 htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline) 73 73 tracing_struct # failed to auto-attach: -524 (trampoline) 74 + user_ringbuf # failed to find kernel BTF type ID of '__s390x_sys_prctl': -3 (?)
+754
tools/testing/selftests/bpf/prog_tests/user_ringbuf.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #define _GNU_SOURCE 5 + #include <linux/compiler.h> 6 + #include <linux/ring_buffer.h> 7 + #include <pthread.h> 8 + #include <stdio.h> 9 + #include <stdlib.h> 10 + #include <sys/mman.h> 11 + #include <sys/syscall.h> 12 + #include <sys/sysinfo.h> 13 + #include <test_progs.h> 14 + #include <uapi/linux/bpf.h> 15 + #include <unistd.h> 16 + 17 + #include "user_ringbuf_fail.skel.h" 18 + #include "user_ringbuf_success.skel.h" 19 + 20 + #include "../progs/test_user_ringbuf.h" 21 + 22 + static size_t log_buf_sz = 1 << 20; /* 1 MB */ 23 + static char obj_log_buf[1048576]; 24 + static const long c_sample_size = sizeof(struct sample) + BPF_RINGBUF_HDR_SZ; 25 + static const long c_ringbuf_size = 1 << 12; /* 1 small page */ 26 + static const long c_max_entries = c_ringbuf_size / c_sample_size; 27 + 28 + static void drain_current_samples(void) 29 + { 30 + syscall(__NR_getpgid); 31 + } 32 + 33 + static int write_samples(struct user_ring_buffer *ringbuf, uint32_t num_samples) 34 + { 35 + int i, err = 0; 36 + 37 + /* Write some number of samples to the ring buffer. */ 38 + for (i = 0; i < num_samples; i++) { 39 + struct sample *entry; 40 + int read; 41 + 42 + entry = user_ring_buffer__reserve(ringbuf, sizeof(*entry)); 43 + if (!entry) { 44 + err = -errno; 45 + goto done; 46 + } 47 + 48 + entry->pid = getpid(); 49 + entry->seq = i; 50 + entry->value = i * i; 51 + 52 + read = snprintf(entry->comm, sizeof(entry->comm), "%u", i); 53 + if (read <= 0) { 54 + /* Assert on the error path to avoid spamming logs with 55 + * mostly success messages. 56 + */ 57 + ASSERT_GT(read, 0, "snprintf_comm"); 58 + err = read; 59 + user_ring_buffer__discard(ringbuf, entry); 60 + goto done; 61 + } 62 + 63 + user_ring_buffer__submit(ringbuf, entry); 64 + } 65 + 66 + done: 67 + drain_current_samples(); 68 + 69 + return err; 70 + } 71 + 72 + static struct user_ringbuf_success *open_load_ringbuf_skel(void) 73 + { 74 + struct user_ringbuf_success *skel; 75 + int err; 76 + 77 + skel = user_ringbuf_success__open(); 78 + if (!ASSERT_OK_PTR(skel, "skel_open")) 79 + return NULL; 80 + 81 + err = bpf_map__set_max_entries(skel->maps.user_ringbuf, c_ringbuf_size); 82 + if (!ASSERT_OK(err, "set_max_entries")) 83 + goto cleanup; 84 + 85 + err = bpf_map__set_max_entries(skel->maps.kernel_ringbuf, c_ringbuf_size); 86 + if (!ASSERT_OK(err, "set_max_entries")) 87 + goto cleanup; 88 + 89 + err = user_ringbuf_success__load(skel); 90 + if (!ASSERT_OK(err, "skel_load")) 91 + goto cleanup; 92 + 93 + return skel; 94 + 95 + cleanup: 96 + user_ringbuf_success__destroy(skel); 97 + return NULL; 98 + } 99 + 100 + static void test_user_ringbuf_mappings(void) 101 + { 102 + int err, rb_fd; 103 + int page_size = getpagesize(); 104 + void *mmap_ptr; 105 + struct user_ringbuf_success *skel; 106 + 107 + skel = open_load_ringbuf_skel(); 108 + if (!skel) 109 + return; 110 + 111 + rb_fd = bpf_map__fd(skel->maps.user_ringbuf); 112 + /* cons_pos can be mapped R/O, can't add +X with mprotect. */ 113 + mmap_ptr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, rb_fd, 0); 114 + ASSERT_OK_PTR(mmap_ptr, "ro_cons_pos"); 115 + ASSERT_ERR(mprotect(mmap_ptr, page_size, PROT_WRITE), "write_cons_pos_protect"); 116 + ASSERT_ERR(mprotect(mmap_ptr, page_size, PROT_EXEC), "exec_cons_pos_protect"); 117 + ASSERT_ERR_PTR(mremap(mmap_ptr, 0, 4 * page_size, MREMAP_MAYMOVE), "wr_prod_pos"); 118 + err = -errno; 119 + ASSERT_ERR(err, "wr_prod_pos_err"); 120 + ASSERT_OK(munmap(mmap_ptr, page_size), "unmap_ro_cons"); 121 + 122 + /* prod_pos can be mapped RW, can't add +X with mprotect. */ 123 + mmap_ptr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 124 + rb_fd, page_size); 125 + ASSERT_OK_PTR(mmap_ptr, "rw_prod_pos"); 126 + ASSERT_ERR(mprotect(mmap_ptr, page_size, PROT_EXEC), "exec_prod_pos_protect"); 127 + err = -errno; 128 + ASSERT_ERR(err, "wr_prod_pos_err"); 129 + ASSERT_OK(munmap(mmap_ptr, page_size), "unmap_rw_prod"); 130 + 131 + /* data pages can be mapped RW, can't add +X with mprotect. */ 132 + mmap_ptr = mmap(NULL, page_size, PROT_WRITE, MAP_SHARED, rb_fd, 133 + 2 * page_size); 134 + ASSERT_OK_PTR(mmap_ptr, "rw_data"); 135 + ASSERT_ERR(mprotect(mmap_ptr, page_size, PROT_EXEC), "exec_data_protect"); 136 + err = -errno; 137 + ASSERT_ERR(err, "exec_data_err"); 138 + ASSERT_OK(munmap(mmap_ptr, page_size), "unmap_rw_data"); 139 + 140 + user_ringbuf_success__destroy(skel); 141 + } 142 + 143 + static int load_skel_create_ringbufs(struct user_ringbuf_success **skel_out, 144 + struct ring_buffer **kern_ringbuf_out, 145 + ring_buffer_sample_fn callback, 146 + struct user_ring_buffer **user_ringbuf_out) 147 + { 148 + struct user_ringbuf_success *skel; 149 + struct ring_buffer *kern_ringbuf = NULL; 150 + struct user_ring_buffer *user_ringbuf = NULL; 151 + int err = -ENOMEM, rb_fd; 152 + 153 + skel = open_load_ringbuf_skel(); 154 + if (!skel) 155 + return err; 156 + 157 + /* only trigger BPF program for current process */ 158 + skel->bss->pid = getpid(); 159 + 160 + if (kern_ringbuf_out) { 161 + rb_fd = bpf_map__fd(skel->maps.kernel_ringbuf); 162 + kern_ringbuf = ring_buffer__new(rb_fd, callback, skel, NULL); 163 + if (!ASSERT_OK_PTR(kern_ringbuf, "kern_ringbuf_create")) 164 + goto cleanup; 165 + 166 + *kern_ringbuf_out = kern_ringbuf; 167 + } 168 + 169 + if (user_ringbuf_out) { 170 + rb_fd = bpf_map__fd(skel->maps.user_ringbuf); 171 + user_ringbuf = user_ring_buffer__new(rb_fd, NULL); 172 + if (!ASSERT_OK_PTR(user_ringbuf, "user_ringbuf_create")) 173 + goto cleanup; 174 + 175 + *user_ringbuf_out = user_ringbuf; 176 + ASSERT_EQ(skel->bss->read, 0, "no_reads_after_load"); 177 + } 178 + 179 + err = user_ringbuf_success__attach(skel); 180 + if (!ASSERT_OK(err, "skel_attach")) 181 + goto cleanup; 182 + 183 + *skel_out = skel; 184 + return 0; 185 + 186 + cleanup: 187 + if (kern_ringbuf_out) 188 + *kern_ringbuf_out = NULL; 189 + if (user_ringbuf_out) 190 + *user_ringbuf_out = NULL; 191 + ring_buffer__free(kern_ringbuf); 192 + user_ring_buffer__free(user_ringbuf); 193 + user_ringbuf_success__destroy(skel); 194 + return err; 195 + } 196 + 197 + static int load_skel_create_user_ringbuf(struct user_ringbuf_success **skel_out, 198 + struct user_ring_buffer **ringbuf_out) 199 + { 200 + return load_skel_create_ringbufs(skel_out, NULL, NULL, ringbuf_out); 201 + } 202 + 203 + static void manually_write_test_invalid_sample(struct user_ringbuf_success *skel, 204 + __u32 size, __u64 producer_pos, int err) 205 + { 206 + void *data_ptr; 207 + __u64 *producer_pos_ptr; 208 + int rb_fd, page_size = getpagesize(); 209 + 210 + rb_fd = bpf_map__fd(skel->maps.user_ringbuf); 211 + 212 + ASSERT_EQ(skel->bss->read, 0, "num_samples_before_bad_sample"); 213 + 214 + /* Map the producer_pos as RW. */ 215 + producer_pos_ptr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, 216 + MAP_SHARED, rb_fd, page_size); 217 + ASSERT_OK_PTR(producer_pos_ptr, "producer_pos_ptr"); 218 + 219 + /* Map the data pages as RW. */ 220 + data_ptr = mmap(NULL, page_size, PROT_WRITE, MAP_SHARED, rb_fd, 2 * page_size); 221 + ASSERT_OK_PTR(data_ptr, "rw_data"); 222 + 223 + memset(data_ptr, 0, BPF_RINGBUF_HDR_SZ); 224 + *(__u32 *)data_ptr = size; 225 + 226 + /* Synchronizes with smp_load_acquire() in __bpf_user_ringbuf_peek() in the kernel. */ 227 + smp_store_release(producer_pos_ptr, producer_pos + BPF_RINGBUF_HDR_SZ); 228 + 229 + drain_current_samples(); 230 + ASSERT_EQ(skel->bss->read, 0, "num_samples_after_bad_sample"); 231 + ASSERT_EQ(skel->bss->err, err, "err_after_bad_sample"); 232 + 233 + ASSERT_OK(munmap(producer_pos_ptr, page_size), "unmap_producer_pos"); 234 + ASSERT_OK(munmap(data_ptr, page_size), "unmap_data_ptr"); 235 + } 236 + 237 + static void test_user_ringbuf_post_misaligned(void) 238 + { 239 + struct user_ringbuf_success *skel; 240 + struct user_ring_buffer *ringbuf; 241 + int err; 242 + __u32 size = (1 << 5) + 7; 243 + 244 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 245 + if (!ASSERT_OK(err, "misaligned_skel")) 246 + return; 247 + 248 + manually_write_test_invalid_sample(skel, size, size, -EINVAL); 249 + user_ring_buffer__free(ringbuf); 250 + user_ringbuf_success__destroy(skel); 251 + } 252 + 253 + static void test_user_ringbuf_post_producer_wrong_offset(void) 254 + { 255 + struct user_ringbuf_success *skel; 256 + struct user_ring_buffer *ringbuf; 257 + int err; 258 + __u32 size = (1 << 5); 259 + 260 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 261 + if (!ASSERT_OK(err, "wrong_offset_skel")) 262 + return; 263 + 264 + manually_write_test_invalid_sample(skel, size, size - 8, -EINVAL); 265 + user_ring_buffer__free(ringbuf); 266 + user_ringbuf_success__destroy(skel); 267 + } 268 + 269 + static void test_user_ringbuf_post_larger_than_ringbuf_sz(void) 270 + { 271 + struct user_ringbuf_success *skel; 272 + struct user_ring_buffer *ringbuf; 273 + int err; 274 + __u32 size = c_ringbuf_size; 275 + 276 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 277 + if (!ASSERT_OK(err, "huge_sample_skel")) 278 + return; 279 + 280 + manually_write_test_invalid_sample(skel, size, size, -E2BIG); 281 + user_ring_buffer__free(ringbuf); 282 + user_ringbuf_success__destroy(skel); 283 + } 284 + 285 + static void test_user_ringbuf_basic(void) 286 + { 287 + struct user_ringbuf_success *skel; 288 + struct user_ring_buffer *ringbuf; 289 + int err; 290 + 291 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 292 + if (!ASSERT_OK(err, "ringbuf_basic_skel")) 293 + return; 294 + 295 + ASSERT_EQ(skel->bss->read, 0, "num_samples_read_before"); 296 + 297 + err = write_samples(ringbuf, 2); 298 + if (!ASSERT_OK(err, "write_samples")) 299 + goto cleanup; 300 + 301 + ASSERT_EQ(skel->bss->read, 2, "num_samples_read_after"); 302 + 303 + cleanup: 304 + user_ring_buffer__free(ringbuf); 305 + user_ringbuf_success__destroy(skel); 306 + } 307 + 308 + static void test_user_ringbuf_sample_full_ring_buffer(void) 309 + { 310 + struct user_ringbuf_success *skel; 311 + struct user_ring_buffer *ringbuf; 312 + int err; 313 + void *sample; 314 + 315 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 316 + if (!ASSERT_OK(err, "ringbuf_full_sample_skel")) 317 + return; 318 + 319 + sample = user_ring_buffer__reserve(ringbuf, c_ringbuf_size - BPF_RINGBUF_HDR_SZ); 320 + if (!ASSERT_OK_PTR(sample, "full_sample")) 321 + goto cleanup; 322 + 323 + user_ring_buffer__submit(ringbuf, sample); 324 + ASSERT_EQ(skel->bss->read, 0, "num_samples_read_before"); 325 + drain_current_samples(); 326 + ASSERT_EQ(skel->bss->read, 1, "num_samples_read_after"); 327 + 328 + cleanup: 329 + user_ring_buffer__free(ringbuf); 330 + user_ringbuf_success__destroy(skel); 331 + } 332 + 333 + static void test_user_ringbuf_post_alignment_autoadjust(void) 334 + { 335 + struct user_ringbuf_success *skel; 336 + struct user_ring_buffer *ringbuf; 337 + struct sample *sample; 338 + int err; 339 + 340 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 341 + if (!ASSERT_OK(err, "ringbuf_align_autoadjust_skel")) 342 + return; 343 + 344 + /* libbpf should automatically round any sample up to an 8-byte alignment. */ 345 + sample = user_ring_buffer__reserve(ringbuf, sizeof(*sample) + 1); 346 + ASSERT_OK_PTR(sample, "reserve_autoaligned"); 347 + user_ring_buffer__submit(ringbuf, sample); 348 + 349 + ASSERT_EQ(skel->bss->read, 0, "num_samples_read_before"); 350 + drain_current_samples(); 351 + ASSERT_EQ(skel->bss->read, 1, "num_samples_read_after"); 352 + 353 + user_ring_buffer__free(ringbuf); 354 + user_ringbuf_success__destroy(skel); 355 + } 356 + 357 + static void test_user_ringbuf_overfill(void) 358 + { 359 + struct user_ringbuf_success *skel; 360 + struct user_ring_buffer *ringbuf; 361 + int err; 362 + 363 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 364 + if (err) 365 + return; 366 + 367 + err = write_samples(ringbuf, c_max_entries * 5); 368 + ASSERT_ERR(err, "write_samples"); 369 + ASSERT_EQ(skel->bss->read, c_max_entries, "max_entries"); 370 + 371 + user_ring_buffer__free(ringbuf); 372 + user_ringbuf_success__destroy(skel); 373 + } 374 + 375 + static void test_user_ringbuf_discards_properly_ignored(void) 376 + { 377 + struct user_ringbuf_success *skel; 378 + struct user_ring_buffer *ringbuf; 379 + int err, num_discarded = 0; 380 + __u64 *token; 381 + 382 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 383 + if (err) 384 + return; 385 + 386 + ASSERT_EQ(skel->bss->read, 0, "num_samples_read_before"); 387 + 388 + while (1) { 389 + /* Write samples until the buffer is full. */ 390 + token = user_ring_buffer__reserve(ringbuf, sizeof(*token)); 391 + if (!token) 392 + break; 393 + 394 + user_ring_buffer__discard(ringbuf, token); 395 + num_discarded++; 396 + } 397 + 398 + if (!ASSERT_GE(num_discarded, 0, "num_discarded")) 399 + goto cleanup; 400 + 401 + /* Should not read any samples, as they are all discarded. */ 402 + ASSERT_EQ(skel->bss->read, 0, "num_pre_kick"); 403 + drain_current_samples(); 404 + ASSERT_EQ(skel->bss->read, 0, "num_post_kick"); 405 + 406 + /* Now that the ring buffer has been drained, we should be able to 407 + * reserve another token. 408 + */ 409 + token = user_ring_buffer__reserve(ringbuf, sizeof(*token)); 410 + 411 + if (!ASSERT_OK_PTR(token, "new_token")) 412 + goto cleanup; 413 + 414 + user_ring_buffer__discard(ringbuf, token); 415 + cleanup: 416 + user_ring_buffer__free(ringbuf); 417 + user_ringbuf_success__destroy(skel); 418 + } 419 + 420 + static void test_user_ringbuf_loop(void) 421 + { 422 + struct user_ringbuf_success *skel; 423 + struct user_ring_buffer *ringbuf; 424 + uint32_t total_samples = 8192; 425 + uint32_t remaining_samples = total_samples; 426 + int err; 427 + 428 + BUILD_BUG_ON(total_samples <= c_max_entries); 429 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 430 + if (err) 431 + return; 432 + 433 + do { 434 + uint32_t curr_samples; 435 + 436 + curr_samples = remaining_samples > c_max_entries 437 + ? c_max_entries : remaining_samples; 438 + err = write_samples(ringbuf, curr_samples); 439 + if (err != 0) { 440 + /* Assert inside of if statement to avoid flooding logs 441 + * on the success path. 442 + */ 443 + ASSERT_OK(err, "write_samples"); 444 + goto cleanup; 445 + } 446 + 447 + remaining_samples -= curr_samples; 448 + ASSERT_EQ(skel->bss->read, total_samples - remaining_samples, 449 + "current_batched_entries"); 450 + } while (remaining_samples > 0); 451 + ASSERT_EQ(skel->bss->read, total_samples, "total_batched_entries"); 452 + 453 + cleanup: 454 + user_ring_buffer__free(ringbuf); 455 + user_ringbuf_success__destroy(skel); 456 + } 457 + 458 + static int send_test_message(struct user_ring_buffer *ringbuf, 459 + enum test_msg_op op, s64 operand_64, 460 + s32 operand_32) 461 + { 462 + struct test_msg *msg; 463 + 464 + msg = user_ring_buffer__reserve(ringbuf, sizeof(*msg)); 465 + if (!msg) { 466 + /* Assert on the error path to avoid spamming logs with mostly 467 + * success messages. 468 + */ 469 + ASSERT_OK_PTR(msg, "reserve_msg"); 470 + return -ENOMEM; 471 + } 472 + 473 + msg->msg_op = op; 474 + 475 + switch (op) { 476 + case TEST_MSG_OP_INC64: 477 + case TEST_MSG_OP_MUL64: 478 + msg->operand_64 = operand_64; 479 + break; 480 + case TEST_MSG_OP_INC32: 481 + case TEST_MSG_OP_MUL32: 482 + msg->operand_32 = operand_32; 483 + break; 484 + default: 485 + PRINT_FAIL("Invalid operand %d\n", op); 486 + user_ring_buffer__discard(ringbuf, msg); 487 + return -EINVAL; 488 + } 489 + 490 + user_ring_buffer__submit(ringbuf, msg); 491 + 492 + return 0; 493 + } 494 + 495 + static void kick_kernel_read_messages(void) 496 + { 497 + syscall(__NR_prctl); 498 + } 499 + 500 + static int handle_kernel_msg(void *ctx, void *data, size_t len) 501 + { 502 + struct user_ringbuf_success *skel = ctx; 503 + struct test_msg *msg = data; 504 + 505 + switch (msg->msg_op) { 506 + case TEST_MSG_OP_INC64: 507 + skel->bss->user_mutated += msg->operand_64; 508 + return 0; 509 + case TEST_MSG_OP_INC32: 510 + skel->bss->user_mutated += msg->operand_32; 511 + return 0; 512 + case TEST_MSG_OP_MUL64: 513 + skel->bss->user_mutated *= msg->operand_64; 514 + return 0; 515 + case TEST_MSG_OP_MUL32: 516 + skel->bss->user_mutated *= msg->operand_32; 517 + return 0; 518 + default: 519 + fprintf(stderr, "Invalid operand %d\n", msg->msg_op); 520 + return -EINVAL; 521 + } 522 + } 523 + 524 + static void drain_kernel_messages_buffer(struct ring_buffer *kern_ringbuf, 525 + struct user_ringbuf_success *skel) 526 + { 527 + int cnt; 528 + 529 + cnt = ring_buffer__consume(kern_ringbuf); 530 + ASSERT_EQ(cnt, 8, "consume_kern_ringbuf"); 531 + ASSERT_OK(skel->bss->err, "consume_kern_ringbuf_err"); 532 + } 533 + 534 + static void test_user_ringbuf_msg_protocol(void) 535 + { 536 + struct user_ringbuf_success *skel; 537 + struct user_ring_buffer *user_ringbuf; 538 + struct ring_buffer *kern_ringbuf; 539 + int err, i; 540 + __u64 expected_kern = 0; 541 + 542 + err = load_skel_create_ringbufs(&skel, &kern_ringbuf, handle_kernel_msg, &user_ringbuf); 543 + if (!ASSERT_OK(err, "create_ringbufs")) 544 + return; 545 + 546 + for (i = 0; i < 64; i++) { 547 + enum test_msg_op op = i % TEST_MSG_OP_NUM_OPS; 548 + __u64 operand_64 = TEST_OP_64; 549 + __u32 operand_32 = TEST_OP_32; 550 + 551 + err = send_test_message(user_ringbuf, op, operand_64, operand_32); 552 + if (err) { 553 + /* Only assert on a failure to avoid spamming success logs. */ 554 + ASSERT_OK(err, "send_test_message"); 555 + goto cleanup; 556 + } 557 + 558 + switch (op) { 559 + case TEST_MSG_OP_INC64: 560 + expected_kern += operand_64; 561 + break; 562 + case TEST_MSG_OP_INC32: 563 + expected_kern += operand_32; 564 + break; 565 + case TEST_MSG_OP_MUL64: 566 + expected_kern *= operand_64; 567 + break; 568 + case TEST_MSG_OP_MUL32: 569 + expected_kern *= operand_32; 570 + break; 571 + default: 572 + PRINT_FAIL("Unexpected op %d\n", op); 573 + goto cleanup; 574 + } 575 + 576 + if (i % 8 == 0) { 577 + kick_kernel_read_messages(); 578 + ASSERT_EQ(skel->bss->kern_mutated, expected_kern, "expected_kern"); 579 + ASSERT_EQ(skel->bss->err, 0, "bpf_prog_err"); 580 + drain_kernel_messages_buffer(kern_ringbuf, skel); 581 + } 582 + } 583 + 584 + cleanup: 585 + ring_buffer__free(kern_ringbuf); 586 + user_ring_buffer__free(user_ringbuf); 587 + user_ringbuf_success__destroy(skel); 588 + } 589 + 590 + static void *kick_kernel_cb(void *arg) 591 + { 592 + /* Kick the kernel, causing it to drain the ring buffer and then wake 593 + * up the test thread waiting on epoll. 594 + */ 595 + syscall(__NR_getrlimit); 596 + 597 + return NULL; 598 + } 599 + 600 + static int spawn_kick_thread_for_poll(void) 601 + { 602 + pthread_t thread; 603 + 604 + return pthread_create(&thread, NULL, kick_kernel_cb, NULL); 605 + } 606 + 607 + static void test_user_ringbuf_blocking_reserve(void) 608 + { 609 + struct user_ringbuf_success *skel; 610 + struct user_ring_buffer *ringbuf; 611 + int err, num_written = 0; 612 + __u64 *token; 613 + 614 + err = load_skel_create_user_ringbuf(&skel, &ringbuf); 615 + if (err) 616 + return; 617 + 618 + ASSERT_EQ(skel->bss->read, 0, "num_samples_read_before"); 619 + 620 + while (1) { 621 + /* Write samples until the buffer is full. */ 622 + token = user_ring_buffer__reserve(ringbuf, sizeof(*token)); 623 + if (!token) 624 + break; 625 + 626 + *token = 0xdeadbeef; 627 + 628 + user_ring_buffer__submit(ringbuf, token); 629 + num_written++; 630 + } 631 + 632 + if (!ASSERT_GE(num_written, 0, "num_written")) 633 + goto cleanup; 634 + 635 + /* Should not have read any samples until the kernel is kicked. */ 636 + ASSERT_EQ(skel->bss->read, 0, "num_pre_kick"); 637 + 638 + /* We correctly time out after 1 second, without a sample. */ 639 + token = user_ring_buffer__reserve_blocking(ringbuf, sizeof(*token), 1000); 640 + if (!ASSERT_EQ(token, NULL, "pre_kick_timeout_token")) 641 + goto cleanup; 642 + 643 + err = spawn_kick_thread_for_poll(); 644 + if (!ASSERT_EQ(err, 0, "deferred_kick_thread\n")) 645 + goto cleanup; 646 + 647 + /* After spawning another thread that asychronously kicks the kernel to 648 + * drain the messages, we're able to block and successfully get a 649 + * sample once we receive an event notification. 650 + */ 651 + token = user_ring_buffer__reserve_blocking(ringbuf, sizeof(*token), 10000); 652 + 653 + if (!ASSERT_OK_PTR(token, "block_token")) 654 + goto cleanup; 655 + 656 + ASSERT_GT(skel->bss->read, 0, "num_post_kill"); 657 + ASSERT_LE(skel->bss->read, num_written, "num_post_kill"); 658 + ASSERT_EQ(skel->bss->err, 0, "err_post_poll"); 659 + user_ring_buffer__discard(ringbuf, token); 660 + 661 + cleanup: 662 + user_ring_buffer__free(ringbuf); 663 + user_ringbuf_success__destroy(skel); 664 + } 665 + 666 + static struct { 667 + const char *prog_name; 668 + const char *expected_err_msg; 669 + } failure_tests[] = { 670 + /* failure cases */ 671 + {"user_ringbuf_callback_bad_access1", "negative offset dynptr_ptr ptr"}, 672 + {"user_ringbuf_callback_bad_access2", "dereference of modified dynptr_ptr ptr"}, 673 + {"user_ringbuf_callback_write_forbidden", "invalid mem access 'dynptr_ptr'"}, 674 + {"user_ringbuf_callback_null_context_write", "invalid mem access 'scalar'"}, 675 + {"user_ringbuf_callback_null_context_read", "invalid mem access 'scalar'"}, 676 + {"user_ringbuf_callback_discard_dynptr", "arg 1 is an unacquired reference"}, 677 + {"user_ringbuf_callback_submit_dynptr", "arg 1 is an unacquired reference"}, 678 + {"user_ringbuf_callback_invalid_return", "At callback return the register R0 has value"}, 679 + }; 680 + 681 + #define SUCCESS_TEST(_func) { _func, #_func } 682 + 683 + static struct { 684 + void (*test_callback)(void); 685 + const char *test_name; 686 + } success_tests[] = { 687 + SUCCESS_TEST(test_user_ringbuf_mappings), 688 + SUCCESS_TEST(test_user_ringbuf_post_misaligned), 689 + SUCCESS_TEST(test_user_ringbuf_post_producer_wrong_offset), 690 + SUCCESS_TEST(test_user_ringbuf_post_larger_than_ringbuf_sz), 691 + SUCCESS_TEST(test_user_ringbuf_basic), 692 + SUCCESS_TEST(test_user_ringbuf_sample_full_ring_buffer), 693 + SUCCESS_TEST(test_user_ringbuf_post_alignment_autoadjust), 694 + SUCCESS_TEST(test_user_ringbuf_overfill), 695 + SUCCESS_TEST(test_user_ringbuf_discards_properly_ignored), 696 + SUCCESS_TEST(test_user_ringbuf_loop), 697 + SUCCESS_TEST(test_user_ringbuf_msg_protocol), 698 + SUCCESS_TEST(test_user_ringbuf_blocking_reserve), 699 + }; 700 + 701 + static void verify_fail(const char *prog_name, const char *expected_err_msg) 702 + { 703 + LIBBPF_OPTS(bpf_object_open_opts, opts); 704 + struct bpf_program *prog; 705 + struct user_ringbuf_fail *skel; 706 + int err; 707 + 708 + opts.kernel_log_buf = obj_log_buf; 709 + opts.kernel_log_size = log_buf_sz; 710 + opts.kernel_log_level = 1; 711 + 712 + skel = user_ringbuf_fail__open_opts(&opts); 713 + if (!ASSERT_OK_PTR(skel, "dynptr_fail__open_opts")) 714 + goto cleanup; 715 + 716 + prog = bpf_object__find_program_by_name(skel->obj, prog_name); 717 + if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name")) 718 + goto cleanup; 719 + 720 + bpf_program__set_autoload(prog, true); 721 + 722 + bpf_map__set_max_entries(skel->maps.user_ringbuf, getpagesize()); 723 + 724 + err = user_ringbuf_fail__load(skel); 725 + if (!ASSERT_ERR(err, "unexpected load success")) 726 + goto cleanup; 727 + 728 + if (!ASSERT_OK_PTR(strstr(obj_log_buf, expected_err_msg), "expected_err_msg")) { 729 + fprintf(stderr, "Expected err_msg: %s\n", expected_err_msg); 730 + fprintf(stderr, "Verifier output: %s\n", obj_log_buf); 731 + } 732 + 733 + cleanup: 734 + user_ringbuf_fail__destroy(skel); 735 + } 736 + 737 + void test_user_ringbuf(void) 738 + { 739 + int i; 740 + 741 + for (i = 0; i < ARRAY_SIZE(success_tests); i++) { 742 + if (!test__start_subtest(success_tests[i].test_name)) 743 + continue; 744 + 745 + success_tests[i].test_callback(); 746 + } 747 + 748 + for (i = 0; i < ARRAY_SIZE(failure_tests); i++) { 749 + if (!test__start_subtest(failure_tests[i].prog_name)) 750 + continue; 751 + 752 + verify_fail(failure_tests[i].prog_name, failure_tests[i].expected_err_msg); 753 + } 754 + }
+35
tools/testing/selftests/bpf/progs/test_user_ringbuf.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #ifndef _TEST_USER_RINGBUF_H 5 + #define _TEST_USER_RINGBUF_H 6 + 7 + #define TEST_OP_64 4 8 + #define TEST_OP_32 2 9 + 10 + enum test_msg_op { 11 + TEST_MSG_OP_INC64, 12 + TEST_MSG_OP_INC32, 13 + TEST_MSG_OP_MUL64, 14 + TEST_MSG_OP_MUL32, 15 + 16 + // Must come last. 17 + TEST_MSG_OP_NUM_OPS, 18 + }; 19 + 20 + struct test_msg { 21 + enum test_msg_op msg_op; 22 + union { 23 + __s64 operand_64; 24 + __s32 operand_32; 25 + }; 26 + }; 27 + 28 + struct sample { 29 + int pid; 30 + int seq; 31 + long value; 32 + char comm[16]; 33 + }; 34 + 35 + #endif /* _TEST_USER_RINGBUF_H */
+177
tools/testing/selftests/bpf/progs/user_ringbuf_fail.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #include <linux/bpf.h> 5 + #include <bpf/bpf_helpers.h> 6 + #include "bpf_misc.h" 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + struct sample { 11 + int pid; 12 + int seq; 13 + long value; 14 + char comm[16]; 15 + }; 16 + 17 + struct { 18 + __uint(type, BPF_MAP_TYPE_USER_RINGBUF); 19 + } user_ringbuf SEC(".maps"); 20 + 21 + static long 22 + bad_access1(struct bpf_dynptr *dynptr, void *context) 23 + { 24 + const struct sample *sample; 25 + 26 + sample = bpf_dynptr_data(dynptr - 1, 0, sizeof(*sample)); 27 + bpf_printk("Was able to pass bad pointer %lx\n", (__u64)dynptr - 1); 28 + 29 + return 0; 30 + } 31 + 32 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 33 + * not be able to read before the pointer. 34 + */ 35 + SEC("?raw_tp/sys_nanosleep") 36 + int user_ringbuf_callback_bad_access1(void *ctx) 37 + { 38 + bpf_user_ringbuf_drain(&user_ringbuf, bad_access1, NULL, 0); 39 + 40 + return 0; 41 + } 42 + 43 + static long 44 + bad_access2(struct bpf_dynptr *dynptr, void *context) 45 + { 46 + const struct sample *sample; 47 + 48 + sample = bpf_dynptr_data(dynptr + 1, 0, sizeof(*sample)); 49 + bpf_printk("Was able to pass bad pointer %lx\n", (__u64)dynptr + 1); 50 + 51 + return 0; 52 + } 53 + 54 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 55 + * not be able to read past the end of the pointer. 56 + */ 57 + SEC("?raw_tp/sys_nanosleep") 58 + int user_ringbuf_callback_bad_access2(void *ctx) 59 + { 60 + bpf_user_ringbuf_drain(&user_ringbuf, bad_access2, NULL, 0); 61 + 62 + return 0; 63 + } 64 + 65 + static long 66 + write_forbidden(struct bpf_dynptr *dynptr, void *context) 67 + { 68 + *((long *)dynptr) = 0; 69 + 70 + return 0; 71 + } 72 + 73 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 74 + * not be able to write to that pointer. 75 + */ 76 + SEC("?raw_tp/sys_nanosleep") 77 + int user_ringbuf_callback_write_forbidden(void *ctx) 78 + { 79 + bpf_user_ringbuf_drain(&user_ringbuf, write_forbidden, NULL, 0); 80 + 81 + return 0; 82 + } 83 + 84 + static long 85 + null_context_write(struct bpf_dynptr *dynptr, void *context) 86 + { 87 + *((__u64 *)context) = 0; 88 + 89 + return 0; 90 + } 91 + 92 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 93 + * not be able to write to that pointer. 94 + */ 95 + SEC("?raw_tp/sys_nanosleep") 96 + int user_ringbuf_callback_null_context_write(void *ctx) 97 + { 98 + bpf_user_ringbuf_drain(&user_ringbuf, null_context_write, NULL, 0); 99 + 100 + return 0; 101 + } 102 + 103 + static long 104 + null_context_read(struct bpf_dynptr *dynptr, void *context) 105 + { 106 + __u64 id = *((__u64 *)context); 107 + 108 + bpf_printk("Read id %lu\n", id); 109 + 110 + return 0; 111 + } 112 + 113 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 114 + * not be able to write to that pointer. 115 + */ 116 + SEC("?raw_tp/sys_nanosleep") 117 + int user_ringbuf_callback_null_context_read(void *ctx) 118 + { 119 + bpf_user_ringbuf_drain(&user_ringbuf, null_context_read, NULL, 0); 120 + 121 + return 0; 122 + } 123 + 124 + static long 125 + try_discard_dynptr(struct bpf_dynptr *dynptr, void *context) 126 + { 127 + bpf_ringbuf_discard_dynptr(dynptr, 0); 128 + 129 + return 0; 130 + } 131 + 132 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 133 + * not be able to read past the end of the pointer. 134 + */ 135 + SEC("?raw_tp/sys_nanosleep") 136 + int user_ringbuf_callback_discard_dynptr(void *ctx) 137 + { 138 + bpf_user_ringbuf_drain(&user_ringbuf, try_discard_dynptr, NULL, 0); 139 + 140 + return 0; 141 + } 142 + 143 + static long 144 + try_submit_dynptr(struct bpf_dynptr *dynptr, void *context) 145 + { 146 + bpf_ringbuf_submit_dynptr(dynptr, 0); 147 + 148 + return 0; 149 + } 150 + 151 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 152 + * not be able to read past the end of the pointer. 153 + */ 154 + SEC("?raw_tp/sys_nanosleep") 155 + int user_ringbuf_callback_submit_dynptr(void *ctx) 156 + { 157 + bpf_user_ringbuf_drain(&user_ringbuf, try_submit_dynptr, NULL, 0); 158 + 159 + return 0; 160 + } 161 + 162 + static long 163 + invalid_drain_callback_return(struct bpf_dynptr *dynptr, void *context) 164 + { 165 + return 2; 166 + } 167 + 168 + /* A callback that accesses a dynptr in a bpf_user_ringbuf_drain callback should 169 + * not be able to write to that pointer. 170 + */ 171 + SEC("?raw_tp/sys_nanosleep") 172 + int user_ringbuf_callback_invalid_return(void *ctx) 173 + { 174 + bpf_user_ringbuf_drain(&user_ringbuf, invalid_drain_callback_return, NULL, 0); 175 + 176 + return 0; 177 + }
+218
tools/testing/selftests/bpf/progs/user_ringbuf_success.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #include <linux/bpf.h> 5 + #include <bpf/bpf_helpers.h> 6 + #include "bpf_misc.h" 7 + #include "test_user_ringbuf.h" 8 + 9 + char _license[] SEC("license") = "GPL"; 10 + 11 + struct { 12 + __uint(type, BPF_MAP_TYPE_USER_RINGBUF); 13 + } user_ringbuf SEC(".maps"); 14 + 15 + struct { 16 + __uint(type, BPF_MAP_TYPE_RINGBUF); 17 + } kernel_ringbuf SEC(".maps"); 18 + 19 + /* inputs */ 20 + int pid, err, val; 21 + 22 + int read = 0; 23 + 24 + /* Counter used for end-to-end protocol test */ 25 + __u64 kern_mutated = 0; 26 + __u64 user_mutated = 0; 27 + __u64 expected_user_mutated = 0; 28 + 29 + static int 30 + is_test_process(void) 31 + { 32 + int cur_pid = bpf_get_current_pid_tgid() >> 32; 33 + 34 + return cur_pid == pid; 35 + } 36 + 37 + static long 38 + record_sample(struct bpf_dynptr *dynptr, void *context) 39 + { 40 + const struct sample *sample = NULL; 41 + struct sample stack_sample; 42 + int status; 43 + static int num_calls; 44 + 45 + if (num_calls++ % 2 == 0) { 46 + status = bpf_dynptr_read(&stack_sample, sizeof(stack_sample), dynptr, 0, 0); 47 + if (status) { 48 + bpf_printk("bpf_dynptr_read() failed: %d\n", status); 49 + err = 1; 50 + return 0; 51 + } 52 + } else { 53 + sample = bpf_dynptr_data(dynptr, 0, sizeof(*sample)); 54 + if (!sample) { 55 + bpf_printk("Unexpectedly failed to get sample\n"); 56 + err = 2; 57 + return 0; 58 + } 59 + stack_sample = *sample; 60 + } 61 + 62 + __sync_fetch_and_add(&read, 1); 63 + return 0; 64 + } 65 + 66 + static void 67 + handle_sample_msg(const struct test_msg *msg) 68 + { 69 + switch (msg->msg_op) { 70 + case TEST_MSG_OP_INC64: 71 + kern_mutated += msg->operand_64; 72 + break; 73 + case TEST_MSG_OP_INC32: 74 + kern_mutated += msg->operand_32; 75 + break; 76 + case TEST_MSG_OP_MUL64: 77 + kern_mutated *= msg->operand_64; 78 + break; 79 + case TEST_MSG_OP_MUL32: 80 + kern_mutated *= msg->operand_32; 81 + break; 82 + default: 83 + bpf_printk("Unrecognized op %d\n", msg->msg_op); 84 + err = 2; 85 + } 86 + } 87 + 88 + static long 89 + read_protocol_msg(struct bpf_dynptr *dynptr, void *context) 90 + { 91 + const struct test_msg *msg = NULL; 92 + 93 + msg = bpf_dynptr_data(dynptr, 0, sizeof(*msg)); 94 + if (!msg) { 95 + err = 1; 96 + bpf_printk("Unexpectedly failed to get msg\n"); 97 + return 0; 98 + } 99 + 100 + handle_sample_msg(msg); 101 + 102 + return 0; 103 + } 104 + 105 + static int publish_next_kern_msg(__u32 index, void *context) 106 + { 107 + struct test_msg *msg = NULL; 108 + int operand_64 = TEST_OP_64; 109 + int operand_32 = TEST_OP_32; 110 + 111 + msg = bpf_ringbuf_reserve(&kernel_ringbuf, sizeof(*msg), 0); 112 + if (!msg) { 113 + err = 4; 114 + return 1; 115 + } 116 + 117 + switch (index % TEST_MSG_OP_NUM_OPS) { 118 + case TEST_MSG_OP_INC64: 119 + msg->operand_64 = operand_64; 120 + msg->msg_op = TEST_MSG_OP_INC64; 121 + expected_user_mutated += operand_64; 122 + break; 123 + case TEST_MSG_OP_INC32: 124 + msg->operand_32 = operand_32; 125 + msg->msg_op = TEST_MSG_OP_INC32; 126 + expected_user_mutated += operand_32; 127 + break; 128 + case TEST_MSG_OP_MUL64: 129 + msg->operand_64 = operand_64; 130 + msg->msg_op = TEST_MSG_OP_MUL64; 131 + expected_user_mutated *= operand_64; 132 + break; 133 + case TEST_MSG_OP_MUL32: 134 + msg->operand_32 = operand_32; 135 + msg->msg_op = TEST_MSG_OP_MUL32; 136 + expected_user_mutated *= operand_32; 137 + break; 138 + default: 139 + bpf_ringbuf_discard(msg, 0); 140 + err = 5; 141 + return 1; 142 + } 143 + 144 + bpf_ringbuf_submit(msg, 0); 145 + 146 + return 0; 147 + } 148 + 149 + static void 150 + publish_kern_messages(void) 151 + { 152 + if (expected_user_mutated != user_mutated) { 153 + bpf_printk("%lu != %lu\n", expected_user_mutated, user_mutated); 154 + err = 3; 155 + return; 156 + } 157 + 158 + bpf_loop(8, publish_next_kern_msg, NULL, 0); 159 + } 160 + 161 + SEC("fentry/" SYS_PREFIX "sys_prctl") 162 + int test_user_ringbuf_protocol(void *ctx) 163 + { 164 + long status = 0; 165 + struct sample *sample = NULL; 166 + struct bpf_dynptr ptr; 167 + 168 + if (!is_test_process()) 169 + return 0; 170 + 171 + status = bpf_user_ringbuf_drain(&user_ringbuf, read_protocol_msg, NULL, 0); 172 + if (status < 0) { 173 + bpf_printk("Drain returned: %ld\n", status); 174 + err = 1; 175 + return 0; 176 + } 177 + 178 + publish_kern_messages(); 179 + 180 + return 0; 181 + } 182 + 183 + SEC("fentry/" SYS_PREFIX "sys_getpgid") 184 + int test_user_ringbuf(void *ctx) 185 + { 186 + int status = 0; 187 + struct sample *sample = NULL; 188 + struct bpf_dynptr ptr; 189 + 190 + if (!is_test_process()) 191 + return 0; 192 + 193 + err = bpf_user_ringbuf_drain(&user_ringbuf, record_sample, NULL, 0); 194 + 195 + return 0; 196 + } 197 + 198 + static long 199 + do_nothing_cb(struct bpf_dynptr *dynptr, void *context) 200 + { 201 + __sync_fetch_and_add(&read, 1); 202 + return 0; 203 + } 204 + 205 + SEC("fentry/" SYS_PREFIX "sys_getrlimit") 206 + int test_user_ringbuf_epoll(void *ctx) 207 + { 208 + long num_samples; 209 + 210 + if (!is_test_process()) 211 + return 0; 212 + 213 + num_samples = bpf_user_ringbuf_drain(&user_ringbuf, do_nothing_cb, NULL, 0); 214 + if (num_samples <= 0) 215 + err = 1; 216 + 217 + return 0; 218 + }