bpf: Add overwrite mode for BPF ring buffer

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

When the BPF ring buffer is full, a new event cannot be recorded until one
or more old events are consumed to make enough space for it. In cases such
as fault diagnostics, where recent events are more useful than older ones,
this mechanism may lead to critical events being lost.

So add overwrite mode for BPF ring buffer to address it. In this mode, the
new event overwrites the oldest event when the buffer is full.

The basic idea is as follows:

1. producer_pos tracks the next position to record new event. When there
is enough free space, producer_pos is simply advanced by producer to
make space for the new event.

2. To avoid waiting for consumer when the buffer is full, a new variable,
overwrite_pos, is introduced for producer. It points to the oldest event
committed in the buffer. It is advanced by producer to discard one or more
oldest events to make space for the new event when the buffer is full.

3. pending_pos tracks the oldest event to be committed. pending_pos is never
passed by producer_pos, so multiple producers never write to the same
position at the same time.

The following example diagrams show how it works in a 4096-byte ring buffer.

1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| |
| |
| |
+-----------------------------------------------------------------------+
^
|
|
producer_pos = 0
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

2. Now reserve a 512-byte event A.

There is enough free space, so A is allocated at offset 0. And producer_pos
is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is
set.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | |
| A | |
| [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 512
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

3. Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced
to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | |
| A | B | |
| [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 1536
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

4. Reserve event C, size 2048.

C is allocated at offset 1536, and producer_pos is advanced to 3584.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| [BUSY] | [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 3584
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

5. Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event to be committed, so
pending_pos is advanced to 512, the start of B.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| | [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | |
| pending_pos = 512 producer_pos = 3584
|
overwrite_pos = 0
consumer_pos = 0

6. Submit event B.

The BUSY bit of B is cleared, and pending_pos is advanced to the start of C,
which is now the oldest event to be committed.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| | | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | |
| pending_pos = 1536 producer_pos = 3584
|
overwrite_pos = 0
consumer_pos = 0

7. Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not being written between producer_pos (currently 3584)
and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced
by 1536 (from 3584 to 5120).

Since event D will overwrite all bytes of event A and the first 512 bytes of
event B, overwrite_pos is advanced to the start of event C, the oldest event
that is not overwritten.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| D End | | C | D Begin|
| [BUSY] | | [BUSY] | [BUSY] |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | pending_pos = 1536
| | overwrite_pos = 1536
| |
| producer_pos=5120
|
consumer_pos = 0

8. Reserve event E, size 1024.

Although there are 512 bytes not being written between producer_pos and
pending_pos, E cannot be reserved, as it would overwrite the first 512
bytes of event C, which is still being written.

9. Submit event C and D.

pending_pos is advanced to the end of D.

0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| D End | | C | D Begin|
| | | | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | overwrite_pos = 1536
| |
| producer_pos=5120
| pending_pos=5120
|
consumer_pos = 0

The performance data for overwrite mode will be provided in a follow-up
patch that adds overwrite-mode benchmarks.

A sample of performance data for non-overwrite mode, collected on an x86_64
CPU and an arm64 CPU, before and after this patch, is shown below. As we can
see, no obvious performance regression occurs.

- x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s)

- arm64 (HiSilicon Kunpeng 920)

Before:

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251018035738.4039621-2-xukuohai@huaweicloud.com

authored by

Xu Kuohai and committed by

Andrii Nakryiko 5 months ago feeaf134 ff880798

+103 -19

3 changed files

expand all

include

uapi

linux

bpf.h

kernel

bpf

ringbuf.c

tools

include

uapi

linux

bpf.h

include/uapi/linux/bpf.h

··· 1430 1430 1431 1431 /* Do not translate kernel bpf_arena pointers to user pointers */ 1432 1432 BPF_F_NO_USER_CONV = (1U << 18), 1433 + 1434 + /* Enable BPF ringbuf overwrite mode */ 1435 + BPF_F_RB_OVERWRITE = (1U << 19), 1433 1436 }; 1434 1437 1435 1438 /* Flags for BPF_PROG_QUERY. */ ··· 6234 6231 BPF_RB_RING_SIZE = 1, 6235 6232 BPF_RB_CONS_POS = 2, 6236 6233 BPF_RB_PROD_POS = 3, 6234 + BPF_RB_OVERWRITE_POS = 4, 6237 6235 }; 6238 6236 6239 6237 /* BPF ring buffer constants */

+95 -19

kernel/bpf/ringbuf.c

··· 13 13 #include <linux/btf_ids.h> 14 14 #include <asm/rqspinlock.h> 15 15 16 - #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) 16 + #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_RB_OVERWRITE) 17 17 18 18 /* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ 19 19 #define RINGBUF_PGOFF \ ··· 30 30 u64 mask; 31 31 struct page **pages; 32 32 int nr_pages; 33 + bool overwrite_mode; 33 34 rqspinlock_t spinlock ____cacheline_aligned_in_smp; 34 35 /* For user-space producer ring buffers, an atomic_t busy bit is used 35 36 * to synchronize access to the ring buffers in the kernel, rather than ··· 74 73 unsigned long consumer_pos __aligned(PAGE_SIZE); 75 74 unsigned long producer_pos __aligned(PAGE_SIZE); 76 75 unsigned long pending_pos; 76 + unsigned long overwrite_pos; /* position after the last overwritten record */ 77 77 char data[] __aligned(PAGE_SIZE); 78 78 }; 79 79 ··· 168 166 * considering that the maximum value of data_sz is (4GB - 1), there 169 167 * will be no overflow, so just note the size limit in the comments. 170 168 */ 171 - static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) 169 + static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node, bool overwrite_mode) 172 170 { 173 171 struct bpf_ringbuf *rb; 174 172 ··· 185 183 rb->consumer_pos = 0; 186 184 rb->producer_pos = 0; 187 185 rb->pending_pos = 0; 186 + rb->overwrite_mode = overwrite_mode; 188 187 189 188 return rb; 190 189 } 191 190 192 191 static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) 193 192 { 193 + bool overwrite_mode = false; 194 194 struct bpf_ringbuf_map *rb_map; 195 195 196 196 if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK) 197 197 return ERR_PTR(-EINVAL); 198 + 199 + if (attr->map_flags & BPF_F_RB_OVERWRITE) { 200 + if (attr->map_type != BPF_MAP_TYPE_RINGBUF) 201 + return ERR_PTR(-EINVAL); 202 + overwrite_mode = true; 203 + } 198 204 199 205 if (attr->key_size || attr->value_size || 200 206 !is_power_of_2(attr->max_entries) || ··· 215 205 216 206 bpf_map_init_from_attr(&rb_map->map, attr); 217 207 218 - rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node); 208 + rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node, overwrite_mode); 219 209 if (!rb_map->rb) { 220 210 bpf_map_area_free(rb_map); 221 211 return ERR_PTR(-ENOMEM); ··· 303 293 return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF); 304 294 } 305 295 296 + /* 297 + * Return an estimate of the available data in the ring buffer. 298 + * Note: the returned value can exceed the actual ring buffer size because the 299 + * function is not synchronized with the producer. The producer acquires the 300 + * ring buffer's spinlock, but this function does not. 301 + */ 306 302 static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) 307 303 { 308 - unsigned long cons_pos, prod_pos; 304 + unsigned long cons_pos, prod_pos, over_pos; 309 305 310 306 cons_pos = smp_load_acquire(&rb->consumer_pos); 311 - prod_pos = smp_load_acquire(&rb->producer_pos); 312 - return prod_pos - cons_pos; 307 + 308 + if (unlikely(rb->overwrite_mode)) { 309 + over_pos = smp_load_acquire(&rb->overwrite_pos); 310 + prod_pos = smp_load_acquire(&rb->producer_pos); 311 + return prod_pos - max(cons_pos, over_pos); 312 + } else { 313 + prod_pos = smp_load_acquire(&rb->producer_pos); 314 + return prod_pos - cons_pos; 315 + } 313 316 } 314 317 315 318 static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) ··· 425 402 return (void*)((addr & PAGE_MASK) - off); 426 403 } 427 404 405 + static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb, 406 + unsigned long new_prod_pos, 407 + unsigned long cons_pos, 408 + unsigned long pend_pos) 409 + { 410 + /* 411 + * No space if oldest not yet committed record until the newest 412 + * record span more than (ringbuf_size - 1). 413 + */ 414 + if (new_prod_pos - pend_pos > rb->mask) 415 + return false; 416 + 417 + /* Ok, we have space in overwrite mode */ 418 + if (unlikely(rb->overwrite_mode)) 419 + return true; 420 + 421 + /* 422 + * No space if producer position advances more than (ringbuf_size - 1) 423 + * ahead of consumer position when not in overwrite mode. 424 + */ 425 + if (new_prod_pos - cons_pos > rb->mask) 426 + return false; 427 + 428 + return true; 429 + } 430 + 431 + static u32 bpf_ringbuf_round_up_hdr_len(u32 hdr_len) 432 + { 433 + hdr_len &= ~BPF_RINGBUF_DISCARD_BIT; 434 + return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8); 435 + } 436 + 428 437 static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) 429 438 { 430 - unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags; 439 + unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos, flags; 431 440 struct bpf_ringbuf_hdr *hdr; 432 - u32 len, pg_off, tmp_size, hdr_len; 441 + u32 len, pg_off, hdr_len; 433 442 434 443 if (unlikely(size > RINGBUF_MAX_RECORD_SZ)) 435 444 return NULL; ··· 484 429 hdr_len = READ_ONCE(hdr->len); 485 430 if (hdr_len & BPF_RINGBUF_BUSY_BIT) 486 431 break; 487 - tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT; 488 - tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8); 489 - pend_pos += tmp_size; 432 + pend_pos += bpf_ringbuf_round_up_hdr_len(hdr_len); 490 433 } 491 434 rb->pending_pos = pend_pos; 492 435 493 - /* check for out of ringbuf space: 494 - * - by ensuring producer position doesn't advance more than 495 - * (ringbuf_size - 1) ahead 496 - * - by ensuring oldest not yet committed record until newest 497 - * record does not span more than (ringbuf_size - 1) 498 - */ 499 - if (new_prod_pos - cons_pos > rb->mask || 500 - new_prod_pos - pend_pos > rb->mask) { 436 + if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) { 501 437 raw_res_spin_unlock_irqrestore(&rb->spinlock, flags); 502 438 return NULL; 439 + } 440 + 441 + /* 442 + * In overwrite mode, advance overwrite_pos when the ring buffer is full. 443 + * The key points are to stay on record boundaries and consume enough records 444 + * to fit the new one. 445 + */ 446 + if (unlikely(rb->overwrite_mode)) { 447 + over_pos = rb->overwrite_pos; 448 + while (new_prod_pos - over_pos > rb->mask) { 449 + hdr = (void *)rb->data + (over_pos & rb->mask); 450 + hdr_len = READ_ONCE(hdr->len); 451 + /* 452 + * The bpf_ringbuf_has_space() check above ensures we won’t 453 + * step over a record currently being worked on by another 454 + * producer. 455 + */ 456 + over_pos += bpf_ringbuf_round_up_hdr_len(hdr_len); 457 + } 458 + /* 459 + * smp_store_release(&rb->producer_pos, new_prod_pos) at 460 + * the end of the function ensures that when consumer sees 461 + * the updated rb->producer_pos, it always sees the updated 462 + * rb->overwrite_pos, so when consumer reads overwrite_pos 463 + * after smp_load_acquire(r->producer_pos), the overwrite_pos 464 + * will always be valid. 465 + */ 466 + WRITE_ONCE(rb->overwrite_pos, over_pos); 503 467 } 504 468 505 469 hdr = (void *)rb->data + (prod_pos & rb->mask); ··· 650 576 return smp_load_acquire(&rb->consumer_pos); 651 577 case BPF_RB_PROD_POS: 652 578 return smp_load_acquire(&rb->producer_pos); 579 + case BPF_RB_OVERWRITE_POS: 580 + return smp_load_acquire(&rb->overwrite_pos); 653 581 default: 654 582 return 0; 655 583 }

tools/include/uapi/linux/bpf.h