Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2021-06-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 12 day(s) which contain
a total of 56 files changed, 394 insertions(+), 380 deletions(-).

The main changes are:

1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney.

2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski.

3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev.

4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria.

5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards.

6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa.

7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi.

8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim.

9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young.

10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer.

11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski.

12) Fix up bpfilter startup log-level to info level, from Gary Lin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+393 -379
+33 -20
Documentation/RCU/checklist.rst
··· 211 211 of the system, especially to real-time workloads running on 212 212 the rest of the system. 213 213 214 - 7. As of v4.20, a given kernel implements only one RCU flavor, 215 - which is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. 216 - If the updater uses call_rcu() or synchronize_rcu(), 217 - then the corresponding readers may use rcu_read_lock() and 218 - rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), 219 - or any pair of primitives that disables and re-enables preemption, 220 - for example, rcu_read_lock_sched() and rcu_read_unlock_sched(). 221 - If the updater uses synchronize_srcu() or call_srcu(), 222 - then the corresponding readers must use srcu_read_lock() and 223 - srcu_read_unlock(), and with the same srcu_struct. The rules for 224 - the expedited primitives are the same as for their non-expedited 225 - counterparts. Mixing things up will result in confusion and 226 - broken kernels, and has even resulted in an exploitable security 227 - issue. 214 + 7. As of v4.20, a given kernel implements only one RCU flavor, which 215 + is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. 216 + If the updater uses call_rcu() or synchronize_rcu(), then 217 + the corresponding readers may use: (1) rcu_read_lock() and 218 + rcu_read_unlock(), (2) any pair of primitives that disables 219 + and re-enables softirq, for example, rcu_read_lock_bh() and 220 + rcu_read_unlock_bh(), or (3) any pair of primitives that disables 221 + and re-enables preemption, for example, rcu_read_lock_sched() and 222 + rcu_read_unlock_sched(). If the updater uses synchronize_srcu() 223 + or call_srcu(), then the corresponding readers must use 224 + srcu_read_lock() and srcu_read_unlock(), and with the same 225 + srcu_struct. The rules for the expedited RCU grace-period-wait 226 + primitives are the same as for their non-expedited counterparts. 228 227 229 - One exception to this rule: rcu_read_lock() and rcu_read_unlock() 230 - may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() 231 - in cases where local bottom halves are already known to be 232 - disabled, for example, in irq or softirq context. Commenting 233 - such cases is a must, of course! And the jury is still out on 234 - whether the increased speed is worth it. 228 + If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(), 229 + then the readers must refrain from executing voluntary 230 + context switches, that is, from blocking. If the updater uses 231 + call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then 232 + the corresponding readers must use rcu_read_lock_trace() and 233 + rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude() 234 + or synchronize_rcu_tasks_rude(), then the corresponding readers 235 + must use anything that disables interrupts. 236 + 237 + Mixing things up will result in confusion and broken kernels, and 238 + has even resulted in an exploitable security issue. Therefore, 239 + when using non-obvious pairs of primitives, commenting is 240 + of course a must. One example of non-obvious pairing is 241 + the XDP feature in networking, which calls BPF programs from 242 + network-driver NAPI (softirq) context. BPF relies heavily on RCU 243 + protection for its data structures, but because the BPF program 244 + invocation happens entirely within a single local_bh_disable() 245 + section in a NAPI poll cycle, this usage is safe. The reason 246 + that this usage is safe is that readers can use anything that 247 + disables BH when updaters use call_rcu() or synchronize_rcu(). 235 248 236 249 8. Although synchronize_rcu() is slower than is call_rcu(), it 237 250 usually results in simpler code. So, unless update performance is
+13
Documentation/bpf/index.rst
··· 12 12 The Cilium project also maintains a `BPF and XDP Reference Guide`_ 13 13 that goes into great technical depth about the BPF Architecture. 14 14 15 + libbpf 16 + ====== 17 + 18 + Libbpf is a userspace library for loading and interacting with bpf programs. 19 + 20 + .. toctree:: 21 + :maxdepth: 1 22 + 23 + libbpf/libbpf 24 + libbpf/libbpf_api 25 + libbpf/libbpf_build 26 + libbpf/libbpf_naming_convention 27 + 15 28 BPF Type Format (BTF) 16 29 ===================== 17 30
+14
Documentation/bpf/libbpf/libbpf.rst
··· 1 + .. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) 2 + 3 + libbpf 4 + ====== 5 + 6 + This is documentation for libbpf, a userspace library for loading and 7 + interacting with bpf programs. 8 + 9 + All general BPF questions, including kernel functionality, libbpf APIs and 10 + their application, should be sent to bpf@vger.kernel.org mailing list. 11 + You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the 12 + mailing list search its `archive <https://lore.kernel.org/bpf/>`_. 13 + Please search the archive before asking new questions. It very well might 14 + be that this was already addressed or answered before.
+27
Documentation/bpf/libbpf/libbpf_api.rst
··· 1 + .. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) 2 + 3 + API 4 + === 5 + 6 + This documentation is autogenerated from header files in libbpf, tools/lib/bpf 7 + 8 + .. kernel-doc:: tools/lib/bpf/libbpf.h 9 + :internal: 10 + 11 + .. kernel-doc:: tools/lib/bpf/bpf.h 12 + :internal: 13 + 14 + .. kernel-doc:: tools/lib/bpf/btf.h 15 + :internal: 16 + 17 + .. kernel-doc:: tools/lib/bpf/xsk.h 18 + :internal: 19 + 20 + .. kernel-doc:: tools/lib/bpf/bpf_tracing.h 21 + :internal: 22 + 23 + .. kernel-doc:: tools/lib/bpf/bpf_core_read.h 24 + :internal: 25 + 26 + .. kernel-doc:: tools/lib/bpf/bpf_endian.h 27 + :internal:
+37
Documentation/bpf/libbpf/libbpf_build.rst
··· 1 + .. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) 2 + 3 + Building libbpf 4 + =============== 5 + 6 + libelf and zlib are internal dependencies of libbpf and thus are required to link 7 + against and must be installed on the system for applications to work. 8 + pkg-config is used by default to find libelf, and the program called 9 + can be overridden with PKG_CONFIG. 10 + 11 + If using pkg-config at build time is not desired, it can be disabled by 12 + setting NO_PKG_CONFIG=1 when calling make. 13 + 14 + To build both static libbpf.a and shared libbpf.so: 15 + 16 + .. code-block:: bash 17 + 18 + $ cd src 19 + $ make 20 + 21 + To build only static libbpf.a library in directory build/ and install them 22 + together with libbpf headers in a staging directory root/: 23 + 24 + .. code-block:: bash 25 + 26 + $ cd src 27 + $ mkdir build root 28 + $ BUILD_STATIC_ONLY=y OBJDIR=build DESTDIR=root make install 29 + 30 + To build both static libbpf.a and shared libbpf.so against a custom libelf 31 + dependency installed in /build/root/ and install them together with libbpf 32 + headers in a build directory /build/root/: 33 + 34 + .. code-block:: bash 35 + 36 + $ cd src 37 + $ PKG_CONFIG_PATH=/build/root/lib64/pkgconfig DESTDIR=/build/root make
+16 -16
Documentation/networking/af_xdp.rst
··· 290 290 #define MAX_SOCKS 16 291 291 292 292 struct { 293 - __uint(type, BPF_MAP_TYPE_XSKMAP); 294 - __uint(max_entries, MAX_SOCKS); 295 - __uint(key_size, sizeof(int)); 296 - __uint(value_size, sizeof(int)); 293 + __uint(type, BPF_MAP_TYPE_XSKMAP); 294 + __uint(max_entries, MAX_SOCKS); 295 + __uint(key_size, sizeof(int)); 296 + __uint(value_size, sizeof(int)); 297 297 } xsks_map SEC(".maps"); 298 298 299 299 static unsigned int rr; 300 300 301 301 SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 302 302 { 303 - rr = (rr + 1) & (MAX_SOCKS - 1); 303 + rr = (rr + 1) & (MAX_SOCKS - 1); 304 304 305 - return bpf_redirect_map(&xsks_map, rr, XDP_DROP); 305 + return bpf_redirect_map(&xsks_map, rr, XDP_DROP); 306 306 } 307 307 308 308 Note, that since there is only a single set of FILL and COMPLETION ··· 379 379 .. code-block:: c 380 380 381 381 if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) 382 - sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); 382 + sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); 383 383 384 384 I.e., only use the syscall if the flag is set. 385 385 ··· 442 442 .. code-block:: c 443 443 444 444 struct xdp_statistics { 445 - __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ 446 - __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 447 - __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 445 + __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ 446 + __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 447 + __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 448 448 }; 449 449 450 450 XDP_OPTIONS getsockopt ··· 483 483 .. code-block:: c 484 484 485 485 // struct xdp_rxtx_ring { 486 - // __u32 *producer; 487 - // __u32 *consumer; 488 - // struct xdp_desc *desc; 486 + // __u32 *producer; 487 + // __u32 *consumer; 488 + // struct xdp_desc *desc; 489 489 // }; 490 490 491 491 // struct xdp_umem_ring { 492 - // __u32 *producer; 493 - // __u32 *consumer; 494 - // __u64 *desc; 492 + // __u32 *producer; 493 + // __u32 *consumer; 494 + // __u64 *desc; 495 495 // }; 496 496 497 497 // typedef struct xdp_rxtx_ring RING;
+13 -33
arch/x86/net/bpf_jit_comp.c
··· 31 31 } 32 32 33 33 #define EMIT(bytes, len) \ 34 - do { prog = emit_code(prog, bytes, len); cnt += len; } while (0) 34 + do { prog = emit_code(prog, bytes, len); } while (0) 35 35 36 36 #define EMIT1(b1) EMIT(b1, 1) 37 37 #define EMIT2(b1, b2) EMIT((b1) + ((b2) << 8), 2) ··· 239 239 static void push_callee_regs(u8 **pprog, bool *callee_regs_used) 240 240 { 241 241 u8 *prog = *pprog; 242 - int cnt = 0; 243 242 244 243 if (callee_regs_used[0]) 245 244 EMIT1(0x53); /* push rbx */ ··· 254 255 static void pop_callee_regs(u8 **pprog, bool *callee_regs_used) 255 256 { 256 257 u8 *prog = *pprog; 257 - int cnt = 0; 258 258 259 259 if (callee_regs_used[3]) 260 260 EMIT2(0x41, 0x5F); /* pop r15 */ ··· 275 277 bool tail_call_reachable, bool is_subprog) 276 278 { 277 279 u8 *prog = *pprog; 278 - int cnt = X86_PATCH_SIZE; 279 280 280 281 /* BPF trampoline can be made to work without these nops, 281 282 * but let's waste 5 bytes for now and optimize later 282 283 */ 283 - memcpy(prog, x86_nops[5], cnt); 284 - prog += cnt; 284 + memcpy(prog, x86_nops[5], X86_PATCH_SIZE); 285 + prog += X86_PATCH_SIZE; 285 286 if (!ebpf_from_cbpf) { 286 287 if (tail_call_reachable && !is_subprog) 287 288 EMIT2(0x31, 0xC0); /* xor eax, eax */ ··· 300 303 static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) 301 304 { 302 305 u8 *prog = *pprog; 303 - int cnt = 0; 304 306 s64 offset; 305 307 306 308 offset = func - (ip + X86_PATCH_SIZE); ··· 419 423 int off1 = 42; 420 424 int off2 = 31; 421 425 int off3 = 9; 422 - int cnt = 0; 423 426 424 427 /* count the additional bytes used for popping callee regs from stack 425 428 * that need to be taken into account for each of the offsets that ··· 508 513 int pop_bytes = 0; 509 514 int off1 = 20; 510 515 int poke_off; 511 - int cnt = 0; 512 516 513 517 /* count the additional bytes used for popping callee regs to stack 514 518 * that need to be taken into account for jump offset that is used for ··· 609 615 { 610 616 u8 *prog = *pprog; 611 617 u8 b1, b2, b3; 612 - int cnt = 0; 613 618 614 619 /* 615 620 * Optimization: if imm32 is positive, use 'mov %eax, imm32' ··· 648 655 const u32 imm32_hi, const u32 imm32_lo) 649 656 { 650 657 u8 *prog = *pprog; 651 - int cnt = 0; 652 658 653 659 if (is_uimm32(((u64)imm32_hi << 32) | (u32)imm32_lo)) { 654 660 /* ··· 670 678 static void emit_mov_reg(u8 **pprog, bool is64, u32 dst_reg, u32 src_reg) 671 679 { 672 680 u8 *prog = *pprog; 673 - int cnt = 0; 674 681 675 682 if (is64) { 676 683 /* mov dst, src */ ··· 688 697 static void emit_insn_suffix(u8 **pprog, u32 ptr_reg, u32 val_reg, int off) 689 698 { 690 699 u8 *prog = *pprog; 691 - int cnt = 0; 692 700 693 701 if (is_imm8(off)) { 694 702 /* 1-byte signed displacement. ··· 710 720 static void maybe_emit_mod(u8 **pprog, u32 dst_reg, u32 src_reg, bool is64) 711 721 { 712 722 u8 *prog = *pprog; 713 - int cnt = 0; 714 723 715 724 if (is64) 716 725 EMIT1(add_2mod(0x48, dst_reg, src_reg)); ··· 722 733 static void emit_ldx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off) 723 734 { 724 735 u8 *prog = *pprog; 725 - int cnt = 0; 726 736 727 737 switch (size) { 728 738 case BPF_B: ··· 752 764 static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off) 753 765 { 754 766 u8 *prog = *pprog; 755 - int cnt = 0; 756 767 757 768 switch (size) { 758 769 case BPF_B: ··· 786 799 u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size) 787 800 { 788 801 u8 *prog = *pprog; 789 - int cnt = 0; 790 802 791 803 EMIT1(0xF0); /* lock prefix */ 792 804 ··· 855 869 } 856 870 } 857 871 858 - static int emit_nops(u8 **pprog, int len) 872 + static void emit_nops(u8 **pprog, int len) 859 873 { 860 874 u8 *prog = *pprog; 861 - int i, noplen, cnt = 0; 875 + int i, noplen; 862 876 863 877 while (len > 0) { 864 878 noplen = len; ··· 872 886 } 873 887 874 888 *pprog = prog; 875 - 876 - return cnt; 877 889 } 878 890 879 891 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) ··· 886 902 bool tail_call_seen = false; 887 903 bool seen_exit = false; 888 904 u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; 889 - int i, cnt = 0, excnt = 0; 905 + int i, excnt = 0; 890 906 int ilen, proglen = 0; 891 907 u8 *prog = temp; 892 908 int err; ··· 1281 1297 emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off); 1282 1298 if (BPF_MODE(insn->code) == BPF_PROBE_MEM) { 1283 1299 struct exception_table_entry *ex; 1284 - u8 *_insn = image + proglen; 1300 + u8 *_insn = image + proglen + (start_of_ldx - temp); 1285 1301 s64 delta; 1286 1302 1287 1303 /* populate jmp_offset for JMP above */ ··· 1560 1576 nops); 1561 1577 return -EFAULT; 1562 1578 } 1563 - cnt += emit_nops(&prog, nops); 1579 + emit_nops(&prog, nops); 1564 1580 } 1565 1581 EMIT2(jmp_cond, jmp_offset); 1566 1582 } else if (is_simm32(jmp_offset)) { ··· 1606 1622 nops); 1607 1623 return -EFAULT; 1608 1624 } 1609 - cnt += emit_nops(&prog, nops); 1625 + emit_nops(&prog, nops); 1610 1626 } 1611 1627 break; 1612 1628 } ··· 1631 1647 nops); 1632 1648 return -EFAULT; 1633 1649 } 1634 - cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); 1650 + emit_nops(&prog, INSN_SZ_DIFF - 2); 1635 1651 } 1636 1652 EMIT2(0xEB, jmp_offset); 1637 1653 } else if (is_simm32(jmp_offset)) { ··· 1738 1754 { 1739 1755 u8 *prog = *pprog; 1740 1756 u8 *jmp_insn; 1741 - int cnt = 0; 1742 1757 1743 1758 /* arg1: mov rdi, progs[i] */ 1744 1759 emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, (u32) (long) p); ··· 1805 1822 static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) 1806 1823 { 1807 1824 u8 *prog = *pprog; 1808 - int cnt = 0; 1809 1825 s64 offset; 1810 1826 1811 1827 offset = func - (ip + 2 + 4); ··· 1836 1854 u8 **branches) 1837 1855 { 1838 1856 u8 *prog = *pprog; 1839 - int i, cnt = 0; 1857 + int i; 1840 1858 1841 1859 /* The first fmod_ret program will receive a garbage return value. 1842 1860 * Set this to 0 to avoid confusing the program. ··· 1932 1950 struct bpf_tramp_progs *tprogs, 1933 1951 void *orig_call) 1934 1952 { 1935 - int ret, i, cnt = 0, nr_args = m->nr_args; 1953 + int ret, i, nr_args = m->nr_args; 1936 1954 int stack_size = nr_args * 8; 1937 1955 struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY]; 1938 1956 struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT]; ··· 2077 2095 */ 2078 2096 err = emit_jump(&prog, __x86_indirect_thunk_rdx, prog); 2079 2097 #else 2080 - int cnt = 0; 2081 - 2082 2098 EMIT2(0xFF, 0xE2); /* jmp rdx */ 2083 2099 #endif 2084 2100 *pprog = prog; ··· 2086 2106 static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs) 2087 2107 { 2088 2108 u8 *jg_reloc, *prog = *pprog; 2089 - int pivot, err, jg_bytes = 1, cnt = 0; 2109 + int pivot, err, jg_bytes = 1; 2090 2110 s64 jg_offset; 2091 2111 2092 2112 if (a == b) {
+2 -1
drivers/media/rc/bpf-lirc.c
··· 326 326 } 327 327 328 328 if (attr->query.prog_cnt != 0 && prog_ids && cnt) 329 - ret = bpf_prog_array_copy_to_user(progs, prog_ids, cnt); 329 + ret = bpf_prog_array_copy_to_user(progs, prog_ids, 330 + attr->query.prog_cnt); 330 331 331 332 unlock: 332 333 mutex_unlock(&ir_raw_handler_lock);
-3
drivers/net/ethernet/amazon/ena/ena_netdev.c
··· 384 384 struct xdp_frame *xdpf; 385 385 u64 *xdp_stat; 386 386 387 - rcu_read_lock(); 388 387 xdp_prog = READ_ONCE(rx_ring->xdp_bpf_prog); 389 388 390 389 if (!xdp_prog) ··· 440 441 441 442 ena_increase_stat(xdp_stat, 1, &rx_ring->syncp); 442 443 out: 443 - rcu_read_unlock(); 444 - 445 444 return verdict; 446 445 } 447 446
-2
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
··· 138 138 xdp_prepare_buff(&xdp, *data_ptr - offset, offset, *len, false); 139 139 orig_data = xdp.data; 140 140 141 - rcu_read_lock(); 142 141 act = bpf_prog_run_xdp(xdp_prog, &xdp); 143 - rcu_read_unlock(); 144 142 145 143 tx_avail = bnxt_tx_avail(bp, txr); 146 144 /* If the tx ring is not full, we must not update the rx producer yet
-2
drivers/net/ethernet/cavium/thunder/nicvf_main.c
··· 555 555 xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false); 556 556 orig_data = xdp.data; 557 557 558 - rcu_read_lock(); 559 558 action = bpf_prog_run_xdp(prog, &xdp); 560 - rcu_read_unlock(); 561 559 562 560 len = xdp.data_end - xdp.data; 563 561 /* Check if XDP program has changed headers */
+1 -7
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
··· 2558 2558 u32 xdp_act; 2559 2559 int err; 2560 2560 2561 - rcu_read_lock(); 2562 - 2563 2561 xdp_prog = READ_ONCE(priv->xdp_prog); 2564 - if (!xdp_prog) { 2565 - rcu_read_unlock(); 2562 + if (!xdp_prog) 2566 2563 return XDP_PASS; 2567 - } 2568 2564 2569 2565 xdp_init_buff(&xdp, DPAA_BP_RAW_SIZE - DPAA_TX_PRIV_DATA_SIZE, 2570 2566 &dpaa_fq->xdp_rxq); ··· 2633 2637 free_pages((unsigned long)vaddr, 0); 2634 2638 break; 2635 2639 } 2636 - 2637 - rcu_read_unlock(); 2638 2640 2639 2641 return xdp_act; 2640 2642 }
-3
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
··· 352 352 u32 xdp_act = XDP_PASS; 353 353 int err, offset; 354 354 355 - rcu_read_lock(); 356 - 357 355 xdp_prog = READ_ONCE(ch->xdp.prog); 358 356 if (!xdp_prog) 359 357 goto out; ··· 412 414 413 415 ch->xdp.res |= xdp_act; 414 416 out: 415 - rcu_read_unlock(); 416 417 return xdp_act; 417 418 } 418 419
-2
drivers/net/ethernet/intel/i40e/i40e_txrx.c
··· 2298 2298 struct bpf_prog *xdp_prog; 2299 2299 u32 act; 2300 2300 2301 - rcu_read_lock(); 2302 2301 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 2303 2302 2304 2303 if (!xdp_prog) ··· 2333 2334 break; 2334 2335 } 2335 2336 xdp_out: 2336 - rcu_read_unlock(); 2337 2337 return result; 2338 2338 } 2339 2339
-3
drivers/net/ethernet/intel/i40e/i40e_xsk.c
··· 153 153 struct bpf_prog *xdp_prog; 154 154 u32 act; 155 155 156 - rcu_read_lock(); 157 156 /* NB! xdp_prog will always be !NULL, due to the fact that 158 157 * this path is enabled by setting an XDP program. 159 158 */ ··· 163 164 err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog); 164 165 if (err) 165 166 goto out_failure; 166 - rcu_read_unlock(); 167 167 return I40E_XDP_REDIR; 168 168 } 169 169 ··· 186 188 result = I40E_XDP_CONSUMED; 187 189 break; 188 190 } 189 - rcu_read_unlock(); 190 191 return result; 191 192 } 192 193
+1 -5
drivers/net/ethernet/intel/ice/ice_txrx.c
··· 1140 1140 xdp.frame_sz = ice_rx_frame_truesize(rx_ring, size); 1141 1141 #endif 1142 1142 1143 - rcu_read_lock(); 1144 1143 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 1145 - if (!xdp_prog) { 1146 - rcu_read_unlock(); 1144 + if (!xdp_prog) 1147 1145 goto construct_skb; 1148 - } 1149 1146 1150 1147 xdp_res = ice_run_xdp(rx_ring, &xdp, xdp_prog); 1151 - rcu_read_unlock(); 1152 1148 if (!xdp_res) 1153 1149 goto construct_skb; 1154 1150 if (xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR)) {
-3
drivers/net/ethernet/intel/ice/ice_xsk.c
··· 466 466 struct ice_ring *xdp_ring; 467 467 u32 act; 468 468 469 - rcu_read_lock(); 470 469 /* ZC patch is enabled only when XDP program is set, 471 470 * so here it can not be NULL 472 471 */ ··· 477 478 err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog); 478 479 if (err) 479 480 goto out_failure; 480 - rcu_read_unlock(); 481 481 return ICE_XDP_REDIR; 482 482 } 483 483 ··· 501 503 break; 502 504 } 503 505 504 - rcu_read_unlock(); 505 506 return result; 506 507 } 507 508
-2
drivers/net/ethernet/intel/igb/igb_main.c
··· 8381 8381 struct bpf_prog *xdp_prog; 8382 8382 u32 act; 8383 8383 8384 - rcu_read_lock(); 8385 8384 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 8386 8385 8387 8386 if (!xdp_prog) ··· 8415 8416 break; 8416 8417 } 8417 8418 xdp_out: 8418 - rcu_read_unlock(); 8419 8419 return ERR_PTR(-result); 8420 8420 } 8421 8421
+2 -5
drivers/net/ethernet/intel/igc/igc_main.c
··· 2240 2240 struct bpf_prog *prog; 2241 2241 int res; 2242 2242 2243 - rcu_read_lock(); 2244 - 2245 2243 prog = READ_ONCE(adapter->xdp_prog); 2246 2244 if (!prog) { 2247 2245 res = IGC_XDP_PASS; 2248 - goto unlock; 2246 + goto out; 2249 2247 } 2250 2248 2251 2249 res = __igc_xdp_run_prog(adapter, prog, xdp); 2252 2250 2253 - unlock: 2254 - rcu_read_unlock(); 2251 + out: 2255 2252 return ERR_PTR(-res); 2256 2253 } 2257 2254
-2
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
··· 2199 2199 struct xdp_frame *xdpf; 2200 2200 u32 act; 2201 2201 2202 - rcu_read_lock(); 2203 2202 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 2204 2203 2205 2204 if (!xdp_prog) ··· 2236 2237 break; 2237 2238 } 2238 2239 xdp_out: 2239 - rcu_read_unlock(); 2240 2240 return ERR_PTR(-result); 2241 2241 } 2242 2242
-3
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
··· 100 100 struct xdp_frame *xdpf; 101 101 u32 act; 102 102 103 - rcu_read_lock(); 104 103 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 105 104 act = bpf_prog_run_xdp(xdp_prog, xdp); 106 105 ··· 107 108 err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog); 108 109 if (err) 109 110 goto out_failure; 110 - rcu_read_unlock(); 111 111 return IXGBE_XDP_REDIR; 112 112 } 113 113 ··· 132 134 result = IXGBE_XDP_CONSUMED; 133 135 break; 134 136 } 135 - rcu_read_unlock(); 136 137 return result; 137 138 } 138 139
-2
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
··· 1054 1054 struct bpf_prog *xdp_prog; 1055 1055 u32 act; 1056 1056 1057 - rcu_read_lock(); 1058 1057 xdp_prog = READ_ONCE(rx_ring->xdp_prog); 1059 1058 1060 1059 if (!xdp_prog) ··· 1081 1082 break; 1082 1083 } 1083 1084 xdp_out: 1084 - rcu_read_unlock(); 1085 1085 return ERR_PTR(-result); 1086 1086 } 1087 1087
-2
drivers/net/ethernet/marvell/mvneta.c
··· 2369 2369 /* Get number of received packets */ 2370 2370 rx_todo = mvneta_rxq_busy_desc_num_get(pp, rxq); 2371 2371 2372 - rcu_read_lock(); 2373 2372 xdp_prog = READ_ONCE(pp->xdp_prog); 2374 2373 2375 2374 /* Fairness NAPI loop */ ··· 2446 2447 xdp_buf.data_hard_start = NULL; 2447 2448 sinfo.nr_frags = 0; 2448 2449 } 2449 - rcu_read_unlock(); 2450 2450 2451 2451 if (xdp_buf.data_hard_start) 2452 2452 mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
-4
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
··· 3877 3877 int rx_done = 0; 3878 3878 u32 xdp_ret = 0; 3879 3879 3880 - rcu_read_lock(); 3881 - 3882 3880 xdp_prog = READ_ONCE(port->xdp_prog); 3883 3881 3884 3882 /* Get number of received packets and clamp the to-do */ ··· 4021 4023 else 4022 4024 mvpp2_bm_pool_put(port, pool, dma_addr, phys_addr); 4023 4025 } 4024 - 4025 - rcu_read_unlock(); 4026 4026 4027 4027 if (xdp_ret & MVPP2_XDP_REDIR) 4028 4028 xdp_do_flush_map();
+2 -6
drivers/net/ethernet/mellanox/mlx4/en_rx.c
··· 679 679 680 680 ring = priv->rx_ring[cq_ring]; 681 681 682 - /* Protect accesses to: ring->xdp_prog, priv->mac_hash list */ 683 - rcu_read_lock(); 684 - xdp_prog = rcu_dereference(ring->xdp_prog); 682 + xdp_prog = rcu_dereference_bh(ring->xdp_prog); 685 683 xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq); 686 684 doorbell_pending = false; 687 685 ··· 742 744 /* Drop the packet, since HW loopback-ed it */ 743 745 mac_hash = ethh->h_source[MLX4_EN_MAC_HASH_IDX]; 744 746 bucket = &priv->mac_hash[mac_hash]; 745 - hlist_for_each_entry_rcu(entry, bucket, hlist) { 747 + hlist_for_each_entry_rcu_bh(entry, bucket, hlist) { 746 748 if (ether_addr_equal_64bits(entry->mac, 747 749 ethh->h_source)) 748 750 goto next; ··· 896 898 if (unlikely(++polled == budget)) 897 899 break; 898 900 } 899 - 900 - rcu_read_unlock(); 901 901 902 902 if (likely(polled)) { 903 903 if (doorbell_pending) {
-2
drivers/net/ethernet/netronome/nfp/nfp_net_common.c
··· 1819 1819 struct xdp_buff xdp; 1820 1820 int idx; 1821 1821 1822 - rcu_read_lock(); 1823 1822 xdp_prog = READ_ONCE(dp->xdp_prog); 1824 1823 true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz; 1825 1824 xdp_init_buff(&xdp, PAGE_SIZE - NFP_NET_RX_BUF_HEADROOM, ··· 2035 2036 if (!nfp_net_xdp_complete(tx_ring)) 2036 2037 pkts_polled = budget; 2037 2038 } 2038 - rcu_read_unlock(); 2039 2039 2040 2040 return pkts_polled; 2041 2041 }
-6
drivers/net/ethernet/qlogic/qede/qede_fp.c
··· 1089 1089 xdp_prepare_buff(&xdp, page_address(bd->data), *data_offset, 1090 1090 *len, false); 1091 1091 1092 - /* Queues always have a full reset currently, so for the time 1093 - * being until there's atomic program replace just mark read 1094 - * side for map helpers. 1095 - */ 1096 - rcu_read_lock(); 1097 1092 act = bpf_prog_run_xdp(prog, &xdp); 1098 - rcu_read_unlock(); 1099 1093 1100 1094 /* Recalculate, as XDP might have changed the headers */ 1101 1095 *data_offset = xdp.data - xdp.data_hard_start;
+2 -7
drivers/net/ethernet/sfc/rx.c
··· 260 260 s16 offset; 261 261 int err; 262 262 263 - rcu_read_lock(); 264 - xdp_prog = rcu_dereference(efx->xdp_prog); 265 - if (!xdp_prog) { 266 - rcu_read_unlock(); 263 + xdp_prog = rcu_dereference_bh(efx->xdp_prog); 264 + if (!xdp_prog) 267 265 return true; 268 - } 269 266 270 267 rx_queue = efx_channel_get_rx_queue(channel); 271 268 272 269 if (unlikely(channel->rx_pkt_n_frags > 1)) { 273 270 /* We can't do XDP on fragmented packets - drop. */ 274 - rcu_read_unlock(); 275 271 efx_free_rx_buffers(rx_queue, rx_buf, 276 272 channel->rx_pkt_n_frags); 277 273 if (net_ratelimit()) ··· 292 296 rx_buf->len, false); 293 297 294 298 xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp); 295 - rcu_read_unlock(); 296 299 297 300 offset = (u8 *)xdp.data - *ehp; 298 301
-3
drivers/net/ethernet/socionext/netsec.c
··· 958 958 959 959 xdp_init_buff(&xdp, PAGE_SIZE, &dring->xdp_rxq); 960 960 961 - rcu_read_lock(); 962 961 xdp_prog = READ_ONCE(priv->xdp_prog); 963 962 dma_dir = page_pool_get_dma_dir(dring->page_pool); 964 963 ··· 1067 1068 dring->tail = (dring->tail + 1) % DESC_NUM; 1068 1069 } 1069 1070 netsec_finalize_xdp_rx(priv, xdp_act, xdp_xmit); 1070 - 1071 - rcu_read_unlock(); 1072 1071 1073 1072 return done; 1074 1073 }
+2 -8
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
··· 4651 4651 return res; 4652 4652 } 4653 4653 4654 - /* This function assumes rcu_read_lock() is held by the caller. */ 4655 4654 static int __stmmac_xdp_run_prog(struct stmmac_priv *priv, 4656 4655 struct bpf_prog *prog, 4657 4656 struct xdp_buff *xdp) ··· 4692 4693 struct bpf_prog *prog; 4693 4694 int res; 4694 4695 4695 - rcu_read_lock(); 4696 - 4697 4696 prog = READ_ONCE(priv->xdp_prog); 4698 4697 if (!prog) { 4699 4698 res = STMMAC_XDP_PASS; 4700 - goto unlock; 4699 + goto out; 4701 4700 } 4702 4701 4703 4702 res = __stmmac_xdp_run_prog(priv, prog, xdp); 4704 - unlock: 4705 - rcu_read_unlock(); 4703 + out: 4706 4704 return ERR_PTR(-res); 4707 4705 } 4708 4706 ··· 4969 4973 buf->xdp->data_end = buf->xdp->data + buf1_len; 4970 4974 xsk_buff_dma_sync_for_cpu(buf->xdp, rx_q->xsk_pool); 4971 4975 4972 - rcu_read_lock(); 4973 4976 prog = READ_ONCE(priv->xdp_prog); 4974 4977 res = __stmmac_xdp_run_prog(priv, prog, buf->xdp); 4975 - rcu_read_unlock(); 4976 4978 4977 4979 switch (res) { 4978 4980 case STMMAC_XDP_PASS:
+2 -8
drivers/net/ethernet/ti/cpsw_priv.c
··· 1328 1328 struct bpf_prog *prog; 1329 1329 u32 act; 1330 1330 1331 - rcu_read_lock(); 1332 - 1333 1331 prog = READ_ONCE(priv->xdp_prog); 1334 - if (!prog) { 1335 - ret = CPSW_XDP_PASS; 1336 - goto out; 1337 - } 1332 + if (!prog) 1333 + return CPSW_XDP_PASS; 1338 1334 1339 1335 act = bpf_prog_run_xdp(prog, xdp); 1340 1336 /* XDP prog might have changed packet data and boundaries */ ··· 1374 1378 ndev->stats.rx_bytes += *len; 1375 1379 ndev->stats.rx_packets++; 1376 1380 out: 1377 - rcu_read_unlock(); 1378 1381 return ret; 1379 1382 drop: 1380 - rcu_read_unlock(); 1381 1383 page_pool_recycle_direct(cpsw->page_pool[ch], page); 1382 1384 return ret; 1383 1385 }
+3 -5
include/linux/filter.h
··· 763 763 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, 764 764 struct xdp_buff *xdp) 765 765 { 766 - /* Caller needs to hold rcu_read_lock() (!), otherwise program 767 - * can be released while still running, or map elements could be 768 - * freed early while still having concurrent users. XDP fastpath 769 - * already takes rcu_read_lock() when fetching the program, so 770 - * it's not necessary here anymore. 766 + /* Driver XDP hooks are invoked within a single NAPI poll cycle and thus 767 + * under local_bh_disable(), which provides the needed RCU protection 768 + * for accessing map entries. 771 769 */ 772 770 return __BPF_PROG_RUN(prog, xdp, BPF_DISPATCHER_FUNC(xdp)); 773 771 }
+14
include/linux/rcupdate.h
··· 363 363 #define rcu_check_sparse(p, space) 364 364 #endif /* #else #ifdef __CHECKER__ */ 365 365 366 + /** 367 + * unrcu_pointer - mark a pointer as not being RCU protected 368 + * @p: pointer needing to lose its __rcu property 369 + * 370 + * Converts @p from an __rcu pointer to a __kernel pointer. 371 + * This allows an __rcu pointer to be used with xchg() and friends. 372 + */ 373 + #define unrcu_pointer(p) \ 374 + ({ \ 375 + typeof(*p) *_________p1 = (typeof(*p) *__force)(p); \ 376 + rcu_check_sparse(p, __rcu); \ 377 + ((typeof(*p) __force __kernel *)(_________p1)); \ 378 + }) 379 + 366 380 #define __rcu_access_pointer(p, space) \ 367 381 ({ \ 368 382 typeof(*p) *_________p1 = (typeof(*p) *__force)READ_ONCE(p); \
+1 -1
include/net/xdp_sock.h
··· 37 37 struct xsk_map { 38 38 struct bpf_map map; 39 39 spinlock_t lock; /* Synchronize map updates */ 40 - struct xdp_sock *xsk_map[]; 40 + struct xdp_sock __rcu *xsk_map[]; 41 41 }; 42 42 43 43 struct xdp_sock {
+9 -4
kernel/bpf/cpumap.c
··· 74 74 struct bpf_cpu_map { 75 75 struct bpf_map map; 76 76 /* Below members specific for map type */ 77 - struct bpf_cpu_map_entry **cpu_map; 77 + struct bpf_cpu_map_entry __rcu **cpu_map; 78 78 }; 79 79 80 80 static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list); ··· 469 469 { 470 470 struct bpf_cpu_map_entry *old_rcpu; 471 471 472 - old_rcpu = xchg(&cmap->cpu_map[key_cpu], rcpu); 472 + old_rcpu = unrcu_pointer(xchg(&cmap->cpu_map[key_cpu], RCU_INITIALIZER(rcpu))); 473 473 if (old_rcpu) { 474 474 call_rcu(&old_rcpu->rcu, __cpu_map_entry_free); 475 475 INIT_WORK(&old_rcpu->kthread_stop_wq, cpu_map_kthread_stop); ··· 551 551 for (i = 0; i < cmap->map.max_entries; i++) { 552 552 struct bpf_cpu_map_entry *rcpu; 553 553 554 - rcpu = READ_ONCE(cmap->cpu_map[i]); 554 + rcpu = rcu_dereference_raw(cmap->cpu_map[i]); 555 555 if (!rcpu) 556 556 continue; 557 557 ··· 562 562 kfree(cmap); 563 563 } 564 564 565 + /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or 566 + * by local_bh_disable() (from XDP calls inside NAPI). The 567 + * rcu_read_lock_bh_held() below makes lockdep accept both. 568 + */ 565 569 static void *__cpu_map_lookup_elem(struct bpf_map *map, u32 key) 566 570 { 567 571 struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); ··· 574 570 if (key >= map->max_entries) 575 571 return NULL; 576 572 577 - rcpu = READ_ONCE(cmap->cpu_map[key]); 573 + rcpu = rcu_dereference_check(cmap->cpu_map[key], 574 + rcu_read_lock_bh_held()); 578 575 return rcpu; 579 576 } 580 577
+21 -28
kernel/bpf/devmap.c
··· 73 73 74 74 struct bpf_dtab { 75 75 struct bpf_map map; 76 - struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */ 76 + struct bpf_dtab_netdev __rcu **netdev_map; /* DEVMAP type only */ 77 77 struct list_head list; 78 78 79 79 /* these are only used for DEVMAP_HASH type maps */ ··· 226 226 for (i = 0; i < dtab->map.max_entries; i++) { 227 227 struct bpf_dtab_netdev *dev; 228 228 229 - dev = dtab->netdev_map[i]; 229 + dev = rcu_dereference_raw(dtab->netdev_map[i]); 230 230 if (!dev) 231 231 continue; 232 232 ··· 259 259 return 0; 260 260 } 261 261 262 + /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or 263 + * by local_bh_disable() (from XDP calls inside NAPI). The 264 + * rcu_read_lock_bh_held() below makes lockdep accept both. 265 + */ 262 266 static void *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key) 263 267 { 264 268 struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); ··· 414 410 trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err); 415 411 } 416 412 417 - /* __dev_flush is called from xdp_do_flush() which _must_ be signaled 418 - * from the driver before returning from its napi->poll() routine. The poll() 419 - * routine is called either from busy_poll context or net_rx_action signaled 420 - * from NET_RX_SOFTIRQ. Either way the poll routine must complete before the 421 - * net device can be torn down. On devmap tear down we ensure the flush list 422 - * is empty before completing to ensure all flush operations have completed. 423 - * When drivers update the bpf program they may need to ensure any flush ops 424 - * are also complete. Using synchronize_rcu or call_rcu will suffice for this 425 - * because both wait for napi context to exit. 413 + /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the 414 + * driver before returning from its napi->poll() routine. See the comment above 415 + * xdp_do_flush() in filter.c. 426 416 */ 427 417 void __dev_flush(void) 428 418 { ··· 431 433 } 432 434 } 433 435 434 - /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or 435 - * update happens in parallel here a dev_put won't happen until after reading 436 - * the ifindex. 436 + /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or 437 + * by local_bh_disable() (from XDP calls inside NAPI). The 438 + * rcu_read_lock_bh_held() below makes lockdep accept both. 437 439 */ 438 440 static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key) 439 441 { ··· 443 445 if (key >= map->max_entries) 444 446 return NULL; 445 447 446 - obj = READ_ONCE(dtab->netdev_map[key]); 448 + obj = rcu_dereference_check(dtab->netdev_map[key], 449 + rcu_read_lock_bh_held()); 447 450 return obj; 448 451 } 449 452 450 - /* Runs under RCU-read-side, plus in softirq under NAPI protection. 451 - * Thus, safe percpu variable access. 453 + /* Runs in NAPI, i.e., softirq under local_bh_disable(). Thus, safe percpu 454 + * variable access, and map elements stick around. See comment above 455 + * xdp_do_flush() in filter.c. 452 456 */ 453 457 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, 454 458 struct net_device *dev_rx, struct bpf_prog *xdp_prog) ··· 735 735 if (k >= map->max_entries) 736 736 return -EINVAL; 737 737 738 - /* Use call_rcu() here to ensure any rcu critical sections have 739 - * completed as well as any flush operations because call_rcu 740 - * will wait for preempt-disable region to complete, NAPI in this 741 - * context. And additionally, the driver tear down ensures all 742 - * soft irqs are complete before removing the net device in the 743 - * case of dev_put equals zero. 744 - */ 745 - old_dev = xchg(&dtab->netdev_map[k], NULL); 738 + old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], NULL)); 746 739 if (old_dev) 747 740 call_rcu(&old_dev->rcu, __dev_map_entry_free); 748 741 return 0; ··· 844 851 * Remembering the driver side flush operation will happen before the 845 852 * net device is removed. 846 853 */ 847 - old_dev = xchg(&dtab->netdev_map[i], dev); 854 + old_dev = unrcu_pointer(xchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev))); 848 855 if (old_dev) 849 856 call_rcu(&old_dev->rcu, __dev_map_entry_free); 850 857 ··· 1024 1031 for (i = 0; i < dtab->map.max_entries; i++) { 1025 1032 struct bpf_dtab_netdev *dev, *odev; 1026 1033 1027 - dev = READ_ONCE(dtab->netdev_map[i]); 1034 + dev = rcu_dereference(dtab->netdev_map[i]); 1028 1035 if (!dev || netdev != dev->dev) 1029 1036 continue; 1030 - odev = cmpxchg(&dtab->netdev_map[i], dev, NULL); 1037 + odev = unrcu_pointer(cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL)); 1031 1038 if (dev == odev) 1032 1039 call_rcu(&dev->rcu, 1033 1040 __dev_map_entry_free);
+14 -7
kernel/bpf/hashtab.c
··· 596 596 struct htab_elem *l; 597 597 u32 hash, key_size; 598 598 599 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 599 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 600 + !rcu_read_lock_bh_held()); 600 601 601 602 key_size = map->key_size; 602 603 ··· 990 989 /* unknown flags */ 991 990 return -EINVAL; 992 991 993 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 992 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 993 + !rcu_read_lock_bh_held()); 994 994 995 995 key_size = map->key_size; 996 996 ··· 1084 1082 /* unknown flags */ 1085 1083 return -EINVAL; 1086 1084 1087 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 1085 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 1086 + !rcu_read_lock_bh_held()); 1088 1087 1089 1088 key_size = map->key_size; 1090 1089 ··· 1151 1148 /* unknown flags */ 1152 1149 return -EINVAL; 1153 1150 1154 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 1151 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 1152 + !rcu_read_lock_bh_held()); 1155 1153 1156 1154 key_size = map->key_size; 1157 1155 ··· 1206 1202 /* unknown flags */ 1207 1203 return -EINVAL; 1208 1204 1209 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 1205 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 1206 + !rcu_read_lock_bh_held()); 1210 1207 1211 1208 key_size = map->key_size; 1212 1209 ··· 1281 1276 u32 hash, key_size; 1282 1277 int ret; 1283 1278 1284 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 1279 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 1280 + !rcu_read_lock_bh_held()); 1285 1281 1286 1282 key_size = map->key_size; 1287 1283 ··· 1317 1311 u32 hash, key_size; 1318 1312 int ret; 1319 1313 1320 - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); 1314 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && 1315 + !rcu_read_lock_bh_held()); 1321 1316 1322 1317 key_size = map->key_size; 1323 1318
+3 -3
kernel/bpf/helpers.c
··· 29 29 */ 30 30 BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key) 31 31 { 32 - WARN_ON_ONCE(!rcu_read_lock_held()); 32 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); 33 33 return (unsigned long) map->ops->map_lookup_elem(map, key); 34 34 } 35 35 ··· 45 45 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key, 46 46 void *, value, u64, flags) 47 47 { 48 - WARN_ON_ONCE(!rcu_read_lock_held()); 48 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); 49 49 return map->ops->map_update_elem(map, key, value, flags); 50 50 } 51 51 ··· 62 62 63 63 BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key) 64 64 { 65 - WARN_ON_ONCE(!rcu_read_lock_held()); 65 + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); 66 66 return map->ops->map_delete_elem(map, key); 67 67 } 68 68
+4 -2
kernel/bpf/lpm_trie.c
··· 232 232 233 233 /* Start walking the trie from the root node ... */ 234 234 235 - for (node = rcu_dereference(trie->root); node;) { 235 + for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held()); 236 + node;) { 236 237 unsigned int next_bit; 237 238 size_t matchlen; 238 239 ··· 265 264 * traverse down. 266 265 */ 267 266 next_bit = extract_bit(key->data, node->prefixlen); 268 - node = rcu_dereference(node->child[next_bit]); 267 + node = rcu_dereference_check(node->child[next_bit], 268 + rcu_read_lock_bh_held()); 269 269 } 270 270 271 271 if (!found)
+2
kernel/bpf/ringbuf.c
··· 8 8 #include <linux/vmalloc.h> 9 9 #include <linux/wait.h> 10 10 #include <linux/poll.h> 11 + #include <linux/kmemleak.h> 11 12 #include <uapi/linux/btf.h> 12 13 13 14 #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) ··· 106 105 rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages, 107 106 VM_ALLOC | VM_USERMAP, PAGE_KERNEL); 108 107 if (rb) { 108 + kmemleak_not_leak(pages); 109 109 rb->pages = pages; 110 110 rb->nr_pages = nr_pages; 111 111 return rb;
+2
kernel/trace/bpf_trace.c
··· 1017 1017 #ifdef CONFIG_CGROUPS 1018 1018 case BPF_FUNC_get_current_cgroup_id: 1019 1019 return &bpf_get_current_cgroup_id_proto; 1020 + case BPF_FUNC_get_current_ancestor_cgroup_id: 1021 + return &bpf_get_current_ancestor_cgroup_id_proto; 1020 1022 #endif 1021 1023 case BPF_FUNC_send_signal: 1022 1024 return &bpf_send_signal_proto;
+1 -1
net/bpfilter/main.c
··· 57 57 { 58 58 debug_f = fopen("/dev/kmsg", "w"); 59 59 setvbuf(debug_f, 0, _IOLBF, 0); 60 - fprintf(debug_f, "Started bpfilter\n"); 60 + fprintf(debug_f, "<5>Started bpfilter\n"); 61 61 loop(); 62 62 fclose(debug_f); 63 63 return 0;
+37 -35
net/core/filter.c
··· 3235 3235 return ret; 3236 3236 } 3237 3237 3238 - static int bpf_skb_proto_4_to_6(struct sk_buff *skb, u64 flags) 3238 + static int bpf_skb_proto_4_to_6(struct sk_buff *skb) 3239 3239 { 3240 3240 const u32 len_diff = sizeof(struct ipv6hdr) - sizeof(struct iphdr); 3241 3241 u32 off = skb_mac_header_len(skb); 3242 3242 int ret; 3243 - 3244 - if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) 3245 - return -ENOTSUPP; 3246 3243 3247 3244 ret = skb_cow(skb, len_diff); 3248 3245 if (unlikely(ret < 0)) ··· 3252 3255 if (skb_is_gso(skb)) { 3253 3256 struct skb_shared_info *shinfo = skb_shinfo(skb); 3254 3257 3255 - /* SKB_GSO_TCPV4 needs to be changed into 3256 - * SKB_GSO_TCPV6. 3257 - */ 3258 + /* SKB_GSO_TCPV4 needs to be changed into SKB_GSO_TCPV6. */ 3258 3259 if (shinfo->gso_type & SKB_GSO_TCPV4) { 3259 3260 shinfo->gso_type &= ~SKB_GSO_TCPV4; 3260 3261 shinfo->gso_type |= SKB_GSO_TCPV6; 3261 3262 } 3262 - 3263 - /* Due to IPv6 header, MSS needs to be downgraded. */ 3264 - if (!(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) 3265 - skb_decrease_gso_size(shinfo, len_diff); 3266 - 3267 - /* Header must be checked, and gso_segs recomputed. */ 3268 - shinfo->gso_type |= SKB_GSO_DODGY; 3269 - shinfo->gso_segs = 0; 3270 3263 } 3271 3264 3272 3265 skb->protocol = htons(ETH_P_IPV6); ··· 3265 3278 return 0; 3266 3279 } 3267 3280 3268 - static int bpf_skb_proto_6_to_4(struct sk_buff *skb, u64 flags) 3281 + static int bpf_skb_proto_6_to_4(struct sk_buff *skb) 3269 3282 { 3270 3283 const u32 len_diff = sizeof(struct ipv6hdr) - sizeof(struct iphdr); 3271 3284 u32 off = skb_mac_header_len(skb); 3272 3285 int ret; 3273 - 3274 - if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) 3275 - return -ENOTSUPP; 3276 3286 3277 3287 ret = skb_unclone(skb, GFP_ATOMIC); 3278 3288 if (unlikely(ret < 0)) ··· 3282 3298 if (skb_is_gso(skb)) { 3283 3299 struct skb_shared_info *shinfo = skb_shinfo(skb); 3284 3300 3285 - /* SKB_GSO_TCPV6 needs to be changed into 3286 - * SKB_GSO_TCPV4. 3287 - */ 3301 + /* SKB_GSO_TCPV6 needs to be changed into SKB_GSO_TCPV4. */ 3288 3302 if (shinfo->gso_type & SKB_GSO_TCPV6) { 3289 3303 shinfo->gso_type &= ~SKB_GSO_TCPV6; 3290 3304 shinfo->gso_type |= SKB_GSO_TCPV4; 3291 3305 } 3292 - 3293 - /* Due to IPv4 header, MSS can be upgraded. */ 3294 - if (!(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) 3295 - skb_increase_gso_size(shinfo, len_diff); 3296 - 3297 - /* Header must be checked, and gso_segs recomputed. */ 3298 - shinfo->gso_type |= SKB_GSO_DODGY; 3299 - shinfo->gso_segs = 0; 3300 3306 } 3301 3307 3302 3308 skb->protocol = htons(ETH_P_IP); ··· 3295 3321 return 0; 3296 3322 } 3297 3323 3298 - static int bpf_skb_proto_xlat(struct sk_buff *skb, __be16 to_proto, u64 flags) 3324 + static int bpf_skb_proto_xlat(struct sk_buff *skb, __be16 to_proto) 3299 3325 { 3300 3326 __be16 from_proto = skb->protocol; 3301 3327 3302 3328 if (from_proto == htons(ETH_P_IP) && 3303 3329 to_proto == htons(ETH_P_IPV6)) 3304 - return bpf_skb_proto_4_to_6(skb, flags); 3330 + return bpf_skb_proto_4_to_6(skb); 3305 3331 3306 3332 if (from_proto == htons(ETH_P_IPV6) && 3307 3333 to_proto == htons(ETH_P_IP)) 3308 - return bpf_skb_proto_6_to_4(skb, flags); 3334 + return bpf_skb_proto_6_to_4(skb); 3309 3335 3310 3336 return -ENOTSUPP; 3311 3337 } ··· 3315 3341 { 3316 3342 int ret; 3317 3343 3318 - if (unlikely(flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO))) 3344 + if (unlikely(flags)) 3319 3345 return -EINVAL; 3320 3346 3321 3347 /* General idea is that this helper does the basic groundwork ··· 3335 3361 * that. For offloads, we mark packet as dodgy, so that headers 3336 3362 * need to be verified first. 3337 3363 */ 3338 - ret = bpf_skb_proto_xlat(skb, proto, flags); 3364 + ret = bpf_skb_proto_xlat(skb, proto); 3339 3365 bpf_compute_data_pointers(skb); 3340 3366 return ret; 3341 3367 } ··· 3897 3923 .arg2_type = ARG_ANYTHING, 3898 3924 }; 3899 3925 3926 + /* XDP_REDIRECT works by a three-step process, implemented in the functions 3927 + * below: 3928 + * 3929 + * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target 3930 + * of the redirect and store it (along with some other metadata) in a per-CPU 3931 + * struct bpf_redirect_info. 3932 + * 3933 + * 2. When the program returns the XDP_REDIRECT return code, the driver will 3934 + * call xdp_do_redirect() which will use the information in struct 3935 + * bpf_redirect_info to actually enqueue the frame into a map type-specific 3936 + * bulk queue structure. 3937 + * 3938 + * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(), 3939 + * which will flush all the different bulk queues, thus completing the 3940 + * redirect. 3941 + * 3942 + * Pointers to the map entries will be kept around for this whole sequence of 3943 + * steps, protected by RCU. However, there is no top-level rcu_read_lock() in 3944 + * the core code; instead, the RCU protection relies on everything happening 3945 + * inside a single NAPI poll sequence, which means it's between a pair of calls 3946 + * to local_bh_disable()/local_bh_enable(). 3947 + * 3948 + * The map entries are marked as __rcu and the map code makes sure to 3949 + * dereference those pointers with rcu_dereference_check() in a way that works 3950 + * for both sections that to hold an rcu_read_lock() and sections that are 3951 + * called from NAPI without a separate rcu_read_lock(). The code below does not 3952 + * use RCU annotations, but relies on those in the map code. 3953 + */ 3900 3954 void xdp_do_flush(void) 3901 3955 { 3902 3956 __dev_flush();
+6 -5
net/core/xdp.c
··· 113 113 void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq) 114 114 { 115 115 struct xdp_mem_allocator *xa; 116 + int type = xdp_rxq->mem.type; 116 117 int id = xdp_rxq->mem.id; 118 + 119 + /* Reset mem info to defaults */ 120 + xdp_rxq->mem.id = 0; 121 + xdp_rxq->mem.type = 0; 117 122 118 123 if (xdp_rxq->reg_state != REG_STATE_REGISTERED) { 119 124 WARN(1, "Missing register, driver bug"); ··· 128 123 if (id == 0) 129 124 return; 130 125 131 - if (xdp_rxq->mem.type == MEM_TYPE_PAGE_POOL) { 126 + if (type == MEM_TYPE_PAGE_POOL) { 132 127 rcu_read_lock(); 133 128 xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params); 134 129 page_pool_destroy(xa->page_pool); ··· 149 144 150 145 xdp_rxq->reg_state = REG_STATE_UNREGISTERED; 151 146 xdp_rxq->dev = NULL; 152 - 153 - /* Reset mem info to defaults */ 154 - xdp_rxq->mem.id = 0; 155 - xdp_rxq->mem.type = 0; 156 147 } 157 148 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg); 158 149
-2
net/sched/act_bpf.c
··· 43 43 tcf_lastuse_update(&prog->tcf_tm); 44 44 bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb); 45 45 46 - rcu_read_lock(); 47 46 filter = rcu_dereference(prog->filter); 48 47 if (at_ingress) { 49 48 __skb_push(skb, skb->mac_len); ··· 55 56 } 56 57 if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK) 57 58 skb_orphan(skb); 58 - rcu_read_unlock(); 59 59 60 60 /* A BPF program may overwrite the default action opcode. 61 61 * Similarly as in cls_bpf, if filter_res == -1 we use the
-3
net/sched/cls_bpf.c
··· 85 85 struct cls_bpf_prog *prog; 86 86 int ret = -1; 87 87 88 - /* Needed here for accessing maps. */ 89 - rcu_read_lock(); 90 88 list_for_each_entry_rcu(prog, &head->plist, link) { 91 89 int filter_res; 92 90 ··· 129 131 130 132 break; 131 133 } 132 - rcu_read_unlock(); 133 134 134 135 return ret; 135 136 }
+2 -2
net/xdp/xsk.c
··· 749 749 } 750 750 751 751 static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs, 752 - struct xdp_sock ***map_entry) 752 + struct xdp_sock __rcu ***map_entry) 753 753 { 754 754 struct xsk_map *map = NULL; 755 755 struct xsk_map_node *node; ··· 785 785 * might be updates to the map between 786 786 * xsk_get_map_list_entry() and xsk_map_try_sock_delete(). 787 787 */ 788 - struct xdp_sock **map_entry = NULL; 788 + struct xdp_sock __rcu **map_entry = NULL; 789 789 struct xsk_map *map; 790 790 791 791 while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
+2 -2
net/xdp/xsk.h
··· 31 31 struct xsk_map_node { 32 32 struct list_head node; 33 33 struct xsk_map *map; 34 - struct xdp_sock **map_entry; 34 + struct xdp_sock __rcu **map_entry; 35 35 }; 36 36 37 37 static inline struct xdp_sock *xdp_sk(struct sock *sk) ··· 40 40 } 41 41 42 42 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, 43 - struct xdp_sock **map_entry); 43 + struct xdp_sock __rcu **map_entry); 44 44 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id); 45 45 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool, 46 46 u16 queue_id);
+17 -12
net/xdp/xskmap.c
··· 12 12 #include "xsk.h" 13 13 14 14 static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map, 15 - struct xdp_sock **map_entry) 15 + struct xdp_sock __rcu **map_entry) 16 16 { 17 17 struct xsk_map_node *node; 18 18 ··· 42 42 } 43 43 44 44 static void xsk_map_sock_delete(struct xdp_sock *xs, 45 - struct xdp_sock **map_entry) 45 + struct xdp_sock __rcu **map_entry) 46 46 { 47 47 struct xsk_map_node *n, *tmp; 48 48 ··· 124 124 return insn - insn_buf; 125 125 } 126 126 127 + /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or 128 + * by local_bh_disable() (from XDP calls inside NAPI). The 129 + * rcu_read_lock_bh_held() below makes lockdep accept both. 130 + */ 127 131 static void *__xsk_map_lookup_elem(struct bpf_map *map, u32 key) 128 132 { 129 133 struct xsk_map *m = container_of(map, struct xsk_map, map); ··· 135 131 if (key >= map->max_entries) 136 132 return NULL; 137 133 138 - return READ_ONCE(m->xsk_map[key]); 134 + return rcu_dereference_check(m->xsk_map[key], rcu_read_lock_bh_held()); 139 135 } 140 136 141 137 static void *xsk_map_lookup_elem(struct bpf_map *map, void *key) 142 138 { 143 - WARN_ON_ONCE(!rcu_read_lock_held()); 144 139 return __xsk_map_lookup_elem(map, *(u32 *)key); 145 140 } 146 141 ··· 152 149 u64 map_flags) 153 150 { 154 151 struct xsk_map *m = container_of(map, struct xsk_map, map); 155 - struct xdp_sock *xs, *old_xs, **map_entry; 152 + struct xdp_sock __rcu **map_entry; 153 + struct xdp_sock *xs, *old_xs; 156 154 u32 i = *(u32 *)key, fd = *(u32 *)value; 157 155 struct xsk_map_node *node; 158 156 struct socket *sock; ··· 183 179 } 184 180 185 181 spin_lock_bh(&m->lock); 186 - old_xs = READ_ONCE(*map_entry); 182 + old_xs = rcu_dereference_protected(*map_entry, lockdep_is_held(&m->lock)); 187 183 if (old_xs == xs) { 188 184 err = 0; 189 185 goto out; ··· 195 191 goto out; 196 192 } 197 193 xsk_map_sock_add(xs, node); 198 - WRITE_ONCE(*map_entry, xs); 194 + rcu_assign_pointer(*map_entry, xs); 199 195 if (old_xs) 200 196 xsk_map_sock_delete(old_xs, map_entry); 201 197 spin_unlock_bh(&m->lock); ··· 212 208 static int xsk_map_delete_elem(struct bpf_map *map, void *key) 213 209 { 214 210 struct xsk_map *m = container_of(map, struct xsk_map, map); 215 - struct xdp_sock *old_xs, **map_entry; 211 + struct xdp_sock __rcu **map_entry; 212 + struct xdp_sock *old_xs; 216 213 int k = *(u32 *)key; 217 214 218 215 if (k >= map->max_entries) ··· 221 216 222 217 spin_lock_bh(&m->lock); 223 218 map_entry = &m->xsk_map[k]; 224 - old_xs = xchg(map_entry, NULL); 219 + old_xs = unrcu_pointer(xchg(map_entry, NULL)); 225 220 if (old_xs) 226 221 xsk_map_sock_delete(old_xs, map_entry); 227 222 spin_unlock_bh(&m->lock); ··· 236 231 } 237 232 238 233 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, 239 - struct xdp_sock **map_entry) 234 + struct xdp_sock __rcu **map_entry) 240 235 { 241 236 spin_lock_bh(&map->lock); 242 - if (READ_ONCE(*map_entry) == xs) { 243 - WRITE_ONCE(*map_entry, NULL); 237 + if (rcu_access_pointer(*map_entry) == xs) { 238 + rcu_assign_pointer(*map_entry, NULL); 244 239 xsk_map_sock_delete(xs, map_entry); 245 240 } 246 241 spin_unlock_bh(&map->lock);
+2 -2
samples/bpf/xdp_redirect_user.c
··· 130 130 if (!(xdp_flags & XDP_FLAGS_SKB_MODE)) 131 131 xdp_flags |= XDP_FLAGS_DRV_MODE; 132 132 133 - if (optind == argc) { 133 + if (optind + 2 != argc) { 134 134 printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); 135 135 return 1; 136 136 } ··· 213 213 poll_stats(2, ifindex_out); 214 214 215 215 out: 216 - return 0; 216 + return ret; 217 217 }
+12 -18
tools/lib/bpf/README.rst Documentation/bpf/libbpf/libbpf_naming_convention.rst
··· 1 1 .. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) 2 2 3 - libbpf API naming convention 4 - ============================ 3 + API naming convention 4 + ===================== 5 5 6 6 libbpf API provides access to a few logically separated groups of 7 7 functions and types. Every group has its own naming convention ··· 10 10 11 11 All types and functions provided by libbpf API should have one of the 12 12 following prefixes: ``bpf_``, ``btf_``, ``libbpf_``, ``xsk_``, 13 - ``perf_buffer_``. 13 + ``btf_dump_``, ``ring_buffer_``, ``perf_buffer_``. 14 14 15 15 System call wrappers 16 16 -------------------- 17 17 18 18 System call wrappers are simple wrappers for commands supported by 19 19 sys_bpf system call. These wrappers should go to ``bpf.h`` header file 20 - and map one-on-one to corresponding commands. 20 + and map one to one to corresponding commands. 21 21 22 22 For example ``bpf_map_lookup_elem`` wraps ``BPF_MAP_LOOKUP_ELEM`` 23 23 command of sys_bpf, ``bpf_prog_attach`` wraps ``BPF_PROG_ATTACH``, etc. ··· 49 49 purpose of the function to open ELF file and create ``bpf_object`` from 50 50 it. 51 51 52 - Another example: ``bpf_program__load`` is named for corresponding 53 - object, ``bpf_program``, that is separated from other part of the name 54 - by double underscore. 55 - 56 52 All objects and corresponding functions other than BTF related should go 57 53 to ``libbpf.h``. BTF types and functions should go to ``btf.h``. 58 54 ··· 68 72 functions. These can be mixed and matched. Note that these functions 69 73 are not reentrant for performance reasons. 70 74 71 - Please take a look at Documentation/networking/af_xdp.rst in the Linux 72 - kernel source tree on how to use XDP sockets and for some common 73 - mistakes in case you do not get any traffic up to user space. 74 - 75 - libbpf ABI 75 + ABI 76 76 ========== 77 77 78 78 libbpf can be both linked statically or used as DSO. To avoid possible ··· 108 116 109 117 For example, if current state of ``libbpf.map`` is: 110 118 111 - .. code-block:: 119 + .. code-block:: c 120 + 112 121 LIBBPF_0.0.1 { 113 122 global: 114 123 bpf_func_a; ··· 121 128 , and a new symbol ``bpf_func_c`` is being introduced, then 122 129 ``libbpf.map`` should be changed like this: 123 130 124 - .. code-block:: 131 + .. code-block:: c 132 + 125 133 LIBBPF_0.0.1 { 126 134 global: 127 135 bpf_func_a; ··· 142 148 incompatible ones, described in details in [1]. 143 149 144 150 Stand-alone build 145 - ================= 151 + ------------------- 146 152 147 153 Under https://github.com/libbpf/libbpf there is a (semi-)automated 148 154 mirror of the mainline's version of libbpf for a stand-alone build. ··· 151 157 the mainline kernel tree. 152 158 153 159 License 154 - ======= 160 + ------------------- 155 161 156 162 libbpf is dual-licensed under LGPL 2.1 and BSD 2-Clause. 157 163 158 164 Links 159 - ===== 165 + ------------------- 160 166 161 167 [1] https://www.akkadia.org/drepper/dsohowto.pdf 162 168 (Chapter 3. Maintaining APIs and ABIs).
+4
tools/lib/bpf/libbpf.c
··· 4001 4001 4002 4002 ret = bpf_load_program_xattr(&attr, NULL, 0); 4003 4003 if (ret < 0) { 4004 + attr.prog_type = BPF_PROG_TYPE_TRACEPOINT; 4005 + ret = bpf_load_program_xattr(&attr, NULL, 0); 4006 + } 4007 + if (ret < 0) { 4004 4008 ret = errno; 4005 4009 cp = libbpf_strerror_r(ret, errmsg, sizeof(errmsg)); 4006 4010 pr_warn("Error in %s():%s(%d). Couldn't load trivial BPF "
+44 -71
tools/lib/bpf/netlink.c
··· 154 154 return ret; 155 155 } 156 156 157 - static int libbpf_netlink_send_recv(struct nlmsghdr *nh, 157 + static int libbpf_netlink_send_recv(struct libbpf_nla_req *req, 158 158 __dump_nlmsg_t parse_msg, 159 159 libbpf_dump_nlmsg_t parse_attr, 160 160 void *cookie) ··· 166 166 if (sock < 0) 167 167 return sock; 168 168 169 - nh->nlmsg_pid = 0; 170 - nh->nlmsg_seq = time(NULL); 169 + req->nh.nlmsg_pid = 0; 170 + req->nh.nlmsg_seq = time(NULL); 171 171 172 - if (send(sock, nh, nh->nlmsg_len, 0) < 0) { 172 + if (send(sock, req, req->nh.nlmsg_len, 0) < 0) { 173 173 ret = -errno; 174 174 goto out; 175 175 } 176 176 177 - ret = libbpf_netlink_recv(sock, nl_pid, nh->nlmsg_seq, 177 + ret = libbpf_netlink_recv(sock, nl_pid, req->nh.nlmsg_seq, 178 178 parse_msg, parse_attr, cookie); 179 179 out: 180 180 libbpf_netlink_close(sock); ··· 186 186 { 187 187 struct nlattr *nla; 188 188 int ret; 189 - struct { 190 - struct nlmsghdr nh; 191 - struct ifinfomsg ifinfo; 192 - char attrbuf[64]; 193 - } req; 189 + struct libbpf_nla_req req; 194 190 195 191 memset(&req, 0, sizeof(req)); 196 192 req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); ··· 195 199 req.ifinfo.ifi_family = AF_UNSPEC; 196 200 req.ifinfo.ifi_index = ifindex; 197 201 198 - nla = nlattr_begin_nested(&req.nh, sizeof(req), IFLA_XDP); 202 + nla = nlattr_begin_nested(&req, IFLA_XDP); 199 203 if (!nla) 200 204 return -EMSGSIZE; 201 - ret = nlattr_add(&req.nh, sizeof(req), IFLA_XDP_FD, &fd, sizeof(fd)); 205 + ret = nlattr_add(&req, IFLA_XDP_FD, &fd, sizeof(fd)); 202 206 if (ret < 0) 203 207 return ret; 204 208 if (flags) { 205 - ret = nlattr_add(&req.nh, sizeof(req), IFLA_XDP_FLAGS, &flags, 206 - sizeof(flags)); 209 + ret = nlattr_add(&req, IFLA_XDP_FLAGS, &flags, sizeof(flags)); 207 210 if (ret < 0) 208 211 return ret; 209 212 } 210 213 if (flags & XDP_FLAGS_REPLACE) { 211 - ret = nlattr_add(&req.nh, sizeof(req), IFLA_XDP_EXPECTED_FD, 212 - &old_fd, sizeof(old_fd)); 214 + ret = nlattr_add(&req, IFLA_XDP_EXPECTED_FD, &old_fd, 215 + sizeof(old_fd)); 213 216 if (ret < 0) 214 217 return ret; 215 218 } 216 - nlattr_end_nested(&req.nh, nla); 219 + nlattr_end_nested(&req, nla); 217 220 218 - return libbpf_netlink_send_recv(&req.nh, NULL, NULL, NULL); 221 + return libbpf_netlink_send_recv(&req, NULL, NULL, NULL); 219 222 } 220 223 221 224 int bpf_set_link_xdp_fd_opts(int ifindex, int fd, __u32 flags, ··· 309 314 struct xdp_id_md xdp_id = {}; 310 315 __u32 mask; 311 316 int ret; 312 - struct { 313 - struct nlmsghdr nh; 314 - struct ifinfomsg ifm; 315 - } req = { 316 - .nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), 317 - .nh.nlmsg_type = RTM_GETLINK, 318 - .nh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, 319 - .ifm.ifi_family = AF_PACKET, 317 + struct libbpf_nla_req req = { 318 + .nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), 319 + .nh.nlmsg_type = RTM_GETLINK, 320 + .nh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, 321 + .ifinfo.ifi_family = AF_PACKET, 320 322 }; 321 323 322 324 if (flags & ~XDP_FLAGS_MASK || !info_size) ··· 328 336 xdp_id.ifindex = ifindex; 329 337 xdp_id.flags = flags; 330 338 331 - ret = libbpf_netlink_send_recv(&req.nh, __dump_link_nlmsg, 339 + ret = libbpf_netlink_send_recv(&req, __dump_link_nlmsg, 332 340 get_xdp_info, &xdp_id); 333 341 if (!ret) { 334 342 size_t sz = min(info_size, sizeof(xdp_id.info)); ··· 368 376 return libbpf_err(ret); 369 377 } 370 378 371 - typedef int (*qdisc_config_t)(struct nlmsghdr *nh, struct tcmsg *t, 372 - size_t maxsz); 379 + typedef int (*qdisc_config_t)(struct libbpf_nla_req *req); 373 380 374 - static int clsact_config(struct nlmsghdr *nh, struct tcmsg *t, size_t maxsz) 381 + static int clsact_config(struct libbpf_nla_req *req) 375 382 { 376 - t->tcm_parent = TC_H_CLSACT; 377 - t->tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0); 383 + req->tc.tcm_parent = TC_H_CLSACT; 384 + req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0); 378 385 379 - return nlattr_add(nh, maxsz, TCA_KIND, "clsact", sizeof("clsact")); 386 + return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact")); 380 387 } 381 388 382 389 static int attach_point_to_config(struct bpf_tc_hook *hook, ··· 422 431 { 423 432 qdisc_config_t config; 424 433 int ret; 425 - struct { 426 - struct nlmsghdr nh; 427 - struct tcmsg tc; 428 - char buf[256]; 429 - } req; 434 + struct libbpf_nla_req req; 430 435 431 436 ret = attach_point_to_config(hook, &config); 432 437 if (ret < 0) ··· 435 448 req.tc.tcm_family = AF_UNSPEC; 436 449 req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0); 437 450 438 - ret = config(&req.nh, &req.tc, sizeof(req)); 451 + ret = config(&req); 439 452 if (ret < 0) 440 453 return ret; 441 454 442 - return libbpf_netlink_send_recv(&req.nh, NULL, NULL, NULL); 455 + return libbpf_netlink_send_recv(&req, NULL, NULL, NULL); 443 456 } 444 457 445 458 static int tc_qdisc_create_excl(struct bpf_tc_hook *hook) ··· 524 537 struct nlattr *tb[TCA_MAX + 1]; 525 538 526 539 libbpf_nla_parse(tb, TCA_MAX, 527 - (struct nlattr *)((char *)tc + NLMSG_ALIGN(sizeof(*tc))), 540 + (struct nlattr *)((void *)tc + NLMSG_ALIGN(sizeof(*tc))), 528 541 NLMSG_PAYLOAD(nh, sizeof(*tc)), NULL); 529 542 if (!tb[TCA_KIND]) 530 543 return NL_CONT; 531 544 return __get_tc_info(cookie, tc, tb, nh->nlmsg_flags & NLM_F_ECHO); 532 545 } 533 546 534 - static int tc_add_fd_and_name(struct nlmsghdr *nh, size_t maxsz, int fd) 547 + static int tc_add_fd_and_name(struct libbpf_nla_req *req, int fd) 535 548 { 536 549 struct bpf_prog_info info = {}; 537 550 __u32 info_len = sizeof(info); ··· 542 555 if (ret < 0) 543 556 return ret; 544 557 545 - ret = nlattr_add(nh, maxsz, TCA_BPF_FD, &fd, sizeof(fd)); 558 + ret = nlattr_add(req, TCA_BPF_FD, &fd, sizeof(fd)); 546 559 if (ret < 0) 547 560 return ret; 548 561 len = snprintf(name, sizeof(name), "%s:[%u]", info.name, info.id); ··· 550 563 return -errno; 551 564 if (len >= sizeof(name)) 552 565 return -ENAMETOOLONG; 553 - return nlattr_add(nh, maxsz, TCA_BPF_NAME, name, len + 1); 566 + return nlattr_add(req, TCA_BPF_NAME, name, len + 1); 554 567 } 555 568 556 569 int bpf_tc_attach(const struct bpf_tc_hook *hook, struct bpf_tc_opts *opts) ··· 558 571 __u32 protocol, bpf_flags, handle, priority, parent, prog_id, flags; 559 572 int ret, ifindex, attach_point, prog_fd; 560 573 struct bpf_cb_ctx info = {}; 574 + struct libbpf_nla_req req; 561 575 struct nlattr *nla; 562 - struct { 563 - struct nlmsghdr nh; 564 - struct tcmsg tc; 565 - char buf[256]; 566 - } req; 567 576 568 577 if (!hook || !opts || 569 578 !OPTS_VALID(hook, bpf_tc_hook) || ··· 601 618 return libbpf_err(ret); 602 619 req.tc.tcm_parent = parent; 603 620 604 - ret = nlattr_add(&req.nh, sizeof(req), TCA_KIND, "bpf", sizeof("bpf")); 621 + ret = nlattr_add(&req, TCA_KIND, "bpf", sizeof("bpf")); 605 622 if (ret < 0) 606 623 return libbpf_err(ret); 607 - nla = nlattr_begin_nested(&req.nh, sizeof(req), TCA_OPTIONS); 624 + nla = nlattr_begin_nested(&req, TCA_OPTIONS); 608 625 if (!nla) 609 626 return libbpf_err(-EMSGSIZE); 610 - ret = tc_add_fd_and_name(&req.nh, sizeof(req), prog_fd); 627 + ret = tc_add_fd_and_name(&req, prog_fd); 611 628 if (ret < 0) 612 629 return libbpf_err(ret); 613 630 bpf_flags = TCA_BPF_FLAG_ACT_DIRECT; 614 - ret = nlattr_add(&req.nh, sizeof(req), TCA_BPF_FLAGS, &bpf_flags, 615 - sizeof(bpf_flags)); 631 + ret = nlattr_add(&req, TCA_BPF_FLAGS, &bpf_flags, sizeof(bpf_flags)); 616 632 if (ret < 0) 617 633 return libbpf_err(ret); 618 - nlattr_end_nested(&req.nh, nla); 634 + nlattr_end_nested(&req, nla); 619 635 620 636 info.opts = opts; 621 637 622 - ret = libbpf_netlink_send_recv(&req.nh, get_tc_info, NULL, &info); 638 + ret = libbpf_netlink_send_recv(&req, get_tc_info, NULL, &info); 623 639 if (ret < 0) 624 640 return libbpf_err(ret); 625 641 if (!info.processed) ··· 632 650 { 633 651 __u32 protocol = 0, handle, priority, parent, prog_id, flags; 634 652 int ret, ifindex, attach_point, prog_fd; 635 - struct { 636 - struct nlmsghdr nh; 637 - struct tcmsg tc; 638 - char buf[256]; 639 - } req; 653 + struct libbpf_nla_req req; 640 654 641 655 if (!hook || 642 656 !OPTS_VALID(hook, bpf_tc_hook) || ··· 679 701 req.tc.tcm_parent = parent; 680 702 681 703 if (!flush) { 682 - ret = nlattr_add(&req.nh, sizeof(req), TCA_KIND, 683 - "bpf", sizeof("bpf")); 704 + ret = nlattr_add(&req, TCA_KIND, "bpf", sizeof("bpf")); 684 705 if (ret < 0) 685 706 return ret; 686 707 } 687 708 688 - return libbpf_netlink_send_recv(&req.nh, NULL, NULL, NULL); 709 + return libbpf_netlink_send_recv(&req, NULL, NULL, NULL); 689 710 } 690 711 691 712 int bpf_tc_detach(const struct bpf_tc_hook *hook, ··· 704 727 __u32 protocol, handle, priority, parent, prog_id, flags; 705 728 int ret, ifindex, attach_point, prog_fd; 706 729 struct bpf_cb_ctx info = {}; 707 - struct { 708 - struct nlmsghdr nh; 709 - struct tcmsg tc; 710 - char buf[256]; 711 - } req; 730 + struct libbpf_nla_req req; 712 731 713 732 if (!hook || !opts || 714 733 !OPTS_VALID(hook, bpf_tc_hook) || ··· 743 770 return libbpf_err(ret); 744 771 req.tc.tcm_parent = parent; 745 772 746 - ret = nlattr_add(&req.nh, sizeof(req), TCA_KIND, "bpf", sizeof("bpf")); 773 + ret = nlattr_add(&req, TCA_KIND, "bpf", sizeof("bpf")); 747 774 if (ret < 0) 748 775 return libbpf_err(ret); 749 776 750 777 info.opts = opts; 751 778 752 - ret = libbpf_netlink_send_recv(&req.nh, get_tc_info, NULL, &info); 779 + ret = libbpf_netlink_send_recv(&req, get_tc_info, NULL, &info); 753 780 if (ret < 0) 754 781 return libbpf_err(ret); 755 782 if (!info.processed)
+1 -1
tools/lib/bpf/nlattr.c
··· 27 27 int totlen = NLA_ALIGN(nla->nla_len); 28 28 29 29 *remaining -= totlen; 30 - return (struct nlattr *) ((char *) nla + totlen); 30 + return (struct nlattr *)((void *)nla + totlen); 31 31 } 32 32 33 33 static int nla_ok(const struct nlattr *nla, int remaining)
+24 -14
tools/lib/bpf/nlattr.h
··· 13 13 #include <string.h> 14 14 #include <errno.h> 15 15 #include <linux/netlink.h> 16 + #include <linux/rtnetlink.h> 16 17 17 18 /* avoid multiple definition of netlink features */ 18 19 #define __LINUX_NETLINK_H ··· 53 52 uint16_t maxlen; 54 53 }; 55 54 55 + struct libbpf_nla_req { 56 + struct nlmsghdr nh; 57 + union { 58 + struct ifinfomsg ifinfo; 59 + struct tcmsg tc; 60 + }; 61 + char buf[128]; 62 + }; 63 + 56 64 /** 57 65 * @ingroup attr 58 66 * Iterate over a stream of attributes ··· 81 71 */ 82 72 static inline void *libbpf_nla_data(const struct nlattr *nla) 83 73 { 84 - return (char *) nla + NLA_HDRLEN; 74 + return (void *)nla + NLA_HDRLEN; 85 75 } 86 76 87 77 static inline uint8_t libbpf_nla_getattr_u8(const struct nlattr *nla) ··· 118 108 119 109 static inline struct nlattr *nla_data(struct nlattr *nla) 120 110 { 121 - return (struct nlattr *)((char *)nla + NLA_HDRLEN); 111 + return (struct nlattr *)((void *)nla + NLA_HDRLEN); 122 112 } 123 113 124 - static inline struct nlattr *nh_tail(struct nlmsghdr *nh) 114 + static inline struct nlattr *req_tail(struct libbpf_nla_req *req) 125 115 { 126 - return (struct nlattr *)((char *)nh + NLMSG_ALIGN(nh->nlmsg_len)); 116 + return (struct nlattr *)((void *)req + NLMSG_ALIGN(req->nh.nlmsg_len)); 127 117 } 128 118 129 - static inline int nlattr_add(struct nlmsghdr *nh, size_t maxsz, int type, 119 + static inline int nlattr_add(struct libbpf_nla_req *req, int type, 130 120 const void *data, int len) 131 121 { 132 122 struct nlattr *nla; 133 123 134 - if (NLMSG_ALIGN(nh->nlmsg_len) + NLA_ALIGN(NLA_HDRLEN + len) > maxsz) 124 + if (NLMSG_ALIGN(req->nh.nlmsg_len) + NLA_ALIGN(NLA_HDRLEN + len) > sizeof(*req)) 135 125 return -EMSGSIZE; 136 126 if (!!data != !!len) 137 127 return -EINVAL; 138 128 139 - nla = nh_tail(nh); 129 + nla = req_tail(req); 140 130 nla->nla_type = type; 141 131 nla->nla_len = NLA_HDRLEN + len; 142 132 if (data) 143 133 memcpy(nla_data(nla), data, len); 144 - nh->nlmsg_len = NLMSG_ALIGN(nh->nlmsg_len) + NLA_ALIGN(nla->nla_len); 134 + req->nh.nlmsg_len = NLMSG_ALIGN(req->nh.nlmsg_len) + NLA_ALIGN(nla->nla_len); 145 135 return 0; 146 136 } 147 137 148 - static inline struct nlattr *nlattr_begin_nested(struct nlmsghdr *nh, 149 - size_t maxsz, int type) 138 + static inline struct nlattr *nlattr_begin_nested(struct libbpf_nla_req *req, int type) 150 139 { 151 140 struct nlattr *tail; 152 141 153 - tail = nh_tail(nh); 154 - if (nlattr_add(nh, maxsz, type | NLA_F_NESTED, NULL, 0)) 142 + tail = req_tail(req); 143 + if (nlattr_add(req, type | NLA_F_NESTED, NULL, 0)) 155 144 return NULL; 156 145 return tail; 157 146 } 158 147 159 - static inline void nlattr_end_nested(struct nlmsghdr *nh, struct nlattr *tail) 148 + static inline void nlattr_end_nested(struct libbpf_nla_req *req, 149 + struct nlattr *tail) 160 150 { 161 - tail->nla_len = (char *)nh_tail(nh) - (char *)tail; 151 + tail->nla_len = (void *)req_tail(req) - (void *)tail; 162 152 } 163 153 164 154 #endif /* __LIBBPF_NLATTR_H */
+1 -1
tools/testing/selftests/bpf/prog_tests/ringbuf.c
··· 100 100 if (CHECK(err != 0, "skel_load", "skeleton load failed\n")) 101 101 goto cleanup; 102 102 103 - rb_fd = bpf_map__fd(skel->maps.ringbuf); 103 + rb_fd = skel->maps.ringbuf.map_fd; 104 104 /* good read/write cons_pos */ 105 105 mmap_ptr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, rb_fd, 0); 106 106 ASSERT_OK_PTR(mmap_ptr, "rw_cons_pos");