Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

+9

Documentation/bpf/index.rst

··· 47 47 prog_flow_dissector 48 48 49 49 50 + Testing BPF 51 + =========== 52 + 53 + .. toctree:: 54 + :maxdepth: 1 55 + 56 + s390 57 + 58 + 50 59 .. Links: 51 60 .. _Documentation/networking/filter.txt: ../networking/filter.txt 52 61 .. _man-pages: https://www.kernel.org/doc/man-pages/

+205

Documentation/bpf/s390.rst

··· 1 + =================== 2 + Testing BPF on s390 3 + =================== 4 + 5 + 1. Introduction 6 + *************** 7 + 8 + IBM Z are mainframe computers, which are descendants of IBM System/360 from 9 + year 1964. They are supported by the Linux kernel under the name "s390". This 10 + document describes how to test BPF in an s390 QEMU guest. 11 + 12 + 2. One-time setup 13 + ***************** 14 + 15 + The following is required to build and run the test suite: 16 + 17 + * s390 GCC 18 + * s390 development headers and libraries 19 + * Clang with BPF support 20 + * QEMU with s390 support 21 + * Disk image with s390 rootfs 22 + 23 + Debian supports installing compiler and libraries for s390 out of the box. 24 + Users of other distros may use debootstrap in order to set up a Debian chroot:: 25 + 26 + sudo debootstrap \ 27 + --variant=minbase \ 28 + --include=sudo \ 29 + testing \ 30 + ./s390-toolchain 31 + sudo mount --rbind /dev ./s390-toolchain/dev 32 + sudo mount --rbind /proc ./s390-toolchain/proc 33 + sudo mount --rbind /sys ./s390-toolchain/sys 34 + sudo chroot ./s390-toolchain 35 + 36 + Once on Debian, the build prerequisites can be installed as follows:: 37 + 38 + sudo dpkg --add-architecture s390x 39 + sudo apt-get update 40 + sudo apt-get install \ 41 + bc \ 42 + bison \ 43 + cmake \ 44 + debootstrap \ 45 + dwarves \ 46 + flex \ 47 + g++ \ 48 + gcc \ 49 + g++-s390x-linux-gnu \ 50 + gcc-s390x-linux-gnu \ 51 + gdb-multiarch \ 52 + git \ 53 + make \ 54 + python3 \ 55 + qemu-system-misc \ 56 + qemu-utils \ 57 + rsync \ 58 + libcap-dev:s390x \ 59 + libelf-dev:s390x \ 60 + libncurses-dev 61 + 62 + Latest Clang targeting BPF can be installed as follows:: 63 + 64 + git clone https://github.com/llvm/llvm-project.git 65 + ln -s ../../clang llvm-project/llvm/tools/ 66 + mkdir llvm-project-build 67 + cd llvm-project-build 68 + cmake \ 69 + -DLLVM_TARGETS_TO_BUILD=BPF \ 70 + -DCMAKE_BUILD_TYPE=Release \ 71 + -DCMAKE_INSTALL_PREFIX=/opt/clang-bpf \ 72 + ../llvm-project/llvm 73 + make 74 + sudo make install 75 + export PATH=/opt/clang-bpf/bin:$PATH 76 + 77 + The disk image can be prepared using a loopback mount and debootstrap:: 78 + 79 + qemu-img create -f raw ./s390.img 1G 80 + sudo losetup -f ./s390.img 81 + sudo mkfs.ext4 /dev/loopX 82 + mkdir ./s390.rootfs 83 + sudo mount /dev/loopX ./s390.rootfs 84 + sudo debootstrap \ 85 + --foreign \ 86 + --arch=s390x \ 87 + --variant=minbase \ 88 + --include=" \ 89 + iproute2, \ 90 + iputils-ping, \ 91 + isc-dhcp-client, \ 92 + kmod, \ 93 + libcap2, \ 94 + libelf1, \ 95 + netcat, \ 96 + procps" \ 97 + testing \ 98 + ./s390.rootfs 99 + sudo umount ./s390.rootfs 100 + sudo losetup -d /dev/loopX 101 + 102 + 3. Compilation 103 + ************** 104 + 105 + In addition to the usual Kconfig options required to run the BPF test suite, it 106 + is also helpful to select:: 107 + 108 + CONFIG_NET_9P=y 109 + CONFIG_9P_FS=y 110 + CONFIG_NET_9P_VIRTIO=y 111 + CONFIG_VIRTIO_PCI=y 112 + 113 + as that would enable a very easy way to share files with the s390 virtual 114 + machine. 115 + 116 + Compiling kernel, modules and testsuite, as well as preparing gdb scripts to 117 + simplify debugging, can be done using the following commands:: 118 + 119 + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- menuconfig 120 + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- bzImage modules scripts_gdb 121 + make ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- \ 122 + -C tools/testing/selftests \ 123 + TARGETS=bpf \ 124 + INSTALL_PATH=$PWD/tools/testing/selftests/kselftest_install \ 125 + install 126 + 127 + 4. Running the test suite 128 + ************************* 129 + 130 + The virtual machine can be started as follows:: 131 + 132 + qemu-system-s390x \ 133 + -cpu max,zpci=on \ 134 + -smp 2 \ 135 + -m 4G \ 136 + -kernel linux/arch/s390/boot/compressed/vmlinux \ 137 + -drive file=./s390.img,if=virtio,format=raw \ 138 + -nographic \ 139 + -append 'root=/dev/vda rw console=ttyS1' \ 140 + -virtfs local,path=./linux,security_model=none,mount_tag=linux \ 141 + -object rng-random,filename=/dev/urandom,id=rng0 \ 142 + -device virtio-rng-ccw,rng=rng0 \ 143 + -netdev user,id=net0 \ 144 + -device virtio-net-ccw,netdev=net0 145 + 146 + When using this on a real IBM Z, ``-enable-kvm`` may be added for better 147 + performance. When starting the virtual machine for the first time, disk image 148 + setup must be finalized using the following command:: 149 + 150 + /debootstrap/debootstrap --second-stage 151 + 152 + Directory with the code built on the host as well as ``/proc`` and ``/sys`` 153 + need to be mounted as follows:: 154 + 155 + mkdir -p /linux 156 + mount -t 9p linux /linux 157 + mount -t proc proc /proc 158 + mount -t sysfs sys /sys 159 + 160 + After that, the test suite can be run using the following commands:: 161 + 162 + cd /linux/tools/testing/selftests/kselftest_install 163 + ./run_kselftest.sh 164 + 165 + As usual, tests can be also run individually:: 166 + 167 + cd /linux/tools/testing/selftests/bpf 168 + ./test_verifier 169 + 170 + 5. Debugging 171 + ************ 172 + 173 + It is possible to debug the s390 kernel using QEMU GDB stub, which is activated 174 + by passing ``-s`` to QEMU. 175 + 176 + It is preferable to turn KASLR off, so that gdb would know where to find the 177 + kernel image in memory, by building the kernel with:: 178 + 179 + RANDOMIZE_BASE=n 180 + 181 + GDB can then be attached using the following command:: 182 + 183 + gdb-multiarch -ex 'target remote localhost:1234' ./vmlinux 184 + 185 + 6. Network 186 + ********** 187 + 188 + In case one needs to use the network in the virtual machine in order to e.g. 189 + install additional packages, it can be configured using:: 190 + 191 + dhclient eth0 192 + 193 + 7. Links 194 + ******** 195 + 196 + This document is a compilation of techniques, whose more comprehensive 197 + descriptions can be found by following these links: 198 + 199 + - `Debootstrap <https://wiki.debian.org/EmDebian/CrossDebootstrap>`_ 200 + - `Multiarch <https://wiki.debian.org/Multiarch/HOWTO>`_ 201 + - `Building LLVM <https://llvm.org/docs/CMake.html>`_ 202 + - `Cross-compiling the kernel <https://wiki.gentoo.org/wiki/Embedded_Handbook/General/Cross-compiling_the_kernel>`_ 203 + - `QEMU s390x Guest Support <https://wiki.qemu.org/Documentation/Platforms/S390X>`_ 204 + - `Plan 9 folder sharing over Virtio <https://wiki.qemu.org/Documentation/9psetup>`_ 205 + - `Using GDB with QEMU <https://wiki.osdev.org/Kernel_Debugging#Use_GDB_with_QEMU>`_

+1 -1

arch/x86/mm/Makefile

··· 13 13 endif 14 14 15 15 obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ 16 - pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o 16 + pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o maccess.o 17 17 18 18 # Make sure __phys_addr has no stackprotector 19 19 nostackp := $(call cc-option, -fno-stack-protector)

+43

arch/x86/mm/maccess.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + #include <linux/uaccess.h> 4 + #include <linux/kernel.h> 5 + 6 + #ifdef CONFIG_X86_64 7 + static __always_inline u64 canonical_address(u64 vaddr, u8 vaddr_bits) 8 + { 9 + return ((s64)vaddr << (64 - vaddr_bits)) >> (64 - vaddr_bits); 10 + } 11 + 12 + static __always_inline bool invalid_probe_range(u64 vaddr) 13 + { 14 + /* 15 + * Range covering the highest possible canonical userspace address 16 + * as well as non-canonical address range. For the canonical range 17 + * we also need to include the userspace guard page. 18 + */ 19 + return vaddr < TASK_SIZE_MAX + PAGE_SIZE || 20 + canonical_address(vaddr, boot_cpu_data.x86_virt_bits) != vaddr; 21 + } 22 + #else 23 + static __always_inline bool invalid_probe_range(u64 vaddr) 24 + { 25 + return vaddr < TASK_SIZE_MAX; 26 + } 27 + #endif 28 + 29 + long probe_kernel_read_strict(void *dst, const void *src, size_t size) 30 + { 31 + if (unlikely(invalid_probe_range((unsigned long)src))) 32 + return -EFAULT; 33 + 34 + return __probe_kernel_read(dst, src, size); 35 + } 36 + 37 + long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, long count) 38 + { 39 + if (unlikely(invalid_probe_range((unsigned long)unsafe_addr))) 40 + return -EFAULT; 41 + 42 + return __strncpy_from_unsafe(dst, unsafe_addr, count); 43 + }

+5 -25

include/linux/bpf.h

··· 373 373 374 374 #define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX 375 375 376 + /* The longest tracepoint has 12 args. 377 + * See include/trace/bpf_probe.h 378 + */ 379 + #define MAX_BPF_FUNC_ARGS 12 380 + 376 381 struct bpf_prog_stats { 377 382 u64 cnt; 378 383 u64 nsecs; ··· 1006 1001 struct bpf_prog *prog) 1007 1002 { 1008 1003 return -EINVAL; 1009 - } 1010 - #endif 1011 - 1012 - #if defined(CONFIG_XDP_SOCKETS) 1013 - struct xdp_sock; 1014 - struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key); 1015 - int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 1016 - struct xdp_sock *xs); 1017 - void __xsk_map_flush(struct bpf_map *map); 1018 - #else 1019 - struct xdp_sock; 1020 - static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, 1021 - u32 key) 1022 - { 1023 - return NULL; 1024 - } 1025 - 1026 - static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 1027 - struct xdp_sock *xs) 1028 - { 1029 - return -EOPNOTSUPP; 1030 - } 1031 - 1032 - static inline void __xsk_map_flush(struct bpf_map *map) 1033 - { 1034 1004 } 1035 1005 #endif 1036 1006

+1

include/linux/bpf_types.h

··· 26 26 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) 27 27 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint) 28 28 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable) 29 + BPF_PROG_TYPE(BPF_PROG_TYPE_TRACING, tracing) 29 30 #endif 30 31 #ifdef CONFIG_CGROUP_BPF 31 32 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)

+16

include/linux/uaccess.h

··· 311 311 * happens, handle that and return -EFAULT. 312 312 */ 313 313 extern long probe_kernel_read(void *dst, const void *src, size_t size); 314 + extern long probe_kernel_read_strict(void *dst, const void *src, size_t size); 314 315 extern long __probe_kernel_read(void *dst, const void *src, size_t size); 315 316 316 317 /* ··· 338 337 extern long notrace probe_kernel_write(void *dst, const void *src, size_t size); 339 338 extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size); 340 339 340 + /* 341 + * probe_user_write(): safely attempt to write to a location in user space 342 + * @dst: address to write to 343 + * @src: pointer to the data that shall be written 344 + * @size: size of the data chunk 345 + * 346 + * Safely write to address @dst from the buffer at @src. If a kernel fault 347 + * happens, handle that and return -EFAULT. 348 + */ 349 + extern long notrace probe_user_write(void __user *dst, const void *src, size_t size); 350 + extern long notrace __probe_user_write(void __user *dst, const void *src, size_t size); 351 + 341 352 extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); 353 + extern long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, 354 + long count); 355 + extern long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); 342 356 extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, 343 357 long count); 344 358 extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);

+39 -12

include/net/xdp_sock.h

··· 69 69 /* Nodes are linked in the struct xdp_sock map_list field, and used to 70 70 * track which maps a certain socket reside in. 71 71 */ 72 - struct xsk_map; 72 + 73 + struct xsk_map { 74 + struct bpf_map map; 75 + struct list_head __percpu *flush_list; 76 + spinlock_t lock; /* Synchronize map updates */ 77 + struct xdp_sock *xsk_map[]; 78 + }; 79 + 73 80 struct xsk_map_node { 74 81 struct list_head node; 75 82 struct xsk_map *map; ··· 116 109 struct xdp_buff; 117 110 #ifdef CONFIG_XDP_SOCKETS 118 111 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); 119 - int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); 120 - void xsk_flush(struct xdp_sock *xs); 121 112 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); 122 113 /* Used from netdev driver */ 123 114 bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt); ··· 139 134 struct xdp_sock **map_entry); 140 135 int xsk_map_inc(struct xsk_map *map); 141 136 void xsk_map_put(struct xsk_map *map); 137 + int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 138 + struct xdp_sock *xs); 139 + void __xsk_map_flush(struct bpf_map *map); 140 + 141 + static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, 142 + u32 key) 143 + { 144 + struct xsk_map *m = container_of(map, struct xsk_map, map); 145 + struct xdp_sock *xs; 146 + 147 + if (key >= map->max_entries) 148 + return NULL; 149 + 150 + xs = READ_ONCE(m->xsk_map[key]); 151 + return xs; 152 + } 142 153 143 154 static inline u64 xsk_umem_extract_addr(u64 addr) 144 155 { ··· 243 222 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) 244 223 { 245 224 return -ENOTSUPP; 246 - } 247 - 248 - static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) 249 - { 250 - return -ENOTSUPP; 251 - } 252 - 253 - static inline void xsk_flush(struct xdp_sock *xs) 254 - { 255 225 } 256 226 257 227 static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) ··· 369 357 return 0; 370 358 } 371 359 360 + static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 361 + struct xdp_sock *xs) 362 + { 363 + return -EOPNOTSUPP; 364 + } 365 + 366 + static inline void __xsk_map_flush(struct bpf_map *map) 367 + { 368 + } 369 + 370 + static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, 371 + u32 key) 372 + { 373 + return NULL; 374 + } 372 375 #endif /* CONFIG_XDP_SOCKETS */ 373 376 374 377 #endif /* _LINUX_XDP_SOCK_H */

+84 -40

include/uapi/linux/bpf.h

··· 173 173 BPF_PROG_TYPE_CGROUP_SYSCTL, 174 174 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 175 175 BPF_PROG_TYPE_CGROUP_SOCKOPT, 176 + BPF_PROG_TYPE_TRACING, 176 177 }; 177 178 178 179 enum bpf_attach_type { ··· 200 199 BPF_CGROUP_UDP6_RECVMSG, 201 200 BPF_CGROUP_GETSOCKOPT, 202 201 BPF_CGROUP_SETSOCKOPT, 202 + BPF_TRACE_RAW_TP, 203 203 __MAX_BPF_ATTACH_TYPE 204 204 }; 205 205 ··· 563 561 * Return 564 562 * 0 on success, or a negative error in case of failure. 565 563 * 566 - * int bpf_probe_read(void *dst, u32 size, const void *src) 564 + * int bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr) 567 565 * Description 568 566 * For tracing programs, safely attempt to read *size* bytes from 569 - * address *src* and store the data in *dst*. 567 + * kernel space address *unsafe_ptr* and store the data in *dst*. 568 + * 569 + * Generally, use bpf_probe_read_user() or bpf_probe_read_kernel() 570 + * instead. 570 571 * Return 571 572 * 0 on success, or a negative error in case of failure. 572 573 * ··· 1431 1426 * Return 1432 1427 * 0 on success, or a negative error in case of failure. 1433 1428 * 1434 - * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr) 1429 + * int bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr) 1435 1430 * Description 1436 - * Copy a NUL terminated string from an unsafe address 1437 - * *unsafe_ptr* to *dst*. The *size* should include the 1438 - * terminating NUL byte. In case the string length is smaller than 1439 - * *size*, the target is not padded with further NUL bytes. If the 1440 - * string length is larger than *size*, just *size*-1 bytes are 1441 - * copied and the last byte is set to NUL. 1431 + * Copy a NUL terminated string from an unsafe kernel address 1432 + * *unsafe_ptr* to *dst*. See bpf_probe_read_kernel_str() for 1433 + * more details. 1442 1434 * 1443 - * On success, the length of the copied string is returned. This 1444 - * makes this helper useful in tracing programs for reading 1445 - * strings, and more importantly to get its length at runtime. See 1446 - * the following snippet: 1447 - * 1448 - * :: 1449 - * 1450 - * SEC("kprobe/sys_open") 1451 - * void bpf_sys_open(struct pt_regs *ctx) 1452 - * { 1453 - * char buf[PATHLEN]; // PATHLEN is defined to 256 1454 - * int res = bpf_probe_read_str(buf, sizeof(buf), 1455 - * ctx->di); 1456 - * 1457 - * // Consume buf, for example push it to 1458 - * // userspace via bpf_perf_event_output(); we 1459 - * // can use res (the string length) as event 1460 - * // size, after checking its boundaries. 1461 - * } 1462 - * 1463 - * In comparison, using **bpf_probe_read()** helper here instead 1464 - * to read the string would require to estimate the length at 1465 - * compile time, and would often result in copying more memory 1466 - * than necessary. 1467 - * 1468 - * Another useful use case is when parsing individual process 1469 - * arguments or individual environment variables navigating 1470 - * *current*\ **->mm->arg_start** and *current*\ 1471 - * **->mm->env_start**: using this helper and the return value, 1472 - * one can quickly iterate at the right offset of the memory area. 1435 + * Generally, use bpf_probe_read_user_str() or bpf_probe_read_kernel_str() 1436 + * instead. 1473 1437 * Return 1474 1438 * On success, the strictly positive length of the string, 1475 1439 * including the trailing NUL character. On error, a negative ··· 2749 2775 * restricted to raw_tracepoint bpf programs. 2750 2776 * Return 2751 2777 * 0 on success, or a negative error in case of failure. 2778 + * 2779 + * int bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr) 2780 + * Description 2781 + * Safely attempt to read *size* bytes from user space address 2782 + * *unsafe_ptr* and store the data in *dst*. 2783 + * Return 2784 + * 0 on success, or a negative error in case of failure. 2785 + * 2786 + * int bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) 2787 + * Description 2788 + * Safely attempt to read *size* bytes from kernel space address 2789 + * *unsafe_ptr* and store the data in *dst*. 2790 + * Return 2791 + * 0 on success, or a negative error in case of failure. 2792 + * 2793 + * int bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr) 2794 + * Description 2795 + * Copy a NUL terminated string from an unsafe user address 2796 + * *unsafe_ptr* to *dst*. The *size* should include the 2797 + * terminating NUL byte. In case the string length is smaller than 2798 + * *size*, the target is not padded with further NUL bytes. If the 2799 + * string length is larger than *size*, just *size*-1 bytes are 2800 + * copied and the last byte is set to NUL. 2801 + * 2802 + * On success, the length of the copied string is returned. This 2803 + * makes this helper useful in tracing programs for reading 2804 + * strings, and more importantly to get its length at runtime. See 2805 + * the following snippet: 2806 + * 2807 + * :: 2808 + * 2809 + * SEC("kprobe/sys_open") 2810 + * void bpf_sys_open(struct pt_regs *ctx) 2811 + * { 2812 + * char buf[PATHLEN]; // PATHLEN is defined to 256 2813 + * int res = bpf_probe_read_user_str(buf, sizeof(buf), 2814 + * ctx->di); 2815 + * 2816 + * // Consume buf, for example push it to 2817 + * // userspace via bpf_perf_event_output(); we 2818 + * // can use res (the string length) as event 2819 + * // size, after checking its boundaries. 2820 + * } 2821 + * 2822 + * In comparison, using **bpf_probe_read_user()** helper here 2823 + * instead to read the string would require to estimate the length 2824 + * at compile time, and would often result in copying more memory 2825 + * than necessary. 2826 + * 2827 + * Another useful use case is when parsing individual process 2828 + * arguments or individual environment variables navigating 2829 + * *current*\ **->mm->arg_start** and *current*\ 2830 + * **->mm->env_start**: using this helper and the return value, 2831 + * one can quickly iterate at the right offset of the memory area. 2832 + * Return 2833 + * On success, the strictly positive length of the string, 2834 + * including the trailing NUL character. On error, a negative 2835 + * value. 2836 + * 2837 + * int bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr) 2838 + * Description 2839 + * Copy a NUL terminated string from an unsafe kernel address *unsafe_ptr* 2840 + * to *dst*. Same semantics as with bpf_probe_read_user_str() apply. 2841 + * Return 2842 + * On success, the strictly positive length of the string, including 2843 + * the trailing NUL character. On error, a negative value. 2752 2844 */ 2753 2845 #define __BPF_FUNC_MAPPER(FN) \ 2754 2846 FN(unspec), \ ··· 2928 2888 FN(sk_storage_delete), \ 2929 2889 FN(send_signal), \ 2930 2890 FN(tcp_gen_syncookie), \ 2931 - FN(skb_output), 2891 + FN(skb_output), \ 2892 + FN(probe_read_user), \ 2893 + FN(probe_read_kernel), \ 2894 + FN(probe_read_user_str), \ 2895 + FN(probe_read_kernel_str), 2932 2896 2933 2897 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 2934 2898 * function eBPF program intends to call

+5 -7

kernel/bpf/core.c

··· 668 668 { 669 669 struct latch_tree_node *n; 670 670 671 - if (!bpf_jit_kallsyms_enabled()) 672 - return NULL; 673 - 674 671 n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops); 675 672 return n ? 676 673 container_of(n, struct bpf_prog_aux, ksym_tnode)->prog : ··· 1306 1309 } 1307 1310 1308 1311 #ifndef CONFIG_BPF_JIT_ALWAYS_ON 1309 - u64 __weak bpf_probe_read(void * dst, u32 size, const void * unsafe_ptr) 1312 + u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) 1310 1313 { 1311 1314 memset(dst, 0, size); 1312 1315 return -EFAULT; 1313 1316 } 1317 + 1314 1318 /** 1315 1319 * __bpf_prog_run - run eBPF program on a given context 1316 1320 * @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo-registers ··· 1567 1569 LDST(W, u32) 1568 1570 LDST(DW, u64) 1569 1571 #undef LDST 1570 - #define LDX_PROBE(SIZEOP, SIZE) \ 1571 - LDX_PROBE_MEM_##SIZEOP: \ 1572 - bpf_probe_read(&DST, SIZE, (const void *)(long) SRC); \ 1572 + #define LDX_PROBE(SIZEOP, SIZE) \ 1573 + LDX_PROBE_MEM_##SIZEOP: \ 1574 + bpf_probe_read_kernel(&DST, SIZE, (const void *)(long) SRC); \ 1573 1575 CONT; 1574 1576 LDX_PROBE(B, 1) 1575 1577 LDX_PROBE(H, 2)

+3 -3

kernel/bpf/syscall.c

··· 1579 1579 u32 btf_id) 1580 1580 { 1581 1581 switch (prog_type) { 1582 - case BPF_PROG_TYPE_RAW_TRACEPOINT: 1582 + case BPF_PROG_TYPE_TRACING: 1583 1583 if (btf_id > BTF_MAX_TYPE) 1584 1584 return -EINVAL; 1585 1585 break; ··· 1842 1842 return PTR_ERR(prog); 1843 1843 1844 1844 if (prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT && 1845 + prog->type != BPF_PROG_TYPE_TRACING && 1845 1846 prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE) { 1846 1847 err = -EINVAL; 1847 1848 goto out_put_prog; 1848 1849 } 1849 1850 1850 - if (prog->type == BPF_PROG_TYPE_RAW_TRACEPOINT && 1851 - prog->aux->attach_btf_id) { 1851 + if (prog->type == BPF_PROG_TYPE_TRACING) { 1852 1852 if (attr->raw_tracepoint.name) { 1853 1853 /* raw_tp name should not be specified in raw_tp 1854 1854 * programs that were verified via in-kernel BTF info

+29 -10

kernel/bpf/verifier.c

··· 6279 6279 case BPF_PROG_TYPE_CGROUP_SYSCTL: 6280 6280 case BPF_PROG_TYPE_CGROUP_SOCKOPT: 6281 6281 break; 6282 + case BPF_PROG_TYPE_RAW_TRACEPOINT: 6283 + if (!env->prog->aux->attach_btf_id) 6284 + return 0; 6285 + range = tnum_const(0); 6286 + break; 6282 6287 default: 6283 6288 return 0; 6284 6289 } ··· 9381 9376 { 9382 9377 struct bpf_prog *prog = env->prog; 9383 9378 u32 btf_id = prog->aux->attach_btf_id; 9379 + const char prefix[] = "btf_trace_"; 9384 9380 const struct btf_type *t; 9385 9381 const char *tname; 9386 9382 9387 - if (prog->type == BPF_PROG_TYPE_RAW_TRACEPOINT && btf_id) { 9388 - const char prefix[] = "btf_trace_"; 9383 + if (prog->type != BPF_PROG_TYPE_TRACING) 9384 + return 0; 9389 9385 9390 - t = btf_type_by_id(btf_vmlinux, btf_id); 9391 - if (!t) { 9392 - verbose(env, "attach_btf_id %u is invalid\n", btf_id); 9393 - return -EINVAL; 9394 - } 9386 + if (!btf_id) { 9387 + verbose(env, "Tracing programs must provide btf_id\n"); 9388 + return -EINVAL; 9389 + } 9390 + t = btf_type_by_id(btf_vmlinux, btf_id); 9391 + if (!t) { 9392 + verbose(env, "attach_btf_id %u is invalid\n", btf_id); 9393 + return -EINVAL; 9394 + } 9395 + tname = btf_name_by_offset(btf_vmlinux, t->name_off); 9396 + if (!tname) { 9397 + verbose(env, "attach_btf_id %u doesn't have a name\n", btf_id); 9398 + return -EINVAL; 9399 + } 9400 + 9401 + switch (prog->expected_attach_type) { 9402 + case BPF_TRACE_RAW_TP: 9395 9403 if (!btf_type_is_typedef(t)) { 9396 9404 verbose(env, "attach_btf_id %u is not a typedef\n", 9397 9405 btf_id); 9398 9406 return -EINVAL; 9399 9407 } 9400 - tname = btf_name_by_offset(btf_vmlinux, t->name_off); 9401 - if (!tname || strncmp(prefix, tname, sizeof(prefix) - 1)) { 9408 + if (strncmp(prefix, tname, sizeof(prefix) - 1)) { 9402 9409 verbose(env, "attach_btf_id %u points to wrong type name %s\n", 9403 9410 btf_id, tname); 9404 9411 return -EINVAL; ··· 9431 9414 prog->aux->attach_func_name = tname; 9432 9415 prog->aux->attach_func_proto = t; 9433 9416 prog->aux->attach_btf_trace = true; 9417 + return 0; 9418 + default: 9419 + return -EINVAL; 9434 9420 } 9435 - return 0; 9436 9421 } 9437 9422 9438 9423 int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,

+35 -75

kernel/bpf/xskmap.c

··· 9 9 #include <linux/slab.h> 10 10 #include <linux/sched.h> 11 11 12 - struct xsk_map { 13 - struct bpf_map map; 14 - struct xdp_sock **xsk_map; 15 - struct list_head __percpu *flush_list; 16 - spinlock_t lock; /* Synchronize map updates */ 17 - }; 18 - 19 12 int xsk_map_inc(struct xsk_map *map) 20 13 { 21 14 struct bpf_map *m = &map->map; ··· 73 80 74 81 static struct bpf_map *xsk_map_alloc(union bpf_attr *attr) 75 82 { 83 + struct bpf_map_memory mem; 84 + int cpu, err, numa_node; 76 85 struct xsk_map *m; 77 - int cpu, err; 78 - u64 cost; 86 + u64 cost, size; 79 87 80 88 if (!capable(CAP_NET_ADMIN)) 81 89 return ERR_PTR(-EPERM); ··· 86 92 attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)) 87 93 return ERR_PTR(-EINVAL); 88 94 89 - m = kzalloc(sizeof(*m), GFP_USER); 90 - if (!m) 95 + numa_node = bpf_map_attr_numa_node(attr); 96 + size = struct_size(m, xsk_map, attr->max_entries); 97 + cost = size + array_size(sizeof(*m->flush_list), num_possible_cpus()); 98 + 99 + err = bpf_map_charge_init(&mem, cost); 100 + if (err < 0) 101 + return ERR_PTR(err); 102 + 103 + m = bpf_map_area_alloc(size, numa_node); 104 + if (!m) { 105 + bpf_map_charge_finish(&mem); 91 106 return ERR_PTR(-ENOMEM); 107 + } 92 108 93 109 bpf_map_init_from_attr(&m->map, attr); 110 + bpf_map_charge_move(&m->map.memory, &mem); 94 111 spin_lock_init(&m->lock); 95 112 96 - cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *); 97 - cost += sizeof(struct list_head) * num_possible_cpus(); 98 - 99 - /* Notice returns -EPERM on if map size is larger than memlock limit */ 100 - err = bpf_map_charge_init(&m->map.memory, cost); 101 - if (err) 102 - goto free_m; 103 - 104 - err = -ENOMEM; 105 - 106 113 m->flush_list = alloc_percpu(struct list_head); 107 - if (!m->flush_list) 108 - goto free_charge; 114 + if (!m->flush_list) { 115 + bpf_map_charge_finish(&m->map.memory); 116 + bpf_map_area_free(m); 117 + return ERR_PTR(-ENOMEM); 118 + } 109 119 110 120 for_each_possible_cpu(cpu) 111 121 INIT_LIST_HEAD(per_cpu_ptr(m->flush_list, cpu)); 112 122 113 - m->xsk_map = bpf_map_area_alloc(m->map.max_entries * 114 - sizeof(struct xdp_sock *), 115 - m->map.numa_node); 116 - if (!m->xsk_map) 117 - goto free_percpu; 118 123 return &m->map; 119 - 120 - free_percpu: 121 - free_percpu(m->flush_list); 122 - free_charge: 123 - bpf_map_charge_finish(&m->map.memory); 124 - free_m: 125 - kfree(m); 126 - return ERR_PTR(err); 127 124 } 128 125 129 126 static void xsk_map_free(struct bpf_map *map) ··· 124 139 bpf_clear_redirect_map(map); 125 140 synchronize_net(); 126 141 free_percpu(m->flush_list); 127 - bpf_map_area_free(m->xsk_map); 128 - kfree(m); 142 + bpf_map_area_free(m); 129 143 } 130 144 131 145 static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key) ··· 144 160 return 0; 145 161 } 146 162 147 - struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key) 163 + static u32 xsk_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) 148 164 { 149 - struct xsk_map *m = container_of(map, struct xsk_map, map); 150 - struct xdp_sock *xs; 165 + const int ret = BPF_REG_0, mp = BPF_REG_1, index = BPF_REG_2; 166 + struct bpf_insn *insn = insn_buf; 151 167 152 - if (key >= map->max_entries) 153 - return NULL; 154 - 155 - xs = READ_ONCE(m->xsk_map[key]); 156 - return xs; 157 - } 158 - 159 - int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 160 - struct xdp_sock *xs) 161 - { 162 - struct xsk_map *m = container_of(map, struct xsk_map, map); 163 - struct list_head *flush_list = this_cpu_ptr(m->flush_list); 164 - int err; 165 - 166 - err = xsk_rcv(xs, xdp); 167 - if (err) 168 - return err; 169 - 170 - if (!xs->flush_node.prev) 171 - list_add(&xs->flush_node, flush_list); 172 - 173 - return 0; 174 - } 175 - 176 - void __xsk_map_flush(struct bpf_map *map) 177 - { 178 - struct xsk_map *m = container_of(map, struct xsk_map, map); 179 - struct list_head *flush_list = this_cpu_ptr(m->flush_list); 180 - struct xdp_sock *xs, *tmp; 181 - 182 - list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { 183 - xsk_flush(xs); 184 - __list_del_clearprev(&xs->flush_node); 185 - } 168 + *insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0); 169 + *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 5); 170 + *insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(sizeof(struct xsk_sock *))); 171 + *insn++ = BPF_ALU64_IMM(BPF_ADD, mp, offsetof(struct xsk_map, xsk_map)); 172 + *insn++ = BPF_ALU64_REG(BPF_ADD, ret, mp); 173 + *insn++ = BPF_LDX_MEM(BPF_SIZEOF(struct xsk_sock *), ret, ret, 0); 174 + *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1); 175 + *insn++ = BPF_MOV64_IMM(ret, 0); 176 + return insn - insn_buf; 186 177 } 187 178 188 179 static void *xsk_map_lookup_elem(struct bpf_map *map, void *key) ··· 271 312 .map_free = xsk_map_free, 272 313 .map_get_next_key = xsk_map_get_next_key, 273 314 .map_lookup_elem = xsk_map_lookup_elem, 315 + .map_gen_lookup = xsk_map_gen_lookup, 274 316 .map_lookup_elem_sys_only = xsk_map_lookup_elem_sys_only, 275 317 .map_update_elem = xsk_map_update_elem, 276 318 .map_delete_elem = xsk_map_delete_elem,

+175 -60

kernel/trace/bpf_trace.c

··· 138 138 }; 139 139 #endif 140 140 141 - BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr) 141 + BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, 142 + const void __user *, unsafe_ptr) 142 143 { 143 - int ret; 144 + int ret = probe_user_read(dst, unsafe_ptr, size); 144 145 145 - ret = security_locked_down(LOCKDOWN_BPF_READ); 146 - if (ret < 0) 147 - goto out; 148 - 149 - ret = probe_kernel_read(dst, unsafe_ptr, size); 150 146 if (unlikely(ret < 0)) 151 - out: 152 147 memset(dst, 0, size); 153 148 154 149 return ret; 155 150 } 156 151 157 - static const struct bpf_func_proto bpf_probe_read_proto = { 158 - .func = bpf_probe_read, 152 + static const struct bpf_func_proto bpf_probe_read_user_proto = { 153 + .func = bpf_probe_read_user, 159 154 .gpl_only = true, 160 155 .ret_type = RET_INTEGER, 161 156 .arg1_type = ARG_PTR_TO_UNINIT_MEM, ··· 158 163 .arg3_type = ARG_ANYTHING, 159 164 }; 160 165 161 - BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src, 166 + BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size, 167 + const void __user *, unsafe_ptr) 168 + { 169 + int ret = strncpy_from_unsafe_user(dst, unsafe_ptr, size); 170 + 171 + if (unlikely(ret < 0)) 172 + memset(dst, 0, size); 173 + 174 + return ret; 175 + } 176 + 177 + static const struct bpf_func_proto bpf_probe_read_user_str_proto = { 178 + .func = bpf_probe_read_user_str, 179 + .gpl_only = true, 180 + .ret_type = RET_INTEGER, 181 + .arg1_type = ARG_PTR_TO_UNINIT_MEM, 182 + .arg2_type = ARG_CONST_SIZE_OR_ZERO, 183 + .arg3_type = ARG_ANYTHING, 184 + }; 185 + 186 + static __always_inline int 187 + bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr, 188 + const bool compat) 189 + { 190 + int ret = security_locked_down(LOCKDOWN_BPF_READ); 191 + 192 + if (unlikely(ret < 0)) 193 + goto out; 194 + ret = compat ? probe_kernel_read(dst, unsafe_ptr, size) : 195 + probe_kernel_read_strict(dst, unsafe_ptr, size); 196 + if (unlikely(ret < 0)) 197 + out: 198 + memset(dst, 0, size); 199 + return ret; 200 + } 201 + 202 + BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, 203 + const void *, unsafe_ptr) 204 + { 205 + return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, false); 206 + } 207 + 208 + static const struct bpf_func_proto bpf_probe_read_kernel_proto = { 209 + .func = bpf_probe_read_kernel, 210 + .gpl_only = true, 211 + .ret_type = RET_INTEGER, 212 + .arg1_type = ARG_PTR_TO_UNINIT_MEM, 213 + .arg2_type = ARG_CONST_SIZE_OR_ZERO, 214 + .arg3_type = ARG_ANYTHING, 215 + }; 216 + 217 + BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size, 218 + const void *, unsafe_ptr) 219 + { 220 + return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, true); 221 + } 222 + 223 + static const struct bpf_func_proto bpf_probe_read_compat_proto = { 224 + .func = bpf_probe_read_compat, 225 + .gpl_only = true, 226 + .ret_type = RET_INTEGER, 227 + .arg1_type = ARG_PTR_TO_UNINIT_MEM, 228 + .arg2_type = ARG_CONST_SIZE_OR_ZERO, 229 + .arg3_type = ARG_ANYTHING, 230 + }; 231 + 232 + static __always_inline int 233 + bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr, 234 + const bool compat) 235 + { 236 + int ret = security_locked_down(LOCKDOWN_BPF_READ); 237 + 238 + if (unlikely(ret < 0)) 239 + goto out; 240 + /* 241 + * The strncpy_from_unsafe_*() call will likely not fill the entire 242 + * buffer, but that's okay in this circumstance as we're probing 243 + * arbitrary memory anyway similar to bpf_probe_read_*() and might 244 + * as well probe the stack. Thus, memory is explicitly cleared 245 + * only in error case, so that improper users ignoring return 246 + * code altogether don't copy garbage; otherwise length of string 247 + * is returned that can be used for bpf_perf_event_output() et al. 248 + */ 249 + ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) : 250 + strncpy_from_unsafe_strict(dst, unsafe_ptr, size); 251 + if (unlikely(ret < 0)) 252 + out: 253 + memset(dst, 0, size); 254 + return ret; 255 + } 256 + 257 + BPF_CALL_3(bpf_probe_read_kernel_str, void *, dst, u32, size, 258 + const void *, unsafe_ptr) 259 + { 260 + return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, false); 261 + } 262 + 263 + static const struct bpf_func_proto bpf_probe_read_kernel_str_proto = { 264 + .func = bpf_probe_read_kernel_str, 265 + .gpl_only = true, 266 + .ret_type = RET_INTEGER, 267 + .arg1_type = ARG_PTR_TO_UNINIT_MEM, 268 + .arg2_type = ARG_CONST_SIZE_OR_ZERO, 269 + .arg3_type = ARG_ANYTHING, 270 + }; 271 + 272 + BPF_CALL_3(bpf_probe_read_compat_str, void *, dst, u32, size, 273 + const void *, unsafe_ptr) 274 + { 275 + return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, true); 276 + } 277 + 278 + static const struct bpf_func_proto bpf_probe_read_compat_str_proto = { 279 + .func = bpf_probe_read_compat_str, 280 + .gpl_only = true, 281 + .ret_type = RET_INTEGER, 282 + .arg1_type = ARG_PTR_TO_UNINIT_MEM, 283 + .arg2_type = ARG_CONST_SIZE_OR_ZERO, 284 + .arg3_type = ARG_ANYTHING, 285 + }; 286 + 287 + BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src, 162 288 u32, size) 163 289 { 164 290 /* ··· 302 186 return -EPERM; 303 187 if (unlikely(!nmi_uaccess_okay())) 304 188 return -EPERM; 305 - if (!access_ok(unsafe_ptr, size)) 306 - return -EPERM; 307 189 308 - return probe_kernel_write(unsafe_ptr, src, size); 190 + return probe_user_write(unsafe_ptr, src, size); 309 191 } 310 192 311 193 static const struct bpf_func_proto bpf_probe_write_user_proto = { ··· 699 585 .arg2_type = ARG_ANYTHING, 700 586 }; 701 587 702 - BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, 703 - const void *, unsafe_ptr) 704 - { 705 - int ret; 706 - 707 - ret = security_locked_down(LOCKDOWN_BPF_READ); 708 - if (ret < 0) 709 - goto out; 710 - 711 - /* 712 - * The strncpy_from_unsafe() call will likely not fill the entire 713 - * buffer, but that's okay in this circumstance as we're probing 714 - * arbitrary memory anyway similar to bpf_probe_read() and might 715 - * as well probe the stack. Thus, memory is explicitly cleared 716 - * only in error case, so that improper users ignoring return 717 - * code altogether don't copy garbage; otherwise length of string 718 - * is returned that can be used for bpf_perf_event_output() et al. 719 - */ 720 - ret = strncpy_from_unsafe(dst, unsafe_ptr, size); 721 - if (unlikely(ret < 0)) 722 - out: 723 - memset(dst, 0, size); 724 - 725 - return ret; 726 - } 727 - 728 - static const struct bpf_func_proto bpf_probe_read_str_proto = { 729 - .func = bpf_probe_read_str, 730 - .gpl_only = true, 731 - .ret_type = RET_INTEGER, 732 - .arg1_type = ARG_PTR_TO_UNINIT_MEM, 733 - .arg2_type = ARG_CONST_SIZE_OR_ZERO, 734 - .arg3_type = ARG_ANYTHING, 735 - }; 736 - 737 588 struct send_signal_irq_work { 738 589 struct irq_work irq_work; 739 590 struct task_struct *task; ··· 778 699 return &bpf_map_pop_elem_proto; 779 700 case BPF_FUNC_map_peek_elem: 780 701 return &bpf_map_peek_elem_proto; 781 - case BPF_FUNC_probe_read: 782 - return &bpf_probe_read_proto; 783 702 case BPF_FUNC_ktime_get_ns: 784 703 return &bpf_ktime_get_ns_proto; 785 704 case BPF_FUNC_tail_call: ··· 804 727 return &bpf_current_task_under_cgroup_proto; 805 728 case BPF_FUNC_get_prandom_u32: 806 729 return &bpf_get_prandom_u32_proto; 730 + case BPF_FUNC_probe_read_user: 731 + return &bpf_probe_read_user_proto; 732 + case BPF_FUNC_probe_read_kernel: 733 + return &bpf_probe_read_kernel_proto; 734 + case BPF_FUNC_probe_read: 735 + return &bpf_probe_read_compat_proto; 736 + case BPF_FUNC_probe_read_user_str: 737 + return &bpf_probe_read_user_str_proto; 738 + case BPF_FUNC_probe_read_kernel_str: 739 + return &bpf_probe_read_kernel_str_proto; 807 740 case BPF_FUNC_probe_read_str: 808 - return &bpf_probe_read_str_proto; 741 + return &bpf_probe_read_compat_str_proto; 809 742 #ifdef CONFIG_CGROUPS 810 743 case BPF_FUNC_get_current_cgroup_id: 811 744 return &bpf_get_current_cgroup_id_proto; ··· 1142 1055 switch (func_id) { 1143 1056 case BPF_FUNC_perf_event_output: 1144 1057 return &bpf_perf_event_output_proto_raw_tp; 1145 - #ifdef CONFIG_NET 1146 - case BPF_FUNC_skb_output: 1147 - return &bpf_skb_output_proto; 1148 - #endif 1149 1058 case BPF_FUNC_get_stackid: 1150 1059 return &bpf_get_stackid_proto_raw_tp; 1151 1060 case BPF_FUNC_get_stack: ··· 1151 1068 } 1152 1069 } 1153 1070 1071 + static const struct bpf_func_proto * 1072 + tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) 1073 + { 1074 + switch (func_id) { 1075 + #ifdef CONFIG_NET 1076 + case BPF_FUNC_skb_output: 1077 + return &bpf_skb_output_proto; 1078 + #endif 1079 + default: 1080 + return raw_tp_prog_func_proto(func_id, prog); 1081 + } 1082 + } 1083 + 1154 1084 static bool raw_tp_prog_is_valid_access(int off, int size, 1155 1085 enum bpf_access_type type, 1156 1086 const struct bpf_prog *prog, 1157 1087 struct bpf_insn_access_aux *info) 1158 1088 { 1159 - /* largest tracepoint in the kernel has 12 args */ 1160 - if (off < 0 || off >= sizeof(__u64) * 12) 1089 + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) 1161 1090 return false; 1162 1091 if (type != BPF_READ) 1163 1092 return false; 1164 1093 if (off % size != 0) 1165 1094 return false; 1166 - if (!prog->aux->attach_btf_id) 1167 - return true; 1095 + return true; 1096 + } 1097 + 1098 + static bool tracing_prog_is_valid_access(int off, int size, 1099 + enum bpf_access_type type, 1100 + const struct bpf_prog *prog, 1101 + struct bpf_insn_access_aux *info) 1102 + { 1103 + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) 1104 + return false; 1105 + if (type != BPF_READ) 1106 + return false; 1107 + if (off % size != 0) 1108 + return false; 1168 1109 return btf_ctx_access(off, size, type, prog, info); 1169 1110 } 1170 1111 ··· 1198 1091 }; 1199 1092 1200 1093 const struct bpf_prog_ops raw_tracepoint_prog_ops = { 1094 + }; 1095 + 1096 + const struct bpf_verifier_ops tracing_verifier_ops = { 1097 + .get_func_proto = tracing_prog_func_proto, 1098 + .is_valid_access = tracing_prog_is_valid_access, 1099 + }; 1100 + 1101 + const struct bpf_prog_ops tracing_prog_ops = { 1201 1102 }; 1202 1103 1203 1104 static bool raw_tp_writable_prog_is_valid_access(int off, int size,

+103 -9

lib/test_bpf.c

··· 6859 6859 return NULL; 6860 6860 } 6861 6861 6862 - static __init int test_skb_segment(void) 6862 + static __init struct sk_buff *build_test_skb_linear_no_head_frag(void) 6863 6863 { 6864 + unsigned int alloc_size = 2000; 6865 + unsigned int headroom = 102, doffset = 72, data_size = 1308; 6866 + struct sk_buff *skb[2]; 6867 + int i; 6868 + 6869 + /* skbs linked in a frag_list, both with linear data, with head_frag=0 6870 + * (data allocated by kmalloc), both have tcp data of 1308 bytes 6871 + * (total payload is 2616 bytes). 6872 + * Data offset is 72 bytes (40 ipv6 hdr, 32 tcp hdr). Some headroom. 6873 + */ 6874 + for (i = 0; i < 2; i++) { 6875 + skb[i] = alloc_skb(alloc_size, GFP_KERNEL); 6876 + if (!skb[i]) { 6877 + if (i == 0) 6878 + goto err_skb0; 6879 + else 6880 + goto err_skb1; 6881 + } 6882 + 6883 + skb[i]->protocol = htons(ETH_P_IPV6); 6884 + skb_reserve(skb[i], headroom); 6885 + skb_put(skb[i], doffset + data_size); 6886 + skb_reset_network_header(skb[i]); 6887 + if (i == 0) 6888 + skb_reset_mac_header(skb[i]); 6889 + else 6890 + skb_set_mac_header(skb[i], -ETH_HLEN); 6891 + __skb_pull(skb[i], doffset); 6892 + } 6893 + 6894 + /* setup shinfo. 6895 + * mimic bpf_skb_proto_4_to_6, which resets gso_segs and assigns a 6896 + * reduced gso_size. 6897 + */ 6898 + skb_shinfo(skb[0])->gso_size = 1288; 6899 + skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV6 | SKB_GSO_DODGY; 6900 + skb_shinfo(skb[0])->gso_segs = 0; 6901 + skb_shinfo(skb[0])->frag_list = skb[1]; 6902 + 6903 + /* adjust skb[0]'s len */ 6904 + skb[0]->len += skb[1]->len; 6905 + skb[0]->data_len += skb[1]->len; 6906 + skb[0]->truesize += skb[1]->truesize; 6907 + 6908 + return skb[0]; 6909 + 6910 + err_skb1: 6911 + kfree_skb(skb[0]); 6912 + err_skb0: 6913 + return NULL; 6914 + } 6915 + 6916 + struct skb_segment_test { 6917 + const char *descr; 6918 + struct sk_buff *(*build_skb)(void); 6864 6919 netdev_features_t features; 6920 + }; 6921 + 6922 + static struct skb_segment_test skb_segment_tests[] __initconst = { 6923 + { 6924 + .descr = "gso_with_rx_frags", 6925 + .build_skb = build_test_skb, 6926 + .features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | 6927 + NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM 6928 + }, 6929 + { 6930 + .descr = "gso_linear_no_head_frag", 6931 + .build_skb = build_test_skb_linear_no_head_frag, 6932 + .features = NETIF_F_SG | NETIF_F_FRAGLIST | 6933 + NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_GSO | 6934 + NETIF_F_LLTX_BIT | NETIF_F_GRO | 6935 + NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM | 6936 + NETIF_F_HW_VLAN_STAG_TX_BIT 6937 + } 6938 + }; 6939 + 6940 + static __init int test_skb_segment_single(const struct skb_segment_test *test) 6941 + { 6865 6942 struct sk_buff *skb, *segs; 6866 6943 int ret = -1; 6867 6944 6868 - features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | 6869 - NETIF_F_IPV6_CSUM; 6870 - features |= NETIF_F_RXCSUM; 6871 - skb = build_test_skb(); 6945 + skb = test->build_skb(); 6872 6946 if (!skb) { 6873 6947 pr_info("%s: failed to build_test_skb", __func__); 6874 6948 goto done; 6875 6949 } 6876 6950 6877 - segs = skb_segment(skb, features); 6951 + segs = skb_segment(skb, test->features); 6878 6952 if (!IS_ERR(segs)) { 6879 6953 kfree_skb_list(segs); 6880 6954 ret = 0; 6881 - pr_info("%s: success in skb_segment!", __func__); 6882 - } else { 6883 - pr_info("%s: failed in skb_segment!", __func__); 6884 6955 } 6885 6956 kfree_skb(skb); 6886 6957 done: 6887 6958 return ret; 6959 + } 6960 + 6961 + static __init int test_skb_segment(void) 6962 + { 6963 + int i, err_cnt = 0, pass_cnt = 0; 6964 + 6965 + for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) { 6966 + const struct skb_segment_test *test = &skb_segment_tests[i]; 6967 + 6968 + pr_info("#%d %s ", i, test->descr); 6969 + 6970 + if (test_skb_segment_single(test)) { 6971 + pr_cont("FAIL\n"); 6972 + err_cnt++; 6973 + } else { 6974 + pr_cont("PASS\n"); 6975 + pass_cnt++; 6976 + } 6977 + } 6978 + 6979 + pr_info("%s: Summary: %d PASSED, %d FAILED\n", __func__, 6980 + pass_cnt, err_cnt); 6981 + return err_cnt ? -EINVAL : 0; 6888 6982 } 6889 6983 6890 6984 static __init int test_bpf(void)

+65 -5

mm/maccess.c

··· 18 18 return ret ? -EFAULT : 0; 19 19 } 20 20 21 + static __always_inline long 22 + probe_write_common(void __user *dst, const void *src, size_t size) 23 + { 24 + long ret; 25 + 26 + pagefault_disable(); 27 + ret = __copy_to_user_inatomic(dst, src, size); 28 + pagefault_enable(); 29 + 30 + return ret ? -EFAULT : 0; 31 + } 32 + 21 33 /** 22 34 * probe_kernel_read(): safely attempt to read from a kernel-space location 23 35 * @dst: pointer to the buffer that shall take the data ··· 43 31 * do_page_fault() doesn't attempt to take mmap_sem. This makes 44 32 * probe_kernel_read() suitable for use within regions where the caller 45 33 * already holds mmap_sem, or other locks which nest inside mmap_sem. 34 + * 35 + * probe_kernel_read_strict() is the same as probe_kernel_read() except for 36 + * the case where architectures have non-overlapping user and kernel address 37 + * ranges: probe_kernel_read_strict() will additionally return -EFAULT for 38 + * probing memory on a user address range where probe_user_read() is supposed 39 + * to be used instead. 46 40 */ 47 41 48 42 long __weak probe_kernel_read(void *dst, const void *src, size_t size) 43 + __attribute__((alias("__probe_kernel_read"))); 44 + 45 + long __weak probe_kernel_read_strict(void *dst, const void *src, size_t size) 49 46 __attribute__((alias("__probe_kernel_read"))); 50 47 51 48 long __probe_kernel_read(void *dst, const void *src, size_t size) ··· 106 85 * Safely write to address @dst from the buffer at @src. If a kernel fault 107 86 * happens, handle that and return -EFAULT. 108 87 */ 88 + 109 89 long __weak probe_kernel_write(void *dst, const void *src, size_t size) 110 90 __attribute__((alias("__probe_kernel_write"))); 111 91 ··· 116 94 mm_segment_t old_fs = get_fs(); 117 95 118 96 set_fs(KERNEL_DS); 119 - pagefault_disable(); 120 - ret = __copy_to_user_inatomic((__force void __user *)dst, src, size); 121 - pagefault_enable(); 97 + ret = probe_write_common((__force void __user *)dst, src, size); 122 98 set_fs(old_fs); 123 99 124 - return ret ? -EFAULT : 0; 100 + return ret; 125 101 } 126 102 EXPORT_SYMBOL_GPL(probe_kernel_write); 127 103 104 + /** 105 + * probe_user_write(): safely attempt to write to a user-space location 106 + * @dst: address to write to 107 + * @src: pointer to the data that shall be written 108 + * @size: size of the data chunk 109 + * 110 + * Safely write to address @dst from the buffer at @src. If a kernel fault 111 + * happens, handle that and return -EFAULT. 112 + */ 113 + 114 + long __weak probe_user_write(void __user *dst, const void *src, size_t size) 115 + __attribute__((alias("__probe_user_write"))); 116 + 117 + long __probe_user_write(void __user *dst, const void *src, size_t size) 118 + { 119 + long ret = -EFAULT; 120 + mm_segment_t old_fs = get_fs(); 121 + 122 + set_fs(USER_DS); 123 + if (access_ok(dst, size)) 124 + ret = probe_write_common(dst, src, size); 125 + set_fs(old_fs); 126 + 127 + return ret; 128 + } 129 + EXPORT_SYMBOL_GPL(probe_user_write); 128 130 129 131 /** 130 132 * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address. ··· 166 120 * 167 121 * If @count is smaller than the length of the string, copies @count-1 bytes, 168 122 * sets the last byte of @dst buffer to NUL and returns @count. 123 + * 124 + * strncpy_from_unsafe_strict() is the same as strncpy_from_unsafe() except 125 + * for the case where architectures have non-overlapping user and kernel address 126 + * ranges: strncpy_from_unsafe_strict() will additionally return -EFAULT for 127 + * probing memory on a user address range where strncpy_from_unsafe_user() is 128 + * supposed to be used instead. 169 129 */ 170 - long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) 130 + 131 + long __weak strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) 132 + __attribute__((alias("__strncpy_from_unsafe"))); 133 + 134 + long __weak strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, 135 + long count) 136 + __attribute__((alias("__strncpy_from_unsafe"))); 137 + 138 + long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) 171 139 { 172 140 mm_segment_t old_fs = get_fs(); 173 141 const void *src = unsafe_addr;

+31 -2

net/xdp/xsk.c

··· 196 196 return false; 197 197 } 198 198 199 - int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) 199 + static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) 200 200 { 201 201 u32 len; 202 202 ··· 212 212 __xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len); 213 213 } 214 214 215 - void xsk_flush(struct xdp_sock *xs) 215 + static void xsk_flush(struct xdp_sock *xs) 216 216 { 217 217 xskq_produce_flush_desc(xs->rx); 218 218 xs->sk.sk_data_ready(&xs->sk); ··· 262 262 out_unlock: 263 263 spin_unlock_bh(&xs->rx_lock); 264 264 return err; 265 + } 266 + 267 + int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, 268 + struct xdp_sock *xs) 269 + { 270 + struct xsk_map *m = container_of(map, struct xsk_map, map); 271 + struct list_head *flush_list = this_cpu_ptr(m->flush_list); 272 + int err; 273 + 274 + err = xsk_rcv(xs, xdp); 275 + if (err) 276 + return err; 277 + 278 + if (!xs->flush_node.prev) 279 + list_add(&xs->flush_node, flush_list); 280 + 281 + return 0; 282 + } 283 + 284 + void __xsk_map_flush(struct bpf_map *map) 285 + { 286 + struct xsk_map *m = container_of(map, struct xsk_map, map); 287 + struct list_head *flush_list = this_cpu_ptr(m->flush_list); 288 + struct xdp_sock *xs, *tmp; 289 + 290 + list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { 291 + xsk_flush(xs); 292 + __list_del_clearprev(&xs->flush_node); 293 + } 265 294 } 266 295 267 296 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)

+2 -2

samples/bpf/map_perf_test_kern.c

··· 181 181 if (addrlen != sizeof(*in6)) 182 182 return 0; 183 183 184 - ret = bpf_probe_read(test_params.dst6, sizeof(test_params.dst6), 185 - &in6->sin6_addr); 184 + ret = bpf_probe_read_user(test_params.dst6, sizeof(test_params.dst6), 185 + &in6->sin6_addr); 186 186 if (ret) 187 187 goto done; 188 188

+2 -2

samples/bpf/test_map_in_map_kern.c

··· 118 118 if (addrlen != sizeof(*in6)) 119 119 return 0; 120 120 121 - ret = bpf_probe_read(dst6, sizeof(dst6), &in6->sin6_addr); 121 + ret = bpf_probe_read_user(dst6, sizeof(dst6), &in6->sin6_addr); 122 122 if (ret) { 123 123 inline_ret = ret; 124 124 goto done; ··· 129 129 130 130 test_case = dst6[7]; 131 131 132 - ret = bpf_probe_read(&port, sizeof(port), &in6->sin6_port); 132 + ret = bpf_probe_read_user(&port, sizeof(port), &in6->sin6_port); 133 133 if (ret) { 134 134 inline_ret = ret; 135 135 goto done;

+1 -1

samples/bpf/test_probe_write_user_kern.c

··· 37 37 if (sockaddr_len > sizeof(orig_addr)) 38 38 return 0; 39 39 40 - if (bpf_probe_read(&orig_addr, sizeof(orig_addr), sockaddr_arg) != 0) 40 + if (bpf_probe_read_user(&orig_addr, sizeof(orig_addr), sockaddr_arg) != 0) 41 41 return 0; 42 42 43 43 mapped_addr = bpf_map_lookup_elem(&dnat_map, &orig_addr);

+84 -40

tools/include/uapi/linux/bpf.h

··· 173 173 BPF_PROG_TYPE_CGROUP_SYSCTL, 174 174 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 175 175 BPF_PROG_TYPE_CGROUP_SOCKOPT, 176 + BPF_PROG_TYPE_TRACING, 176 177 }; 177 178 178 179 enum bpf_attach_type { ··· 200 199 BPF_CGROUP_UDP6_RECVMSG, 201 200 BPF_CGROUP_GETSOCKOPT, 202 201 BPF_CGROUP_SETSOCKOPT, 202 + BPF_TRACE_RAW_TP, 203 203 __MAX_BPF_ATTACH_TYPE 204 204 }; 205 205 ··· 563 561 * Return 564 562 * 0 on success, or a negative error in case of failure. 565 563 * 566 - * int bpf_probe_read(void *dst, u32 size, const void *src) 564 + * int bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr) 567 565 * Description 568 566 * For tracing programs, safely attempt to read *size* bytes from 569 - * address *src* and store the data in *dst*. 567 + * kernel space address *unsafe_ptr* and store the data in *dst*. 568 + * 569 + * Generally, use bpf_probe_read_user() or bpf_probe_read_kernel() 570 + * instead. 570 571 * Return 571 572 * 0 on success, or a negative error in case of failure. 572 573 * ··· 1431 1426 * Return 1432 1427 * 0 on success, or a negative error in case of failure. 1433 1428 * 1434 - * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr) 1429 + * int bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr) 1435 1430 * Description 1436 - * Copy a NUL terminated string from an unsafe address 1437 - * *unsafe_ptr* to *dst*. The *size* should include the 1438 - * terminating NUL byte. In case the string length is smaller than 1439 - * *size*, the target is not padded with further NUL bytes. If the 1440 - * string length is larger than *size*, just *size*-1 bytes are 1441 - * copied and the last byte is set to NUL. 1431 + * Copy a NUL terminated string from an unsafe kernel address 1432 + * *unsafe_ptr* to *dst*. See bpf_probe_read_kernel_str() for 1433 + * more details. 1442 1434 * 1443 - * On success, the length of the copied string is returned. This 1444 - * makes this helper useful in tracing programs for reading 1445 - * strings, and more importantly to get its length at runtime. See 1446 - * the following snippet: 1447 - * 1448 - * :: 1449 - * 1450 - * SEC("kprobe/sys_open") 1451 - * void bpf_sys_open(struct pt_regs *ctx) 1452 - * { 1453 - * char buf[PATHLEN]; // PATHLEN is defined to 256 1454 - * int res = bpf_probe_read_str(buf, sizeof(buf), 1455 - * ctx->di); 1456 - * 1457 - * // Consume buf, for example push it to 1458 - * // userspace via bpf_perf_event_output(); we 1459 - * // can use res (the string length) as event 1460 - * // size, after checking its boundaries. 1461 - * } 1462 - * 1463 - * In comparison, using **bpf_probe_read()** helper here instead 1464 - * to read the string would require to estimate the length at 1465 - * compile time, and would often result in copying more memory 1466 - * than necessary. 1467 - * 1468 - * Another useful use case is when parsing individual process 1469 - * arguments or individual environment variables navigating 1470 - * *current*\ **->mm->arg_start** and *current*\ 1471 - * **->mm->env_start**: using this helper and the return value, 1472 - * one can quickly iterate at the right offset of the memory area. 1435 + * Generally, use bpf_probe_read_user_str() or bpf_probe_read_kernel_str() 1436 + * instead. 1473 1437 * Return 1474 1438 * On success, the strictly positive length of the string, 1475 1439 * including the trailing NUL character. On error, a negative ··· 2749 2775 * restricted to raw_tracepoint bpf programs. 2750 2776 * Return 2751 2777 * 0 on success, or a negative error in case of failure. 2778 + * 2779 + * int bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr) 2780 + * Description 2781 + * Safely attempt to read *size* bytes from user space address 2782 + * *unsafe_ptr* and store the data in *dst*. 2783 + * Return 2784 + * 0 on success, or a negative error in case of failure. 2785 + * 2786 + * int bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) 2787 + * Description 2788 + * Safely attempt to read *size* bytes from kernel space address 2789 + * *unsafe_ptr* and store the data in *dst*. 2790 + * Return 2791 + * 0 on success, or a negative error in case of failure. 2792 + * 2793 + * int bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr) 2794 + * Description 2795 + * Copy a NUL terminated string from an unsafe user address 2796 + * *unsafe_ptr* to *dst*. The *size* should include the 2797 + * terminating NUL byte. In case the string length is smaller than 2798 + * *size*, the target is not padded with further NUL bytes. If the 2799 + * string length is larger than *size*, just *size*-1 bytes are 2800 + * copied and the last byte is set to NUL. 2801 + * 2802 + * On success, the length of the copied string is returned. This 2803 + * makes this helper useful in tracing programs for reading 2804 + * strings, and more importantly to get its length at runtime. See 2805 + * the following snippet: 2806 + * 2807 + * :: 2808 + * 2809 + * SEC("kprobe/sys_open") 2810 + * void bpf_sys_open(struct pt_regs *ctx) 2811 + * { 2812 + * char buf[PATHLEN]; // PATHLEN is defined to 256 2813 + * int res = bpf_probe_read_user_str(buf, sizeof(buf), 2814 + * ctx->di); 2815 + * 2816 + * // Consume buf, for example push it to 2817 + * // userspace via bpf_perf_event_output(); we 2818 + * // can use res (the string length) as event 2819 + * // size, after checking its boundaries. 2820 + * } 2821 + * 2822 + * In comparison, using **bpf_probe_read_user()** helper here 2823 + * instead to read the string would require to estimate the length 2824 + * at compile time, and would often result in copying more memory 2825 + * than necessary. 2826 + * 2827 + * Another useful use case is when parsing individual process 2828 + * arguments or individual environment variables navigating 2829 + * *current*\ **->mm->arg_start** and *current*\ 2830 + * **->mm->env_start**: using this helper and the return value, 2831 + * one can quickly iterate at the right offset of the memory area. 2832 + * Return 2833 + * On success, the strictly positive length of the string, 2834 + * including the trailing NUL character. On error, a negative 2835 + * value. 2836 + * 2837 + * int bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr) 2838 + * Description 2839 + * Copy a NUL terminated string from an unsafe kernel address *unsafe_ptr* 2840 + * to *dst*. Same semantics as with bpf_probe_read_user_str() apply. 2841 + * Return 2842 + * On success, the strictly positive length of the string, including 2843 + * the trailing NUL character. On error, a negative value. 2752 2844 */ 2753 2845 #define __BPF_FUNC_MAPPER(FN) \ 2754 2846 FN(unspec), \ ··· 2928 2888 FN(sk_storage_delete), \ 2929 2889 FN(send_signal), \ 2930 2890 FN(tcp_gen_syncookie), \ 2931 - FN(skb_output), 2891 + FN(skb_output), \ 2892 + FN(probe_read_user), \ 2893 + FN(probe_read_kernel), \ 2894 + FN(probe_read_user_str), \ 2895 + FN(probe_read_kernel_str), 2932 2896 2933 2897 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 2934 2898 * function eBPF program intends to call

+4 -4

tools/lib/bpf/bpf.c

··· 228 228 memset(&attr, 0, sizeof(attr)); 229 229 attr.prog_type = load_attr->prog_type; 230 230 attr.expected_attach_type = load_attr->expected_attach_type; 231 - if (attr.prog_type == BPF_PROG_TYPE_RAW_TRACEPOINT) 232 - /* expected_attach_type is ignored for tracing progs */ 233 - attr.attach_btf_id = attr.expected_attach_type; 231 + if (attr.prog_type == BPF_PROG_TYPE_TRACING) 232 + attr.attach_btf_id = load_attr->attach_btf_id; 233 + else 234 + attr.prog_ifindex = load_attr->prog_ifindex; 234 235 attr.insn_cnt = (__u32)load_attr->insns_cnt; 235 236 attr.insns = ptr_to_u64(load_attr->insns); 236 237 attr.license = ptr_to_u64(load_attr->license); ··· 246 245 } 247 246 248 247 attr.kern_version = load_attr->kern_version; 249 - attr.prog_ifindex = load_attr->prog_ifindex; 250 248 attr.prog_btf_fd = load_attr->prog_btf_fd; 251 249 attr.func_info_rec_size = load_attr->func_info_rec_size; 252 250 attr.func_info_cnt = load_attr->func_info_cnt;

+4 -1

tools/lib/bpf/bpf.h

··· 78 78 size_t insns_cnt; 79 79 const char *license; 80 80 __u32 kern_version; 81 - __u32 prog_ifindex; 81 + union { 82 + __u32 prog_ifindex; 83 + __u32 attach_btf_id; 84 + }; 82 85 __u32 prog_btf_fd; 83 86 __u32 func_info_rec_size; 84 87 const void *func_info;

+6

tools/lib/bpf/bpf_helpers.h

··· 38 38 unsigned int map_flags; 39 39 }; 40 40 41 + enum libbpf_pin_type { 42 + LIBBPF_PIN_NONE, 43 + /* PIN_BY_NAME: pin maps by name (in /sys/fs/bpf by default) */ 44 + LIBBPF_PIN_BY_NAME, 45 + }; 46 + 41 47 #endif

+362 -104

tools/lib/bpf/libbpf.c

··· 188 188 bpf_program_clear_priv_t clear_priv; 189 189 190 190 enum bpf_attach_type expected_attach_type; 191 + __u32 attach_btf_id; 191 192 void *func_info; 192 193 __u32 func_info_rec_size; 193 194 __u32 func_info_cnt; ··· 227 226 void *priv; 228 227 bpf_map_clear_priv_t clear_priv; 229 228 enum libbpf_map_type libbpf_type; 229 + char *pin_path; 230 + bool pinned; 230 231 }; 231 232 232 233 struct bpf_secdata { ··· 1093 1090 return true; 1094 1091 } 1095 1092 1093 + static int build_map_pin_path(struct bpf_map *map, const char *path) 1094 + { 1095 + char buf[PATH_MAX]; 1096 + int err, len; 1097 + 1098 + if (!path) 1099 + path = "/sys/fs/bpf"; 1100 + 1101 + len = snprintf(buf, PATH_MAX, "%s/%s", path, bpf_map__name(map)); 1102 + if (len < 0) 1103 + return -EINVAL; 1104 + else if (len >= PATH_MAX) 1105 + return -ENAMETOOLONG; 1106 + 1107 + err = bpf_map__set_pin_path(map, buf); 1108 + if (err) 1109 + return err; 1110 + 1111 + return 0; 1112 + } 1113 + 1096 1114 static int bpf_object__init_user_btf_map(struct bpf_object *obj, 1097 1115 const struct btf_type *sec, 1098 1116 int var_idx, int sec_idx, 1099 - const Elf_Data *data, bool strict) 1117 + const Elf_Data *data, bool strict, 1118 + const char *pin_root_path) 1100 1119 { 1101 1120 const struct btf_type *var, *def, *t; 1102 1121 const struct btf_var_secinfo *vi; ··· 1293 1268 } 1294 1269 map->def.value_size = sz; 1295 1270 map->btf_value_type_id = t->type; 1271 + } else if (strcmp(name, "pinning") == 0) { 1272 + __u32 val; 1273 + int err; 1274 + 1275 + if (!get_map_field_int(map_name, obj->btf, def, m, 1276 + &val)) 1277 + return -EINVAL; 1278 + pr_debug("map '%s': found pinning = %u.\n", 1279 + map_name, val); 1280 + 1281 + if (val != LIBBPF_PIN_NONE && 1282 + val != LIBBPF_PIN_BY_NAME) { 1283 + pr_warn("map '%s': invalid pinning value %u.\n", 1284 + map_name, val); 1285 + return -EINVAL; 1286 + } 1287 + if (val == LIBBPF_PIN_BY_NAME) { 1288 + err = build_map_pin_path(map, pin_root_path); 1289 + if (err) { 1290 + pr_warn("map '%s': couldn't build pin path.\n", 1291 + map_name); 1292 + return err; 1293 + } 1294 + } 1296 1295 } else { 1297 1296 if (strict) { 1298 1297 pr_warn("map '%s': unknown field '%s'.\n", ··· 1336 1287 return 0; 1337 1288 } 1338 1289 1339 - static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict) 1290 + static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict, 1291 + const char *pin_root_path) 1340 1292 { 1341 1293 const struct btf_type *sec = NULL; 1342 1294 int nr_types, i, vlen, err; ··· 1379 1329 for (i = 0; i < vlen; i++) { 1380 1330 err = bpf_object__init_user_btf_map(obj, sec, i, 1381 1331 obj->efile.btf_maps_shndx, 1382 - data, strict); 1332 + data, strict, pin_root_path); 1383 1333 if (err) 1384 1334 return err; 1385 1335 } ··· 1387 1337 return 0; 1388 1338 } 1389 1339 1390 - static int bpf_object__init_maps(struct bpf_object *obj, bool relaxed_maps) 1340 + static int bpf_object__init_maps(struct bpf_object *obj, bool relaxed_maps, 1341 + const char *pin_root_path) 1391 1342 { 1392 1343 bool strict = !relaxed_maps; 1393 1344 int err; ··· 1397 1346 if (err) 1398 1347 return err; 1399 1348 1400 - err = bpf_object__init_user_btf_maps(obj, strict); 1349 + err = bpf_object__init_user_btf_maps(obj, strict, pin_root_path); 1401 1350 if (err) 1402 1351 return err; 1403 1352 ··· 1586 1535 return 0; 1587 1536 } 1588 1537 1589 - static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps) 1538 + static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps, 1539 + const char *pin_root_path) 1590 1540 { 1591 1541 Elf *elf = obj->efile.elf; 1592 1542 GElf_Ehdr *ep = &obj->efile.ehdr; ··· 1716 1664 } 1717 1665 } 1718 1666 1719 - if (!obj->efile.strtabidx || obj->efile.strtabidx >= idx) { 1667 + if (!obj->efile.strtabidx || obj->efile.strtabidx > idx) { 1720 1668 pr_warn("Corrupted ELF file: index of strtab invalid\n"); 1721 1669 return -LIBBPF_ERRNO__FORMAT; 1722 1670 } 1723 1671 err = bpf_object__init_btf(obj, btf_data, btf_ext_data); 1724 1672 if (!err) 1725 - err = bpf_object__init_maps(obj, relaxed_maps); 1673 + err = bpf_object__init_maps(obj, relaxed_maps, pin_root_path); 1726 1674 if (!err) 1727 1675 err = bpf_object__sanitize_and_load_btf(obj); 1728 1676 if (!err) ··· 1968 1916 return -errno; 1969 1917 1970 1918 new_fd = open("/", O_RDONLY | O_CLOEXEC); 1971 - if (new_fd < 0) 1919 + if (new_fd < 0) { 1920 + err = -errno; 1972 1921 goto err_free_new_name; 1922 + } 1973 1923 1974 1924 new_fd = dup3(fd, new_fd, O_CLOEXEC); 1975 - if (new_fd < 0) 1925 + if (new_fd < 0) { 1926 + err = -errno; 1976 1927 goto err_close_new_fd; 1928 + } 1977 1929 1978 1930 err = zclose(map->fd); 1979 - if (err) 1931 + if (err) { 1932 + err = -errno; 1980 1933 goto err_close_new_fd; 1934 + } 1981 1935 free(map->name); 1982 1936 1983 1937 map->fd = new_fd; ··· 2002 1944 close(new_fd); 2003 1945 err_free_new_name: 2004 1946 free(new_name); 2005 - return -errno; 1947 + return err; 2006 1948 } 2007 1949 2008 1950 int bpf_map__resize(struct bpf_map *map, __u32 max_entries) ··· 2178 2120 return 0; 2179 2121 } 2180 2122 2123 + static bool map_is_reuse_compat(const struct bpf_map *map, int map_fd) 2124 + { 2125 + struct bpf_map_info map_info = {}; 2126 + char msg[STRERR_BUFSIZE]; 2127 + __u32 map_info_len; 2128 + 2129 + map_info_len = sizeof(map_info); 2130 + 2131 + if (bpf_obj_get_info_by_fd(map_fd, &map_info, &map_info_len)) { 2132 + pr_warn("failed to get map info for map FD %d: %s\n", 2133 + map_fd, libbpf_strerror_r(errno, msg, sizeof(msg))); 2134 + return false; 2135 + } 2136 + 2137 + return (map_info.type == map->def.type && 2138 + map_info.key_size == map->def.key_size && 2139 + map_info.value_size == map->def.value_size && 2140 + map_info.max_entries == map->def.max_entries && 2141 + map_info.map_flags == map->def.map_flags); 2142 + } 2143 + 2144 + static int 2145 + bpf_object__reuse_map(struct bpf_map *map) 2146 + { 2147 + char *cp, errmsg[STRERR_BUFSIZE]; 2148 + int err, pin_fd; 2149 + 2150 + pin_fd = bpf_obj_get(map->pin_path); 2151 + if (pin_fd < 0) { 2152 + err = -errno; 2153 + if (err == -ENOENT) { 2154 + pr_debug("found no pinned map to reuse at '%s'\n", 2155 + map->pin_path); 2156 + return 0; 2157 + } 2158 + 2159 + cp = libbpf_strerror_r(-err, errmsg, sizeof(errmsg)); 2160 + pr_warn("couldn't retrieve pinned map '%s': %s\n", 2161 + map->pin_path, cp); 2162 + return err; 2163 + } 2164 + 2165 + if (!map_is_reuse_compat(map, pin_fd)) { 2166 + pr_warn("couldn't reuse pinned map at '%s': parameter mismatch\n", 2167 + map->pin_path); 2168 + close(pin_fd); 2169 + return -EINVAL; 2170 + } 2171 + 2172 + err = bpf_map__reuse_fd(map, pin_fd); 2173 + if (err) { 2174 + close(pin_fd); 2175 + return err; 2176 + } 2177 + map->pinned = true; 2178 + pr_debug("reused pinned map at '%s'\n", map->pin_path); 2179 + 2180 + return 0; 2181 + } 2182 + 2181 2183 static int 2182 2184 bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map) 2183 2185 { ··· 2279 2161 struct bpf_map_def *def = &map->def; 2280 2162 char *cp, errmsg[STRERR_BUFSIZE]; 2281 2163 int *pfd = &map->fd; 2164 + 2165 + if (map->pin_path) { 2166 + err = bpf_object__reuse_map(map); 2167 + if (err) { 2168 + pr_warn("error reusing pinned map %s\n", 2169 + map->name); 2170 + return err; 2171 + } 2172 + } 2282 2173 2283 2174 if (map->fd >= 0) { 2284 2175 pr_debug("skip map create (preset) %s: fd=%d\n", ··· 2364 2237 if (err < 0) { 2365 2238 zclose(*pfd); 2366 2239 goto err_out; 2240 + } 2241 + } 2242 + 2243 + if (map->pin_path && !map->pinned) { 2244 + err = bpf_map__pin(map, NULL); 2245 + if (err) { 2246 + pr_warn("failed to auto-pin map name '%s' at '%s'\n", 2247 + map->name, map->pin_path); 2248 + return err; 2367 2249 } 2368 2250 } 2369 2251 ··· 3582 3446 load_attr.line_info_cnt = prog->line_info_cnt; 3583 3447 load_attr.log_level = prog->log_level; 3584 3448 load_attr.prog_flags = prog->prog_flags; 3449 + load_attr.attach_btf_id = prog->attach_btf_id; 3585 3450 3586 3451 retry_load: 3587 3452 log_buf = malloc(log_buf_size); ··· 3744 3607 return 0; 3745 3608 } 3746 3609 3610 + static int libbpf_attach_btf_id_by_name(const char *name, __u32 *btf_id); 3611 + 3747 3612 static struct bpf_object * 3748 3613 __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz, 3749 3614 struct bpf_object_open_opts *opts) 3750 3615 { 3616 + const char *pin_root_path; 3751 3617 struct bpf_program *prog; 3752 3618 struct bpf_object *obj; 3753 3619 const char *obj_name; ··· 3785 3645 3786 3646 obj->relaxed_core_relocs = OPTS_GET(opts, relaxed_core_relocs, false); 3787 3647 relaxed_maps = OPTS_GET(opts, relaxed_maps, false); 3648 + pin_root_path = OPTS_GET(opts, pin_root_path, NULL); 3788 3649 3789 3650 CHECK_ERR(bpf_object__elf_init(obj), err, out); 3790 3651 CHECK_ERR(bpf_object__check_endianness(obj), err, out); 3791 3652 CHECK_ERR(bpf_object__probe_caps(obj), err, out); 3792 - CHECK_ERR(bpf_object__elf_collect(obj, relaxed_maps), err, out); 3653 + CHECK_ERR(bpf_object__elf_collect(obj, relaxed_maps, pin_root_path), 3654 + err, out); 3793 3655 CHECK_ERR(bpf_object__collect_reloc(obj), err, out); 3794 3656 bpf_object__elf_finish(obj); 3795 3657 3796 3658 bpf_object__for_each_program(prog, obj) { 3797 3659 enum bpf_prog_type prog_type; 3798 3660 enum bpf_attach_type attach_type; 3661 + __u32 btf_id; 3799 3662 3800 3663 err = libbpf_prog_type_by_name(prog->section_name, &prog_type, 3801 3664 &attach_type); ··· 3810 3667 3811 3668 bpf_program__set_type(prog, prog_type); 3812 3669 bpf_program__set_expected_attach_type(prog, attach_type); 3670 + if (prog_type == BPF_PROG_TYPE_TRACING) { 3671 + err = libbpf_attach_btf_id_by_name(prog->section_name, &btf_id); 3672 + if (err) 3673 + goto out; 3674 + prog->attach_btf_id = btf_id; 3675 + } 3813 3676 } 3814 3677 3815 3678 return obj; ··· 3946 3797 return bpf_object__load_xattr(&attr); 3947 3798 } 3948 3799 3800 + static int make_parent_dir(const char *path) 3801 + { 3802 + char *cp, errmsg[STRERR_BUFSIZE]; 3803 + char *dname, *dir; 3804 + int err = 0; 3805 + 3806 + dname = strdup(path); 3807 + if (dname == NULL) 3808 + return -ENOMEM; 3809 + 3810 + dir = dirname(dname); 3811 + if (mkdir(dir, 0700) && errno != EEXIST) 3812 + err = -errno; 3813 + 3814 + free(dname); 3815 + if (err) { 3816 + cp = libbpf_strerror_r(-err, errmsg, sizeof(errmsg)); 3817 + pr_warn("failed to mkdir %s: %s\n", path, cp); 3818 + } 3819 + return err; 3820 + } 3821 + 3949 3822 static int check_path(const char *path) 3950 3823 { 3951 3824 char *cp, errmsg[STRERR_BUFSIZE]; ··· 4003 3832 { 4004 3833 char *cp, errmsg[STRERR_BUFSIZE]; 4005 3834 int err; 3835 + 3836 + err = make_parent_dir(path); 3837 + if (err) 3838 + return err; 4006 3839 4007 3840 err = check_path(path); 4008 3841 if (err) ··· 4061 3886 return 0; 4062 3887 } 4063 3888 4064 - static int make_dir(const char *path) 4065 - { 4066 - char *cp, errmsg[STRERR_BUFSIZE]; 4067 - int err = 0; 4068 - 4069 - if (mkdir(path, 0700) && errno != EEXIST) 4070 - err = -errno; 4071 - 4072 - if (err) { 4073 - cp = libbpf_strerror_r(-err, errmsg, sizeof(errmsg)); 4074 - pr_warn("failed to mkdir %s: %s\n", path, cp); 4075 - } 4076 - return err; 4077 - } 4078 - 4079 3889 int bpf_program__pin(struct bpf_program *prog, const char *path) 4080 3890 { 4081 3891 int i, err; 3892 + 3893 + err = make_parent_dir(path); 3894 + if (err) 3895 + return err; 4082 3896 4083 3897 err = check_path(path); 4084 3898 if (err) ··· 4088 3924 /* don't create subdirs when pinning single instance */ 4089 3925 return bpf_program__pin_instance(prog, path, 0); 4090 3926 } 4091 - 4092 - err = make_dir(path); 4093 - if (err) 4094 - return err; 4095 3927 4096 3928 for (i = 0; i < prog->instances.nr; i++) { 4097 3929 char buf[PATH_MAX]; ··· 4179 4019 char *cp, errmsg[STRERR_BUFSIZE]; 4180 4020 int err; 4181 4021 4182 - err = check_path(path); 4183 - if (err) 4184 - return err; 4185 - 4186 4022 if (map == NULL) { 4187 4023 pr_warn("invalid map pointer\n"); 4188 4024 return -EINVAL; 4189 4025 } 4190 4026 4191 - if (bpf_obj_pin(map->fd, path)) { 4192 - cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg)); 4193 - pr_warn("failed to pin map: %s\n", cp); 4194 - return -errno; 4027 + if (map->pin_path) { 4028 + if (path && strcmp(path, map->pin_path)) { 4029 + pr_warn("map '%s' already has pin path '%s' different from '%s'\n", 4030 + bpf_map__name(map), map->pin_path, path); 4031 + return -EINVAL; 4032 + } else if (map->pinned) { 4033 + pr_debug("map '%s' already pinned at '%s'; not re-pinning\n", 4034 + bpf_map__name(map), map->pin_path); 4035 + return 0; 4036 + } 4037 + } else { 4038 + if (!path) { 4039 + pr_warn("missing a path to pin map '%s' at\n", 4040 + bpf_map__name(map)); 4041 + return -EINVAL; 4042 + } else if (map->pinned) { 4043 + pr_warn("map '%s' already pinned\n", bpf_map__name(map)); 4044 + return -EEXIST; 4045 + } 4046 + 4047 + map->pin_path = strdup(path); 4048 + if (!map->pin_path) { 4049 + err = -errno; 4050 + goto out_err; 4051 + } 4195 4052 } 4196 4053 4197 - pr_debug("pinned map '%s'\n", path); 4054 + err = make_parent_dir(map->pin_path); 4055 + if (err) 4056 + return err; 4057 + 4058 + err = check_path(map->pin_path); 4059 + if (err) 4060 + return err; 4061 + 4062 + if (bpf_obj_pin(map->fd, map->pin_path)) { 4063 + err = -errno; 4064 + goto out_err; 4065 + } 4066 + 4067 + map->pinned = true; 4068 + pr_debug("pinned map '%s'\n", map->pin_path); 4198 4069 4199 4070 return 0; 4071 + 4072 + out_err: 4073 + cp = libbpf_strerror_r(-err, errmsg, sizeof(errmsg)); 4074 + pr_warn("failed to pin map: %s\n", cp); 4075 + return err; 4200 4076 } 4201 4077 4202 4078 int bpf_map__unpin(struct bpf_map *map, const char *path) 4203 4079 { 4204 4080 int err; 4205 4081 4206 - err = check_path(path); 4207 - if (err) 4208 - return err; 4209 - 4210 4082 if (map == NULL) { 4211 4083 pr_warn("invalid map pointer\n"); 4212 4084 return -EINVAL; 4213 4085 } 4214 4086 4087 + if (map->pin_path) { 4088 + if (path && strcmp(path, map->pin_path)) { 4089 + pr_warn("map '%s' already has pin path '%s' different from '%s'\n", 4090 + bpf_map__name(map), map->pin_path, path); 4091 + return -EINVAL; 4092 + } 4093 + path = map->pin_path; 4094 + } else if (!path) { 4095 + pr_warn("no path to unpin map '%s' from\n", 4096 + bpf_map__name(map)); 4097 + return -EINVAL; 4098 + } 4099 + 4100 + err = check_path(path); 4101 + if (err) 4102 + return err; 4103 + 4215 4104 err = unlink(path); 4216 4105 if (err != 0) 4217 4106 return -errno; 4218 - pr_debug("unpinned map '%s'\n", path); 4107 + 4108 + map->pinned = false; 4109 + pr_debug("unpinned map '%s' from '%s'\n", bpf_map__name(map), path); 4219 4110 4220 4111 return 0; 4112 + } 4113 + 4114 + int bpf_map__set_pin_path(struct bpf_map *map, const char *path) 4115 + { 4116 + char *new = NULL; 4117 + 4118 + if (path) { 4119 + new = strdup(path); 4120 + if (!new) 4121 + return -errno; 4122 + } 4123 + 4124 + free(map->pin_path); 4125 + map->pin_path = new; 4126 + return 0; 4127 + } 4128 + 4129 + const char *bpf_map__get_pin_path(const struct bpf_map *map) 4130 + { 4131 + return map->pin_path; 4132 + } 4133 + 4134 + bool bpf_map__is_pinned(const struct bpf_map *map) 4135 + { 4136 + return map->pinned; 4221 4137 } 4222 4138 4223 4139 int bpf_object__pin_maps(struct bpf_object *obj, const char *path) ··· 4309 4073 return -ENOENT; 4310 4074 } 4311 4075 4312 - err = make_dir(path); 4313 - if (err) 4314 - return err; 4315 - 4316 4076 bpf_object__for_each_map(map, obj) { 4077 + char *pin_path = NULL; 4317 4078 char buf[PATH_MAX]; 4318 - int len; 4319 4079 4320 - len = snprintf(buf, PATH_MAX, "%s/%s", path, 4321 - bpf_map__name(map)); 4322 - if (len < 0) { 4323 - err = -EINVAL; 4324 - goto err_unpin_maps; 4325 - } else if (len >= PATH_MAX) { 4326 - err = -ENAMETOOLONG; 4327 - goto err_unpin_maps; 4080 + if (path) { 4081 + int len; 4082 + 4083 + len = snprintf(buf, PATH_MAX, "%s/%s", path, 4084 + bpf_map__name(map)); 4085 + if (len < 0) { 4086 + err = -EINVAL; 4087 + goto err_unpin_maps; 4088 + } else if (len >= PATH_MAX) { 4089 + err = -ENAMETOOLONG; 4090 + goto err_unpin_maps; 4091 + } 4092 + pin_path = buf; 4093 + } else if (!map->pin_path) { 4094 + continue; 4328 4095 } 4329 4096 4330 - err = bpf_map__pin(map, buf); 4097 + err = bpf_map__pin(map, pin_path); 4331 4098 if (err) 4332 4099 goto err_unpin_maps; 4333 4100 } ··· 4339 4100 4340 4101 err_unpin_maps: 4341 4102 while ((map = bpf_map__prev(map, obj))) { 4342 - char buf[PATH_MAX]; 4343 - int len; 4344 - 4345 - len = snprintf(buf, PATH_MAX, "%s/%s", path, 4346 - bpf_map__name(map)); 4347 - if (len < 0) 4348 - continue; 4349 - else if (len >= PATH_MAX) 4103 + if (!map->pin_path) 4350 4104 continue; 4351 4105 4352 - bpf_map__unpin(map, buf); 4106 + bpf_map__unpin(map, NULL); 4353 4107 } 4354 4108 4355 4109 return err; ··· 4357 4125 return -ENOENT; 4358 4126 4359 4127 bpf_object__for_each_map(map, obj) { 4128 + char *pin_path = NULL; 4360 4129 char buf[PATH_MAX]; 4361 - int len; 4362 4130 4363 - len = snprintf(buf, PATH_MAX, "%s/%s", path, 4364 - bpf_map__name(map)); 4365 - if (len < 0) 4366 - return -EINVAL; 4367 - else if (len >= PATH_MAX) 4368 - return -ENAMETOOLONG; 4131 + if (path) { 4132 + int len; 4369 4133 4370 - err = bpf_map__unpin(map, buf); 4134 + len = snprintf(buf, PATH_MAX, "%s/%s", path, 4135 + bpf_map__name(map)); 4136 + if (len < 0) 4137 + return -EINVAL; 4138 + else if (len >= PATH_MAX) 4139 + return -ENAMETOOLONG; 4140 + pin_path = buf; 4141 + } else if (!map->pin_path) { 4142 + continue; 4143 + } 4144 + 4145 + err = bpf_map__unpin(map, pin_path); 4371 4146 if (err) 4372 4147 return err; 4373 4148 } ··· 4394 4155 pr_warn("object not yet loaded; load it first\n"); 4395 4156 return -ENOENT; 4396 4157 } 4397 - 4398 - err = make_dir(path); 4399 - if (err) 4400 - return err; 4401 4158 4402 4159 bpf_object__for_each_program(prog, obj) { 4403 4160 char buf[PATH_MAX]; ··· 4495 4260 4496 4261 for (i = 0; i < obj->nr_maps; i++) { 4497 4262 zfree(&obj->maps[i].name); 4263 + zfree(&obj->maps[i].pin_path); 4498 4264 if (obj->maps[i].clear_priv) 4499 4265 obj->maps[i].clear_priv(&obj->maps[i], 4500 4266 obj->maps[i].priv); ··· 4754 4518 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT); 4755 4519 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP); 4756 4520 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT); 4521 + BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING); 4757 4522 4758 4523 enum bpf_attach_type 4759 4524 bpf_program__get_expected_attach_type(struct bpf_program *prog) ··· 4783 4546 BPF_PROG_SEC_IMPL(string, ptype, eatype, 1, 0, eatype) 4784 4547 4785 4548 /* Programs that use BTF to identify attach point */ 4786 - #define BPF_PROG_BTF(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 0, 1, 0) 4549 + #define BPF_PROG_BTF(string, ptype, eatype) \ 4550 + BPF_PROG_SEC_IMPL(string, ptype, eatype, 0, 1, 0) 4787 4551 4788 4552 /* Programs that can be attached but attach type can't be identified by section 4789 4553 * name. Kept for backward compatibility. ··· 4811 4573 BPF_PROG_SEC("tp/", BPF_PROG_TYPE_TRACEPOINT), 4812 4574 BPF_PROG_SEC("raw_tracepoint/", BPF_PROG_TYPE_RAW_TRACEPOINT), 4813 4575 BPF_PROG_SEC("raw_tp/", BPF_PROG_TYPE_RAW_TRACEPOINT), 4814 - BPF_PROG_BTF("tp_btf/", BPF_PROG_TYPE_RAW_TRACEPOINT), 4576 + BPF_PROG_BTF("tp_btf/", BPF_PROG_TYPE_TRACING, 4577 + BPF_TRACE_RAW_TP), 4815 4578 BPF_PROG_SEC("xdp", BPF_PROG_TYPE_XDP), 4816 4579 BPF_PROG_SEC("perf_event", BPF_PROG_TYPE_PERF_EVENT), 4817 4580 BPF_PROG_SEC("lwt_in", BPF_PROG_TYPE_LWT_IN), ··· 4917 4678 continue; 4918 4679 *prog_type = section_names[i].prog_type; 4919 4680 *expected_attach_type = section_names[i].expected_attach_type; 4920 - if (section_names[i].is_attach_btf) { 4921 - struct btf *btf = bpf_core_find_kernel_btf(); 4922 - char raw_tp_btf_name[128] = "btf_trace_"; 4923 - char *dst = raw_tp_btf_name + sizeof("btf_trace_") - 1; 4924 - int ret; 4925 - 4926 - if (IS_ERR(btf)) { 4927 - pr_warn("vmlinux BTF is not found\n"); 4928 - return -EINVAL; 4929 - } 4930 - /* prepend "btf_trace_" prefix per kernel convention */ 4931 - strncat(dst, name + section_names[i].len, 4932 - sizeof(raw_tp_btf_name) - sizeof("btf_trace_")); 4933 - ret = btf__find_by_name(btf, raw_tp_btf_name); 4934 - btf__free(btf); 4935 - if (ret <= 0) { 4936 - pr_warn("%s is not found in vmlinux BTF\n", dst); 4937 - return -EINVAL; 4938 - } 4939 - *expected_attach_type = ret; 4940 - } 4941 4681 return 0; 4942 4682 } 4943 4683 pr_warn("failed to guess program type based on ELF section name '%s'\n", name); ··· 4927 4709 } 4928 4710 4929 4711 return -ESRCH; 4712 + } 4713 + 4714 + #define BTF_PREFIX "btf_trace_" 4715 + static int libbpf_attach_btf_id_by_name(const char *name, __u32 *btf_id) 4716 + { 4717 + struct btf *btf = bpf_core_find_kernel_btf(); 4718 + char raw_tp_btf_name[128] = BTF_PREFIX; 4719 + char *dst = raw_tp_btf_name + sizeof(BTF_PREFIX) - 1; 4720 + int ret, i, err = -EINVAL; 4721 + 4722 + if (IS_ERR(btf)) { 4723 + pr_warn("vmlinux BTF is not found\n"); 4724 + return -EINVAL; 4725 + } 4726 + 4727 + if (!name) 4728 + goto out; 4729 + 4730 + for (i = 0; i < ARRAY_SIZE(section_names); i++) { 4731 + if (!section_names[i].is_attach_btf) 4732 + continue; 4733 + if (strncmp(name, section_names[i].sec, section_names[i].len)) 4734 + continue; 4735 + /* prepend "btf_trace_" prefix per kernel convention */ 4736 + strncat(dst, name + section_names[i].len, 4737 + sizeof(raw_tp_btf_name) - sizeof(BTF_PREFIX)); 4738 + ret = btf__find_by_name(btf, raw_tp_btf_name); 4739 + if (ret <= 0) { 4740 + pr_warn("%s is not found in vmlinux BTF\n", dst); 4741 + goto out; 4742 + } 4743 + *btf_id = ret; 4744 + err = 0; 4745 + goto out; 4746 + } 4747 + pr_warn("failed to identify btf_id based on ELF section name '%s'\n", name); 4748 + err = -ESRCH; 4749 + out: 4750 + btf__free(btf); 4751 + return err; 4930 4752 } 4931 4753 4932 4754 int libbpf_attach_type_by_name(const char *name,

+22 -1

tools/lib/bpf/libbpf.h

··· 103 103 bool relaxed_maps; 104 104 /* process CO-RE relocations non-strictly, allowing them to fail */ 105 105 bool relaxed_core_relocs; 106 + /* maps that set the 'pinning' attribute in their definition will have 107 + * their pin_path attribute set to a file in this directory, and be 108 + * auto-pinned to that path on load; defaults to "/sys/fs/bpf". 109 + */ 110 + const char *pin_root_path; 106 111 }; 107 - #define bpf_object_open_opts__last_field relaxed_core_relocs 112 + #define bpf_object_open_opts__last_field pin_root_path 108 113 109 114 LIBBPF_API struct bpf_object *bpf_object__open(const char *path); 110 115 LIBBPF_API struct bpf_object * ··· 129 124 __u32 *size); 130 125 int bpf_object__variable_offset(const struct bpf_object *obj, const char *name, 131 126 __u32 *off); 127 + 128 + enum libbpf_pin_type { 129 + LIBBPF_PIN_NONE, 130 + /* PIN_BY_NAME: pin maps by name (in /sys/fs/bpf by default) */ 131 + LIBBPF_PIN_BY_NAME, 132 + }; 133 + 134 + /* pin_maps and unpin_maps can both be called with a NULL path, in which case 135 + * they will use the pin_path attribute of each map (and ignore all maps that 136 + * don't have a pin_path set). 137 + */ 132 138 LIBBPF_API int bpf_object__pin_maps(struct bpf_object *obj, const char *path); 133 139 LIBBPF_API int bpf_object__unpin_maps(struct bpf_object *obj, 134 140 const char *path); ··· 323 307 LIBBPF_API int bpf_program__set_sched_act(struct bpf_program *prog); 324 308 LIBBPF_API int bpf_program__set_xdp(struct bpf_program *prog); 325 309 LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog); 310 + LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog); 326 311 327 312 LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog); 328 313 LIBBPF_API void bpf_program__set_type(struct bpf_program *prog, ··· 343 326 LIBBPF_API bool bpf_program__is_sched_act(const struct bpf_program *prog); 344 327 LIBBPF_API bool bpf_program__is_xdp(const struct bpf_program *prog); 345 328 LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog); 329 + LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog); 346 330 347 331 /* 348 332 * No need for __attribute__((packed)), all members of 'bpf_map_def' ··· 403 385 LIBBPF_API bool bpf_map__is_offload_neutral(const struct bpf_map *map); 404 386 LIBBPF_API bool bpf_map__is_internal(const struct bpf_map *map); 405 387 LIBBPF_API void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex); 388 + LIBBPF_API int bpf_map__set_pin_path(struct bpf_map *map, const char *path); 389 + LIBBPF_API const char *bpf_map__get_pin_path(const struct bpf_map *map); 390 + LIBBPF_API bool bpf_map__is_pinned(const struct bpf_map *map); 406 391 LIBBPF_API int bpf_map__pin(struct bpf_map *map, const char *path); 407 392 LIBBPF_API int bpf_map__unpin(struct bpf_map *map, const char *path); 408 393

+5

tools/lib/bpf/libbpf.map

··· 193 193 194 194 LIBBPF_0.0.6 { 195 195 global: 196 + bpf_map__get_pin_path; 197 + bpf_map__is_pinned; 198 + bpf_map__set_pin_path; 196 199 bpf_object__open_file; 197 200 bpf_object__open_mem; 198 201 bpf_program__get_expected_attach_type; 199 202 bpf_program__get_type; 203 + bpf_program__is_tracing; 204 + bpf_program__set_tracing; 200 205 } LIBBPF_0.0.5;

+1

tools/lib/bpf/libbpf_probes.c

··· 102 102 case BPF_PROG_TYPE_FLOW_DISSECTOR: 103 103 case BPF_PROG_TYPE_CGROUP_SYSCTL: 104 104 case BPF_PROG_TYPE_CGROUP_SOCKOPT: 105 + case BPF_PROG_TYPE_TRACING: 105 106 default: 106 107 break; 107 108 }

+71 -12

tools/lib/bpf/xsk.c

··· 73 73 int fd; 74 74 }; 75 75 76 + /* Up until and including Linux 5.3 */ 77 + struct xdp_ring_offset_v1 { 78 + __u64 producer; 79 + __u64 consumer; 80 + __u64 desc; 81 + }; 82 + 83 + /* Up until and including Linux 5.3 */ 84 + struct xdp_mmap_offsets_v1 { 85 + struct xdp_ring_offset_v1 rx; 86 + struct xdp_ring_offset_v1 tx; 87 + struct xdp_ring_offset_v1 fr; 88 + struct xdp_ring_offset_v1 cr; 89 + }; 90 + 76 91 int xsk_umem__fd(const struct xsk_umem *umem) 77 92 { 78 93 return umem ? umem->fd : -EINVAL; ··· 148 133 return 0; 149 134 } 150 135 136 + static void xsk_mmap_offsets_v1(struct xdp_mmap_offsets *off) 137 + { 138 + struct xdp_mmap_offsets_v1 off_v1; 139 + 140 + /* getsockopt on a kernel <= 5.3 has no flags fields. 141 + * Copy over the offsets to the correct places in the >=5.4 format 142 + * and put the flags where they would have been on that kernel. 143 + */ 144 + memcpy(&off_v1, off, sizeof(off_v1)); 145 + 146 + off->rx.producer = off_v1.rx.producer; 147 + off->rx.consumer = off_v1.rx.consumer; 148 + off->rx.desc = off_v1.rx.desc; 149 + off->rx.flags = off_v1.rx.consumer + sizeof(__u32); 150 + 151 + off->tx.producer = off_v1.tx.producer; 152 + off->tx.consumer = off_v1.tx.consumer; 153 + off->tx.desc = off_v1.tx.desc; 154 + off->tx.flags = off_v1.tx.consumer + sizeof(__u32); 155 + 156 + off->fr.producer = off_v1.fr.producer; 157 + off->fr.consumer = off_v1.fr.consumer; 158 + off->fr.desc = off_v1.fr.desc; 159 + off->fr.flags = off_v1.fr.consumer + sizeof(__u32); 160 + 161 + off->cr.producer = off_v1.cr.producer; 162 + off->cr.consumer = off_v1.cr.consumer; 163 + off->cr.desc = off_v1.cr.desc; 164 + off->cr.flags = off_v1.cr.consumer + sizeof(__u32); 165 + } 166 + 167 + static int xsk_get_mmap_offsets(int fd, struct xdp_mmap_offsets *off) 168 + { 169 + socklen_t optlen; 170 + int err; 171 + 172 + optlen = sizeof(*off); 173 + err = getsockopt(fd, SOL_XDP, XDP_MMAP_OFFSETS, off, &optlen); 174 + if (err) 175 + return err; 176 + 177 + if (optlen == sizeof(*off)) 178 + return 0; 179 + 180 + if (optlen == sizeof(struct xdp_mmap_offsets_v1)) { 181 + xsk_mmap_offsets_v1(off); 182 + return 0; 183 + } 184 + 185 + return -EINVAL; 186 + } 187 + 151 188 int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, void *umem_area, 152 189 __u64 size, struct xsk_ring_prod *fill, 153 190 struct xsk_ring_cons *comp, ··· 208 141 struct xdp_mmap_offsets off; 209 142 struct xdp_umem_reg mr; 210 143 struct xsk_umem *umem; 211 - socklen_t optlen; 212 144 void *map; 213 145 int err; 214 146 ··· 256 190 goto out_socket; 257 191 } 258 192 259 - optlen = sizeof(off); 260 - err = getsockopt(umem->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); 193 + err = xsk_get_mmap_offsets(umem->fd, &off); 261 194 if (err) { 262 195 err = -errno; 263 196 goto out_socket; ··· 579 514 struct sockaddr_xdp sxdp = {}; 580 515 struct xdp_mmap_offsets off; 581 516 struct xsk_socket *xsk; 582 - socklen_t optlen; 583 517 int err; 584 518 585 519 if (!umem || !xsk_ptr || !rx || !tx) ··· 637 573 } 638 574 } 639 575 640 - optlen = sizeof(off); 641 - err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); 576 + err = xsk_get_mmap_offsets(xsk->fd, &off); 642 577 if (err) { 643 578 err = -errno; 644 579 goto out_socket; ··· 723 660 int xsk_umem__delete(struct xsk_umem *umem) 724 661 { 725 662 struct xdp_mmap_offsets off; 726 - socklen_t optlen; 727 663 int err; 728 664 729 665 if (!umem) ··· 731 669 if (umem->refcount) 732 670 return -EBUSY; 733 671 734 - optlen = sizeof(off); 735 - err = getsockopt(umem->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); 672 + err = xsk_get_mmap_offsets(umem->fd, &off); 736 673 if (!err) { 737 674 munmap(umem->fill->ring - off.fr.desc, 738 675 off.fr.desc + umem->config.fill_size * sizeof(__u64)); ··· 749 688 { 750 689 size_t desc_sz = sizeof(struct xdp_desc); 751 690 struct xdp_mmap_offsets off; 752 - socklen_t optlen; 753 691 int err; 754 692 755 693 if (!xsk) ··· 759 699 close(xsk->prog_fd); 760 700 } 761 701 762 - optlen = sizeof(off); 763 - err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); 702 + err = xsk_get_mmap_offsets(xsk->fd, &off); 764 703 if (!err) { 765 704 if (xsk->rx) { 766 705 munmap(xsk->rx->ring - off.rx.desc,

+10 -6

tools/testing/selftests/bpf/Makefile

··· 89 89 $(OUTPUT)/urandom_read: urandom_read.c 90 90 $(CC) -o $@ $< -Wl,--build-id 91 91 92 + $(OUTPUT)/test_stub.o: test_stub.c 93 + $(CC) -c $(CFLAGS) -o $@ $< 94 + 92 95 BPFOBJ := $(OUTPUT)/libbpf.a 93 96 94 97 $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/test_stub.o $(BPFOBJ) ··· 134 131 | sed -n '/<...> search starts here:/,/End of search list./{ s| $/.*$|-idirafter \1|p }') 135 132 endef 136 133 134 + # Determine target endianness. 135 + IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ 136 + grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') 137 + MENDIAN=$(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) 138 + 137 139 CLANG_SYS_INCLUDES = $(call get_sys_includes,$(CLANG)) 138 - BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \ 140 + BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) $(MENDIAN) \ 139 141 -I. -I./include/uapi -I$(APIDIR) \ 140 142 -I$(BPFDIR) -I$(abspath $(OUTPUT)/../usr/include) 141 143 ··· 279 271 280 272 # Define test_progs BPF-GCC-flavored test runner. 281 273 ifneq ($(BPF_GCC),) 282 - IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ 283 - grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') 284 - MENDIAN=$(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) 285 - 286 274 TRUNNER_BPF_BUILD_RULE := GCC_BPF_BUILD_RULE 287 - TRUNNER_BPF_CFLAGS := $(BPF_CFLAGS) $(call get_sys_includes,gcc) $(MENDIAN) 275 + TRUNNER_BPF_CFLAGS := $(BPF_CFLAGS) $(call get_sys_includes,gcc) 288 276 TRUNNER_BPF_LDFLAGS := 289 277 $(eval $(call DEFINE_TEST_RUNNER,test_progs,bpf_gcc)) 290 278 endif

+210

tools/testing/selftests/bpf/prog_tests/pinning.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <sys/types.h> 4 + #include <sys/stat.h> 5 + #include <unistd.h> 6 + #include <test_progs.h> 7 + 8 + __u32 get_map_id(struct bpf_object *obj, const char *name) 9 + { 10 + struct bpf_map_info map_info = {}; 11 + __u32 map_info_len, duration = 0; 12 + struct bpf_map *map; 13 + int err; 14 + 15 + map_info_len = sizeof(map_info); 16 + 17 + map = bpf_object__find_map_by_name(obj, name); 18 + if (CHECK(!map, "find map", "NULL map")) 19 + return 0; 20 + 21 + err = bpf_obj_get_info_by_fd(bpf_map__fd(map), 22 + &map_info, &map_info_len); 23 + CHECK(err, "get map info", "err %d errno %d", err, errno); 24 + return map_info.id; 25 + } 26 + 27 + void test_pinning(void) 28 + { 29 + const char *file_invalid = "./test_pinning_invalid.o"; 30 + const char *custpinpath = "/sys/fs/bpf/custom/pinmap"; 31 + const char *nopinpath = "/sys/fs/bpf/nopinmap"; 32 + const char *nopinpath2 = "/sys/fs/bpf/nopinmap2"; 33 + const char *custpath = "/sys/fs/bpf/custom"; 34 + const char *pinpath = "/sys/fs/bpf/pinmap"; 35 + const char *file = "./test_pinning.o"; 36 + __u32 map_id, map_id2, duration = 0; 37 + struct stat statbuf = {}; 38 + struct bpf_object *obj; 39 + struct bpf_map *map; 40 + int err; 41 + DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts, 42 + .pin_root_path = custpath, 43 + ); 44 + 45 + /* check that opening fails with invalid pinning value in map def */ 46 + obj = bpf_object__open_file(file_invalid, NULL); 47 + err = libbpf_get_error(obj); 48 + if (CHECK(err != -EINVAL, "invalid open", "err %d errno %d\n", err, errno)) { 49 + obj = NULL; 50 + goto out; 51 + } 52 + 53 + /* open the valid object file */ 54 + obj = bpf_object__open_file(file, NULL); 55 + err = libbpf_get_error(obj); 56 + if (CHECK(err, "default open", "err %d errno %d\n", err, errno)) { 57 + obj = NULL; 58 + goto out; 59 + } 60 + 61 + err = bpf_object__load(obj); 62 + if (CHECK(err, "default load", "err %d errno %d\n", err, errno)) 63 + goto out; 64 + 65 + /* check that pinmap was pinned */ 66 + err = stat(pinpath, &statbuf); 67 + if (CHECK(err, "stat pinpath", "err %d errno %d\n", err, errno)) 68 + goto out; 69 + 70 + /* check that nopinmap was *not* pinned */ 71 + err = stat(nopinpath, &statbuf); 72 + if (CHECK(!err || errno != ENOENT, "stat nopinpath", 73 + "err %d errno %d\n", err, errno)) 74 + goto out; 75 + 76 + /* check that nopinmap2 was *not* pinned */ 77 + err = stat(nopinpath2, &statbuf); 78 + if (CHECK(!err || errno != ENOENT, "stat nopinpath2", 79 + "err %d errno %d\n", err, errno)) 80 + goto out; 81 + 82 + map_id = get_map_id(obj, "pinmap"); 83 + if (!map_id) 84 + goto out; 85 + 86 + bpf_object__close(obj); 87 + 88 + obj = bpf_object__open_file(file, NULL); 89 + if (CHECK_FAIL(libbpf_get_error(obj))) { 90 + obj = NULL; 91 + goto out; 92 + } 93 + 94 + err = bpf_object__load(obj); 95 + if (CHECK(err, "default load", "err %d errno %d\n", err, errno)) 96 + goto out; 97 + 98 + /* check that same map ID was reused for second load */ 99 + map_id2 = get_map_id(obj, "pinmap"); 100 + if (CHECK(map_id != map_id2, "check reuse", 101 + "err %d errno %d id %d id2 %d\n", err, errno, map_id, map_id2)) 102 + goto out; 103 + 104 + /* should be no-op to re-pin same map */ 105 + map = bpf_object__find_map_by_name(obj, "pinmap"); 106 + if (CHECK(!map, "find map", "NULL map")) 107 + goto out; 108 + 109 + err = bpf_map__pin(map, NULL); 110 + if (CHECK(err, "re-pin map", "err %d errno %d\n", err, errno)) 111 + goto out; 112 + 113 + /* but error to pin at different location */ 114 + err = bpf_map__pin(map, "/sys/fs/bpf/other"); 115 + if (CHECK(!err, "pin map different", "err %d errno %d\n", err, errno)) 116 + goto out; 117 + 118 + /* unpin maps with a pin_path set */ 119 + err = bpf_object__unpin_maps(obj, NULL); 120 + if (CHECK(err, "unpin maps", "err %d errno %d\n", err, errno)) 121 + goto out; 122 + 123 + /* and re-pin them... */ 124 + err = bpf_object__pin_maps(obj, NULL); 125 + if (CHECK(err, "pin maps", "err %d errno %d\n", err, errno)) 126 + goto out; 127 + 128 + /* set pinning path of other map and re-pin all */ 129 + map = bpf_object__find_map_by_name(obj, "nopinmap"); 130 + if (CHECK(!map, "find map", "NULL map")) 131 + goto out; 132 + 133 + err = bpf_map__set_pin_path(map, custpinpath); 134 + if (CHECK(err, "set pin path", "err %d errno %d\n", err, errno)) 135 + goto out; 136 + 137 + /* should only pin the one unpinned map */ 138 + err = bpf_object__pin_maps(obj, NULL); 139 + if (CHECK(err, "pin maps", "err %d errno %d\n", err, errno)) 140 + goto out; 141 + 142 + /* check that nopinmap was pinned at the custom path */ 143 + err = stat(custpinpath, &statbuf); 144 + if (CHECK(err, "stat custpinpath", "err %d errno %d\n", err, errno)) 145 + goto out; 146 + 147 + /* remove the custom pin path to re-test it with auto-pinning below */ 148 + err = unlink(custpinpath); 149 + if (CHECK(err, "unlink custpinpath", "err %d errno %d\n", err, errno)) 150 + goto out; 151 + 152 + err = rmdir(custpath); 153 + if (CHECK(err, "rmdir custpindir", "err %d errno %d\n", err, errno)) 154 + goto out; 155 + 156 + bpf_object__close(obj); 157 + 158 + /* open the valid object file again */ 159 + obj = bpf_object__open_file(file, NULL); 160 + err = libbpf_get_error(obj); 161 + if (CHECK(err, "default open", "err %d errno %d\n", err, errno)) { 162 + obj = NULL; 163 + goto out; 164 + } 165 + 166 + /* swap pin paths of the two maps */ 167 + bpf_object__for_each_map(map, obj) { 168 + if (!strcmp(bpf_map__name(map), "nopinmap")) 169 + err = bpf_map__set_pin_path(map, pinpath); 170 + else if (!strcmp(bpf_map__name(map), "pinmap")) 171 + err = bpf_map__set_pin_path(map, NULL); 172 + else 173 + continue; 174 + 175 + if (CHECK(err, "set pin path", "err %d errno %d\n", err, errno)) 176 + goto out; 177 + } 178 + 179 + /* should fail because of map parameter mismatch */ 180 + err = bpf_object__load(obj); 181 + if (CHECK(err != -EINVAL, "param mismatch load", "err %d errno %d\n", err, errno)) 182 + goto out; 183 + 184 + bpf_object__close(obj); 185 + 186 + /* test auto-pinning at custom path with open opt */ 187 + obj = bpf_object__open_file(file, &opts); 188 + if (CHECK_FAIL(libbpf_get_error(obj))) { 189 + obj = NULL; 190 + goto out; 191 + } 192 + 193 + err = bpf_object__load(obj); 194 + if (CHECK(err, "custom load", "err %d errno %d\n", err, errno)) 195 + goto out; 196 + 197 + /* check that pinmap was pinned at the custom path */ 198 + err = stat(custpinpath, &statbuf); 199 + if (CHECK(err, "stat custpinpath", "err %d errno %d\n", err, errno)) 200 + goto out; 201 + 202 + out: 203 + unlink(pinpath); 204 + unlink(nopinpath); 205 + unlink(nopinpath2); 206 + unlink(custpinpath); 207 + rmdir(custpath); 208 + if (obj) 209 + bpf_object__close(obj); 210 + }

+78

tools/testing/selftests/bpf/prog_tests/probe_user.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <test_progs.h> 3 + 4 + void test_probe_user(void) 5 + { 6 + #define kprobe_name "__sys_connect" 7 + const char *prog_name = "kprobe/" kprobe_name; 8 + const char *obj_file = "./test_probe_user.o"; 9 + DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts, ); 10 + int err, results_map_fd, sock_fd, duration = 0; 11 + struct sockaddr curr, orig, tmp; 12 + struct sockaddr_in *in = (struct sockaddr_in *)&curr; 13 + struct bpf_link *kprobe_link = NULL; 14 + struct bpf_program *kprobe_prog; 15 + struct bpf_object *obj; 16 + static const int zero = 0; 17 + 18 + obj = bpf_object__open_file(obj_file, &opts); 19 + if (CHECK(IS_ERR(obj), "obj_open_file", "err %ld\n", PTR_ERR(obj))) 20 + return; 21 + 22 + kprobe_prog = bpf_object__find_program_by_title(obj, prog_name); 23 + if (CHECK(!kprobe_prog, "find_probe", 24 + "prog '%s' not found\n", prog_name)) 25 + goto cleanup; 26 + 27 + err = bpf_object__load(obj); 28 + if (CHECK(err, "obj_load", "err %d\n", err)) 29 + goto cleanup; 30 + 31 + results_map_fd = bpf_find_map(__func__, obj, "test_pro.bss"); 32 + if (CHECK(results_map_fd < 0, "find_bss_map", 33 + "err %d\n", results_map_fd)) 34 + goto cleanup; 35 + 36 + kprobe_link = bpf_program__attach_kprobe(kprobe_prog, false, 37 + kprobe_name); 38 + if (CHECK(IS_ERR(kprobe_link), "attach_kprobe", 39 + "err %ld\n", PTR_ERR(kprobe_link))) { 40 + kprobe_link = NULL; 41 + goto cleanup; 42 + } 43 + 44 + memset(&curr, 0, sizeof(curr)); 45 + in->sin_family = AF_INET; 46 + in->sin_port = htons(5555); 47 + in->sin_addr.s_addr = inet_addr("255.255.255.255"); 48 + memcpy(&orig, &curr, sizeof(curr)); 49 + 50 + sock_fd = socket(AF_INET, SOCK_STREAM, 0); 51 + if (CHECK(sock_fd < 0, "create_sock_fd", "err %d\n", sock_fd)) 52 + goto cleanup; 53 + 54 + connect(sock_fd, &curr, sizeof(curr)); 55 + close(sock_fd); 56 + 57 + err = bpf_map_lookup_elem(results_map_fd, &zero, &tmp); 58 + if (CHECK(err, "get_kprobe_res", 59 + "failed to get kprobe res: %d\n", err)) 60 + goto cleanup; 61 + 62 + in = (struct sockaddr_in *)&tmp; 63 + if (CHECK(memcmp(&tmp, &orig, sizeof(orig)), "check_kprobe_res", 64 + "wrong kprobe res from probe read: %s:%u\n", 65 + inet_ntoa(in->sin_addr), ntohs(in->sin_port))) 66 + goto cleanup; 67 + 68 + memset(&tmp, 0xab, sizeof(tmp)); 69 + 70 + in = (struct sockaddr_in *)&curr; 71 + if (CHECK(memcmp(&curr, &tmp, sizeof(tmp)), "check_kprobe_res", 72 + "wrong kprobe res from probe write: %s:%u\n", 73 + inet_ntoa(in->sin_addr), ntohs(in->sin_port))) 74 + goto cleanup; 75 + cleanup: 76 + bpf_link__destroy(kprobe_link); 77 + bpf_object__close(obj); 78 + }

+2 -2

tools/testing/selftests/bpf/progs/kfree_skb.c

··· 79 79 func = ptr->func; 80 80 })); 81 81 82 - bpf_probe_read(&pkt_type, sizeof(pkt_type), _(&skb->__pkt_type_offset)); 82 + bpf_probe_read_kernel(&pkt_type, sizeof(pkt_type), _(&skb->__pkt_type_offset)); 83 83 pkt_type &= 7; 84 84 85 85 /* read eth proto */ 86 - bpf_probe_read(&pkt_data, sizeof(pkt_data), data + 12); 86 + bpf_probe_read_kernel(&pkt_data, sizeof(pkt_data), data + 12); 87 87 88 88 bpf_printk("rcuhead.next %llx func %llx\n", ptr, func); 89 89 bpf_printk("skb->len %d users %d pkt_type %x\n",

+36 -31

tools/testing/selftests/bpf/progs/pyperf.h

··· 72 72 void* thread_state; 73 73 int key; 74 74 75 - bpf_probe_read(&key, sizeof(key), (void*)(long)pidData->tls_key_addr); 76 - bpf_probe_read(&thread_state, sizeof(thread_state), 77 - tls_base + 0x310 + key * 0x10 + 0x08); 75 + bpf_probe_read_user(&key, sizeof(key), (void*)(long)pidData->tls_key_addr); 76 + bpf_probe_read_user(&thread_state, sizeof(thread_state), 77 + tls_base + 0x310 + key * 0x10 + 0x08); 78 78 return thread_state; 79 79 } 80 80 ··· 82 82 FrameData *frame, Symbol *symbol) 83 83 { 84 84 // read data from PyFrameObject 85 - bpf_probe_read(&frame->f_back, 86 - sizeof(frame->f_back), 87 - frame_ptr + pidData->offsets.PyFrameObject_back); 88 - bpf_probe_read(&frame->f_code, 89 - sizeof(frame->f_code), 90 - frame_ptr + pidData->offsets.PyFrameObject_code); 85 + bpf_probe_read_user(&frame->f_back, 86 + sizeof(frame->f_back), 87 + frame_ptr + pidData->offsets.PyFrameObject_back); 88 + bpf_probe_read_user(&frame->f_code, 89 + sizeof(frame->f_code), 90 + frame_ptr + pidData->offsets.PyFrameObject_code); 91 91 92 92 // read data from PyCodeObject 93 93 if (!frame->f_code) 94 94 return false; 95 - bpf_probe_read(&frame->co_filename, 96 - sizeof(frame->co_filename), 97 - frame->f_code + pidData->offsets.PyCodeObject_filename); 98 - bpf_probe_read(&frame->co_name, 99 - sizeof(frame->co_name), 100 - frame->f_code + pidData->offsets.PyCodeObject_name); 95 + bpf_probe_read_user(&frame->co_filename, 96 + sizeof(frame->co_filename), 97 + frame->f_code + pidData->offsets.PyCodeObject_filename); 98 + bpf_probe_read_user(&frame->co_name, 99 + sizeof(frame->co_name), 100 + frame->f_code + pidData->offsets.PyCodeObject_name); 101 101 // read actual names into symbol 102 102 if (frame->co_filename) 103 - bpf_probe_read_str(&symbol->file, 104 - sizeof(symbol->file), 105 - frame->co_filename + pidData->offsets.String_data); 103 + bpf_probe_read_user_str(&symbol->file, 104 + sizeof(symbol->file), 105 + frame->co_filename + 106 + pidData->offsets.String_data); 106 107 if (frame->co_name) 107 - bpf_probe_read_str(&symbol->name, 108 - sizeof(symbol->name), 109 - frame->co_name + pidData->offsets.String_data); 108 + bpf_probe_read_user_str(&symbol->name, 109 + sizeof(symbol->name), 110 + frame->co_name + 111 + pidData->offsets.String_data); 110 112 return true; 111 113 } 112 114 ··· 176 174 event->kernel_stack_id = bpf_get_stackid(ctx, &stackmap, 0); 177 175 178 176 void* thread_state_current = (void*)0; 179 - bpf_probe_read(&thread_state_current, 180 - sizeof(thread_state_current), 181 - (void*)(long)pidData->current_state_addr); 177 + bpf_probe_read_user(&thread_state_current, 178 + sizeof(thread_state_current), 179 + (void*)(long)pidData->current_state_addr); 182 180 183 181 struct task_struct* task = (struct task_struct*)bpf_get_current_task(); 184 182 void* tls_base = (void*)task; ··· 190 188 if (pidData->use_tls) { 191 189 uint64_t pthread_created; 192 190 uint64_t pthread_self; 193 - bpf_probe_read(&pthread_self, sizeof(pthread_self), tls_base + 0x10); 191 + bpf_probe_read_user(&pthread_self, sizeof(pthread_self), 192 + tls_base + 0x10); 194 193 195 - bpf_probe_read(&pthread_created, 196 - sizeof(pthread_created), 197 - thread_state + pidData->offsets.PyThreadState_thread); 194 + bpf_probe_read_user(&pthread_created, 195 + sizeof(pthread_created), 196 + thread_state + 197 + pidData->offsets.PyThreadState_thread); 198 198 event->pthread_match = pthread_created == pthread_self; 199 199 } else { 200 200 event->pthread_match = 1; ··· 208 204 Symbol sym = {}; 209 205 int cur_cpu = bpf_get_smp_processor_id(); 210 206 211 - bpf_probe_read(&frame_ptr, 212 - sizeof(frame_ptr), 213 - thread_state + pidData->offsets.PyThreadState_frame); 207 + bpf_probe_read_user(&frame_ptr, 208 + sizeof(frame_ptr), 209 + thread_state + 210 + pidData->offsets.PyThreadState_frame); 214 211 215 212 int32_t* symbol_counter = bpf_map_lookup_elem(&symbolmap, &sym); 216 213 if (symbol_counter == NULL)

+18 -18

tools/testing/selftests/bpf/progs/strobemeta.h

··· 98 98 /* 99 99 * having volatile doesn't change anything on BPF side, but clang 100 100 * emits warnings for passing `volatile const char *` into 101 - * bpf_probe_read_str that expects just `const char *` 101 + * bpf_probe_read_user_str that expects just `const char *` 102 102 */ 103 103 const char* tag; 104 104 /* ··· 309 309 dtv_t *dtv; 310 310 void *tls_ptr; 311 311 312 - bpf_probe_read(&tls_index, sizeof(struct tls_index), 313 - (void *)loc->offset); 312 + bpf_probe_read_user(&tls_index, sizeof(struct tls_index), 313 + (void *)loc->offset); 314 314 /* valid module index is always positive */ 315 315 if (tls_index.module > 0) { 316 316 /* dtv = ((struct tcbhead *)tls_base)->dtv[tls_index.module] */ 317 - bpf_probe_read(&dtv, sizeof(dtv), 318 - &((struct tcbhead *)tls_base)->dtv); 317 + bpf_probe_read_user(&dtv, sizeof(dtv), 318 + &((struct tcbhead *)tls_base)->dtv); 319 319 dtv += tls_index.module; 320 320 } else { 321 321 dtv = NULL; 322 322 } 323 - bpf_probe_read(&tls_ptr, sizeof(void *), dtv); 323 + bpf_probe_read_user(&tls_ptr, sizeof(void *), dtv); 324 324 /* if pointer has (void *)-1 value, then TLS wasn't initialized yet */ 325 325 return tls_ptr && tls_ptr != (void *)-1 326 326 ? tls_ptr + tls_index.offset ··· 336 336 if (!location) 337 337 return; 338 338 339 - bpf_probe_read(value, sizeof(struct strobe_value_generic), location); 339 + bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location); 340 340 data->int_vals[idx] = value->val; 341 341 if (value->header.len) 342 342 data->int_vals_set_mask |= (1 << idx); ··· 356 356 if (!location) 357 357 return 0; 358 358 359 - bpf_probe_read(value, sizeof(struct strobe_value_generic), location); 360 - len = bpf_probe_read_str(payload, STROBE_MAX_STR_LEN, value->ptr); 359 + bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location); 360 + len = bpf_probe_read_user_str(payload, STROBE_MAX_STR_LEN, value->ptr); 361 361 /* 362 - * if bpf_probe_read_str returns error (<0), due to casting to 362 + * if bpf_probe_read_user_str returns error (<0), due to casting to 363 363 * unsinged int, it will become big number, so next check is 364 364 * sufficient to check for errors AND prove to BPF verifier, that 365 - * bpf_probe_read_str won't return anything bigger than 365 + * bpf_probe_read_user_str won't return anything bigger than 366 366 * STROBE_MAX_STR_LEN 367 367 */ 368 368 if (len > STROBE_MAX_STR_LEN) ··· 391 391 if (!location) 392 392 return payload; 393 393 394 - bpf_probe_read(value, sizeof(struct strobe_value_generic), location); 395 - if (bpf_probe_read(&map, sizeof(struct strobe_map_raw), value->ptr)) 394 + bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location); 395 + if (bpf_probe_read_user(&map, sizeof(struct strobe_map_raw), value->ptr)) 396 396 return payload; 397 397 398 398 descr->id = map.id; ··· 402 402 data->req_meta_valid = 1; 403 403 } 404 404 405 - len = bpf_probe_read_str(payload, STROBE_MAX_STR_LEN, map.tag); 405 + len = bpf_probe_read_user_str(payload, STROBE_MAX_STR_LEN, map.tag); 406 406 if (len <= STROBE_MAX_STR_LEN) { 407 407 descr->tag_len = len; 408 408 payload += len; ··· 418 418 break; 419 419 420 420 descr->key_lens[i] = 0; 421 - len = bpf_probe_read_str(payload, STROBE_MAX_STR_LEN, 422 - map.entries[i].key); 421 + len = bpf_probe_read_user_str(payload, STROBE_MAX_STR_LEN, 422 + map.entries[i].key); 423 423 if (len <= STROBE_MAX_STR_LEN) { 424 424 descr->key_lens[i] = len; 425 425 payload += len; 426 426 } 427 427 descr->val_lens[i] = 0; 428 - len = bpf_probe_read_str(payload, STROBE_MAX_STR_LEN, 429 - map.entries[i].val); 428 + len = bpf_probe_read_user_str(payload, STROBE_MAX_STR_LEN, 429 + map.entries[i].val); 430 430 if (len <= STROBE_MAX_STR_LEN) { 431 431 descr->val_lens[i] = len; 432 432 payload += len;

+31

tools/testing/selftests/bpf/progs/test_pinning.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + int _version SEC("version") = 1; 7 + 8 + struct { 9 + __uint(type, BPF_MAP_TYPE_ARRAY); 10 + __uint(max_entries, 1); 11 + __type(key, __u32); 12 + __type(value, __u64); 13 + __uint(pinning, LIBBPF_PIN_BY_NAME); 14 + } pinmap SEC(".maps"); 15 + 16 + struct { 17 + __uint(type, BPF_MAP_TYPE_HASH); 18 + __uint(max_entries, 1); 19 + __type(key, __u32); 20 + __type(value, __u64); 21 + } nopinmap SEC(".maps"); 22 + 23 + struct { 24 + __uint(type, BPF_MAP_TYPE_ARRAY); 25 + __uint(max_entries, 1); 26 + __type(key, __u32); 27 + __type(value, __u64); 28 + __uint(pinning, LIBBPF_PIN_NONE); 29 + } nopinmap2 SEC(".maps"); 30 + 31 + char _license[] SEC("license") = "GPL";

+16

tools/testing/selftests/bpf/progs/test_pinning_invalid.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + int _version SEC("version") = 1; 7 + 8 + struct { 9 + __uint(type, BPF_MAP_TYPE_ARRAY); 10 + __uint(max_entries, 1); 11 + __type(key, __u32); 12 + __type(value, __u64); 13 + __uint(pinning, 2); /* invalid */ 14 + } nopinmap3 SEC(".maps"); 15 + 16 + char _license[] SEC("license") = "GPL";

+26

tools/testing/selftests/bpf/progs/test_probe_user.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/ptrace.h> 4 + #include <linux/bpf.h> 5 + 6 + #include <netinet/in.h> 7 + 8 + #include "bpf_helpers.h" 9 + #include "bpf_tracing.h" 10 + 11 + static struct sockaddr_in old; 12 + 13 + SEC("kprobe/__sys_connect") 14 + int handle_sys_connect(struct pt_regs *ctx) 15 + { 16 + void *ptr = (void *)PT_REGS_PARM2(ctx); 17 + struct sockaddr_in new; 18 + 19 + bpf_probe_read_user(&old, sizeof(old), ptr); 20 + __builtin_memset(&new, 0xab, sizeof(new)); 21 + bpf_probe_write_user(ptr, &new, sizeof(new)); 22 + 23 + return 0; 24 + } 25 + 26 + char _license[] SEC("license") = "GPL";

+1 -1

tools/testing/selftests/bpf/progs/test_tcp_estats.c

··· 38 38 #include <sys/socket.h> 39 39 #include "bpf_helpers.h" 40 40 41 - #define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;}) 41 + #define _(P) ({typeof(P) val = 0; bpf_probe_read_kernel(&val, sizeof(val), &P); val;}) 42 42 #define TCP_ESTATS_MAGIC 0xBAADBEEF 43 43 44 44 /* This test case needs "sock" and "pt_regs" data structure.

-3

tools/testing/selftests/bpf/test_offload.py

··· 314 314 continue 315 315 316 316 p = os.path.join(path, f) 317 - if not os.stat(p).st_mode & stat.S_IRUSR: 318 - continue 319 - 320 317 if os.path.isfile(p) and os.access(p, os.R_OK): 321 318 _, out = cmd('cat %s/%s' % (path, f)) 322 319 dfs[f] = out.strip()

+23

tools/testing/selftests/bpf/test_sysctl.c

··· 121 121 .result = OP_EPERM, 122 122 }, 123 123 { 124 + .descr = "ctx:write sysctl:write read ok narrow", 125 + .insns = { 126 + /* u64 w = (u16)write & 1; */ 127 + #if __BYTE_ORDER == __LITTLE_ENDIAN 128 + BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_1, 129 + offsetof(struct bpf_sysctl, write)), 130 + #else 131 + BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_1, 132 + offsetof(struct bpf_sysctl, write) + 2), 133 + #endif 134 + BPF_ALU64_IMM(BPF_AND, BPF_REG_7, 1), 135 + /* return 1 - w; */ 136 + BPF_MOV64_IMM(BPF_REG_0, 1), 137 + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_7), 138 + BPF_EXIT_INSN(), 139 + }, 140 + .attach_type = BPF_CGROUP_SYSCTL, 141 + .sysctl = "kernel/domainname", 142 + .open_flags = O_WRONLY, 143 + .newval = "(none)", /* same as default, should fail anyway */ 144 + .result = OP_EPERM, 145 + }, 146 + { 124 147 .descr = "ctx:write sysctl:read write reject", 125 148 .insns = { 126 149 /* write = X */