Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-08-31

We've added 15 non-merge commits during the last 3 day(s) which contain
a total of 17 files changed, 468 insertions(+), 97 deletions(-).

The main changes are:

1) BPF selftest fixes: one flake and one related to clang18 testing,
from Yonghong Song.

2) Fix a d_path BPF selftest failure after fast-forward from Linus'
tree, from Jiri Olsa.

3) Fix a preempt_rt splat in sockmap when using raw_spin_lock_t,
from John Fastabend.

4) Fix a xsk_diag_fill use-after-free race during socket cleanup,
from Magnus Karlsson.

5) Fix xsk_build_skb to address a buggy dereference of an ERR_PTR(),
from Tirthendu Sarkar.

6) Fix a bpftool build warning when compiled with -Wtype-limits,
from Yafang Shao.

7) Several misc fixes and cleanups in standardization docs,
from David Vernet.

8) Fix BPF selftest install to consider no_alu32/cpuv4/bpf-gcc flavors,
from Björn Töpel.

9) Annotate a data race in bpf_long_memcpy for KCSAN, from Daniel Borkmann.

10) Extend documentation with a description for CO-RE relocations,
from Eduard Zingerman.

11) Fix several invalid escape sequence warnings in bpf_doc.py script,
from Vishal Chourasia.

12) Fix the instruction set doc wrt offset of BPF-to-BPF call,
from Will Hawkins.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Include build flavors for install target
bpf: Annotate bpf_long_memcpy with data_race
selftests/bpf: Fix d_path test
bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py
xsk: Fix xsk_diag use-after-free error during socket cleanup
bpf, docs: s/eBPF/BPF in standards documents
bpf, docs: Add abi.rst document to standardization subdirectory
bpf, docs: Move linux-notes.rst to root bpf docs tree
bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
docs/bpf: Add description for CO-RE relocations
bpf, docs: Correct source of offset for program-local call
selftests/bpf: Fix flaky cgroup_iter_sleepable subtest
xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()
bpftool: Fix build warnings with -Wtype-limits
bpf: Prevent inlining of bpf_fentry_test7()
====================

Link: https://lore.kernel.org/r/20230831210019.14417-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+468 -97
+25 -6
Documentation/bpf/btf.rst
··· 726 726 4.2 .BTF.ext section 727 727 -------------------- 728 728 729 - The .BTF.ext section encodes func_info and line_info which needs loader 730 - manipulation before loading into the kernel. 729 + The .BTF.ext section encodes func_info, line_info and CO-RE relocations 730 + which needs loader manipulation before loading into the kernel. 731 731 732 732 The specification for .BTF.ext section is defined at ``tools/lib/bpf/btf.h`` 733 733 and ``tools/lib/bpf/btf.c``. ··· 745 745 __u32 func_info_len; 746 746 __u32 line_info_off; 747 747 __u32 line_info_len; 748 + 749 + /* optional part of .BTF.ext header */ 750 + __u32 core_relo_off; 751 + __u32 core_relo_len; 748 752 }; 749 753 750 754 It is very similar to .BTF section. Instead of type/string section, it 751 - contains func_info and line_info section. See :ref:`BPF_Prog_Load` for details 752 - about func_info and line_info record format. 755 + contains func_info, line_info and core_relo sub-sections. 756 + See :ref:`BPF_Prog_Load` for details about func_info and line_info 757 + record format. 753 758 754 759 The func_info is organized as below.:: 755 760 756 - func_info_rec_size 761 + func_info_rec_size /* __u32 value */ 757 762 btf_ext_info_sec for section #1 /* func_info for section #1 */ 758 763 btf_ext_info_sec for section #2 /* func_info for section #2 */ 759 764 ... ··· 778 773 779 774 The line_info is organized as below.:: 780 775 781 - line_info_rec_size 776 + line_info_rec_size /* __u32 value */ 782 777 btf_ext_info_sec for section #1 /* line_info for section #1 */ 783 778 btf_ext_info_sec for section #2 /* line_info for section #2 */ 784 779 ... ··· 791 786 kernel API, the ``insn_off`` is the instruction offset in the unit of ``struct 792 787 bpf_insn``. For ELF API, the ``insn_off`` is the byte offset from the 793 788 beginning of section (``btf_ext_info_sec->sec_name_off``). 789 + 790 + The core_relo is organized as below.:: 791 + 792 + core_relo_rec_size /* __u32 value */ 793 + btf_ext_info_sec for section #1 /* core_relo for section #1 */ 794 + btf_ext_info_sec for section #2 /* core_relo for section #2 */ 795 + 796 + ``core_relo_rec_size`` specifies the size of ``bpf_core_relo`` 797 + structure when .BTF.ext is generated. All ``bpf_core_relo`` structures 798 + within a single ``btf_ext_info_sec`` describe relocations applied to 799 + section named by ``btf_ext_info_sec->sec_name_off``. 800 + 801 + See :ref:`Documentation/bpf/llvm_reloc <btf-co-re-relocations>` 802 + for more information on CO-RE relocations. 794 803 795 804 4.2 .BTF_ids section 796 805 --------------------
+1
Documentation/bpf/index.rst
··· 29 29 bpf_licensing 30 30 test_debug 31 31 clang-notes 32 + linux-notes 32 33 other 33 34 redirect 34 35
+304
Documentation/bpf/llvm_reloc.rst
··· 240 240 Offset Info Type Symbol's Value Symbol's Name 241 241 000000000000002c 0000000200000004 R_BPF_64_NODYLD32 0000000000000000 .text 242 242 0000000000000040 0000000200000004 R_BPF_64_NODYLD32 0000000000000000 .text 243 + 244 + .. _btf-co-re-relocations: 245 + 246 + ================= 247 + CO-RE Relocations 248 + ================= 249 + 250 + From object file point of view CO-RE mechanism is implemented as a set 251 + of CO-RE specific relocation records. These relocation records are not 252 + related to ELF relocations and are encoded in .BTF.ext section. 253 + See :ref:`Documentation/bpf/btf <BTF_Ext_Section>` for more 254 + information on .BTF.ext structure. 255 + 256 + CO-RE relocations are applied to BPF instructions to update immediate 257 + or offset fields of the instruction at load time with information 258 + relevant for target kernel. 259 + 260 + Field to patch is selected basing on the instruction class: 261 + 262 + * For BPF_ALU, BPF_ALU64, BPF_LD `immediate` field is patched; 263 + * For BPF_LDX, BPF_STX, BPF_ST `offset` field is patched; 264 + * BPF_JMP, BPF_JMP32 instructions **should not** be patched. 265 + 266 + Relocation kinds 267 + ================ 268 + 269 + There are several kinds of CO-RE relocations that could be split in 270 + three groups: 271 + 272 + * Field-based - patch instruction with field related information, e.g. 273 + change offset field of the BPF_LDX instruction to reflect offset 274 + of a specific structure field in the target kernel. 275 + 276 + * Type-based - patch instruction with type related information, e.g. 277 + change immediate field of the BPF_ALU move instruction to 0 or 1 to 278 + reflect if specific type is present in the target kernel. 279 + 280 + * Enum-based - patch instruction with enum related information, e.g. 281 + change immediate field of the BPF_LD_IMM64 instruction to reflect 282 + value of a specific enum literal in the target kernel. 283 + 284 + The complete list of relocation kinds is represented by the following enum: 285 + 286 + .. code-block:: c 287 + 288 + enum bpf_core_relo_kind { 289 + BPF_CORE_FIELD_BYTE_OFFSET = 0, /* field byte offset */ 290 + BPF_CORE_FIELD_BYTE_SIZE = 1, /* field size in bytes */ 291 + BPF_CORE_FIELD_EXISTS = 2, /* field existence in target kernel */ 292 + BPF_CORE_FIELD_SIGNED = 3, /* field signedness (0 - unsigned, 1 - signed) */ 293 + BPF_CORE_FIELD_LSHIFT_U64 = 4, /* bitfield-specific left bitshift */ 294 + BPF_CORE_FIELD_RSHIFT_U64 = 5, /* bitfield-specific right bitshift */ 295 + BPF_CORE_TYPE_ID_LOCAL = 6, /* type ID in local BPF object */ 296 + BPF_CORE_TYPE_ID_TARGET = 7, /* type ID in target kernel */ 297 + BPF_CORE_TYPE_EXISTS = 8, /* type existence in target kernel */ 298 + BPF_CORE_TYPE_SIZE = 9, /* type size in bytes */ 299 + BPF_CORE_ENUMVAL_EXISTS = 10, /* enum value existence in target kernel */ 300 + BPF_CORE_ENUMVAL_VALUE = 11, /* enum value integer value */ 301 + BPF_CORE_TYPE_MATCHES = 12, /* type match in target kernel */ 302 + }; 303 + 304 + Notes: 305 + 306 + * ``BPF_CORE_FIELD_LSHIFT_U64`` and ``BPF_CORE_FIELD_RSHIFT_U64`` are 307 + supposed to be used to read bitfield values using the following 308 + algorithm: 309 + 310 + .. code-block:: c 311 + 312 + // To read bitfield ``f`` from ``struct s`` 313 + is_signed = relo(s->f, BPF_CORE_FIELD_SIGNED) 314 + off = relo(s->f, BPF_CORE_FIELD_BYTE_OFFSET) 315 + sz = relo(s->f, BPF_CORE_FIELD_BYTE_SIZE) 316 + l = relo(s->f, BPF_CORE_FIELD_LSHIFT_U64) 317 + r = relo(s->f, BPF_CORE_FIELD_RSHIFT_U64) 318 + // define ``v`` as signed or unsigned integer of size ``sz`` 319 + v = *({s|u}<sz> *)((void *)s + off) 320 + v <<= l 321 + v >>= r 322 + 323 + * The ``BPF_CORE_TYPE_MATCHES`` queries matching relation, defined as 324 + follows: 325 + 326 + * for integers: types match if size and signedness match; 327 + * for arrays & pointers: target types are recursively matched; 328 + * for structs & unions: 329 + 330 + * local members need to exist in target with the same name; 331 + 332 + * for each member we recursively check match unless it is already behind a 333 + pointer, in which case we only check matching names and compatible kind; 334 + 335 + * for enums: 336 + 337 + * local variants have to have a match in target by symbolic name (but not 338 + numeric value); 339 + 340 + * size has to match (but enum may match enum64 and vice versa); 341 + 342 + * for function pointers: 343 + 344 + * number and position of arguments in local type has to match target; 345 + * for each argument and the return value we recursively check match. 346 + 347 + CO-RE Relocation Record 348 + ======================= 349 + 350 + Relocation record is encoded as the following structure: 351 + 352 + .. code-block:: c 353 + 354 + struct bpf_core_relo { 355 + __u32 insn_off; 356 + __u32 type_id; 357 + __u32 access_str_off; 358 + enum bpf_core_relo_kind kind; 359 + }; 360 + 361 + * ``insn_off`` - instruction offset (in bytes) within a code section 362 + associated with this relocation; 363 + 364 + * ``type_id`` - BTF type ID of the "root" (containing) entity of a 365 + relocatable type or field; 366 + 367 + * ``access_str_off`` - offset into corresponding .BTF string section. 368 + String interpretation depends on specific relocation kind: 369 + 370 + * for field-based relocations, string encodes an accessed field using 371 + a sequence of field and array indices, separated by colon (:). It's 372 + conceptually very close to LLVM's `getelementptr <GEP_>`_ instruction's 373 + arguments for identifying offset to a field. For example, consider the 374 + following C code: 375 + 376 + .. code-block:: c 377 + 378 + struct sample { 379 + int a; 380 + int b; 381 + struct { int c[10]; }; 382 + } __attribute__((preserve_access_index)); 383 + struct sample *s; 384 + 385 + * Access to ``s[0].a`` would be encoded as ``0:0``: 386 + 387 + * ``0``: first element of ``s`` (as if ``s`` is an array); 388 + * ``0``: index of field ``a`` in ``struct sample``. 389 + 390 + * Access to ``s->a`` would be encoded as ``0:0`` as well. 391 + * Access to ``s->b`` would be encoded as ``0:1``: 392 + 393 + * ``0``: first element of ``s``; 394 + * ``1``: index of field ``b`` in ``struct sample``. 395 + 396 + * Access to ``s[1].c[5]`` would be encoded as ``1:2:0:5``: 397 + 398 + * ``1``: second element of ``s``; 399 + * ``2``: index of anonymous structure field in ``struct sample``; 400 + * ``0``: index of field ``c`` in anonymous structure; 401 + * ``5``: access to array element #5. 402 + 403 + * for type-based relocations, string is expected to be just "0"; 404 + 405 + * for enum value-based relocations, string contains an index of enum 406 + value within its enum type; 407 + 408 + * ``kind`` - one of ``enum bpf_core_relo_kind``. 409 + 410 + .. _GEP: https://llvm.org/docs/LangRef.html#getelementptr-instruction 411 + 412 + .. _btf_co_re_relocation_examples: 413 + 414 + CO-RE Relocation Examples 415 + ========================= 416 + 417 + For the following C code: 418 + 419 + .. code-block:: c 420 + 421 + struct foo { 422 + int a; 423 + int b; 424 + unsigned c:15; 425 + } __attribute__((preserve_access_index)); 426 + 427 + enum bar { U, V }; 428 + 429 + With the following BTF definitions: 430 + 431 + .. code-block:: 432 + 433 + ... 434 + [2] STRUCT 'foo' size=8 vlen=2 435 + 'a' type_id=3 bits_offset=0 436 + 'b' type_id=3 bits_offset=32 437 + 'c' type_id=4 bits_offset=64 bitfield_size=15 438 + [3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED 439 + [4] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none) 440 + ... 441 + [16] ENUM 'bar' encoding=UNSIGNED size=4 vlen=2 442 + 'U' val=0 443 + 'V' val=1 444 + 445 + Field offset relocations are generated automatically when 446 + ``__attribute__((preserve_access_index))`` is used, for example: 447 + 448 + .. code-block:: c 449 + 450 + void alpha(struct foo *s, volatile unsigned long *g) { 451 + *g = s->a; 452 + s->a = 1; 453 + } 454 + 455 + 00 <alpha>: 456 + 0: r3 = *(s32 *)(r1 + 0x0) 457 + 00: CO-RE <byte_off> [2] struct foo::a (0:0) 458 + 1: *(u64 *)(r2 + 0x0) = r3 459 + 2: *(u32 *)(r1 + 0x0) = 0x1 460 + 10: CO-RE <byte_off> [2] struct foo::a (0:0) 461 + 3: exit 462 + 463 + 464 + All relocation kinds could be requested via built-in functions. 465 + E.g. field-based relocations: 466 + 467 + .. code-block:: c 468 + 469 + void bravo(struct foo *s, volatile unsigned long *g) { 470 + *g = __builtin_preserve_field_info(s->b, 0 /* field byte offset */); 471 + *g = __builtin_preserve_field_info(s->b, 1 /* field byte size */); 472 + *g = __builtin_preserve_field_info(s->b, 2 /* field existence */); 473 + *g = __builtin_preserve_field_info(s->b, 3 /* field signedness */); 474 + *g = __builtin_preserve_field_info(s->c, 4 /* bitfield left shift */); 475 + *g = __builtin_preserve_field_info(s->c, 5 /* bitfield right shift */); 476 + } 477 + 478 + 20 <bravo>: 479 + 4: r1 = 0x4 480 + 20: CO-RE <byte_off> [2] struct foo::b (0:1) 481 + 5: *(u64 *)(r2 + 0x0) = r1 482 + 6: r1 = 0x4 483 + 30: CO-RE <byte_sz> [2] struct foo::b (0:1) 484 + 7: *(u64 *)(r2 + 0x0) = r1 485 + 8: r1 = 0x1 486 + 40: CO-RE <field_exists> [2] struct foo::b (0:1) 487 + 9: *(u64 *)(r2 + 0x0) = r1 488 + 10: r1 = 0x1 489 + 50: CO-RE <signed> [2] struct foo::b (0:1) 490 + 11: *(u64 *)(r2 + 0x0) = r1 491 + 12: r1 = 0x31 492 + 60: CO-RE <lshift_u64> [2] struct foo::c (0:2) 493 + 13: *(u64 *)(r2 + 0x0) = r1 494 + 14: r1 = 0x31 495 + 70: CO-RE <rshift_u64> [2] struct foo::c (0:2) 496 + 15: *(u64 *)(r2 + 0x0) = r1 497 + 16: exit 498 + 499 + 500 + Type-based relocations: 501 + 502 + .. code-block:: c 503 + 504 + void charlie(struct foo *s, volatile unsigned long *g) { 505 + *g = __builtin_preserve_type_info(*s, 0 /* type existence */); 506 + *g = __builtin_preserve_type_info(*s, 1 /* type size */); 507 + *g = __builtin_preserve_type_info(*s, 2 /* type matches */); 508 + *g = __builtin_btf_type_id(*s, 0 /* type id in this object file */); 509 + *g = __builtin_btf_type_id(*s, 1 /* type id in target kernel */); 510 + } 511 + 512 + 88 <charlie>: 513 + 17: r1 = 0x1 514 + 88: CO-RE <type_exists> [2] struct foo 515 + 18: *(u64 *)(r2 + 0x0) = r1 516 + 19: r1 = 0xc 517 + 98: CO-RE <type_size> [2] struct foo 518 + 20: *(u64 *)(r2 + 0x0) = r1 519 + 21: r1 = 0x1 520 + a8: CO-RE <type_matches> [2] struct foo 521 + 22: *(u64 *)(r2 + 0x0) = r1 522 + 23: r1 = 0x2 ll 523 + b8: CO-RE <local_type_id> [2] struct foo 524 + 25: *(u64 *)(r2 + 0x0) = r1 525 + 26: r1 = 0x2 ll 526 + d0: CO-RE <target_type_id> [2] struct foo 527 + 28: *(u64 *)(r2 + 0x0) = r1 528 + 29: exit 529 + 530 + Enum-based relocations: 531 + 532 + .. code-block:: c 533 + 534 + void delta(struct foo *s, volatile unsigned long *g) { 535 + *g = __builtin_preserve_enum_value(*(enum bar *)U, 0 /* enum literal existence */); 536 + *g = __builtin_preserve_enum_value(*(enum bar *)V, 1 /* enum literal value */); 537 + } 538 + 539 + f0 <delta>: 540 + 30: r1 = 0x1 ll 541 + f0: CO-RE <enumval_exists> [16] enum bar::U = 0 542 + 32: *(u64 *)(r2 + 0x0) = r1 543 + 33: r1 = 0x1 ll 544 + 108: CO-RE <enumval_value> [16] enum bar::V = 1 545 + 35: *(u64 *)(r2 + 0x0) = r1 546 + 36: exit
+25
Documentation/bpf/standardization/abi.rst
··· 1 + .. contents:: 2 + .. sectnum:: 3 + 4 + =================================================== 5 + BPF ABI Recommended Conventions and Guidelines v1.0 6 + =================================================== 7 + 8 + This is version 1.0 of an informational document containing recommended 9 + conventions and guidelines for producing portable BPF program binaries. 10 + 11 + Registers and calling convention 12 + ================================ 13 + 14 + BPF has 10 general purpose registers and a read-only frame pointer register, 15 + all of which are 64-bits wide. 16 + 17 + The BPF calling convention is defined as: 18 + 19 + * R0: return value from function calls, and exit value for BPF programs 20 + * R1 - R5: arguments for function calls 21 + * R6 - R9: callee saved registers that function calls will preserve 22 + * R10: read-only frame pointer to access stack 23 + 24 + R0 - R5 are scratch registers and BPF programs needs to spill/fill them if 25 + necessary across calls.
+1 -1
Documentation/bpf/standardization/index.rst
··· 12 12 :maxdepth: 1 13 13 14 14 instruction-set 15 - linux-notes 15 + abi 16 16 17 17 .. Links: 18 18 .. _IETF BPF Working Group: https://datatracker.ietf.org/wg/bpf/about/
+14 -30
Documentation/bpf/standardization/instruction-set.rst
··· 1 1 .. contents:: 2 2 .. sectnum:: 3 3 4 - ======================================== 5 - eBPF Instruction Set Specification, v1.0 6 - ======================================== 4 + ======================================= 5 + BPF Instruction Set Specification, v1.0 6 + ======================================= 7 7 8 - This document specifies version 1.0 of the eBPF instruction set. 8 + This document specifies version 1.0 of the BPF instruction set. 9 9 10 10 Documentation conventions 11 11 ========================= ··· 97 97 A: 10000110 98 98 B: 11111111 10000110 99 99 100 - Registers and calling convention 101 - ================================ 102 - 103 - eBPF has 10 general purpose registers and a read-only frame pointer register, 104 - all of which are 64-bits wide. 105 - 106 - The eBPF calling convention is defined as: 107 - 108 - * R0: return value from function calls, and exit value for eBPF programs 109 - * R1 - R5: arguments for function calls 110 - * R6 - R9: callee saved registers that function calls will preserve 111 - * R10: read-only frame pointer to access stack 112 - 113 - R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if 114 - necessary across calls. 115 - 116 100 Instruction encoding 117 101 ==================== 118 102 119 - eBPF has two instruction encodings: 103 + BPF has two instruction encodings: 120 104 121 105 * the basic instruction encoding, which uses 64 bits to encode an instruction 122 106 * the wide instruction encoding, which appends a second 64-bit immediate (i.e., ··· 244 260 ========= ===== ======= ========================================================== 245 261 246 262 Underflow and overflow are allowed during arithmetic operations, meaning 247 - the 64-bit or 32-bit value will wrap. If eBPF program execution would 263 + the 64-bit or 32-bit value will wrap. If BPF program execution would 248 264 result in division by zero, the destination register is instead set to zero. 249 265 If execution would result in modulo by zero, for ``BPF_ALU64`` the value of 250 266 the destination register is unchanged whereas for ``BPF_ALU`` the upper ··· 357 373 BPF_JSGT 0x6 any PC += offset if dst > src signed 358 374 BPF_JSGE 0x7 any PC += offset if dst >= src signed 359 375 BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_ 360 - BPF_CALL 0x8 0x1 call PC += offset see `Program-local functions`_ 376 + BPF_CALL 0x8 0x1 call PC += imm see `Program-local functions`_ 361 377 BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_ 362 378 BPF_EXIT 0x9 0x0 return BPF_JMP only 363 379 BPF_JLT 0xa any PC += offset if dst < src unsigned ··· 366 382 BPF_JSLE 0xd any PC += offset if dst <= src signed 367 383 ======== ===== === =========================================== ========================================= 368 384 369 - The eBPF program needs to store the return value into register R0 before doing a 385 + The BPF program needs to store the return value into register R0 before doing a 370 386 ``BPF_EXIT``. 371 387 372 388 Example: ··· 408 424 ~~~~~~~~~~~~~~~~~~~~~~~ 409 425 Program-local functions are functions exposed by the same BPF program as the 410 426 caller, and are referenced by offset from the call instruction, similar to 411 - ``BPF_JA``. A ``BPF_EXIT`` within the program-local function will return to 412 - the caller. 427 + ``BPF_JA``. The offset is encoded in the imm field of the call instruction. 428 + A ``BPF_EXIT`` within the program-local function will return to the caller. 413 429 414 430 Load and store instructions 415 431 =========================== ··· 486 502 487 503 Atomic operations are operations that operate on memory and can not be 488 504 interrupted or corrupted by other access to the same memory region 489 - by other eBPF programs or means outside of this specification. 505 + by other BPF programs or means outside of this specification. 490 506 491 - All atomic operations supported by eBPF are encoded as store operations 507 + All atomic operations supported by BPF are encoded as store operations 492 508 that use the ``BPF_ATOMIC`` mode modifier as follows: 493 509 494 510 * ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations ··· 578 594 Maps 579 595 ~~~~ 580 596 581 - Maps are shared memory regions accessible by eBPF programs on some platforms. 597 + Maps are shared memory regions accessible by BPF programs on some platforms. 582 598 A map can have various semantics as defined in a separate document, and may or 583 599 may not have a single contiguous memory region, but the 'map_val(map)' is 584 600 currently only defined for maps that do have a single contiguous memory region. ··· 600 616 Legacy BPF Packet access instructions 601 617 ------------------------------------- 602 618 603 - eBPF previously introduced special instructions for access to packet data that were 619 + BPF previously introduced special instructions for access to packet data that were 604 620 carried over from classic BPF. However, these instructions are 605 621 deprecated and should no longer be used.
Documentation/bpf/standardization/linux-notes.rst Documentation/bpf/linux-notes.rst
+1 -1
include/linux/bpf.h
··· 438 438 439 439 size /= sizeof(long); 440 440 while (size--) 441 - *ldst++ = *lsrc++; 441 + data_race(*ldst++ = *lsrc++); 442 442 } 443 443 444 444 /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
+1
net/bpf/test_run.c
··· 543 543 544 544 int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg) 545 545 { 546 + asm volatile (""); 546 547 return (long)arg; 547 548 } 548 549
+18 -18
net/core/sock_map.c
··· 18 18 struct bpf_map map; 19 19 struct sock **sks; 20 20 struct sk_psock_progs progs; 21 - raw_spinlock_t lock; 21 + spinlock_t lock; 22 22 }; 23 23 24 24 #define SOCK_CREATE_FLAG_MASK \ ··· 44 44 return ERR_PTR(-ENOMEM); 45 45 46 46 bpf_map_init_from_attr(&stab->map, attr); 47 - raw_spin_lock_init(&stab->lock); 47 + spin_lock_init(&stab->lock); 48 48 49 49 stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries * 50 50 sizeof(struct sock *), ··· 411 411 struct sock *sk; 412 412 int err = 0; 413 413 414 - raw_spin_lock_bh(&stab->lock); 414 + spin_lock_bh(&stab->lock); 415 415 sk = *psk; 416 416 if (!sk_test || sk_test == sk) 417 417 sk = xchg(psk, NULL); ··· 421 421 else 422 422 err = -EINVAL; 423 423 424 - raw_spin_unlock_bh(&stab->lock); 424 + spin_unlock_bh(&stab->lock); 425 425 return err; 426 426 } 427 427 ··· 487 487 psock = sk_psock(sk); 488 488 WARN_ON_ONCE(!psock); 489 489 490 - raw_spin_lock_bh(&stab->lock); 490 + spin_lock_bh(&stab->lock); 491 491 osk = stab->sks[idx]; 492 492 if (osk && flags == BPF_NOEXIST) { 493 493 ret = -EEXIST; ··· 501 501 stab->sks[idx] = sk; 502 502 if (osk) 503 503 sock_map_unref(osk, &stab->sks[idx]); 504 - raw_spin_unlock_bh(&stab->lock); 504 + spin_unlock_bh(&stab->lock); 505 505 return 0; 506 506 out_unlock: 507 - raw_spin_unlock_bh(&stab->lock); 507 + spin_unlock_bh(&stab->lock); 508 508 if (psock) 509 509 sk_psock_put(sk, psock); 510 510 out_free: ··· 835 835 836 836 struct bpf_shtab_bucket { 837 837 struct hlist_head head; 838 - raw_spinlock_t lock; 838 + spinlock_t lock; 839 839 }; 840 840 841 841 struct bpf_shtab { ··· 910 910 * is okay since it's going away only after RCU grace period. 911 911 * However, we need to check whether it's still present. 912 912 */ 913 - raw_spin_lock_bh(&bucket->lock); 913 + spin_lock_bh(&bucket->lock); 914 914 elem_probe = sock_hash_lookup_elem_raw(&bucket->head, elem->hash, 915 915 elem->key, map->key_size); 916 916 if (elem_probe && elem_probe == elem) { ··· 918 918 sock_map_unref(elem->sk, elem); 919 919 sock_hash_free_elem(htab, elem); 920 920 } 921 - raw_spin_unlock_bh(&bucket->lock); 921 + spin_unlock_bh(&bucket->lock); 922 922 } 923 923 924 924 static long sock_hash_delete_elem(struct bpf_map *map, void *key) ··· 932 932 hash = sock_hash_bucket_hash(key, key_size); 933 933 bucket = sock_hash_select_bucket(htab, hash); 934 934 935 - raw_spin_lock_bh(&bucket->lock); 935 + spin_lock_bh(&bucket->lock); 936 936 elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size); 937 937 if (elem) { 938 938 hlist_del_rcu(&elem->node); ··· 940 940 sock_hash_free_elem(htab, elem); 941 941 ret = 0; 942 942 } 943 - raw_spin_unlock_bh(&bucket->lock); 943 + spin_unlock_bh(&bucket->lock); 944 944 return ret; 945 945 } 946 946 ··· 1000 1000 hash = sock_hash_bucket_hash(key, key_size); 1001 1001 bucket = sock_hash_select_bucket(htab, hash); 1002 1002 1003 - raw_spin_lock_bh(&bucket->lock); 1003 + spin_lock_bh(&bucket->lock); 1004 1004 elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size); 1005 1005 if (elem && flags == BPF_NOEXIST) { 1006 1006 ret = -EEXIST; ··· 1026 1026 sock_map_unref(elem->sk, elem); 1027 1027 sock_hash_free_elem(htab, elem); 1028 1028 } 1029 - raw_spin_unlock_bh(&bucket->lock); 1029 + spin_unlock_bh(&bucket->lock); 1030 1030 return 0; 1031 1031 out_unlock: 1032 - raw_spin_unlock_bh(&bucket->lock); 1032 + spin_unlock_bh(&bucket->lock); 1033 1033 sk_psock_put(sk, psock); 1034 1034 out_free: 1035 1035 sk_psock_free_link(link); ··· 1115 1115 1116 1116 for (i = 0; i < htab->buckets_num; i++) { 1117 1117 INIT_HLIST_HEAD(&htab->buckets[i].head); 1118 - raw_spin_lock_init(&htab->buckets[i].lock); 1118 + spin_lock_init(&htab->buckets[i].lock); 1119 1119 } 1120 1120 1121 1121 return &htab->map; ··· 1147 1147 * exists, psock exists and holds a ref to socket. That 1148 1148 * lets us to grab a socket ref too. 1149 1149 */ 1150 - raw_spin_lock_bh(&bucket->lock); 1150 + spin_lock_bh(&bucket->lock); 1151 1151 hlist_for_each_entry(elem, &bucket->head, node) 1152 1152 sock_hold(elem->sk); 1153 1153 hlist_move_list(&bucket->head, &unlink_list); 1154 - raw_spin_unlock_bh(&bucket->lock); 1154 + spin_unlock_bh(&bucket->lock); 1155 1155 1156 1156 /* Process removed entries out of atomic context to 1157 1157 * block for socket lock before deleting the psock's
+13 -9
net/xdp/xsk.c
··· 602 602 603 603 for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) { 604 604 if (unlikely(i >= MAX_SKB_FRAGS)) 605 - return ERR_PTR(-EFAULT); 605 + return ERR_PTR(-EOVERFLOW); 606 606 607 607 page = pool->umem->pgs[addr >> PAGE_SHIFT]; 608 608 get_page(page); ··· 655 655 skb_put(skb, len); 656 656 657 657 err = skb_store_bits(skb, 0, buffer, len); 658 - if (unlikely(err)) 658 + if (unlikely(err)) { 659 + kfree_skb(skb); 659 660 goto free_err; 661 + } 660 662 } else { 661 663 int nr_frags = skb_shinfo(skb)->nr_frags; 662 664 struct page *page; 663 665 u8 *vaddr; 664 666 665 667 if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) { 666 - err = -EFAULT; 668 + err = -EOVERFLOW; 667 669 goto free_err; 668 670 } 669 671 ··· 692 690 return skb; 693 691 694 692 free_err: 695 - if (err == -EAGAIN) { 696 - xsk_cq_cancel_locked(xs, 1); 697 - } else { 698 - xsk_set_destructor_arg(skb); 699 - xsk_drop_skb(skb); 693 + if (err == -EOVERFLOW) { 694 + /* Drop the packet */ 695 + xsk_set_destructor_arg(xs->skb); 696 + xsk_drop_skb(xs->skb); 700 697 xskq_cons_release(xs->tx); 698 + } else { 699 + /* Let application retry */ 700 + xsk_cq_cancel_locked(xs, 1); 701 701 } 702 702 703 703 return ERR_PTR(err); ··· 742 738 skb = xsk_build_skb(xs, &desc); 743 739 if (IS_ERR(skb)) { 744 740 err = PTR_ERR(skb); 745 - if (err == -EAGAIN) 741 + if (err != -EOVERFLOW) 746 742 goto out; 747 743 err = 0; 748 744 continue;
+3
net/xdp/xsk_diag.c
··· 111 111 sock_diag_save_cookie(sk, msg->xdiag_cookie); 112 112 113 113 mutex_lock(&xs->mutex); 114 + if (READ_ONCE(xs->state) == XSK_UNBOUND) 115 + goto out_nlmsg_trim; 116 + 114 117 if ((req->xdiag_show & XDP_SHOW_INFO) && xsk_diag_put_info(xs, nlskb)) 115 118 goto out_nlmsg_trim; 116 119
+28 -28
scripts/bpf_doc.py
··· 59 59 Break down helper function protocol into smaller chunks: return type, 60 60 name, distincts arguments. 61 61 """ 62 - arg_re = re.compile('((\w+ )*?(\w+|...))( (\**)(\w+))?$') 62 + arg_re = re.compile(r'((\w+ )*?(\w+|...))( (\**)(\w+))?$') 63 63 res = {} 64 - proto_re = re.compile('(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$') 64 + proto_re = re.compile(r'(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$') 65 65 66 66 capture = proto_re.match(self.proto) 67 67 res['ret_type'] = capture.group(1) ··· 114 114 return Helper(proto=proto, desc=desc, ret=ret) 115 115 116 116 def parse_symbol(self): 117 - p = re.compile(' \* ?(BPF\w+)$') 117 + p = re.compile(r' \* ?(BPF\w+)$') 118 118 capture = p.match(self.line) 119 119 if not capture: 120 120 raise NoSyscallCommandFound 121 - end_re = re.compile(' \* ?NOTES$') 121 + end_re = re.compile(r' \* ?NOTES$') 122 122 end = end_re.match(self.line) 123 123 if end: 124 124 raise NoSyscallCommandFound ··· 133 133 # - Same as above, with "const" and/or "struct" in front of type 134 134 # - "..." (undefined number of arguments, for bpf_trace_printk()) 135 135 # There is at least one term ("void"), and at most five arguments. 136 - p = re.compile(' \* ?((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$') 136 + p = re.compile(r' \* ?((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$') 137 137 capture = p.match(self.line) 138 138 if not capture: 139 139 raise NoHelperFound ··· 141 141 return capture.group(1) 142 142 143 143 def parse_desc(self, proto): 144 - p = re.compile(' \* ?(?:\t| {5,8})Description$') 144 + p = re.compile(r' \* ?(?:\t| {5,8})Description$') 145 145 capture = p.match(self.line) 146 146 if not capture: 147 147 raise Exception("No description section found for " + proto) ··· 154 154 if self.line == ' *\n': 155 155 desc += '\n' 156 156 else: 157 - p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') 157 + p = re.compile(r' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') 158 158 capture = p.match(self.line) 159 159 if capture: 160 160 desc_present = True ··· 167 167 return desc 168 168 169 169 def parse_ret(self, proto): 170 - p = re.compile(' \* ?(?:\t| {5,8})Return$') 170 + p = re.compile(r' \* ?(?:\t| {5,8})Return$') 171 171 capture = p.match(self.line) 172 172 if not capture: 173 173 raise Exception("No return section found for " + proto) ··· 180 180 if self.line == ' *\n': 181 181 ret += '\n' 182 182 else: 183 - p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') 183 + p = re.compile(r' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') 184 184 capture = p.match(self.line) 185 185 if capture: 186 186 ret_present = True ··· 219 219 self.seek_to('enum bpf_cmd {', 220 220 'Could not find start of bpf_cmd enum', 0) 221 221 # Searches for either one or more BPF\w+ enums 222 - bpf_p = re.compile('\s*(BPF\w+)+') 222 + bpf_p = re.compile(r'\s*(BPF\w+)+') 223 223 # Searches for an enum entry assigned to another entry, 224 224 # for e.g. BPF_PROG_RUN = BPF_PROG_TEST_RUN, which is 225 225 # not documented hence should be skipped in check to 226 226 # determine if the right number of syscalls are documented 227 - assign_p = re.compile('\s*(BPF\w+)\s*=\s*(BPF\w+)') 227 + assign_p = re.compile(r'\s*(BPF\w+)\s*=\s*(BPF\w+)') 228 228 bpf_cmd_str = '' 229 229 while True: 230 230 capture = assign_p.match(self.line) ··· 239 239 break 240 240 self.line = self.reader.readline() 241 241 # Find the number of occurences of BPF\w+ 242 - self.enum_syscalls = re.findall('(BPF\w+)+', bpf_cmd_str) 242 + self.enum_syscalls = re.findall(r'(BPF\w+)+', bpf_cmd_str) 243 243 244 244 def parse_desc_helpers(self): 245 245 self.seek_to(helpersDocStart, ··· 263 263 self.seek_to('#define ___BPF_FUNC_MAPPER(FN, ctx...)', 264 264 'Could not find start of eBPF helper definition list') 265 265 # Searches for one FN(\w+) define or a backslash for newline 266 - p = re.compile('\s*FN\((\w+), (\d+), ##ctx\)|\\\\') 266 + p = re.compile(r'\s*FN\((\w+), (\d+), ##ctx\)|\\\\') 267 267 fn_defines_str = '' 268 268 i = 0 269 269 while True: ··· 278 278 break 279 279 self.line = self.reader.readline() 280 280 # Find the number of occurences of FN(\w+) 281 - self.define_unique_helpers = re.findall('FN\(\w+, \d+, ##ctx\)', fn_defines_str) 281 + self.define_unique_helpers = re.findall(r'FN\(\w+, \d+, ##ctx\)', fn_defines_str) 282 282 283 283 def validate_helpers(self): 284 284 last_helper = '' ··· 425 425 try: 426 426 cmd = ['git', 'log', '-1', '--pretty=format:%cs', '--no-patch', 427 427 '-L', 428 - '/{}/,/\*\//:include/uapi/linux/bpf.h'.format(delimiter)] 428 + '/{}/,/\\*\\//:include/uapi/linux/bpf.h'.format(delimiter)] 429 429 date = subprocess.run(cmd, cwd=linuxRoot, 430 430 capture_output=True, check=True) 431 431 return date.stdout.decode().rstrip() ··· 516 516 programs that are compatible with the GNU Privacy License (GPL). 517 517 518 518 In order to use such helpers, the eBPF program must be loaded with the correct 519 - license string passed (via **attr**) to the **bpf**\ () system call, and this 519 + license string passed (via **attr**) to the **bpf**\\ () system call, and this 520 520 generally translates into the C source code of the program containing a line 521 521 similar to the following: 522 522 ··· 550 550 * The bpftool utility can be used to probe the availability of helper functions 551 551 on the system (as well as supported program and map types, and a number of 552 552 other parameters). To do so, run **bpftool feature probe** (see 553 - **bpftool-feature**\ (8) for details). Add the **unprivileged** keyword to 553 + **bpftool-feature**\\ (8) for details). Add the **unprivileged** keyword to 554 554 list features available to unprivileged users. 555 555 556 556 Compatibility between helper functions and program types can generally be found ··· 562 562 requirement for GPL license is also in those **struct bpf_func_proto**. 563 563 564 564 Compatibility between helper functions and map types can be found in the 565 - **check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*. 565 + **check_map_func_compatibility**\\ () function in file *kernel/bpf/verifier.c*. 566 566 567 567 Helper functions that invalidate the checks on **data** and **data_end** 568 568 pointers for network processing are listed in function 569 - **bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*. 569 + **bpf_helper_changes_pkt_data**\\ () in file *net/core/filter.c*. 570 570 571 571 SEE ALSO 572 572 ======== 573 573 574 - **bpf**\ (2), 575 - **bpftool**\ (8), 576 - **cgroups**\ (7), 577 - **ip**\ (8), 578 - **perf_event_open**\ (2), 579 - **sendmsg**\ (2), 580 - **socket**\ (7), 581 - **tc-bpf**\ (8)''' 574 + **bpf**\\ (2), 575 + **bpftool**\\ (8), 576 + **cgroups**\\ (7), 577 + **ip**\\ (8), 578 + **perf_event_open**\\ (2), 579 + **sendmsg**\\ (2), 580 + **socket**\\ (7), 581 + **tc-bpf**\\ (8)''' 582 582 print(footer) 583 583 584 584 def print_proto(self, helper): ··· 598 598 one_arg = '{}{}'.format(comma, a['type']) 599 599 if a['name']: 600 600 if a['star']: 601 - one_arg += ' {}**\ '.format(a['star'].replace('*', '\\*')) 601 + one_arg += ' {}**\\ '.format(a['star'].replace('*', '\\*')) 602 602 else: 603 603 one_arg += '** ' 604 604 one_arg += '*{}*\\ **'.format(a['name'])
+1 -1
tools/bpf/bpftool/link.c
··· 83 83 #define perf_event_name(array, id) ({ \ 84 84 const char *event_str = NULL; \ 85 85 \ 86 - if ((id) >= 0 && (id) < ARRAY_SIZE(array)) \ 86 + if ((id) < ARRAY_SIZE(array)) \ 87 87 event_str = array[id]; \ 88 88 event_str; \ 89 89 })
+12
tools/testing/selftests/bpf/Makefile
··· 50 50 test_cgroup_storage \ 51 51 test_tcpnotify_user test_sysctl \ 52 52 test_progs-no_alu32 53 + TEST_INST_SUBDIRS := no_alu32 53 54 54 55 # Also test bpf-gcc, if present 55 56 ifneq ($(BPF_GCC),) 56 57 TEST_GEN_PROGS += test_progs-bpf_gcc 58 + TEST_INST_SUBDIRS += bpf_gcc 57 59 endif 58 60 59 61 ifneq ($(CLANG_CPUV4),) 60 62 TEST_GEN_PROGS += test_progs-cpuv4 63 + TEST_INST_SUBDIRS += cpuv4 61 64 endif 62 65 63 66 TEST_GEN_FILES = test_lwt_ip_encap.bpf.o test_tc_edt.bpf.o ··· 717 714 718 715 # Delete partially updated (corrupted) files on error 719 716 .DELETE_ON_ERROR: 717 + 718 + DEFAULT_INSTALL_RULE := $(INSTALL_RULE) 719 + override define INSTALL_RULE 720 + $(DEFAULT_INSTALL_RULE) 721 + @for DIR in $(TEST_INST_SUBDIRS); do \ 722 + mkdir -p $(INSTALL_PATH)/$$DIR; \ 723 + rsync -a $(OUTPUT)/$$DIR/*.bpf.o $(INSTALL_PATH)/$$DIR;\ 724 + done 725 + endef
+3 -2
tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c
··· 8 8 #include <linux/unistd.h> 9 9 #include <linux/mount.h> 10 10 #include <sys/syscall.h> 11 + #include "bpf/libbpf_internal.h" 11 12 12 13 static inline int sys_fsopen(const char *fsname, unsigned flags) 13 14 { ··· 156 155 ASSERT_OK(err, "obj_pin"); 157 156 158 157 /* cleanup */ 159 - if (pin_opts.path_fd >= 0) 158 + if (path_kind == PATH_FD_REL && pin_opts.path_fd >= 0) 160 159 close(pin_opts.path_fd); 161 160 if (old_cwd[0]) 162 161 ASSERT_OK(chdir(old_cwd), "restore_cwd"); ··· 221 220 goto cleanup; 222 221 223 222 /* cleanup */ 224 - if (get_opts.path_fd >= 0) 223 + if (path_kind == PATH_FD_REL && get_opts.path_fd >= 0) 225 224 close(get_opts.path_fd); 226 225 if (old_cwd[0]) 227 226 ASSERT_OK(chdir(old_cwd), "restore_cwd");
+18 -1
tools/testing/selftests/bpf/prog_tests/d_path.c
··· 12 12 #include "test_d_path_check_rdonly_mem.skel.h" 13 13 #include "test_d_path_check_types.skel.h" 14 14 15 + /* sys_close_range is not around for long time, so let's 16 + * make sure we can call it on systems with older glibc 17 + */ 18 + #ifndef __NR_close_range 19 + #ifdef __alpha__ 20 + #define __NR_close_range 546 21 + #else 22 + #define __NR_close_range 436 23 + #endif 24 + #endif 25 + 15 26 static int duration; 16 27 17 28 static struct { ··· 101 90 fstat(indicatorfd, &fileStat); 102 91 103 92 out_close: 104 - /* triggers filp_close */ 93 + /* sys_close no longer triggers filp_close, but we can 94 + * call sys_close_range instead which still does 95 + */ 96 + #define close(fd) syscall(__NR_close_range, fd, fd, 0) 97 + 105 98 close(pipefd[0]); 106 99 close(pipefd[1]); 107 100 close(sockfd); ··· 113 98 close(devfd); 114 99 close(localfd); 115 100 close(indicatorfd); 101 + 102 + #undef close 116 103 return ret; 117 104 } 118 105