Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs/bpf: Add description of register liveness tracking algorithm

This is a followup for [1], adds an overview for the register liveness
tracking, covers the following points:
- why register liveness tracking is useful;
- how register parentage chains are constructed;
- how liveness marks are applied using the parentage chains.

[1] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20230202125713.821931-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Eduard Zingerman and committed by
Alexei Starovoitov
cb601848 354bb4a0

+295
+295
Documentation/bpf/verifier.rst
··· 316 316 registers it may hold). They must all be safe for the branch to be pruned. 317 317 This is implemented in states_equal(). 318 318 319 + Some technical details about state pruning implementation could be found below. 320 + 321 + Register liveness tracking 322 + -------------------------- 323 + 324 + In order to make state pruning effective, liveness state is tracked for each 325 + register and stack slot. The basic idea is to track which registers and stack 326 + slots are actually used during subseqeuent execution of the program, until 327 + program exit is reached. Registers and stack slots that were never used could be 328 + removed from the cached state thus making more states equivalent to a cached 329 + state. This could be illustrated by the following program:: 330 + 331 + 0: call bpf_get_prandom_u32() 332 + 1: r1 = 0 333 + 2: if r0 == 0 goto +1 334 + 3: r0 = 1 335 + --- checkpoint --- 336 + 4: r0 = r1 337 + 5: exit 338 + 339 + Suppose that a state cache entry is created at instruction #4 (such entries are 340 + also called "checkpoints" in the text below). The verifier could reach the 341 + instruction with one of two possible register states: 342 + 343 + * r0 = 1, r1 = 0 344 + * r0 = 0, r1 = 0 345 + 346 + However, only the value of register ``r1`` is important to successfully finish 347 + verification. The goal of the liveness tracking algorithm is to spot this fact 348 + and figure out that both states are actually equivalent. 349 + 350 + Data structures 351 + ~~~~~~~~~~~~~~~ 352 + 353 + Liveness is tracked using the following data structures:: 354 + 355 + enum bpf_reg_liveness { 356 + REG_LIVE_NONE = 0, 357 + REG_LIVE_READ32 = 0x1, 358 + REG_LIVE_READ64 = 0x2, 359 + REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64, 360 + REG_LIVE_WRITTEN = 0x4, 361 + REG_LIVE_DONE = 0x8, 362 + }; 363 + 364 + struct bpf_reg_state { 365 + ... 366 + struct bpf_reg_state *parent; 367 + ... 368 + enum bpf_reg_liveness live; 369 + ... 370 + }; 371 + 372 + struct bpf_stack_state { 373 + struct bpf_reg_state spilled_ptr; 374 + ... 375 + }; 376 + 377 + struct bpf_func_state { 378 + struct bpf_reg_state regs[MAX_BPF_REG]; 379 + ... 380 + struct bpf_stack_state *stack; 381 + } 382 + 383 + struct bpf_verifier_state { 384 + struct bpf_func_state *frame[MAX_CALL_FRAMES]; 385 + struct bpf_verifier_state *parent; 386 + ... 387 + } 388 + 389 + * ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new 390 + verifier state creation; 391 + 392 + * ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is 393 + defined by some instruction verified between this verifier state's parent and 394 + verifier state itself; 395 + 396 + * ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot) 397 + is read by a some child state of this verifier state; 398 + 399 + * ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid 400 + processing same verifier state multiple times and for some sanity checks; 401 + 402 + * ``->live`` field values are formed by combining ``enum bpf_reg_liveness`` 403 + values using bitwise or. 404 + 405 + Register parentage chains 406 + ~~~~~~~~~~~~~~~~~~~~~~~~~ 407 + 408 + In order to propagate information between parent and child states, a *register 409 + parentage chain* is established. Each register or stack slot is linked to a 410 + corresponding register or stack slot in its parent state via a ``->parent`` 411 + pointer. This link is established upon state creation in ``is_state_visited()`` 412 + and might be modified by ``set_callee_state()`` called from 413 + ``__check_func_call()``. 414 + 415 + The rules for correspondence between registers / stack slots are as follows: 416 + 417 + * For the current stack frame, registers and stack slots of the new state are 418 + linked to the registers and stack slots of the parent state with the same 419 + indices. 420 + 421 + * For the outer stack frames, only caller saved registers (r6-r9) and stack 422 + slots are linked to the registers and stack slots of the parent state with the 423 + same indices. 424 + 425 + * When function call is processed a new ``struct bpf_func_state`` instance is 426 + allocated, it encapsulates a new set of registers and stack slots. For this 427 + new frame, parent links for r6-r9 and stack slots are set to nil, parent links 428 + for r1-r5 are set to match caller r1-r5 parent links. 429 + 430 + This could be illustrated by the following diagram (arrows stand for 431 + ``->parent`` pointers):: 432 + 433 + ... ; Frame #0, some instructions 434 + --- checkpoint #0 --- 435 + 1 : r6 = 42 ; Frame #0 436 + --- checkpoint #1 --- 437 + 2 : call foo() ; Frame #0 438 + ... ; Frame #1, instructions from foo() 439 + --- checkpoint #2 --- 440 + ... ; Frame #1, instructions from foo() 441 + --- checkpoint #3 --- 442 + exit ; Frame #1, return from foo() 443 + 3 : r1 = r6 ; Frame #0 <- current state 444 + 445 + +-------------------------------+-------------------------------+ 446 + | Frame #0 | Frame #1 | 447 + Checkpoint +-------------------------------+-------------------------------+ 448 + #0 | r0 | r1-r5 | r6-r9 | fp-8 ... | 449 + +-------------------------------+ 450 + ^ ^ ^ ^ 451 + | | | | 452 + Checkpoint +-------------------------------+ 453 + #1 | r0 | r1-r5 | r6-r9 | fp-8 ... | 454 + +-------------------------------+ 455 + ^ ^ ^ 456 + |_______|_______|_______________ 457 + | | | 458 + nil nil | | | nil nil 459 + | | | | | | | 460 + Checkpoint +-------------------------------+-------------------------------+ 461 + #2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | 462 + +-------------------------------+-------------------------------+ 463 + ^ ^ ^ ^ ^ 464 + nil nil | | | | | 465 + | | | | | | | 466 + Checkpoint +-------------------------------+-------------------------------+ 467 + #3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | 468 + +-------------------------------+-------------------------------+ 469 + ^ ^ 470 + nil nil | | 471 + | | | | 472 + Current +-------------------------------+ 473 + state | r0 | r1-r5 | r6-r9 | fp-8 ... | 474 + +-------------------------------+ 475 + \ 476 + r6 read mark is propagated via these links 477 + all the way up to checkpoint #1. 478 + The checkpoint #1 contains a write mark for r6 479 + because of instruction (1), thus read propagation 480 + does not reach checkpoint #0 (see section below). 481 + 482 + Liveness marks tracking 483 + ~~~~~~~~~~~~~~~~~~~~~~~ 484 + 485 + For each processed instruction, the verifier tracks read and written registers 486 + and stack slots. The main idea of the algorithm is that read marks propagate 487 + back along the state parentage chain until they hit a write mark, which 'screens 488 + off' earlier states from the read. The information about reads is propagated by 489 + function ``mark_reg_read()`` which could be summarized as follows:: 490 + 491 + mark_reg_read(struct bpf_reg_state *state, ...): 492 + parent = state->parent 493 + while parent: 494 + if state->live & REG_LIVE_WRITTEN: 495 + break 496 + if parent->live & REG_LIVE_READ64: 497 + break 498 + parent->live |= REG_LIVE_READ64 499 + state = parent 500 + parent = state->parent 501 + 502 + Notes: 503 + 504 + * The read marks are applied to the **parent** state while write marks are 505 + applied to the **current** state. The write mark on a register or stack slot 506 + means that it is updated by some instruction in the straight-line code leading 507 + from the parent state to the current state. 508 + 509 + * Details about REG_LIVE_READ32 are omitted. 510 + 511 + * Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`) 512 + might override the first parent link. Please refer to the comments in the 513 + ``propagate_liveness()`` and ``mark_reg_read()`` source code for further 514 + details. 515 + 516 + Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are 517 + applied conservatively: stack slots are marked as written only if write size 518 + corresponds to the size of the register, e.g. see function ``save_register_state()``. 519 + 520 + Consider the following example:: 521 + 522 + 0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8 523 + --- checkpoint #0 --- 524 + 1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes 525 + 2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1) 526 + 3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0) 527 + 528 + As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should 529 + it be otherwise, the algorithm above wouldn't be able to propagate the read mark 530 + from (3) to checkpoint #0. 531 + 532 + Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is 533 + called to update the ``->branches`` counter for each verifier state in a chain 534 + of parent verifier states. When the ``->branches`` counter reaches zero the 535 + verifier state becomes a valid entry in a set of cached verifier states. 536 + 537 + Each entry of the verifier states cache is post-processed by a function 538 + ``clean_live_states()``. This function marks all registers and stack slots 539 + without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``. 540 + Registers/stack slots marked in this way are ignored in function ``stacksafe()`` 541 + called from ``states_equal()`` when a state cache entry is considered for 542 + equivalence with a current state. 543 + 544 + Now it is possible to explain how the example from the beginning of the section 545 + works:: 546 + 547 + 0: call bpf_get_prandom_u32() 548 + 1: r1 = 0 549 + 2: if r0 == 0 goto +1 550 + 3: r0 = 1 551 + --- checkpoint[0] --- 552 + 4: r0 = r1 553 + 5: exit 554 + 555 + * At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }`` 556 + is pushed to states processing queue (pc stands for program counter). 557 + 558 + * At instruction #4: 559 + 560 + * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``; 561 + * ``checkpoint[0].r0`` is marked as written; 562 + * ``checkpoint[0].r1`` is marked as read; 563 + 564 + * At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed 565 + by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a 566 + read mark and all other registers and stack slots are marked as ``NOT_INIT`` 567 + or ``STACK_INVALID`` 568 + 569 + * The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue 570 + and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states 571 + are considered equivalent. 572 + 573 + .. _read_marks_for_cache_hits: 574 + 575 + Read marks propagation for cache hits 576 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 577 + 578 + Another point is the handling of read marks when a previously verified state is 579 + found in the states cache. Upon cache hit verifier must behave in the same way 580 + as if the current state was verified to the program exit. This means that all 581 + read marks, present on registers and stack slots of the cached state, must be 582 + propagated over the parentage chain of the current state. Example below shows 583 + why this is important. Function ``propagate_liveness()`` handles this case. 584 + 585 + Consider the following state parentage chain (S is a starting state, A-E are 586 + derived states, -> arrows show which state is derived from which):: 587 + 588 + r1 read 589 + <------------- A[r1] == 0 590 + C[r1] == 0 591 + S ---> A ---> B ---> exit E[r1] == 1 592 + | 593 + ` ---> C ---> D 594 + | 595 + ` ---> E ^ 596 + |___ suppose all these 597 + ^ states are at insn #Y 598 + | 599 + suppose all these 600 + states are at insn #X 601 + 602 + * Chain of states ``S -> A -> B -> exit`` is verified first. 603 + 604 + * While ``B -> exit`` is verified, register ``r1`` is read and this read mark is 605 + propagated up to state ``A``. 606 + 607 + * When chain of states ``C -> D`` is verified the state ``D`` turns out to be 608 + equivalent to state ``B``. 609 + 610 + * The read mark for ``r1`` has to be propagated to state ``C``, otherwise state 611 + ``C`` might get mistakenly marked as equivalent to state ``E`` even though 612 + values for register ``r1`` differ between ``C`` and ``E``. 613 + 319 614 Understanding eBPF verifier messages 320 615 ==================================== 321 616