docs: networking: convert filter.txt to ReST

+2 -2

Documentation/bpf/index.rst

··· 7 7 8 8 This kernel side documentation is still work in progress. The main 9 9 textual documentation is (for historical reasons) described in 10 - `Documentation/networking/filter.txt`_, which describe both classical 10 + `Documentation/networking/filter.rst`_, which describe both classical 11 11 and extended BPF instruction-set. 12 12 The Cilium project also maintains a `BPF and XDP Reference Guide`_ 13 13 that goes into great technical depth about the BPF Architecture. ··· 59 59 60 60 61 61 .. Links: 62 - .. _Documentation/networking/filter.txt: ../networking/filter.txt 62 + .. _Documentation/networking/filter.rst: ../networking/filter.txt 63 63 .. _man-pages: https://www.kernel.org/doc/man-pages/ 64 64 .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html 65 65 .. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/

+477 -371

Documentation/networking/filter.txt Documentation/networking/filter.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================================================= 1 4 Linux Socket Filtering aka Berkeley Packet Filter (BPF) 2 5 ======================================================= 3 6 ··· 45 42 46 43 Although we were only speaking about sockets here, BPF in Linux is used 47 44 in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel 48 - qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places 45 + qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places 49 46 such as team driver, PTP code, etc where BPF is being used. 50 47 51 - [1] Documentation/userspace-api/seccomp_filter.rst 48 + .. [1] Documentation/userspace-api/seccomp_filter.rst 52 49 53 50 Original BPF paper: 54 51 ··· 62 59 --------- 63 60 64 61 User space applications include <linux/filter.h> which contains the 65 - following relevant structures: 62 + following relevant structures:: 66 63 67 - struct sock_filter { /* Filter block */ 68 - __u16 code; /* Actual filter code */ 69 - __u8 jt; /* Jump true */ 70 - __u8 jf; /* Jump false */ 71 - __u32 k; /* Generic multiuse field */ 72 - }; 64 + struct sock_filter { /* Filter block */ 65 + __u16 code; /* Actual filter code */ 66 + __u8 jt; /* Jump true */ 67 + __u8 jf; /* Jump false */ 68 + __u32 k; /* Generic multiuse field */ 69 + }; 73 70 74 71 Such a structure is assembled as an array of 4-tuples, that contains 75 72 a code, jt, jf and k value. jt and jf are jump offsets and k a generic 76 - value to be used for a provided code. 73 + value to be used for a provided code:: 77 74 78 - struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ 79 - unsigned short len; /* Number of filter blocks */ 80 - struct sock_filter __user *filter; 81 - }; 75 + struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ 76 + unsigned short len; /* Number of filter blocks */ 77 + struct sock_filter __user *filter; 78 + }; 82 79 83 80 For socket filtering, a pointer to this structure (as shown in 84 81 follow-up example) is being passed to the kernel through setsockopt(2). ··· 86 83 Example 87 84 ------- 88 85 89 - #include <sys/socket.h> 90 - #include <sys/types.h> 91 - #include <arpa/inet.h> 92 - #include <linux/if_ether.h> 93 - /* ... */ 86 + :: 94 87 95 - /* From the example above: tcpdump -i em1 port 22 -dd */ 96 - struct sock_filter code[] = { 97 - { 0x28, 0, 0, 0x0000000c }, 98 - { 0x15, 0, 8, 0x000086dd }, 99 - { 0x30, 0, 0, 0x00000014 }, 100 - { 0x15, 2, 0, 0x00000084 }, 101 - { 0x15, 1, 0, 0x00000006 }, 102 - { 0x15, 0, 17, 0x00000011 }, 103 - { 0x28, 0, 0, 0x00000036 }, 104 - { 0x15, 14, 0, 0x00000016 }, 105 - { 0x28, 0, 0, 0x00000038 }, 106 - { 0x15, 12, 13, 0x00000016 }, 107 - { 0x15, 0, 12, 0x00000800 }, 108 - { 0x30, 0, 0, 0x00000017 }, 109 - { 0x15, 2, 0, 0x00000084 }, 110 - { 0x15, 1, 0, 0x00000006 }, 111 - { 0x15, 0, 8, 0x00000011 }, 112 - { 0x28, 0, 0, 0x00000014 }, 113 - { 0x45, 6, 0, 0x00001fff }, 114 - { 0xb1, 0, 0, 0x0000000e }, 115 - { 0x48, 0, 0, 0x0000000e }, 116 - { 0x15, 2, 0, 0x00000016 }, 117 - { 0x48, 0, 0, 0x00000010 }, 118 - { 0x15, 0, 1, 0x00000016 }, 119 - { 0x06, 0, 0, 0x0000ffff }, 120 - { 0x06, 0, 0, 0x00000000 }, 121 - }; 88 + #include <sys/socket.h> 89 + #include <sys/types.h> 90 + #include <arpa/inet.h> 91 + #include <linux/if_ether.h> 92 + /* ... */ 122 93 123 - struct sock_fprog bpf = { 124 - .len = ARRAY_SIZE(code), 125 - .filter = code, 126 - }; 94 + /* From the example above: tcpdump -i em1 port 22 -dd */ 95 + struct sock_filter code[] = { 96 + { 0x28, 0, 0, 0x0000000c }, 97 + { 0x15, 0, 8, 0x000086dd }, 98 + { 0x30, 0, 0, 0x00000014 }, 99 + { 0x15, 2, 0, 0x00000084 }, 100 + { 0x15, 1, 0, 0x00000006 }, 101 + { 0x15, 0, 17, 0x00000011 }, 102 + { 0x28, 0, 0, 0x00000036 }, 103 + { 0x15, 14, 0, 0x00000016 }, 104 + { 0x28, 0, 0, 0x00000038 }, 105 + { 0x15, 12, 13, 0x00000016 }, 106 + { 0x15, 0, 12, 0x00000800 }, 107 + { 0x30, 0, 0, 0x00000017 }, 108 + { 0x15, 2, 0, 0x00000084 }, 109 + { 0x15, 1, 0, 0x00000006 }, 110 + { 0x15, 0, 8, 0x00000011 }, 111 + { 0x28, 0, 0, 0x00000014 }, 112 + { 0x45, 6, 0, 0x00001fff }, 113 + { 0xb1, 0, 0, 0x0000000e }, 114 + { 0x48, 0, 0, 0x0000000e }, 115 + { 0x15, 2, 0, 0x00000016 }, 116 + { 0x48, 0, 0, 0x00000010 }, 117 + { 0x15, 0, 1, 0x00000016 }, 118 + { 0x06, 0, 0, 0x0000ffff }, 119 + { 0x06, 0, 0, 0x00000000 }, 120 + }; 127 121 128 - sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 129 - if (sock < 0) 130 - /* ... bail out ... */ 122 + struct sock_fprog bpf = { 123 + .len = ARRAY_SIZE(code), 124 + .filter = code, 125 + }; 131 126 132 - ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); 133 - if (ret < 0) 134 - /* ... bail out ... */ 127 + sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 128 + if (sock < 0) 129 + /* ... bail out ... */ 135 130 136 - /* ... */ 137 - close(sock); 131 + ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); 132 + if (ret < 0) 133 + /* ... bail out ... */ 134 + 135 + /* ... */ 136 + close(sock); 138 137 139 138 The above example code attaches a socket filter for a PF_PACKET socket 140 139 in order to let all IPv4/IPv6 packets with port 22 pass. The rest will ··· 183 178 184 179 The BPF architecture consists of the following basic elements: 185 180 181 + ======= ==================================================== 186 182 Element Description 187 - 183 + ======= ==================================================== 188 184 A 32 bit wide accumulator 189 185 X 32 bit wide X register 190 186 M[] 16 x 32 bit wide misc registers aka "scratch memory 191 - store", addressable from 0 to 15 187 + store", addressable from 0 to 15 188 + ======= ==================================================== 192 189 193 190 A program, that is translated by bpf_asm into "opcodes" is an array that 194 - consists of the following elements (as already mentioned): 191 + consists of the following elements (as already mentioned):: 195 192 196 193 op:16, jt:8, jf:8, k:32 197 194 ··· 208 201 table lists all bpf_asm instructions available resp. what their underlying 209 202 opcodes as defined in linux/filter.h stand for: 210 203 204 + =========== =================== ===================== 211 205 Instruction Addressing mode Description 212 - 206 + =========== =================== ===================== 213 207 ld 1, 2, 3, 4, 12 Load word into A 214 208 ldi 4 Load word into A 215 209 ldh 1, 2 Load half-word into A ··· 249 241 txa Copy X into A 250 242 251 243 ret 4, 11 Return 244 + =========== =================== ===================== 252 245 253 246 The next table shows addressing formats from the 2nd column: 254 247 248 + =============== =================== =============================================== 255 249 Addressing mode Syntax Description 256 - 250 + =============== =================== =============================================== 257 251 0 x/%x Register X 258 252 1 [k] BHW at byte offset k in the packet 259 253 2 [x + k] BHW at the offset X + k in the packet ··· 269 259 10 x/%x,Lt Jump to Lt if predicate is true 270 260 11 a/%a Accumulator A 271 261 12 extension BPF extension 262 + =============== =================== =============================================== 272 263 273 264 The Linux kernel also has a couple of BPF extensions that are used along 274 265 with the class of load instructions by "overloading" the k argument with ··· 278 267 279 268 Possible BPF extensions are shown in the following table: 280 269 270 + =================================== ================================================= 281 271 Extension Description 282 - 272 + =================================== ================================================= 283 273 len skb->len 284 274 proto skb->protocol 285 275 type skb->pkt_type ··· 297 285 vlan_avail skb_vlan_tag_present(skb) 298 286 vlan_tpid skb->vlan_proto 299 287 rand prandom_u32() 288 + =================================== ================================================= 300 289 301 290 These extensions can also be prefixed with '#'. 302 291 Examples for low-level BPF: 303 292 304 - ** ARP packets: 293 + **ARP packets**:: 305 294 306 295 ldh [12] 307 296 jne #0x806, drop 308 297 ret #-1 309 298 drop: ret #0 310 299 311 - ** IPv4 TCP packets: 300 + **IPv4 TCP packets**:: 312 301 313 302 ldh [12] 314 303 jne #0x800, drop ··· 318 305 ret #-1 319 306 drop: ret #0 320 307 321 - ** (Accelerated) VLAN w/ id 10: 308 + **(Accelerated) VLAN w/ id 10**:: 322 309 323 310 ld vlan_tci 324 311 jneq #10, drop 325 312 ret #-1 326 313 drop: ret #0 327 314 328 - ** icmp random packet sampling, 1 in 4 315 + **icmp random packet sampling, 1 in 4**: 316 + 329 317 ldh [12] 330 318 jne #0x800, drop 331 319 ldb [23] ··· 338 324 ret #-1 339 325 drop: ret #0 340 326 341 - ** SECCOMP filter example: 327 + **SECCOMP filter example**:: 342 328 343 329 ld [4] /* offsetof(struct seccomp_data, arch) */ 344 330 jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ ··· 359 345 The above example code can be placed into a file (here called "foo"), and 360 346 then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf 361 347 and cls_bpf understands and can directly be loaded with. Example with above 362 - ARP code: 348 + ARP code:: 363 349 364 - $ ./bpf_asm foo 365 - 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, 350 + $ ./bpf_asm foo 351 + 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, 366 352 367 - In copy and paste C-like output: 353 + In copy and paste C-like output:: 368 354 369 - $ ./bpf_asm -c foo 370 - { 0x28, 0, 0, 0x0000000c }, 371 - { 0x15, 0, 1, 0x00000806 }, 372 - { 0x06, 0, 0, 0xffffffff }, 373 - { 0x06, 0, 0, 0000000000 }, 355 + $ ./bpf_asm -c foo 356 + { 0x28, 0, 0, 0x0000000c }, 357 + { 0x15, 0, 1, 0x00000806 }, 358 + { 0x06, 0, 0, 0xffffffff }, 359 + { 0x06, 0, 0, 0000000000 }, 374 360 375 361 In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF 376 362 filters that might not be obvious at first, it's good to test filters before ··· 379 365 for testing BPF filters against given pcap files, single stepping through the 380 366 BPF code on the pcap's packets and to do BPF machine register dumps. 381 367 382 - Starting bpf_dbg is trivial and just requires issuing: 368 + Starting bpf_dbg is trivial and just requires issuing:: 383 369 384 - # ./bpf_dbg 370 + # ./bpf_dbg 385 371 386 372 In case input and output do not equal stdin/stdout, bpf_dbg takes an 387 373 alternative stdin source as a first argument, and an alternative stdout ··· 395 381 support (follow-up example commands starting with '>' denote bpf_dbg shell). 396 382 The usual workflow would be to ... 397 383 398 - > load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 384 + * load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 399 385 Loads a BPF filter from standard output of bpf_asm, or transformed via 400 - e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT 386 + e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT 401 387 debugging (next section), this command creates a temporary socket and 402 388 loads the BPF code into the kernel. Thus, this will also be useful for 403 389 JIT developers. 404 390 405 - > load pcap foo.pcap 391 + * load pcap foo.pcap 392 + 406 393 Loads standard tcpdump pcap file. 407 394 408 - > run [<n>] 395 + * run [<n>] 396 + 409 397 bpf passes:1 fails:9 410 398 Runs through all packets from a pcap to account how many passes and fails 411 399 the filter will generate. A limit of packets to traverse can be given. 412 400 413 - > disassemble 414 - l0: ldh [12] 415 - l1: jeq #0x800, l2, l5 416 - l2: ldb [23] 417 - l3: jeq #0x1, l4, l5 418 - l4: ret #0xffff 419 - l5: ret #0 401 + * disassemble:: 402 + 403 + l0: ldh [12] 404 + l1: jeq #0x800, l2, l5 405 + l2: ldb [23] 406 + l3: jeq #0x1, l4, l5 407 + l4: ret #0xffff 408 + l5: ret #0 409 + 420 410 Prints out BPF code disassembly. 421 411 422 - > dump 423 - /* { op, jt, jf, k }, */ 424 - { 0x28, 0, 0, 0x0000000c }, 425 - { 0x15, 0, 3, 0x00000800 }, 426 - { 0x30, 0, 0, 0x00000017 }, 427 - { 0x15, 0, 1, 0x00000001 }, 428 - { 0x06, 0, 0, 0x0000ffff }, 429 - { 0x06, 0, 0, 0000000000 }, 412 + * dump:: 413 + 414 + /* { op, jt, jf, k }, */ 415 + { 0x28, 0, 0, 0x0000000c }, 416 + { 0x15, 0, 3, 0x00000800 }, 417 + { 0x30, 0, 0, 0x00000017 }, 418 + { 0x15, 0, 1, 0x00000001 }, 419 + { 0x06, 0, 0, 0x0000ffff }, 420 + { 0x06, 0, 0, 0000000000 }, 421 + 430 422 Prints out C-style BPF code dump. 431 423 432 - > breakpoint 0 433 - breakpoint at: l0: ldh [12] 434 - > breakpoint 1 435 - breakpoint at: l1: jeq #0x800, l2, l5 424 + * breakpoint 0:: 425 + 426 + breakpoint at: l0: ldh [12] 427 + 428 + * breakpoint 1:: 429 + 430 + breakpoint at: l1: jeq #0x800, l2, l5 431 + 436 432 ... 433 + 437 434 Sets breakpoints at particular BPF instructions. Issuing a `run` command 438 435 will walk through the pcap file continuing from the current packet and 439 436 break when a breakpoint is being hit (another `run` will continue from 440 437 the currently active breakpoint executing next instructions): 441 438 442 - > run 443 - -- register dump -- 444 - pc: [0] <-- program counter 445 - code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction 446 - curr: l0: ldh [12] <-- disassembly of current instruction 447 - A: [00000000][0] <-- content of A (hex, decimal) 448 - X: [00000000][0] <-- content of X (hex, decimal) 449 - M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) 450 - -- packet dump -- <-- Current packet from pcap (hex) 451 - len: 42 452 - 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 453 - 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 454 - 32: 00 00 00 00 00 00 0a 3b 01 01 455 - (breakpoint) 456 - > 439 + * run:: 457 440 458 - > breakpoint 459 - breakpoints: 0 1 460 - Prints currently set breakpoints. 441 + -- register dump -- 442 + pc: [0] <-- program counter 443 + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction 444 + curr: l0: ldh [12] <-- disassembly of current instruction 445 + A: [00000000][0] <-- content of A (hex, decimal) 446 + X: [00000000][0] <-- content of X (hex, decimal) 447 + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) 448 + -- packet dump -- <-- Current packet from pcap (hex) 449 + len: 42 450 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 451 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 452 + 32: 00 00 00 00 00 00 0a 3b 01 01 453 + (breakpoint) 454 + > 461 455 462 - > step [-<n>, +<n>] 456 + * breakpoint:: 457 + 458 + breakpoints: 0 1 459 + 460 + Prints currently set breakpoints. 461 + 462 + * step [-<n>, +<n>] 463 + 463 464 Performs single stepping through the BPF program from the current pc 464 465 offset. Thus, on each step invocation, above register dump is issued. 465 466 This can go forwards and backwards in time, a plain `step` will break 466 467 on the next BPF instruction, thus +1. (No `run` needs to be issued here.) 467 468 468 - > select <n> 469 + * select <n> 470 + 469 471 Selects a given packet from the pcap file to continue from. Thus, on 470 472 the next `run` or `step`, the BPF program is being evaluated against 471 473 the user pre-selected packet. Numbering starts just as in Wireshark 472 474 with index 1. 473 475 474 - > quit 475 - # 476 + * quit 477 + 476 478 Exits bpf_dbg. 477 479 478 480 JIT compiler ··· 498 468 PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through 499 469 CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each 500 470 attached filter from user space or for internal kernel users if it has 501 - been previously enabled by root: 471 + been previously enabled by root:: 502 472 503 473 echo 1 > /proc/sys/net/core/bpf_jit_enable 504 474 505 475 For JIT developers, doing audits etc, each compile run can output the generated 506 - opcode image into the kernel log via: 476 + opcode image into the kernel log via:: 507 477 508 478 echo 2 > /proc/sys/net/core/bpf_jit_enable 509 479 510 - Example output from dmesg: 480 + Example output from dmesg:: 511 481 512 - [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f 513 - [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 514 - [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 515 - [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 516 - [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 517 - [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 482 + [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f 483 + [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 484 + [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 485 + [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 486 + [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 487 + [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 518 488 519 489 When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and 520 490 setting any other value than that will return in failure. This is even the case for ··· 523 493 generally recommended approach instead. 524 494 525 495 In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for 526 - generating disassembly out of the kernel log's hexdump: 496 + generating disassembly out of the kernel log's hexdump:: 527 497 528 - # ./bpf_jit_disasm 529 - 70 bytes emitted from JIT compiler (pass:3, flen:6) 530 - ffffffffa0069c8f + <x>: 531 - 0: push %rbp 532 - 1: mov %rsp,%rbp 533 - 4: sub $0x60,%rsp 534 - 8: mov %rbx,-0x8(%rbp) 535 - c: mov 0x68(%rdi),%r9d 536 - 10: sub 0x6c(%rdi),%r9d 537 - 14: mov 0xd8(%rdi),%r8 538 - 1b: mov $0xc,%esi 539 - 20: callq 0xffffffffe0ff9442 540 - 25: cmp $0x800,%eax 541 - 2a: jne 0x0000000000000042 542 - 2c: mov $0x17,%esi 543 - 31: callq 0xffffffffe0ff945e 544 - 36: cmp $0x1,%eax 545 - 39: jne 0x0000000000000042 546 - 3b: mov $0xffff,%eax 547 - 40: jmp 0x0000000000000044 548 - 42: xor %eax,%eax 549 - 44: leaveq 550 - 45: retq 498 + # ./bpf_jit_disasm 499 + 70 bytes emitted from JIT compiler (pass:3, flen:6) 500 + ffffffffa0069c8f + <x>: 501 + 0: push %rbp 502 + 1: mov %rsp,%rbp 503 + 4: sub $0x60,%rsp 504 + 8: mov %rbx,-0x8(%rbp) 505 + c: mov 0x68(%rdi),%r9d 506 + 10: sub 0x6c(%rdi),%r9d 507 + 14: mov 0xd8(%rdi),%r8 508 + 1b: mov $0xc,%esi 509 + 20: callq 0xffffffffe0ff9442 510 + 25: cmp $0x800,%eax 511 + 2a: jne 0x0000000000000042 512 + 2c: mov $0x17,%esi 513 + 31: callq 0xffffffffe0ff945e 514 + 36: cmp $0x1,%eax 515 + 39: jne 0x0000000000000042 516 + 3b: mov $0xffff,%eax 517 + 40: jmp 0x0000000000000044 518 + 42: xor %eax,%eax 519 + 44: leaveq 520 + 45: retq 551 521 552 - Issuing option `-o` will "annotate" opcodes to resulting assembler 553 - instructions, which can be very useful for JIT developers: 522 + Issuing option `-o` will "annotate" opcodes to resulting assembler 523 + instructions, which can be very useful for JIT developers: 554 524 555 - # ./bpf_jit_disasm -o 556 - 70 bytes emitted from JIT compiler (pass:3, flen:6) 557 - ffffffffa0069c8f + <x>: 558 - 0: push %rbp 559 - 55 560 - 1: mov %rsp,%rbp 561 - 48 89 e5 562 - 4: sub $0x60,%rsp 563 - 48 83 ec 60 564 - 8: mov %rbx,-0x8(%rbp) 565 - 48 89 5d f8 566 - c: mov 0x68(%rdi),%r9d 567 - 44 8b 4f 68 568 - 10: sub 0x6c(%rdi),%r9d 569 - 44 2b 4f 6c 570 - 14: mov 0xd8(%rdi),%r8 571 - 4c 8b 87 d8 00 00 00 572 - 1b: mov $0xc,%esi 573 - be 0c 00 00 00 574 - 20: callq 0xffffffffe0ff9442 575 - e8 1d 94 ff e0 576 - 25: cmp $0x800,%eax 577 - 3d 00 08 00 00 578 - 2a: jne 0x0000000000000042 579 - 75 16 580 - 2c: mov $0x17,%esi 581 - be 17 00 00 00 582 - 31: callq 0xffffffffe0ff945e 583 - e8 28 94 ff e0 584 - 36: cmp $0x1,%eax 585 - 83 f8 01 586 - 39: jne 0x0000000000000042 587 - 75 07 588 - 3b: mov $0xffff,%eax 589 - b8 ff ff 00 00 590 - 40: jmp 0x0000000000000044 591 - eb 02 592 - 42: xor %eax,%eax 593 - 31 c0 594 - 44: leaveq 595 - c9 596 - 45: retq 597 - c3 525 + # ./bpf_jit_disasm -o 526 + 70 bytes emitted from JIT compiler (pass:3, flen:6) 527 + ffffffffa0069c8f + <x>: 528 + 0: push %rbp 529 + 55 530 + 1: mov %rsp,%rbp 531 + 48 89 e5 532 + 4: sub $0x60,%rsp 533 + 48 83 ec 60 534 + 8: mov %rbx,-0x8(%rbp) 535 + 48 89 5d f8 536 + c: mov 0x68(%rdi),%r9d 537 + 44 8b 4f 68 538 + 10: sub 0x6c(%rdi),%r9d 539 + 44 2b 4f 6c 540 + 14: mov 0xd8(%rdi),%r8 541 + 4c 8b 87 d8 00 00 00 542 + 1b: mov $0xc,%esi 543 + be 0c 00 00 00 544 + 20: callq 0xffffffffe0ff9442 545 + e8 1d 94 ff e0 546 + 25: cmp $0x800,%eax 547 + 3d 00 08 00 00 548 + 2a: jne 0x0000000000000042 549 + 75 16 550 + 2c: mov $0x17,%esi 551 + be 17 00 00 00 552 + 31: callq 0xffffffffe0ff945e 553 + e8 28 94 ff e0 554 + 36: cmp $0x1,%eax 555 + 83 f8 01 556 + 39: jne 0x0000000000000042 557 + 75 07 558 + 3b: mov $0xffff,%eax 559 + b8 ff ff 00 00 560 + 40: jmp 0x0000000000000044 561 + eb 02 562 + 42: xor %eax,%eax 563 + 31 c0 564 + 44: leaveq 565 + c9 566 + 45: retq 567 + c3 598 568 599 569 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful 600 570 toolchain for developing and testing the kernel's JIT compiler. ··· 693 663 694 664 - Conditional jt/jf targets replaced with jt/fall-through: 695 665 696 - While the original design has constructs such as "if (cond) jump_true; 697 - else jump_false;", they are being replaced into alternative constructs like 698 - "if (cond) jump_true; /* else fall-through */". 666 + While the original design has constructs such as ``if (cond) jump_true; 667 + else jump_false;``, they are being replaced into alternative constructs like 668 + ``if (cond) jump_true; /* else fall-through */``. 699 669 700 670 - Introduces bpf_call insn and register passing convention for zero overhead 701 671 calls from/to other kernel functions: ··· 714 684 a return value of the function. Since R6 - R9 are callee saved, their state 715 685 is preserved across the call. 716 686 717 - For example, consider three C functions: 687 + For example, consider three C functions:: 718 688 719 - u64 f1() { return (*_f2)(1); } 720 - u64 f2(u64 a) { return f3(a + 1, a); } 721 - u64 f3(u64 a, u64 b) { return a - b; } 689 + u64 f1() { return (*_f2)(1); } 690 + u64 f2(u64 a) { return f3(a + 1, a); } 691 + u64 f3(u64 a, u64 b) { return a - b; } 722 692 723 - GCC can compile f1, f3 into x86_64: 693 + GCC can compile f1, f3 into x86_64:: 724 694 725 - f1: 726 - movl $1, %edi 727 - movq _f2(%rip), %rax 728 - jmp *%rax 729 - f3: 730 - movq %rdi, %rax 731 - subq %rsi, %rax 732 - ret 695 + f1: 696 + movl $1, %edi 697 + movq _f2(%rip), %rax 698 + jmp *%rax 699 + f3: 700 + movq %rdi, %rax 701 + subq %rsi, %rax 702 + ret 733 703 734 - Function f2 in eBPF may look like: 704 + Function f2 in eBPF may look like:: 735 705 736 - f2: 737 - bpf_mov R2, R1 738 - bpf_add R1, 1 739 - bpf_call f3 740 - bpf_exit 706 + f2: 707 + bpf_mov R2, R1 708 + bpf_add R1, 1 709 + bpf_call f3 710 + bpf_exit 741 711 742 - If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and 712 + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and 743 713 returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to 744 714 be used to call into f2. 745 715 ··· 751 721 752 722 On 64-bit architectures all register map to HW registers one to one. For 753 723 example, x86_64 JIT compiler can map them as ... 724 + 725 + :: 754 726 755 727 R0 - rax 756 728 R1 - rdi ··· 769 737 ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing 770 738 and rbx, r12 - r15 are callee saved. 771 739 772 - Then the following internal BPF pseudo-program: 740 + Then the following internal BPF pseudo-program:: 773 741 774 742 bpf_mov R6, R1 /* save ctx */ 775 743 bpf_mov R2, 2 ··· 787 755 bpf_add R0, R7 788 756 bpf_exit 789 757 790 - After JIT to x86_64 may look like: 758 + After JIT to x86_64 may look like:: 791 759 792 760 push %rbp 793 761 mov %rsp,%rbp ··· 813 781 leaveq 814 782 retq 815 783 816 - Which is in this example equivalent in C to: 784 + Which is in this example equivalent in C to:: 817 785 818 786 u64 bpf_filter(u64 ctx) 819 787 { 820 - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); 788 + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); 821 789 } 822 790 823 791 In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 824 792 arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper 825 - registers and place their return value into '%rax' which is R0 in eBPF. 793 + registers and place their return value into ``%rax`` which is R0 in eBPF. 826 794 Prologue and epilogue are emitted by JIT and are implicit in the 827 795 interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve 828 796 them across the calls as defined by calling convention. 829 797 830 - For example the following program is invalid: 798 + For example the following program is invalid:: 831 799 832 800 bpf_mov R1, 1 833 801 bpf_call foo ··· 846 814 its content is defined by a specific use case. For seccomp register R1 points 847 815 to seccomp_data, for converted BPF filters R1 points to a skb. 848 816 849 - A program, that is translated internally consists of the following elements: 817 + A program, that is translated internally consists of the following elements:: 850 818 851 819 op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 852 820 ··· 856 824 857 825 Internal BPF is a general purpose RISC instruction set. Not every register and 858 826 every instruction are used during translation from original BPF to new format. 859 - For example, socket filters are not using 'exclusive add' instruction, but 827 + For example, socket filters are not using ``exclusive add`` instruction, but 860 828 tracing filters may do to maintain counters of events, for example. Register R9 861 829 is not used by socket filters either, but more complex filters may be running 862 830 out of registers and would have to resort to spill/fill to stack. ··· 881 849 882 850 eBPF is reusing most of the opcode encoding from classic to simplify conversion 883 851 of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' 884 - field is divided into three parts: 852 + field is divided into three parts:: 885 853 886 854 +----------------+--------+--------------------+ 887 855 | 4 bits | 1 bit | 3 bits | ··· 891 859 892 860 Three LSB bits store instruction class which is one of: 893 861 894 - Classic BPF classes: eBPF classes: 895 - 862 + =================== =============== 863 + Classic BPF classes eBPF classes 864 + =================== =============== 896 865 BPF_LD 0x00 BPF_LD 0x00 897 866 BPF_LDX 0x01 BPF_LDX 0x01 898 867 BPF_ST 0x02 BPF_ST 0x02 ··· 902 869 BPF_JMP 0x05 BPF_JMP 0x05 903 870 BPF_RET 0x06 BPF_JMP32 0x06 904 871 BPF_MISC 0x07 BPF_ALU64 0x07 872 + =================== =============== 905 873 906 874 When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... 907 875 908 - BPF_K 0x00 909 - BPF_X 0x08 876 + :: 910 877 911 - * in classic BPF, this means: 878 + BPF_K 0x00 879 + BPF_X 0x08 912 880 913 - BPF_SRC(code) == BPF_X - use register X as source operand 914 - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 881 + * in classic BPF, this means:: 915 882 916 - * in eBPF, this means: 883 + BPF_SRC(code) == BPF_X - use register X as source operand 884 + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 917 885 918 - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand 919 - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 886 + * in eBPF, this means:: 887 + 888 + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand 889 + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 920 890 921 891 ... and four MSB bits store operation code. 922 892 923 - If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: 893 + If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: 924 894 925 895 BPF_ADD 0x00 926 896 BPF_SUB 0x10 ··· 940 904 BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ 941 905 BPF_END 0xd0 /* eBPF only: endianness conversion */ 942 906 943 - If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of: 907 + If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: 944 908 945 909 BPF_JA 0x00 /* BPF_JMP only */ 946 910 BPF_JEQ 0x10 ··· 970 934 instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: 971 935 dst_reg = dst_reg + src_reg 972 936 973 - Classic BPF wastes the whole BPF_RET class to represent a single 'ret' 937 + Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` 974 938 operation. Classic BPF_RET | BPF_K means copy imm32 into return register 975 939 and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT 976 940 in eBPF means function exit only. The eBPF program needs to store return ··· 978 942 BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide 979 943 operands for the comparisons instead. 980 944 981 - For load and store instructions the 8-bit 'code' field is divided as: 945 + For load and store instructions the 8-bit 'code' field is divided as:: 982 946 983 947 +--------+--------+-------------------+ 984 948 | 3 bits | 2 bits | 3 bits | ··· 988 952 989 953 Size modifier is one of ... 990 954 955 + :: 956 + 991 957 BPF_W 0x00 /* word */ 992 958 BPF_H 0x08 /* half word */ 993 959 BPF_B 0x10 /* byte */ 994 960 BPF_DW 0x18 /* eBPF only, double word */ 995 961 996 - ... which encodes size of load/store operation: 962 + ... which encodes size of load/store operation:: 997 963 998 964 B - 1 byte 999 965 H - 2 byte 1000 966 W - 4 byte 1001 967 DW - 8 byte (eBPF only) 1002 968 1003 - Mode modifier is one of: 969 + Mode modifier is one of:: 1004 970 1005 971 BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ 1006 972 BPF_ABS 0x20 ··· 1017 979 1018 980 They had to be carried over from classic to have strong performance of 1019 981 socket filters running in eBPF interpreter. These instructions can only 1020 - be used when interpreter context is a pointer to 'struct sk_buff' and 982 + be used when interpreter context is a pointer to ``struct sk_buff`` and 1021 983 have seven implicit operands. Register R6 is an implicit input that must 1022 984 contain pointer to sk_buff. Register R0 is an implicit output which contains 1023 985 the data fetched from the packet. Registers R1-R5 are scratch registers ··· 1030 992 therefore must preserve this property. src_reg and imm32 fields are 1031 993 explicit inputs to these instructions. 1032 994 1033 - For example: 995 + For example:: 1034 996 1035 997 BPF_IND | BPF_W | BPF_LD means: 1036 998 1037 999 R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) 1038 1000 and R1 - R5 were scratched. 1039 1001 1040 - Unlike classic BPF instruction set, eBPF has generic load/store operations: 1002 + Unlike classic BPF instruction set, eBPF has generic load/store operations:: 1041 1003 1042 - BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg 1043 - BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 1044 - BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) 1045 - BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg 1046 - BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg 1004 + BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg 1005 + BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 1006 + BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) 1007 + BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg 1008 + BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg 1047 1009 1048 1010 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and 1049 1011 2 byte atomic increments are not supported. 1050 1012 1051 1013 eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists 1052 - of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single 1014 + of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single 1053 1015 instruction that loads 64-bit immediate value into a dst_reg. 1054 1016 Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 1055 1017 32-bit immediate value into a register. ··· 1075 1037 (In 'secure' mode verifier will reject any type of pointer arithmetic to make 1076 1038 sure that kernel addresses don't leak to unprivileged users) 1077 1039 1078 - If register was never written to, it's not readable: 1040 + If register was never written to, it's not readable:: 1041 + 1079 1042 bpf_mov R0 = R2 1080 1043 bpf_exit 1044 + 1081 1045 will be rejected, since R2 is unreadable at the start of the program. 1082 1046 1083 1047 After kernel function call, R1-R5 are reset to unreadable and 1084 1048 R0 has a return type of the function. 1085 1049 1086 1050 Since R6-R9 are callee saved, their state is preserved across the call. 1051 + 1052 + :: 1053 + 1087 1054 bpf_mov R6 = 1 1088 1055 bpf_call foo 1089 1056 bpf_mov R0 = R6 1090 1057 bpf_exit 1058 + 1091 1059 is a correct program. If there was R1 instead of R6, it would have 1092 1060 been rejected. 1093 1061 1094 1062 load/store instructions are allowed only with registers of valid types, which 1095 1063 are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. 1096 - For example: 1064 + For example:: 1065 + 1097 1066 bpf_mov R1 = 1 1098 1067 bpf_mov R2 = 2 1099 1068 bpf_xadd *(u32 *)(R1 + 3) += R2 1100 1069 bpf_exit 1070 + 1101 1071 will be rejected, since R1 doesn't have a valid pointer type at the time of 1102 1072 execution of instruction bpf_xadd. 1103 1073 1104 - At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') 1074 + At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) 1105 1075 A callback is used to customize verifier to restrict eBPF program access to only 1106 1076 certain fields within ctx structure with specified size and alignment. 1107 1077 1108 - For example, the following insn: 1078 + For example, the following insn:: 1079 + 1109 1080 bpf_ld R0 = *(u32 *)(R6 + 8) 1081 + 1110 1082 intends to load a word from address R6 + 8 and store it into R0 1111 1083 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know 1112 1084 that offset 8 of size 4 bytes can be accessed for reading, otherwise ··· 1127 1079 1128 1080 The verifier will allow eBPF program to read data from stack only after 1129 1081 it wrote into it. 1082 + 1130 1083 Classic BPF verifier does similar check with M[0-15] memory slots. 1131 - For example: 1084 + For example:: 1085 + 1132 1086 bpf_ld R0 = *(u32 *)(R10 - 4) 1133 1087 bpf_exit 1088 + 1134 1089 is invalid program. 1135 1090 Though R10 is correct read-only register and has type PTR_TO_STACK 1136 1091 and R10 - 4 is within stack bounds, there were no stores into that location. ··· 1164 1113 ----------------------- 1165 1114 In order to determine the safety of an eBPF program, the verifier must track 1166 1115 the range of possible values in each register and also in each stack slot. 1167 - This is done with 'struct bpf_reg_state', defined in include/linux/ 1116 + This is done with ``struct bpf_reg_state``, defined in include/linux/ 1168 1117 bpf_verifier.h, which unifies tracking of scalar and pointer values. Each 1169 1118 register state has a type, which is either NOT_INIT (the register has not been 1170 1119 written to), SCALAR_VALUE (some value which is not usable as a pointer), or a 1171 1120 pointer type. The types of pointers describe their base, as follows: 1172 - PTR_TO_CTX Pointer to bpf_context. 1173 - CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic 1174 - on these pointers is forbidden. 1175 - PTR_TO_MAP_VALUE Pointer to the value stored in a map element. 1121 + 1122 + 1123 + PTR_TO_CTX 1124 + Pointer to bpf_context. 1125 + CONST_PTR_TO_MAP 1126 + Pointer to struct bpf_map. "Const" because arithmetic 1127 + on these pointers is forbidden. 1128 + PTR_TO_MAP_VALUE 1129 + Pointer to the value stored in a map element. 1176 1130 PTR_TO_MAP_VALUE_OR_NULL 1177 - Either a pointer to a map value, or NULL; map accesses 1178 - (see section 'eBPF maps', below) return this type, 1179 - which becomes a PTR_TO_MAP_VALUE when checked != NULL. 1180 - Arithmetic on these pointers is forbidden. 1181 - PTR_TO_STACK Frame pointer. 1182 - PTR_TO_PACKET skb->data. 1183 - PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. 1184 - PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. 1131 + Either a pointer to a map value, or NULL; map accesses 1132 + (see section 'eBPF maps', below) return this type, 1133 + which becomes a PTR_TO_MAP_VALUE when checked != NULL. 1134 + Arithmetic on these pointers is forbidden. 1135 + PTR_TO_STACK 1136 + Frame pointer. 1137 + PTR_TO_PACKET 1138 + skb->data. 1139 + PTR_TO_PACKET_END 1140 + skb->data + headlen; arithmetic forbidden. 1141 + PTR_TO_SOCKET 1142 + Pointer to struct bpf_sock_ops, implicitly refcounted. 1185 1143 PTR_TO_SOCKET_OR_NULL 1186 - Either a pointer to a socket, or NULL; socket lookup 1187 - returns this type, which becomes a PTR_TO_SOCKET when 1188 - checked != NULL. PTR_TO_SOCKET is reference-counted, 1189 - so programs must release the reference through the 1190 - socket release function before the end of the program. 1191 - Arithmetic on these pointers is forbidden. 1144 + Either a pointer to a socket, or NULL; socket lookup 1145 + returns this type, which becomes a PTR_TO_SOCKET when 1146 + checked != NULL. PTR_TO_SOCKET is reference-counted, 1147 + so programs must release the reference through the 1148 + socket release function before the end of the program. 1149 + Arithmetic on these pointers is forbidden. 1150 + 1192 1151 However, a pointer may be offset from this base (as a result of pointer 1193 1152 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable 1194 1153 offset'. The former is used when an exactly-known value (e.g. an immediate 1195 1154 operand) is added to a pointer, while the latter is used for values which are 1196 1155 not exactly known. The variable offset is also used in SCALAR_VALUEs, to track 1197 1156 the range of possible values in the register. 1157 + 1198 1158 The verifier's knowledge about the variable offset consists of: 1159 + 1199 1160 * minimum and maximum values as unsigned 1200 1161 * minimum and maximum values as signed 1162 + 1201 1163 * knowledge of the values of individual bits, in the form of a 'tnum': a u64 1202 - 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 1203 - 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both 1204 - mask and value; no bit should ever be 1 in both. For example, if a byte is read 1205 - into a register from memory, the register's top 56 bits are known zero, while 1206 - the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we 1207 - then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 1208 - 0x1ff), because of potential carries. 1164 + 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 1165 + 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both 1166 + mask and value; no bit should ever be 1 in both. For example, if a byte is read 1167 + into a register from memory, the register's top 56 bits are known zero, while 1168 + the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we 1169 + then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 1170 + 0x1ff), because of potential carries. 1209 1171 1210 1172 Besides arithmetic, the register state can also be updated by conditional 1211 1173 branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch ··· 1252 1188 to all copies of the pointer returned from a socket lookup. This has similar 1253 1189 behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but 1254 1190 it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly 1255 - represents a reference to the corresponding 'struct sock'. To ensure that the 1191 + represents a reference to the corresponding ``struct sock``. To ensure that the 1256 1192 reference is not leaked, it is imperative to NULL-check the reference and in 1257 1193 the non-NULL case, and pass the valid reference to the socket release function. 1258 1194 ··· 1260 1196 -------------------- 1261 1197 In cls_bpf and act_bpf programs the verifier allows direct access to the packet 1262 1198 data via skb->data and skb->data_end pointers. 1263 - Ex: 1264 - 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ 1265 - 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ 1266 - 3: r5 = r3 1267 - 4: r5 += 14 1268 - 5: if r5 > r4 goto pc+16 1269 - R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 1270 - 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ 1199 + Ex:: 1200 + 1201 + 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ 1202 + 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ 1203 + 3: r5 = r3 1204 + 4: r5 += 14 1205 + 5: if r5 > r4 goto pc+16 1206 + R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 1207 + 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ 1271 1208 1272 1209 this 2byte load from the packet is safe to do, since the program author 1273 - did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which 1210 + did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which 1274 1211 means that in the fall-through case the register R3 (which points to skb->data) 1275 1212 has at least 14 directly accessible bytes. The verifier marks it 1276 1213 as R3=pkt(id=0,off=0,r=14). ··· 1280 1215 r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. 1281 1216 Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points 1282 1217 to the packet data, but constant 14 was added to the register, so 1283 - it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) 1218 + it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) 1284 1219 which is zero bytes. 1285 1220 1286 - More complex packet access may look like: 1287 - R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 1288 - 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 1289 - 7: r4 = *(u8 *)(r3 +12) 1290 - 8: r4 *= 14 1291 - 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ 1292 - 10: r3 += r4 1293 - 11: r2 = r1 1294 - 12: r2 <<= 48 1295 - 13: r2 >>= 48 1296 - 14: r3 += r2 1297 - 15: r2 = r3 1298 - 16: r2 += 8 1299 - 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 1300 - 18: if r2 > r1 goto pc+2 1301 - R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 1302 - 19: r1 = *(u8 *)(r3 +4) 1221 + More complex packet access may look like:: 1222 + 1223 + 1224 + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 1225 + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 1226 + 7: r4 = *(u8 *)(r3 +12) 1227 + 8: r4 *= 14 1228 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ 1229 + 10: r3 += r4 1230 + 11: r2 = r1 1231 + 12: r2 <<= 48 1232 + 13: r2 >>= 48 1233 + 14: r3 += r2 1234 + 15: r2 = r3 1235 + 16: r2 += 8 1236 + 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 1237 + 18: if r2 > r1 goto pc+2 1238 + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 1239 + 19: r1 = *(u8 *)(r3 +4) 1240 + 1303 1241 The state of the register R3 is R3=pkt(id=2,off=0,r=8) 1304 - id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some 1242 + id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some 1305 1243 offset within a packet and since the program author did 1306 - 'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). 1244 + ``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). 1307 1245 The verifier only allows 'add'/'sub' operations on packet registers. Any other 1308 1246 operation will set the register state to 'SCALAR_VALUE' and it won't be 1309 1247 available for direct packet access. 1310 - Operation 'r3 += rX' may overflow and become less than original skb->data, 1311 - therefore the verifier has to prevent that. So when it sees 'r3 += rX' 1248 + 1249 + Operation ``r3 += rX`` may overflow and become less than original skb->data, 1250 + therefore the verifier has to prevent that. So when it sees ``r3 += rX`` 1312 1251 instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 1313 1252 against skb->data_end will not give us 'range' information, so attempts to read 1314 1253 through the pointer will give "invalid access to packet" error. 1315 - Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is 1254 + 1255 + Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is 1316 1256 R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits 1317 1257 of the register are guaranteed to be zero, and nothing is known about the lower 1318 - 8 bits. After insn 'r4 *= 14' the state becomes 1258 + 8 bits. After insn ``r4 *= 14`` the state becomes 1319 1259 R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit 1320 1260 value by constant 14 will keep upper 52 bits as zero, also the least significant 1321 - bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make 1261 + bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make 1322 1262 R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign 1323 1263 extending. This logic is implemented in adjust_reg_min_max_vals() function, 1324 1264 which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice 1325 1265 versa) and adjust_scalar_min_max_vals() for operations on two scalars. 1326 1266 1327 1267 The end result is that bpf program author can access packet directly 1328 - using normal C code as: 1268 + using normal C code as:: 1269 + 1329 1270 void *data = (void *)(long)skb->data; 1330 1271 void *data_end = (void *)(long)skb->data_end; 1331 1272 struct eth_hdr *eth = data; ··· 1339 1268 struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); 1340 1269 1341 1270 if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) 1342 - return 0; 1271 + return 0; 1343 1272 if (eth->h_proto != htons(ETH_P_IP)) 1344 - return 0; 1273 + return 0; 1345 1274 if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) 1346 - return 0; 1275 + return 0; 1347 1276 if (udp->dest == 53 || udp->source == 9) 1348 - ...; 1277 + ...; 1278 + 1349 1279 which makes such programs easier to write comparing to LD_ABS insn 1350 1280 and significantly faster. 1351 1281 ··· 1356 1284 and userspace. 1357 1285 1358 1286 The maps are accessed from user space via BPF syscall, which has commands: 1287 + 1359 1288 - create a map with given type and attributes 1360 - map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) 1289 + ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` 1361 1290 using attr->map_type, attr->key_size, attr->value_size, attr->max_entries 1362 1291 returns process-local file descriptor or negative error 1363 1292 1364 1293 - lookup key in a given map 1365 - err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) 1294 + ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` 1366 1295 using attr->map_fd, attr->key, attr->value 1367 1296 returns zero and stores found elem into value or negative error 1368 1297 1369 1298 - create or update key/value pair in a given map 1370 - err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) 1299 + ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` 1371 1300 using attr->map_fd, attr->key, attr->value 1372 1301 returns zero or negative error 1373 1302 1374 1303 - find and delete element by key in a given map 1375 - err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) 1304 + ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` 1376 1305 using attr->map_fd, attr->key 1377 1306 1378 1307 - to delete map: close(fd) ··· 1385 1312 maps can have different types: hash, array, bloom filter, radix-tree, etc. 1386 1313 1387 1314 The map is defined by: 1388 - . type 1389 - . max number of elements 1390 - . key size in bytes 1391 - . value size in bytes 1315 + 1316 + - type 1317 + - max number of elements 1318 + - key size in bytes 1319 + - value size in bytes 1392 1320 1393 1321 Pruning 1394 1322 ------- ··· 1413 1339 The following are few examples of invalid eBPF programs and verifier error 1414 1340 messages as seen in the log: 1415 1341 1416 - Program with unreachable instructions: 1417 - static struct bpf_insn prog[] = { 1342 + Program with unreachable instructions:: 1343 + 1344 + static struct bpf_insn prog[] = { 1418 1345 BPF_EXIT_INSN(), 1419 1346 BPF_EXIT_INSN(), 1420 - }; 1347 + }; 1348 + 1421 1349 Error: 1350 + 1422 1351 unreachable insn 1 1423 1352 1424 - Program that reads uninitialized register: 1353 + Program that reads uninitialized register:: 1354 + 1425 1355 BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), 1426 1356 BPF_EXIT_INSN(), 1427 - Error: 1357 + 1358 + Error:: 1359 + 1428 1360 0: (bf) r0 = r2 1429 1361 R2 !read_ok 1430 1362 1431 - Program that doesn't initialize R0 before exiting: 1363 + Program that doesn't initialize R0 before exiting:: 1364 + 1432 1365 BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), 1433 1366 BPF_EXIT_INSN(), 1434 - Error: 1367 + 1368 + Error:: 1369 + 1435 1370 0: (bf) r2 = r1 1436 1371 1: (95) exit 1437 1372 R0 !read_ok 1438 1373 1439 - Program that accesses stack out of bounds: 1440 - BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), 1441 - BPF_EXIT_INSN(), 1442 - Error: 1443 - 0: (7a) *(u64 *)(r10 +8) = 0 1444 - invalid stack off=8 size=8 1374 + Program that accesses stack out of bounds:: 1445 1375 1446 - Program that doesn't initialize stack before passing its address into function: 1376 + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), 1377 + BPF_EXIT_INSN(), 1378 + 1379 + Error:: 1380 + 1381 + 0: (7a) *(u64 *)(r10 +8) = 0 1382 + invalid stack off=8 size=8 1383 + 1384 + Program that doesn't initialize stack before passing its address into function:: 1385 + 1447 1386 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 1448 1387 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 1449 1388 BPF_LD_MAP_FD(BPF_REG_1, 0), 1450 1389 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 1451 1390 BPF_EXIT_INSN(), 1452 - Error: 1391 + 1392 + Error:: 1393 + 1453 1394 0: (bf) r2 = r10 1454 1395 1: (07) r2 += -8 1455 1396 2: (b7) r1 = 0x0 1456 1397 3: (85) call 1 1457 1398 invalid indirect read from stack off -8+0 size 8 1458 1399 1459 - Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: 1400 + Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: 1401 + 1460 1402 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 1461 1403 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 1462 1404 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 1463 1405 BPF_LD_MAP_FD(BPF_REG_1, 0), 1464 1406 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 1465 1407 BPF_EXIT_INSN(), 1466 - Error: 1408 + 1409 + Error:: 1410 + 1467 1411 0: (7a) *(u64 *)(r10 -8) = 0 1468 1412 1: (bf) r2 = r10 1469 1413 2: (07) r2 += -8 ··· 1490 1398 fd 0 is not pointing to valid bpf_map 1491 1399 1492 1400 Program that doesn't check return value of map_lookup_elem() before accessing 1493 - map element: 1401 + map element:: 1402 + 1494 1403 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 1495 1404 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 1496 1405 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), ··· 1499 1406 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 1500 1407 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), 1501 1408 BPF_EXIT_INSN(), 1502 - Error: 1409 + 1410 + Error:: 1411 + 1503 1412 0: (7a) *(u64 *)(r10 -8) = 0 1504 1413 1: (bf) r2 = r10 1505 1414 2: (07) r2 += -8 ··· 1511 1416 R0 invalid mem access 'map_value_or_null' 1512 1417 1513 1418 Program that correctly checks map_lookup_elem() returned value for NULL, but 1514 - accesses the memory with incorrect alignment: 1419 + accesses the memory with incorrect alignment:: 1420 + 1515 1421 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 1516 1422 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 1517 1423 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), ··· 1521 1425 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), 1522 1426 BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), 1523 1427 BPF_EXIT_INSN(), 1524 - Error: 1428 + 1429 + Error:: 1430 + 1525 1431 0: (7a) *(u64 *)(r10 -8) = 0 1526 1432 1: (bf) r2 = r10 1527 1433 2: (07) r2 += -8 ··· 1536 1438 1537 1439 Program that correctly checks map_lookup_elem() returned value for NULL and 1538 1440 accesses memory with correct alignment in one side of 'if' branch, but fails 1539 - to do so in the other side of 'if' branch: 1441 + to do so in the other side of 'if' branch:: 1442 + 1540 1443 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 1541 1444 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 1542 1445 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), ··· 1548 1449 BPF_EXIT_INSN(), 1549 1450 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), 1550 1451 BPF_EXIT_INSN(), 1551 - Error: 1452 + 1453 + Error:: 1454 + 1552 1455 0: (7a) *(u64 *)(r10 -8) = 0 1553 1456 1: (bf) r2 = r10 1554 1457 2: (07) r2 += -8 ··· 1566 1465 R0 invalid mem access 'imm' 1567 1466 1568 1467 Program that performs a socket lookup then sets the pointer to NULL without 1569 - checking it: 1570 - value: 1468 + checking it:: 1469 + 1571 1470 BPF_MOV64_IMM(BPF_REG_2, 0), 1572 1471 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 1573 1472 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), ··· 1578 1477 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 1579 1478 BPF_MOV64_IMM(BPF_REG_0, 0), 1580 1479 BPF_EXIT_INSN(), 1581 - Error: 1480 + 1481 + Error:: 1482 + 1582 1483 0: (b7) r2 = 0 1583 1484 1: (63) *(u32 *)(r10 -8) = r2 1584 1485 2: (bf) r2 = r10 ··· 1594 1491 Unreleased reference id=1, alloc_insn=7 1595 1492 1596 1493 Program that performs a socket lookup but does not NULL-check the returned 1597 - value: 1494 + value:: 1495 + 1598 1496 BPF_MOV64_IMM(BPF_REG_2, 0), 1599 1497 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 1600 1498 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), ··· 1605 1501 BPF_MOV64_IMM(BPF_REG_5, 0), 1606 1502 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 1607 1503 BPF_EXIT_INSN(), 1608 - Error: 1504 + 1505 + Error:: 1506 + 1609 1507 0: (b7) r2 = 0 1610 1508 1: (63) *(u32 *)(r10 -8) = r2 1611 1509 2: (bf) r2 = r10 ··· 1625 1519 Next to the BPF toolchain, the kernel also ships a test module that contains 1626 1520 various test cases for classic and internal BPF that can be executed against 1627 1521 the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and 1628 - enabled via Kconfig: 1522 + enabled via Kconfig:: 1629 1523 1630 1524 CONFIG_TEST_BPF=m 1631 1525 ··· 1646 1540 to give potential BPF hackers or security auditors a better overview of 1647 1541 the underlying architecture. 1648 1542 1649 - Jay Schulist <jschlst@samba.org> 1650 - Daniel Borkmann <daniel@iogearbox.net> 1651 - Alexei Starovoitov <ast@kernel.org> 1543 + - Jay Schulist <jschlst@samba.org> 1544 + - Daniel Borkmann <daniel@iogearbox.net> 1545 + - Alexei Starovoitov <ast@kernel.org>

+1

Documentation/networking/index.rst

··· 56 56 driver 57 57 eql 58 58 fib_trie 59 + filter 59 60 60 61 .. only:: subproject and html 61 62

+1 -1

Documentation/networking/packet_mmap.txt

··· 1051 1051 ------------------------------------------------------------------------------- 1052 1052 1053 1053 - Packet sockets work well together with Linux socket filters, thus you also 1054 - might want to have a look at Documentation/networking/filter.txt 1054 + might want to have a look at Documentation/networking/filter.rst 1055 1055 1056 1056 -------------------------------------------------------------------------------- 1057 1057 + THANKS

+1 -1

MAINTAINERS

··· 3192 3192 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 3193 3193 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 3194 3194 F: Documentation/bpf/ 3195 - F: Documentation/networking/filter.txt 3195 + F: Documentation/networking/filter.rst 3196 3196 F: arch/*/net/* 3197 3197 F: include/linux/bpf* 3198 3198 F: include/linux/filter.h

+1 -1

tools/bpf/bpf_asm.c

··· 11 11 * 12 12 * How to get into it: 13 13 * 14 - * 1) read Documentation/networking/filter.txt 14 + * 1) read Documentation/networking/filter.rst 15 15 * 2) Run `bpf_asm [-c] <filter-prog file>` to translate into binary 16 16 * blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will 17 17 * pretty print a C-like construct.

+1 -1

tools/bpf/bpf_dbg.c

··· 13 13 * for making a verdict when multiple simple BPF programs are combined 14 14 * into one in order to prevent parsing same headers multiple times. 15 15 * 16 - * More on how to debug BPF opcodes see Documentation/networking/filter.txt 16 + * More on how to debug BPF opcodes see Documentation/networking/filter.rst 17 17 * which is the main document on BPF. Mini howto for getting started: 18 18 * 19 19 * 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'):