bpf, docs: Split the comparism to classic BPF from instruction-set.rst

+376

Documentation/bpf/classic_vs_extended.rst

··· 1 + 2 + =================== 3 + Classic BPF vs eBPF 4 + =================== 5 + 6 + eBPF is designed to be JITed with one to one mapping, which can also open up 7 + the possibility for GCC/LLVM compilers to generate optimized eBPF code through 8 + an eBPF backend that performs almost as fast as natively compiled code. 9 + 10 + Some core changes of the eBPF format from classic BPF: 11 + 12 + - Number of registers increase from 2 to 10: 13 + 14 + The old format had two registers A and X, and a hidden frame pointer. The 15 + new layout extends this to be 10 internal registers and a read-only frame 16 + pointer. Since 64-bit CPUs are passing arguments to functions via registers 17 + the number of args from eBPF program to in-kernel function is restricted 18 + to 5 and one register is used to accept return value from an in-kernel 19 + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ 20 + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved 21 + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. 22 + 23 + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, 24 + etc, and eBPF calling convention maps directly to ABIs used by the kernel on 25 + 64-bit architectures. 26 + 27 + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic 28 + and may let more complex programs to be interpreted. 29 + 30 + R0 - R5 are scratch registers and eBPF program needs spill/fill them if 31 + necessary across calls. Note that there is only one eBPF program (== one 32 + eBPF main routine) and it cannot call other eBPF functions, it can only 33 + call predefined in-kernel functions, though. 34 + 35 + - Register width increases from 32-bit to 64-bit: 36 + 37 + Still, the semantics of the original 32-bit ALU operations are preserved 38 + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower 39 + subregisters that zero-extend into 64-bit if they are being written to. 40 + That behavior maps directly to x86_64 and arm64 subregister definition, but 41 + makes other JITs more difficult. 42 + 43 + 32-bit architectures run 64-bit eBPF programs via interpreter. 44 + Their JITs may convert BPF programs that only use 32-bit subregisters into 45 + native instruction set and let the rest being interpreted. 46 + 47 + Operation is 64-bit, because on 64-bit architectures, pointers are also 48 + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, 49 + so 32-bit eBPF registers would otherwise require to define register-pair 50 + ABI, thus, there won't be able to use a direct eBPF register to HW register 51 + mapping and JIT would need to do combine/split/move operations for every 52 + register in and out of the function, which is complex, bug prone and slow. 53 + Another reason is the use of atomic 64-bit counters. 54 + 55 + - Conditional jt/jf targets replaced with jt/fall-through: 56 + 57 + While the original design has constructs such as ``if (cond) jump_true; 58 + else jump_false;``, they are being replaced into alternative constructs like 59 + ``if (cond) jump_true; /* else fall-through */``. 60 + 61 + - Introduces bpf_call insn and register passing convention for zero overhead 62 + calls from/to other kernel functions: 63 + 64 + Before an in-kernel function call, the eBPF program needs to 65 + place function arguments into R1 to R5 registers to satisfy calling 66 + convention, then the interpreter will take them from registers and pass 67 + to in-kernel function. If R1 - R5 registers are mapped to CPU registers 68 + that are used for argument passing on given architecture, the JIT compiler 69 + doesn't need to emit extra moves. Function arguments will be in the correct 70 + registers and BPF_CALL instruction will be JITed as single 'call' HW 71 + instruction. This calling convention was picked to cover common call 72 + situations without performance penalty. 73 + 74 + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has 75 + a return value of the function. Since R6 - R9 are callee saved, their state 76 + is preserved across the call. 77 + 78 + For example, consider three C functions:: 79 + 80 + u64 f1() { return (*_f2)(1); } 81 + u64 f2(u64 a) { return f3(a + 1, a); } 82 + u64 f3(u64 a, u64 b) { return a - b; } 83 + 84 + GCC can compile f1, f3 into x86_64:: 85 + 86 + f1: 87 + movl $1, %edi 88 + movq _f2(%rip), %rax 89 + jmp *%rax 90 + f3: 91 + movq %rdi, %rax 92 + subq %rsi, %rax 93 + ret 94 + 95 + Function f2 in eBPF may look like:: 96 + 97 + f2: 98 + bpf_mov R2, R1 99 + bpf_add R1, 1 100 + bpf_call f3 101 + bpf_exit 102 + 103 + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and 104 + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to 105 + be used to call into f2. 106 + 107 + For practical reasons all eBPF programs have only one argument 'ctx' which is 108 + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs 109 + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments 110 + are currently not supported, but these restrictions can be lifted if necessary 111 + in the future. 112 + 113 + On 64-bit architectures all register map to HW registers one to one. For 114 + example, x86_64 JIT compiler can map them as ... 115 + 116 + :: 117 + 118 + R0 - rax 119 + R1 - rdi 120 + R2 - rsi 121 + R3 - rdx 122 + R4 - rcx 123 + R5 - r8 124 + R6 - rbx 125 + R7 - r13 126 + R8 - r14 127 + R9 - r15 128 + R10 - rbp 129 + 130 + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing 131 + and rbx, r12 - r15 are callee saved. 132 + 133 + Then the following eBPF pseudo-program:: 134 + 135 + bpf_mov R6, R1 /* save ctx */ 136 + bpf_mov R2, 2 137 + bpf_mov R3, 3 138 + bpf_mov R4, 4 139 + bpf_mov R5, 5 140 + bpf_call foo 141 + bpf_mov R7, R0 /* save foo() return value */ 142 + bpf_mov R1, R6 /* restore ctx for next call */ 143 + bpf_mov R2, 6 144 + bpf_mov R3, 7 145 + bpf_mov R4, 8 146 + bpf_mov R5, 9 147 + bpf_call bar 148 + bpf_add R0, R7 149 + bpf_exit 150 + 151 + After JIT to x86_64 may look like:: 152 + 153 + push %rbp 154 + mov %rsp,%rbp 155 + sub $0x228,%rsp 156 + mov %rbx,-0x228(%rbp) 157 + mov %r13,-0x220(%rbp) 158 + mov %rdi,%rbx 159 + mov $0x2,%esi 160 + mov $0x3,%edx 161 + mov $0x4,%ecx 162 + mov $0x5,%r8d 163 + callq foo 164 + mov %rax,%r13 165 + mov %rbx,%rdi 166 + mov $0x6,%esi 167 + mov $0x7,%edx 168 + mov $0x8,%ecx 169 + mov $0x9,%r8d 170 + callq bar 171 + add %r13,%rax 172 + mov -0x228(%rbp),%rbx 173 + mov -0x220(%rbp),%r13 174 + leaveq 175 + retq 176 + 177 + Which is in this example equivalent in C to:: 178 + 179 + u64 bpf_filter(u64 ctx) 180 + { 181 + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); 182 + } 183 + 184 + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 185 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper 186 + registers and place their return value into ``%rax`` which is R0 in eBPF. 187 + Prologue and epilogue are emitted by JIT and are implicit in the 188 + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve 189 + them across the calls as defined by calling convention. 190 + 191 + For example the following program is invalid:: 192 + 193 + bpf_mov R1, 1 194 + bpf_call foo 195 + bpf_mov R0, R1 196 + bpf_exit 197 + 198 + After the call the registers R1-R5 contain junk values and cannot be read. 199 + An in-kernel verifier.rst is used to validate eBPF programs. 200 + 201 + Also in the new design, eBPF is limited to 4096 insns, which means that any 202 + program will terminate quickly and will only call a fixed number of kernel 203 + functions. Original BPF and eBPF are two operand instructions, 204 + which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. 205 + 206 + The input context pointer for invoking the interpreter function is generic, 207 + its content is defined by a specific use case. For seccomp register R1 points 208 + to seccomp_data, for converted BPF filters R1 points to a skb. 209 + 210 + A program, that is translated internally consists of the following elements:: 211 + 212 + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 213 + 214 + So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field 215 + has room for new instructions. Some of them may use 16/24/32 byte encoding. New 216 + instructions must be multiple of 8 bytes to preserve backward compatibility. 217 + 218 + eBPF is a general purpose RISC instruction set. Not every register and 219 + every instruction are used during translation from original BPF to eBPF. 220 + For example, socket filters are not using ``exclusive add`` instruction, but 221 + tracing filters may do to maintain counters of events, for example. Register R9 222 + is not used by socket filters either, but more complex filters may be running 223 + out of registers and would have to resort to spill/fill to stack. 224 + 225 + eBPF can be used as a generic assembler for last step performance 226 + optimizations, socket filters and seccomp are using it as assembler. Tracing 227 + filters may use it as assembler to generate code from kernel. In kernel usage 228 + may not be bounded by security considerations, since generated eBPF code 229 + may be optimizing internal code path and not being exposed to the user space. 230 + Safety of eBPF can come from the verifier.rst. In such use cases as 231 + described, it may be used as safe instruction set. 232 + 233 + Just like the original BPF, eBPF runs within a controlled environment, 234 + is deterministic and the kernel can easily prove that. The safety of the program 235 + can be determined in two steps: first step does depth-first-search to disallow 236 + loops and other CFG validation; second step starts from the first insn and 237 + descends all possible paths. It simulates execution of every insn and observes 238 + the state change of registers and stack. 239 + 240 + opcode encoding 241 + =============== 242 + 243 + eBPF is reusing most of the opcode encoding from classic to simplify conversion 244 + of classic BPF to eBPF. 245 + 246 + For arithmetic and jump instructions the 8-bit 'code' field is divided into three 247 + parts:: 248 + 249 + +----------------+--------+--------------------+ 250 + | 4 bits | 1 bit | 3 bits | 251 + | operation code | source | instruction class | 252 + +----------------+--------+--------------------+ 253 + (MSB) (LSB) 254 + 255 + Three LSB bits store instruction class which is one of: 256 + 257 + =================== =============== 258 + Classic BPF classes eBPF classes 259 + =================== =============== 260 + BPF_LD 0x00 BPF_LD 0x00 261 + BPF_LDX 0x01 BPF_LDX 0x01 262 + BPF_ST 0x02 BPF_ST 0x02 263 + BPF_STX 0x03 BPF_STX 0x03 264 + BPF_ALU 0x04 BPF_ALU 0x04 265 + BPF_JMP 0x05 BPF_JMP 0x05 266 + BPF_RET 0x06 BPF_JMP32 0x06 267 + BPF_MISC 0x07 BPF_ALU64 0x07 268 + =================== =============== 269 + 270 + The 4th bit encodes the source operand ... 271 + 272 + :: 273 + 274 + BPF_K 0x00 275 + BPF_X 0x08 276 + 277 + * in classic BPF, this means:: 278 + 279 + BPF_SRC(code) == BPF_X - use register X as source operand 280 + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 281 + 282 + * in eBPF, this means:: 283 + 284 + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand 285 + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 286 + 287 + ... and four MSB bits store operation code. 288 + 289 + If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: 290 + 291 + BPF_ADD 0x00 292 + BPF_SUB 0x10 293 + BPF_MUL 0x20 294 + BPF_DIV 0x30 295 + BPF_OR 0x40 296 + BPF_AND 0x50 297 + BPF_LSH 0x60 298 + BPF_RSH 0x70 299 + BPF_NEG 0x80 300 + BPF_MOD 0x90 301 + BPF_XOR 0xa0 302 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ 303 + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ 304 + BPF_END 0xd0 /* eBPF only: endianness conversion */ 305 + 306 + If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: 307 + 308 + BPF_JA 0x00 /* BPF_JMP only */ 309 + BPF_JEQ 0x10 310 + BPF_JGT 0x20 311 + BPF_JGE 0x30 312 + BPF_JSET 0x40 313 + BPF_JNE 0x50 /* eBPF only: jump != */ 314 + BPF_JSGT 0x60 /* eBPF only: signed '>' */ 315 + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ 316 + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ 317 + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ 318 + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ 319 + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ 320 + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ 321 + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ 322 + 323 + So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF 324 + and eBPF. There are only two registers in classic BPF, so it means A += X. 325 + In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, 326 + BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous 327 + src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. 328 + 329 + Classic BPF is using BPF_MISC class to represent A = X and X = A moves. 330 + eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no 331 + BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean 332 + exactly the same operations as BPF_ALU, but with 64-bit wide operands 333 + instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: 334 + dst_reg = dst_reg + src_reg 335 + 336 + Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` 337 + operation. Classic BPF_RET | BPF_K means copy imm32 into return register 338 + and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT 339 + in eBPF means function exit only. The eBPF program needs to store return 340 + value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as 341 + BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide 342 + operands for the comparisons instead. 343 + 344 + For load and store instructions the 8-bit 'code' field is divided as:: 345 + 346 + +--------+--------+-------------------+ 347 + | 3 bits | 2 bits | 3 bits | 348 + | mode | size | instruction class | 349 + +--------+--------+-------------------+ 350 + (MSB) (LSB) 351 + 352 + Size modifier is one of ... 353 + 354 + :: 355 + 356 + BPF_W 0x00 /* word */ 357 + BPF_H 0x08 /* half word */ 358 + BPF_B 0x10 /* byte */ 359 + BPF_DW 0x18 /* eBPF only, double word */ 360 + 361 + ... which encodes size of load/store operation:: 362 + 363 + B - 1 byte 364 + H - 2 byte 365 + W - 4 byte 366 + DW - 8 byte (eBPF only) 367 + 368 + Mode modifier is one of:: 369 + 370 + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ 371 + BPF_ABS 0x20 372 + BPF_IND 0x40 373 + BPF_MEM 0x60 374 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ 375 + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ 376 + BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */

+1

Documentation/bpf/index.rst

··· 21 21 helpers 22 22 programs 23 23 maps 24 + classic_vs_extended.rst 24 25 bpf_licensing 25 26 test_debug 26 27 other

+62 -304

Documentation/bpf/instruction-set.rst

··· 3 3 eBPF Instruction Set 4 4 ==================== 5 5 6 - eBPF is designed to be JITed with one to one mapping, which can also open up 7 - the possibility for GCC/LLVM compilers to generate optimized eBPF code through 8 - an eBPF backend that performs almost as fast as natively compiled code. 6 + Registers and calling convention 7 + ================================ 9 8 10 - Some core changes of the eBPF format from classic BPF: 9 + eBPF has 10 general purpose registers and a read-only frame pointer register, 10 + all of which are 64-bits wide. 11 11 12 - - Number of registers increase from 2 to 10: 12 + The eBPF calling convention is defined as: 13 13 14 - The old format had two registers A and X, and a hidden frame pointer. The 15 - new layout extends this to be 10 internal registers and a read-only frame 16 - pointer. Since 64-bit CPUs are passing arguments to functions via registers 17 - the number of args from eBPF program to in-kernel function is restricted 18 - to 5 and one register is used to accept return value from an in-kernel 19 - function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ 20 - sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved 21 - registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. 14 + * R0: return value from function calls, and exit value for eBPF programs 15 + * R1 - R5: arguments for function calls 16 + * R6 - R9: callee saved registers that function calls will preserve 17 + * R10: read-only frame pointer to access stack 22 18 23 - Therefore, eBPF calling convention is defined as: 24 - 25 - * R0 - return value from in-kernel function, and exit value for eBPF program 26 - * R1 - R5 - arguments from eBPF program to in-kernel function 27 - * R6 - R9 - callee saved registers that in-kernel function will preserve 28 - * R10 - read-only frame pointer to access stack 29 - 30 - Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, 31 - etc, and eBPF calling convention maps directly to ABIs used by the kernel on 32 - 64-bit architectures. 33 - 34 - On 32-bit architectures JIT may map programs that use only 32-bit arithmetic 35 - and may let more complex programs to be interpreted. 36 - 37 - R0 - R5 are scratch registers and eBPF program needs spill/fill them if 38 - necessary across calls. Note that there is only one eBPF program (== one 39 - eBPF main routine) and it cannot call other eBPF functions, it can only 40 - call predefined in-kernel functions, though. 41 - 42 - - Register width increases from 32-bit to 64-bit: 43 - 44 - Still, the semantics of the original 32-bit ALU operations are preserved 45 - via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower 46 - subregisters that zero-extend into 64-bit if they are being written to. 47 - That behavior maps directly to x86_64 and arm64 subregister definition, but 48 - makes other JITs more difficult. 49 - 50 - 32-bit architectures run 64-bit eBPF programs via interpreter. 51 - Their JITs may convert BPF programs that only use 32-bit subregisters into 52 - native instruction set and let the rest being interpreted. 53 - 54 - Operation is 64-bit, because on 64-bit architectures, pointers are also 55 - 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, 56 - so 32-bit eBPF registers would otherwise require to define register-pair 57 - ABI, thus, there won't be able to use a direct eBPF register to HW register 58 - mapping and JIT would need to do combine/split/move operations for every 59 - register in and out of the function, which is complex, bug prone and slow. 60 - Another reason is the use of atomic 64-bit counters. 61 - 62 - - Conditional jt/jf targets replaced with jt/fall-through: 63 - 64 - While the original design has constructs such as ``if (cond) jump_true; 65 - else jump_false;``, they are being replaced into alternative constructs like 66 - ``if (cond) jump_true; /* else fall-through */``. 67 - 68 - - Introduces bpf_call insn and register passing convention for zero overhead 69 - calls from/to other kernel functions: 70 - 71 - Before an in-kernel function call, the eBPF program needs to 72 - place function arguments into R1 to R5 registers to satisfy calling 73 - convention, then the interpreter will take them from registers and pass 74 - to in-kernel function. If R1 - R5 registers are mapped to CPU registers 75 - that are used for argument passing on given architecture, the JIT compiler 76 - doesn't need to emit extra moves. Function arguments will be in the correct 77 - registers and BPF_CALL instruction will be JITed as single 'call' HW 78 - instruction. This calling convention was picked to cover common call 79 - situations without performance penalty. 80 - 81 - After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has 82 - a return value of the function. Since R6 - R9 are callee saved, their state 83 - is preserved across the call. 84 - 85 - For example, consider three C functions:: 86 - 87 - u64 f1() { return (*_f2)(1); } 88 - u64 f2(u64 a) { return f3(a + 1, a); } 89 - u64 f3(u64 a, u64 b) { return a - b; } 90 - 91 - GCC can compile f1, f3 into x86_64:: 92 - 93 - f1: 94 - movl $1, %edi 95 - movq _f2(%rip), %rax 96 - jmp *%rax 97 - f3: 98 - movq %rdi, %rax 99 - subq %rsi, %rax 100 - ret 101 - 102 - Function f2 in eBPF may look like:: 103 - 104 - f2: 105 - bpf_mov R2, R1 106 - bpf_add R1, 1 107 - bpf_call f3 108 - bpf_exit 109 - 110 - If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and 111 - returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to 112 - be used to call into f2. 113 - 114 - For practical reasons all eBPF programs have only one argument 'ctx' which is 115 - already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs 116 - can call kernel functions with up to 5 arguments. Calls with 6 or more arguments 117 - are currently not supported, but these restrictions can be lifted if necessary 118 - in the future. 119 - 120 - On 64-bit architectures all register map to HW registers one to one. For 121 - example, x86_64 JIT compiler can map them as ... 122 - 123 - :: 124 - 125 - R0 - rax 126 - R1 - rdi 127 - R2 - rsi 128 - R3 - rdx 129 - R4 - rcx 130 - R5 - r8 131 - R6 - rbx 132 - R7 - r13 133 - R8 - r14 134 - R9 - r15 135 - R10 - rbp 136 - 137 - ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing 138 - and rbx, r12 - r15 are callee saved. 139 - 140 - Then the following eBPF pseudo-program:: 141 - 142 - bpf_mov R6, R1 /* save ctx */ 143 - bpf_mov R2, 2 144 - bpf_mov R3, 3 145 - bpf_mov R4, 4 146 - bpf_mov R5, 5 147 - bpf_call foo 148 - bpf_mov R7, R0 /* save foo() return value */ 149 - bpf_mov R1, R6 /* restore ctx for next call */ 150 - bpf_mov R2, 6 151 - bpf_mov R3, 7 152 - bpf_mov R4, 8 153 - bpf_mov R5, 9 154 - bpf_call bar 155 - bpf_add R0, R7 156 - bpf_exit 157 - 158 - After JIT to x86_64 may look like:: 159 - 160 - push %rbp 161 - mov %rsp,%rbp 162 - sub $0x228,%rsp 163 - mov %rbx,-0x228(%rbp) 164 - mov %r13,-0x220(%rbp) 165 - mov %rdi,%rbx 166 - mov $0x2,%esi 167 - mov $0x3,%edx 168 - mov $0x4,%ecx 169 - mov $0x5,%r8d 170 - callq foo 171 - mov %rax,%r13 172 - mov %rbx,%rdi 173 - mov $0x6,%esi 174 - mov $0x7,%edx 175 - mov $0x8,%ecx 176 - mov $0x9,%r8d 177 - callq bar 178 - add %r13,%rax 179 - mov -0x228(%rbp),%rbx 180 - mov -0x220(%rbp),%r13 181 - leaveq 182 - retq 183 - 184 - Which is in this example equivalent in C to:: 185 - 186 - u64 bpf_filter(u64 ctx) 187 - { 188 - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); 189 - } 190 - 191 - In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 192 - arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper 193 - registers and place their return value into ``%rax`` which is R0 in eBPF. 194 - Prologue and epilogue are emitted by JIT and are implicit in the 195 - interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve 196 - them across the calls as defined by calling convention. 197 - 198 - For example the following program is invalid:: 199 - 200 - bpf_mov R1, 1 201 - bpf_call foo 202 - bpf_mov R0, R1 203 - bpf_exit 204 - 205 - After the call the registers R1-R5 contain junk values and cannot be read. 206 - An in-kernel verifier.rst is used to validate eBPF programs. 207 - 208 - Also in the new design, eBPF is limited to 4096 insns, which means that any 209 - program will terminate quickly and will only call a fixed number of kernel 210 - functions. Original BPF and eBPF are two operand instructions, 211 - which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. 212 - 213 - The input context pointer for invoking the interpreter function is generic, 214 - its content is defined by a specific use case. For seccomp register R1 points 215 - to seccomp_data, for converted BPF filters R1 points to a skb. 216 - 217 - A program, that is translated internally consists of the following elements:: 218 - 219 - op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 220 - 221 - So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field 222 - has room for new instructions. Some of them may use 16/24/32 byte encoding. New 223 - instructions must be multiple of 8 bytes to preserve backward compatibility. 224 - 225 - eBPF is a general purpose RISC instruction set. Not every register and 226 - every instruction are used during translation from original BPF to eBPF. 227 - For example, socket filters are not using ``exclusive add`` instruction, but 228 - tracing filters may do to maintain counters of events, for example. Register R9 229 - is not used by socket filters either, but more complex filters may be running 230 - out of registers and would have to resort to spill/fill to stack. 231 - 232 - eBPF can be used as a generic assembler for last step performance 233 - optimizations, socket filters and seccomp are using it as assembler. Tracing 234 - filters may use it as assembler to generate code from kernel. In kernel usage 235 - may not be bounded by security considerations, since generated eBPF code 236 - may be optimizing internal code path and not being exposed to the user space. 237 - Safety of eBPF can come from the verifier.rst. In such use cases as 238 - described, it may be used as safe instruction set. 239 - 240 - Just like the original BPF, eBPF runs within a controlled environment, 241 - is deterministic and the kernel can easily prove that. The safety of the program 242 - can be determined in two steps: first step does depth-first-search to disallow 243 - loops and other CFG validation; second step starts from the first insn and 244 - descends all possible paths. It simulates execution of every insn and observes 245 - the state change of registers and stack. 19 + R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if 20 + necessary across calls. 246 21 247 22 eBPF opcode encoding 248 23 ==================== 249 24 250 - eBPF is reusing most of the opcode encoding from classic to simplify conversion 251 - of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' 252 - field is divided into three parts:: 25 + For arithmetic and jump instructions the 8-bit 'opcode' field is divided into 26 + three parts:: 253 27 254 28 +----------------+--------+--------------------+ 255 29 | 4 bits | 1 bit | 3 bits | ··· 33 259 34 260 Three LSB bits store instruction class which is one of: 35 261 36 - =================== =============== 37 - Classic BPF classes eBPF classes 38 - =================== =============== 39 - BPF_LD 0x00 BPF_LD 0x00 40 - BPF_LDX 0x01 BPF_LDX 0x01 41 - BPF_ST 0x02 BPF_ST 0x02 42 - BPF_STX 0x03 BPF_STX 0x03 43 - BPF_ALU 0x04 BPF_ALU 0x04 44 - BPF_JMP 0x05 BPF_JMP 0x05 45 - BPF_RET 0x06 BPF_JMP32 0x06 46 - BPF_MISC 0x07 BPF_ALU64 0x07 47 - =================== =============== 262 + ========= ===== 263 + class value 264 + ========= ===== 265 + BPF_LD 0x00 266 + BPF_LDX 0x01 267 + BPF_ST 0x02 268 + BPF_STX 0x03 269 + BPF_ALU 0x04 270 + BPF_JMP 0x05 271 + BPF_JMP32 0x06 272 + BPF_ALU64 0x07 273 + ========= ===== 48 274 49 275 When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... 50 276 51 - :: 277 + :: 52 278 53 - BPF_K 0x00 54 - BPF_X 0x08 55 - 56 - * in classic BPF, this means:: 57 - 58 - BPF_SRC(code) == BPF_X - use register X as source operand 59 - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 60 - 61 - * in eBPF, this means:: 62 - 63 - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand 64 - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 279 + BPF_K 0x00 /* use 32-bit immediate as source operand */ 280 + BPF_X 0x08 /* use 'src_reg' register as source operand */ 65 281 66 282 ... and four MSB bits store operation code. 67 283 68 - If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: 284 + If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 BPF_OP(code) is one of:: 69 285 70 286 BPF_ADD 0x00 71 287 BPF_SUB 0x10 ··· 68 304 BPF_NEG 0x80 69 305 BPF_MOD 0x90 70 306 BPF_XOR 0xa0 71 - BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ 72 - BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ 73 - BPF_END 0xd0 /* eBPF only: endianness conversion */ 307 + BPF_MOV 0xb0 /* mov reg to reg */ 308 + BPF_ARSH 0xc0 /* sign extending shift right */ 309 + BPF_END 0xd0 /* endianness conversion */ 74 310 75 - If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: 311 + If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 BPF_OP(code) is one of:: 76 312 77 313 BPF_JA 0x00 /* BPF_JMP only */ 78 314 BPF_JEQ 0x10 79 315 BPF_JGT 0x20 80 316 BPF_JGE 0x30 81 317 BPF_JSET 0x40 82 - BPF_JNE 0x50 /* eBPF only: jump != */ 83 - BPF_JSGT 0x60 /* eBPF only: signed '>' */ 84 - BPF_JSGE 0x70 /* eBPF only: signed '>=' */ 85 - BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ 86 - BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ 87 - BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ 88 - BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ 89 - BPF_JSLT 0xc0 /* eBPF only: signed '<' */ 90 - BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ 318 + BPF_JNE 0x50 /* jump != */ 319 + BPF_JSGT 0x60 /* signed '>' */ 320 + BPF_JSGE 0x70 /* signed '>=' */ 321 + BPF_CALL 0x80 /* function call */ 322 + BPF_EXIT 0x90 /* function return */ 323 + BPF_JLT 0xa0 /* unsigned '<' */ 324 + BPF_JLE 0xb0 /* unsigned '<=' */ 325 + BPF_JSLT 0xc0 /* signed '<' */ 326 + BPF_JSLE 0xd0 /* signed '<=' */ 91 327 92 - So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF 93 - and eBPF. There are only two registers in classic BPF, so it means A += X. 94 - In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, 95 - BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous 96 - src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. 328 + So BPF_ADD | BPF_X | BPF_ALU means:: 97 329 98 - Classic BPF is using BPF_MISC class to represent A = X and X = A moves. 99 - eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no 100 - BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean 101 - exactly the same operations as BPF_ALU, but with 64-bit wide operands 102 - instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: 103 - dst_reg = dst_reg + src_reg 330 + dst_reg = (u32) dst_reg + (u32) src_reg; 104 331 105 - Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` 106 - operation. Classic BPF_RET | BPF_K means copy imm32 into return register 107 - and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT 108 - in eBPF means function exit only. The eBPF program needs to store return 109 - value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as 332 + Similarly, BPF_XOR | BPF_K | BPF_ALU means:: 333 + 334 + src_reg = (u32) src_reg ^ (u32) imm32 335 + 336 + eBPF is using BPF_MOV | BPF_X | BPF_ALU to represent A = B moves. BPF_ALU64 337 + is used to mean exactly the same operations as BPF_ALU, but with 64-bit wide 338 + operands instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:: 339 + 340 + dst_reg = dst_reg + src_reg 341 + 342 + BPF_JMP | BPF_EXIT means function exit only. The eBPF program needs to store 343 + the return value into register R0 before doing a BPF_EXIT. Class 6 is used as 110 344 BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide 111 345 operands for the comparisons instead. 112 346 ··· 123 361 BPF_W 0x00 /* word */ 124 362 BPF_H 0x08 /* half word */ 125 363 BPF_B 0x10 /* byte */ 126 - BPF_DW 0x18 /* eBPF only, double word */ 364 + BPF_DW 0x18 /* double word */ 127 365 128 366 ... which encodes size of load/store operation:: 129 367 130 368 B - 1 byte 131 369 H - 2 byte 132 370 W - 4 byte 133 - DW - 8 byte (eBPF only) 371 + DW - 8 byte 134 372 135 373 Mode modifier is one of:: 136 374 137 - BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ 375 + BPF_IMM 0x00 /* used for 64-bit mov */ 138 376 BPF_ABS 0x20 139 377 BPF_IND 0x40 140 378 BPF_MEM 0x60 141 - BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ 142 - BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ 143 - BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */ 379 + BPF_ATOMIC 0xc0 /* atomic operations */ 144 380 145 381 eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and 146 382 (BPF_IND | <size> | BPF_LD) which are used to access packet data. 147 383 148 - They had to be carried over from classic to have strong performance of 384 + They had to be carried over from classic BPF to have strong performance of 149 385 socket filters running in eBPF interpreter. These instructions can only 150 386 be used when interpreter context is a pointer to ``struct sk_buff`` and 151 387 have seven implicit operands. Register R6 is an implicit input that must ··· 165 405 R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) 166 406 and R1 - R5 were scratched. 167 407 168 - Unlike classic BPF instruction set, eBPF has generic load/store operations:: 408 + eBPF has generic load/store operations:: 169 409 170 410 BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg 171 411 BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 ··· 220 460 eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists 221 461 of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single 222 462 instruction that loads 64-bit immediate value into a dst_reg. 223 - Classic BPF has similar instruction: ``BPF_LD | BPF_W | BPF_IMM`` which loads 224 - 32-bit immediate value into a register.