Documentation/networking/filter.txt at v3.16

tjh.dev / kernel
fork atom
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork atom
kernel / Documentation / networking / filter.txt
at v3.16 1027 lines 39 kB view raw
wrap content
   1Linux Socket Filtering aka Berkeley Packet Filter (BPF)
   2=======================================================
   3
   4Introduction
   5------------
   6
   7Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
   8Though there are some distinct differences between the BSD and Linux
   9Kernel filtering, but when we speak of BPF or LSF in Linux context, we
  10mean the very same mechanism of filtering in the Linux kernel.
  11
  12BPF allows a user-space program to attach a filter onto any socket and
  13allow or disallow certain types of data to come through the socket. LSF
  14follows exactly the same filter code structure as BSD's BPF, so referring
  15to the BSD bpf.4 manpage is very helpful in creating filters.
  16
  17On Linux, BPF is much simpler than on BSD. One does not have to worry
  18about devices or anything like that. You simply create your filter code,
  19send it to the kernel via the SO_ATTACH_FILTER option and if your filter
  20code passes the kernel check on it, you then immediately begin filtering
  21data on that socket.
  22
  23You can also detach filters from your socket via the SO_DETACH_FILTER
  24option. This will probably not be used much since when you close a socket
  25that has a filter on it the filter is automagically removed. The other
  26less common case may be adding a different filter on the same socket where
  27you had another filter that is still running: the kernel takes care of
  28removing the old one and placing your new one in its place, assuming your
  29filter has passed the checks, otherwise if it fails the old filter will
  30remain on that socket.
  31
  32SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
  33set, a filter cannot be removed or changed. This allows one process to
  34setup a socket, attach a filter, lock it then drop privileges and be
  35assured that the filter will be kept until the socket is closed.
  36
  37The biggest user of this construct might be libpcap. Issuing a high-level
  38filter command like `tcpdump -i em1 port 22` passes through the libpcap
  39internal compiler that generates a structure that can eventually be loaded
  40via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
  41displays what is being placed into this structure.
  42
  43Although we were only speaking about sockets here, BPF in Linux is used
  44in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
  45qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
  46such as team driver, PTP code, etc where BPF is being used.
  47
  48 [1] Documentation/prctl/seccomp_filter.txt
  49
  50Original BPF paper:
  51
  52Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
  53architecture for user-level packet capture. In Proceedings of the
  54USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
  55Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
  56CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
  57
  58Structure
  59---------
  60
  61User space applications include <linux/filter.h> which contains the
  62following relevant structures:
  63
  64struct sock_filter {	/* Filter block */
  65	__u16	code;   /* Actual filter code */
  66	__u8	jt;	/* Jump true */
  67	__u8	jf;	/* Jump false */
  68	__u32	k;      /* Generic multiuse field */
  69};
  70
  71Such a structure is assembled as an array of 4-tuples, that contains
  72a code, jt, jf and k value. jt and jf are jump offsets and k a generic
  73value to be used for a provided code.
  74
  75struct sock_fprog {			/* Required for SO_ATTACH_FILTER. */
  76	unsigned short		   len;	/* Number of filter blocks */
  77	struct sock_filter __user *filter;
  78};
  79
  80For socket filtering, a pointer to this structure (as shown in
  81follow-up example) is being passed to the kernel through setsockopt(2).
  82
  83Example
  84-------
  85
  86#include <sys/socket.h>
  87#include <sys/types.h>
  88#include <arpa/inet.h>
  89#include <linux/if_ether.h>
  90/* ... */
  91
  92/* From the example above: tcpdump -i em1 port 22 -dd */
  93struct sock_filter code[] = {
  94	{ 0x28,  0,  0, 0x0000000c },
  95	{ 0x15,  0,  8, 0x000086dd },
  96	{ 0x30,  0,  0, 0x00000014 },
  97	{ 0x15,  2,  0, 0x00000084 },
  98	{ 0x15,  1,  0, 0x00000006 },
  99	{ 0x15,  0, 17, 0x00000011 },
 100	{ 0x28,  0,  0, 0x00000036 },
 101	{ 0x15, 14,  0, 0x00000016 },
 102	{ 0x28,  0,  0, 0x00000038 },
 103	{ 0x15, 12, 13, 0x00000016 },
 104	{ 0x15,  0, 12, 0x00000800 },
 105	{ 0x30,  0,  0, 0x00000017 },
 106	{ 0x15,  2,  0, 0x00000084 },
 107	{ 0x15,  1,  0, 0x00000006 },
 108	{ 0x15,  0,  8, 0x00000011 },
 109	{ 0x28,  0,  0, 0x00000014 },
 110	{ 0x45,  6,  0, 0x00001fff },
 111	{ 0xb1,  0,  0, 0x0000000e },
 112	{ 0x48,  0,  0, 0x0000000e },
 113	{ 0x15,  2,  0, 0x00000016 },
 114	{ 0x48,  0,  0, 0x00000010 },
 115	{ 0x15,  0,  1, 0x00000016 },
 116	{ 0x06,  0,  0, 0x0000ffff },
 117	{ 0x06,  0,  0, 0x00000000 },
 118};
 119
 120struct sock_fprog bpf = {
 121	.len = ARRAY_SIZE(code),
 122	.filter = code,
 123};
 124
 125sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
 126if (sock < 0)
 127	/* ... bail out ... */
 128
 129ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
 130if (ret < 0)
 131	/* ... bail out ... */
 132
 133/* ... */
 134close(sock);
 135
 136The above example code attaches a socket filter for a PF_PACKET socket
 137in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
 138be dropped for this socket.
 139
 140The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
 141and SO_LOCK_FILTER for preventing the filter to be detached, takes an
 142integer value with 0 or 1.
 143
 144Note that socket filters are not restricted to PF_PACKET sockets only,
 145but can also be used on other socket families.
 146
 147Summary of system calls:
 148
 149 * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
 150 * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
 151 * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER,   &val, sizeof(val));
 152
 153Normally, most use cases for socket filtering on packet sockets will be
 154covered by libpcap in high-level syntax, so as an application developer
 155you should stick to that. libpcap wraps its own layer around all that.
 156
 157Unless i) using/linking to libpcap is not an option, ii) the required BPF
 158filters use Linux extensions that are not supported by libpcap's compiler,
 159iii) a filter might be more complex and not cleanly implementable with
 160libpcap's compiler, or iv) particular filter codes should be optimized
 161differently than libpcap's internal compiler does; then in such cases
 162writing such a filter "by hand" can be of an alternative. For example,
 163xt_bpf and cls_bpf users might have requirements that could result in
 164more complex filter code, or one that cannot be expressed with libpcap
 165(e.g. different return codes for various code paths). Moreover, BPF JIT
 166implementors may wish to manually write test cases and thus need low-level
 167access to BPF code as well.
 168
 169BPF engine and instruction set
 170------------------------------
 171
 172Under tools/net/ there's a small helper tool called bpf_asm which can
 173be used to write low-level filters for example scenarios mentioned in the
 174previous section. Asm-like syntax mentioned here has been implemented in
 175bpf_asm and will be used for further explanations (instead of dealing with
 176less readable opcodes directly, principles are the same). The syntax is
 177closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
 178
 179The BPF architecture consists of the following basic elements:
 180
 181  Element          Description
 182
 183  A                32 bit wide accumulator
 184  X                32 bit wide X register
 185  M[]              16 x 32 bit wide misc registers aka "scratch memory
 186                   store", addressable from 0 to 15
 187
 188A program, that is translated by bpf_asm into "opcodes" is an array that
 189consists of the following elements (as already mentioned):
 190
 191  op:16, jt:8, jf:8, k:32
 192
 193The element op is a 16 bit wide opcode that has a particular instruction
 194encoded. jt and jf are two 8 bit wide jump targets, one for condition
 195"jump if true", the other one "jump if false". Eventually, element k
 196contains a miscellaneous argument that can be interpreted in different
 197ways depending on the given instruction in op.
 198
 199The instruction set consists of load, store, branch, alu, miscellaneous
 200and return instructions that are also represented in bpf_asm syntax. This
 201table lists all bpf_asm instructions available resp. what their underlying
 202opcodes as defined in linux/filter.h stand for:
 203
 204  Instruction      Addressing mode      Description
 205
 206  ld               1, 2, 3, 4, 10       Load word into A
 207  ldi              4                    Load word into A
 208  ldh              1, 2                 Load half-word into A
 209  ldb              1, 2                 Load byte into A
 210  ldx              3, 4, 5, 10          Load word into X
 211  ldxi             4                    Load word into X
 212  ldxb             5                    Load byte into X
 213
 214  st               3                    Store A into M[]
 215  stx              3                    Store X into M[]
 216
 217  jmp              6                    Jump to label
 218  ja               6                    Jump to label
 219  jeq              7, 8                 Jump on k == A
 220  jneq             8                    Jump on k != A
 221  jne              8                    Jump on k != A
 222  jlt              8                    Jump on k < A
 223  jle              8                    Jump on k <= A
 224  jgt              7, 8                 Jump on k > A
 225  jge              7, 8                 Jump on k >= A
 226  jset             7, 8                 Jump on k & A
 227
 228  add              0, 4                 A + <x>
 229  sub              0, 4                 A - <x>
 230  mul              0, 4                 A * <x>
 231  div              0, 4                 A / <x>
 232  mod              0, 4                 A % <x>
 233  neg              0, 4                 !A
 234  and              0, 4                 A & <x>
 235  or               0, 4                 A | <x>
 236  xor              0, 4                 A ^ <x>
 237  lsh              0, 4                 A << <x>
 238  rsh              0, 4                 A >> <x>
 239
 240  tax                                   Copy A into X
 241  txa                                   Copy X into A
 242
 243  ret              4, 9                 Return
 244
 245The next table shows addressing formats from the 2nd column:
 246
 247  Addressing mode  Syntax               Description
 248
 249   0               x/%x                 Register X
 250   1               [k]                  BHW at byte offset k in the packet
 251   2               [x + k]              BHW at the offset X + k in the packet
 252   3               M[k]                 Word at offset k in M[]
 253   4               #k                   Literal value stored in k
 254   5               4*([k]&0xf)          Lower nibble * 4 at byte offset k in the packet
 255   6               L                    Jump label L
 256   7               #k,Lt,Lf             Jump to Lt if true, otherwise jump to Lf
 257   8               #k,Lt                Jump to Lt if predicate is true
 258   9               a/%a                 Accumulator A
 259  10               extension            BPF extension
 260
 261The Linux kernel also has a couple of BPF extensions that are used along
 262with the class of load instructions by "overloading" the k argument with
 263a negative offset + a particular extension offset. The result of such BPF
 264extensions are loaded into A.
 265
 266Possible BPF extensions are shown in the following table:
 267
 268  Extension                             Description
 269
 270  len                                   skb->len
 271  proto                                 skb->protocol
 272  type                                  skb->pkt_type
 273  poff                                  Payload start offset
 274  ifidx                                 skb->dev->ifindex
 275  nla                                   Netlink attribute of type X with offset A
 276  nlan                                  Nested Netlink attribute of type X with offset A
 277  mark                                  skb->mark
 278  queue                                 skb->queue_mapping
 279  hatype                                skb->dev->type
 280  rxhash                                skb->hash
 281  cpu                                   raw_smp_processor_id()
 282  vlan_tci                              vlan_tx_tag_get(skb)
 283  vlan_pr                               vlan_tx_tag_present(skb)
 284  rand                                  prandom_u32()
 285
 286These extensions can also be prefixed with '#'.
 287Examples for low-level BPF:
 288
 289** ARP packets:
 290
 291  ldh [12]
 292  jne #0x806, drop
 293  ret #-1
 294  drop: ret #0
 295
 296** IPv4 TCP packets:
 297
 298  ldh [12]
 299  jne #0x800, drop
 300  ldb [23]
 301  jneq #6, drop
 302  ret #-1
 303  drop: ret #0
 304
 305** (Accelerated) VLAN w/ id 10:
 306
 307  ld vlan_tci
 308  jneq #10, drop
 309  ret #-1
 310  drop: ret #0
 311
 312** icmp random packet sampling, 1 in 4
 313  ldh [12]
 314  jne #0x800, drop
 315  ldb [23]
 316  jneq #1, drop
 317  # get a random uint32 number
 318  ld rand
 319  mod #4
 320  jneq #1, drop
 321  ret #-1
 322  drop: ret #0
 323
 324** SECCOMP filter example:
 325
 326  ld [4]                  /* offsetof(struct seccomp_data, arch) */
 327  jne #0xc000003e, bad    /* AUDIT_ARCH_X86_64 */
 328  ld [0]                  /* offsetof(struct seccomp_data, nr) */
 329  jeq #15, good           /* __NR_rt_sigreturn */
 330  jeq #231, good          /* __NR_exit_group */
 331  jeq #60, good           /* __NR_exit */
 332  jeq #0, good            /* __NR_read */
 333  jeq #1, good            /* __NR_write */
 334  jeq #5, good            /* __NR_fstat */
 335  jeq #9, good            /* __NR_mmap */
 336  jeq #14, good           /* __NR_rt_sigprocmask */
 337  jeq #13, good           /* __NR_rt_sigaction */
 338  jeq #35, good           /* __NR_nanosleep */
 339  bad: ret #0             /* SECCOMP_RET_KILL */
 340  good: ret #0x7fff0000   /* SECCOMP_RET_ALLOW */
 341
 342The above example code can be placed into a file (here called "foo"), and
 343then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
 344and cls_bpf understands and can directly be loaded with. Example with above
 345ARP code:
 346
 347$ ./bpf_asm foo
 3484,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
 349
 350In copy and paste C-like output:
 351
 352$ ./bpf_asm -c foo
 353{ 0x28,  0,  0, 0x0000000c },
 354{ 0x15,  0,  1, 0x00000806 },
 355{ 0x06,  0,  0, 0xffffffff },
 356{ 0x06,  0,  0, 0000000000 },
 357
 358In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
 359filters that might not be obvious at first, it's good to test filters before
 360attaching to a live system. For that purpose, there's a small tool called
 361bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
 362for testing BPF filters against given pcap files, single stepping through the
 363BPF code on the pcap's packets and to do BPF machine register dumps.
 364
 365Starting bpf_dbg is trivial and just requires issuing:
 366
 367# ./bpf_dbg
 368
 369In case input and output do not equal stdin/stdout, bpf_dbg takes an
 370alternative stdin source as a first argument, and an alternative stdout
 371sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
 372
 373Other than that, a particular libreadline configuration can be set via
 374file "~/.bpf_dbg_init" and the command history is stored in the file
 375"~/.bpf_dbg_history".
 376
 377Interaction in bpf_dbg happens through a shell that also has auto-completion
 378support (follow-up example commands starting with '>' denote bpf_dbg shell).
 379The usual workflow would be to ...
 380
 381> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
 382  Loads a BPF filter from standard output of bpf_asm, or transformed via
 383  e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
 384  debugging (next section), this command creates a temporary socket and
 385  loads the BPF code into the kernel. Thus, this will also be useful for
 386  JIT developers.
 387
 388> load pcap foo.pcap
 389  Loads standard tcpdump pcap file.
 390
 391> run [<n>]
 392bpf passes:1 fails:9
 393  Runs through all packets from a pcap to account how many passes and fails
 394  the filter will generate. A limit of packets to traverse can be given.
 395
 396> disassemble
 397l0:	ldh [12]
 398l1:	jeq #0x800, l2, l5
 399l2:	ldb [23]
 400l3:	jeq #0x1, l4, l5
 401l4:	ret #0xffff
 402l5:	ret #0
 403  Prints out BPF code disassembly.
 404
 405> dump
 406/* { op, jt, jf, k }, */
 407{ 0x28,  0,  0, 0x0000000c },
 408{ 0x15,  0,  3, 0x00000800 },
 409{ 0x30,  0,  0, 0x00000017 },
 410{ 0x15,  0,  1, 0x00000001 },
 411{ 0x06,  0,  0, 0x0000ffff },
 412{ 0x06,  0,  0, 0000000000 },
 413  Prints out C-style BPF code dump.
 414
 415> breakpoint 0
 416breakpoint at: l0:	ldh [12]
 417> breakpoint 1
 418breakpoint at: l1:	jeq #0x800, l2, l5
 419  ...
 420  Sets breakpoints at particular BPF instructions. Issuing a `run` command
 421  will walk through the pcap file continuing from the current packet and
 422  break when a breakpoint is being hit (another `run` will continue from
 423  the currently active breakpoint executing next instructions):
 424
 425  > run
 426  -- register dump --
 427  pc:       [0]                       <-- program counter
 428  code:     [40] jt[0] jf[0] k[12]    <-- plain BPF code of current instruction
 429  curr:     l0:	ldh [12]              <-- disassembly of current instruction
 430  A:        [00000000][0]             <-- content of A (hex, decimal)
 431  X:        [00000000][0]             <-- content of X (hex, decimal)
 432  M[0,15]:  [00000000][0]             <-- folded content of M (hex, decimal)
 433  -- packet dump --                   <-- Current packet from pcap (hex)
 434  len: 42
 435    0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
 436   16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
 437   32: 00 00 00 00 00 00 0a 3b 01 01
 438  (breakpoint)
 439  >
 440
 441> breakpoint
 442breakpoints: 0 1
 443  Prints currently set breakpoints.
 444
 445> step [-<n>, +<n>]
 446  Performs single stepping through the BPF program from the current pc
 447  offset. Thus, on each step invocation, above register dump is issued.
 448  This can go forwards and backwards in time, a plain `step` will break
 449  on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
 450
 451> select <n>
 452  Selects a given packet from the pcap file to continue from. Thus, on
 453  the next `run` or `step`, the BPF program is being evaluated against
 454  the user pre-selected packet. Numbering starts just as in Wireshark
 455  with index 1.
 456
 457> quit
 458#
 459  Exits bpf_dbg.
 460
 461JIT compiler
 462------------
 463
 464The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
 465ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
 466transparently invoked for each attached filter from user space or for internal
 467kernel users if it has been previously enabled by root:
 468
 469  echo 1 > /proc/sys/net/core/bpf_jit_enable
 470
 471For JIT developers, doing audits etc, each compile run can output the generated
 472opcode image into the kernel log via:
 473
 474  echo 2 > /proc/sys/net/core/bpf_jit_enable
 475
 476Example output from dmesg:
 477
 478[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
 479[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
 480[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
 481[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
 482[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
 483[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
 484
 485In the kernel source tree under tools/net/, there's bpf_jit_disasm for
 486generating disassembly out of the kernel log's hexdump:
 487
 488# ./bpf_jit_disasm
 48970 bytes emitted from JIT compiler (pass:3, flen:6)
 490ffffffffa0069c8f + <x>:
 491   0:	push   %rbp
 492   1:	mov    %rsp,%rbp
 493   4:	sub    $0x60,%rsp
 494   8:	mov    %rbx,-0x8(%rbp)
 495   c:	mov    0x68(%rdi),%r9d
 496  10:	sub    0x6c(%rdi),%r9d
 497  14:	mov    0xd8(%rdi),%r8
 498  1b:	mov    $0xc,%esi
 499  20:	callq  0xffffffffe0ff9442
 500  25:	cmp    $0x800,%eax
 501  2a:	jne    0x0000000000000042
 502  2c:	mov    $0x17,%esi
 503  31:	callq  0xffffffffe0ff945e
 504  36:	cmp    $0x1,%eax
 505  39:	jne    0x0000000000000042
 506  3b:	mov    $0xffff,%eax
 507  40:	jmp    0x0000000000000044
 508  42:	xor    %eax,%eax
 509  44:	leaveq
 510  45:	retq
 511
 512Issuing option `-o` will "annotate" opcodes to resulting assembler
 513instructions, which can be very useful for JIT developers:
 514
 515# ./bpf_jit_disasm -o
 51670 bytes emitted from JIT compiler (pass:3, flen:6)
 517ffffffffa0069c8f + <x>:
 518   0:	push   %rbp
 519	55
 520   1:	mov    %rsp,%rbp
 521	48 89 e5
 522   4:	sub    $0x60,%rsp
 523	48 83 ec 60
 524   8:	mov    %rbx,-0x8(%rbp)
 525	48 89 5d f8
 526   c:	mov    0x68(%rdi),%r9d
 527	44 8b 4f 68
 528  10:	sub    0x6c(%rdi),%r9d
 529	44 2b 4f 6c
 530  14:	mov    0xd8(%rdi),%r8
 531	4c 8b 87 d8 00 00 00
 532  1b:	mov    $0xc,%esi
 533	be 0c 00 00 00
 534  20:	callq  0xffffffffe0ff9442
 535	e8 1d 94 ff e0
 536  25:	cmp    $0x800,%eax
 537	3d 00 08 00 00
 538  2a:	jne    0x0000000000000042
 539	75 16
 540  2c:	mov    $0x17,%esi
 541	be 17 00 00 00
 542  31:	callq  0xffffffffe0ff945e
 543	e8 28 94 ff e0
 544  36:	cmp    $0x1,%eax
 545	83 f8 01
 546  39:	jne    0x0000000000000042
 547	75 07
 548  3b:	mov    $0xffff,%eax
 549	b8 ff ff 00 00
 550  40:	jmp    0x0000000000000044
 551	eb 02
 552  42:	xor    %eax,%eax
 553	31 c0
 554  44:	leaveq
 555	c9
 556  45:	retq
 557	c3
 558
 559For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
 560toolchain for developing and testing the kernel's JIT compiler.
 561
 562BPF kernel internals
 563--------------------
 564Internally, for the kernel interpreter, a different instruction set
 565format with similar underlying principles from BPF described in previous
 566paragraphs is being used. However, the instruction set format is modelled
 567closer to the underlying architecture to mimic native instruction sets, so
 568that a better performance can be achieved (more details later). This new
 569ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
 570originates from [e]xtended BPF is not the same as BPF extensions! While
 571eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
 572of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
 573
 574It is designed to be JITed with one to one mapping, which can also open up
 575the possibility for GCC/LLVM compilers to generate optimized eBPF code through
 576an eBPF backend that performs almost as fast as natively compiled code.
 577
 578The new instruction set was originally designed with the possible goal in
 579mind to write programs in "restricted C" and compile into eBPF with a optional
 580GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
 581minimal performance overhead over two steps, that is, C -> eBPF -> native code.
 582
 583Currently, the new format is being used for running user BPF programs, which
 584includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
 585team driver's classifier for its load-balancing mode, netfilter's xt_bpf
 586extension, PTP dissector/classifier, and much more. They are all internally
 587converted by the kernel into the new instruction set representation and run
 588in the eBPF interpreter. For in-kernel handlers, this all works transparently
 589by using sk_unattached_filter_create() for setting up the filter, resp.
 590sk_unattached_filter_destroy() for destroying it. The macro
 591SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed
 592code to run the filter. 'filter' is a pointer to struct sk_filter that we
 593got from sk_unattached_filter_create(), and 'ctx' the given context (e.g.
 594skb pointer). All constraints and restrictions from sk_chk_filter() apply
 595before a conversion to the new layout is being done behind the scenes!
 596
 597Currently, the classic BPF format is being used for JITing on most of the
 598architectures. Only x86-64 performs JIT compilation from eBPF instruction set,
 599however, future work will migrate other JIT compilers as well, so that they
 600will profit from the very same benefits.
 601
 602Some core changes of the new internal format:
 603
 604- Number of registers increase from 2 to 10:
 605
 606  The old format had two registers A and X, and a hidden frame pointer. The
 607  new layout extends this to be 10 internal registers and a read-only frame
 608  pointer. Since 64-bit CPUs are passing arguments to functions via registers
 609  the number of args from eBPF program to in-kernel function is restricted
 610  to 5 and one register is used to accept return value from an in-kernel
 611  function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
 612  sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
 613  registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
 614
 615  Therefore, eBPF calling convention is defined as:
 616
 617    * R0	- return value from in-kernel function, and exit value for eBPF program
 618    * R1 - R5	- arguments from eBPF program to in-kernel function
 619    * R6 - R9	- callee saved registers that in-kernel function will preserve
 620    * R10	- read-only frame pointer to access stack
 621
 622  Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
 623  etc, and eBPF calling convention maps directly to ABIs used by the kernel on
 624  64-bit architectures.
 625
 626  On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
 627  and may let more complex programs to be interpreted.
 628
 629  R0 - R5 are scratch registers and eBPF program needs spill/fill them if
 630  necessary across calls. Note that there is only one eBPF program (== one
 631  eBPF main routine) and it cannot call other eBPF functions, it can only
 632  call predefined in-kernel functions, though.
 633
 634- Register width increases from 32-bit to 64-bit:
 635
 636  Still, the semantics of the original 32-bit ALU operations are preserved
 637  via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
 638  subregisters that zero-extend into 64-bit if they are being written to.
 639  That behavior maps directly to x86_64 and arm64 subregister definition, but
 640  makes other JITs more difficult.
 641
 642  32-bit architectures run 64-bit internal BPF programs via interpreter.
 643  Their JITs may convert BPF programs that only use 32-bit subregisters into
 644  native instruction set and let the rest being interpreted.
 645
 646  Operation is 64-bit, because on 64-bit architectures, pointers are also
 647  64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
 648  so 32-bit eBPF registers would otherwise require to define register-pair
 649  ABI, thus, there won't be able to use a direct eBPF register to HW register
 650  mapping and JIT would need to do combine/split/move operations for every
 651  register in and out of the function, which is complex, bug prone and slow.
 652  Another reason is the use of atomic 64-bit counters.
 653
 654- Conditional jt/jf targets replaced with jt/fall-through:
 655
 656  While the original design has constructs such as "if (cond) jump_true;
 657  else jump_false;", they are being replaced into alternative constructs like
 658  "if (cond) jump_true; /* else fall-through */".
 659
 660- Introduces bpf_call insn and register passing convention for zero overhead
 661  calls from/to other kernel functions:
 662
 663  Before an in-kernel function call, the internal BPF program needs to
 664  place function arguments into R1 to R5 registers to satisfy calling
 665  convention, then the interpreter will take them from registers and pass
 666  to in-kernel function. If R1 - R5 registers are mapped to CPU registers
 667  that are used for argument passing on given architecture, the JIT compiler
 668  doesn't need to emit extra moves. Function arguments will be in the correct
 669  registers and BPF_CALL instruction will be JITed as single 'call' HW
 670  instruction. This calling convention was picked to cover common call
 671  situations without performance penalty.
 672
 673  After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
 674  a return value of the function. Since R6 - R9 are callee saved, their state
 675  is preserved across the call.
 676
 677  For example, consider three C functions:
 678
 679  u64 f1() { return (*_f2)(1); }
 680  u64 f2(u64 a) { return f3(a + 1, a); }
 681  u64 f3(u64 a, u64 b) { return a - b; }
 682
 683  GCC can compile f1, f3 into x86_64:
 684
 685  f1:
 686    movl $1, %edi
 687    movq _f2(%rip), %rax
 688    jmp  *%rax
 689  f3:
 690    movq %rdi, %rax
 691    subq %rsi, %rax
 692    ret
 693
 694  Function f2 in eBPF may look like:
 695
 696  f2:
 697    bpf_mov R2, R1
 698    bpf_add R1, 1
 699    bpf_call f3
 700    bpf_exit
 701
 702  If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
 703  returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to
 704  be used to call into f2.
 705
 706  For practical reasons all eBPF programs have only one argument 'ctx' which is
 707  already placed into R1 (e.g. on __sk_run_filter() startup) and the programs
 708  can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
 709  are currently not supported, but these restrictions can be lifted if necessary
 710  in the future.
 711
 712  On 64-bit architectures all register map to HW registers one to one. For
 713  example, x86_64 JIT compiler can map them as ...
 714
 715    R0 - rax
 716    R1 - rdi
 717    R2 - rsi
 718    R3 - rdx
 719    R4 - rcx
 720    R5 - r8
 721    R6 - rbx
 722    R7 - r13
 723    R8 - r14
 724    R9 - r15
 725    R10 - rbp
 726
 727  ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
 728  and rbx, r12 - r15 are callee saved.
 729
 730  Then the following internal BPF pseudo-program:
 731
 732    bpf_mov R6, R1 /* save ctx */
 733    bpf_mov R2, 2
 734    bpf_mov R3, 3
 735    bpf_mov R4, 4
 736    bpf_mov R5, 5
 737    bpf_call foo
 738    bpf_mov R7, R0 /* save foo() return value */
 739    bpf_mov R1, R6 /* restore ctx for next call */
 740    bpf_mov R2, 6
 741    bpf_mov R3, 7
 742    bpf_mov R4, 8
 743    bpf_mov R5, 9
 744    bpf_call bar
 745    bpf_add R0, R7
 746    bpf_exit
 747
 748  After JIT to x86_64 may look like:
 749
 750    push %rbp
 751    mov %rsp,%rbp
 752    sub $0x228,%rsp
 753    mov %rbx,-0x228(%rbp)
 754    mov %r13,-0x220(%rbp)
 755    mov %rdi,%rbx
 756    mov $0x2,%esi
 757    mov $0x3,%edx
 758    mov $0x4,%ecx
 759    mov $0x5,%r8d
 760    callq foo
 761    mov %rax,%r13
 762    mov %rbx,%rdi
 763    mov $0x2,%esi
 764    mov $0x3,%edx
 765    mov $0x4,%ecx
 766    mov $0x5,%r8d
 767    callq bar
 768    add %r13,%rax
 769    mov -0x228(%rbp),%rbx
 770    mov -0x220(%rbp),%r13
 771    leaveq
 772    retq
 773
 774  Which is in this example equivalent in C to:
 775
 776    u64 bpf_filter(u64 ctx)
 777    {
 778        return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
 779    }
 780
 781  In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
 782  arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
 783  registers and place their return value into '%rax' which is R0 in eBPF.
 784  Prologue and epilogue are emitted by JIT and are implicit in the
 785  interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
 786  them across the calls as defined by calling convention.
 787
 788  For example the following program is invalid:
 789
 790    bpf_mov R1, 1
 791    bpf_call foo
 792    bpf_mov R0, R1
 793    bpf_exit
 794
 795  After the call the registers R1-R5 contain junk values and cannot be read.
 796  In the future an eBPF verifier can be used to validate internal BPF programs.
 797
 798Also in the new design, eBPF is limited to 4096 insns, which means that any
 799program will terminate quickly and will only call a fixed number of kernel
 800functions. Original BPF and the new format are two operand instructions,
 801which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
 802
 803The input context pointer for invoking the interpreter function is generic,
 804its content is defined by a specific use case. For seccomp register R1 points
 805to seccomp_data, for converted BPF filters R1 points to a skb.
 806
 807A program, that is translated internally consists of the following elements:
 808
 809  op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32
 810
 811So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
 812has room for new instructions. Some of them may use 16/24/32 byte encoding. New
 813instructions must be multiple of 8 bytes to preserve backward compatibility.
 814
 815Internal BPF is a general purpose RISC instruction set. Not every register and
 816every instruction are used during translation from original BPF to new format.
 817For example, socket filters are not using 'exclusive add' instruction, but
 818tracing filters may do to maintain counters of events, for example. Register R9
 819is not used by socket filters either, but more complex filters may be running
 820out of registers and would have to resort to spill/fill to stack.
 821
 822Internal BPF can used as generic assembler for last step performance
 823optimizations, socket filters and seccomp are using it as assembler. Tracing
 824filters may use it as assembler to generate code from kernel. In kernel usage
 825may not be bounded by security considerations, since generated internal BPF code
 826may be optimizing internal code path and not being exposed to the user space.
 827Safety of internal BPF can come from a verifier (TBD). In such use cases as
 828described, it may be used as safe instruction set.
 829
 830Just like the original BPF, the new format runs within a controlled environment,
 831is deterministic and the kernel can easily prove that. The safety of the program
 832can be determined in two steps: first step does depth-first-search to disallow
 833loops and other CFG validation; second step starts from the first insn and
 834descends all possible paths. It simulates execution of every insn and observes
 835the state change of registers and stack.
 836
 837eBPF opcode encoding
 838--------------------
 839
 840eBPF is reusing most of the opcode encoding from classic to simplify conversion
 841of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
 842field is divided into three parts:
 843
 844  +----------------+--------+--------------------+
 845  |   4 bits       |  1 bit |   3 bits           |
 846  | operation code | source | instruction class  |
 847  +----------------+--------+--------------------+
 848  (MSB)                                      (LSB)
 849
 850Three LSB bits store instruction class which is one of:
 851
 852  Classic BPF classes:    eBPF classes:
 853
 854  BPF_LD    0x00          BPF_LD    0x00
 855  BPF_LDX   0x01          BPF_LDX   0x01
 856  BPF_ST    0x02          BPF_ST    0x02
 857  BPF_STX   0x03          BPF_STX   0x03
 858  BPF_ALU   0x04          BPF_ALU   0x04
 859  BPF_JMP   0x05          BPF_JMP   0x05
 860  BPF_RET   0x06          [ class 6 unused, for future if needed ]
 861  BPF_MISC  0x07          BPF_ALU64 0x07
 862
 863When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
 864
 865  BPF_K     0x00
 866  BPF_X     0x08
 867
 868 * in classic BPF, this means:
 869
 870  BPF_SRC(code) == BPF_X - use register X as source operand
 871  BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
 872
 873 * in eBPF, this means:
 874
 875  BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
 876  BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
 877
 878... and four MSB bits store operation code.
 879
 880If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
 881
 882  BPF_ADD   0x00
 883  BPF_SUB   0x10
 884  BPF_MUL   0x20
 885  BPF_DIV   0x30
 886  BPF_OR    0x40
 887  BPF_AND   0x50
 888  BPF_LSH   0x60
 889  BPF_RSH   0x70
 890  BPF_NEG   0x80
 891  BPF_MOD   0x90
 892  BPF_XOR   0xa0
 893  BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
 894  BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
 895  BPF_END   0xd0  /* eBPF only: endianness conversion */
 896
 897If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
 898
 899  BPF_JA    0x00
 900  BPF_JEQ   0x10
 901  BPF_JGT   0x20
 902  BPF_JGE   0x30
 903  BPF_JSET  0x40
 904  BPF_JNE   0x50  /* eBPF only: jump != */
 905  BPF_JSGT  0x60  /* eBPF only: signed '>' */
 906  BPF_JSGE  0x70  /* eBPF only: signed '>=' */
 907  BPF_CALL  0x80  /* eBPF only: function call */
 908  BPF_EXIT  0x90  /* eBPF only: function return */
 909
 910So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
 911and eBPF. There are only two registers in classic BPF, so it means A += X.
 912In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
 913BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
 914src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
 915
 916Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
 917eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
 918BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
 919exactly the same operations as BPF_ALU, but with 64-bit wide operands
 920instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
 921dst_reg = dst_reg + src_reg
 922
 923Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
 924operation. Classic BPF_RET | BPF_K means copy imm32 into return register
 925and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
 926in eBPF means function exit only. The eBPF program needs to store return
 927value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently
 928unused and reserved for future use.
 929
 930For load and store instructions the 8-bit 'code' field is divided as:
 931
 932  +--------+--------+-------------------+
 933  | 3 bits | 2 bits |   3 bits          |
 934  |  mode  |  size  | instruction class |
 935  +--------+--------+-------------------+
 936  (MSB)                             (LSB)
 937
 938Size modifier is one of ...
 939
 940  BPF_W   0x00    /* word */
 941  BPF_H   0x08    /* half word */
 942  BPF_B   0x10    /* byte */
 943  BPF_DW  0x18    /* eBPF only, double word */
 944
 945... which encodes size of load/store operation:
 946
 947 B  - 1 byte
 948 H  - 2 byte
 949 W  - 4 byte
 950 DW - 8 byte (eBPF only)
 951
 952Mode modifier is one of:
 953
 954  BPF_IMM  0x00  /* classic BPF only, reserved in eBPF */
 955  BPF_ABS  0x20
 956  BPF_IND  0x40
 957  BPF_MEM  0x60
 958  BPF_LEN  0x80  /* classic BPF only, reserved in eBPF */
 959  BPF_MSH  0xa0  /* classic BPF only, reserved in eBPF */
 960  BPF_XADD 0xc0  /* eBPF only, exclusive add */
 961
 962eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
 963(BPF_IND | <size> | BPF_LD) which are used to access packet data.
 964
 965They had to be carried over from classic to have strong performance of
 966socket filters running in eBPF interpreter. These instructions can only
 967be used when interpreter context is a pointer to 'struct sk_buff' and
 968have seven implicit operands. Register R6 is an implicit input that must
 969contain pointer to sk_buff. Register R0 is an implicit output which contains
 970the data fetched from the packet. Registers R1-R5 are scratch registers
 971and must not be used to store the data across BPF_ABS | BPF_LD or
 972BPF_IND | BPF_LD instructions.
 973
 974These instructions have implicit program exit condition as well. When
 975eBPF program is trying to access the data beyond the packet boundary,
 976the interpreter will abort the execution of the program. JIT compilers
 977therefore must preserve this property. src_reg and imm32 fields are
 978explicit inputs to these instructions.
 979
 980For example:
 981
 982  BPF_IND | BPF_W | BPF_LD means:
 983
 984    R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
 985    and R1 - R5 were scratched.
 986
 987Unlike classic BPF instruction set, eBPF has generic load/store operations:
 988
 989BPF_MEM | <size> | BPF_STX:  *(size *) (dst_reg + off) = src_reg
 990BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32
 991BPF_MEM | <size> | BPF_LDX:  dst_reg = *(size *) (src_reg + off)
 992BPF_XADD | BPF_W  | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
 993BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
 994
 995Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 9962 byte atomic increments are not supported.
 997
 998Testing
 999-------
1000
1001Next to the BPF toolchain, the kernel also ships a test module that contains
1002various test cases for classic and internal BPF that can be executed against
1003the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
1004enabled via Kconfig:
1005
1006  CONFIG_TEST_BPF=m
1007
1008After the module has been built and installed, the test suite can be executed
1009via insmod or modprobe against 'test_bpf' module. Results of the test cases
1010including timings in nsec can be found in the kernel log (dmesg).
1011
1012Misc
1013----
1014
1015Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
1016SECCOMP-BPF kernel fuzzing.
1017
1018Written by
1019----------
1020
1021The document was written in the hope that it is found useful and in order
1022to give potential BPF hackers or security auditors a better overview of
1023the underlying architecture.
1024
1025Jay Schulist <jschlst@samba.org>
1026Daniel Borkmann <dborkman@redhat.com>
1027Alexei Starovoitov <ast@plumgrid.com>