Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'filter-next'

Daniel Borkmann says:

====================
BPF updates

We sat down and have heavily reworked the whole previous patchset
from v10 [1] to address all comments/concerns. This patchset therefore
*replaces* the internal BPF interpreter with the new layout as
discussed in [1], and migrates some exotic callers to properly use the
BPF API for a transparent upgrade. All other callers that already use
the BPF API in a way it should be used, need no further changes to run
the new internals. We also removed the sysctl knob entirely, and do not
expose any structure to userland, so that implementation details only
reside in kernel space. Since we are replacing the interpreter we had
to migrate seccomp in one patch along with the interpreter to not break
anything. When attaching a new filter, the flow can be described as
following: i) test if jit compiler is enabled and can compile the user
BPF, ii) if so, then go for it, iii) if not, then transparently migrate
the filter into the new representation, and run it in the interpreter.
Also, we have scratched the jit flag from the len attribute and made it
as initial patch in this series as Pablo has suggested in the last
feedback, thanks. For details, please refer to the patches themselves.

We did extensive testing of BPF and seccomp on the new interpreter
itself and also on the user ABIs and could not find any issues; new
performance numbers as posted in patch 8 are also still the same.

Please find more details in the patches themselves.

For all the previous history from v1 to v10, see [1]. We have decided
to drop the v11 as we have pedantically reworked the set, but of course,
included all previous feedback.

v3 -> v4:
- Applied feedback from Dave regarding swap insns
- Rebased on net-next
v2 -> v3:
- Rebased to latest net-next (i.e. w/ rxhash->hash rename)
- Fixed patch 8/9 commit message/doc as suggested by Dave
- Rest is unchanged
v1 -> v2:
- Rebased to latest net-next
- Added static to ptp_filter as suggested by Dave
- Fixed a typo in patch 8's commit message
- Rest unchanged

Thanks !

[1] http://thread.gmane.org/gmane.linux.kernel/1665858
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+1657 -535
+125
Documentation/networking/filter.txt
··· 546 546 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful 547 547 toolchain for developing and testing the kernel's JIT compiler. 548 548 549 + BPF kernel internals 550 + -------------------- 551 + Internally, for the kernel interpreter, a different BPF instruction set 552 + format with similar underlying principles from BPF described in previous 553 + paragraphs is being used. However, the instruction set format is modelled 554 + closer to the underlying architecture to mimic native instruction sets, so 555 + that a better performance can be achieved (more details later). 556 + 557 + It is designed to be JITed with one to one mapping, which can also open up 558 + the possibility for GCC/LLVM compilers to generate optimized BPF code through 559 + a BPF backend that performs almost as fast as natively compiled code. 560 + 561 + The new instruction set was originally designed with the possible goal in 562 + mind to write programs in "restricted C" and compile into BPF with a optional 563 + GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with 564 + minimal performance overhead over two steps, that is, C -> BPF -> native code. 565 + 566 + Currently, the new format is being used for running user BPF programs, which 567 + includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, 568 + team driver's classifier for its load-balancing mode, netfilter's xt_bpf 569 + extension, PTP dissector/classifier, and much more. They are all internally 570 + converted by the kernel into the new instruction set representation and run 571 + in the extended interpreter. For in-kernel handlers, this all works 572 + transparently by using sk_unattached_filter_create() for setting up the 573 + filter, resp. sk_unattached_filter_destroy() for destroying it. The macro 574 + SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to 575 + run the filter. 'filter' is a pointer to struct sk_filter that we got from 576 + sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer). 577 + All constraints and restrictions from sk_chk_filter() apply before a 578 + conversion to the new layout is being done behind the scenes! 579 + 580 + Currently, for JITing, the user BPF format is being used and current BPF JIT 581 + compilers reused whenever possible. In other words, we do not (yet!) perform 582 + a JIT compilation in the new layout, however, future work will successively 583 + migrate traditional JIT compilers into the new instruction format as well, so 584 + that they will profit from the very same benefits. Thus, when speaking about 585 + JIT in the following, a JIT compiler (TBD) for the new instruction format is 586 + meant in this context. 587 + 588 + Some core changes of the new internal format: 589 + 590 + - Number of registers increase from 2 to 10: 591 + 592 + The old format had two registers A and X, and a hidden frame pointer. The 593 + new layout extends this to be 10 internal registers and a read-only frame 594 + pointer. Since 64-bit CPUs are passing arguments to functions via registers 595 + the number of args from BPF program to in-kernel function is restricted 596 + to 5 and one register is used to accept return value from an in-kernel 597 + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ 598 + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved 599 + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. 600 + 601 + Therefore, BPF calling convention is defined as: 602 + 603 + * R0 - return value from in-kernel function 604 + * R1 - R5 - arguments from BPF program to in-kernel function 605 + * R6 - R9 - callee saved registers that in-kernel function will preserve 606 + * R10 - read-only frame pointer to access stack 607 + 608 + Thus, all BPF registers map one to one to HW registers on x86_64, aarch64, 609 + etc, and BPF calling convention maps directly to ABIs used by the kernel on 610 + 64-bit architectures. 611 + 612 + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic 613 + and may let more complex programs to be interpreted. 614 + 615 + R0 - R5 are scratch registers and BPF program needs spill/fill them if 616 + necessary across calls. Note that there is only one BPF program (== one BPF 617 + main routine) and it cannot call other BPF functions, it can only call 618 + predefined in-kernel functions, though. 619 + 620 + - Register width increases from 32-bit to 64-bit: 621 + 622 + Still, the semantics of the original 32-bit ALU operations are preserved 623 + via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower 624 + subregisters that zero-extend into 64-bit if they are being written to. 625 + That behavior maps directly to x86_64 and arm64 subregister definition, but 626 + makes other JITs more difficult. 627 + 628 + 32-bit architectures run 64-bit internal BPF programs via interpreter. 629 + Their JITs may convert BPF programs that only use 32-bit subregisters into 630 + native instruction set and let the rest being interpreted. 631 + 632 + Operation is 64-bit, because on 64-bit architectures, pointers are also 633 + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, 634 + so 32-bit BPF registers would otherwise require to define register-pair 635 + ABI, thus, there won't be able to use a direct BPF register to HW register 636 + mapping and JIT would need to do combine/split/move operations for every 637 + register in and out of the function, which is complex, bug prone and slow. 638 + Another reason is the use of atomic 64-bit counters. 639 + 640 + - Conditional jt/jf targets replaced with jt/fall-through: 641 + 642 + While the original design has constructs such as "if (cond) jump_true; 643 + else jump_false;", they are being replaced into alternative constructs like 644 + "if (cond) jump_true; /* else fall-through */". 645 + 646 + - Introduces bpf_call insn and register passing convention for zero overhead 647 + calls from/to other kernel functions: 648 + 649 + After a kernel function call, R1 - R5 are reset to unreadable and R0 has a 650 + return type of the function. Since R6 - R9 are callee saved, their state is 651 + preserved across the call. 652 + 653 + Also in the new design, BPF is limited to 4096 insns, which means that any 654 + program will terminate quickly and will only call a fixed number of kernel 655 + functions. Original BPF and the new format are two operand instructions, 656 + which helps to do one-to-one mapping between BPF insn and x86 insn during JIT. 657 + 658 + The input context pointer for invoking the interpreter function is generic, 659 + its content is defined by a specific use case. For seccomp register R1 points 660 + to seccomp_data, for converted BPF filters R1 points to a skb. 661 + 662 + A program, that is translated internally consists of the following elements: 663 + 664 + op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32 665 + 666 + Just like the original BPF, the new format runs within a controlled environment, 667 + is deterministic and the kernel can easily prove that. The safety of the program 668 + can be determined in two steps: first step does depth-first-search to disallow 669 + loops and other CFG validation; second step starts from the first insn and 670 + descends all possible paths. It simulates execution of every insn and observes 671 + the state change of registers and stack. 672 + 549 673 Misc 550 674 ---- 551 675 ··· 685 561 686 562 Jay Schulist <jschlst@samba.org> 687 563 Daniel Borkmann <dborkman@redhat.com> 564 + Alexei Starovoitov <ast@plumgrid.com>
+2 -1
arch/arm/net/bpf_jit_32.c
··· 925 925 bpf_jit_dump(fp->len, alloc_size, 2, ctx.target); 926 926 927 927 fp->bpf_func = (void *)ctx.target; 928 + fp->jited = 1; 928 929 out: 929 930 kfree(ctx.offsets); 930 931 return; ··· 933 932 934 933 void bpf_jit_free(struct sk_filter *fp) 935 934 { 936 - if (fp->bpf_func != sk_run_filter) 935 + if (fp->jited) 937 936 module_free(NULL, fp->bpf_func); 938 937 kfree(fp); 939 938 }
+2 -1
arch/powerpc/net/bpf_jit_comp.c
··· 689 689 ((u64 *)image)[0] = (u64)code_base; 690 690 ((u64 *)image)[1] = local_paca->kernel_toc; 691 691 fp->bpf_func = (void *)image; 692 + fp->jited = 1; 692 693 } 693 694 out: 694 695 kfree(addrs); ··· 698 697 699 698 void bpf_jit_free(struct sk_filter *fp) 700 699 { 701 - if (fp->bpf_func != sk_run_filter) 700 + if (fp->jited) 702 701 module_free(NULL, fp->bpf_func); 703 702 kfree(fp); 704 703 }
+4 -1
arch/s390/net/bpf_jit_comp.c
··· 877 877 if (jit.start) { 878 878 set_memory_ro((unsigned long)header, header->pages); 879 879 fp->bpf_func = (void *) jit.start; 880 + fp->jited = 1; 880 881 } 881 882 out: 882 883 kfree(addrs); ··· 888 887 unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK; 889 888 struct bpf_binary_header *header = (void *)addr; 890 889 891 - if (fp->bpf_func == sk_run_filter) 890 + if (!fp->jited) 892 891 goto free_filter; 892 + 893 893 set_memory_rw(addr, header->pages); 894 894 module_free(NULL, header); 895 + 895 896 free_filter: 896 897 kfree(fp); 897 898 }
+2 -1
arch/sparc/net/bpf_jit_comp.c
··· 809 809 if (image) { 810 810 bpf_flush_icache(image, image + proglen); 811 811 fp->bpf_func = (void *)image; 812 + fp->jited = 1; 812 813 } 813 814 out: 814 815 kfree(addrs); ··· 818 817 819 818 void bpf_jit_free(struct sk_filter *fp) 820 819 { 821 - if (fp->bpf_func != sk_run_filter) 820 + if (fp->jited) 822 821 module_free(NULL, fp->bpf_func); 823 822 kfree(fp); 824 823 }
+2 -1
arch/x86/net/bpf_jit_comp.c
··· 772 772 bpf_flush_icache(header, image + proglen); 773 773 set_memory_ro((unsigned long)header, header->pages); 774 774 fp->bpf_func = (void *)image; 775 + fp->jited = 1; 775 776 } 776 777 out: 777 778 kfree(addrs); ··· 792 791 793 792 void bpf_jit_free(struct sk_filter *fp) 794 793 { 795 - if (fp->bpf_func != sk_run_filter) { 794 + if (fp->jited) { 796 795 INIT_WORK(&fp->work, bpf_jit_free_deferred); 797 796 schedule_work(&fp->work); 798 797 } else {
+41 -20
drivers/isdn/i4l/isdn_ppp.c
··· 378 378 is->slcomp = NULL; 379 379 #endif 380 380 #ifdef CONFIG_IPPP_FILTER 381 - kfree(is->pass_filter); 382 - is->pass_filter = NULL; 383 - kfree(is->active_filter); 384 - is->active_filter = NULL; 381 + if (is->pass_filter) { 382 + sk_unattached_filter_destroy(is->pass_filter); 383 + is->pass_filter = NULL; 384 + } 385 + 386 + if (is->active_filter) { 387 + sk_unattached_filter_destroy(is->active_filter); 388 + is->active_filter = NULL; 389 + } 385 390 #endif 386 391 387 392 /* TODO: if this was the previous master: link the stuff to the new master */ ··· 634 629 #ifdef CONFIG_IPPP_FILTER 635 630 case PPPIOCSPASS: 636 631 { 632 + struct sock_fprog fprog; 637 633 struct sock_filter *code; 638 - int len = get_filter(argp, &code); 634 + int err, len = get_filter(argp, &code); 635 + 639 636 if (len < 0) 640 637 return len; 641 - kfree(is->pass_filter); 642 - is->pass_filter = code; 643 - is->pass_len = len; 644 - break; 638 + 639 + fprog.len = len; 640 + fprog.filter = code; 641 + 642 + if (is->pass_filter) 643 + sk_unattached_filter_destroy(is->pass_filter); 644 + err = sk_unattached_filter_create(&is->pass_filter, &fprog); 645 + kfree(code); 646 + 647 + return err; 645 648 } 646 649 case PPPIOCSACTIVE: 647 650 { 651 + struct sock_fprog fprog; 648 652 struct sock_filter *code; 649 - int len = get_filter(argp, &code); 653 + int err, len = get_filter(argp, &code); 654 + 650 655 if (len < 0) 651 656 return len; 652 - kfree(is->active_filter); 653 - is->active_filter = code; 654 - is->active_len = len; 655 - break; 657 + 658 + fprog.len = len; 659 + fprog.filter = code; 660 + 661 + if (is->active_filter) 662 + sk_unattached_filter_destroy(is->active_filter); 663 + err = sk_unattached_filter_create(&is->active_filter, &fprog); 664 + kfree(code); 665 + 666 + return err; 656 667 } 657 668 #endif /* CONFIG_IPPP_FILTER */ 658 669 default: ··· 1168 1147 } 1169 1148 1170 1149 if (is->pass_filter 1171 - && sk_run_filter(skb, is->pass_filter) == 0) { 1150 + && SK_RUN_FILTER(is->pass_filter, skb) == 0) { 1172 1151 if (is->debug & 0x2) 1173 1152 printk(KERN_DEBUG "IPPP: inbound frame filtered.\n"); 1174 1153 kfree_skb(skb); 1175 1154 return; 1176 1155 } 1177 1156 if (!(is->active_filter 1178 - && sk_run_filter(skb, is->active_filter) == 0)) { 1157 + && SK_RUN_FILTER(is->active_filter, skb) == 0)) { 1179 1158 if (is->debug & 0x2) 1180 1159 printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n"); 1181 1160 lp->huptimer = 0; ··· 1314 1293 } 1315 1294 1316 1295 if (ipt->pass_filter 1317 - && sk_run_filter(skb, ipt->pass_filter) == 0) { 1296 + && SK_RUN_FILTER(ipt->pass_filter, skb) == 0) { 1318 1297 if (ipt->debug & 0x4) 1319 1298 printk(KERN_DEBUG "IPPP: outbound frame filtered.\n"); 1320 1299 kfree_skb(skb); 1321 1300 goto unlock; 1322 1301 } 1323 1302 if (!(ipt->active_filter 1324 - && sk_run_filter(skb, ipt->active_filter) == 0)) { 1303 + && SK_RUN_FILTER(ipt->active_filter, skb) == 0)) { 1325 1304 if (ipt->debug & 0x4) 1326 1305 printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n"); 1327 1306 lp->huptimer = 0; ··· 1511 1490 } 1512 1491 1513 1492 drop |= is->pass_filter 1514 - && sk_run_filter(skb, is->pass_filter) == 0; 1493 + && SK_RUN_FILTER(is->pass_filter, skb) == 0; 1515 1494 drop |= is->active_filter 1516 - && sk_run_filter(skb, is->active_filter) == 0; 1495 + && SK_RUN_FILTER(is->active_filter, skb) == 0; 1517 1496 1518 1497 skb_push(skb, IPPP_MAX_HEADER - 4); 1519 1498 return drop;
+1 -10
drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
··· 120 120 int data); 121 121 static void pch_gbe_set_multi(struct net_device *netdev); 122 122 123 - static struct sock_filter ptp_filter[] = { 124 - PTP_FILTER 125 - }; 126 - 127 123 static int pch_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid) 128 124 { 129 125 u8 *data = skb->data; ··· 127 131 u16 *hi, *id; 128 132 u32 lo; 129 133 130 - if (sk_run_filter(skb, ptp_filter) == PTP_CLASS_NONE) 134 + if (ptp_classify_raw(skb) == PTP_CLASS_NONE) 131 135 return 0; 132 136 133 137 offset = ETH_HLEN + IPV4_HLEN(data) + UDP_HLEN; ··· 2631 2635 2632 2636 adapter->ptp_pdev = pci_get_bus_and_slot(adapter->pdev->bus->number, 2633 2637 PCI_DEVFN(12, 4)); 2634 - if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) { 2635 - dev_err(&pdev->dev, "Bad ptp filter\n"); 2636 - ret = -EINVAL; 2637 - goto err_free_netdev; 2638 - } 2639 2638 2640 2639 netdev->netdev_ops = &pch_gbe_netdev_ops; 2641 2640 netdev->watchdog_timeo = PCH_GBE_WATCHDOG_PERIOD;
+1 -9
drivers/net/ethernet/ti/cpts.c
··· 31 31 32 32 #ifdef CONFIG_TI_CPTS 33 33 34 - static struct sock_filter ptp_filter[] = { 35 - PTP_FILTER 36 - }; 37 - 38 34 #define cpts_read32(c, r) __raw_readl(&c->reg->r) 39 35 #define cpts_write32(c, v, r) __raw_writel(v, &c->reg->r) 40 36 ··· 297 301 u64 ns = 0; 298 302 struct cpts_event *event; 299 303 struct list_head *this, *next; 300 - unsigned int class = sk_run_filter(skb, ptp_filter); 304 + unsigned int class = ptp_classify_raw(skb); 301 305 unsigned long flags; 302 306 u16 seqid; 303 307 u8 mtype; ··· 368 372 int err, i; 369 373 unsigned long flags; 370 374 371 - if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) { 372 - pr_err("cpts: bad ptp filter\n"); 373 - return -EINVAL; 374 - } 375 375 cpts->info = cpts_info; 376 376 cpts->clock = ptp_clock_register(&cpts->info, dev); 377 377 if (IS_ERR(cpts->clock)) {
+1 -10
drivers/net/ethernet/xscale/ixp4xx_eth.c
··· 256 256 static struct port *npe_port_tab[MAX_NPES]; 257 257 static struct dma_pool *dma_pool; 258 258 259 - static struct sock_filter ptp_filter[] = { 260 - PTP_FILTER 261 - }; 262 - 263 259 static int ixp_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid) 264 260 { 265 261 u8 *data = skb->data; ··· 263 267 u16 *hi, *id; 264 268 u32 lo; 265 269 266 - if (sk_run_filter(skb, ptp_filter) != PTP_CLASS_V1_IPV4) 270 + if (ptp_classify_raw(skb) != PTP_CLASS_V1_IPV4) 267 271 return 0; 268 272 269 273 offset = ETH_HLEN + IPV4_HLEN(data) + UDP_HLEN; ··· 1408 1412 u32 regs_phys; 1409 1413 char phy_id[MII_BUS_ID_SIZE + 3]; 1410 1414 int err; 1411 - 1412 - if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) { 1413 - pr_err("ixp4xx_eth: bad ptp filter\n"); 1414 - return -EINVAL; 1415 - } 1416 1415 1417 1416 if (!(dev = alloc_etherdev(sizeof(struct port)))) 1418 1417 return -ENOMEM;
+41 -19
drivers/net/ppp/ppp_generic.c
··· 143 143 struct sk_buff_head mrq; /* MP: receive reconstruction queue */ 144 144 #endif /* CONFIG_PPP_MULTILINK */ 145 145 #ifdef CONFIG_PPP_FILTER 146 - struct sock_filter *pass_filter; /* filter for packets to pass */ 147 - struct sock_filter *active_filter;/* filter for pkts to reset idle */ 148 - unsigned pass_len, active_len; 146 + struct sk_filter *pass_filter; /* filter for packets to pass */ 147 + struct sk_filter *active_filter;/* filter for pkts to reset idle */ 149 148 #endif /* CONFIG_PPP_FILTER */ 150 149 struct net *ppp_net; /* the net we belong to */ 151 150 struct ppp_link_stats stats64; /* 64 bit network stats */ ··· 754 755 case PPPIOCSPASS: 755 756 { 756 757 struct sock_filter *code; 758 + 757 759 err = get_filter(argp, &code); 758 760 if (err >= 0) { 761 + struct sock_fprog fprog = { 762 + .len = err, 763 + .filter = code, 764 + }; 765 + 759 766 ppp_lock(ppp); 760 - kfree(ppp->pass_filter); 761 - ppp->pass_filter = code; 762 - ppp->pass_len = err; 767 + if (ppp->pass_filter) 768 + sk_unattached_filter_destroy(ppp->pass_filter); 769 + err = sk_unattached_filter_create(&ppp->pass_filter, 770 + &fprog); 771 + kfree(code); 763 772 ppp_unlock(ppp); 764 - err = 0; 765 773 } 766 774 break; 767 775 } 768 776 case PPPIOCSACTIVE: 769 777 { 770 778 struct sock_filter *code; 779 + 771 780 err = get_filter(argp, &code); 772 781 if (err >= 0) { 782 + struct sock_fprog fprog = { 783 + .len = err, 784 + .filter = code, 785 + }; 786 + 773 787 ppp_lock(ppp); 774 - kfree(ppp->active_filter); 775 - ppp->active_filter = code; 776 - ppp->active_len = err; 788 + if (ppp->active_filter) 789 + sk_unattached_filter_destroy(ppp->active_filter); 790 + err = sk_unattached_filter_create(&ppp->active_filter, 791 + &fprog); 792 + kfree(code); 777 793 ppp_unlock(ppp); 778 - err = 0; 779 794 } 780 795 break; 781 796 } ··· 1197 1184 a four-byte PPP header on each packet */ 1198 1185 *skb_push(skb, 2) = 1; 1199 1186 if (ppp->pass_filter && 1200 - sk_run_filter(skb, ppp->pass_filter) == 0) { 1187 + SK_RUN_FILTER(ppp->pass_filter, skb) == 0) { 1201 1188 if (ppp->debug & 1) 1202 1189 netdev_printk(KERN_DEBUG, ppp->dev, 1203 1190 "PPP: outbound frame " ··· 1207 1194 } 1208 1195 /* if this packet passes the active filter, record the time */ 1209 1196 if (!(ppp->active_filter && 1210 - sk_run_filter(skb, ppp->active_filter) == 0)) 1197 + SK_RUN_FILTER(ppp->active_filter, skb) == 0)) 1211 1198 ppp->last_xmit = jiffies; 1212 1199 skb_pull(skb, 2); 1213 1200 #else ··· 1831 1818 1832 1819 *skb_push(skb, 2) = 0; 1833 1820 if (ppp->pass_filter && 1834 - sk_run_filter(skb, ppp->pass_filter) == 0) { 1821 + SK_RUN_FILTER(ppp->pass_filter, skb) == 0) { 1835 1822 if (ppp->debug & 1) 1836 1823 netdev_printk(KERN_DEBUG, ppp->dev, 1837 1824 "PPP: inbound frame " ··· 1840 1827 return; 1841 1828 } 1842 1829 if (!(ppp->active_filter && 1843 - sk_run_filter(skb, ppp->active_filter) == 0)) 1830 + SK_RUN_FILTER(ppp->active_filter, skb) == 0)) 1844 1831 ppp->last_recv = jiffies; 1845 1832 __skb_pull(skb, 2); 1846 1833 } else ··· 2685 2672 ppp->minseq = -1; 2686 2673 skb_queue_head_init(&ppp->mrq); 2687 2674 #endif /* CONFIG_PPP_MULTILINK */ 2675 + #ifdef CONFIG_PPP_FILTER 2676 + ppp->pass_filter = NULL; 2677 + ppp->active_filter = NULL; 2678 + #endif /* CONFIG_PPP_FILTER */ 2688 2679 2689 2680 /* 2690 2681 * drum roll: don't forget to set ··· 2819 2802 skb_queue_purge(&ppp->mrq); 2820 2803 #endif /* CONFIG_PPP_MULTILINK */ 2821 2804 #ifdef CONFIG_PPP_FILTER 2822 - kfree(ppp->pass_filter); 2823 - ppp->pass_filter = NULL; 2824 - kfree(ppp->active_filter); 2825 - ppp->active_filter = NULL; 2805 + if (ppp->pass_filter) { 2806 + sk_unattached_filter_destroy(ppp->pass_filter); 2807 + ppp->pass_filter = NULL; 2808 + } 2809 + 2810 + if (ppp->active_filter) { 2811 + sk_unattached_filter_destroy(ppp->active_filter); 2812 + ppp->active_filter = NULL; 2813 + } 2826 2814 #endif /* CONFIG_PPP_FILTER */ 2827 2815 2828 2816 kfree_skb(ppp->xmit_pending);
+94 -24
include/linux/filter.h
··· 9 9 #include <linux/workqueue.h> 10 10 #include <uapi/linux/filter.h> 11 11 12 - #ifdef CONFIG_COMPAT 13 - /* 14 - * A struct sock_filter is architecture independent. 12 + /* Internally used and optimized filter representation with extended 13 + * instruction set based on top of classic BPF. 15 14 */ 15 + 16 + /* instruction classes */ 17 + #define BPF_ALU64 0x07 /* alu mode in double word width */ 18 + 19 + /* ld/ldx fields */ 20 + #define BPF_DW 0x18 /* double word */ 21 + #define BPF_XADD 0xc0 /* exclusive add */ 22 + 23 + /* alu/jmp fields */ 24 + #define BPF_MOV 0xb0 /* mov reg to reg */ 25 + #define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */ 26 + 27 + /* change endianness of a register */ 28 + #define BPF_END 0xd0 /* flags for endianness conversion: */ 29 + #define BPF_TO_LE 0x00 /* convert to little-endian */ 30 + #define BPF_TO_BE 0x08 /* convert to big-endian */ 31 + #define BPF_FROM_LE BPF_TO_LE 32 + #define BPF_FROM_BE BPF_TO_BE 33 + 34 + #define BPF_JNE 0x50 /* jump != */ 35 + #define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */ 36 + #define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */ 37 + #define BPF_CALL 0x80 /* function call */ 38 + #define BPF_EXIT 0x90 /* function return */ 39 + 40 + /* BPF has 10 general purpose 64-bit registers and stack frame. */ 41 + #define MAX_BPF_REG 11 42 + 43 + /* BPF program can access up to 512 bytes of stack space. */ 44 + #define MAX_BPF_STACK 512 45 + 46 + /* Arg1, context and stack frame pointer register positions. */ 47 + #define ARG1_REG 1 48 + #define CTX_REG 6 49 + #define FP_REG 10 50 + 51 + struct sock_filter_int { 52 + __u8 code; /* opcode */ 53 + __u8 a_reg:4; /* dest register */ 54 + __u8 x_reg:4; /* source register */ 55 + __s16 off; /* signed offset */ 56 + __s32 imm; /* signed immediate constant */ 57 + }; 58 + 59 + #ifdef CONFIG_COMPAT 60 + /* A struct sock_filter is architecture independent. */ 16 61 struct compat_sock_fprog { 17 62 u16 len; 18 - compat_uptr_t filter; /* struct sock_filter * */ 63 + compat_uptr_t filter; /* struct sock_filter * */ 19 64 }; 20 65 #endif 21 66 67 + struct sock_fprog_kern { 68 + u16 len; 69 + struct sock_filter *filter; 70 + }; 71 + 22 72 struct sk_buff; 23 73 struct sock; 74 + struct seccomp_data; 24 75 25 - struct sk_filter 26 - { 76 + struct sk_filter { 27 77 atomic_t refcnt; 28 - unsigned int len; /* Number of filter blocks */ 78 + u32 jited:1, /* Is our filter JIT'ed? */ 79 + len:31; /* Number of filter blocks */ 80 + struct sock_fprog_kern *orig_prog; /* Original BPF program */ 29 81 struct rcu_head rcu; 30 82 unsigned int (*bpf_func)(const struct sk_buff *skb, 31 - const struct sock_filter *filter); 83 + const struct sock_filter_int *filter); 32 84 union { 33 - struct sock_filter insns[0]; 85 + struct sock_filter insns[0]; 86 + struct sock_filter_int insnsi[0]; 34 87 struct work_struct work; 35 88 }; 36 89 }; ··· 94 41 offsetof(struct sk_filter, insns[proglen])); 95 42 } 96 43 97 - extern int sk_filter(struct sock *sk, struct sk_buff *skb); 98 - extern unsigned int sk_run_filter(const struct sk_buff *skb, 99 - const struct sock_filter *filter); 100 - extern int sk_unattached_filter_create(struct sk_filter **pfp, 101 - struct sock_fprog *fprog); 102 - extern void sk_unattached_filter_destroy(struct sk_filter *fp); 103 - extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); 104 - extern int sk_detach_filter(struct sock *sk); 105 - extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen); 106 - extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len); 107 - extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to); 44 + #define sk_filter_proglen(fprog) \ 45 + (fprog->len * sizeof(fprog->filter[0])) 46 + 47 + #define SK_RUN_FILTER(filter, ctx) \ 48 + (*filter->bpf_func)(ctx, filter->insnsi) 49 + 50 + int sk_filter(struct sock *sk, struct sk_buff *skb); 51 + 52 + u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx, 53 + const struct sock_filter_int *insni); 54 + u32 sk_run_filter_int_skb(const struct sk_buff *ctx, 55 + const struct sock_filter_int *insni); 56 + 57 + int sk_convert_filter(struct sock_filter *prog, int len, 58 + struct sock_filter_int *new_prog, int *new_len); 59 + 60 + int sk_unattached_filter_create(struct sk_filter **pfp, 61 + struct sock_fprog *fprog); 62 + void sk_unattached_filter_destroy(struct sk_filter *fp); 63 + 64 + int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); 65 + int sk_detach_filter(struct sock *sk); 66 + 67 + int sk_chk_filter(struct sock_filter *filter, unsigned int flen); 68 + int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, 69 + unsigned int len); 70 + void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to); 71 + 72 + void sk_filter_charge(struct sock *sk, struct sk_filter *fp); 73 + void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp); 108 74 109 75 #ifdef CONFIG_BPF_JIT 110 76 #include <stdarg.h> 111 77 #include <linux/linkage.h> 112 78 #include <linux/printk.h> 113 79 114 - extern void bpf_jit_compile(struct sk_filter *fp); 115 - extern void bpf_jit_free(struct sk_filter *fp); 80 + void bpf_jit_compile(struct sk_filter *fp); 81 + void bpf_jit_free(struct sk_filter *fp); 116 82 117 83 static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen, 118 84 u32 pass, void *image) ··· 142 70 print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET, 143 71 16, 1, image, proglen, false); 144 72 } 145 - #define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns) 146 73 #else 147 74 #include <linux/slab.h> 148 75 static inline void bpf_jit_compile(struct sk_filter *fp) ··· 151 80 { 152 81 kfree(fp); 153 82 } 154 - #define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns) 155 83 #endif 156 84 157 85 static inline int bpf_tell_extensions(void)
+2 -3
include/linux/isdn_ppp.h
··· 180 180 struct slcompress *slcomp; 181 181 #endif 182 182 #ifdef CONFIG_IPPP_FILTER 183 - struct sock_filter *pass_filter; /* filter for packets to pass */ 184 - struct sock_filter *active_filter; /* filter for pkts to reset idle */ 185 - unsigned pass_len, active_len; 183 + struct sk_filter *pass_filter; /* filter for packets to pass */ 184 + struct sk_filter *active_filter; /* filter for pkts to reset idle */ 186 185 #endif 187 186 unsigned long debug; 188 187 struct isdn_ppp_compressor *compressor,*decompressor;
+2 -12
include/linux/ptp_classify.h
··· 27 27 #include <linux/if_vlan.h> 28 28 #include <linux/ip.h> 29 29 #include <linux/filter.h> 30 - #ifdef __KERNEL__ 31 30 #include <linux/in.h> 32 - #else 33 - #include <netinet/in.h> 34 - #endif 35 31 36 32 #define PTP_CLASS_NONE 0x00 /* not a PTP event message */ 37 33 #define PTP_CLASS_V1 0x01 /* protocol version 1 */ ··· 80 84 #define OP_RETA (BPF_RET | BPF_A) 81 85 #define OP_RETK (BPF_RET | BPF_K) 82 86 83 - static inline int ptp_filter_init(struct sock_filter *f, int len) 84 - { 85 - if (OP_LDH == f[0].code) 86 - return sk_chk_filter(f, len); 87 - else 88 - return 0; 89 - } 90 - 91 87 #define PTP_FILTER \ 92 88 {OP_LDH, 0, 0, OFF_ETYPE }, /* */ \ 93 89 {OP_JEQ, 0, 12, ETH_P_IP }, /* f goto L20 */ \ ··· 124 136 {OP_OR, 0, 0, PTP_CLASS_L2 }, /* */ \ 125 137 {OP_RETA, 0, 0, 0 }, /* */ \ 126 138 /*L6x*/ {OP_RETK, 0, 0, PTP_CLASS_NONE }, 139 + 140 + unsigned int ptp_classify_raw(const struct sk_buff *skb); 127 141 128 142 #endif
-1
include/linux/seccomp.h
··· 76 76 #ifdef CONFIG_SECCOMP_FILTER 77 77 extern void put_seccomp_filter(struct task_struct *tsk); 78 78 extern void get_seccomp_filter(struct task_struct *tsk); 79 - extern u32 seccomp_bpf_load(int off); 80 79 #else /* CONFIG_SECCOMP_FILTER */ 81 80 static inline void put_seccomp_filter(struct task_struct *tsk) 82 81 {
-27
include/net/sock.h
··· 1621 1621 /* Initialise core socket variables */ 1622 1622 void sock_init_data(struct socket *sock, struct sock *sk); 1623 1623 1624 - void sk_filter_release_rcu(struct rcu_head *rcu); 1625 - 1626 - /** 1627 - * sk_filter_release - release a socket filter 1628 - * @fp: filter to remove 1629 - * 1630 - * Remove a filter from a socket and release its resources. 1631 - */ 1632 - 1633 - static inline void sk_filter_release(struct sk_filter *fp) 1634 - { 1635 - if (atomic_dec_and_test(&fp->refcnt)) 1636 - call_rcu(&fp->rcu, sk_filter_release_rcu); 1637 - } 1638 - 1639 - static inline void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp) 1640 - { 1641 - atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc); 1642 - sk_filter_release(fp); 1643 - } 1644 - 1645 - static inline void sk_filter_charge(struct sock *sk, struct sk_filter *fp) 1646 - { 1647 - atomic_inc(&fp->refcnt); 1648 - atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc); 1649 - } 1650 - 1651 1624 /* 1652 1625 * Socket reference counting postulates. 1653 1626 *
+58 -61
kernel/seccomp.c
··· 55 55 atomic_t usage; 56 56 struct seccomp_filter *prev; 57 57 unsigned short len; /* Instruction count */ 58 - struct sock_filter insns[]; 58 + struct sock_filter_int insnsi[]; 59 59 }; 60 60 61 61 /* Limit any path through the tree to 256KB worth of instructions. */ 62 62 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter)) 63 63 64 - /** 65 - * get_u32 - returns a u32 offset into data 66 - * @data: a unsigned 64 bit value 67 - * @index: 0 or 1 to return the first or second 32-bits 68 - * 69 - * This inline exists to hide the length of unsigned long. If a 32-bit 70 - * unsigned long is passed in, it will be extended and the top 32-bits will be 71 - * 0. If it is a 64-bit unsigned long, then whatever data is resident will be 72 - * properly returned. 73 - * 64 + /* 74 65 * Endianness is explicitly ignored and left for BPF program authors to manage 75 66 * as per the specific architecture. 76 67 */ 77 - static inline u32 get_u32(u64 data, int index) 68 + static void populate_seccomp_data(struct seccomp_data *sd) 78 69 { 79 - return ((u32 *)&data)[index]; 80 - } 70 + struct task_struct *task = current; 71 + struct pt_regs *regs = task_pt_regs(task); 81 72 82 - /* Helper for bpf_load below. */ 83 - #define BPF_DATA(_name) offsetof(struct seccomp_data, _name) 84 - /** 85 - * bpf_load: checks and returns a pointer to the requested offset 86 - * @off: offset into struct seccomp_data to load from 87 - * 88 - * Returns the requested 32-bits of data. 89 - * seccomp_check_filter() should assure that @off is 32-bit aligned 90 - * and not out of bounds. Failure to do so is a BUG. 91 - */ 92 - u32 seccomp_bpf_load(int off) 93 - { 94 - struct pt_regs *regs = task_pt_regs(current); 95 - if (off == BPF_DATA(nr)) 96 - return syscall_get_nr(current, regs); 97 - if (off == BPF_DATA(arch)) 98 - return syscall_get_arch(current, regs); 99 - if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) { 100 - unsigned long value; 101 - int arg = (off - BPF_DATA(args[0])) / sizeof(u64); 102 - int index = !!(off % sizeof(u64)); 103 - syscall_get_arguments(current, regs, arg, 1, &value); 104 - return get_u32(value, index); 105 - } 106 - if (off == BPF_DATA(instruction_pointer)) 107 - return get_u32(KSTK_EIP(current), 0); 108 - if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) 109 - return get_u32(KSTK_EIP(current), 1); 110 - /* seccomp_check_filter should make this impossible. */ 111 - BUG(); 73 + sd->nr = syscall_get_nr(task, regs); 74 + sd->arch = syscall_get_arch(task, regs); 75 + 76 + /* Unroll syscall_get_args to help gcc on arm. */ 77 + syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]); 78 + syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]); 79 + syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]); 80 + syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]); 81 + syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]); 82 + syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]); 83 + 84 + sd->instruction_pointer = KSTK_EIP(task); 112 85 } 113 86 114 87 /** ··· 106 133 107 134 switch (code) { 108 135 case BPF_S_LD_W_ABS: 109 - ftest->code = BPF_S_ANC_SECCOMP_LD_W; 136 + ftest->code = BPF_LDX | BPF_W | BPF_ABS; 110 137 /* 32-bit aligned and not out of bounds. */ 111 138 if (k >= sizeof(struct seccomp_data) || k & 3) 112 139 return -EINVAL; 113 140 continue; 114 141 case BPF_S_LD_W_LEN: 115 - ftest->code = BPF_S_LD_IMM; 142 + ftest->code = BPF_LD | BPF_IMM; 116 143 ftest->k = sizeof(struct seccomp_data); 117 144 continue; 118 145 case BPF_S_LDX_W_LEN: 119 - ftest->code = BPF_S_LDX_IMM; 146 + ftest->code = BPF_LDX | BPF_IMM; 120 147 ftest->k = sizeof(struct seccomp_data); 121 148 continue; 122 149 /* Explicitly include allowed calls. */ ··· 158 185 case BPF_S_JMP_JGT_X: 159 186 case BPF_S_JMP_JSET_K: 160 187 case BPF_S_JMP_JSET_X: 188 + sk_decode_filter(ftest, ftest); 161 189 continue; 162 190 default: 163 191 return -EINVAL; ··· 176 202 static u32 seccomp_run_filters(int syscall) 177 203 { 178 204 struct seccomp_filter *f; 205 + struct seccomp_data sd; 179 206 u32 ret = SECCOMP_RET_ALLOW; 180 207 181 208 /* Ensure unexpected behavior doesn't result in failing open. */ 182 209 if (WARN_ON(current->seccomp.filter == NULL)) 183 210 return SECCOMP_RET_KILL; 184 211 212 + populate_seccomp_data(&sd); 213 + 185 214 /* 186 215 * All filters in the list are evaluated and the lowest BPF return 187 216 * value always takes priority (ignoring the DATA). 188 217 */ 189 218 for (f = current->seccomp.filter; f; f = f->prev) { 190 - u32 cur_ret = sk_run_filter(NULL, f->insns); 219 + u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi); 191 220 if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION)) 192 221 ret = cur_ret; 193 222 } ··· 208 231 struct seccomp_filter *filter; 209 232 unsigned long fp_size = fprog->len * sizeof(struct sock_filter); 210 233 unsigned long total_insns = fprog->len; 234 + struct sock_filter *fp; 235 + int new_len; 211 236 long ret; 212 237 213 238 if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) ··· 231 252 CAP_SYS_ADMIN) != 0) 232 253 return -EACCES; 233 254 234 - /* Allocate a new seccomp_filter */ 235 - filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, 236 - GFP_KERNEL|__GFP_NOWARN); 237 - if (!filter) 255 + fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN); 256 + if (!fp) 238 257 return -ENOMEM; 239 - atomic_set(&filter->usage, 1); 240 - filter->len = fprog->len; 241 258 242 259 /* Copy the instructions from fprog. */ 243 260 ret = -EFAULT; 244 - if (copy_from_user(filter->insns, fprog->filter, fp_size)) 245 - goto fail; 261 + if (copy_from_user(fp, fprog->filter, fp_size)) 262 + goto free_prog; 246 263 247 264 /* Check and rewrite the fprog via the skb checker */ 248 - ret = sk_chk_filter(filter->insns, filter->len); 265 + ret = sk_chk_filter(fp, fprog->len); 249 266 if (ret) 250 - goto fail; 267 + goto free_prog; 251 268 252 269 /* Check and rewrite the fprog for seccomp use */ 253 - ret = seccomp_check_filter(filter->insns, filter->len); 270 + ret = seccomp_check_filter(fp, fprog->len); 254 271 if (ret) 255 - goto fail; 272 + goto free_prog; 273 + 274 + /* Convert 'sock_filter' insns to 'sock_filter_int' insns */ 275 + ret = sk_convert_filter(fp, fprog->len, NULL, &new_len); 276 + if (ret) 277 + goto free_prog; 278 + 279 + /* Allocate a new seccomp_filter */ 280 + filter = kzalloc(sizeof(struct seccomp_filter) + 281 + sizeof(struct sock_filter_int) * new_len, 282 + GFP_KERNEL|__GFP_NOWARN); 283 + if (!filter) 284 + goto free_prog; 285 + 286 + ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len); 287 + if (ret) 288 + goto free_filter; 289 + 290 + atomic_set(&filter->usage, 1); 291 + filter->len = new_len; 256 292 257 293 /* 258 294 * If there is an existing filter, make it the prev and don't drop its ··· 276 282 filter->prev = current->seccomp.filter; 277 283 current->seccomp.filter = filter; 278 284 return 0; 279 - fail: 285 + 286 + free_filter: 280 287 kfree(filter); 288 + free_prog: 289 + kfree(fp); 281 290 return ret; 282 291 } 283 292
+1249 -314
net/core/filter.c
··· 1 1 /* 2 2 * Linux Socket Filter - Kernel level socket filtering 3 3 * 4 - * Author: 5 - * Jay Schulist <jschlst@samba.org> 4 + * Based on the design of the Berkeley Packet Filter. The new 5 + * internal format has been designed by PLUMgrid: 6 6 * 7 - * Based on the design of: 8 - * - The Berkeley Packet Filter 7 + * Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com 8 + * 9 + * Authors: 10 + * 11 + * Jay Schulist <jschlst@samba.org> 12 + * Alexei Starovoitov <ast@plumgrid.com> 13 + * Daniel Borkmann <dborkman@redhat.com> 9 14 * 10 15 * This program is free software; you can redistribute it and/or 11 16 * modify it under the terms of the GNU General Public License ··· 113 108 } 114 109 EXPORT_SYMBOL(sk_filter); 115 110 111 + /* Base function for offset calculation. Needs to go into .text section, 112 + * therefore keeping it non-static as well; will also be used by JITs 113 + * anyway later on, so do not let the compiler omit it. 114 + */ 115 + noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) 116 + { 117 + return 0; 118 + } 119 + 116 120 /** 117 - * sk_run_filter - run a filter on a socket 118 - * @skb: buffer to run the filter on 121 + * __sk_run_filter - run a filter on a given context 122 + * @ctx: buffer to run the filter on 119 123 * @fentry: filter to apply 120 124 * 121 - * Decode and apply filter instructions to the skb->data. 122 - * Return length to keep, 0 for none. @skb is the data we are 123 - * filtering, @filter is the array of filter instructions. 124 - * Because all jumps are guaranteed to be before last instruction, 125 - * and last instruction guaranteed to be a RET, we dont need to check 126 - * flen. (We used to pass to this function the length of filter) 125 + * Decode and apply filter instructions to the skb->data. Return length to 126 + * keep, 0 for none. @ctx is the data we are operating on, @filter is the 127 + * array of filter instructions. 127 128 */ 128 - unsigned int sk_run_filter(const struct sk_buff *skb, 129 - const struct sock_filter *fentry) 129 + unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn) 130 130 { 131 + u64 stack[MAX_BPF_STACK / sizeof(u64)]; 132 + u64 regs[MAX_BPF_REG], tmp; 131 133 void *ptr; 132 - u32 A = 0; /* Accumulator */ 133 - u32 X = 0; /* Index Register */ 134 - u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */ 135 - u32 tmp; 136 - int k; 134 + int off; 137 135 138 - /* 139 - * Process array of filter instructions. 140 - */ 141 - for (;; fentry++) { 142 - #if defined(CONFIG_X86_32) 143 - #define K (fentry->k) 144 - #else 145 - const u32 K = fentry->k; 146 - #endif 136 + #define K insn->imm 137 + #define A regs[insn->a_reg] 138 + #define X regs[insn->x_reg] 139 + #define R0 regs[0] 147 140 148 - switch (fentry->code) { 149 - case BPF_S_ALU_ADD_X: 150 - A += X; 151 - continue; 152 - case BPF_S_ALU_ADD_K: 153 - A += K; 154 - continue; 155 - case BPF_S_ALU_SUB_X: 156 - A -= X; 157 - continue; 158 - case BPF_S_ALU_SUB_K: 159 - A -= K; 160 - continue; 161 - case BPF_S_ALU_MUL_X: 162 - A *= X; 163 - continue; 164 - case BPF_S_ALU_MUL_K: 165 - A *= K; 166 - continue; 167 - case BPF_S_ALU_DIV_X: 168 - if (X == 0) 169 - return 0; 170 - A /= X; 171 - continue; 172 - case BPF_S_ALU_DIV_K: 173 - A /= K; 174 - continue; 175 - case BPF_S_ALU_MOD_X: 176 - if (X == 0) 177 - return 0; 178 - A %= X; 179 - continue; 180 - case BPF_S_ALU_MOD_K: 181 - A %= K; 182 - continue; 183 - case BPF_S_ALU_AND_X: 184 - A &= X; 185 - continue; 186 - case BPF_S_ALU_AND_K: 187 - A &= K; 188 - continue; 189 - case BPF_S_ALU_OR_X: 190 - A |= X; 191 - continue; 192 - case BPF_S_ALU_OR_K: 193 - A |= K; 194 - continue; 195 - case BPF_S_ANC_ALU_XOR_X: 196 - case BPF_S_ALU_XOR_X: 197 - A ^= X; 198 - continue; 199 - case BPF_S_ALU_XOR_K: 200 - A ^= K; 201 - continue; 202 - case BPF_S_ALU_LSH_X: 203 - A <<= X; 204 - continue; 205 - case BPF_S_ALU_LSH_K: 206 - A <<= K; 207 - continue; 208 - case BPF_S_ALU_RSH_X: 209 - A >>= X; 210 - continue; 211 - case BPF_S_ALU_RSH_K: 212 - A >>= K; 213 - continue; 214 - case BPF_S_ALU_NEG: 215 - A = -A; 216 - continue; 217 - case BPF_S_JMP_JA: 218 - fentry += K; 219 - continue; 220 - case BPF_S_JMP_JGT_K: 221 - fentry += (A > K) ? fentry->jt : fentry->jf; 222 - continue; 223 - case BPF_S_JMP_JGE_K: 224 - fentry += (A >= K) ? fentry->jt : fentry->jf; 225 - continue; 226 - case BPF_S_JMP_JEQ_K: 227 - fentry += (A == K) ? fentry->jt : fentry->jf; 228 - continue; 229 - case BPF_S_JMP_JSET_K: 230 - fentry += (A & K) ? fentry->jt : fentry->jf; 231 - continue; 232 - case BPF_S_JMP_JGT_X: 233 - fentry += (A > X) ? fentry->jt : fentry->jf; 234 - continue; 235 - case BPF_S_JMP_JGE_X: 236 - fentry += (A >= X) ? fentry->jt : fentry->jf; 237 - continue; 238 - case BPF_S_JMP_JEQ_X: 239 - fentry += (A == X) ? fentry->jt : fentry->jf; 240 - continue; 241 - case BPF_S_JMP_JSET_X: 242 - fentry += (A & X) ? fentry->jt : fentry->jf; 243 - continue; 244 - case BPF_S_LD_W_ABS: 245 - k = K; 246 - load_w: 247 - ptr = load_pointer(skb, k, 4, &tmp); 248 - if (ptr != NULL) { 249 - A = get_unaligned_be32(ptr); 250 - continue; 251 - } 252 - return 0; 253 - case BPF_S_LD_H_ABS: 254 - k = K; 255 - load_h: 256 - ptr = load_pointer(skb, k, 2, &tmp); 257 - if (ptr != NULL) { 258 - A = get_unaligned_be16(ptr); 259 - continue; 260 - } 261 - return 0; 262 - case BPF_S_LD_B_ABS: 263 - k = K; 264 - load_b: 265 - ptr = load_pointer(skb, k, 1, &tmp); 266 - if (ptr != NULL) { 267 - A = *(u8 *)ptr; 268 - continue; 269 - } 270 - return 0; 271 - case BPF_S_LD_W_LEN: 272 - A = skb->len; 273 - continue; 274 - case BPF_S_LDX_W_LEN: 275 - X = skb->len; 276 - continue; 277 - case BPF_S_LD_W_IND: 278 - k = X + K; 279 - goto load_w; 280 - case BPF_S_LD_H_IND: 281 - k = X + K; 282 - goto load_h; 283 - case BPF_S_LD_B_IND: 284 - k = X + K; 285 - goto load_b; 286 - case BPF_S_LDX_B_MSH: 287 - ptr = load_pointer(skb, K, 1, &tmp); 288 - if (ptr != NULL) { 289 - X = (*(u8 *)ptr & 0xf) << 2; 290 - continue; 291 - } 292 - return 0; 293 - case BPF_S_LD_IMM: 294 - A = K; 295 - continue; 296 - case BPF_S_LDX_IMM: 297 - X = K; 298 - continue; 299 - case BPF_S_LD_MEM: 300 - A = mem[K]; 301 - continue; 302 - case BPF_S_LDX_MEM: 303 - X = mem[K]; 304 - continue; 305 - case BPF_S_MISC_TAX: 306 - X = A; 307 - continue; 308 - case BPF_S_MISC_TXA: 309 - A = X; 310 - continue; 311 - case BPF_S_RET_K: 312 - return K; 313 - case BPF_S_RET_A: 314 - return A; 315 - case BPF_S_ST: 316 - mem[K] = A; 317 - continue; 318 - case BPF_S_STX: 319 - mem[K] = X; 320 - continue; 321 - case BPF_S_ANC_PROTOCOL: 322 - A = ntohs(skb->protocol); 323 - continue; 324 - case BPF_S_ANC_PKTTYPE: 325 - A = skb->pkt_type; 326 - continue; 327 - case BPF_S_ANC_IFINDEX: 328 - if (!skb->dev) 329 - return 0; 330 - A = skb->dev->ifindex; 331 - continue; 332 - case BPF_S_ANC_MARK: 333 - A = skb->mark; 334 - continue; 335 - case BPF_S_ANC_QUEUE: 336 - A = skb->queue_mapping; 337 - continue; 338 - case BPF_S_ANC_HATYPE: 339 - if (!skb->dev) 340 - return 0; 341 - A = skb->dev->type; 342 - continue; 343 - case BPF_S_ANC_RXHASH: 344 - A = skb->hash; 345 - continue; 346 - case BPF_S_ANC_CPU: 347 - A = raw_smp_processor_id(); 348 - continue; 349 - case BPF_S_ANC_VLAN_TAG: 350 - A = vlan_tx_tag_get(skb); 351 - continue; 352 - case BPF_S_ANC_VLAN_TAG_PRESENT: 353 - A = !!vlan_tx_tag_present(skb); 354 - continue; 355 - case BPF_S_ANC_PAY_OFFSET: 356 - A = __skb_get_poff(skb); 357 - continue; 358 - case BPF_S_ANC_NLATTR: { 359 - struct nlattr *nla; 141 + #define CONT ({insn++; goto select_insn; }) 142 + #define CONT_JMP ({insn++; goto select_insn; }) 360 143 361 - if (skb_is_nonlinear(skb)) 362 - return 0; 363 - if (A > skb->len - sizeof(struct nlattr)) 364 - return 0; 144 + static const void *jumptable[256] = { 145 + [0 ... 255] = &&default_label, 146 + /* Now overwrite non-defaults ... */ 147 + #define DL(A, B, C) [A|B|C] = &&A##_##B##_##C 148 + DL(BPF_ALU, BPF_ADD, BPF_X), 149 + DL(BPF_ALU, BPF_ADD, BPF_K), 150 + DL(BPF_ALU, BPF_SUB, BPF_X), 151 + DL(BPF_ALU, BPF_SUB, BPF_K), 152 + DL(BPF_ALU, BPF_AND, BPF_X), 153 + DL(BPF_ALU, BPF_AND, BPF_K), 154 + DL(BPF_ALU, BPF_OR, BPF_X), 155 + DL(BPF_ALU, BPF_OR, BPF_K), 156 + DL(BPF_ALU, BPF_LSH, BPF_X), 157 + DL(BPF_ALU, BPF_LSH, BPF_K), 158 + DL(BPF_ALU, BPF_RSH, BPF_X), 159 + DL(BPF_ALU, BPF_RSH, BPF_K), 160 + DL(BPF_ALU, BPF_XOR, BPF_X), 161 + DL(BPF_ALU, BPF_XOR, BPF_K), 162 + DL(BPF_ALU, BPF_MUL, BPF_X), 163 + DL(BPF_ALU, BPF_MUL, BPF_K), 164 + DL(BPF_ALU, BPF_MOV, BPF_X), 165 + DL(BPF_ALU, BPF_MOV, BPF_K), 166 + DL(BPF_ALU, BPF_DIV, BPF_X), 167 + DL(BPF_ALU, BPF_DIV, BPF_K), 168 + DL(BPF_ALU, BPF_MOD, BPF_X), 169 + DL(BPF_ALU, BPF_MOD, BPF_K), 170 + DL(BPF_ALU, BPF_NEG, 0), 171 + DL(BPF_ALU, BPF_END, BPF_TO_BE), 172 + DL(BPF_ALU, BPF_END, BPF_TO_LE), 173 + DL(BPF_ALU64, BPF_ADD, BPF_X), 174 + DL(BPF_ALU64, BPF_ADD, BPF_K), 175 + DL(BPF_ALU64, BPF_SUB, BPF_X), 176 + DL(BPF_ALU64, BPF_SUB, BPF_K), 177 + DL(BPF_ALU64, BPF_AND, BPF_X), 178 + DL(BPF_ALU64, BPF_AND, BPF_K), 179 + DL(BPF_ALU64, BPF_OR, BPF_X), 180 + DL(BPF_ALU64, BPF_OR, BPF_K), 181 + DL(BPF_ALU64, BPF_LSH, BPF_X), 182 + DL(BPF_ALU64, BPF_LSH, BPF_K), 183 + DL(BPF_ALU64, BPF_RSH, BPF_X), 184 + DL(BPF_ALU64, BPF_RSH, BPF_K), 185 + DL(BPF_ALU64, BPF_XOR, BPF_X), 186 + DL(BPF_ALU64, BPF_XOR, BPF_K), 187 + DL(BPF_ALU64, BPF_MUL, BPF_X), 188 + DL(BPF_ALU64, BPF_MUL, BPF_K), 189 + DL(BPF_ALU64, BPF_MOV, BPF_X), 190 + DL(BPF_ALU64, BPF_MOV, BPF_K), 191 + DL(BPF_ALU64, BPF_ARSH, BPF_X), 192 + DL(BPF_ALU64, BPF_ARSH, BPF_K), 193 + DL(BPF_ALU64, BPF_DIV, BPF_X), 194 + DL(BPF_ALU64, BPF_DIV, BPF_K), 195 + DL(BPF_ALU64, BPF_MOD, BPF_X), 196 + DL(BPF_ALU64, BPF_MOD, BPF_K), 197 + DL(BPF_ALU64, BPF_NEG, 0), 198 + DL(BPF_JMP, BPF_CALL, 0), 199 + DL(BPF_JMP, BPF_JA, 0), 200 + DL(BPF_JMP, BPF_JEQ, BPF_X), 201 + DL(BPF_JMP, BPF_JEQ, BPF_K), 202 + DL(BPF_JMP, BPF_JNE, BPF_X), 203 + DL(BPF_JMP, BPF_JNE, BPF_K), 204 + DL(BPF_JMP, BPF_JGT, BPF_X), 205 + DL(BPF_JMP, BPF_JGT, BPF_K), 206 + DL(BPF_JMP, BPF_JGE, BPF_X), 207 + DL(BPF_JMP, BPF_JGE, BPF_K), 208 + DL(BPF_JMP, BPF_JSGT, BPF_X), 209 + DL(BPF_JMP, BPF_JSGT, BPF_K), 210 + DL(BPF_JMP, BPF_JSGE, BPF_X), 211 + DL(BPF_JMP, BPF_JSGE, BPF_K), 212 + DL(BPF_JMP, BPF_JSET, BPF_X), 213 + DL(BPF_JMP, BPF_JSET, BPF_K), 214 + DL(BPF_JMP, BPF_EXIT, 0), 215 + DL(BPF_STX, BPF_MEM, BPF_B), 216 + DL(BPF_STX, BPF_MEM, BPF_H), 217 + DL(BPF_STX, BPF_MEM, BPF_W), 218 + DL(BPF_STX, BPF_MEM, BPF_DW), 219 + DL(BPF_STX, BPF_XADD, BPF_W), 220 + DL(BPF_STX, BPF_XADD, BPF_DW), 221 + DL(BPF_ST, BPF_MEM, BPF_B), 222 + DL(BPF_ST, BPF_MEM, BPF_H), 223 + DL(BPF_ST, BPF_MEM, BPF_W), 224 + DL(BPF_ST, BPF_MEM, BPF_DW), 225 + DL(BPF_LDX, BPF_MEM, BPF_B), 226 + DL(BPF_LDX, BPF_MEM, BPF_H), 227 + DL(BPF_LDX, BPF_MEM, BPF_W), 228 + DL(BPF_LDX, BPF_MEM, BPF_DW), 229 + DL(BPF_LD, BPF_ABS, BPF_W), 230 + DL(BPF_LD, BPF_ABS, BPF_H), 231 + DL(BPF_LD, BPF_ABS, BPF_B), 232 + DL(BPF_LD, BPF_IND, BPF_W), 233 + DL(BPF_LD, BPF_IND, BPF_H), 234 + DL(BPF_LD, BPF_IND, BPF_B), 235 + #undef DL 236 + }; 365 237 366 - nla = nla_find((struct nlattr *)&skb->data[A], 367 - skb->len - A, X); 368 - if (nla) 369 - A = (void *)nla - (void *)skb->data; 370 - else 371 - A = 0; 372 - continue; 238 + regs[FP_REG] = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; 239 + regs[ARG1_REG] = (u64) (unsigned long) ctx; 240 + 241 + select_insn: 242 + goto *jumptable[insn->code]; 243 + 244 + /* ALU */ 245 + #define ALU(OPCODE, OP) \ 246 + BPF_ALU64_##OPCODE##_BPF_X: \ 247 + A = A OP X; \ 248 + CONT; \ 249 + BPF_ALU_##OPCODE##_BPF_X: \ 250 + A = (u32) A OP (u32) X; \ 251 + CONT; \ 252 + BPF_ALU64_##OPCODE##_BPF_K: \ 253 + A = A OP K; \ 254 + CONT; \ 255 + BPF_ALU_##OPCODE##_BPF_K: \ 256 + A = (u32) A OP (u32) K; \ 257 + CONT; 258 + 259 + ALU(BPF_ADD, +) 260 + ALU(BPF_SUB, -) 261 + ALU(BPF_AND, &) 262 + ALU(BPF_OR, |) 263 + ALU(BPF_LSH, <<) 264 + ALU(BPF_RSH, >>) 265 + ALU(BPF_XOR, ^) 266 + ALU(BPF_MUL, *) 267 + #undef ALU 268 + BPF_ALU_BPF_NEG_0: 269 + A = (u32) -A; 270 + CONT; 271 + BPF_ALU64_BPF_NEG_0: 272 + A = -A; 273 + CONT; 274 + BPF_ALU_BPF_MOV_BPF_X: 275 + A = (u32) X; 276 + CONT; 277 + BPF_ALU_BPF_MOV_BPF_K: 278 + A = (u32) K; 279 + CONT; 280 + BPF_ALU64_BPF_MOV_BPF_X: 281 + A = X; 282 + CONT; 283 + BPF_ALU64_BPF_MOV_BPF_K: 284 + A = K; 285 + CONT; 286 + BPF_ALU64_BPF_ARSH_BPF_X: 287 + (*(s64 *) &A) >>= X; 288 + CONT; 289 + BPF_ALU64_BPF_ARSH_BPF_K: 290 + (*(s64 *) &A) >>= K; 291 + CONT; 292 + BPF_ALU64_BPF_MOD_BPF_X: 293 + tmp = A; 294 + if (X) 295 + A = do_div(tmp, X); 296 + CONT; 297 + BPF_ALU_BPF_MOD_BPF_X: 298 + tmp = (u32) A; 299 + if (X) 300 + A = do_div(tmp, (u32) X); 301 + CONT; 302 + BPF_ALU64_BPF_MOD_BPF_K: 303 + tmp = A; 304 + if (K) 305 + A = do_div(tmp, K); 306 + CONT; 307 + BPF_ALU_BPF_MOD_BPF_K: 308 + tmp = (u32) A; 309 + if (K) 310 + A = do_div(tmp, (u32) K); 311 + CONT; 312 + BPF_ALU64_BPF_DIV_BPF_X: 313 + if (X) 314 + do_div(A, X); 315 + CONT; 316 + BPF_ALU_BPF_DIV_BPF_X: 317 + tmp = (u32) A; 318 + if (X) 319 + do_div(tmp, (u32) X); 320 + A = (u32) tmp; 321 + CONT; 322 + BPF_ALU64_BPF_DIV_BPF_K: 323 + if (K) 324 + do_div(A, K); 325 + CONT; 326 + BPF_ALU_BPF_DIV_BPF_K: 327 + tmp = (u32) A; 328 + if (K) 329 + do_div(tmp, (u32) K); 330 + A = (u32) tmp; 331 + CONT; 332 + BPF_ALU_BPF_END_BPF_TO_BE: 333 + switch (K) { 334 + case 16: 335 + A = (__force u16) cpu_to_be16(A); 336 + break; 337 + case 32: 338 + A = (__force u32) cpu_to_be32(A); 339 + break; 340 + case 64: 341 + A = (__force u64) cpu_to_be64(A); 342 + break; 373 343 } 374 - case BPF_S_ANC_NLATTR_NEST: { 375 - struct nlattr *nla; 376 - 377 - if (skb_is_nonlinear(skb)) 378 - return 0; 379 - if (A > skb->len - sizeof(struct nlattr)) 380 - return 0; 381 - 382 - nla = (struct nlattr *)&skb->data[A]; 383 - if (nla->nla_len > A - skb->len) 384 - return 0; 385 - 386 - nla = nla_find_nested(nla, X); 387 - if (nla) 388 - A = (void *)nla - (void *)skb->data; 389 - else 390 - A = 0; 391 - continue; 344 + CONT; 345 + BPF_ALU_BPF_END_BPF_TO_LE: 346 + switch (K) { 347 + case 16: 348 + A = (__force u16) cpu_to_le16(A); 349 + break; 350 + case 32: 351 + A = (__force u32) cpu_to_le32(A); 352 + break; 353 + case 64: 354 + A = (__force u64) cpu_to_le64(A); 355 + break; 392 356 } 393 - #ifdef CONFIG_SECCOMP_FILTER 394 - case BPF_S_ANC_SECCOMP_LD_W: 395 - A = seccomp_bpf_load(fentry->k); 396 - continue; 397 - #endif 398 - default: 399 - WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n", 400 - fentry->code, fentry->jt, 401 - fentry->jf, fentry->k); 402 - return 0; 357 + CONT; 358 + 359 + /* CALL */ 360 + BPF_JMP_BPF_CALL_0: 361 + /* Function call scratches R1-R5 registers, preserves R6-R9, 362 + * and stores return value into R0. 363 + */ 364 + R0 = (__bpf_call_base + insn->imm)(regs[1], regs[2], regs[3], 365 + regs[4], regs[5]); 366 + CONT; 367 + 368 + /* JMP */ 369 + BPF_JMP_BPF_JA_0: 370 + insn += insn->off; 371 + CONT; 372 + BPF_JMP_BPF_JEQ_BPF_X: 373 + if (A == X) { 374 + insn += insn->off; 375 + CONT_JMP; 403 376 } 377 + CONT; 378 + BPF_JMP_BPF_JEQ_BPF_K: 379 + if (A == K) { 380 + insn += insn->off; 381 + CONT_JMP; 382 + } 383 + CONT; 384 + BPF_JMP_BPF_JNE_BPF_X: 385 + if (A != X) { 386 + insn += insn->off; 387 + CONT_JMP; 388 + } 389 + CONT; 390 + BPF_JMP_BPF_JNE_BPF_K: 391 + if (A != K) { 392 + insn += insn->off; 393 + CONT_JMP; 394 + } 395 + CONT; 396 + BPF_JMP_BPF_JGT_BPF_X: 397 + if (A > X) { 398 + insn += insn->off; 399 + CONT_JMP; 400 + } 401 + CONT; 402 + BPF_JMP_BPF_JGT_BPF_K: 403 + if (A > K) { 404 + insn += insn->off; 405 + CONT_JMP; 406 + } 407 + CONT; 408 + BPF_JMP_BPF_JGE_BPF_X: 409 + if (A >= X) { 410 + insn += insn->off; 411 + CONT_JMP; 412 + } 413 + CONT; 414 + BPF_JMP_BPF_JGE_BPF_K: 415 + if (A >= K) { 416 + insn += insn->off; 417 + CONT_JMP; 418 + } 419 + CONT; 420 + BPF_JMP_BPF_JSGT_BPF_X: 421 + if (((s64)A) > ((s64)X)) { 422 + insn += insn->off; 423 + CONT_JMP; 424 + } 425 + CONT; 426 + BPF_JMP_BPF_JSGT_BPF_K: 427 + if (((s64)A) > ((s64)K)) { 428 + insn += insn->off; 429 + CONT_JMP; 430 + } 431 + CONT; 432 + BPF_JMP_BPF_JSGE_BPF_X: 433 + if (((s64)A) >= ((s64)X)) { 434 + insn += insn->off; 435 + CONT_JMP; 436 + } 437 + CONT; 438 + BPF_JMP_BPF_JSGE_BPF_K: 439 + if (((s64)A) >= ((s64)K)) { 440 + insn += insn->off; 441 + CONT_JMP; 442 + } 443 + CONT; 444 + BPF_JMP_BPF_JSET_BPF_X: 445 + if (A & X) { 446 + insn += insn->off; 447 + CONT_JMP; 448 + } 449 + CONT; 450 + BPF_JMP_BPF_JSET_BPF_K: 451 + if (A & K) { 452 + insn += insn->off; 453 + CONT_JMP; 454 + } 455 + CONT; 456 + BPF_JMP_BPF_EXIT_0: 457 + return R0; 458 + 459 + /* STX and ST and LDX*/ 460 + #define LDST(SIZEOP, SIZE) \ 461 + BPF_STX_BPF_MEM_##SIZEOP: \ 462 + *(SIZE *)(unsigned long) (A + insn->off) = X; \ 463 + CONT; \ 464 + BPF_ST_BPF_MEM_##SIZEOP: \ 465 + *(SIZE *)(unsigned long) (A + insn->off) = K; \ 466 + CONT; \ 467 + BPF_LDX_BPF_MEM_##SIZEOP: \ 468 + A = *(SIZE *)(unsigned long) (X + insn->off); \ 469 + CONT; 470 + 471 + LDST(BPF_B, u8) 472 + LDST(BPF_H, u16) 473 + LDST(BPF_W, u32) 474 + LDST(BPF_DW, u64) 475 + #undef LDST 476 + BPF_STX_BPF_XADD_BPF_W: /* lock xadd *(u32 *)(A + insn->off) += X */ 477 + atomic_add((u32) X, (atomic_t *)(unsigned long) 478 + (A + insn->off)); 479 + CONT; 480 + BPF_STX_BPF_XADD_BPF_DW: /* lock xadd *(u64 *)(A + insn->off) += X */ 481 + atomic64_add((u64) X, (atomic64_t *)(unsigned long) 482 + (A + insn->off)); 483 + CONT; 484 + BPF_LD_BPF_ABS_BPF_W: /* R0 = ntohl(*(u32 *) (skb->data + K)) */ 485 + off = K; 486 + load_word: 487 + /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are only 488 + * appearing in the programs where ctx == skb. All programs 489 + * keep 'ctx' in regs[CTX_REG] == R6, sk_convert_filter() 490 + * saves it in R6, internal BPF verifier will check that 491 + * R6 == ctx. 492 + * 493 + * BPF_ABS and BPF_IND are wrappers of function calls, so 494 + * they scratch R1-R5 registers, preserve R6-R9, and store 495 + * return value into R0. 496 + * 497 + * Implicit input: 498 + * ctx 499 + * 500 + * Explicit input: 501 + * X == any register 502 + * K == 32-bit immediate 503 + * 504 + * Output: 505 + * R0 - 8/16/32-bit skb data converted to cpu endianness 506 + */ 507 + ptr = load_pointer((struct sk_buff *) ctx, off, 4, &tmp); 508 + if (likely(ptr != NULL)) { 509 + R0 = get_unaligned_be32(ptr); 510 + CONT; 511 + } 512 + return 0; 513 + BPF_LD_BPF_ABS_BPF_H: /* R0 = ntohs(*(u16 *) (skb->data + K)) */ 514 + off = K; 515 + load_half: 516 + ptr = load_pointer((struct sk_buff *) ctx, off, 2, &tmp); 517 + if (likely(ptr != NULL)) { 518 + R0 = get_unaligned_be16(ptr); 519 + CONT; 520 + } 521 + return 0; 522 + BPF_LD_BPF_ABS_BPF_B: /* R0 = *(u8 *) (ctx + K) */ 523 + off = K; 524 + load_byte: 525 + ptr = load_pointer((struct sk_buff *) ctx, off, 1, &tmp); 526 + if (likely(ptr != NULL)) { 527 + R0 = *(u8 *)ptr; 528 + CONT; 529 + } 530 + return 0; 531 + BPF_LD_BPF_IND_BPF_W: /* R0 = ntohl(*(u32 *) (skb->data + X + K)) */ 532 + off = K + X; 533 + goto load_word; 534 + BPF_LD_BPF_IND_BPF_H: /* R0 = ntohs(*(u16 *) (skb->data + X + K)) */ 535 + off = K + X; 536 + goto load_half; 537 + BPF_LD_BPF_IND_BPF_B: /* R0 = *(u8 *) (skb->data + X + K) */ 538 + off = K + X; 539 + goto load_byte; 540 + 541 + default_label: 542 + /* If we ever reach this, we have a bug somewhere. */ 543 + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code); 544 + return 0; 545 + #undef CONT_JMP 546 + #undef CONT 547 + 548 + #undef R0 549 + #undef X 550 + #undef A 551 + #undef K 552 + } 553 + 554 + u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx, 555 + const struct sock_filter_int *insni) 556 + __attribute__ ((alias ("__sk_run_filter"))); 557 + 558 + u32 sk_run_filter_int_skb(const struct sk_buff *ctx, 559 + const struct sock_filter_int *insni) 560 + __attribute__ ((alias ("__sk_run_filter"))); 561 + EXPORT_SYMBOL_GPL(sk_run_filter_int_skb); 562 + 563 + /* Helper to find the offset of pkt_type in sk_buff structure. We want 564 + * to make sure its still a 3bit field starting at a byte boundary; 565 + * taken from arch/x86/net/bpf_jit_comp.c. 566 + */ 567 + #define PKT_TYPE_MAX 7 568 + static unsigned int pkt_type_offset(void) 569 + { 570 + struct sk_buff skb_probe = { .pkt_type = ~0, }; 571 + u8 *ct = (u8 *) &skb_probe; 572 + unsigned int off; 573 + 574 + for (off = 0; off < sizeof(struct sk_buff); off++) { 575 + if (ct[off] == PKT_TYPE_MAX) 576 + return off; 404 577 } 578 + 579 + pr_err_once("Please fix %s, as pkt_type couldn't be found!\n", __func__); 580 + return -1; 581 + } 582 + 583 + static u64 __skb_get_pay_offset(u64 ctx, u64 A, u64 X, u64 r4, u64 r5) 584 + { 585 + struct sk_buff *skb = (struct sk_buff *)(long) ctx; 586 + 587 + return __skb_get_poff(skb); 588 + } 589 + 590 + static u64 __skb_get_nlattr(u64 ctx, u64 A, u64 X, u64 r4, u64 r5) 591 + { 592 + struct sk_buff *skb = (struct sk_buff *)(long) ctx; 593 + struct nlattr *nla; 594 + 595 + if (skb_is_nonlinear(skb)) 596 + return 0; 597 + 598 + if (A > skb->len - sizeof(struct nlattr)) 599 + return 0; 600 + 601 + nla = nla_find((struct nlattr *) &skb->data[A], skb->len - A, X); 602 + if (nla) 603 + return (void *) nla - (void *) skb->data; 405 604 406 605 return 0; 407 606 } 408 - EXPORT_SYMBOL(sk_run_filter); 409 607 410 - /* 411 - * Security : 608 + static u64 __skb_get_nlattr_nest(u64 ctx, u64 A, u64 X, u64 r4, u64 r5) 609 + { 610 + struct sk_buff *skb = (struct sk_buff *)(long) ctx; 611 + struct nlattr *nla; 612 + 613 + if (skb_is_nonlinear(skb)) 614 + return 0; 615 + 616 + if (A > skb->len - sizeof(struct nlattr)) 617 + return 0; 618 + 619 + nla = (struct nlattr *) &skb->data[A]; 620 + if (nla->nla_len > A - skb->len) 621 + return 0; 622 + 623 + nla = nla_find_nested(nla, X); 624 + if (nla) 625 + return (void *) nla - (void *) skb->data; 626 + 627 + return 0; 628 + } 629 + 630 + static u64 __get_raw_cpu_id(u64 ctx, u64 A, u64 X, u64 r4, u64 r5) 631 + { 632 + return raw_smp_processor_id(); 633 + } 634 + 635 + /* Register mappings for user programs. */ 636 + #define A_REG 0 637 + #define X_REG 7 638 + #define TMP_REG 8 639 + #define ARG2_REG 2 640 + #define ARG3_REG 3 641 + 642 + static bool convert_bpf_extensions(struct sock_filter *fp, 643 + struct sock_filter_int **insnp) 644 + { 645 + struct sock_filter_int *insn = *insnp; 646 + 647 + switch (fp->k) { 648 + case SKF_AD_OFF + SKF_AD_PROTOCOL: 649 + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, protocol) != 2); 650 + 651 + insn->code = BPF_LDX | BPF_MEM | BPF_H; 652 + insn->a_reg = A_REG; 653 + insn->x_reg = CTX_REG; 654 + insn->off = offsetof(struct sk_buff, protocol); 655 + insn++; 656 + 657 + /* A = ntohs(A) [emitting a nop or swap16] */ 658 + insn->code = BPF_ALU | BPF_END | BPF_FROM_BE; 659 + insn->a_reg = A_REG; 660 + insn->imm = 16; 661 + break; 662 + 663 + case SKF_AD_OFF + SKF_AD_PKTTYPE: 664 + insn->code = BPF_LDX | BPF_MEM | BPF_B; 665 + insn->a_reg = A_REG; 666 + insn->x_reg = CTX_REG; 667 + insn->off = pkt_type_offset(); 668 + if (insn->off < 0) 669 + return false; 670 + insn++; 671 + 672 + insn->code = BPF_ALU | BPF_AND | BPF_K; 673 + insn->a_reg = A_REG; 674 + insn->imm = PKT_TYPE_MAX; 675 + break; 676 + 677 + case SKF_AD_OFF + SKF_AD_IFINDEX: 678 + case SKF_AD_OFF + SKF_AD_HATYPE: 679 + if (FIELD_SIZEOF(struct sk_buff, dev) == 8) 680 + insn->code = BPF_LDX | BPF_MEM | BPF_DW; 681 + else 682 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 683 + insn->a_reg = TMP_REG; 684 + insn->x_reg = CTX_REG; 685 + insn->off = offsetof(struct sk_buff, dev); 686 + insn++; 687 + 688 + insn->code = BPF_JMP | BPF_JNE | BPF_K; 689 + insn->a_reg = TMP_REG; 690 + insn->imm = 0; 691 + insn->off = 1; 692 + insn++; 693 + 694 + insn->code = BPF_JMP | BPF_EXIT; 695 + insn++; 696 + 697 + BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, ifindex) != 4); 698 + BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, type) != 2); 699 + 700 + insn->a_reg = A_REG; 701 + insn->x_reg = TMP_REG; 702 + 703 + if (fp->k == SKF_AD_OFF + SKF_AD_IFINDEX) { 704 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 705 + insn->off = offsetof(struct net_device, ifindex); 706 + } else { 707 + insn->code = BPF_LDX | BPF_MEM | BPF_H; 708 + insn->off = offsetof(struct net_device, type); 709 + } 710 + break; 711 + 712 + case SKF_AD_OFF + SKF_AD_MARK: 713 + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4); 714 + 715 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 716 + insn->a_reg = A_REG; 717 + insn->x_reg = CTX_REG; 718 + insn->off = offsetof(struct sk_buff, mark); 719 + break; 720 + 721 + case SKF_AD_OFF + SKF_AD_RXHASH: 722 + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4); 723 + 724 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 725 + insn->a_reg = A_REG; 726 + insn->x_reg = CTX_REG; 727 + insn->off = offsetof(struct sk_buff, hash); 728 + break; 729 + 730 + case SKF_AD_OFF + SKF_AD_QUEUE: 731 + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, queue_mapping) != 2); 732 + 733 + insn->code = BPF_LDX | BPF_MEM | BPF_H; 734 + insn->a_reg = A_REG; 735 + insn->x_reg = CTX_REG; 736 + insn->off = offsetof(struct sk_buff, queue_mapping); 737 + break; 738 + 739 + case SKF_AD_OFF + SKF_AD_VLAN_TAG: 740 + case SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT: 741 + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2); 742 + 743 + insn->code = BPF_LDX | BPF_MEM | BPF_H; 744 + insn->a_reg = A_REG; 745 + insn->x_reg = CTX_REG; 746 + insn->off = offsetof(struct sk_buff, vlan_tci); 747 + insn++; 748 + 749 + BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000); 750 + 751 + if (fp->k == SKF_AD_OFF + SKF_AD_VLAN_TAG) { 752 + insn->code = BPF_ALU | BPF_AND | BPF_K; 753 + insn->a_reg = A_REG; 754 + insn->imm = ~VLAN_TAG_PRESENT; 755 + } else { 756 + insn->code = BPF_ALU | BPF_RSH | BPF_K; 757 + insn->a_reg = A_REG; 758 + insn->imm = 12; 759 + insn++; 760 + 761 + insn->code = BPF_ALU | BPF_AND | BPF_K; 762 + insn->a_reg = A_REG; 763 + insn->imm = 1; 764 + } 765 + break; 766 + 767 + case SKF_AD_OFF + SKF_AD_PAY_OFFSET: 768 + case SKF_AD_OFF + SKF_AD_NLATTR: 769 + case SKF_AD_OFF + SKF_AD_NLATTR_NEST: 770 + case SKF_AD_OFF + SKF_AD_CPU: 771 + /* arg1 = ctx */ 772 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 773 + insn->a_reg = ARG1_REG; 774 + insn->x_reg = CTX_REG; 775 + insn++; 776 + 777 + /* arg2 = A */ 778 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 779 + insn->a_reg = ARG2_REG; 780 + insn->x_reg = A_REG; 781 + insn++; 782 + 783 + /* arg3 = X */ 784 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 785 + insn->a_reg = ARG3_REG; 786 + insn->x_reg = X_REG; 787 + insn++; 788 + 789 + /* Emit call(ctx, arg2=A, arg3=X) */ 790 + insn->code = BPF_JMP | BPF_CALL; 791 + switch (fp->k) { 792 + case SKF_AD_OFF + SKF_AD_PAY_OFFSET: 793 + insn->imm = __skb_get_pay_offset - __bpf_call_base; 794 + break; 795 + case SKF_AD_OFF + SKF_AD_NLATTR: 796 + insn->imm = __skb_get_nlattr - __bpf_call_base; 797 + break; 798 + case SKF_AD_OFF + SKF_AD_NLATTR_NEST: 799 + insn->imm = __skb_get_nlattr_nest - __bpf_call_base; 800 + break; 801 + case SKF_AD_OFF + SKF_AD_CPU: 802 + insn->imm = __get_raw_cpu_id - __bpf_call_base; 803 + break; 804 + } 805 + break; 806 + 807 + case SKF_AD_OFF + SKF_AD_ALU_XOR_X: 808 + insn->code = BPF_ALU | BPF_XOR | BPF_X; 809 + insn->a_reg = A_REG; 810 + insn->x_reg = X_REG; 811 + break; 812 + 813 + default: 814 + /* This is just a dummy call to avoid letting the compiler 815 + * evict __bpf_call_base() as an optimization. Placed here 816 + * where no-one bothers. 817 + */ 818 + BUG_ON(__bpf_call_base(0, 0, 0, 0, 0) != 0); 819 + return false; 820 + } 821 + 822 + *insnp = insn; 823 + return true; 824 + } 825 + 826 + /** 827 + * sk_convert_filter - convert filter program 828 + * @prog: the user passed filter program 829 + * @len: the length of the user passed filter program 830 + * @new_prog: buffer where converted program will be stored 831 + * @new_len: pointer to store length of converted program 832 + * 833 + * Remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style. 834 + * Conversion workflow: 835 + * 836 + * 1) First pass for calculating the new program length: 837 + * sk_convert_filter(old_prog, old_len, NULL, &new_len) 838 + * 839 + * 2) 2nd pass to remap in two passes: 1st pass finds new 840 + * jump offsets, 2nd pass remapping: 841 + * new_prog = kmalloc(sizeof(struct sock_filter_int) * new_len); 842 + * sk_convert_filter(old_prog, old_len, new_prog, &new_len); 843 + * 844 + * User BPF's register A is mapped to our BPF register 6, user BPF 845 + * register X is mapped to BPF register 7; frame pointer is always 846 + * register 10; Context 'void *ctx' is stored in register 1, that is, 847 + * for socket filters: ctx == 'struct sk_buff *', for seccomp: 848 + * ctx == 'struct seccomp_data *'. 849 + */ 850 + int sk_convert_filter(struct sock_filter *prog, int len, 851 + struct sock_filter_int *new_prog, int *new_len) 852 + { 853 + int new_flen = 0, pass = 0, target, i; 854 + struct sock_filter_int *new_insn; 855 + struct sock_filter *fp; 856 + int *addrs = NULL; 857 + u8 bpf_src; 858 + 859 + BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK); 860 + BUILD_BUG_ON(FP_REG + 1 != MAX_BPF_REG); 861 + 862 + if (len <= 0 || len >= BPF_MAXINSNS) 863 + return -EINVAL; 864 + 865 + if (new_prog) { 866 + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL); 867 + if (!addrs) 868 + return -ENOMEM; 869 + } 870 + 871 + do_pass: 872 + new_insn = new_prog; 873 + fp = prog; 874 + 875 + if (new_insn) { 876 + new_insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 877 + new_insn->a_reg = CTX_REG; 878 + new_insn->x_reg = ARG1_REG; 879 + } 880 + new_insn++; 881 + 882 + for (i = 0; i < len; fp++, i++) { 883 + struct sock_filter_int tmp_insns[6] = { }; 884 + struct sock_filter_int *insn = tmp_insns; 885 + 886 + if (addrs) 887 + addrs[i] = new_insn - new_prog; 888 + 889 + switch (fp->code) { 890 + /* All arithmetic insns and skb loads map as-is. */ 891 + case BPF_ALU | BPF_ADD | BPF_X: 892 + case BPF_ALU | BPF_ADD | BPF_K: 893 + case BPF_ALU | BPF_SUB | BPF_X: 894 + case BPF_ALU | BPF_SUB | BPF_K: 895 + case BPF_ALU | BPF_AND | BPF_X: 896 + case BPF_ALU | BPF_AND | BPF_K: 897 + case BPF_ALU | BPF_OR | BPF_X: 898 + case BPF_ALU | BPF_OR | BPF_K: 899 + case BPF_ALU | BPF_LSH | BPF_X: 900 + case BPF_ALU | BPF_LSH | BPF_K: 901 + case BPF_ALU | BPF_RSH | BPF_X: 902 + case BPF_ALU | BPF_RSH | BPF_K: 903 + case BPF_ALU | BPF_XOR | BPF_X: 904 + case BPF_ALU | BPF_XOR | BPF_K: 905 + case BPF_ALU | BPF_MUL | BPF_X: 906 + case BPF_ALU | BPF_MUL | BPF_K: 907 + case BPF_ALU | BPF_DIV | BPF_X: 908 + case BPF_ALU | BPF_DIV | BPF_K: 909 + case BPF_ALU | BPF_MOD | BPF_X: 910 + case BPF_ALU | BPF_MOD | BPF_K: 911 + case BPF_ALU | BPF_NEG: 912 + case BPF_LD | BPF_ABS | BPF_W: 913 + case BPF_LD | BPF_ABS | BPF_H: 914 + case BPF_LD | BPF_ABS | BPF_B: 915 + case BPF_LD | BPF_IND | BPF_W: 916 + case BPF_LD | BPF_IND | BPF_H: 917 + case BPF_LD | BPF_IND | BPF_B: 918 + /* Check for overloaded BPF extension and 919 + * directly convert it if found, otherwise 920 + * just move on with mapping. 921 + */ 922 + if (BPF_CLASS(fp->code) == BPF_LD && 923 + BPF_MODE(fp->code) == BPF_ABS && 924 + convert_bpf_extensions(fp, &insn)) 925 + break; 926 + 927 + insn->code = fp->code; 928 + insn->a_reg = A_REG; 929 + insn->x_reg = X_REG; 930 + insn->imm = fp->k; 931 + break; 932 + 933 + /* Jump opcodes map as-is, but offsets need adjustment. */ 934 + case BPF_JMP | BPF_JA: 935 + target = i + fp->k + 1; 936 + insn->code = fp->code; 937 + #define EMIT_JMP \ 938 + do { \ 939 + if (target >= len || target < 0) \ 940 + goto err; \ 941 + insn->off = addrs ? addrs[target] - addrs[i] - 1 : 0; \ 942 + /* Adjust pc relative offset for 2nd or 3rd insn. */ \ 943 + insn->off -= insn - tmp_insns; \ 944 + } while (0) 945 + 946 + EMIT_JMP; 947 + break; 948 + 949 + case BPF_JMP | BPF_JEQ | BPF_K: 950 + case BPF_JMP | BPF_JEQ | BPF_X: 951 + case BPF_JMP | BPF_JSET | BPF_K: 952 + case BPF_JMP | BPF_JSET | BPF_X: 953 + case BPF_JMP | BPF_JGT | BPF_K: 954 + case BPF_JMP | BPF_JGT | BPF_X: 955 + case BPF_JMP | BPF_JGE | BPF_K: 956 + case BPF_JMP | BPF_JGE | BPF_X: 957 + if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) { 958 + /* BPF immediates are signed, zero extend 959 + * immediate into tmp register and use it 960 + * in compare insn. 961 + */ 962 + insn->code = BPF_ALU | BPF_MOV | BPF_K; 963 + insn->a_reg = TMP_REG; 964 + insn->imm = fp->k; 965 + insn++; 966 + 967 + insn->a_reg = A_REG; 968 + insn->x_reg = TMP_REG; 969 + bpf_src = BPF_X; 970 + } else { 971 + insn->a_reg = A_REG; 972 + insn->x_reg = X_REG; 973 + insn->imm = fp->k; 974 + bpf_src = BPF_SRC(fp->code); 975 + } 976 + 977 + /* Common case where 'jump_false' is next insn. */ 978 + if (fp->jf == 0) { 979 + insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src; 980 + target = i + fp->jt + 1; 981 + EMIT_JMP; 982 + break; 983 + } 984 + 985 + /* Convert JEQ into JNE when 'jump_true' is next insn. */ 986 + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) { 987 + insn->code = BPF_JMP | BPF_JNE | bpf_src; 988 + target = i + fp->jf + 1; 989 + EMIT_JMP; 990 + break; 991 + } 992 + 993 + /* Other jumps are mapped into two insns: Jxx and JA. */ 994 + target = i + fp->jt + 1; 995 + insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src; 996 + EMIT_JMP; 997 + insn++; 998 + 999 + insn->code = BPF_JMP | BPF_JA; 1000 + target = i + fp->jf + 1; 1001 + EMIT_JMP; 1002 + break; 1003 + 1004 + /* ldxb 4 * ([14] & 0xf) is remaped into 6 insns. */ 1005 + case BPF_LDX | BPF_MSH | BPF_B: 1006 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 1007 + insn->a_reg = TMP_REG; 1008 + insn->x_reg = A_REG; 1009 + insn++; 1010 + 1011 + insn->code = BPF_LD | BPF_ABS | BPF_B; 1012 + insn->a_reg = A_REG; 1013 + insn->imm = fp->k; 1014 + insn++; 1015 + 1016 + insn->code = BPF_ALU | BPF_AND | BPF_K; 1017 + insn->a_reg = A_REG; 1018 + insn->imm = 0xf; 1019 + insn++; 1020 + 1021 + insn->code = BPF_ALU | BPF_LSH | BPF_K; 1022 + insn->a_reg = A_REG; 1023 + insn->imm = 2; 1024 + insn++; 1025 + 1026 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 1027 + insn->a_reg = X_REG; 1028 + insn->x_reg = A_REG; 1029 + insn++; 1030 + 1031 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 1032 + insn->a_reg = A_REG; 1033 + insn->x_reg = TMP_REG; 1034 + break; 1035 + 1036 + /* RET_K, RET_A are remaped into 2 insns. */ 1037 + case BPF_RET | BPF_A: 1038 + case BPF_RET | BPF_K: 1039 + insn->code = BPF_ALU | BPF_MOV | 1040 + (BPF_RVAL(fp->code) == BPF_K ? 1041 + BPF_K : BPF_X); 1042 + insn->a_reg = 0; 1043 + insn->x_reg = A_REG; 1044 + insn->imm = fp->k; 1045 + insn++; 1046 + 1047 + insn->code = BPF_JMP | BPF_EXIT; 1048 + break; 1049 + 1050 + /* Store to stack. */ 1051 + case BPF_ST: 1052 + case BPF_STX: 1053 + insn->code = BPF_STX | BPF_MEM | BPF_W; 1054 + insn->a_reg = FP_REG; 1055 + insn->x_reg = fp->code == BPF_ST ? A_REG : X_REG; 1056 + insn->off = -(BPF_MEMWORDS - fp->k) * 4; 1057 + break; 1058 + 1059 + /* Load from stack. */ 1060 + case BPF_LD | BPF_MEM: 1061 + case BPF_LDX | BPF_MEM: 1062 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 1063 + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 1064 + A_REG : X_REG; 1065 + insn->x_reg = FP_REG; 1066 + insn->off = -(BPF_MEMWORDS - fp->k) * 4; 1067 + break; 1068 + 1069 + /* A = K or X = K */ 1070 + case BPF_LD | BPF_IMM: 1071 + case BPF_LDX | BPF_IMM: 1072 + insn->code = BPF_ALU | BPF_MOV | BPF_K; 1073 + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 1074 + A_REG : X_REG; 1075 + insn->imm = fp->k; 1076 + break; 1077 + 1078 + /* X = A */ 1079 + case BPF_MISC | BPF_TAX: 1080 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 1081 + insn->a_reg = X_REG; 1082 + insn->x_reg = A_REG; 1083 + break; 1084 + 1085 + /* A = X */ 1086 + case BPF_MISC | BPF_TXA: 1087 + insn->code = BPF_ALU64 | BPF_MOV | BPF_X; 1088 + insn->a_reg = A_REG; 1089 + insn->x_reg = X_REG; 1090 + break; 1091 + 1092 + /* A = skb->len or X = skb->len */ 1093 + case BPF_LD | BPF_W | BPF_LEN: 1094 + case BPF_LDX | BPF_W | BPF_LEN: 1095 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 1096 + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 1097 + A_REG : X_REG; 1098 + insn->x_reg = CTX_REG; 1099 + insn->off = offsetof(struct sk_buff, len); 1100 + break; 1101 + 1102 + /* access seccomp_data fields */ 1103 + case BPF_LDX | BPF_ABS | BPF_W: 1104 + insn->code = BPF_LDX | BPF_MEM | BPF_W; 1105 + insn->a_reg = A_REG; 1106 + insn->x_reg = CTX_REG; 1107 + insn->off = fp->k; 1108 + break; 1109 + 1110 + default: 1111 + goto err; 1112 + } 1113 + 1114 + insn++; 1115 + if (new_prog) 1116 + memcpy(new_insn, tmp_insns, 1117 + sizeof(*insn) * (insn - tmp_insns)); 1118 + 1119 + new_insn += insn - tmp_insns; 1120 + } 1121 + 1122 + if (!new_prog) { 1123 + /* Only calculating new length. */ 1124 + *new_len = new_insn - new_prog; 1125 + return 0; 1126 + } 1127 + 1128 + pass++; 1129 + if (new_flen != new_insn - new_prog) { 1130 + new_flen = new_insn - new_prog; 1131 + if (pass > 2) 1132 + goto err; 1133 + 1134 + goto do_pass; 1135 + } 1136 + 1137 + kfree(addrs); 1138 + BUG_ON(*new_len != new_flen); 1139 + return 0; 1140 + err: 1141 + kfree(addrs); 1142 + return -EINVAL; 1143 + } 1144 + 1145 + /* Security: 1146 + * 412 1147 * A BPF program is able to use 16 cells of memory to store intermediate 413 - * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter()) 1148 + * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter()). 1149 + * 414 1150 * As we dont want to clear mem[] array for each packet going through 415 1151 * sk_run_filter(), we check that filter loaded by user never try to read 416 1152 * a cell if not previously written, and we check all branches to be sure ··· 1375 629 } 1376 630 EXPORT_SYMBOL(sk_chk_filter); 1377 631 632 + static int sk_store_orig_filter(struct sk_filter *fp, 633 + const struct sock_fprog *fprog) 634 + { 635 + unsigned int fsize = sk_filter_proglen(fprog); 636 + struct sock_fprog_kern *fkprog; 637 + 638 + fp->orig_prog = kmalloc(sizeof(*fkprog), GFP_KERNEL); 639 + if (!fp->orig_prog) 640 + return -ENOMEM; 641 + 642 + fkprog = fp->orig_prog; 643 + fkprog->len = fprog->len; 644 + fkprog->filter = kmemdup(fp->insns, fsize, GFP_KERNEL); 645 + if (!fkprog->filter) { 646 + kfree(fp->orig_prog); 647 + return -ENOMEM; 648 + } 649 + 650 + return 0; 651 + } 652 + 653 + static void sk_release_orig_filter(struct sk_filter *fp) 654 + { 655 + struct sock_fprog_kern *fprog = fp->orig_prog; 656 + 657 + if (fprog) { 658 + kfree(fprog->filter); 659 + kfree(fprog); 660 + } 661 + } 662 + 1378 663 /** 1379 664 * sk_filter_release_rcu - Release a socket filter by rcu_head 1380 665 * @rcu: rcu_head that contains the sk_filter to free 1381 666 */ 1382 - void sk_filter_release_rcu(struct rcu_head *rcu) 667 + static void sk_filter_release_rcu(struct rcu_head *rcu) 1383 668 { 1384 669 struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu); 1385 670 671 + sk_release_orig_filter(fp); 1386 672 bpf_jit_free(fp); 1387 673 } 1388 - EXPORT_SYMBOL(sk_filter_release_rcu); 1389 674 1390 - static int __sk_prepare_filter(struct sk_filter *fp) 675 + /** 676 + * sk_filter_release - release a socket filter 677 + * @fp: filter to remove 678 + * 679 + * Remove a filter from a socket and release its resources. 680 + */ 681 + static void sk_filter_release(struct sk_filter *fp) 682 + { 683 + if (atomic_dec_and_test(&fp->refcnt)) 684 + call_rcu(&fp->rcu, sk_filter_release_rcu); 685 + } 686 + 687 + void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp) 688 + { 689 + atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc); 690 + sk_filter_release(fp); 691 + } 692 + 693 + void sk_filter_charge(struct sock *sk, struct sk_filter *fp) 694 + { 695 + atomic_inc(&fp->refcnt); 696 + atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc); 697 + } 698 + 699 + static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp, 700 + struct sock *sk, 701 + unsigned int len) 702 + { 703 + struct sk_filter *fp_new; 704 + 705 + if (sk == NULL) 706 + return krealloc(fp, len, GFP_KERNEL); 707 + 708 + fp_new = sock_kmalloc(sk, len, GFP_KERNEL); 709 + if (fp_new) { 710 + memcpy(fp_new, fp, sizeof(struct sk_filter)); 711 + /* As we're kepping orig_prog in fp_new along, 712 + * we need to make sure we're not evicting it 713 + * from the old fp. 714 + */ 715 + fp->orig_prog = NULL; 716 + sk_filter_uncharge(sk, fp); 717 + } 718 + 719 + return fp_new; 720 + } 721 + 722 + static struct sk_filter *__sk_migrate_filter(struct sk_filter *fp, 723 + struct sock *sk) 724 + { 725 + struct sock_filter *old_prog; 726 + struct sk_filter *old_fp; 727 + int i, err, new_len, old_len = fp->len; 728 + 729 + /* We are free to overwrite insns et al right here as it 730 + * won't be used at this point in time anymore internally 731 + * after the migration to the internal BPF instruction 732 + * representation. 733 + */ 734 + BUILD_BUG_ON(sizeof(struct sock_filter) != 735 + sizeof(struct sock_filter_int)); 736 + 737 + /* For now, we need to unfiddle BPF_S_* identifiers in place. 738 + * This can sooner or later on be subject to removal, e.g. when 739 + * JITs have been converted. 740 + */ 741 + for (i = 0; i < fp->len; i++) 742 + sk_decode_filter(&fp->insns[i], &fp->insns[i]); 743 + 744 + /* Conversion cannot happen on overlapping memory areas, 745 + * so we need to keep the user BPF around until the 2nd 746 + * pass. At this time, the user BPF is stored in fp->insns. 747 + */ 748 + old_prog = kmemdup(fp->insns, old_len * sizeof(struct sock_filter), 749 + GFP_KERNEL); 750 + if (!old_prog) { 751 + err = -ENOMEM; 752 + goto out_err; 753 + } 754 + 755 + /* 1st pass: calculate the new program length. */ 756 + err = sk_convert_filter(old_prog, old_len, NULL, &new_len); 757 + if (err) 758 + goto out_err_free; 759 + 760 + /* Expand fp for appending the new filter representation. */ 761 + old_fp = fp; 762 + fp = __sk_migrate_realloc(old_fp, sk, sk_filter_size(new_len)); 763 + if (!fp) { 764 + /* The old_fp is still around in case we couldn't 765 + * allocate new memory, so uncharge on that one. 766 + */ 767 + fp = old_fp; 768 + err = -ENOMEM; 769 + goto out_err_free; 770 + } 771 + 772 + fp->bpf_func = sk_run_filter_int_skb; 773 + fp->len = new_len; 774 + 775 + /* 2nd pass: remap sock_filter insns into sock_filter_int insns. */ 776 + err = sk_convert_filter(old_prog, old_len, fp->insnsi, &new_len); 777 + if (err) 778 + /* 2nd sk_convert_filter() can fail only if it fails 779 + * to allocate memory, remapping must succeed. Note, 780 + * that at this time old_fp has already been released 781 + * by __sk_migrate_realloc(). 782 + */ 783 + goto out_err_free; 784 + 785 + kfree(old_prog); 786 + return fp; 787 + 788 + out_err_free: 789 + kfree(old_prog); 790 + out_err: 791 + /* Rollback filter setup. */ 792 + if (sk != NULL) 793 + sk_filter_uncharge(sk, fp); 794 + else 795 + kfree(fp); 796 + return ERR_PTR(err); 797 + } 798 + 799 + static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp, 800 + struct sock *sk) 1391 801 { 1392 802 int err; 1393 803 1394 - fp->bpf_func = sk_run_filter; 804 + fp->bpf_func = NULL; 805 + fp->jited = 0; 1395 806 1396 807 err = sk_chk_filter(fp->insns, fp->len); 1397 808 if (err) 1398 - return err; 809 + return ERR_PTR(err); 1399 810 811 + /* Probe if we can JIT compile the filter and if so, do 812 + * the compilation of the filter. 813 + */ 1400 814 bpf_jit_compile(fp); 1401 - return 0; 815 + 816 + /* JIT compiler couldn't process this filter, so do the 817 + * internal BPF translation for the optimized interpreter. 818 + */ 819 + if (!fp->jited) 820 + fp = __sk_migrate_filter(fp, sk); 821 + 822 + return fp; 1402 823 } 1403 824 1404 825 /** ··· 1581 668 int sk_unattached_filter_create(struct sk_filter **pfp, 1582 669 struct sock_fprog *fprog) 1583 670 { 671 + unsigned int fsize = sk_filter_proglen(fprog); 1584 672 struct sk_filter *fp; 1585 - unsigned int fsize = sizeof(struct sock_filter) * fprog->len; 1586 - int err; 1587 673 1588 674 /* Make sure new filter is there and in the right amounts. */ 1589 675 if (fprog->filter == NULL) ··· 1591 679 fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL); 1592 680 if (!fp) 1593 681 return -ENOMEM; 682 + 1594 683 memcpy(fp->insns, fprog->filter, fsize); 1595 684 1596 685 atomic_set(&fp->refcnt, 1); 1597 686 fp->len = fprog->len; 687 + /* Since unattached filters are not copied back to user 688 + * space through sk_get_filter(), we do not need to hold 689 + * a copy here, and can spare us the work. 690 + */ 691 + fp->orig_prog = NULL; 1598 692 1599 - err = __sk_prepare_filter(fp); 1600 - if (err) 1601 - goto free_mem; 693 + /* __sk_prepare_filter() already takes care of uncharging 694 + * memory in case something goes wrong. 695 + */ 696 + fp = __sk_prepare_filter(fp, NULL); 697 + if (IS_ERR(fp)) 698 + return PTR_ERR(fp); 1602 699 1603 700 *pfp = fp; 1604 701 return 0; 1605 - free_mem: 1606 - kfree(fp); 1607 - return err; 1608 702 } 1609 703 EXPORT_SYMBOL_GPL(sk_unattached_filter_create); 1610 704 ··· 1633 715 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk) 1634 716 { 1635 717 struct sk_filter *fp, *old_fp; 1636 - unsigned int fsize = sizeof(struct sock_filter) * fprog->len; 718 + unsigned int fsize = sk_filter_proglen(fprog); 1637 719 unsigned int sk_fsize = sk_filter_size(fprog->len); 1638 720 int err; 1639 721 ··· 1647 729 fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL); 1648 730 if (!fp) 1649 731 return -ENOMEM; 732 + 1650 733 if (copy_from_user(fp->insns, fprog->filter, fsize)) { 1651 734 sock_kfree_s(sk, fp, sk_fsize); 1652 735 return -EFAULT; ··· 1656 737 atomic_set(&fp->refcnt, 1); 1657 738 fp->len = fprog->len; 1658 739 1659 - err = __sk_prepare_filter(fp); 740 + err = sk_store_orig_filter(fp, fprog); 1660 741 if (err) { 1661 742 sk_filter_uncharge(sk, fp); 1662 - return err; 743 + return -ENOMEM; 1663 744 } 745 + 746 + /* __sk_prepare_filter() already takes care of uncharging 747 + * memory in case something goes wrong. 748 + */ 749 + fp = __sk_prepare_filter(fp, sk); 750 + if (IS_ERR(fp)) 751 + return PTR_ERR(fp); 1664 752 1665 753 old_fp = rcu_dereference_protected(sk->sk_filter, 1666 754 sock_owned_by_user(sk)); ··· 1675 749 1676 750 if (old_fp) 1677 751 sk_filter_uncharge(sk, old_fp); 752 + 1678 753 return 0; 1679 754 } 1680 755 EXPORT_SYMBOL_GPL(sk_attach_filter); ··· 1695 768 sk_filter_uncharge(sk, filter); 1696 769 ret = 0; 1697 770 } 771 + 1698 772 return ret; 1699 773 } 1700 774 EXPORT_SYMBOL_GPL(sk_detach_filter); ··· 1778 850 to->k = filt->k; 1779 851 } 1780 852 1781 - int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, unsigned int len) 853 + int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, 854 + unsigned int len) 1782 855 { 856 + struct sock_fprog_kern *fprog; 1783 857 struct sk_filter *filter; 1784 - int i, ret; 858 + int ret = 0; 1785 859 1786 860 lock_sock(sk); 1787 861 filter = rcu_dereference_protected(sk->sk_filter, 1788 - sock_owned_by_user(sk)); 1789 - ret = 0; 862 + sock_owned_by_user(sk)); 1790 863 if (!filter) 1791 864 goto out; 1792 - ret = filter->len; 865 + 866 + /* We're copying the filter that has been originally attached, 867 + * so no conversion/decode needed anymore. 868 + */ 869 + fprog = filter->orig_prog; 870 + 871 + ret = fprog->len; 1793 872 if (!len) 873 + /* User space only enquires number of filter blocks. */ 1794 874 goto out; 875 + 1795 876 ret = -EINVAL; 1796 - if (len < filter->len) 877 + if (len < fprog->len) 1797 878 goto out; 1798 879 1799 880 ret = -EFAULT; 1800 - for (i = 0; i < filter->len; i++) { 1801 - struct sock_filter fb; 881 + if (copy_to_user(ubuf, fprog->filter, sk_filter_proglen(fprog))) 882 + goto out; 1802 883 1803 - sk_decode_filter(&filter->insns[i], &fb); 1804 - if (copy_to_user(&ubuf[i], &fb, sizeof(fb))) 1805 - goto out; 1806 - } 1807 - 1808 - ret = filter->len; 884 + /* Instead of bytes, the API requests to return the number 885 + * of filter blocks. 886 + */ 887 + ret = fprog->len; 1809 888 out: 1810 889 release_sock(sk); 1811 890 return ret;
+10 -13
net/core/sock_diag.c
··· 52 52 int sock_diag_put_filterinfo(struct user_namespace *user_ns, struct sock *sk, 53 53 struct sk_buff *skb, int attrtype) 54 54 { 55 - struct nlattr *attr; 55 + struct sock_fprog_kern *fprog; 56 56 struct sk_filter *filter; 57 - unsigned int len; 57 + struct nlattr *attr; 58 + unsigned int flen; 58 59 int err = 0; 59 60 60 61 if (!ns_capable(user_ns, CAP_NET_ADMIN)) { ··· 64 63 } 65 64 66 65 rcu_read_lock(); 67 - 68 66 filter = rcu_dereference(sk->sk_filter); 69 - len = filter ? filter->len * sizeof(struct sock_filter) : 0; 67 + if (!filter) 68 + goto out; 70 69 71 - attr = nla_reserve(skb, attrtype, len); 70 + fprog = filter->orig_prog; 71 + flen = sk_filter_proglen(fprog); 72 + 73 + attr = nla_reserve(skb, attrtype, flen); 72 74 if (attr == NULL) { 73 75 err = -EMSGSIZE; 74 76 goto out; 75 77 } 76 78 77 - if (filter) { 78 - struct sock_filter *fb = (struct sock_filter *)nla_data(attr); 79 - int i; 80 - 81 - for (i = 0; i < filter->len; i++, fb++) 82 - sk_decode_filter(&filter->insns[i], fb); 83 - } 84 - 79 + memcpy(nla_data(attr), fprog->filter, flen); 85 80 out: 86 81 rcu_read_unlock(); 87 82 return err;
+20 -7
net/core/timestamping.c
··· 23 23 #include <linux/skbuff.h> 24 24 #include <linux/export.h> 25 25 26 - static struct sock_filter ptp_filter[] = { 27 - PTP_FILTER 28 - }; 26 + static struct sk_filter *ptp_insns __read_mostly; 27 + 28 + unsigned int ptp_classify_raw(const struct sk_buff *skb) 29 + { 30 + return SK_RUN_FILTER(ptp_insns, skb); 31 + } 32 + EXPORT_SYMBOL_GPL(ptp_classify_raw); 29 33 30 34 static unsigned int classify(const struct sk_buff *skb) 31 35 { 32 - if (likely(skb->dev && 33 - skb->dev->phydev && 36 + if (likely(skb->dev && skb->dev->phydev && 34 37 skb->dev->phydev->drv)) 35 - return sk_run_filter(skb, ptp_filter); 38 + return ptp_classify_raw(skb); 36 39 else 37 40 return PTP_CLASS_NONE; 38 41 } ··· 63 60 if (likely(phydev->drv->txtstamp)) { 64 61 if (!atomic_inc_not_zero(&sk->sk_refcnt)) 65 62 return; 63 + 66 64 clone = skb_clone(skb, GFP_ATOMIC); 67 65 if (!clone) { 68 66 sock_put(sk); 69 67 return; 70 68 } 69 + 71 70 clone->sk = sk; 72 71 phydev->drv->txtstamp(phydev, clone, type); 73 72 } ··· 94 89 } 95 90 96 91 *skb_hwtstamps(skb) = *hwtstamps; 92 + 97 93 serr = SKB_EXT_ERR(skb); 98 94 memset(serr, 0, sizeof(*serr)); 99 95 serr->ee.ee_errno = ENOMSG; 100 96 serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING; 101 97 skb->sk = NULL; 98 + 102 99 err = sock_queue_err_skb(sk, skb); 100 + 103 101 sock_put(sk); 104 102 if (err) 105 103 kfree_skb(skb); ··· 143 135 144 136 void __init skb_timestamping_init(void) 145 137 { 146 - BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter))); 138 + static struct sock_filter ptp_filter[] = { PTP_FILTER }; 139 + struct sock_fprog ptp_prog = { 140 + .len = ARRAY_SIZE(ptp_filter), .filter = ptp_filter, 141 + }; 142 + 143 + BUG_ON(sk_unattached_filter_create(&ptp_insns, &ptp_prog)); 147 144 }