Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

+36

Documentation/bpf/README.rst

··· 1 + ================= 2 + BPF documentation 3 + ================= 4 + 5 + This directory contains documentation for the BPF (Berkeley Packet 6 + Filter) facility, with a focus on the extended BPF version (eBPF). 7 + 8 + This kernel side documentation is still work in progress. The main 9 + textual documentation is (for historical reasons) described in 10 + `Documentation/networking/filter.txt`_, which describe both classical 11 + and extended BPF instruction-set. 12 + The Cilium project also maintains a `BPF and XDP Reference Guide`_ 13 + that goes into great technical depth about the BPF Architecture. 14 + 15 + The primary info for the bpf syscall is available in the `man-pages`_ 16 + for `bpf(2)`_. 17 + 18 + 19 + 20 + Frequently asked questions (FAQ) 21 + ================================ 22 + 23 + Two sets of Questions and Answers (Q&A) are maintained. 24 + 25 + * QA for common questions about BPF see: bpf_design_QA_ 26 + 27 + * QA for developers interacting with BPF subsystem: bpf_devel_QA_ 28 + 29 + 30 + .. Links: 31 + .. _bpf_design_QA: bpf_design_QA.rst 32 + .. _bpf_devel_QA: bpf_devel_QA.rst 33 + .. _Documentation/networking/filter.txt: ../networking/filter.txt 34 + .. _man-pages: https://www.kernel.org/doc/man-pages/ 35 + .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html 36 + .. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/

+221

Documentation/bpf/bpf_design_QA.rst

··· 1 + ============== 2 + BPF Design Q&A 3 + ============== 4 + 5 + BPF extensibility and applicability to networking, tracing, security 6 + in the linux kernel and several user space implementations of BPF 7 + virtual machine led to a number of misunderstanding on what BPF actually is. 8 + This short QA is an attempt to address that and outline a direction 9 + of where BPF is heading long term. 10 + 11 + .. contents:: 12 + :local: 13 + :depth: 3 14 + 15 + Questions and Answers 16 + ===================== 17 + 18 + Q: Is BPF a generic instruction set similar to x64 and arm64? 19 + ------------------------------------------------------------- 20 + A: NO. 21 + 22 + Q: Is BPF a generic virtual machine ? 23 + ------------------------------------- 24 + A: NO. 25 + 26 + BPF is generic instruction set *with* C calling convention. 27 + ----------------------------------------------------------- 28 + 29 + Q: Why C calling convention was chosen? 30 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31 + 32 + A: Because BPF programs are designed to run in the linux kernel 33 + which is written in C, hence BPF defines instruction set compatible 34 + with two most used architectures x64 and arm64 (and takes into 35 + consideration important quirks of other architectures) and 36 + defines calling convention that is compatible with C calling 37 + convention of the linux kernel on those architectures. 38 + 39 + Q: can multiple return values be supported in the future? 40 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 41 + A: NO. BPF allows only register R0 to be used as return value. 42 + 43 + Q: can more than 5 function arguments be supported in the future? 44 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 45 + A: NO. BPF calling convention only allows registers R1-R5 to be used 46 + as arguments. BPF is not a standalone instruction set. 47 + (unlike x64 ISA that allows msft, cdecl and other conventions) 48 + 49 + Q: can BPF programs access instruction pointer or return address? 50 + ----------------------------------------------------------------- 51 + A: NO. 52 + 53 + Q: can BPF programs access stack pointer ? 54 + ------------------------------------------ 55 + A: NO. 56 + 57 + Only frame pointer (register R10) is accessible. 58 + From compiler point of view it's necessary to have stack pointer. 59 + For example LLVM defines register R11 as stack pointer in its 60 + BPF backend, but it makes sure that generated code never uses it. 61 + 62 + Q: Does C-calling convention diminishes possible use cases? 63 + ----------------------------------------------------------- 64 + A: YES. 65 + 66 + BPF design forces addition of major functionality in the form 67 + of kernel helper functions and kernel objects like BPF maps with 68 + seamless interoperability between them. It lets kernel call into 69 + BPF programs and programs call kernel helpers with zero overhead. 70 + As all of them were native C code. That is particularly the case 71 + for JITed BPF programs that are indistinguishable from 72 + native kernel C code. 73 + 74 + Q: Does it mean that 'innovative' extensions to BPF code are disallowed? 75 + ------------------------------------------------------------------------ 76 + A: Soft yes. 77 + 78 + At least for now until BPF core has support for 79 + bpf-to-bpf calls, indirect calls, loops, global variables, 80 + jump tables, read only sections and all other normal constructs 81 + that C code can produce. 82 + 83 + Q: Can loops be supported in a safe way? 84 + ---------------------------------------- 85 + A: It's not clear yet. 86 + 87 + BPF developers are trying to find a way to 88 + support bounded loops where the verifier can guarantee that 89 + the program terminates in less than 4096 instructions. 90 + 91 + Instruction level questions 92 + --------------------------- 93 + 94 + Q: LD_ABS and LD_IND instructions vs C code 95 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 96 + 97 + Q: How come LD_ABS and LD_IND instruction are present in BPF whereas 98 + C code cannot express them and has to use builtin intrinsics? 99 + 100 + A: This is artifact of compatibility with classic BPF. Modern 101 + networking code in BPF performs better without them. 102 + See 'direct packet access'. 103 + 104 + Q: BPF instructions mapping not one-to-one to native CPU 105 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 106 + Q: It seems not all BPF instructions are one-to-one to native CPU. 107 + For example why BPF_JNE and other compare and jumps are not cpu-like? 108 + 109 + A: This was necessary to avoid introducing flags into ISA which are 110 + impossible to make generic and efficient across CPU architectures. 111 + 112 + Q: why BPF_DIV instruction doesn't map to x64 div? 113 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 114 + A: Because if we picked one-to-one relationship to x64 it would have made 115 + it more complicated to support on arm64 and other archs. Also it 116 + needs div-by-zero runtime check. 117 + 118 + Q: why there is no BPF_SDIV for signed divide operation? 119 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 120 + A: Because it would be rarely used. llvm errors in such case and 121 + prints a suggestion to use unsigned divide instead 122 + 123 + Q: Why BPF has implicit prologue and epilogue? 124 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 125 + A: Because architectures like sparc have register windows and in general 126 + there are enough subtle differences between architectures, so naive 127 + store return address into stack won't work. Another reason is BPF has 128 + to be safe from division by zero (and legacy exception path 129 + of LD_ABS insn). Those instructions need to invoke epilogue and 130 + return implicitly. 131 + 132 + Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? 133 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 134 + A: Because classic BPF didn't have them and BPF authors felt that compiler 135 + workaround would be acceptable. Turned out that programs lose performance 136 + due to lack of these compare instructions and they were added. 137 + These two instructions is a perfect example what kind of new BPF 138 + instructions are acceptable and can be added in the future. 139 + These two already had equivalent instructions in native CPUs. 140 + New instructions that don't have one-to-one mapping to HW instructions 141 + will not be accepted. 142 + 143 + Q: BPF 32-bit subregister requirements 144 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 145 + Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF 146 + registers which makes BPF inefficient virtual machine for 32-bit 147 + CPU architectures and 32-bit HW accelerators. Can true 32-bit registers 148 + be added to BPF in the future? 149 + 150 + A: NO. The first thing to improve performance on 32-bit archs is to teach 151 + LLVM to generate code that uses 32-bit subregisters. Then second step 152 + is to teach verifier to mark operations where zero-ing upper bits 153 + is unnecessary. Then JITs can take advantage of those markings and 154 + drastically reduce size of generated code and improve performance. 155 + 156 + Q: Does BPF have a stable ABI? 157 + ------------------------------ 158 + A: YES. BPF instructions, arguments to BPF programs, set of helper 159 + functions and their arguments, recognized return codes are all part 160 + of ABI. However when tracing programs are using bpf_probe_read() helper 161 + to walk kernel internal datastructures and compile with kernel 162 + internal headers these accesses can and will break with newer 163 + kernels. The union bpf_attr -> kern_version is checked at load time 164 + to prevent accidentally loading kprobe-based bpf programs written 165 + for a different kernel. Networking programs don't do kern_version check. 166 + 167 + Q: How much stack space a BPF program uses? 168 + ------------------------------------------- 169 + A: Currently all program types are limited to 512 bytes of stack 170 + space, but the verifier computes the actual amount of stack used 171 + and both interpreter and most JITed code consume necessary amount. 172 + 173 + Q: Can BPF be offloaded to HW? 174 + ------------------------------ 175 + A: YES. BPF HW offload is supported by NFP driver. 176 + 177 + Q: Does classic BPF interpreter still exist? 178 + -------------------------------------------- 179 + A: NO. Classic BPF programs are converted into extend BPF instructions. 180 + 181 + Q: Can BPF call arbitrary kernel functions? 182 + ------------------------------------------- 183 + A: NO. BPF programs can only call a set of helper functions which 184 + is defined for every program type. 185 + 186 + Q: Can BPF overwrite arbitrary kernel memory? 187 + --------------------------------------------- 188 + A: NO. 189 + 190 + Tracing bpf programs can *read* arbitrary memory with bpf_probe_read() 191 + and bpf_probe_read_str() helpers. Networking programs cannot read 192 + arbitrary memory, since they don't have access to these helpers. 193 + Programs can never read or write arbitrary memory directly. 194 + 195 + Q: Can BPF overwrite arbitrary user memory? 196 + ------------------------------------------- 197 + A: Sort-of. 198 + 199 + Tracing BPF programs can overwrite the user memory 200 + of the current task with bpf_probe_write_user(). Every time such 201 + program is loaded the kernel will print warning message, so 202 + this helper is only useful for experiments and prototypes. 203 + Tracing BPF programs are root only. 204 + 205 + Q: bpf_trace_printk() helper warning 206 + ------------------------------------ 207 + Q: When bpf_trace_printk() helper is used the kernel prints nasty 208 + warning message. Why is that? 209 + 210 + A: This is done to nudge program authors into better interfaces when 211 + programs need to pass data to user space. Like bpf_perf_event_output() 212 + can be used to efficiently stream data via perf ring buffer. 213 + BPF maps can be used for asynchronous data sharing between kernel 214 + and user space. bpf_trace_printk() should only be used for debugging. 215 + 216 + Q: New functionality via kernel modules? 217 + ---------------------------------------- 218 + Q: Can BPF functionality such as new program or map types, new 219 + helpers, etc be added out of kernel module code? 220 + 221 + A: NO.

-156

Documentation/bpf/bpf_design_QA.txt

··· 1 - BPF extensibility and applicability to networking, tracing, security 2 - in the linux kernel and several user space implementations of BPF 3 - virtual machine led to a number of misunderstanding on what BPF actually is. 4 - This short QA is an attempt to address that and outline a direction 5 - of where BPF is heading long term. 6 - 7 - Q: Is BPF a generic instruction set similar to x64 and arm64? 8 - A: NO. 9 - 10 - Q: Is BPF a generic virtual machine ? 11 - A: NO. 12 - 13 - BPF is generic instruction set _with_ C calling convention. 14 - 15 - Q: Why C calling convention was chosen? 16 - A: Because BPF programs are designed to run in the linux kernel 17 - which is written in C, hence BPF defines instruction set compatible 18 - with two most used architectures x64 and arm64 (and takes into 19 - consideration important quirks of other architectures) and 20 - defines calling convention that is compatible with C calling 21 - convention of the linux kernel on those architectures. 22 - 23 - Q: can multiple return values be supported in the future? 24 - A: NO. BPF allows only register R0 to be used as return value. 25 - 26 - Q: can more than 5 function arguments be supported in the future? 27 - A: NO. BPF calling convention only allows registers R1-R5 to be used 28 - as arguments. BPF is not a standalone instruction set. 29 - (unlike x64 ISA that allows msft, cdecl and other conventions) 30 - 31 - Q: can BPF programs access instruction pointer or return address? 32 - A: NO. 33 - 34 - Q: can BPF programs access stack pointer ? 35 - A: NO. Only frame pointer (register R10) is accessible. 36 - From compiler point of view it's necessary to have stack pointer. 37 - For example LLVM defines register R11 as stack pointer in its 38 - BPF backend, but it makes sure that generated code never uses it. 39 - 40 - Q: Does C-calling convention diminishes possible use cases? 41 - A: YES. BPF design forces addition of major functionality in the form 42 - of kernel helper functions and kernel objects like BPF maps with 43 - seamless interoperability between them. It lets kernel call into 44 - BPF programs and programs call kernel helpers with zero overhead. 45 - As all of them were native C code. That is particularly the case 46 - for JITed BPF programs that are indistinguishable from 47 - native kernel C code. 48 - 49 - Q: Does it mean that 'innovative' extensions to BPF code are disallowed? 50 - A: Soft yes. At least for now until BPF core has support for 51 - bpf-to-bpf calls, indirect calls, loops, global variables, 52 - jump tables, read only sections and all other normal constructs 53 - that C code can produce. 54 - 55 - Q: Can loops be supported in a safe way? 56 - A: It's not clear yet. BPF developers are trying to find a way to 57 - support bounded loops where the verifier can guarantee that 58 - the program terminates in less than 4096 instructions. 59 - 60 - Q: How come LD_ABS and LD_IND instruction are present in BPF whereas 61 - C code cannot express them and has to use builtin intrinsics? 62 - A: This is artifact of compatibility with classic BPF. Modern 63 - networking code in BPF performs better without them. 64 - See 'direct packet access'. 65 - 66 - Q: It seems not all BPF instructions are one-to-one to native CPU. 67 - For example why BPF_JNE and other compare and jumps are not cpu-like? 68 - A: This was necessary to avoid introducing flags into ISA which are 69 - impossible to make generic and efficient across CPU architectures. 70 - 71 - Q: why BPF_DIV instruction doesn't map to x64 div? 72 - A: Because if we picked one-to-one relationship to x64 it would have made 73 - it more complicated to support on arm64 and other archs. Also it 74 - needs div-by-zero runtime check. 75 - 76 - Q: why there is no BPF_SDIV for signed divide operation? 77 - A: Because it would be rarely used. llvm errors in such case and 78 - prints a suggestion to use unsigned divide instead 79 - 80 - Q: Why BPF has implicit prologue and epilogue? 81 - A: Because architectures like sparc have register windows and in general 82 - there are enough subtle differences between architectures, so naive 83 - store return address into stack won't work. Another reason is BPF has 84 - to be safe from division by zero (and legacy exception path 85 - of LD_ABS insn). Those instructions need to invoke epilogue and 86 - return implicitly. 87 - 88 - Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? 89 - A: Because classic BPF didn't have them and BPF authors felt that compiler 90 - workaround would be acceptable. Turned out that programs lose performance 91 - due to lack of these compare instructions and they were added. 92 - These two instructions is a perfect example what kind of new BPF 93 - instructions are acceptable and can be added in the future. 94 - These two already had equivalent instructions in native CPUs. 95 - New instructions that don't have one-to-one mapping to HW instructions 96 - will not be accepted. 97 - 98 - Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF 99 - registers which makes BPF inefficient virtual machine for 32-bit 100 - CPU architectures and 32-bit HW accelerators. Can true 32-bit registers 101 - be added to BPF in the future? 102 - A: NO. The first thing to improve performance on 32-bit archs is to teach 103 - LLVM to generate code that uses 32-bit subregisters. Then second step 104 - is to teach verifier to mark operations where zero-ing upper bits 105 - is unnecessary. Then JITs can take advantage of those markings and 106 - drastically reduce size of generated code and improve performance. 107 - 108 - Q: Does BPF have a stable ABI? 109 - A: YES. BPF instructions, arguments to BPF programs, set of helper 110 - functions and their arguments, recognized return codes are all part 111 - of ABI. However when tracing programs are using bpf_probe_read() helper 112 - to walk kernel internal datastructures and compile with kernel 113 - internal headers these accesses can and will break with newer 114 - kernels. The union bpf_attr -> kern_version is checked at load time 115 - to prevent accidentally loading kprobe-based bpf programs written 116 - for a different kernel. Networking programs don't do kern_version check. 117 - 118 - Q: How much stack space a BPF program uses? 119 - A: Currently all program types are limited to 512 bytes of stack 120 - space, but the verifier computes the actual amount of stack used 121 - and both interpreter and most JITed code consume necessary amount. 122 - 123 - Q: Can BPF be offloaded to HW? 124 - A: YES. BPF HW offload is supported by NFP driver. 125 - 126 - Q: Does classic BPF interpreter still exist? 127 - A: NO. Classic BPF programs are converted into extend BPF instructions. 128 - 129 - Q: Can BPF call arbitrary kernel functions? 130 - A: NO. BPF programs can only call a set of helper functions which 131 - is defined for every program type. 132 - 133 - Q: Can BPF overwrite arbitrary kernel memory? 134 - A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read() 135 - and bpf_probe_read_str() helpers. Networking programs cannot read 136 - arbitrary memory, since they don't have access to these helpers. 137 - Programs can never read or write arbitrary memory directly. 138 - 139 - Q: Can BPF overwrite arbitrary user memory? 140 - A: Sort-of. Tracing BPF programs can overwrite the user memory 141 - of the current task with bpf_probe_write_user(). Every time such 142 - program is loaded the kernel will print warning message, so 143 - this helper is only useful for experiments and prototypes. 144 - Tracing BPF programs are root only. 145 - 146 - Q: When bpf_trace_printk() helper is used the kernel prints nasty 147 - warning message. Why is that? 148 - A: This is done to nudge program authors into better interfaces when 149 - programs need to pass data to user space. Like bpf_perf_event_output() 150 - can be used to efficiently stream data via perf ring buffer. 151 - BPF maps can be used for asynchronous data sharing between kernel 152 - and user space. bpf_trace_printk() should only be used for debugging. 153 - 154 - Q: Can BPF functionality such as new program or map types, new 155 - helpers, etc be added out of kernel module code? 156 - A: NO.

+640

Documentation/bpf/bpf_devel_QA.rst

··· 1 + ================================= 2 + HOWTO interact with BPF subsystem 3 + ================================= 4 + 5 + This document provides information for the BPF subsystem about various 6 + workflows related to reporting bugs, submitting patches, and queueing 7 + patches for stable kernels. 8 + 9 + For general information about submitting patches, please refer to 10 + `Documentation/process/`_. This document only describes additional specifics 11 + related to BPF. 12 + 13 + .. contents:: 14 + :local: 15 + :depth: 2 16 + 17 + Reporting bugs 18 + ============== 19 + 20 + Q: How do I report bugs for BPF kernel code? 21 + -------------------------------------------- 22 + A: Since all BPF kernel development as well as bpftool and iproute2 BPF 23 + loader development happens through the netdev kernel mailing list, 24 + please report any found issues around BPF to the following mailing 25 + list: 26 + 27 + netdev@vger.kernel.org 28 + 29 + This may also include issues related to XDP, BPF tracing, etc. 30 + 31 + Given netdev has a high volume of traffic, please also add the BPF 32 + maintainers to Cc (from kernel MAINTAINERS_ file): 33 + 34 + * Alexei Starovoitov <ast@kernel.org> 35 + * Daniel Borkmann <daniel@iogearbox.net> 36 + 37 + In case a buggy commit has already been identified, make sure to keep 38 + the actual commit authors in Cc as well for the report. They can 39 + typically be identified through the kernel's git tree. 40 + 41 + **Please do NOT report BPF issues to bugzilla.kernel.org since it 42 + is a guarantee that the reported issue will be overlooked.** 43 + 44 + Submitting patches 45 + ================== 46 + 47 + Q: To which mailing list do I need to submit my BPF patches? 48 + ------------------------------------------------------------ 49 + A: Please submit your BPF patches to the netdev kernel mailing list: 50 + 51 + netdev@vger.kernel.org 52 + 53 + Historically, BPF came out of networking and has always been maintained 54 + by the kernel networking community. Although these days BPF touches 55 + many other subsystems as well, the patches are still routed mainly 56 + through the networking community. 57 + 58 + In case your patch has changes in various different subsystems (e.g. 59 + tracing, security, etc), make sure to Cc the related kernel mailing 60 + lists and maintainers from there as well, so they are able to review 61 + the changes and provide their Acked-by's to the patches. 62 + 63 + Q: Where can I find patches currently under discussion for BPF subsystem? 64 + ------------------------------------------------------------------------- 65 + A: All patches that are Cc'ed to netdev are queued for review under netdev 66 + patchwork project: 67 + 68 + http://patchwork.ozlabs.org/project/netdev/list/ 69 + 70 + Those patches which target BPF, are assigned to a 'bpf' delegate for 71 + further processing from BPF maintainers. The current queue with 72 + patches under review can be found at: 73 + 74 + https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 75 + 76 + Once the patches have been reviewed by the BPF community as a whole 77 + and approved by the BPF maintainers, their status in patchwork will be 78 + changed to 'Accepted' and the submitter will be notified by mail. This 79 + means that the patches look good from a BPF perspective and have been 80 + applied to one of the two BPF kernel trees. 81 + 82 + In case feedback from the community requires a respin of the patches, 83 + their status in patchwork will be set to 'Changes Requested', and purged 84 + from the current review queue. Likewise for cases where patches would 85 + get rejected or are not applicable to the BPF trees (but assigned to 86 + the 'bpf' delegate). 87 + 88 + Q: How do the changes make their way into Linux? 89 + ------------------------------------------------ 90 + A: There are two BPF kernel trees (git repositories). Once patches have 91 + been accepted by the BPF maintainers, they will be applied to one 92 + of the two BPF trees: 93 + 94 + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ 95 + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ 96 + 97 + The bpf tree itself is for fixes only, whereas bpf-next for features, 98 + cleanups or other kind of improvements ("next-like" content). This is 99 + analogous to net and net-next trees for networking. Both bpf and 100 + bpf-next will only have a master branch in order to simplify against 101 + which branch patches should get rebased to. 102 + 103 + Accumulated BPF patches in the bpf tree will regularly get pulled 104 + into the net kernel tree. Likewise, accumulated BPF patches accepted 105 + into the bpf-next tree will make their way into net-next tree. net and 106 + net-next are both run by David S. Miller. From there, they will go 107 + into the kernel mainline tree run by Linus Torvalds. To read up on the 108 + process of net and net-next being merged into the mainline tree, see 109 + the `netdev FAQ`_ under: 110 + 111 + `Documentation/networking/netdev-FAQ.txt`_ 112 + 113 + Occasionally, to prevent merge conflicts, we might send pull requests 114 + to other trees (e.g. tracing) with a small subset of the patches, but 115 + net and net-next are always the main trees targeted for integration. 116 + 117 + The pull requests will contain a high-level summary of the accumulated 118 + patches and can be searched on netdev kernel mailing list through the 119 + following subject lines (``yyyy-mm-dd`` is the date of the pull 120 + request):: 121 + 122 + pull-request: bpf yyyy-mm-dd 123 + pull-request: bpf-next yyyy-mm-dd 124 + 125 + Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to? 126 + --------------------------------------------------------------------------------- 127 + 128 + A: The process is the very same as described in the `netdev FAQ`_, so 129 + please read up on it. The subject line must indicate whether the 130 + patch is a fix or rather "next-like" content in order to let the 131 + maintainers know whether it is targeted at bpf or bpf-next. 132 + 133 + For fixes eventually landing in bpf -> net tree, the subject must 134 + look like:: 135 + 136 + git format-patch --subject-prefix='PATCH bpf' start..finish 137 + 138 + For features/improvements/etc that should eventually land in 139 + bpf-next -> net-next, the subject must look like:: 140 + 141 + git format-patch --subject-prefix='PATCH bpf-next' start..finish 142 + 143 + If unsure whether the patch or patch series should go into bpf 144 + or net directly, or bpf-next or net-next directly, it is not a 145 + problem either if the subject line says net or net-next as target. 146 + It is eventually up to the maintainers to do the delegation of 147 + the patches. 148 + 149 + If it is clear that patches should go into bpf or bpf-next tree, 150 + please make sure to rebase the patches against those trees in 151 + order to reduce potential conflicts. 152 + 153 + In case the patch or patch series has to be reworked and sent out 154 + again in a second or later revision, it is also required to add a 155 + version number (``v2``, ``v3``, ...) into the subject prefix:: 156 + 157 + git format-patch --subject-prefix='PATCH net-next v2' start..finish 158 + 159 + When changes have been requested to the patch series, always send the 160 + whole patch series again with the feedback incorporated (never send 161 + individual diffs on top of the old series). 162 + 163 + Q: What does it mean when a patch gets applied to bpf or bpf-next tree? 164 + ----------------------------------------------------------------------- 165 + A: It means that the patch looks good for mainline inclusion from 166 + a BPF point of view. 167 + 168 + Be aware that this is not a final verdict that the patch will 169 + automatically get accepted into net or net-next trees eventually: 170 + 171 + On the netdev kernel mailing list reviews can come in at any point 172 + in time. If discussions around a patch conclude that they cannot 173 + get included as-is, we will either apply a follow-up fix or drop 174 + them from the trees entirely. Therefore, we also reserve to rebase 175 + the trees when deemed necessary. After all, the purpose of the tree 176 + is to: 177 + 178 + i) accumulate and stage BPF patches for integration into trees 179 + like net and net-next, and 180 + 181 + ii) run extensive BPF test suite and 182 + workloads on the patches before they make their way any further. 183 + 184 + Once the BPF pull request was accepted by David S. Miller, then 185 + the patches end up in net or net-next tree, respectively, and 186 + make their way from there further into mainline. Again, see the 187 + `netdev FAQ`_ for additional information e.g. on how often they are 188 + merged to mainline. 189 + 190 + Q: How long do I need to wait for feedback on my BPF patches? 191 + ------------------------------------------------------------- 192 + A: We try to keep the latency low. The usual time to feedback will 193 + be around 2 or 3 business days. It may vary depending on the 194 + complexity of changes and current patch load. 195 + 196 + Q: How often do you send pull requests to major kernel trees like net or net-next? 197 + ---------------------------------------------------------------------------------- 198 + 199 + A: Pull requests will be sent out rather often in order to not 200 + accumulate too many patches in bpf or bpf-next. 201 + 202 + As a rule of thumb, expect pull requests for each tree regularly 203 + at the end of the week. In some cases pull requests could additionally 204 + come also in the middle of the week depending on the current patch 205 + load or urgency. 206 + 207 + Q: Are patches applied to bpf-next when the merge window is open? 208 + ----------------------------------------------------------------- 209 + A: For the time when the merge window is open, bpf-next will not be 210 + processed. This is roughly analogous to net-next patch processing, 211 + so feel free to read up on the `netdev FAQ`_ about further details. 212 + 213 + During those two weeks of merge window, we might ask you to resend 214 + your patch series once bpf-next is open again. Once Linus released 215 + a ``v*-rc1`` after the merge window, we continue processing of bpf-next. 216 + 217 + For non-subscribers to kernel mailing lists, there is also a status 218 + page run by David S. Miller on net-next that provides guidance: 219 + 220 + http://vger.kernel.org/~davem/net-next.html 221 + 222 + Q: Verifier changes and test cases 223 + ---------------------------------- 224 + Q: I made a BPF verifier change, do I need to add test cases for 225 + BPF kernel selftests_? 226 + 227 + A: If the patch has changes to the behavior of the verifier, then yes, 228 + it is absolutely necessary to add test cases to the BPF kernel 229 + selftests_ suite. If they are not present and we think they are 230 + needed, then we might ask for them before accepting any changes. 231 + 232 + In particular, test_verifier.c is tracking a high number of BPF test 233 + cases, including a lot of corner cases that LLVM BPF back end may 234 + generate out of the restricted C code. Thus, adding test cases is 235 + absolutely crucial to make sure future changes do not accidentally 236 + affect prior use-cases. Thus, treat those test cases as: verifier 237 + behavior that is not tracked in test_verifier.c could potentially 238 + be subject to change. 239 + 240 + Q: samples/bpf preference vs selftests? 241 + --------------------------------------- 242 + Q: When should I add code to `samples/bpf/`_ and when to BPF kernel 243 + selftests_ ? 244 + 245 + A: In general, we prefer additions to BPF kernel selftests_ rather than 246 + `samples/bpf/`_. The rationale is very simple: kernel selftests are 247 + regularly run by various bots to test for kernel regressions. 248 + 249 + The more test cases we add to BPF selftests, the better the coverage 250 + and the less likely it is that those could accidentally break. It is 251 + not that BPF kernel selftests cannot demo how a specific feature can 252 + be used. 253 + 254 + That said, `samples/bpf/`_ may be a good place for people to get started, 255 + so it might be advisable that simple demos of features could go into 256 + `samples/bpf/`_, but advanced functional and corner-case testing rather 257 + into kernel selftests. 258 + 259 + If your sample looks like a test case, then go for BPF kernel selftests 260 + instead! 261 + 262 + Q: When should I add code to the bpftool? 263 + ----------------------------------------- 264 + A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide 265 + a central user space tool for debugging and introspection of BPF programs 266 + and maps that are active in the kernel. If UAPI changes related to BPF 267 + enable for dumping additional information of programs or maps, then 268 + bpftool should be extended as well to support dumping them. 269 + 270 + Q: When should I add code to iproute2's BPF loader? 271 + --------------------------------------------------- 272 + A: For UAPI changes related to the XDP or tc layer (e.g. ``cls_bpf``), 273 + the convention is that those control-path related changes are added to 274 + iproute2's BPF loader as well from user space side. This is not only 275 + useful to have UAPI changes properly designed to be usable, but also 276 + to make those changes available to a wider user base of major 277 + downstream distributions. 278 + 279 + Q: Do you accept patches as well for iproute2's BPF loader? 280 + ----------------------------------------------------------- 281 + A: Patches for the iproute2's BPF loader have to be sent to: 282 + 283 + netdev@vger.kernel.org 284 + 285 + While those patches are not processed by the BPF kernel maintainers, 286 + please keep them in Cc as well, so they can be reviewed. 287 + 288 + The official git repository for iproute2 is run by Stephen Hemminger 289 + and can be found at: 290 + 291 + https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ 292 + 293 + The patches need to have a subject prefix of '``[PATCH iproute2 294 + master]``' or '``[PATCH iproute2 net-next]``'. '``master``' or 295 + '``net-next``' describes the target branch where the patch should be 296 + applied to. Meaning, if kernel changes went into the net-next kernel 297 + tree, then the related iproute2 changes need to go into the iproute2 298 + net-next branch, otherwise they can be targeted at master branch. The 299 + iproute2 net-next branch will get merged into the master branch after 300 + the current iproute2 version from master has been released. 301 + 302 + Like BPF, the patches end up in patchwork under the netdev project and 303 + are delegated to 'shemminger' for further processing: 304 + 305 + http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 306 + 307 + Q: What is the minimum requirement before I submit my BPF patches? 308 + ------------------------------------------------------------------ 309 + A: When submitting patches, always take the time and properly test your 310 + patches *prior* to submission. Never rush them! If maintainers find 311 + that your patches have not been properly tested, it is a good way to 312 + get them grumpy. Testing patch submissions is a hard requirement! 313 + 314 + Note, fixes that go to bpf tree *must* have a ``Fixes:`` tag included. 315 + The same applies to fixes that target bpf-next, where the affected 316 + commit is in net-next (or in some cases bpf-next). The ``Fixes:`` tag is 317 + crucial in order to identify follow-up commits and tremendously helps 318 + for people having to do backporting, so it is a must have! 319 + 320 + We also don't accept patches with an empty commit message. Take your 321 + time and properly write up a high quality commit message, it is 322 + essential! 323 + 324 + Think about it this way: other developers looking at your code a month 325 + from now need to understand *why* a certain change has been done that 326 + way, and whether there have been flaws in the analysis or assumptions 327 + that the original author did. Thus providing a proper rationale and 328 + describing the use-case for the changes is a must. 329 + 330 + Patch submissions with >1 patch must have a cover letter which includes 331 + a high level description of the series. This high level summary will 332 + then be placed into the merge commit by the BPF maintainers such that 333 + it is also accessible from the git log for future reference. 334 + 335 + Q: Features changing BPF JIT and/or LLVM 336 + ---------------------------------------- 337 + Q: What do I need to consider when adding a new instruction or feature 338 + that would require BPF JIT and/or LLVM integration as well? 339 + 340 + A: We try hard to keep all BPF JITs up to date such that the same user 341 + experience can be guaranteed when running BPF programs on different 342 + architectures without having the program punt to the less efficient 343 + interpreter in case the in-kernel BPF JIT is enabled. 344 + 345 + If you are unable to implement or test the required JIT changes for 346 + certain architectures, please work together with the related BPF JIT 347 + developers in order to get the feature implemented in a timely manner. 348 + Please refer to the git log (``arch/*/net/``) to locate the necessary 349 + people for helping out. 350 + 351 + Also always make sure to add BPF test cases (e.g. test_bpf.c and 352 + test_verifier.c) for new instructions, so that they can receive 353 + broad test coverage and help run-time testing the various BPF JITs. 354 + 355 + In case of new BPF instructions, once the changes have been accepted 356 + into the Linux kernel, please implement support into LLVM's BPF back 357 + end. See LLVM_ section below for further information. 358 + 359 + Stable submission 360 + ================= 361 + 362 + Q: I need a specific BPF commit in stable kernels. What should I do? 363 + -------------------------------------------------------------------- 364 + A: In case you need a specific fix in stable kernels, first check whether 365 + the commit has already been applied in the related ``linux-*.y`` branches: 366 + 367 + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ 368 + 369 + If not the case, then drop an email to the BPF maintainers with the 370 + netdev kernel mailing list in Cc and ask for the fix to be queued up: 371 + 372 + netdev@vger.kernel.org 373 + 374 + The process in general is the same as on netdev itself, see also the 375 + `netdev FAQ`_ document. 376 + 377 + Q: Do you also backport to kernels not currently maintained as stable? 378 + ---------------------------------------------------------------------- 379 + A: No. If you need a specific BPF commit in kernels that are currently not 380 + maintained by the stable maintainers, then you are on your own. 381 + 382 + The current stable and longterm stable kernels are all listed here: 383 + 384 + https://www.kernel.org/ 385 + 386 + Q: The BPF patch I am about to submit needs to go to stable as well 387 + ------------------------------------------------------------------- 388 + What should I do? 389 + 390 + A: The same rules apply as with netdev patch submissions in general, see 391 + `netdev FAQ`_ under: 392 + 393 + `Documentation/networking/netdev-FAQ.txt`_ 394 + 395 + Never add "``Cc: stable@vger.kernel.org``" to the patch description, but 396 + ask the BPF maintainers to queue the patches instead. This can be done 397 + with a note, for example, under the ``---`` part of the patch which does 398 + not go into the git log. Alternatively, this can be done as a simple 399 + request by mail instead. 400 + 401 + Q: Queue stable patches 402 + ----------------------- 403 + Q: Where do I find currently queued BPF patches that will be submitted 404 + to stable? 405 + 406 + A: Once patches that fix critical bugs got applied into the bpf tree, they 407 + are queued up for stable submission under: 408 + 409 + http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* 410 + 411 + They will be on hold there at minimum until the related commit made its 412 + way into the mainline kernel tree. 413 + 414 + After having been under broader exposure, the queued patches will be 415 + submitted by the BPF maintainers to the stable maintainers. 416 + 417 + Testing patches 418 + =============== 419 + 420 + Q: How to run BPF selftests 421 + --------------------------- 422 + A: After you have booted into the newly compiled kernel, navigate to 423 + the BPF selftests_ suite in order to test BPF functionality (current 424 + working directory points to the root of the cloned git tree):: 425 + 426 + $ cd tools/testing/selftests/bpf/ 427 + $ make 428 + 429 + To run the verifier tests:: 430 + 431 + $ sudo ./test_verifier 432 + 433 + The verifier tests print out all the current checks being 434 + performed. The summary at the end of running all tests will dump 435 + information of test successes and failures:: 436 + 437 + Summary: 418 PASSED, 0 FAILED 438 + 439 + In order to run through all BPF selftests, the following command is 440 + needed:: 441 + 442 + $ sudo make run_tests 443 + 444 + See the kernels selftest `Documentation/dev-tools/kselftest.rst`_ 445 + document for further documentation. 446 + 447 + Q: Which BPF kernel selftests version should I run my kernel against? 448 + --------------------------------------------------------------------- 449 + A: If you run a kernel ``xyz``, then always run the BPF kernel selftests 450 + from that kernel ``xyz`` as well. Do not expect that the BPF selftest 451 + from the latest mainline tree will pass all the time. 452 + 453 + In particular, test_bpf.c and test_verifier.c have a large number of 454 + test cases and are constantly updated with new BPF test sequences, or 455 + existing ones are adapted to verifier changes e.g. due to verifier 456 + becoming smarter and being able to better track certain things. 457 + 458 + LLVM 459 + ==== 460 + 461 + Q: Where do I find LLVM with BPF support? 462 + ----------------------------------------- 463 + A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. 464 + 465 + All major distributions these days ship LLVM with BPF back end enabled, 466 + so for the majority of use-cases it is not required to compile LLVM by 467 + hand anymore, just install the distribution provided package. 468 + 469 + LLVM's static compiler lists the supported targets through 470 + ``llc --version``, make sure BPF targets are listed. Example:: 471 + 472 + $ llc --version 473 + LLVM (http://llvm.org/): 474 + LLVM version 6.0.0svn 475 + Optimized build. 476 + Default target: x86_64-unknown-linux-gnu 477 + Host CPU: skylake 478 + 479 + Registered Targets: 480 + bpf - BPF (host endian) 481 + bpfeb - BPF (big endian) 482 + bpfel - BPF (little endian) 483 + x86 - 32-bit X86: Pentium-Pro and above 484 + x86-64 - 64-bit X86: EM64T and AMD64 485 + 486 + For developers in order to utilize the latest features added to LLVM's 487 + BPF back end, it is advisable to run the latest LLVM releases. Support 488 + for new BPF kernel features such as additions to the BPF instruction 489 + set are often developed together. 490 + 491 + All LLVM releases can be found at: http://releases.llvm.org/ 492 + 493 + Q: Got it, so how do I build LLVM manually anyway? 494 + -------------------------------------------------- 495 + A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have 496 + that set up, proceed with building the latest LLVM and clang version 497 + from the git repositories:: 498 + 499 + $ git clone http://llvm.org/git/llvm.git 500 + $ cd llvm/tools 501 + $ git clone --depth 1 http://llvm.org/git/clang.git 502 + $ cd ..; mkdir build; cd build 503 + $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ 504 + -DBUILD_SHARED_LIBS=OFF \ 505 + -DCMAKE_BUILD_TYPE=Release \ 506 + -DLLVM_BUILD_RUNTIME=OFF 507 + $ make -j $(getconf _NPROCESSORS_ONLN) 508 + 509 + The built binaries can then be found in the build/bin/ directory, where 510 + you can point the PATH variable to. 511 + 512 + Q: Reporting LLVM BPF issues 513 + ---------------------------- 514 + Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code 515 + generation back end or about LLVM generated code that the verifier 516 + refuses to accept? 517 + 518 + A: Yes, please do! 519 + 520 + LLVM's BPF back end is a key piece of the whole BPF 521 + infrastructure and it ties deeply into verification of programs from the 522 + kernel side. Therefore, any issues on either side need to be investigated 523 + and fixed whenever necessary. 524 + 525 + Therefore, please make sure to bring them up at netdev kernel mailing 526 + list and Cc BPF maintainers for LLVM and kernel bits: 527 + 528 + * Yonghong Song <yhs@fb.com> 529 + * Alexei Starovoitov <ast@kernel.org> 530 + * Daniel Borkmann <daniel@iogearbox.net> 531 + 532 + LLVM also has an issue tracker where BPF related bugs can be found: 533 + 534 + https://bugs.llvm.org/buglist.cgi?quicksearch=bpf 535 + 536 + However, it is better to reach out through mailing lists with having 537 + maintainers in Cc. 538 + 539 + Q: New BPF instruction for kernel and LLVM 540 + ------------------------------------------ 541 + Q: I have added a new BPF instruction to the kernel, how can I integrate 542 + it into LLVM? 543 + 544 + A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow 545 + the selection of BPF instruction set extensions. By default the 546 + ``generic`` processor target is used, which is the base instruction set 547 + (v1) of BPF. 548 + 549 + LLVM has an option to select ``-mcpu=probe`` where it will probe the host 550 + kernel for supported BPF instruction set extensions and selects the 551 + optimal set automatically. 552 + 553 + For cross-compilation, a specific version can be select manually as well :: 554 + 555 + $ llc -march bpf -mcpu=help 556 + Available CPUs for this target: 557 + 558 + generic - Select the generic processor. 559 + probe - Select the probe processor. 560 + v1 - Select the v1 processor. 561 + v2 - Select the v2 processor. 562 + [...] 563 + 564 + Newly added BPF instructions to the Linux kernel need to follow the same 565 + scheme, bump the instruction set version and implement probing for the 566 + extensions such that ``-mcpu=probe`` users can benefit from the 567 + optimization transparently when upgrading their kernels. 568 + 569 + If you are unable to implement support for the newly added BPF instruction 570 + please reach out to BPF developers for help. 571 + 572 + By the way, the BPF kernel selftests run with ``-mcpu=probe`` for better 573 + test coverage. 574 + 575 + Q: clang flag for target bpf? 576 + ----------------------------- 577 + Q: In some cases clang flag ``-target bpf`` is used but in other cases the 578 + default clang target, which matches the underlying architecture, is used. 579 + What is the difference and when I should use which? 580 + 581 + A: Although LLVM IR generation and optimization try to stay architecture 582 + independent, ``-target <arch>`` still has some impact on generated code: 583 + 584 + - BPF program may recursively include header file(s) with file scope 585 + inline assembly codes. The default target can handle this well, 586 + while ``bpf`` target may fail if bpf backend assembler does not 587 + understand these assembly codes, which is true in most cases. 588 + 589 + - When compiled without ``-g``, additional elf sections, e.g., 590 + .eh_frame and .rela.eh_frame, may be present in the object file 591 + with default target, but not with ``bpf`` target. 592 + 593 + - The default target may turn a C switch statement into a switch table 594 + lookup and jump operation. Since the switch table is placed 595 + in the global readonly section, the bpf program will fail to load. 596 + The bpf target does not support switch table optimization. 597 + The clang option ``-fno-jump-tables`` can be used to disable 598 + switch table generation. 599 + 600 + - For clang ``-target bpf``, it is guaranteed that pointer or long / 601 + unsigned long types will always have a width of 64 bit, no matter 602 + whether underlying clang binary or default target (or kernel) is 603 + 32 bit. However, when native clang target is used, then it will 604 + compile these types based on the underlying architecture's conventions, 605 + meaning in case of 32 bit architecture, pointer or long / unsigned 606 + long types e.g. in BPF context structure will have width of 32 bit 607 + while the BPF LLVM back end still operates in 64 bit. The native 608 + target is mostly needed in tracing for the case of walking ``pt_regs`` 609 + or other kernel structures where CPU's register width matters. 610 + Otherwise, ``clang -target bpf`` is generally recommended. 611 + 612 + You should use default target when: 613 + 614 + - Your program includes a header file, e.g., ptrace.h, which eventually 615 + pulls in some header files containing file scope host assembly codes. 616 + 617 + - You can add ``-fno-jump-tables`` to work around the switch table issue. 618 + 619 + Otherwise, you can use ``bpf`` target. Additionally, you *must* use bpf target 620 + when: 621 + 622 + - Your program uses data structures with pointer or long / unsigned long 623 + types that interface with BPF helpers or context data structures. Access 624 + into these structures is verified by the BPF verifier and may result 625 + in verification failures if the native architecture is not aligned with 626 + the BPF architecture, e.g. 64-bit. An example of this is 627 + BPF_PROG_TYPE_SK_MSG require ``-target bpf`` 628 + 629 + 630 + .. Links 631 + .. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/ 632 + .. _MAINTAINERS: ../../MAINTAINERS 633 + .. _Documentation/networking/netdev-FAQ.txt: ../networking/netdev-FAQ.txt 634 + .. _netdev FAQ: ../networking/netdev-FAQ.txt 635 + .. _samples/bpf/: ../../samples/bpf/ 636 + .. _selftests: ../../tools/testing/selftests/bpf/ 637 + .. _Documentation/dev-tools/kselftest.rst: 638 + https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html 639 + 640 + Happy BPF hacking!

-570

Documentation/bpf/bpf_devel_QA.txt

··· 1 - This document provides information for the BPF subsystem about various 2 - workflows related to reporting bugs, submitting patches, and queueing 3 - patches for stable kernels. 4 - 5 - For general information about submitting patches, please refer to 6 - Documentation/process/. This document only describes additional specifics 7 - related to BPF. 8 - 9 - Reporting bugs: 10 - --------------- 11 - 12 - Q: How do I report bugs for BPF kernel code? 13 - 14 - A: Since all BPF kernel development as well as bpftool and iproute2 BPF 15 - loader development happens through the netdev kernel mailing list, 16 - please report any found issues around BPF to the following mailing 17 - list: 18 - 19 - netdev@vger.kernel.org 20 - 21 - This may also include issues related to XDP, BPF tracing, etc. 22 - 23 - Given netdev has a high volume of traffic, please also add the BPF 24 - maintainers to Cc (from kernel MAINTAINERS file): 25 - 26 - Alexei Starovoitov <ast@kernel.org> 27 - Daniel Borkmann <daniel@iogearbox.net> 28 - 29 - In case a buggy commit has already been identified, make sure to keep 30 - the actual commit authors in Cc as well for the report. They can 31 - typically be identified through the kernel's git tree. 32 - 33 - Please do *not* report BPF issues to bugzilla.kernel.org since it 34 - is a guarantee that the reported issue will be overlooked. 35 - 36 - Submitting patches: 37 - ------------------- 38 - 39 - Q: To which mailing list do I need to submit my BPF patches? 40 - 41 - A: Please submit your BPF patches to the netdev kernel mailing list: 42 - 43 - netdev@vger.kernel.org 44 - 45 - Historically, BPF came out of networking and has always been maintained 46 - by the kernel networking community. Although these days BPF touches 47 - many other subsystems as well, the patches are still routed mainly 48 - through the networking community. 49 - 50 - In case your patch has changes in various different subsystems (e.g. 51 - tracing, security, etc), make sure to Cc the related kernel mailing 52 - lists and maintainers from there as well, so they are able to review 53 - the changes and provide their Acked-by's to the patches. 54 - 55 - Q: Where can I find patches currently under discussion for BPF subsystem? 56 - 57 - A: All patches that are Cc'ed to netdev are queued for review under netdev 58 - patchwork project: 59 - 60 - http://patchwork.ozlabs.org/project/netdev/list/ 61 - 62 - Those patches which target BPF, are assigned to a 'bpf' delegate for 63 - further processing from BPF maintainers. The current queue with 64 - patches under review can be found at: 65 - 66 - https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 67 - 68 - Once the patches have been reviewed by the BPF community as a whole 69 - and approved by the BPF maintainers, their status in patchwork will be 70 - changed to 'Accepted' and the submitter will be notified by mail. This 71 - means that the patches look good from a BPF perspective and have been 72 - applied to one of the two BPF kernel trees. 73 - 74 - In case feedback from the community requires a respin of the patches, 75 - their status in patchwork will be set to 'Changes Requested', and purged 76 - from the current review queue. Likewise for cases where patches would 77 - get rejected or are not applicable to the BPF trees (but assigned to 78 - the 'bpf' delegate). 79 - 80 - Q: How do the changes make their way into Linux? 81 - 82 - A: There are two BPF kernel trees (git repositories). Once patches have 83 - been accepted by the BPF maintainers, they will be applied to one 84 - of the two BPF trees: 85 - 86 - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ 87 - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ 88 - 89 - The bpf tree itself is for fixes only, whereas bpf-next for features, 90 - cleanups or other kind of improvements ("next-like" content). This is 91 - analogous to net and net-next trees for networking. Both bpf and 92 - bpf-next will only have a master branch in order to simplify against 93 - which branch patches should get rebased to. 94 - 95 - Accumulated BPF patches in the bpf tree will regularly get pulled 96 - into the net kernel tree. Likewise, accumulated BPF patches accepted 97 - into the bpf-next tree will make their way into net-next tree. net and 98 - net-next are both run by David S. Miller. From there, they will go 99 - into the kernel mainline tree run by Linus Torvalds. To read up on the 100 - process of net and net-next being merged into the mainline tree, see 101 - the netdev FAQ under: 102 - 103 - Documentation/networking/netdev-FAQ.txt 104 - 105 - Occasionally, to prevent merge conflicts, we might send pull requests 106 - to other trees (e.g. tracing) with a small subset of the patches, but 107 - net and net-next are always the main trees targeted for integration. 108 - 109 - The pull requests will contain a high-level summary of the accumulated 110 - patches and can be searched on netdev kernel mailing list through the 111 - following subject lines (yyyy-mm-dd is the date of the pull request): 112 - 113 - pull-request: bpf yyyy-mm-dd 114 - pull-request: bpf-next yyyy-mm-dd 115 - 116 - Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be 117 - applied to? 118 - 119 - A: The process is the very same as described in the netdev FAQ, so 120 - please read up on it. The subject line must indicate whether the 121 - patch is a fix or rather "next-like" content in order to let the 122 - maintainers know whether it is targeted at bpf or bpf-next. 123 - 124 - For fixes eventually landing in bpf -> net tree, the subject must 125 - look like: 126 - 127 - git format-patch --subject-prefix='PATCH bpf' start..finish 128 - 129 - For features/improvements/etc that should eventually land in 130 - bpf-next -> net-next, the subject must look like: 131 - 132 - git format-patch --subject-prefix='PATCH bpf-next' start..finish 133 - 134 - If unsure whether the patch or patch series should go into bpf 135 - or net directly, or bpf-next or net-next directly, it is not a 136 - problem either if the subject line says net or net-next as target. 137 - It is eventually up to the maintainers to do the delegation of 138 - the patches. 139 - 140 - If it is clear that patches should go into bpf or bpf-next tree, 141 - please make sure to rebase the patches against those trees in 142 - order to reduce potential conflicts. 143 - 144 - In case the patch or patch series has to be reworked and sent out 145 - again in a second or later revision, it is also required to add a 146 - version number (v2, v3, ...) into the subject prefix: 147 - 148 - git format-patch --subject-prefix='PATCH net-next v2' start..finish 149 - 150 - When changes have been requested to the patch series, always send the 151 - whole patch series again with the feedback incorporated (never send 152 - individual diffs on top of the old series). 153 - 154 - Q: What does it mean when a patch gets applied to bpf or bpf-next tree? 155 - 156 - A: It means that the patch looks good for mainline inclusion from 157 - a BPF point of view. 158 - 159 - Be aware that this is not a final verdict that the patch will 160 - automatically get accepted into net or net-next trees eventually: 161 - 162 - On the netdev kernel mailing list reviews can come in at any point 163 - in time. If discussions around a patch conclude that they cannot 164 - get included as-is, we will either apply a follow-up fix or drop 165 - them from the trees entirely. Therefore, we also reserve to rebase 166 - the trees when deemed necessary. After all, the purpose of the tree 167 - is to i) accumulate and stage BPF patches for integration into trees 168 - like net and net-next, and ii) run extensive BPF test suite and 169 - workloads on the patches before they make their way any further. 170 - 171 - Once the BPF pull request was accepted by David S. Miller, then 172 - the patches end up in net or net-next tree, respectively, and 173 - make their way from there further into mainline. Again, see the 174 - netdev FAQ for additional information e.g. on how often they are 175 - merged to mainline. 176 - 177 - Q: How long do I need to wait for feedback on my BPF patches? 178 - 179 - A: We try to keep the latency low. The usual time to feedback will 180 - be around 2 or 3 business days. It may vary depending on the 181 - complexity of changes and current patch load. 182 - 183 - Q: How often do you send pull requests to major kernel trees like 184 - net or net-next? 185 - 186 - A: Pull requests will be sent out rather often in order to not 187 - accumulate too many patches in bpf or bpf-next. 188 - 189 - As a rule of thumb, expect pull requests for each tree regularly 190 - at the end of the week. In some cases pull requests could additionally 191 - come also in the middle of the week depending on the current patch 192 - load or urgency. 193 - 194 - Q: Are patches applied to bpf-next when the merge window is open? 195 - 196 - A: For the time when the merge window is open, bpf-next will not be 197 - processed. This is roughly analogous to net-next patch processing, 198 - so feel free to read up on the netdev FAQ about further details. 199 - 200 - During those two weeks of merge window, we might ask you to resend 201 - your patch series once bpf-next is open again. Once Linus released 202 - a v*-rc1 after the merge window, we continue processing of bpf-next. 203 - 204 - For non-subscribers to kernel mailing lists, there is also a status 205 - page run by David S. Miller on net-next that provides guidance: 206 - 207 - http://vger.kernel.org/~davem/net-next.html 208 - 209 - Q: I made a BPF verifier change, do I need to add test cases for 210 - BPF kernel selftests? 211 - 212 - A: If the patch has changes to the behavior of the verifier, then yes, 213 - it is absolutely necessary to add test cases to the BPF kernel 214 - selftests suite. If they are not present and we think they are 215 - needed, then we might ask for them before accepting any changes. 216 - 217 - In particular, test_verifier.c is tracking a high number of BPF test 218 - cases, including a lot of corner cases that LLVM BPF back end may 219 - generate out of the restricted C code. Thus, adding test cases is 220 - absolutely crucial to make sure future changes do not accidentally 221 - affect prior use-cases. Thus, treat those test cases as: verifier 222 - behavior that is not tracked in test_verifier.c could potentially 223 - be subject to change. 224 - 225 - Q: When should I add code to samples/bpf/ and when to BPF kernel 226 - selftests? 227 - 228 - A: In general, we prefer additions to BPF kernel selftests rather than 229 - samples/bpf/. The rationale is very simple: kernel selftests are 230 - regularly run by various bots to test for kernel regressions. 231 - 232 - The more test cases we add to BPF selftests, the better the coverage 233 - and the less likely it is that those could accidentally break. It is 234 - not that BPF kernel selftests cannot demo how a specific feature can 235 - be used. 236 - 237 - That said, samples/bpf/ may be a good place for people to get started, 238 - so it might be advisable that simple demos of features could go into 239 - samples/bpf/, but advanced functional and corner-case testing rather 240 - into kernel selftests. 241 - 242 - If your sample looks like a test case, then go for BPF kernel selftests 243 - instead! 244 - 245 - Q: When should I add code to the bpftool? 246 - 247 - A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide 248 - a central user space tool for debugging and introspection of BPF programs 249 - and maps that are active in the kernel. If UAPI changes related to BPF 250 - enable for dumping additional information of programs or maps, then 251 - bpftool should be extended as well to support dumping them. 252 - 253 - Q: When should I add code to iproute2's BPF loader? 254 - 255 - A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the 256 - convention is that those control-path related changes are added to 257 - iproute2's BPF loader as well from user space side. This is not only 258 - useful to have UAPI changes properly designed to be usable, but also 259 - to make those changes available to a wider user base of major 260 - downstream distributions. 261 - 262 - Q: Do you accept patches as well for iproute2's BPF loader? 263 - 264 - A: Patches for the iproute2's BPF loader have to be sent to: 265 - 266 - netdev@vger.kernel.org 267 - 268 - While those patches are not processed by the BPF kernel maintainers, 269 - please keep them in Cc as well, so they can be reviewed. 270 - 271 - The official git repository for iproute2 is run by Stephen Hemminger 272 - and can be found at: 273 - 274 - https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ 275 - 276 - The patches need to have a subject prefix of '[PATCH iproute2 master]' 277 - or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the 278 - target branch where the patch should be applied to. Meaning, if kernel 279 - changes went into the net-next kernel tree, then the related iproute2 280 - changes need to go into the iproute2 net-next branch, otherwise they 281 - can be targeted at master branch. The iproute2 net-next branch will get 282 - merged into the master branch after the current iproute2 version from 283 - master has been released. 284 - 285 - Like BPF, the patches end up in patchwork under the netdev project and 286 - are delegated to 'shemminger' for further processing: 287 - 288 - http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 289 - 290 - Q: What is the minimum requirement before I submit my BPF patches? 291 - 292 - A: When submitting patches, always take the time and properly test your 293 - patches *prior* to submission. Never rush them! If maintainers find 294 - that your patches have not been properly tested, it is a good way to 295 - get them grumpy. Testing patch submissions is a hard requirement! 296 - 297 - Note, fixes that go to bpf tree *must* have a Fixes: tag included. The 298 - same applies to fixes that target bpf-next, where the affected commit 299 - is in net-next (or in some cases bpf-next). The Fixes: tag is crucial 300 - in order to identify follow-up commits and tremendously helps for people 301 - having to do backporting, so it is a must have! 302 - 303 - We also don't accept patches with an empty commit message. Take your 304 - time and properly write up a high quality commit message, it is 305 - essential! 306 - 307 - Think about it this way: other developers looking at your code a month 308 - from now need to understand *why* a certain change has been done that 309 - way, and whether there have been flaws in the analysis or assumptions 310 - that the original author did. Thus providing a proper rationale and 311 - describing the use-case for the changes is a must. 312 - 313 - Patch submissions with >1 patch must have a cover letter which includes 314 - a high level description of the series. This high level summary will 315 - then be placed into the merge commit by the BPF maintainers such that 316 - it is also accessible from the git log for future reference. 317 - 318 - Q: What do I need to consider when adding a new instruction or feature 319 - that would require BPF JIT and/or LLVM integration as well? 320 - 321 - A: We try hard to keep all BPF JITs up to date such that the same user 322 - experience can be guaranteed when running BPF programs on different 323 - architectures without having the program punt to the less efficient 324 - interpreter in case the in-kernel BPF JIT is enabled. 325 - 326 - If you are unable to implement or test the required JIT changes for 327 - certain architectures, please work together with the related BPF JIT 328 - developers in order to get the feature implemented in a timely manner. 329 - Please refer to the git log (arch/*/net/) to locate the necessary 330 - people for helping out. 331 - 332 - Also always make sure to add BPF test cases (e.g. test_bpf.c and 333 - test_verifier.c) for new instructions, so that they can receive 334 - broad test coverage and help run-time testing the various BPF JITs. 335 - 336 - In case of new BPF instructions, once the changes have been accepted 337 - into the Linux kernel, please implement support into LLVM's BPF back 338 - end. See LLVM section below for further information. 339 - 340 - Stable submission: 341 - ------------------ 342 - 343 - Q: I need a specific BPF commit in stable kernels. What should I do? 344 - 345 - A: In case you need a specific fix in stable kernels, first check whether 346 - the commit has already been applied in the related linux-*.y branches: 347 - 348 - https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ 349 - 350 - If not the case, then drop an email to the BPF maintainers with the 351 - netdev kernel mailing list in Cc and ask for the fix to be queued up: 352 - 353 - netdev@vger.kernel.org 354 - 355 - The process in general is the same as on netdev itself, see also the 356 - netdev FAQ document. 357 - 358 - Q: Do you also backport to kernels not currently maintained as stable? 359 - 360 - A: No. If you need a specific BPF commit in kernels that are currently not 361 - maintained by the stable maintainers, then you are on your own. 362 - 363 - The current stable and longterm stable kernels are all listed here: 364 - 365 - https://www.kernel.org/ 366 - 367 - Q: The BPF patch I am about to submit needs to go to stable as well. What 368 - should I do? 369 - 370 - A: The same rules apply as with netdev patch submissions in general, see 371 - netdev FAQ under: 372 - 373 - Documentation/networking/netdev-FAQ.txt 374 - 375 - Never add "Cc: stable@vger.kernel.org" to the patch description, but 376 - ask the BPF maintainers to queue the patches instead. This can be done 377 - with a note, for example, under the "---" part of the patch which does 378 - not go into the git log. Alternatively, this can be done as a simple 379 - request by mail instead. 380 - 381 - Q: Where do I find currently queued BPF patches that will be submitted 382 - to stable? 383 - 384 - A: Once patches that fix critical bugs got applied into the bpf tree, they 385 - are queued up for stable submission under: 386 - 387 - http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* 388 - 389 - They will be on hold there at minimum until the related commit made its 390 - way into the mainline kernel tree. 391 - 392 - After having been under broader exposure, the queued patches will be 393 - submitted by the BPF maintainers to the stable maintainers. 394 - 395 - Testing patches: 396 - ---------------- 397 - 398 - Q: Which BPF kernel selftests version should I run my kernel against? 399 - 400 - A: If you run a kernel xyz, then always run the BPF kernel selftests from 401 - that kernel xyz as well. Do not expect that the BPF selftest from the 402 - latest mainline tree will pass all the time. 403 - 404 - In particular, test_bpf.c and test_verifier.c have a large number of 405 - test cases and are constantly updated with new BPF test sequences, or 406 - existing ones are adapted to verifier changes e.g. due to verifier 407 - becoming smarter and being able to better track certain things. 408 - 409 - LLVM: 410 - ----- 411 - 412 - Q: Where do I find LLVM with BPF support? 413 - 414 - A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. 415 - 416 - All major distributions these days ship LLVM with BPF back end enabled, 417 - so for the majority of use-cases it is not required to compile LLVM by 418 - hand anymore, just install the distribution provided package. 419 - 420 - LLVM's static compiler lists the supported targets through 'llc --version', 421 - make sure BPF targets are listed. Example: 422 - 423 - $ llc --version 424 - LLVM (http://llvm.org/): 425 - LLVM version 6.0.0svn 426 - Optimized build. 427 - Default target: x86_64-unknown-linux-gnu 428 - Host CPU: skylake 429 - 430 - Registered Targets: 431 - bpf - BPF (host endian) 432 - bpfeb - BPF (big endian) 433 - bpfel - BPF (little endian) 434 - x86 - 32-bit X86: Pentium-Pro and above 435 - x86-64 - 64-bit X86: EM64T and AMD64 436 - 437 - For developers in order to utilize the latest features added to LLVM's 438 - BPF back end, it is advisable to run the latest LLVM releases. Support 439 - for new BPF kernel features such as additions to the BPF instruction 440 - set are often developed together. 441 - 442 - All LLVM releases can be found at: http://releases.llvm.org/ 443 - 444 - Q: Got it, so how do I build LLVM manually anyway? 445 - 446 - A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have 447 - that set up, proceed with building the latest LLVM and clang version 448 - from the git repositories: 449 - 450 - $ git clone http://llvm.org/git/llvm.git 451 - $ cd llvm/tools 452 - $ git clone --depth 1 http://llvm.org/git/clang.git 453 - $ cd ..; mkdir build; cd build 454 - $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ 455 - -DBUILD_SHARED_LIBS=OFF \ 456 - -DCMAKE_BUILD_TYPE=Release \ 457 - -DLLVM_BUILD_RUNTIME=OFF 458 - $ make -j $(getconf _NPROCESSORS_ONLN) 459 - 460 - The built binaries can then be found in the build/bin/ directory, where 461 - you can point the PATH variable to. 462 - 463 - Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code 464 - generation back end or about LLVM generated code that the verifier 465 - refuses to accept? 466 - 467 - A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF 468 - infrastructure and it ties deeply into verification of programs from the 469 - kernel side. Therefore, any issues on either side need to be investigated 470 - and fixed whenever necessary. 471 - 472 - Therefore, please make sure to bring them up at netdev kernel mailing 473 - list and Cc BPF maintainers for LLVM and kernel bits: 474 - 475 - Yonghong Song <yhs@fb.com> 476 - Alexei Starovoitov <ast@kernel.org> 477 - Daniel Borkmann <daniel@iogearbox.net> 478 - 479 - LLVM also has an issue tracker where BPF related bugs can be found: 480 - 481 - https://bugs.llvm.org/buglist.cgi?quicksearch=bpf 482 - 483 - However, it is better to reach out through mailing lists with having 484 - maintainers in Cc. 485 - 486 - Q: I have added a new BPF instruction to the kernel, how can I integrate 487 - it into LLVM? 488 - 489 - A: LLVM has a -mcpu selector for the BPF back end in order to allow the 490 - selection of BPF instruction set extensions. By default the 'generic' 491 - processor target is used, which is the base instruction set (v1) of BPF. 492 - 493 - LLVM has an option to select -mcpu=probe where it will probe the host 494 - kernel for supported BPF instruction set extensions and selects the 495 - optimal set automatically. 496 - 497 - For cross-compilation, a specific version can be select manually as well. 498 - 499 - $ llc -march bpf -mcpu=help 500 - Available CPUs for this target: 501 - 502 - generic - Select the generic processor. 503 - probe - Select the probe processor. 504 - v1 - Select the v1 processor. 505 - v2 - Select the v2 processor. 506 - [...] 507 - 508 - Newly added BPF instructions to the Linux kernel need to follow the same 509 - scheme, bump the instruction set version and implement probing for the 510 - extensions such that -mcpu=probe users can benefit from the optimization 511 - transparently when upgrading their kernels. 512 - 513 - If you are unable to implement support for the newly added BPF instruction 514 - please reach out to BPF developers for help. 515 - 516 - By the way, the BPF kernel selftests run with -mcpu=probe for better 517 - test coverage. 518 - 519 - Q: In some cases clang flag "-target bpf" is used but in other cases the 520 - default clang target, which matches the underlying architecture, is used. 521 - What is the difference and when I should use which? 522 - 523 - A: Although LLVM IR generation and optimization try to stay architecture 524 - independent, "-target <arch>" still has some impact on generated code: 525 - 526 - - BPF program may recursively include header file(s) with file scope 527 - inline assembly codes. The default target can handle this well, 528 - while bpf target may fail if bpf backend assembler does not 529 - understand these assembly codes, which is true in most cases. 530 - 531 - - When compiled without -g, additional elf sections, e.g., 532 - .eh_frame and .rela.eh_frame, may be present in the object file 533 - with default target, but not with bpf target. 534 - 535 - - The default target may turn a C switch statement into a switch table 536 - lookup and jump operation. Since the switch table is placed 537 - in the global readonly section, the bpf program will fail to load. 538 - The bpf target does not support switch table optimization. 539 - The clang option "-fno-jump-tables" can be used to disable 540 - switch table generation. 541 - 542 - - For clang -target bpf, it is guaranteed that pointer or long / 543 - unsigned long types will always have a width of 64 bit, no matter 544 - whether underlying clang binary or default target (or kernel) is 545 - 32 bit. However, when native clang target is used, then it will 546 - compile these types based on the underlying architecture's conventions, 547 - meaning in case of 32 bit architecture, pointer or long / unsigned 548 - long types e.g. in BPF context structure will have width of 32 bit 549 - while the BPF LLVM back end still operates in 64 bit. The native 550 - target is mostly needed in tracing for the case of walking pt_regs 551 - or other kernel structures where CPU's register width matters. 552 - Otherwise, clang -target bpf is generally recommended. 553 - 554 - You should use default target when: 555 - 556 - - Your program includes a header file, e.g., ptrace.h, which eventually 557 - pulls in some header files containing file scope host assembly codes. 558 - - You can add "-fno-jump-tables" to work around the switch table issue. 559 - 560 - Otherwise, you can use bpf target. Additionally, you _must_ use bpf target 561 - when: 562 - 563 - - Your program uses data structures with pointer or long / unsigned long 564 - types that interface with BPF helpers or context data structures. Access 565 - into these structures is verified by the BPF verifier and may result 566 - in verification failures if the native architecture is not aligned with 567 - the BPF architecture, e.g. 64-bit. An example of this is 568 - BPF_PROG_TYPE_SK_MSG require '-target bpf' 569 - 570 - Happy BPF hacking!

+9 -6

Documentation/networking/filter.txt

··· 1142 1142 the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we 1143 1143 then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 1144 1144 0x1ff), because of potential carries. 1145 + 1145 1146 Besides arithmetic, the register state can also be updated by conditional 1146 1147 branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch 1147 1148 it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' ··· 1151 1150 from the signed and unsigned bounds can be combined; for instance if a value is 1152 1151 first tested < 8 and then tested s> 4, the verifier will conclude that the value 1153 1152 is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. 1153 + 1154 1154 PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all 1155 1155 pointers sharing that same variable offset. This is important for packet range 1156 - checks: after adding some variable to a packet pointer, if you then copy it to 1157 - another register and (say) add a constant 4, both registers will share the same 1158 - 'id' but one will have a fixed offset of +4. Then if it is bounds-checked and 1159 - found to be less than a PTR_TO_PACKET_END, the other register is now known to 1160 - have a safe range of at least 4 bytes. See 'Direct packet access', below, for 1161 - more on PTR_TO_PACKET ranges. 1156 + checks: after adding a variable to a packet pointer register A, if you then copy 1157 + it to another register B and then add a constant 4 to A, both registers will 1158 + share the same 'id' but the A will have a fixed offset of +4. Then if A is 1159 + bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is 1160 + now known to have a safe range of at least 4 bytes. See 'Direct packet access', 1161 + below, for more on PTR_TO_PACKET ranges. 1162 + 1162 1163 The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of 1163 1164 the pointer returned from a map lookup. This means that when one copy is 1164 1165 checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.

+3 -10

arch/arm/net/bpf_jit_32.c

··· 234 234 #define SCRATCH_SIZE 80 235 235 236 236 /* total stack size used in JITed code */ 237 - #define _STACK_SIZE \ 238 - (ctx->prog->aux->stack_depth + \ 239 - + SCRATCH_SIZE + \ 240 - + 4 /* extra for skb_copy_bits buffer */) 241 - 242 - #define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT) 237 + #define _STACK_SIZE (ctx->prog->aux->stack_depth + SCRATCH_SIZE) 238 + #define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT) 243 239 244 240 /* Get the offset of eBPF REGISTERs stored on scratch space. */ 245 - #define STACK_VAR(off) (STACK_SIZE-off-4) 246 - 247 - /* Offset of skb_copy_bits buffer */ 248 - #define SKB_BUFFER STACK_VAR(SCRATCH_SIZE) 241 + #define STACK_VAR(off) (STACK_SIZE - off) 249 242 250 243 #if __LINUX_ARM_ARCH__ < 7 251 244

+70 -47

arch/arm64/net/bpf_jit_comp.c

··· 21 21 #include <linux/bpf.h> 22 22 #include <linux/filter.h> 23 23 #include <linux/printk.h> 24 - #include <linux/skbuff.h> 25 24 #include <linux/slab.h> 26 25 27 26 #include <asm/byteorder.h> ··· 79 80 ctx->idx++; 80 81 } 81 82 82 - static inline void emit_a64_mov_i64(const int reg, const u64 val, 83 - struct jit_ctx *ctx) 84 - { 85 - u64 tmp = val; 86 - int shift = 0; 87 - 88 - emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx); 89 - tmp >>= 16; 90 - shift += 16; 91 - while (tmp) { 92 - if (tmp & 0xffff) 93 - emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx); 94 - tmp >>= 16; 95 - shift += 16; 96 - } 97 - } 98 - 99 - static inline void emit_addr_mov_i64(const int reg, const u64 val, 100 - struct jit_ctx *ctx) 101 - { 102 - u64 tmp = val; 103 - int shift = 0; 104 - 105 - emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx); 106 - for (;shift < 48;) { 107 - tmp >>= 16; 108 - shift += 16; 109 - emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx); 110 - } 111 - } 112 - 113 83 static inline void emit_a64_mov_i(const int is64, const int reg, 114 84 const s32 val, struct jit_ctx *ctx) 115 85 { ··· 90 122 emit(A64_MOVN(is64, reg, (u16)~lo, 0), ctx); 91 123 } else { 92 124 emit(A64_MOVN(is64, reg, (u16)~hi, 16), ctx); 93 - emit(A64_MOVK(is64, reg, lo, 0), ctx); 125 + if (lo != 0xffff) 126 + emit(A64_MOVK(is64, reg, lo, 0), ctx); 94 127 } 95 128 } else { 96 129 emit(A64_MOVZ(is64, reg, lo, 0), ctx); 97 130 if (hi) 98 131 emit(A64_MOVK(is64, reg, hi, 16), ctx); 132 + } 133 + } 134 + 135 + static int i64_i16_blocks(const u64 val, bool inverse) 136 + { 137 + return (((val >> 0) & 0xffff) != (inverse ? 0xffff : 0x0000)) + 138 + (((val >> 16) & 0xffff) != (inverse ? 0xffff : 0x0000)) + 139 + (((val >> 32) & 0xffff) != (inverse ? 0xffff : 0x0000)) + 140 + (((val >> 48) & 0xffff) != (inverse ? 0xffff : 0x0000)); 141 + } 142 + 143 + static inline void emit_a64_mov_i64(const int reg, const u64 val, 144 + struct jit_ctx *ctx) 145 + { 146 + u64 nrm_tmp = val, rev_tmp = ~val; 147 + bool inverse; 148 + int shift; 149 + 150 + if (!(nrm_tmp >> 32)) 151 + return emit_a64_mov_i(0, reg, (u32)val, ctx); 152 + 153 + inverse = i64_i16_blocks(nrm_tmp, true) < i64_i16_blocks(nrm_tmp, false); 154 + shift = max(round_down((inverse ? (fls64(rev_tmp) - 1) : 155 + (fls64(nrm_tmp) - 1)), 16), 0); 156 + if (inverse) 157 + emit(A64_MOVN(1, reg, (rev_tmp >> shift) & 0xffff, shift), ctx); 158 + else 159 + emit(A64_MOVZ(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx); 160 + shift -= 16; 161 + while (shift >= 0) { 162 + if (((nrm_tmp >> shift) & 0xffff) != (inverse ? 0xffff : 0x0000)) 163 + emit(A64_MOVK(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx); 164 + shift -= 16; 165 + } 166 + } 167 + 168 + /* 169 + * This is an unoptimized 64 immediate emission used for BPF to BPF call 170 + * addresses. It will always do a full 64 bit decomposition as otherwise 171 + * more complexity in the last extra pass is required since we previously 172 + * reserved 4 instructions for the address. 173 + */ 174 + static inline void emit_addr_mov_i64(const int reg, const u64 val, 175 + struct jit_ctx *ctx) 176 + { 177 + u64 tmp = val; 178 + int shift = 0; 179 + 180 + emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx); 181 + for (;shift < 48;) { 182 + tmp >>= 16; 183 + shift += 16; 184 + emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx); 99 185 } 100 186 } 101 187 ··· 185 163 /* Tail call offset to jump into */ 186 164 #define PROLOGUE_OFFSET 7 187 165 188 - static int build_prologue(struct jit_ctx *ctx) 166 + static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) 189 167 { 190 168 const struct bpf_prog *prog = ctx->prog; 191 169 const u8 r6 = bpf2a64[BPF_REG_6]; ··· 210 188 * | ... | BPF prog stack 211 189 * | | 212 190 * +-----+ <= (BPF_FP - prog->aux->stack_depth) 213 - * |RSVD | JIT scratchpad 191 + * |RSVD | padding 214 192 * current A64_SP => +-----+ <= (BPF_FP - ctx->stack_size) 215 193 * | | 216 194 * | ... | Function call stack ··· 232 210 /* Set up BPF prog stack base register */ 233 211 emit(A64_MOV(1, fp, A64_SP), ctx); 234 212 235 - /* Initialize tail_call_cnt */ 236 - emit(A64_MOVZ(1, tcc, 0, 0), ctx); 213 + if (!ebpf_from_cbpf) { 214 + /* Initialize tail_call_cnt */ 215 + emit(A64_MOVZ(1, tcc, 0, 0), ctx); 237 216 238 - cur_offset = ctx->idx - idx0; 239 - if (cur_offset != PROLOGUE_OFFSET) { 240 - pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n", 241 - cur_offset, PROLOGUE_OFFSET); 242 - return -1; 217 + cur_offset = ctx->idx - idx0; 218 + if (cur_offset != PROLOGUE_OFFSET) { 219 + pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n", 220 + cur_offset, PROLOGUE_OFFSET); 221 + return -1; 222 + } 243 223 } 244 224 245 - /* 4 byte extra for skb_copy_bits buffer */ 246 - ctx->stack_size = prog->aux->stack_depth + 4; 247 - ctx->stack_size = STACK_ALIGN(ctx->stack_size); 225 + ctx->stack_size = STACK_ALIGN(prog->aux->stack_depth); 248 226 249 227 /* Set up function call stack */ 250 228 emit(A64_SUB_I(1, A64_SP, A64_SP, ctx->stack_size), ctx); ··· 808 786 struct bpf_prog *tmp, *orig_prog = prog; 809 787 struct bpf_binary_header *header; 810 788 struct arm64_jit_data *jit_data; 789 + bool was_classic = bpf_prog_was_classic(prog); 811 790 bool tmp_blinded = false; 812 791 bool extra_pass = false; 813 792 struct jit_ctx ctx; ··· 863 840 goto out_off; 864 841 } 865 842 866 - if (build_prologue(&ctx)) { 843 + if (build_prologue(&ctx, was_classic)) { 867 844 prog = orig_prog; 868 845 goto out_off; 869 846 } ··· 886 863 skip_init_ctx: 887 864 ctx.idx = 0; 888 865 889 - build_prologue(&ctx); 866 + build_prologue(&ctx, was_classic); 890 867 891 868 if (build_body(&ctx)) { 892 869 bpf_jit_binary_free(header);

-26

arch/mips/net/ebpf_jit.c

··· 95 95 * struct jit_ctx - JIT context 96 96 * @skf: The sk_filter 97 97 * @stack_size: eBPF stack size 98 - * @tmp_offset: eBPF $sp offset to 8-byte temporary memory 99 98 * @idx: Instruction index 100 99 * @flags: JIT flags 101 100 * @offsets: Instruction offsets ··· 104 105 struct jit_ctx { 105 106 const struct bpf_prog *skf; 106 107 int stack_size; 107 - int tmp_offset; 108 108 u32 idx; 109 109 u32 flags; 110 110 u32 *offsets; ··· 291 293 locals_size = (ctx->flags & EBPF_SEEN_FP) ? MAX_BPF_STACK : 0; 292 294 293 295 stack_adjust += locals_size; 294 - ctx->tmp_offset = locals_size; 295 296 296 297 ctx->stack_size = stack_adjust; 297 298 ··· 396 399 emit_instr(ctx, lui, reg, upper >> 16); 397 400 emit_instr(ctx, addiu, reg, reg, lower); 398 401 } 399 - 400 402 } 401 403 402 404 static int gen_imm_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, ··· 540 544 } 541 545 } 542 546 543 - return 0; 544 - } 545 - 546 - static void * __must_check 547 - ool_skb_header_pointer(const struct sk_buff *skb, int offset, 548 - int len, void *buffer) 549 - { 550 - return skb_header_pointer(skb, offset, len, buffer); 551 - } 552 - 553 - static int size_to_len(const struct bpf_insn *insn) 554 - { 555 - switch (BPF_SIZE(insn->code)) { 556 - case BPF_B: 557 - return 1; 558 - case BPF_H: 559 - return 2; 560 - case BPF_W: 561 - return 4; 562 - case BPF_DW: 563 - return 8; 564 - } 565 547 return 0; 566 548 } 567 549

-1

arch/sparc/net/bpf_jit_comp_64.c

··· 894 894 const int i = insn - ctx->prog->insnsi; 895 895 const s16 off = insn->off; 896 896 const s32 imm = insn->imm; 897 - u32 *func; 898 897 899 898 if (insn->src_reg == BPF_REG_FP) 900 899 ctx->saw_frame_pointer = true;

+14 -15

arch/x86/include/asm/nospec-branch.h

··· 301 301 * jmp *%edx for x86_32 302 302 */ 303 303 #ifdef CONFIG_RETPOLINE 304 - #ifdef CONFIG_X86_64 305 - # define RETPOLINE_RAX_BPF_JIT_SIZE 17 306 - # define RETPOLINE_RAX_BPF_JIT() \ 304 + # ifdef CONFIG_X86_64 305 + # define RETPOLINE_RAX_BPF_JIT_SIZE 17 306 + # define RETPOLINE_RAX_BPF_JIT() \ 307 307 do { \ 308 308 EMIT1_off32(0xE8, 7); /* callq do_rop */ \ 309 309 /* spec_trap: */ \ ··· 314 314 EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \ 315 315 EMIT1(0xC3); /* retq */ \ 316 316 } while (0) 317 - #else 318 - # define RETPOLINE_EDX_BPF_JIT() \ 317 + # else /* !CONFIG_X86_64 */ 318 + # define RETPOLINE_EDX_BPF_JIT() \ 319 319 do { \ 320 320 EMIT1_off32(0xE8, 7); /* call do_rop */ \ 321 321 /* spec_trap: */ \ ··· 326 326 EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */ \ 327 327 EMIT1(0xC3); /* ret */ \ 328 328 } while (0) 329 - #endif 329 + # endif 330 330 #else /* !CONFIG_RETPOLINE */ 331 - 332 - #ifdef CONFIG_X86_64 333 - # define RETPOLINE_RAX_BPF_JIT_SIZE 2 334 - # define RETPOLINE_RAX_BPF_JIT() \ 335 - EMIT2(0xFF, 0xE0); /* jmp *%rax */ 336 - #else 337 - # define RETPOLINE_EDX_BPF_JIT() \ 338 - EMIT2(0xFF, 0xE2) /* jmp *%edx */ 339 - #endif 331 + # ifdef CONFIG_X86_64 332 + # define RETPOLINE_RAX_BPF_JIT_SIZE 2 333 + # define RETPOLINE_RAX_BPF_JIT() \ 334 + EMIT2(0xFF, 0xE0); /* jmp *%rax */ 335 + # else /* !CONFIG_X86_64 */ 336 + # define RETPOLINE_EDX_BPF_JIT() \ 337 + EMIT2(0xFF, 0xE2) /* jmp *%edx */ 338 + # endif 340 339 #endif 341 340 342 341 #endif /* _ASM_X86_NOSPEC_BRANCH_H_ */

+1

drivers/net/ethernet/netronome/nfp/bpf/fw.h

··· 50 50 NFP_BPF_CAP_TYPE_ADJUST_HEAD = 2, 51 51 NFP_BPF_CAP_TYPE_MAPS = 3, 52 52 NFP_BPF_CAP_TYPE_RANDOM = 4, 53 + NFP_BPF_CAP_TYPE_QUEUE_SELECT = 5, 53 54 }; 54 55 55 56 struct nfp_bpf_cap_tlv_func {

+47

drivers/net/ethernet/netronome/nfp/bpf/jit.c

··· 42 42 43 43 #include "main.h" 44 44 #include "../nfp_asm.h" 45 + #include "../nfp_net_ctrl.h" 45 46 46 47 /* --- NFP prog --- */ 47 48 /* Foreach "multiple" entries macros provide pos and next<n> pointers. ··· 1471 1470 return 0; 1472 1471 } 1473 1472 1473 + static int 1474 + nfp_queue_select(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) 1475 + { 1476 + u32 jmp_tgt; 1477 + 1478 + jmp_tgt = nfp_prog_current_offset(nfp_prog) + 5; 1479 + 1480 + /* Make sure the queue id fits into FW field */ 1481 + emit_alu(nfp_prog, reg_none(), reg_a(meta->insn.src_reg * 2), 1482 + ALU_OP_AND_NOT_B, reg_imm(0xff)); 1483 + emit_br(nfp_prog, BR_BEQ, jmp_tgt, 2); 1484 + 1485 + /* Set the 'queue selected' bit and the queue value */ 1486 + emit_shf(nfp_prog, pv_qsel_set(nfp_prog), 1487 + pv_qsel_set(nfp_prog), SHF_OP_OR, reg_imm(1), 1488 + SHF_SC_L_SHF, PKT_VEL_QSEL_SET_BIT); 1489 + emit_ld_field(nfp_prog, 1490 + pv_qsel_val(nfp_prog), 0x1, reg_b(meta->insn.src_reg * 2), 1491 + SHF_SC_NONE, 0); 1492 + /* Delay slots end here, we will jump over next instruction if queue 1493 + * value fits into the field. 1494 + */ 1495 + emit_ld_field(nfp_prog, 1496 + pv_qsel_val(nfp_prog), 0x1, reg_imm(NFP_NET_RXR_MAX), 1497 + SHF_SC_NONE, 0); 1498 + 1499 + if (!nfp_prog_confirm_current_offset(nfp_prog, jmp_tgt)) 1500 + return -EINVAL; 1501 + 1502 + return 0; 1503 + } 1504 + 1474 1505 /* --- Callbacks --- */ 1475 1506 static int mov_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) 1476 1507 { ··· 2193 2160 false, wrp_lmem_store); 2194 2161 } 2195 2162 2163 + static int mem_stx_xdp(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) 2164 + { 2165 + switch (meta->insn.off) { 2166 + case offsetof(struct xdp_md, rx_queue_index): 2167 + return nfp_queue_select(nfp_prog, meta); 2168 + } 2169 + 2170 + WARN_ON_ONCE(1); /* verifier should have rejected bad accesses */ 2171 + return -EOPNOTSUPP; 2172 + } 2173 + 2196 2174 static int 2197 2175 mem_stx(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, 2198 2176 unsigned int size) ··· 2230 2186 2231 2187 static int mem_stx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) 2232 2188 { 2189 + if (meta->ptr.type == PTR_TO_CTX) 2190 + if (nfp_prog->type == BPF_PROG_TYPE_XDP) 2191 + return mem_stx_xdp(nfp_prog, meta); 2233 2192 return mem_stx(nfp_prog, meta, 4); 2234 2193 } 2235 2194

+11

drivers/net/ethernet/netronome/nfp/bpf/main.c

··· 334 334 return 0; 335 335 } 336 336 337 + static int 338 + nfp_bpf_parse_cap_qsel(struct nfp_app_bpf *bpf, void __iomem *value, u32 length) 339 + { 340 + bpf->queue_select = true; 341 + return 0; 342 + } 343 + 337 344 static int nfp_bpf_parse_capabilities(struct nfp_app *app) 338 345 { 339 346 struct nfp_cpp *cpp = app->pf->cpp; ··· 381 374 break; 382 375 case NFP_BPF_CAP_TYPE_RANDOM: 383 376 if (nfp_bpf_parse_cap_random(app->priv, value, length)) 377 + goto err_release_free; 378 + break; 379 + case NFP_BPF_CAP_TYPE_QUEUE_SELECT: 380 + if (nfp_bpf_parse_cap_qsel(app->priv, value, length)) 384 381 goto err_release_free; 385 382 break; 386 383 default:

+8

drivers/net/ethernet/netronome/nfp/bpf/main.h

··· 82 82 enum pkt_vec { 83 83 PKT_VEC_PKT_LEN = 0, 84 84 PKT_VEC_PKT_PTR = 2, 85 + PKT_VEC_QSEL_SET = 4, 86 + PKT_VEC_QSEL_VAL = 6, 85 87 }; 88 + 89 + #define PKT_VEL_QSEL_SET_BIT 4 86 90 87 91 #define pv_len(np) reg_lm(1, PKT_VEC_PKT_LEN) 88 92 #define pv_ctm_ptr(np) reg_lm(1, PKT_VEC_PKT_PTR) 93 + #define pv_qsel_set(np) reg_lm(1, PKT_VEC_QSEL_SET) 94 + #define pv_qsel_val(np) reg_lm(1, PKT_VEC_QSEL_VAL) 89 95 90 96 #define stack_reg(np) reg_a(STATIC_REG_STACK) 91 97 #define stack_imm(np) imm_b(np) ··· 145 139 * @helpers.perf_event_output: output perf event to a ring buffer 146 140 * 147 141 * @pseudo_random: FW initialized the pseudo-random machinery (CSRs) 142 + * @queue_select: BPF can set the RX queue ID in packet vector 148 143 */ 149 144 struct nfp_app_bpf { 150 145 struct nfp_app *app; ··· 188 181 } helpers; 189 182 190 183 bool pseudo_random; 184 + bool queue_select; 191 185 }; 192 186 193 187 enum nfp_bpf_map_use {

+26 -2

drivers/net/ethernet/netronome/nfp/bpf/verifier.c

··· 468 468 } 469 469 470 470 static int 471 + nfp_bpf_check_store(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, 472 + struct bpf_verifier_env *env) 473 + { 474 + const struct bpf_reg_state *reg = cur_regs(env) + meta->insn.dst_reg; 475 + 476 + if (reg->type == PTR_TO_CTX) { 477 + if (nfp_prog->type == BPF_PROG_TYPE_XDP) { 478 + /* XDP ctx accesses must be 4B in size */ 479 + switch (meta->insn.off) { 480 + case offsetof(struct xdp_md, rx_queue_index): 481 + if (nfp_prog->bpf->queue_select) 482 + goto exit_check_ptr; 483 + pr_vlog(env, "queue selection not supported by FW\n"); 484 + return -EOPNOTSUPP; 485 + } 486 + } 487 + pr_vlog(env, "unsupported store to context field\n"); 488 + return -EOPNOTSUPP; 489 + } 490 + exit_check_ptr: 491 + return nfp_bpf_check_ptr(nfp_prog, meta, env, meta->insn.dst_reg); 492 + } 493 + 494 + static int 471 495 nfp_bpf_check_xadd(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, 472 496 struct bpf_verifier_env *env) 473 497 { ··· 546 522 return nfp_bpf_check_ptr(nfp_prog, meta, env, 547 523 meta->insn.src_reg); 548 524 if (is_mbpf_store(meta)) 549 - return nfp_bpf_check_ptr(nfp_prog, meta, env, 550 - meta->insn.dst_reg); 525 + return nfp_bpf_check_store(nfp_prog, meta, env); 526 + 551 527 if (is_mbpf_xadd(meta)) 552 528 return nfp_bpf_check_xadd(nfp_prog, meta, env); 553 529

+12 -10

drivers/net/ethernet/netronome/nfp/nfp_asm.h

··· 183 183 #define OP_ALU_DST_LMEXTN 0x80000000000ULL 184 184 185 185 enum alu_op { 186 - ALU_OP_NONE = 0x00, 187 - ALU_OP_ADD = 0x01, 188 - ALU_OP_NOT = 0x04, 189 - ALU_OP_ADD_2B = 0x05, 190 - ALU_OP_AND = 0x08, 191 - ALU_OP_SUB_C = 0x0d, 192 - ALU_OP_ADD_C = 0x11, 193 - ALU_OP_OR = 0x14, 194 - ALU_OP_SUB = 0x15, 195 - ALU_OP_XOR = 0x18, 186 + ALU_OP_NONE = 0x00, 187 + ALU_OP_ADD = 0x01, 188 + ALU_OP_NOT = 0x04, 189 + ALU_OP_ADD_2B = 0x05, 190 + ALU_OP_AND = 0x08, 191 + ALU_OP_AND_NOT_A = 0x0c, 192 + ALU_OP_SUB_C = 0x0d, 193 + ALU_OP_AND_NOT_B = 0x10, 194 + ALU_OP_ADD_C = 0x11, 195 + ALU_OP_OR = 0x14, 196 + ALU_OP_SUB = 0x15, 197 + ALU_OP_XOR = 0x18, 196 198 }; 197 199 198 200 enum alu_dst_ab {

+9 -1

include/linux/bpf.h

··· 627 627 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL) 628 628 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr); 629 629 630 - static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux) 630 + static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux) 631 631 { 632 632 return aux->offload_requested; 633 633 } ··· 668 668 669 669 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET) 670 670 struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key); 671 + struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key); 671 672 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type); 672 673 #else 673 674 static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key) 675 + { 676 + return NULL; 677 + } 678 + 679 + static inline struct sock *__sock_hash_lookup_elem(struct bpf_map *map, 680 + void *key) 674 681 { 675 682 return NULL; 676 683 } ··· 731 724 extern const struct bpf_func_proto bpf_get_stackid_proto; 732 725 extern const struct bpf_func_proto bpf_get_stack_proto; 733 726 extern const struct bpf_func_proto bpf_sock_map_update_proto; 727 + extern const struct bpf_func_proto bpf_sock_hash_update_proto; 734 728 735 729 /* Shared helpers among cBPF and eBPF. */ 736 730 void bpf_user_rnd_init_once(void);

+1

include/linux/bpf_types.h

··· 47 47 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) 48 48 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET) 49 49 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) 50 + BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops) 50 51 #endif 51 52 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) 52 53 #if defined(CONFIG_XDP_SOCKETS)

+2 -2

include/linux/bpf_verifier.h

··· 200 200 u32 subprog_cnt; 201 201 }; 202 202 203 - void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt, 204 - va_list args); 203 + __printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log, 204 + const char *fmt, va_list args); 205 205 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, 206 206 const char *fmt, ...); 207 207

+2

include/linux/btf.h

··· 44 44 u32 *ret_size); 45 45 void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, 46 46 struct seq_file *m); 47 + int btf_get_fd_by_id(u32 id); 48 + u32 btf_id(const struct btf *btf); 47 49 48 50 #endif

+1 -2

include/linux/filter.h

··· 515 515 int sg_end; 516 516 struct scatterlist sg_data[MAX_SKB_FRAGS]; 517 517 bool sg_copy[MAX_SKB_FRAGS]; 518 - __u32 key; 519 518 __u32 flags; 520 - struct bpf_map *map; 519 + struct sock *sk_redir; 521 520 struct sk_buff *skb; 522 521 struct list_head list; 523 522 };

+14

include/net/addrconf.h

··· 223 223 const struct in6_addr *addr); 224 224 int (*ipv6_dst_lookup)(struct net *net, struct sock *sk, 225 225 struct dst_entry **dst, struct flowi6 *fl6); 226 + 227 + struct fib6_table *(*fib6_get_table)(struct net *net, u32 id); 228 + struct fib6_info *(*fib6_lookup)(struct net *net, int oif, 229 + struct flowi6 *fl6, int flags); 230 + struct fib6_info *(*fib6_table_lookup)(struct net *net, 231 + struct fib6_table *table, 232 + int oif, struct flowi6 *fl6, 233 + int flags); 234 + struct fib6_info *(*fib6_multipath_select)(const struct net *net, 235 + struct fib6_info *f6i, 236 + struct flowi6 *fl6, int oif, 237 + const struct sk_buff *skb, 238 + int strict); 239 + 226 240 void (*udpv6_encap_enable)(void); 227 241 void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr, 228 242 const struct in6_addr *solicited_addr,

+18 -3

include/net/ip6_fib.h

··· 376 376 const struct sk_buff *skb, 377 377 int flags, pol_lookup_t lookup); 378 378 379 - struct fib6_node *fib6_lookup(struct fib6_node *root, 380 - const struct in6_addr *daddr, 381 - const struct in6_addr *saddr); 379 + /* called with rcu lock held; can return error pointer 380 + * caller needs to select path 381 + */ 382 + struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, 383 + int flags); 384 + 385 + /* called with rcu lock held; caller needs to select path */ 386 + struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table, 387 + int oif, struct flowi6 *fl6, int strict); 388 + 389 + struct fib6_info *fib6_multipath_select(const struct net *net, 390 + struct fib6_info *match, 391 + struct flowi6 *fl6, int oif, 392 + const struct sk_buff *skb, int strict); 393 + 394 + struct fib6_node *fib6_node_lookup(struct fib6_node *root, 395 + const struct in6_addr *daddr, 396 + const struct in6_addr *saddr); 382 397 383 398 struct fib6_node *fib6_locate(struct fib6_node *root, 384 399 const struct in6_addr *daddr, int dst_len,

+1 -2

include/net/tcp.h

··· 816 816 #endif 817 817 } header; /* For incoming skbs */ 818 818 struct { 819 - __u32 key; 820 819 __u32 flags; 821 - struct bpf_map *map; 820 + struct sock *sk_redir; 822 821 void *data_end; 823 822 } bpf; 824 823 };

+7 -7

include/trace/events/fib6.h

··· 12 12 13 13 TRACE_EVENT(fib6_table_lookup, 14 14 15 - TP_PROTO(const struct net *net, const struct rt6_info *rt, 15 + TP_PROTO(const struct net *net, const struct fib6_info *f6i, 16 16 struct fib6_table *table, const struct flowi6 *flp), 17 17 18 - TP_ARGS(net, rt, table, flp), 18 + TP_ARGS(net, f6i, table, flp), 19 19 20 20 TP_STRUCT__entry( 21 21 __field( u32, tb_id ) ··· 48 48 in6 = (struct in6_addr *)__entry->dst; 49 49 *in6 = flp->daddr; 50 50 51 - if (rt->rt6i_idev) { 52 - __assign_str(name, rt->rt6i_idev->dev->name); 51 + if (f6i->fib6_nh.nh_dev) { 52 + __assign_str(name, f6i->fib6_nh.nh_dev); 53 53 } else { 54 54 __assign_str(name, ""); 55 55 } 56 - if (rt == net->ipv6.ip6_null_entry) { 56 + if (f6i == net->ipv6.fib6_null_entry) { 57 57 struct in6_addr in6_zero = {}; 58 58 59 59 in6 = (struct in6_addr *)__entry->gw; 60 60 *in6 = in6_zero; 61 61 62 - } else if (rt) { 62 + } else if (f6i) { 63 63 in6 = (struct in6_addr *)__entry->gw; 64 - *in6 = rt->rt6i_gateway; 64 + *in6 = f6i->fib6_nh.nh_gw; 65 65 } 66 66 ), 67 67

+141 -1

include/uapi/linux/bpf.h

··· 96 96 BPF_PROG_QUERY, 97 97 BPF_RAW_TRACEPOINT_OPEN, 98 98 BPF_BTF_LOAD, 99 + BPF_BTF_GET_FD_BY_ID, 99 100 }; 100 101 101 102 enum bpf_map_type { ··· 118 117 BPF_MAP_TYPE_SOCKMAP, 119 118 BPF_MAP_TYPE_CPUMAP, 120 119 BPF_MAP_TYPE_XSKMAP, 120 + BPF_MAP_TYPE_SOCKHASH, 121 121 }; 122 122 123 123 enum bpf_prog_type { ··· 346 344 __u32 start_id; 347 345 __u32 prog_id; 348 346 __u32 map_id; 347 + __u32 btf_id; 349 348 }; 350 349 __u32 next_id; 351 350 __u32 open_flags; ··· 1829 1826 * Return 1830 1827 * 0 on success, or a negative error in case of failure. 1831 1828 * 1829 + * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags) 1830 + * Description 1831 + * Do FIB lookup in kernel tables using parameters in *params*. 1832 + * If lookup is successful and result shows packet is to be 1833 + * forwarded, the neighbor tables are searched for the nexthop. 1834 + * If successful (ie., FIB lookup shows forwarding and nexthop 1835 + * is resolved), the nexthop address is returned in ipv4_dst, 1836 + * ipv6_dst or mpls_out based on family, smac is set to mac 1837 + * address of egress device, dmac is set to nexthop mac address, 1838 + * rt_metric is set to metric from route. 1839 + * 1840 + * *plen* argument is the size of the passed in struct. 1841 + * *flags* argument can be one or more BPF_FIB_LOOKUP_ flags: 1842 + * 1843 + * **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs 1844 + * full lookup using FIB rules 1845 + * **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress 1846 + * perspective (default is ingress) 1847 + * 1848 + * *ctx* is either **struct xdp_md** for XDP programs or 1849 + * **struct sk_buff** tc cls_act programs. 1850 + * 1851 + * Return 1852 + * Egress device index on success, 0 if packet needs to continue 1853 + * up the stack for further processing or a negative error in case 1854 + * of failure. 1855 + * 1856 + * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags) 1857 + * Description 1858 + * Add an entry to, or update a sockhash *map* referencing sockets. 1859 + * The *skops* is used as a new value for the entry associated to 1860 + * *key*. *flags* is one of: 1861 + * 1862 + * **BPF_NOEXIST** 1863 + * The entry for *key* must not exist in the map. 1864 + * **BPF_EXIST** 1865 + * The entry for *key* must already exist in the map. 1866 + * **BPF_ANY** 1867 + * No condition on the existence of the entry for *key*. 1868 + * 1869 + * If the *map* has eBPF programs (parser and verdict), those will 1870 + * be inherited by the socket being added. If the socket is 1871 + * already attached to eBPF programs, this results in an error. 1872 + * Return 1873 + * 0 on success, or a negative error in case of failure. 1874 + * 1875 + * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags) 1876 + * Description 1877 + * This helper is used in programs implementing policies at the 1878 + * socket level. If the message *msg* is allowed to pass (i.e. if 1879 + * the verdict eBPF program returns **SK_PASS**), redirect it to 1880 + * the socket referenced by *map* (of type 1881 + * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and 1882 + * egress interfaces can be used for redirection. The 1883 + * **BPF_F_INGRESS** value in *flags* is used to make the 1884 + * distinction (ingress path is selected if the flag is present, 1885 + * egress path otherwise). This is the only flag supported for now. 1886 + * Return 1887 + * **SK_PASS** on success, or **SK_DROP** on error. 1888 + * 1889 + * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags) 1890 + * Description 1891 + * This helper is used in programs implementing policies at the 1892 + * skb socket level. If the sk_buff *skb* is allowed to pass (i.e. 1893 + * if the verdeict eBPF program returns **SK_PASS**), redirect it 1894 + * to the socket referenced by *map* (of type 1895 + * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and 1896 + * egress interfaces can be used for redirection. The 1897 + * **BPF_F_INGRESS** value in *flags* is used to make the 1898 + * distinction (ingress path is selected if the flag is present, 1899 + * egress otherwise). This is the only flag supported for now. 1900 + * Return 1901 + * **SK_PASS** on success, or **SK_DROP** on error. 1832 1902 */ 1833 1903 #define __BPF_FUNC_MAPPER(FN) \ 1834 1904 FN(unspec), \ ··· 1972 1896 FN(xdp_adjust_tail), \ 1973 1897 FN(skb_get_xfrm_state), \ 1974 1898 FN(get_stack), \ 1975 - FN(skb_load_bytes_relative), 1899 + FN(skb_load_bytes_relative), \ 1900 + FN(fib_lookup), \ 1901 + FN(sock_hash_update), \ 1902 + FN(msg_redirect_hash), \ 1903 + FN(sk_redirect_hash), 1976 1904 1977 1905 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 1978 1906 * function eBPF program intends to call ··· 2210 2130 __u32 ifindex; 2211 2131 __u64 netns_dev; 2212 2132 __u64 netns_ino; 2133 + __u32 btf_id; 2134 + __u32 btf_key_id; 2135 + __u32 btf_value_id; 2136 + } __attribute__((aligned(8))); 2137 + 2138 + struct bpf_btf_info { 2139 + __aligned_u64 btf; 2140 + __u32 btf_size; 2141 + __u32 id; 2213 2142 } __attribute__((aligned(8))); 2214 2143 2215 2144 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed ··· 2397 2308 2398 2309 struct bpf_raw_tracepoint_args { 2399 2310 __u64 args[0]; 2311 + }; 2312 + 2313 + /* DIRECT: Skip the FIB rules and go to FIB table associated with device 2314 + * OUTPUT: Do lookup from egress perspective; default is ingress 2315 + */ 2316 + #define BPF_FIB_LOOKUP_DIRECT BIT(0) 2317 + #define BPF_FIB_LOOKUP_OUTPUT BIT(1) 2318 + 2319 + struct bpf_fib_lookup { 2320 + /* input */ 2321 + __u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */ 2322 + 2323 + /* set if lookup is to consider L4 data - e.g., FIB rules */ 2324 + __u8 l4_protocol; 2325 + __be16 sport; 2326 + __be16 dport; 2327 + 2328 + /* total length of packet from network header - used for MTU check */ 2329 + __u16 tot_len; 2330 + __u32 ifindex; /* L3 device index for lookup */ 2331 + 2332 + union { 2333 + /* inputs to lookup */ 2334 + __u8 tos; /* AF_INET */ 2335 + __be32 flowlabel; /* AF_INET6 */ 2336 + 2337 + /* output: metric of fib result */ 2338 + __u32 rt_metric; 2339 + }; 2340 + 2341 + union { 2342 + __be32 mpls_in; 2343 + __be32 ipv4_src; 2344 + __u32 ipv6_src[4]; /* in6_addr; network order */ 2345 + }; 2346 + 2347 + /* input to bpf_fib_lookup, *dst is destination address. 2348 + * output: bpf_fib_lookup sets to gateway address 2349 + */ 2350 + union { 2351 + /* return for MPLS lookups */ 2352 + __be32 mpls_out[4]; /* support up to 4 labels */ 2353 + __be32 ipv4_dst; 2354 + __u32 ipv6_dst[4]; /* in6_addr; network order */ 2355 + }; 2356 + 2357 + /* output */ 2358 + __be16 h_vlan_proto; 2359 + __be16 h_vlan_TCI; 2360 + __u8 smac[6]; /* ETH_ALEN */ 2361 + __u8 dmac[6]; /* ETH_ALEN */ 2400 2362 }; 2401 2363 2402 2364 #endif /* _UAPI__LINUX_BPF_H__ */

+1

init/Kconfig

··· 1391 1391 bool "Enable bpf() system call" 1392 1392 select ANON_INODES 1393 1393 select BPF 1394 + select IRQ_WORK 1394 1395 default n 1395 1396 help 1396 1397 Enable the bpf() system call that allows to manipulate eBPF

+120 -16

kernel/bpf/btf.c

··· 11 11 #include <linux/file.h> 12 12 #include <linux/uaccess.h> 13 13 #include <linux/kernel.h> 14 + #include <linux/idr.h> 14 15 #include <linux/bpf_verifier.h> 15 16 #include <linux/btf.h> 16 17 ··· 180 179 i < btf_type_vlen(struct_type); \ 181 180 i++, member++) 182 181 182 + static DEFINE_IDR(btf_idr); 183 + static DEFINE_SPINLOCK(btf_idr_lock); 184 + 183 185 struct btf { 184 186 union { 185 187 struct btf_header *hdr; ··· 197 193 u32 types_size; 198 194 u32 data_size; 199 195 refcount_t refcnt; 196 + u32 id; 197 + struct rcu_head rcu; 200 198 }; 201 199 202 200 enum verifier_phase { ··· 604 598 return 0; 605 599 } 606 600 601 + static int btf_alloc_id(struct btf *btf) 602 + { 603 + int id; 604 + 605 + idr_preload(GFP_KERNEL); 606 + spin_lock_bh(&btf_idr_lock); 607 + id = idr_alloc_cyclic(&btf_idr, btf, 1, INT_MAX, GFP_ATOMIC); 608 + if (id > 0) 609 + btf->id = id; 610 + spin_unlock_bh(&btf_idr_lock); 611 + idr_preload_end(); 612 + 613 + if (WARN_ON_ONCE(!id)) 614 + return -ENOSPC; 615 + 616 + return id > 0 ? 0 : id; 617 + } 618 + 619 + static void btf_free_id(struct btf *btf) 620 + { 621 + unsigned long flags; 622 + 623 + /* 624 + * In map-in-map, calling map_delete_elem() on outer 625 + * map will call bpf_map_put on the inner map. 626 + * It will then eventually call btf_free_id() 627 + * on the inner map. Some of the map_delete_elem() 628 + * implementation may have irq disabled, so 629 + * we need to use the _irqsave() version instead 630 + * of the _bh() version. 631 + */ 632 + spin_lock_irqsave(&btf_idr_lock, flags); 633 + idr_remove(&btf_idr, btf->id); 634 + spin_unlock_irqrestore(&btf_idr_lock, flags); 635 + } 636 + 607 637 static void btf_free(struct btf *btf) 608 638 { 609 639 kvfree(btf->types); ··· 649 607 kfree(btf); 650 608 } 651 609 652 - static void btf_get(struct btf *btf) 610 + static void btf_free_rcu(struct rcu_head *rcu) 653 611 { 654 - refcount_inc(&btf->refcnt); 612 + struct btf *btf = container_of(rcu, struct btf, rcu); 613 + 614 + btf_free(btf); 655 615 } 656 616 657 617 void btf_put(struct btf *btf) 658 618 { 659 - if (btf && refcount_dec_and_test(&btf->refcnt)) 660 - btf_free(btf); 619 + if (btf && refcount_dec_and_test(&btf->refcnt)) { 620 + btf_free_id(btf); 621 + call_rcu(&btf->rcu, btf_free_rcu); 622 + } 661 623 } 662 624 663 625 static int env_resolve_init(struct btf_verifier_env *env) ··· 2023 1977 2024 1978 if (!err) { 2025 1979 btf_verifier_env_free(env); 2026 - btf_get(btf); 1980 + refcount_set(&btf->refcnt, 1); 2027 1981 return btf; 2028 1982 } 2029 1983 ··· 2052 2006 .release = btf_release, 2053 2007 }; 2054 2008 2009 + static int __btf_new_fd(struct btf *btf) 2010 + { 2011 + return anon_inode_getfd("btf", &btf_fops, btf, O_RDONLY | O_CLOEXEC); 2012 + } 2013 + 2055 2014 int btf_new_fd(const union bpf_attr *attr) 2056 2015 { 2057 2016 struct btf *btf; 2058 - int fd; 2017 + int ret; 2059 2018 2060 2019 btf = btf_parse(u64_to_user_ptr(attr->btf), 2061 2020 attr->btf_size, attr->btf_log_level, ··· 2069 2018 if (IS_ERR(btf)) 2070 2019 return PTR_ERR(btf); 2071 2020 2072 - fd = anon_inode_getfd("btf", &btf_fops, btf, 2073 - O_RDONLY | O_CLOEXEC); 2074 - if (fd < 0) 2021 + ret = btf_alloc_id(btf); 2022 + if (ret) { 2023 + btf_free(btf); 2024 + return ret; 2025 + } 2026 + 2027 + /* 2028 + * The BTF ID is published to the userspace. 2029 + * All BTF free must go through call_rcu() from 2030 + * now on (i.e. free by calling btf_put()). 2031 + */ 2032 + 2033 + ret = __btf_new_fd(btf); 2034 + if (ret < 0) 2075 2035 btf_put(btf); 2076 2036 2077 - return fd; 2037 + return ret; 2078 2038 } 2079 2039 2080 2040 struct btf *btf_get_by_fd(int fd) ··· 2104 2042 } 2105 2043 2106 2044 btf = f.file->private_data; 2107 - btf_get(btf); 2045 + refcount_inc(&btf->refcnt); 2108 2046 fdput(f); 2109 2047 2110 2048 return btf; ··· 2114 2052 const union bpf_attr *attr, 2115 2053 union bpf_attr __user *uattr) 2116 2054 { 2117 - void __user *udata = u64_to_user_ptr(attr->info.info); 2118 - u32 copy_len = min_t(u32, btf->data_size, 2119 - attr->info.info_len); 2055 + struct bpf_btf_info __user *uinfo; 2056 + struct bpf_btf_info info = {}; 2057 + u32 info_copy, btf_copy; 2058 + void __user *ubtf; 2059 + u32 uinfo_len; 2120 2060 2121 - if (copy_to_user(udata, btf->data, copy_len) || 2122 - put_user(btf->data_size, &uattr->info.info_len)) 2061 + uinfo = u64_to_user_ptr(attr->info.info); 2062 + uinfo_len = attr->info.info_len; 2063 + 2064 + info_copy = min_t(u32, uinfo_len, sizeof(info)); 2065 + if (copy_from_user(&info, uinfo, info_copy)) 2066 + return -EFAULT; 2067 + 2068 + info.id = btf->id; 2069 + ubtf = u64_to_user_ptr(info.btf); 2070 + btf_copy = min_t(u32, btf->data_size, info.btf_size); 2071 + if (copy_to_user(ubtf, btf->data, btf_copy)) 2072 + return -EFAULT; 2073 + info.btf_size = btf->data_size; 2074 + 2075 + if (copy_to_user(uinfo, &info, info_copy) || 2076 + put_user(info_copy, &uattr->info.info_len)) 2123 2077 return -EFAULT; 2124 2078 2125 2079 return 0; 2080 + } 2081 + 2082 + int btf_get_fd_by_id(u32 id) 2083 + { 2084 + struct btf *btf; 2085 + int fd; 2086 + 2087 + rcu_read_lock(); 2088 + btf = idr_find(&btf_idr, id); 2089 + if (!btf || !refcount_inc_not_zero(&btf->refcnt)) 2090 + btf = ERR_PTR(-ENOENT); 2091 + rcu_read_unlock(); 2092 + 2093 + if (IS_ERR(btf)) 2094 + return PTR_ERR(btf); 2095 + 2096 + fd = __btf_new_fd(btf); 2097 + if (fd < 0) 2098 + btf_put(btf); 2099 + 2100 + return fd; 2101 + } 2102 + 2103 + u32 btf_id(const struct btf *btf) 2104 + { 2105 + return btf->id; 2126 2106 }

+1

kernel/bpf/core.c

··· 1707 1707 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; 1708 1708 const struct bpf_func_proto bpf_get_current_comm_proto __weak; 1709 1709 const struct bpf_func_proto bpf_sock_map_update_proto __weak; 1710 + const struct bpf_func_proto bpf_sock_hash_update_proto __weak; 1710 1711 1711 1712 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) 1712 1713 {

+571 -73

kernel/bpf/sockmap.c

··· 48 48 #define SOCK_CREATE_FLAG_MASK \ 49 49 (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) 50 50 51 - struct bpf_stab { 52 - struct bpf_map map; 53 - struct sock **sock_map; 51 + struct bpf_sock_progs { 54 52 struct bpf_prog *bpf_tx_msg; 55 53 struct bpf_prog *bpf_parse; 56 54 struct bpf_prog *bpf_verdict; 55 + }; 56 + 57 + struct bpf_stab { 58 + struct bpf_map map; 59 + struct sock **sock_map; 60 + struct bpf_sock_progs progs; 61 + }; 62 + 63 + struct bucket { 64 + struct hlist_head head; 65 + raw_spinlock_t lock; 66 + }; 67 + 68 + struct bpf_htab { 69 + struct bpf_map map; 70 + struct bucket *buckets; 71 + atomic_t count; 72 + u32 n_buckets; 73 + u32 elem_size; 74 + struct bpf_sock_progs progs; 75 + }; 76 + 77 + struct htab_elem { 78 + struct rcu_head rcu; 79 + struct hlist_node hash_node; 80 + u32 hash; 81 + struct sock *sk; 82 + char key[0]; 57 83 }; 58 84 59 85 enum smap_psock_state { ··· 89 63 struct smap_psock_map_entry { 90 64 struct list_head list; 91 65 struct sock **entry; 66 + struct htab_elem *hash_link; 67 + struct bpf_htab *htab; 92 68 }; 93 69 94 70 struct smap_psock { ··· 219 191 rcu_read_unlock(); 220 192 } 221 193 194 + static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) 195 + { 196 + atomic_dec(&htab->count); 197 + kfree_rcu(l, rcu); 198 + } 199 + 222 200 static void bpf_tcp_close(struct sock *sk, long timeout) 223 201 { 224 202 void (*close_fun)(struct sock *sk, long timeout); ··· 261 227 } 262 228 263 229 list_for_each_entry_safe(e, tmp, &psock->maps, list) { 264 - osk = cmpxchg(e->entry, sk, NULL); 265 - if (osk == sk) { 266 - list_del(&e->list); 267 - smap_release_sock(psock, sk); 230 + if (e->entry) { 231 + osk = cmpxchg(e->entry, sk, NULL); 232 + if (osk == sk) { 233 + list_del(&e->list); 234 + smap_release_sock(psock, sk); 235 + } 236 + } else { 237 + hlist_del_rcu(&e->hash_link->hash_node); 238 + smap_release_sock(psock, e->hash_link->sk); 239 + free_htab_elem(e->htab, e->hash_link); 268 240 } 269 241 } 270 242 write_unlock_bh(&sk->sk_callback_lock); ··· 501 461 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md) 502 462 { 503 463 return ((_rc == SK_PASS) ? 504 - (md->map ? __SK_REDIRECT : __SK_PASS) : 464 + (md->sk_redir ? __SK_REDIRECT : __SK_PASS) : 505 465 __SK_DROP); 506 466 } 507 467 ··· 1132 1092 * when we orphan the skb so that we don't have the possibility 1133 1093 * to reference a stale map. 1134 1094 */ 1135 - TCP_SKB_CB(skb)->bpf.map = NULL; 1095 + TCP_SKB_CB(skb)->bpf.sk_redir = NULL; 1136 1096 skb->sk = psock->sock; 1137 1097 bpf_compute_data_pointers(skb); 1138 1098 preempt_disable(); ··· 1142 1102 1143 1103 /* Moving return codes from UAPI namespace into internal namespace */ 1144 1104 return rc == SK_PASS ? 1145 - (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) : 1105 + (TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) : 1146 1106 __SK_DROP; 1147 1107 } 1148 1108 ··· 1412 1372 } 1413 1373 1414 1374 static void smap_init_progs(struct smap_psock *psock, 1415 - struct bpf_stab *stab, 1416 1375 struct bpf_prog *verdict, 1417 1376 struct bpf_prog *parse) 1418 1377 { ··· 1489 1450 kfree(psock); 1490 1451 } 1491 1452 1492 - static struct smap_psock *smap_init_psock(struct sock *sock, 1493 - struct bpf_stab *stab) 1453 + static struct smap_psock *smap_init_psock(struct sock *sock, int node) 1494 1454 { 1495 1455 struct smap_psock *psock; 1496 1456 1497 1457 psock = kzalloc_node(sizeof(struct smap_psock), 1498 1458 GFP_ATOMIC | __GFP_NOWARN, 1499 - stab->map.numa_node); 1459 + node); 1500 1460 if (!psock) 1501 1461 return ERR_PTR(-ENOMEM); 1502 1462 ··· 1563 1525 return ERR_PTR(err); 1564 1526 } 1565 1527 1566 - static void smap_list_remove(struct smap_psock *psock, struct sock **entry) 1528 + static void smap_list_remove(struct smap_psock *psock, 1529 + struct sock **entry, 1530 + struct htab_elem *hash_link) 1567 1531 { 1568 1532 struct smap_psock_map_entry *e, *tmp; 1569 1533 1570 1534 list_for_each_entry_safe(e, tmp, &psock->maps, list) { 1571 - if (e->entry == entry) { 1535 + if (e->entry == entry || e->hash_link == hash_link) { 1572 1536 list_del(&e->list); 1573 1537 break; 1574 1538 } ··· 1608 1568 * to be null and queued for garbage collection. 1609 1569 */ 1610 1570 if (likely(psock)) { 1611 - smap_list_remove(psock, &stab->sock_map[i]); 1571 + smap_list_remove(psock, &stab->sock_map[i], NULL); 1612 1572 smap_release_sock(psock, sock); 1613 1573 } 1614 1574 write_unlock_bh(&sock->sk_callback_lock); ··· 1667 1627 1668 1628 if (psock->bpf_parse) 1669 1629 smap_stop_sock(psock, sock); 1670 - smap_list_remove(psock, &stab->sock_map[k]); 1630 + smap_list_remove(psock, &stab->sock_map[k], NULL); 1671 1631 smap_release_sock(psock, sock); 1672 1632 out: 1673 1633 write_unlock_bh(&sock->sk_callback_lock); ··· 1702 1662 * - sock_map must use READ_ONCE and (cmp)xchg operations 1703 1663 * - BPF verdict/parse programs must use READ_ONCE and xchg operations 1704 1664 */ 1705 - static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, 1706 - struct bpf_map *map, 1707 - void *key, u64 flags) 1665 + 1666 + static int __sock_map_ctx_update_elem(struct bpf_map *map, 1667 + struct bpf_sock_progs *progs, 1668 + struct sock *sock, 1669 + struct sock **map_link, 1670 + void *key) 1708 1671 { 1709 - struct bpf_stab *stab = container_of(map, struct bpf_stab, map); 1710 - struct smap_psock_map_entry *e = NULL; 1711 1672 struct bpf_prog *verdict, *parse, *tx_msg; 1712 - struct sock *osock, *sock; 1673 + struct smap_psock_map_entry *e = NULL; 1713 1674 struct smap_psock *psock; 1714 - u32 i = *(u32 *)key; 1715 1675 bool new = false; 1716 1676 int err; 1717 - 1718 - if (unlikely(flags > BPF_EXIST)) 1719 - return -EINVAL; 1720 - 1721 - if (unlikely(i >= stab->map.max_entries)) 1722 - return -E2BIG; 1723 - 1724 - sock = READ_ONCE(stab->sock_map[i]); 1725 - if (flags == BPF_EXIST && !sock) 1726 - return -ENOENT; 1727 - else if (flags == BPF_NOEXIST && sock) 1728 - return -EEXIST; 1729 - 1730 - sock = skops->sk; 1731 1677 1732 1678 /* 1. If sock map has BPF programs those will be inherited by the 1733 1679 * sock being added. If the sock is already attached to BPF programs 1734 1680 * this results in an error. 1735 1681 */ 1736 - verdict = READ_ONCE(stab->bpf_verdict); 1737 - parse = READ_ONCE(stab->bpf_parse); 1738 - tx_msg = READ_ONCE(stab->bpf_tx_msg); 1682 + verdict = READ_ONCE(progs->bpf_verdict); 1683 + parse = READ_ONCE(progs->bpf_parse); 1684 + tx_msg = READ_ONCE(progs->bpf_tx_msg); 1739 1685 1740 1686 if (parse && verdict) { 1741 1687 /* bpf prog refcnt may be zero if a concurrent attach operation ··· 1729 1703 * we increment the refcnt. If this is the case abort with an 1730 1704 * error. 1731 1705 */ 1732 - verdict = bpf_prog_inc_not_zero(stab->bpf_verdict); 1706 + verdict = bpf_prog_inc_not_zero(progs->bpf_verdict); 1733 1707 if (IS_ERR(verdict)) 1734 1708 return PTR_ERR(verdict); 1735 1709 1736 - parse = bpf_prog_inc_not_zero(stab->bpf_parse); 1710 + parse = bpf_prog_inc_not_zero(progs->bpf_parse); 1737 1711 if (IS_ERR(parse)) { 1738 1712 bpf_prog_put(verdict); 1739 1713 return PTR_ERR(parse); ··· 1741 1715 } 1742 1716 1743 1717 if (tx_msg) { 1744 - tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg); 1718 + tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg); 1745 1719 if (IS_ERR(tx_msg)) { 1746 1720 if (verdict) 1747 1721 bpf_prog_put(verdict); ··· 1774 1748 goto out_progs; 1775 1749 } 1776 1750 } else { 1777 - psock = smap_init_psock(sock, stab); 1751 + psock = smap_init_psock(sock, map->numa_node); 1778 1752 if (IS_ERR(psock)) { 1779 1753 err = PTR_ERR(psock); 1780 1754 goto out_progs; ··· 1784 1758 new = true; 1785 1759 } 1786 1760 1787 - e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN); 1788 - if (!e) { 1789 - err = -ENOMEM; 1790 - goto out_progs; 1761 + if (map_link) { 1762 + e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN); 1763 + if (!e) { 1764 + err = -ENOMEM; 1765 + goto out_progs; 1766 + } 1791 1767 } 1792 - e->entry = &stab->sock_map[i]; 1793 1768 1794 1769 /* 3. At this point we have a reference to a valid psock that is 1795 1770 * running. Attach any BPF programs needed. ··· 1807 1780 err = smap_init_sock(psock, sock); 1808 1781 if (err) 1809 1782 goto out_free; 1810 - smap_init_progs(psock, stab, verdict, parse); 1783 + smap_init_progs(psock, verdict, parse); 1811 1784 smap_start_sock(psock, sock); 1812 1785 } 1813 1786 ··· 1816 1789 * it with. Because we can only have a single set of programs if 1817 1790 * old_sock has a strp we can stop it. 1818 1791 */ 1819 - list_add_tail(&e->list, &psock->maps); 1820 - write_unlock_bh(&sock->sk_callback_lock); 1821 - 1822 - osock = xchg(&stab->sock_map[i], sock); 1823 - if (osock) { 1824 - struct smap_psock *opsock = smap_psock_sk(osock); 1825 - 1826 - write_lock_bh(&osock->sk_callback_lock); 1827 - smap_list_remove(opsock, &stab->sock_map[i]); 1828 - smap_release_sock(opsock, osock); 1829 - write_unlock_bh(&osock->sk_callback_lock); 1792 + if (map_link) { 1793 + e->entry = map_link; 1794 + list_add_tail(&e->list, &psock->maps); 1830 1795 } 1831 - return 0; 1796 + write_unlock_bh(&sock->sk_callback_lock); 1797 + return err; 1832 1798 out_free: 1799 + kfree(e); 1833 1800 smap_release_sock(psock, sock); 1834 1801 out_progs: 1835 1802 if (verdict) ··· 1837 1816 return err; 1838 1817 } 1839 1818 1840 - int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) 1819 + static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, 1820 + struct bpf_map *map, 1821 + void *key, u64 flags) 1841 1822 { 1842 1823 struct bpf_stab *stab = container_of(map, struct bpf_stab, map); 1824 + struct bpf_sock_progs *progs = &stab->progs; 1825 + struct sock *osock, *sock; 1826 + u32 i = *(u32 *)key; 1827 + int err; 1828 + 1829 + if (unlikely(flags > BPF_EXIST)) 1830 + return -EINVAL; 1831 + 1832 + if (unlikely(i >= stab->map.max_entries)) 1833 + return -E2BIG; 1834 + 1835 + sock = READ_ONCE(stab->sock_map[i]); 1836 + if (flags == BPF_EXIST && !sock) 1837 + return -ENOENT; 1838 + else if (flags == BPF_NOEXIST && sock) 1839 + return -EEXIST; 1840 + 1841 + sock = skops->sk; 1842 + err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i], 1843 + key); 1844 + if (err) 1845 + goto out; 1846 + 1847 + osock = xchg(&stab->sock_map[i], sock); 1848 + if (osock) { 1849 + struct smap_psock *opsock = smap_psock_sk(osock); 1850 + 1851 + write_lock_bh(&osock->sk_callback_lock); 1852 + smap_list_remove(opsock, &stab->sock_map[i], NULL); 1853 + smap_release_sock(opsock, osock); 1854 + write_unlock_bh(&osock->sk_callback_lock); 1855 + } 1856 + out: 1857 + return err; 1858 + } 1859 + 1860 + int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type) 1861 + { 1862 + struct bpf_sock_progs *progs; 1843 1863 struct bpf_prog *orig; 1844 1864 1845 - if (unlikely(map->map_type != BPF_MAP_TYPE_SOCKMAP)) 1865 + if (map->map_type == BPF_MAP_TYPE_SOCKMAP) { 1866 + struct bpf_stab *stab = container_of(map, struct bpf_stab, map); 1867 + 1868 + progs = &stab->progs; 1869 + } else if (map->map_type == BPF_MAP_TYPE_SOCKHASH) { 1870 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 1871 + 1872 + progs = &htab->progs; 1873 + } else { 1846 1874 return -EINVAL; 1875 + } 1847 1876 1848 1877 switch (type) { 1849 1878 case BPF_SK_MSG_VERDICT: 1850 - orig = xchg(&stab->bpf_tx_msg, prog); 1879 + orig = xchg(&progs->bpf_tx_msg, prog); 1851 1880 break; 1852 1881 case BPF_SK_SKB_STREAM_PARSER: 1853 - orig = xchg(&stab->bpf_parse, prog); 1882 + orig = xchg(&progs->bpf_parse, prog); 1854 1883 break; 1855 1884 case BPF_SK_SKB_STREAM_VERDICT: 1856 - orig = xchg(&stab->bpf_verdict, prog); 1885 + orig = xchg(&progs->bpf_verdict, prog); 1857 1886 break; 1858 1887 default: 1859 1888 return -EOPNOTSUPP; ··· 1951 1880 1952 1881 static void sock_map_release(struct bpf_map *map) 1953 1882 { 1954 - struct bpf_stab *stab = container_of(map, struct bpf_stab, map); 1883 + struct bpf_sock_progs *progs; 1955 1884 struct bpf_prog *orig; 1956 1885 1957 - orig = xchg(&stab->bpf_parse, NULL); 1886 + if (map->map_type == BPF_MAP_TYPE_SOCKMAP) { 1887 + struct bpf_stab *stab = container_of(map, struct bpf_stab, map); 1888 + 1889 + progs = &stab->progs; 1890 + } else { 1891 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 1892 + 1893 + progs = &htab->progs; 1894 + } 1895 + 1896 + orig = xchg(&progs->bpf_parse, NULL); 1958 1897 if (orig) 1959 1898 bpf_prog_put(orig); 1960 - orig = xchg(&stab->bpf_verdict, NULL); 1899 + orig = xchg(&progs->bpf_verdict, NULL); 1961 1900 if (orig) 1962 1901 bpf_prog_put(orig); 1963 1902 1964 - orig = xchg(&stab->bpf_tx_msg, NULL); 1903 + orig = xchg(&progs->bpf_tx_msg, NULL); 1965 1904 if (orig) 1966 1905 bpf_prog_put(orig); 1906 + } 1907 + 1908 + static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) 1909 + { 1910 + struct bpf_htab *htab; 1911 + int i, err; 1912 + u64 cost; 1913 + 1914 + if (!capable(CAP_NET_ADMIN)) 1915 + return ERR_PTR(-EPERM); 1916 + 1917 + /* check sanity of attributes */ 1918 + if (attr->max_entries == 0 || attr->value_size != 4 || 1919 + attr->map_flags & ~SOCK_CREATE_FLAG_MASK) 1920 + return ERR_PTR(-EINVAL); 1921 + 1922 + if (attr->key_size > MAX_BPF_STACK) 1923 + /* eBPF programs initialize keys on stack, so they cannot be 1924 + * larger than max stack size 1925 + */ 1926 + return ERR_PTR(-E2BIG); 1927 + 1928 + err = bpf_tcp_ulp_register(); 1929 + if (err && err != -EEXIST) 1930 + return ERR_PTR(err); 1931 + 1932 + htab = kzalloc(sizeof(*htab), GFP_USER); 1933 + if (!htab) 1934 + return ERR_PTR(-ENOMEM); 1935 + 1936 + bpf_map_init_from_attr(&htab->map, attr); 1937 + 1938 + htab->n_buckets = roundup_pow_of_two(htab->map.max_entries); 1939 + htab->elem_size = sizeof(struct htab_elem) + 1940 + round_up(htab->map.key_size, 8); 1941 + err = -EINVAL; 1942 + if (htab->n_buckets == 0 || 1943 + htab->n_buckets > U32_MAX / sizeof(struct bucket)) 1944 + goto free_htab; 1945 + 1946 + cost = (u64) htab->n_buckets * sizeof(struct bucket) + 1947 + (u64) htab->elem_size * htab->map.max_entries; 1948 + 1949 + if (cost >= U32_MAX - PAGE_SIZE) 1950 + goto free_htab; 1951 + 1952 + htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; 1953 + err = bpf_map_precharge_memlock(htab->map.pages); 1954 + if (err) 1955 + goto free_htab; 1956 + 1957 + err = -ENOMEM; 1958 + htab->buckets = bpf_map_area_alloc( 1959 + htab->n_buckets * sizeof(struct bucket), 1960 + htab->map.numa_node); 1961 + if (!htab->buckets) 1962 + goto free_htab; 1963 + 1964 + for (i = 0; i < htab->n_buckets; i++) { 1965 + INIT_HLIST_HEAD(&htab->buckets[i].head); 1966 + raw_spin_lock_init(&htab->buckets[i].lock); 1967 + } 1968 + 1969 + return &htab->map; 1970 + free_htab: 1971 + kfree(htab); 1972 + return ERR_PTR(err); 1973 + } 1974 + 1975 + static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash) 1976 + { 1977 + return &htab->buckets[hash & (htab->n_buckets - 1)]; 1978 + } 1979 + 1980 + static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash) 1981 + { 1982 + return &__select_bucket(htab, hash)->head; 1983 + } 1984 + 1985 + static void sock_hash_free(struct bpf_map *map) 1986 + { 1987 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 1988 + int i; 1989 + 1990 + synchronize_rcu(); 1991 + 1992 + /* At this point no update, lookup or delete operations can happen. 1993 + * However, be aware we can still get a socket state event updates, 1994 + * and data ready callabacks that reference the psock from sk_user_data 1995 + * Also psock worker threads are still in-flight. So smap_release_sock 1996 + * will only free the psock after cancel_sync on the worker threads 1997 + * and a grace period expire to ensure psock is really safe to remove. 1998 + */ 1999 + rcu_read_lock(); 2000 + for (i = 0; i < htab->n_buckets; i++) { 2001 + struct hlist_head *head = select_bucket(htab, i); 2002 + struct hlist_node *n; 2003 + struct htab_elem *l; 2004 + 2005 + hlist_for_each_entry_safe(l, n, head, hash_node) { 2006 + struct sock *sock = l->sk; 2007 + struct smap_psock *psock; 2008 + 2009 + hlist_del_rcu(&l->hash_node); 2010 + write_lock_bh(&sock->sk_callback_lock); 2011 + psock = smap_psock_sk(sock); 2012 + /* This check handles a racing sock event that can get 2013 + * the sk_callback_lock before this case but after xchg 2014 + * causing the refcnt to hit zero and sock user data 2015 + * (psock) to be null and queued for garbage collection. 2016 + */ 2017 + if (likely(psock)) { 2018 + smap_list_remove(psock, NULL, l); 2019 + smap_release_sock(psock, sock); 2020 + } 2021 + write_unlock_bh(&sock->sk_callback_lock); 2022 + kfree(l); 2023 + } 2024 + } 2025 + rcu_read_unlock(); 2026 + bpf_map_area_free(htab->buckets); 2027 + kfree(htab); 2028 + } 2029 + 2030 + static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab, 2031 + void *key, u32 key_size, u32 hash, 2032 + struct sock *sk, 2033 + struct htab_elem *old_elem) 2034 + { 2035 + struct htab_elem *l_new; 2036 + 2037 + if (atomic_inc_return(&htab->count) > htab->map.max_entries) { 2038 + if (!old_elem) { 2039 + atomic_dec(&htab->count); 2040 + return ERR_PTR(-E2BIG); 2041 + } 2042 + } 2043 + l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN, 2044 + htab->map.numa_node); 2045 + if (!l_new) 2046 + return ERR_PTR(-ENOMEM); 2047 + 2048 + memcpy(l_new->key, key, key_size); 2049 + l_new->sk = sk; 2050 + l_new->hash = hash; 2051 + return l_new; 2052 + } 2053 + 2054 + static struct htab_elem *lookup_elem_raw(struct hlist_head *head, 2055 + u32 hash, void *key, u32 key_size) 2056 + { 2057 + struct htab_elem *l; 2058 + 2059 + hlist_for_each_entry_rcu(l, head, hash_node) { 2060 + if (l->hash == hash && !memcmp(&l->key, key, key_size)) 2061 + return l; 2062 + } 2063 + 2064 + return NULL; 2065 + } 2066 + 2067 + static inline u32 htab_map_hash(const void *key, u32 key_len) 2068 + { 2069 + return jhash(key, key_len, 0); 2070 + } 2071 + 2072 + static int sock_hash_get_next_key(struct bpf_map *map, 2073 + void *key, void *next_key) 2074 + { 2075 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 2076 + struct htab_elem *l, *next_l; 2077 + struct hlist_head *h; 2078 + u32 hash, key_size; 2079 + int i = 0; 2080 + 2081 + WARN_ON_ONCE(!rcu_read_lock_held()); 2082 + 2083 + key_size = map->key_size; 2084 + if (!key) 2085 + goto find_first_elem; 2086 + hash = htab_map_hash(key, key_size); 2087 + h = select_bucket(htab, hash); 2088 + 2089 + l = lookup_elem_raw(h, hash, key, key_size); 2090 + if (!l) 2091 + goto find_first_elem; 2092 + next_l = hlist_entry_safe( 2093 + rcu_dereference_raw(hlist_next_rcu(&l->hash_node)), 2094 + struct htab_elem, hash_node); 2095 + if (next_l) { 2096 + memcpy(next_key, next_l->key, key_size); 2097 + return 0; 2098 + } 2099 + 2100 + /* no more elements in this hash list, go to the next bucket */ 2101 + i = hash & (htab->n_buckets - 1); 2102 + i++; 2103 + 2104 + find_first_elem: 2105 + /* iterate over buckets */ 2106 + for (; i < htab->n_buckets; i++) { 2107 + h = select_bucket(htab, i); 2108 + 2109 + /* pick first element in the bucket */ 2110 + next_l = hlist_entry_safe( 2111 + rcu_dereference_raw(hlist_first_rcu(h)), 2112 + struct htab_elem, hash_node); 2113 + if (next_l) { 2114 + /* if it's not empty, just return it */ 2115 + memcpy(next_key, next_l->key, key_size); 2116 + return 0; 2117 + } 2118 + } 2119 + 2120 + /* iterated over all buckets and all elements */ 2121 + return -ENOENT; 2122 + } 2123 + 2124 + static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, 2125 + struct bpf_map *map, 2126 + void *key, u64 map_flags) 2127 + { 2128 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 2129 + struct bpf_sock_progs *progs = &htab->progs; 2130 + struct htab_elem *l_new = NULL, *l_old; 2131 + struct smap_psock_map_entry *e = NULL; 2132 + struct hlist_head *head; 2133 + struct smap_psock *psock; 2134 + u32 key_size, hash; 2135 + struct sock *sock; 2136 + struct bucket *b; 2137 + int err; 2138 + 2139 + sock = skops->sk; 2140 + 2141 + if (sock->sk_type != SOCK_STREAM || 2142 + sock->sk_protocol != IPPROTO_TCP) 2143 + return -EOPNOTSUPP; 2144 + 2145 + if (unlikely(map_flags > BPF_EXIST)) 2146 + return -EINVAL; 2147 + 2148 + e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN); 2149 + if (!e) 2150 + return -ENOMEM; 2151 + 2152 + WARN_ON_ONCE(!rcu_read_lock_held()); 2153 + key_size = map->key_size; 2154 + hash = htab_map_hash(key, key_size); 2155 + b = __select_bucket(htab, hash); 2156 + head = &b->head; 2157 + 2158 + err = __sock_map_ctx_update_elem(map, progs, sock, NULL, key); 2159 + if (err) 2160 + goto err; 2161 + 2162 + /* bpf_map_update_elem() can be called in_irq() */ 2163 + raw_spin_lock_bh(&b->lock); 2164 + l_old = lookup_elem_raw(head, hash, key, key_size); 2165 + if (l_old && map_flags == BPF_NOEXIST) { 2166 + err = -EEXIST; 2167 + goto bucket_err; 2168 + } 2169 + if (!l_old && map_flags == BPF_EXIST) { 2170 + err = -ENOENT; 2171 + goto bucket_err; 2172 + } 2173 + 2174 + l_new = alloc_sock_hash_elem(htab, key, key_size, hash, sock, l_old); 2175 + if (IS_ERR(l_new)) { 2176 + err = PTR_ERR(l_new); 2177 + goto bucket_err; 2178 + } 2179 + 2180 + psock = smap_psock_sk(sock); 2181 + if (unlikely(!psock)) { 2182 + err = -EINVAL; 2183 + goto bucket_err; 2184 + } 2185 + 2186 + e->hash_link = l_new; 2187 + e->htab = container_of(map, struct bpf_htab, map); 2188 + list_add_tail(&e->list, &psock->maps); 2189 + 2190 + /* add new element to the head of the list, so that 2191 + * concurrent search will find it before old elem 2192 + */ 2193 + hlist_add_head_rcu(&l_new->hash_node, head); 2194 + if (l_old) { 2195 + psock = smap_psock_sk(l_old->sk); 2196 + 2197 + hlist_del_rcu(&l_old->hash_node); 2198 + smap_list_remove(psock, NULL, l_old); 2199 + smap_release_sock(psock, l_old->sk); 2200 + free_htab_elem(htab, l_old); 2201 + } 2202 + raw_spin_unlock_bh(&b->lock); 2203 + return 0; 2204 + bucket_err: 2205 + raw_spin_unlock_bh(&b->lock); 2206 + err: 2207 + kfree(e); 2208 + psock = smap_psock_sk(sock); 2209 + if (psock) 2210 + smap_release_sock(psock, sock); 2211 + return err; 2212 + } 2213 + 2214 + static int sock_hash_update_elem(struct bpf_map *map, 2215 + void *key, void *value, u64 flags) 2216 + { 2217 + struct bpf_sock_ops_kern skops; 2218 + u32 fd = *(u32 *)value; 2219 + struct socket *socket; 2220 + int err; 2221 + 2222 + socket = sockfd_lookup(fd, &err); 2223 + if (!socket) 2224 + return err; 2225 + 2226 + skops.sk = socket->sk; 2227 + if (!skops.sk) { 2228 + fput(socket->file); 2229 + return -EINVAL; 2230 + } 2231 + 2232 + err = sock_hash_ctx_update_elem(&skops, map, key, flags); 2233 + fput(socket->file); 2234 + return err; 2235 + } 2236 + 2237 + static int sock_hash_delete_elem(struct bpf_map *map, void *key) 2238 + { 2239 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 2240 + struct hlist_head *head; 2241 + struct bucket *b; 2242 + struct htab_elem *l; 2243 + u32 hash, key_size; 2244 + int ret = -ENOENT; 2245 + 2246 + key_size = map->key_size; 2247 + hash = htab_map_hash(key, key_size); 2248 + b = __select_bucket(htab, hash); 2249 + head = &b->head; 2250 + 2251 + raw_spin_lock_bh(&b->lock); 2252 + l = lookup_elem_raw(head, hash, key, key_size); 2253 + if (l) { 2254 + struct sock *sock = l->sk; 2255 + struct smap_psock *psock; 2256 + 2257 + hlist_del_rcu(&l->hash_node); 2258 + write_lock_bh(&sock->sk_callback_lock); 2259 + psock = smap_psock_sk(sock); 2260 + /* This check handles a racing sock event that can get the 2261 + * sk_callback_lock before this case but after xchg happens 2262 + * causing the refcnt to hit zero and sock user data (psock) 2263 + * to be null and queued for garbage collection. 2264 + */ 2265 + if (likely(psock)) { 2266 + smap_list_remove(psock, NULL, l); 2267 + smap_release_sock(psock, sock); 2268 + } 2269 + write_unlock_bh(&sock->sk_callback_lock); 2270 + free_htab_elem(htab, l); 2271 + ret = 0; 2272 + } 2273 + raw_spin_unlock_bh(&b->lock); 2274 + return ret; 2275 + } 2276 + 2277 + struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key) 2278 + { 2279 + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 2280 + struct hlist_head *head; 2281 + struct htab_elem *l; 2282 + u32 key_size, hash; 2283 + struct bucket *b; 2284 + struct sock *sk; 2285 + 2286 + key_size = map->key_size; 2287 + hash = htab_map_hash(key, key_size); 2288 + b = __select_bucket(htab, hash); 2289 + head = &b->head; 2290 + 2291 + raw_spin_lock_bh(&b->lock); 2292 + l = lookup_elem_raw(head, hash, key, key_size); 2293 + sk = l ? l->sk : NULL; 2294 + raw_spin_unlock_bh(&b->lock); 2295 + return sk; 1967 2296 } 1968 2297 1969 2298 const struct bpf_map_ops sock_map_ops = { ··· 2376 1905 .map_release_uref = sock_map_release, 2377 1906 }; 2378 1907 1908 + const struct bpf_map_ops sock_hash_ops = { 1909 + .map_alloc = sock_hash_alloc, 1910 + .map_free = sock_hash_free, 1911 + .map_lookup_elem = sock_map_lookup, 1912 + .map_get_next_key = sock_hash_get_next_key, 1913 + .map_update_elem = sock_hash_update_elem, 1914 + .map_delete_elem = sock_hash_delete_elem, 1915 + }; 1916 + 2379 1917 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock, 2380 1918 struct bpf_map *, map, void *, key, u64, flags) 2381 1919 { ··· 2394 1914 2395 1915 const struct bpf_func_proto bpf_sock_map_update_proto = { 2396 1916 .func = bpf_sock_map_update, 1917 + .gpl_only = false, 1918 + .pkt_access = true, 1919 + .ret_type = RET_INTEGER, 1920 + .arg1_type = ARG_PTR_TO_CTX, 1921 + .arg2_type = ARG_CONST_MAP_PTR, 1922 + .arg3_type = ARG_PTR_TO_MAP_KEY, 1923 + .arg4_type = ARG_ANYTHING, 1924 + }; 1925 + 1926 + BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, bpf_sock, 1927 + struct bpf_map *, map, void *, key, u64, flags) 1928 + { 1929 + WARN_ON_ONCE(!rcu_read_lock_held()); 1930 + return sock_hash_ctx_update_elem(bpf_sock, map, key, flags); 1931 + } 1932 + 1933 + const struct bpf_func_proto bpf_sock_hash_update_proto = { 1934 + .func = bpf_sock_hash_update, 2397 1935 .gpl_only = false, 2398 1936 .pkt_access = true, 2399 1937 .ret_type = RET_INTEGER,

+53 -6

kernel/bpf/stackmap.c

··· 11 11 #include <linux/perf_event.h> 12 12 #include <linux/elf.h> 13 13 #include <linux/pagemap.h> 14 + #include <linux/irq_work.h> 14 15 #include "percpu_freelist.h" 15 16 16 17 #define STACK_CREATE_FLAG_MASK \ ··· 32 31 u32 n_buckets; 33 32 struct stack_map_bucket *buckets[]; 34 33 }; 34 + 35 + /* irq_work to run up_read() for build_id lookup in nmi context */ 36 + struct stack_map_irq_work { 37 + struct irq_work irq_work; 38 + struct rw_semaphore *sem; 39 + }; 40 + 41 + static void do_up_read(struct irq_work *entry) 42 + { 43 + struct stack_map_irq_work *work; 44 + 45 + work = container_of(entry, struct stack_map_irq_work, irq_work); 46 + up_read(work->sem); 47 + work->sem = NULL; 48 + } 49 + 50 + static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work); 35 51 36 52 static inline bool stack_map_use_build_id(struct bpf_map *map) 37 53 { ··· 285 267 { 286 268 int i; 287 269 struct vm_area_struct *vma; 270 + bool in_nmi_ctx = in_nmi(); 271 + bool irq_work_busy = false; 272 + struct stack_map_irq_work *work; 273 + 274 + if (in_nmi_ctx) { 275 + work = this_cpu_ptr(&up_read_work); 276 + if (work->irq_work.flags & IRQ_WORK_BUSY) 277 + /* cannot queue more up_read, fallback */ 278 + irq_work_busy = true; 279 + } 288 280 289 281 /* 290 - * We cannot do up_read() in nmi context, so build_id lookup is 291 - * only supported for non-nmi events. If at some point, it is 292 - * possible to run find_vma() without taking the semaphore, we 293 - * would like to allow build_id lookup in nmi context. 282 + * We cannot do up_read() in nmi context. To do build_id lookup 283 + * in nmi context, we need to run up_read() in irq_work. We use 284 + * a percpu variable to do the irq_work. If the irq_work is 285 + * already used by another lookup, we fall back to report ips. 294 286 * 295 287 * Same fallback is used for kernel stack (!user) on a stackmap 296 288 * with build_id. 297 289 */ 298 - if (!user || !current || !current->mm || in_nmi() || 290 + if (!user || !current || !current->mm || irq_work_busy || 299 291 down_read_trylock(&current->mm->mmap_sem) == 0) { 300 292 /* cannot access current->mm, fall back to ips */ 301 293 for (i = 0; i < trace_nr; i++) { ··· 327 299 - vma->vm_start; 328 300 id_offs[i].status = BPF_STACK_BUILD_ID_VALID; 329 301 } 330 - up_read(&current->mm->mmap_sem); 302 + 303 + if (!in_nmi_ctx) { 304 + up_read(&current->mm->mmap_sem); 305 + } else { 306 + work->sem = &current->mm->mmap_sem; 307 + irq_work_queue(&work->irq_work); 308 + } 331 309 } 332 310 333 311 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, ··· 609 575 .map_update_elem = stack_map_update_elem, 610 576 .map_delete_elem = stack_map_delete_elem, 611 577 }; 578 + 579 + static int __init stack_map_init(void) 580 + { 581 + int cpu; 582 + struct stack_map_irq_work *work; 583 + 584 + for_each_possible_cpu(cpu) { 585 + work = per_cpu_ptr(&up_read_work, cpu); 586 + init_irq_work(&work->irq_work, do_up_read); 587 + } 588 + return 0; 589 + } 590 + subsys_initcall(stack_map_init);

+39 -2

kernel/bpf/syscall.c

··· 255 255 256 256 bpf_map_uncharge_memlock(map); 257 257 security_bpf_map_free(map); 258 - btf_put(map->btf); 259 258 /* implementation dependent freeing */ 260 259 map->ops->map_free(map); 261 260 } ··· 275 276 if (atomic_dec_and_test(&map->refcnt)) { 276 277 /* bpf_map_free_id() must be called first */ 277 278 bpf_map_free_id(map, do_idr_lock); 279 + btf_put(map->btf); 278 280 INIT_WORK(&map->work, bpf_map_free_deferred); 279 281 schedule_work(&map->work); 280 282 } ··· 2011 2011 info.map_flags = map->map_flags; 2012 2012 memcpy(info.name, map->name, sizeof(map->name)); 2013 2013 2014 + if (map->btf) { 2015 + info.btf_id = btf_id(map->btf); 2016 + info.btf_key_id = map->btf_key_id; 2017 + info.btf_value_id = map->btf_value_id; 2018 + } 2019 + 2014 2020 if (bpf_map_is_dev_bound(map)) { 2015 2021 err = bpf_map_offload_info_fill(&info, map); 2016 2022 if (err) ··· 2028 2022 return -EFAULT; 2029 2023 2030 2024 return 0; 2025 + } 2026 + 2027 + static int bpf_btf_get_info_by_fd(struct btf *btf, 2028 + const union bpf_attr *attr, 2029 + union bpf_attr __user *uattr) 2030 + { 2031 + struct bpf_btf_info __user *uinfo = u64_to_user_ptr(attr->info.info); 2032 + u32 info_len = attr->info.info_len; 2033 + int err; 2034 + 2035 + err = check_uarg_tail_zero(uinfo, sizeof(*uinfo), info_len); 2036 + if (err) 2037 + return err; 2038 + 2039 + return btf_get_info_by_fd(btf, attr, uattr); 2031 2040 } 2032 2041 2033 2042 #define BPF_OBJ_GET_INFO_BY_FD_LAST_FIELD info.info ··· 2068 2047 err = bpf_map_get_info_by_fd(f.file->private_data, attr, 2069 2048 uattr); 2070 2049 else if (f.file->f_op == &btf_fops) 2071 - err = btf_get_info_by_fd(f.file->private_data, attr, uattr); 2050 + err = bpf_btf_get_info_by_fd(f.file->private_data, attr, uattr); 2072 2051 else 2073 2052 err = -EINVAL; 2074 2053 ··· 2087 2066 return -EPERM; 2088 2067 2089 2068 return btf_new_fd(attr); 2069 + } 2070 + 2071 + #define BPF_BTF_GET_FD_BY_ID_LAST_FIELD btf_id 2072 + 2073 + static int bpf_btf_get_fd_by_id(const union bpf_attr *attr) 2074 + { 2075 + if (CHECK_ATTR(BPF_BTF_GET_FD_BY_ID)) 2076 + return -EINVAL; 2077 + 2078 + if (!capable(CAP_SYS_ADMIN)) 2079 + return -EPERM; 2080 + 2081 + return btf_get_fd_by_id(attr->btf_id); 2090 2082 } 2091 2083 2092 2084 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) ··· 2184 2150 break; 2185 2151 case BPF_BTF_LOAD: 2186 2152 err = bpf_btf_load(&attr); 2153 + break; 2154 + case BPF_BTF_GET_FD_BY_ID: 2155 + err = bpf_btf_get_fd_by_id(&attr); 2187 2156 break; 2188 2157 default: 2189 2158 err = -EINVAL;

+13 -3

kernel/bpf/verifier.c

··· 2093 2093 func_id != BPF_FUNC_msg_redirect_map) 2094 2094 goto error; 2095 2095 break; 2096 + case BPF_MAP_TYPE_SOCKHASH: 2097 + if (func_id != BPF_FUNC_sk_redirect_hash && 2098 + func_id != BPF_FUNC_sock_hash_update && 2099 + func_id != BPF_FUNC_map_delete_elem && 2100 + func_id != BPF_FUNC_msg_redirect_hash) 2101 + goto error; 2102 + break; 2096 2103 default: 2097 2104 break; 2098 2105 } ··· 2137 2130 break; 2138 2131 case BPF_FUNC_sk_redirect_map: 2139 2132 case BPF_FUNC_msg_redirect_map: 2133 + case BPF_FUNC_sock_map_update: 2140 2134 if (map->map_type != BPF_MAP_TYPE_SOCKMAP) 2141 2135 goto error; 2142 2136 break; 2143 - case BPF_FUNC_sock_map_update: 2144 - if (map->map_type != BPF_MAP_TYPE_SOCKMAP) 2137 + case BPF_FUNC_sk_redirect_hash: 2138 + case BPF_FUNC_msg_redirect_hash: 2139 + case BPF_FUNC_sock_hash_update: 2140 + if (map->map_type != BPF_MAP_TYPE_SOCKHASH) 2145 2141 goto error; 2146 2142 break; 2147 2143 default: ··· 5225 5215 } 5226 5216 } 5227 5217 5228 - if (!ops->convert_ctx_access) 5218 + if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux)) 5229 5219 return 0; 5230 5220 5231 5221 insn = env->prog->insnsi + delta;

+341 -24

net/core/filter.c

··· 60 60 #include <net/xfrm.h> 61 61 #include <linux/bpf_trace.h> 62 62 #include <net/xdp_sock.h> 63 + #include <linux/inetdevice.h> 64 + #include <net/ip_fib.h> 65 + #include <net/flow.h> 66 + #include <net/arp.h> 63 67 64 68 /** 65 69 * sk_filter_trim_cap - run a packet through a socket filter ··· 2074 2070 .arg2_type = ARG_ANYTHING, 2075 2071 }; 2076 2072 2073 + BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb, 2074 + struct bpf_map *, map, void *, key, u64, flags) 2075 + { 2076 + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); 2077 + 2078 + /* If user passes invalid input drop the packet. */ 2079 + if (unlikely(flags & ~(BPF_F_INGRESS))) 2080 + return SK_DROP; 2081 + 2082 + tcb->bpf.flags = flags; 2083 + tcb->bpf.sk_redir = __sock_hash_lookup_elem(map, key); 2084 + if (!tcb->bpf.sk_redir) 2085 + return SK_DROP; 2086 + 2087 + return SK_PASS; 2088 + } 2089 + 2090 + static const struct bpf_func_proto bpf_sk_redirect_hash_proto = { 2091 + .func = bpf_sk_redirect_hash, 2092 + .gpl_only = false, 2093 + .ret_type = RET_INTEGER, 2094 + .arg1_type = ARG_PTR_TO_CTX, 2095 + .arg2_type = ARG_CONST_MAP_PTR, 2096 + .arg3_type = ARG_PTR_TO_MAP_KEY, 2097 + .arg4_type = ARG_ANYTHING, 2098 + }; 2099 + 2077 2100 BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb, 2078 2101 struct bpf_map *, map, u32, key, u64, flags) 2079 2102 { ··· 2110 2079 if (unlikely(flags & ~(BPF_F_INGRESS))) 2111 2080 return SK_DROP; 2112 2081 2113 - tcb->bpf.key = key; 2114 2082 tcb->bpf.flags = flags; 2115 - tcb->bpf.map = map; 2083 + tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key); 2084 + if (!tcb->bpf.sk_redir) 2085 + return SK_DROP; 2116 2086 2117 2087 return SK_PASS; 2118 2088 } ··· 2121 2089 struct sock *do_sk_redirect_map(struct sk_buff *skb) 2122 2090 { 2123 2091 struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); 2124 - struct sock *sk = NULL; 2125 2092 2126 - if (tcb->bpf.map) { 2127 - sk = __sock_map_lookup_elem(tcb->bpf.map, tcb->bpf.key); 2128 - 2129 - tcb->bpf.key = 0; 2130 - tcb->bpf.map = NULL; 2131 - } 2132 - 2133 - return sk; 2093 + return tcb->bpf.sk_redir; 2134 2094 } 2135 2095 2136 2096 static const struct bpf_func_proto bpf_sk_redirect_map_proto = { ··· 2135 2111 .arg4_type = ARG_ANYTHING, 2136 2112 }; 2137 2113 2114 + BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg_buff *, msg, 2115 + struct bpf_map *, map, void *, key, u64, flags) 2116 + { 2117 + /* If user passes invalid input drop the packet. */ 2118 + if (unlikely(flags & ~(BPF_F_INGRESS))) 2119 + return SK_DROP; 2120 + 2121 + msg->flags = flags; 2122 + msg->sk_redir = __sock_hash_lookup_elem(map, key); 2123 + if (!msg->sk_redir) 2124 + return SK_DROP; 2125 + 2126 + return SK_PASS; 2127 + } 2128 + 2129 + static const struct bpf_func_proto bpf_msg_redirect_hash_proto = { 2130 + .func = bpf_msg_redirect_hash, 2131 + .gpl_only = false, 2132 + .ret_type = RET_INTEGER, 2133 + .arg1_type = ARG_PTR_TO_CTX, 2134 + .arg2_type = ARG_CONST_MAP_PTR, 2135 + .arg3_type = ARG_PTR_TO_MAP_KEY, 2136 + .arg4_type = ARG_ANYTHING, 2137 + }; 2138 + 2138 2139 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg, 2139 2140 struct bpf_map *, map, u32, key, u64, flags) 2140 2141 { ··· 2167 2118 if (unlikely(flags & ~(BPF_F_INGRESS))) 2168 2119 return SK_DROP; 2169 2120 2170 - msg->key = key; 2171 2121 msg->flags = flags; 2172 - msg->map = map; 2122 + msg->sk_redir = __sock_map_lookup_elem(map, key); 2123 + if (!msg->sk_redir) 2124 + return SK_DROP; 2173 2125 2174 2126 return SK_PASS; 2175 2127 } 2176 2128 2177 2129 struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) 2178 2130 { 2179 - struct sock *sk = NULL; 2180 - 2181 - if (msg->map) { 2182 - sk = __sock_map_lookup_elem(msg->map, msg->key); 2183 - 2184 - msg->key = 0; 2185 - msg->map = NULL; 2186 - } 2187 - 2188 - return sk; 2131 + return msg->sk_redir; 2189 2132 } 2190 2133 2191 2134 static const struct bpf_func_proto bpf_msg_redirect_map_proto = { ··· 4073 4032 }; 4074 4033 #endif 4075 4034 4035 + #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) 4036 + static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, 4037 + const struct neighbour *neigh, 4038 + const struct net_device *dev) 4039 + { 4040 + memcpy(params->dmac, neigh->ha, ETH_ALEN); 4041 + memcpy(params->smac, dev->dev_addr, ETH_ALEN); 4042 + params->h_vlan_TCI = 0; 4043 + params->h_vlan_proto = 0; 4044 + 4045 + return dev->ifindex; 4046 + } 4047 + #endif 4048 + 4049 + #if IS_ENABLED(CONFIG_INET) 4050 + static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, 4051 + u32 flags) 4052 + { 4053 + struct in_device *in_dev; 4054 + struct neighbour *neigh; 4055 + struct net_device *dev; 4056 + struct fib_result res; 4057 + struct fib_nh *nh; 4058 + struct flowi4 fl4; 4059 + int err; 4060 + 4061 + dev = dev_get_by_index_rcu(net, params->ifindex); 4062 + if (unlikely(!dev)) 4063 + return -ENODEV; 4064 + 4065 + /* verify forwarding is enabled on this interface */ 4066 + in_dev = __in_dev_get_rcu(dev); 4067 + if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev))) 4068 + return 0; 4069 + 4070 + if (flags & BPF_FIB_LOOKUP_OUTPUT) { 4071 + fl4.flowi4_iif = 1; 4072 + fl4.flowi4_oif = params->ifindex; 4073 + } else { 4074 + fl4.flowi4_iif = params->ifindex; 4075 + fl4.flowi4_oif = 0; 4076 + } 4077 + fl4.flowi4_tos = params->tos & IPTOS_RT_MASK; 4078 + fl4.flowi4_scope = RT_SCOPE_UNIVERSE; 4079 + fl4.flowi4_flags = 0; 4080 + 4081 + fl4.flowi4_proto = params->l4_protocol; 4082 + fl4.daddr = params->ipv4_dst; 4083 + fl4.saddr = params->ipv4_src; 4084 + fl4.fl4_sport = params->sport; 4085 + fl4.fl4_dport = params->dport; 4086 + 4087 + if (flags & BPF_FIB_LOOKUP_DIRECT) { 4088 + u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN; 4089 + struct fib_table *tb; 4090 + 4091 + tb = fib_get_table(net, tbid); 4092 + if (unlikely(!tb)) 4093 + return 0; 4094 + 4095 + err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF); 4096 + } else { 4097 + fl4.flowi4_mark = 0; 4098 + fl4.flowi4_secid = 0; 4099 + fl4.flowi4_tun_key.tun_id = 0; 4100 + fl4.flowi4_uid = sock_net_uid(net, NULL); 4101 + 4102 + err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF); 4103 + } 4104 + 4105 + if (err || res.type != RTN_UNICAST) 4106 + return 0; 4107 + 4108 + if (res.fi->fib_nhs > 1) 4109 + fib_select_path(net, &res, &fl4, NULL); 4110 + 4111 + nh = &res.fi->fib_nh[res.nh_sel]; 4112 + 4113 + /* do not handle lwt encaps right now */ 4114 + if (nh->nh_lwtstate) 4115 + return 0; 4116 + 4117 + dev = nh->nh_dev; 4118 + if (unlikely(!dev)) 4119 + return 0; 4120 + 4121 + if (nh->nh_gw) 4122 + params->ipv4_dst = nh->nh_gw; 4123 + 4124 + params->rt_metric = res.fi->fib_priority; 4125 + 4126 + /* xdp and cls_bpf programs are run in RCU-bh so 4127 + * rcu_read_lock_bh is not needed here 4128 + */ 4129 + neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst); 4130 + if (neigh) 4131 + return bpf_fib_set_fwd_params(params, neigh, dev); 4132 + 4133 + return 0; 4134 + } 4135 + #endif 4136 + 4137 + #if IS_ENABLED(CONFIG_IPV6) 4138 + static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, 4139 + u32 flags) 4140 + { 4141 + struct in6_addr *src = (struct in6_addr *) params->ipv6_src; 4142 + struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst; 4143 + struct neighbour *neigh; 4144 + struct net_device *dev; 4145 + struct inet6_dev *idev; 4146 + struct fib6_info *f6i; 4147 + struct flowi6 fl6; 4148 + int strict = 0; 4149 + int oif; 4150 + 4151 + /* link local addresses are never forwarded */ 4152 + if (rt6_need_strict(dst) || rt6_need_strict(src)) 4153 + return 0; 4154 + 4155 + dev = dev_get_by_index_rcu(net, params->ifindex); 4156 + if (unlikely(!dev)) 4157 + return -ENODEV; 4158 + 4159 + idev = __in6_dev_get_safely(dev); 4160 + if (unlikely(!idev || !net->ipv6.devconf_all->forwarding)) 4161 + return 0; 4162 + 4163 + if (flags & BPF_FIB_LOOKUP_OUTPUT) { 4164 + fl6.flowi6_iif = 1; 4165 + oif = fl6.flowi6_oif = params->ifindex; 4166 + } else { 4167 + oif = fl6.flowi6_iif = params->ifindex; 4168 + fl6.flowi6_oif = 0; 4169 + strict = RT6_LOOKUP_F_HAS_SADDR; 4170 + } 4171 + fl6.flowlabel = params->flowlabel; 4172 + fl6.flowi6_scope = 0; 4173 + fl6.flowi6_flags = 0; 4174 + fl6.mp_hash = 0; 4175 + 4176 + fl6.flowi6_proto = params->l4_protocol; 4177 + fl6.daddr = *dst; 4178 + fl6.saddr = *src; 4179 + fl6.fl6_sport = params->sport; 4180 + fl6.fl6_dport = params->dport; 4181 + 4182 + if (flags & BPF_FIB_LOOKUP_DIRECT) { 4183 + u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN; 4184 + struct fib6_table *tb; 4185 + 4186 + tb = ipv6_stub->fib6_get_table(net, tbid); 4187 + if (unlikely(!tb)) 4188 + return 0; 4189 + 4190 + f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict); 4191 + } else { 4192 + fl6.flowi6_mark = 0; 4193 + fl6.flowi6_secid = 0; 4194 + fl6.flowi6_tun_key.tun_id = 0; 4195 + fl6.flowi6_uid = sock_net_uid(net, NULL); 4196 + 4197 + f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict); 4198 + } 4199 + 4200 + if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry)) 4201 + return 0; 4202 + 4203 + if (unlikely(f6i->fib6_flags & RTF_REJECT || 4204 + f6i->fib6_type != RTN_UNICAST)) 4205 + return 0; 4206 + 4207 + if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0) 4208 + f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6, 4209 + fl6.flowi6_oif, NULL, 4210 + strict); 4211 + 4212 + if (f6i->fib6_nh.nh_lwtstate) 4213 + return 0; 4214 + 4215 + if (f6i->fib6_flags & RTF_GATEWAY) 4216 + *dst = f6i->fib6_nh.nh_gw; 4217 + 4218 + dev = f6i->fib6_nh.nh_dev; 4219 + params->rt_metric = f6i->fib6_metric; 4220 + 4221 + /* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is 4222 + * not needed here. Can not use __ipv6_neigh_lookup_noref here 4223 + * because we need to get nd_tbl via the stub 4224 + */ 4225 + neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128, 4226 + ndisc_hashfn, dst, dev); 4227 + if (neigh) 4228 + return bpf_fib_set_fwd_params(params, neigh, dev); 4229 + 4230 + return 0; 4231 + } 4232 + #endif 4233 + 4234 + BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx, 4235 + struct bpf_fib_lookup *, params, int, plen, u32, flags) 4236 + { 4237 + if (plen < sizeof(*params)) 4238 + return -EINVAL; 4239 + 4240 + switch (params->family) { 4241 + #if IS_ENABLED(CONFIG_INET) 4242 + case AF_INET: 4243 + return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params, 4244 + flags); 4245 + #endif 4246 + #if IS_ENABLED(CONFIG_IPV6) 4247 + case AF_INET6: 4248 + return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params, 4249 + flags); 4250 + #endif 4251 + } 4252 + return 0; 4253 + } 4254 + 4255 + static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = { 4256 + .func = bpf_xdp_fib_lookup, 4257 + .gpl_only = true, 4258 + .ret_type = RET_INTEGER, 4259 + .arg1_type = ARG_PTR_TO_CTX, 4260 + .arg2_type = ARG_PTR_TO_MEM, 4261 + .arg3_type = ARG_CONST_SIZE, 4262 + .arg4_type = ARG_ANYTHING, 4263 + }; 4264 + 4265 + BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, 4266 + struct bpf_fib_lookup *, params, int, plen, u32, flags) 4267 + { 4268 + if (plen < sizeof(*params)) 4269 + return -EINVAL; 4270 + 4271 + switch (params->family) { 4272 + #if IS_ENABLED(CONFIG_INET) 4273 + case AF_INET: 4274 + return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags); 4275 + #endif 4276 + #if IS_ENABLED(CONFIG_IPV6) 4277 + case AF_INET6: 4278 + return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags); 4279 + #endif 4280 + } 4281 + return -ENOTSUPP; 4282 + } 4283 + 4284 + static const struct bpf_func_proto bpf_skb_fib_lookup_proto = { 4285 + .func = bpf_skb_fib_lookup, 4286 + .gpl_only = true, 4287 + .ret_type = RET_INTEGER, 4288 + .arg1_type = ARG_PTR_TO_CTX, 4289 + .arg2_type = ARG_PTR_TO_MEM, 4290 + .arg3_type = ARG_CONST_SIZE, 4291 + .arg4_type = ARG_ANYTHING, 4292 + }; 4293 + 4076 4294 static const struct bpf_func_proto * 4077 4295 bpf_base_func_proto(enum bpf_func_id func_id) 4078 4296 { ··· 4481 4181 case BPF_FUNC_skb_get_xfrm_state: 4482 4182 return &bpf_skb_get_xfrm_state_proto; 4483 4183 #endif 4184 + case BPF_FUNC_fib_lookup: 4185 + return &bpf_skb_fib_lookup_proto; 4484 4186 default: 4485 4187 return bpf_base_func_proto(func_id); 4486 4188 } ··· 4508 4206 return &bpf_xdp_redirect_map_proto; 4509 4207 case BPF_FUNC_xdp_adjust_tail: 4510 4208 return &bpf_xdp_adjust_tail_proto; 4209 + case BPF_FUNC_fib_lookup: 4210 + return &bpf_xdp_fib_lookup_proto; 4511 4211 default: 4512 4212 return bpf_base_func_proto(func_id); 4513 4213 } ··· 4554 4250 return &bpf_sock_ops_cb_flags_set_proto; 4555 4251 case BPF_FUNC_sock_map_update: 4556 4252 return &bpf_sock_map_update_proto; 4253 + case BPF_FUNC_sock_hash_update: 4254 + return &bpf_sock_hash_update_proto; 4557 4255 default: 4558 4256 return bpf_base_func_proto(func_id); 4559 4257 } ··· 4567 4261 switch (func_id) { 4568 4262 case BPF_FUNC_msg_redirect_map: 4569 4263 return &bpf_msg_redirect_map_proto; 4264 + case BPF_FUNC_msg_redirect_hash: 4265 + return &bpf_msg_redirect_hash_proto; 4570 4266 case BPF_FUNC_msg_apply_bytes: 4571 4267 return &bpf_msg_apply_bytes_proto; 4572 4268 case BPF_FUNC_msg_cork_bytes: ··· 4600 4292 return &bpf_get_socket_uid_proto; 4601 4293 case BPF_FUNC_sk_redirect_map: 4602 4294 return &bpf_sk_redirect_map_proto; 4295 + case BPF_FUNC_sk_redirect_hash: 4296 + return &bpf_sk_redirect_hash_proto; 4603 4297 default: 4604 4298 return bpf_base_func_proto(func_id); 4605 4299 } ··· 4955 4645 const struct bpf_prog *prog, 4956 4646 struct bpf_insn_access_aux *info) 4957 4647 { 4958 - if (type == BPF_WRITE) 4648 + if (type == BPF_WRITE) { 4649 + if (bpf_prog_is_dev_bound(prog->aux)) { 4650 + switch (off) { 4651 + case offsetof(struct xdp_md, rx_queue_index): 4652 + return __is_valid_xdp_access(off, size); 4653 + } 4654 + } 4959 4655 return false; 4656 + } 4960 4657 4961 4658 switch (off) { 4962 4659 case offsetof(struct xdp_md, data):

+32 -1

net/ipv6/addrconf_core.c

··· 134 134 return -EAFNOSUPPORT; 135 135 } 136 136 137 + static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id) 138 + { 139 + return NULL; 140 + } 141 + 142 + static struct fib6_info * 143 + eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table, 144 + int oif, struct flowi6 *fl6, int flags) 145 + { 146 + return NULL; 147 + } 148 + 149 + static struct fib6_info * 150 + eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, 151 + int flags) 152 + { 153 + return NULL; 154 + } 155 + 156 + static struct fib6_info * 157 + eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i, 158 + struct flowi6 *fl6, int oif, 159 + const struct sk_buff *skb, int strict) 160 + { 161 + return f6i; 162 + } 163 + 137 164 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) { 138 - .ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup, 165 + .ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup, 166 + .fib6_get_table = eafnosupport_fib6_get_table, 167 + .fib6_table_lookup = eafnosupport_fib6_table_lookup, 168 + .fib6_lookup = eafnosupport_fib6_lookup, 169 + .fib6_multipath_select = eafnosupport_fib6_multipath_select, 139 170 }; 140 171 EXPORT_SYMBOL_GPL(ipv6_stub); 141 172

+5 -1

net/ipv6/af_inet6.c

··· 889 889 static const struct ipv6_stub ipv6_stub_impl = { 890 890 .ipv6_sock_mc_join = ipv6_sock_mc_join, 891 891 .ipv6_sock_mc_drop = ipv6_sock_mc_drop, 892 - .ipv6_dst_lookup = ip6_dst_lookup, 892 + .ipv6_dst_lookup = ip6_dst_lookup, 893 + .fib6_get_table = fib6_get_table, 894 + .fib6_table_lookup = fib6_table_lookup, 895 + .fib6_lookup = fib6_lookup, 896 + .fib6_multipath_select = fib6_multipath_select, 893 897 .udpv6_encap_enable = udpv6_encap_enable, 894 898 .ndisc_send_na = ndisc_send_na, 895 899 .nd_tbl = &nd_tbl,

+113 -21

net/ipv6/fib6_rules.c

··· 60 60 return fib_rules_seq_read(net, AF_INET6); 61 61 } 62 62 63 + /* called with rcu lock held; no reference taken on fib6_info */ 64 + struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, 65 + int flags) 66 + { 67 + struct fib6_info *f6i; 68 + int err; 69 + 70 + if (net->ipv6.fib6_has_custom_rules) { 71 + struct fib_lookup_arg arg = { 72 + .lookup_ptr = fib6_table_lookup, 73 + .lookup_data = &oif, 74 + .flags = FIB_LOOKUP_NOREF, 75 + }; 76 + 77 + l3mdev_update_flow(net, flowi6_to_flowi(fl6)); 78 + 79 + err = fib_rules_lookup(net->ipv6.fib6_rules_ops, 80 + flowi6_to_flowi(fl6), flags, &arg); 81 + if (err) 82 + return ERR_PTR(err); 83 + 84 + f6i = arg.result ? : net->ipv6.fib6_null_entry; 85 + } else { 86 + f6i = fib6_table_lookup(net, net->ipv6.fib6_local_tbl, 87 + oif, fl6, flags); 88 + if (!f6i || f6i == net->ipv6.fib6_null_entry) 89 + f6i = fib6_table_lookup(net, net->ipv6.fib6_main_tbl, 90 + oif, fl6, flags); 91 + } 92 + 93 + return f6i; 94 + } 95 + 63 96 struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, 64 97 const struct sk_buff *skb, 65 98 int flags, pol_lookup_t lookup) ··· 129 96 return &net->ipv6.ip6_null_entry->dst; 130 97 } 131 98 132 - static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, 133 - int flags, struct fib_lookup_arg *arg) 99 + static int fib6_rule_saddr(struct net *net, struct fib_rule *rule, int flags, 100 + struct flowi6 *flp6, const struct net_device *dev) 101 + { 102 + struct fib6_rule *r = (struct fib6_rule *)rule; 103 + 104 + /* If we need to find a source address for this traffic, 105 + * we check the result if it meets requirement of the rule. 106 + */ 107 + if ((rule->flags & FIB_RULE_FIND_SADDR) && 108 + r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) { 109 + struct in6_addr saddr; 110 + 111 + if (ipv6_dev_get_saddr(net, dev, &flp6->daddr, 112 + rt6_flags2srcprefs(flags), &saddr)) 113 + return -EAGAIN; 114 + 115 + if (!ipv6_prefix_equal(&saddr, &r->src.addr, r->src.plen)) 116 + return -EAGAIN; 117 + 118 + flp6->saddr = saddr; 119 + } 120 + 121 + return 0; 122 + } 123 + 124 + static int fib6_rule_action_alt(struct fib_rule *rule, struct flowi *flp, 125 + int flags, struct fib_lookup_arg *arg) 126 + { 127 + struct flowi6 *flp6 = &flp->u.ip6; 128 + struct net *net = rule->fr_net; 129 + struct fib6_table *table; 130 + struct fib6_info *f6i; 131 + int err = -EAGAIN, *oif; 132 + u32 tb_id; 133 + 134 + switch (rule->action) { 135 + case FR_ACT_TO_TBL: 136 + break; 137 + case FR_ACT_UNREACHABLE: 138 + return -ENETUNREACH; 139 + case FR_ACT_PROHIBIT: 140 + return -EACCES; 141 + case FR_ACT_BLACKHOLE: 142 + default: 143 + return -EINVAL; 144 + } 145 + 146 + tb_id = fib_rule_get_table(rule, arg); 147 + table = fib6_get_table(net, tb_id); 148 + if (!table) 149 + return -EAGAIN; 150 + 151 + oif = (int *)arg->lookup_data; 152 + f6i = fib6_table_lookup(net, table, *oif, flp6, flags); 153 + if (f6i != net->ipv6.fib6_null_entry) { 154 + err = fib6_rule_saddr(net, rule, flags, flp6, 155 + fib6_info_nh_dev(f6i)); 156 + 157 + if (likely(!err)) 158 + arg->result = f6i; 159 + } 160 + 161 + return err; 162 + } 163 + 164 + static int __fib6_rule_action(struct fib_rule *rule, struct flowi *flp, 165 + int flags, struct fib_lookup_arg *arg) 134 166 { 135 167 struct flowi6 *flp6 = &flp->u.ip6; 136 168 struct rt6_info *rt = NULL; ··· 232 134 233 135 rt = lookup(net, table, flp6, arg->lookup_data, flags); 234 136 if (rt != net->ipv6.ip6_null_entry) { 235 - struct fib6_rule *r = (struct fib6_rule *)rule; 137 + err = fib6_rule_saddr(net, rule, flags, flp6, 138 + ip6_dst_idev(&rt->dst)->dev); 236 139 237 - /* 238 - * If we need to find a source address for this traffic, 239 - * we check the result if it meets requirement of the rule. 240 - */ 241 - if ((rule->flags & FIB_RULE_FIND_SADDR) && 242 - r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) { 243 - struct in6_addr saddr; 140 + if (err == -EAGAIN) 141 + goto again; 244 142 245 - if (ipv6_dev_get_saddr(net, 246 - ip6_dst_idev(&rt->dst)->dev, 247 - &flp6->daddr, 248 - rt6_flags2srcprefs(flags), 249 - &saddr)) 250 - goto again; 251 - if (!ipv6_prefix_equal(&saddr, &r->src.addr, 252 - r->src.plen)) 253 - goto again; 254 - flp6->saddr = saddr; 255 - } 256 143 err = rt->dst.error; 257 144 if (err != -EAGAIN) 258 145 goto out; ··· 253 170 out: 254 171 arg->result = rt; 255 172 return err; 173 + } 174 + 175 + static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, 176 + int flags, struct fib_lookup_arg *arg) 177 + { 178 + if (arg->lookup_ptr == fib6_table_lookup) 179 + return fib6_rule_action_alt(rule, flp, flags, arg); 180 + 181 + return __fib6_rule_action(rule, flp, flags, arg); 256 182 } 257 183 258 184 static bool fib6_rule_suppress(struct fib_rule *rule, struct fib_lookup_arg *arg)

+15 -6

net/ipv6/ip6_fib.c

··· 354 354 return &rt->dst; 355 355 } 356 356 357 + /* called with rcu lock held; no reference taken on fib6_info */ 358 + struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, 359 + int flags) 360 + { 361 + return fib6_table_lookup(net, net->ipv6.fib6_main_tbl, oif, fl6, flags); 362 + } 363 + 357 364 static void __net_init fib6_tables_init(struct net *net) 358 365 { 359 366 fib6_link_table(net, net->ipv6.fib6_main_tbl); ··· 1361 1354 const struct in6_addr *addr; /* search key */ 1362 1355 }; 1363 1356 1364 - static struct fib6_node *fib6_lookup_1(struct fib6_node *root, 1365 - struct lookup_args *args) 1357 + static struct fib6_node *fib6_node_lookup_1(struct fib6_node *root, 1358 + struct lookup_args *args) 1366 1359 { 1367 1360 struct fib6_node *fn; 1368 1361 __be32 dir; ··· 1407 1400 #ifdef CONFIG_IPV6_SUBTREES 1408 1401 if (subtree) { 1409 1402 struct fib6_node *sfn; 1410 - sfn = fib6_lookup_1(subtree, args + 1); 1403 + sfn = fib6_node_lookup_1(subtree, 1404 + args + 1); 1411 1405 if (!sfn) 1412 1406 goto backtrack; 1413 1407 fn = sfn; ··· 1430 1422 1431 1423 /* called with rcu_read_lock() held 1432 1424 */ 1433 - struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *daddr, 1434 - const struct in6_addr *saddr) 1425 + struct fib6_node *fib6_node_lookup(struct fib6_node *root, 1426 + const struct in6_addr *daddr, 1427 + const struct in6_addr *saddr) 1435 1428 { 1436 1429 struct fib6_node *fn; 1437 1430 struct lookup_args args[] = { ··· 1451 1442 } 1452 1443 }; 1453 1444 1454 - fn = fib6_lookup_1(root, daddr ? args : args + 1); 1445 + fn = fib6_node_lookup_1(root, daddr ? args : args + 1); 1455 1446 if (!fn || fn->fn_flags & RTN_TL_ROOT) 1456 1447 fn = root; 1457 1448

+43 -33

net/ipv6/route.c

··· 419 419 return false; 420 420 } 421 421 422 - static struct fib6_info *rt6_multipath_select(const struct net *net, 423 - struct fib6_info *match, 424 - struct flowi6 *fl6, int oif, 425 - const struct sk_buff *skb, 426 - int strict) 422 + struct fib6_info *fib6_multipath_select(const struct net *net, 423 + struct fib6_info *match, 424 + struct flowi6 *fl6, int oif, 425 + const struct sk_buff *skb, 426 + int strict) 427 427 { 428 428 struct fib6_info *sibling, *next_sibling; 429 429 ··· 1006 1006 pn = rcu_dereference(fn->parent); 1007 1007 sn = FIB6_SUBTREE(pn); 1008 1008 if (sn && sn != fn) 1009 - fn = fib6_lookup(sn, NULL, saddr); 1009 + fn = fib6_node_lookup(sn, NULL, saddr); 1010 1010 else 1011 1011 fn = pn; 1012 1012 if (fn->fn_flags & RTN_RTINFO) ··· 1059 1059 flags &= ~RT6_LOOKUP_F_IFACE; 1060 1060 1061 1061 rcu_read_lock(); 1062 - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 1062 + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 1063 1063 restart: 1064 1064 f6i = rcu_dereference(fn->leaf); 1065 1065 if (!f6i) { ··· 1068 1068 f6i = rt6_device_match(net, f6i, &fl6->saddr, 1069 1069 fl6->flowi6_oif, flags); 1070 1070 if (f6i->fib6_nsiblings && fl6->flowi6_oif == 0) 1071 - f6i = rt6_multipath_select(net, f6i, fl6, 1072 - fl6->flowi6_oif, skb, flags); 1071 + f6i = fib6_multipath_select(net, f6i, fl6, 1072 + fl6->flowi6_oif, skb, 1073 + flags); 1073 1074 } 1074 1075 if (f6i == net->ipv6.fib6_null_entry) { 1075 1076 fn = fib6_backtrack(fn, &fl6->saddr); 1076 1077 if (fn) 1077 1078 goto restart; 1078 1079 } 1080 + 1081 + trace_fib6_table_lookup(net, f6i, table, fl6); 1079 1082 1080 1083 /* Search through exception table */ 1081 1084 rt = rt6_find_cached_rt(f6i, &fl6->daddr, &fl6->saddr); ··· 1097 1094 } 1098 1095 1099 1096 rcu_read_unlock(); 1100 - 1101 - trace_fib6_table_lookup(net, rt, table, fl6); 1102 1097 1103 1098 return rt; 1104 1099 } ··· 1800 1799 rcu_read_unlock_bh(); 1801 1800 } 1802 1801 1803 - struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, 1804 - int oif, struct flowi6 *fl6, 1805 - const struct sk_buff *skb, int flags) 1802 + /* must be called with rcu lock held */ 1803 + struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table, 1804 + int oif, struct flowi6 *fl6, int strict) 1806 1805 { 1807 1806 struct fib6_node *fn, *saved_fn; 1808 1807 struct fib6_info *f6i; 1809 - struct rt6_info *rt; 1810 - int strict = 0; 1811 1808 1812 - strict |= flags & RT6_LOOKUP_F_IFACE; 1813 - strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE; 1814 - if (net->ipv6.devconf_all->forwarding == 0) 1815 - strict |= RT6_LOOKUP_F_REACHABLE; 1816 - 1817 - rcu_read_lock(); 1818 - 1819 - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 1809 + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 1820 1810 saved_fn = fn; 1821 1811 1822 1812 if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF) ··· 1815 1823 1816 1824 redo_rt6_select: 1817 1825 f6i = rt6_select(net, fn, oif, strict); 1818 - if (f6i->fib6_nsiblings) 1819 - f6i = rt6_multipath_select(net, f6i, fl6, oif, skb, strict); 1820 1826 if (f6i == net->ipv6.fib6_null_entry) { 1821 1827 fn = fib6_backtrack(fn, &fl6->saddr); 1822 1828 if (fn) ··· 1827 1837 } 1828 1838 } 1829 1839 1840 + trace_fib6_table_lookup(net, f6i, table, fl6); 1841 + 1842 + return f6i; 1843 + } 1844 + 1845 + struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, 1846 + int oif, struct flowi6 *fl6, 1847 + const struct sk_buff *skb, int flags) 1848 + { 1849 + struct fib6_info *f6i; 1850 + struct rt6_info *rt; 1851 + int strict = 0; 1852 + 1853 + strict |= flags & RT6_LOOKUP_F_IFACE; 1854 + strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE; 1855 + if (net->ipv6.devconf_all->forwarding == 0) 1856 + strict |= RT6_LOOKUP_F_REACHABLE; 1857 + 1858 + rcu_read_lock(); 1859 + 1860 + f6i = fib6_table_lookup(net, table, oif, fl6, strict); 1861 + if (f6i->fib6_nsiblings) 1862 + f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict); 1863 + 1830 1864 if (f6i == net->ipv6.fib6_null_entry) { 1831 1865 rt = net->ipv6.ip6_null_entry; 1832 1866 rcu_read_unlock(); 1833 1867 dst_hold(&rt->dst); 1834 - trace_fib6_table_lookup(net, rt, table, fl6); 1835 1868 return rt; 1836 1869 } 1837 1870 ··· 1865 1852 dst_use_noref(&rt->dst, jiffies); 1866 1853 1867 1854 rcu_read_unlock(); 1868 - trace_fib6_table_lookup(net, rt, table, fl6); 1869 1855 return rt; 1870 1856 } else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) && 1871 1857 !(f6i->fib6_flags & RTF_GATEWAY))) { ··· 1890 1878 dst_hold(&uncached_rt->dst); 1891 1879 } 1892 1880 1893 - trace_fib6_table_lookup(net, uncached_rt, table, fl6); 1894 1881 return uncached_rt; 1895 - 1896 1882 } else { 1897 1883 /* Get a percpu copy */ 1898 1884 ··· 1904 1894 1905 1895 local_bh_enable(); 1906 1896 rcu_read_unlock(); 1907 - trace_fib6_table_lookup(net, pcpu_rt, table, fl6); 1897 + 1908 1898 return pcpu_rt; 1909 1899 } 1910 1900 } ··· 2435 2425 */ 2436 2426 2437 2427 rcu_read_lock(); 2438 - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 2428 + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); 2439 2429 restart: 2440 2430 for_each_fib6_node_rt_rcu(fn) { 2441 2431 if (rt->fib6_nh.nh_flags & RTNH_F_DEAD) ··· 2489 2479 2490 2480 rcu_read_unlock(); 2491 2481 2492 - trace_fib6_table_lookup(net, ret, table, fl6); 2482 + trace_fib6_table_lookup(net, rt, table, fl6); 2493 2483 return ret; 2494 2484 }; 2495 2485

+1 -1

net/xdp/xdp_umem.c

··· 209 209 if ((addr + size) < addr) 210 210 return -EINVAL; 211 211 212 - nframes = size / frame_size; 212 + nframes = (unsigned int)div_u64(size, frame_size); 213 213 if (nframes == 0 || nframes > UINT_MAX) 214 214 return -EINVAL; 215 215

+75 -91

samples/bpf/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 + 3 + BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src)) 4 + TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools 5 + 2 6 # List of programs to build 3 7 hostprogs-y := test_lru_dist 4 8 hostprogs-y += sock_example ··· 50 46 hostprogs-y += cpustat 51 47 hostprogs-y += xdp_adjust_tail 52 48 hostprogs-y += xdpsock 49 + hostprogs-y += xdp_fwd 53 50 54 51 # Libbpf dependencies 55 - LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o 52 + LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a 53 + 56 54 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o 57 55 TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o 58 56 59 - test_lru_dist-objs := test_lru_dist.o $(LIBBPF) 60 - sock_example-objs := sock_example.o $(LIBBPF) 61 - fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o 62 - sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o 63 - sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o 64 - sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o 65 - tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o 66 - tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o 67 - tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o 68 - tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o 69 - tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o 70 - tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o 71 - tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o 72 - load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o 73 - test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o 74 - trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS) 75 - lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o 76 - offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS) 77 - spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS) 78 - map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o 79 - test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o 80 - test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o 81 - test_cgrp2_attach-objs := $(LIBBPF) test_cgrp2_attach.o 82 - test_cgrp2_attach2-objs := $(LIBBPF) test_cgrp2_attach2.o $(CGROUP_HELPERS) 83 - test_cgrp2_sock-objs := $(LIBBPF) test_cgrp2_sock.o 84 - test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o 85 - xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o 57 + fds_example-objs := bpf_load.o fds_example.o 58 + sockex1-objs := bpf_load.o sockex1_user.o 59 + sockex2-objs := bpf_load.o sockex2_user.o 60 + sockex3-objs := bpf_load.o sockex3_user.o 61 + tracex1-objs := bpf_load.o tracex1_user.o 62 + tracex2-objs := bpf_load.o tracex2_user.o 63 + tracex3-objs := bpf_load.o tracex3_user.o 64 + tracex4-objs := bpf_load.o tracex4_user.o 65 + tracex5-objs := bpf_load.o tracex5_user.o 66 + tracex6-objs := bpf_load.o tracex6_user.o 67 + tracex7-objs := bpf_load.o tracex7_user.o 68 + load_sock_ops-objs := bpf_load.o load_sock_ops.o 69 + test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o 70 + trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS) 71 + lathist-objs := bpf_load.o lathist_user.o 72 + offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS) 73 + spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS) 74 + map_perf_test-objs := bpf_load.o map_perf_test_user.o 75 + test_overhead-objs := bpf_load.o test_overhead_user.o 76 + test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o 77 + test_cgrp2_attach-objs := test_cgrp2_attach.o 78 + test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(CGROUP_HELPERS) 79 + test_cgrp2_sock-objs := test_cgrp2_sock.o 80 + test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o 81 + xdp1-objs := xdp1_user.o 86 82 # reuse xdp1 source intentionally 87 - xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o 88 - xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o 89 - test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \ 83 + xdp2-objs := xdp1_user.o 84 + xdp_router_ipv4-objs := bpf_load.o xdp_router_ipv4_user.o 85 + test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \ 90 86 test_current_task_under_cgroup_user.o 91 - trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS) 92 - sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS) 93 - tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o 94 - lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o 95 - xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o 96 - test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o 97 - per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o 98 - xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o 99 - xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o 100 - xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o 101 - xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o 102 - xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o 103 - syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o 104 - cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o 105 - xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o 106 - xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o 87 + trace_event-objs := bpf_load.o trace_event_user.o $(TRACE_HELPERS) 88 + sampleip-objs := bpf_load.o sampleip_user.o $(TRACE_HELPERS) 89 + tc_l2_redirect-objs := bpf_load.o tc_l2_redirect_user.o 90 + lwt_len_hist-objs := bpf_load.o lwt_len_hist_user.o 91 + xdp_tx_iptunnel-objs := bpf_load.o xdp_tx_iptunnel_user.o 92 + test_map_in_map-objs := bpf_load.o test_map_in_map_user.o 93 + per_socket_stats_example-objs := cookie_uid_helper_example.o 94 + xdp_redirect-objs := bpf_load.o xdp_redirect_user.o 95 + xdp_redirect_map-objs := bpf_load.o xdp_redirect_map_user.o 96 + xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o 97 + xdp_monitor-objs := bpf_load.o xdp_monitor_user.o 98 + xdp_rxq_info-objs := xdp_rxq_info_user.o 99 + syscall_tp-objs := bpf_load.o syscall_tp_user.o 100 + cpustat-objs := bpf_load.o cpustat_user.o 101 + xdp_adjust_tail-objs := xdp_adjust_tail_user.o 102 + xdpsock-objs := bpf_load.o xdpsock_user.o 103 + xdp_fwd-objs := bpf_load.o xdp_fwd_user.o 107 104 108 105 # Tell kbuild to always build the programs 109 106 always := $(hostprogs-y) ··· 159 154 always += cpustat_kern.o 160 155 always += xdp_adjust_tail_kern.o 161 156 always += xdpsock_kern.o 157 + always += xdp_fwd_kern.o 162 158 163 159 HOSTCFLAGS += -I$(objtree)/usr/include 164 160 HOSTCFLAGS += -I$(srctree)/tools/lib/ ··· 168 162 HOSTCFLAGS += -I$(srctree)/tools/perf 169 163 170 164 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable 171 - HOSTLOADLIBES_fds_example += -lelf 172 - HOSTLOADLIBES_sockex1 += -lelf 173 - HOSTLOADLIBES_sockex2 += -lelf 174 - HOSTLOADLIBES_sockex3 += -lelf 175 - HOSTLOADLIBES_tracex1 += -lelf 176 - HOSTLOADLIBES_tracex2 += -lelf 177 - HOSTLOADLIBES_tracex3 += -lelf 178 - HOSTLOADLIBES_tracex4 += -lelf -lrt 179 - HOSTLOADLIBES_tracex5 += -lelf 180 - HOSTLOADLIBES_tracex6 += -lelf 181 - HOSTLOADLIBES_tracex7 += -lelf 182 - HOSTLOADLIBES_test_cgrp2_sock2 += -lelf 183 - HOSTLOADLIBES_load_sock_ops += -lelf 184 - HOSTLOADLIBES_test_probe_write_user += -lelf 185 - HOSTLOADLIBES_trace_output += -lelf -lrt 186 - HOSTLOADLIBES_lathist += -lelf 187 - HOSTLOADLIBES_offwaketime += -lelf 188 - HOSTLOADLIBES_spintest += -lelf 189 - HOSTLOADLIBES_map_perf_test += -lelf -lrt 190 - HOSTLOADLIBES_test_overhead += -lelf -lrt 191 - HOSTLOADLIBES_xdp1 += -lelf 192 - HOSTLOADLIBES_xdp2 += -lelf 193 - HOSTLOADLIBES_xdp_router_ipv4 += -lelf 194 - HOSTLOADLIBES_test_current_task_under_cgroup += -lelf 195 - HOSTLOADLIBES_trace_event += -lelf 196 - HOSTLOADLIBES_sampleip += -lelf 197 - HOSTLOADLIBES_tc_l2_redirect += -l elf 198 - HOSTLOADLIBES_lwt_len_hist += -l elf 199 - HOSTLOADLIBES_xdp_tx_iptunnel += -lelf 200 - HOSTLOADLIBES_test_map_in_map += -lelf 201 - HOSTLOADLIBES_xdp_redirect += -lelf 202 - HOSTLOADLIBES_xdp_redirect_map += -lelf 203 - HOSTLOADLIBES_xdp_redirect_cpu += -lelf 204 - HOSTLOADLIBES_xdp_monitor += -lelf 205 - HOSTLOADLIBES_xdp_rxq_info += -lelf 206 - HOSTLOADLIBES_syscall_tp += -lelf 207 - HOSTLOADLIBES_cpustat += -lelf 208 - HOSTLOADLIBES_xdp_adjust_tail += -lelf 209 - HOSTLOADLIBES_xdpsock += -lelf -pthread 165 + HOSTCFLAGS_trace_helpers.o += -I$(srctree)/tools/lib/bpf/ 166 + 167 + HOSTCFLAGS_trace_output_user.o += -I$(srctree)/tools/lib/bpf/ 168 + HOSTCFLAGS_offwaketime_user.o += -I$(srctree)/tools/lib/bpf/ 169 + HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/ 170 + HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/ 171 + HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/ 172 + 173 + HOST_LOADLIBES += $(LIBBPF) -lelf 174 + HOSTLOADLIBES_tracex4 += -lrt 175 + HOSTLOADLIBES_trace_output += -lrt 176 + HOSTLOADLIBES_map_perf_test += -lrt 177 + HOSTLOADLIBES_test_overhead += -lrt 178 + HOSTLOADLIBES_xdpsock += -pthread 210 179 211 180 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: 212 181 # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang ··· 195 214 endif 196 215 197 216 # Trick to allow make to be run from this directory 198 - all: $(LIBBPF) 199 - $(MAKE) -C ../../ $(CURDIR)/ 217 + all: 218 + $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR) 200 219 201 220 clean: 202 221 $(MAKE) -C ../../ M=$(CURDIR) clean 203 222 @rm -f *~ 204 223 205 224 $(LIBBPF): FORCE 206 - $(MAKE) -C $(dir $@) $(notdir $@) 225 + # Fix up variables inherited from Kbuild that tools/ build system won't like 226 + $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(BPF_SAMPLES_PATH)/../../ O= 207 227 208 228 $(obj)/syscall_nrs.s: $(src)/syscall_nrs.c 209 229 $(call if_changed_dep,cc_s_c) ··· 235 253 exit 2; \ 236 254 else true; fi 237 255 238 - $(src)/*.c: verify_target_bpf 256 + $(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF) 257 + $(src)/*.c: verify_target_bpf $(LIBBPF) 239 258 240 259 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h 241 260 ··· 244 261 # But, there is no easy way to fix it, so just exclude it since it is 245 262 # useless for BPF samples. 246 263 $(obj)/%.o: $(src)/%.c 247 - $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \ 264 + @echo " CLANG-bpf " $@ 265 + $(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \ 248 266 -I$(srctree)/tools/testing/selftests/bpf/ \ 249 267 -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ 250 268 -D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \

+6 -6

samples/bpf/bpf_load.c

··· 24 24 #include <poll.h> 25 25 #include <ctype.h> 26 26 #include <assert.h> 27 - #include "libbpf.h" 27 + #include <bpf/bpf.h> 28 28 #include "bpf_load.h" 29 29 #include "perf-sys.h" 30 30 ··· 420 420 421 421 /* Keeping compatible with ELF maps section changes 422 422 * ------------------------------------------------ 423 - * The program size of struct bpf_map_def is known by loader 423 + * The program size of struct bpf_load_map_def is known by loader 424 424 * code, but struct stored in ELF file can be different. 425 425 * 426 426 * Unfortunately sym[i].st_size is zero. To calculate the ··· 429 429 * symbols. 430 430 */ 431 431 map_sz_elf = data_maps->d_size / nr_maps; 432 - map_sz_copy = sizeof(struct bpf_map_def); 432 + map_sz_copy = sizeof(struct bpf_load_map_def); 433 433 if (map_sz_elf < map_sz_copy) { 434 434 /* 435 435 * Backward compat, loading older ELF file with ··· 448 448 449 449 /* Memcpy relevant part of ELF maps data to loader maps */ 450 450 for (i = 0; i < nr_maps; i++) { 451 + struct bpf_load_map_def *def; 451 452 unsigned char *addr, *end; 452 - struct bpf_map_def *def; 453 453 const char *map_name; 454 454 size_t offset; 455 455 ··· 464 464 465 465 /* Symbol value is offset into ELF maps section data area */ 466 466 offset = sym[i].st_value; 467 - def = (struct bpf_map_def *)(data_maps->d_buf + offset); 467 + def = (struct bpf_load_map_def *)(data_maps->d_buf + offset); 468 468 maps[i].elf_offset = offset; 469 - memset(&maps[i].def, 0, sizeof(struct bpf_map_def)); 469 + memset(&maps[i].def, 0, sizeof(struct bpf_load_map_def)); 470 470 memcpy(&maps[i].def, def, map_sz_copy); 471 471 472 472 /* Verify no newer features were requested */

+3 -3

samples/bpf/bpf_load.h

··· 2 2 #ifndef __BPF_LOAD_H 3 3 #define __BPF_LOAD_H 4 4 5 - #include "libbpf.h" 5 + #include <bpf/bpf.h> 6 6 7 7 #define MAX_MAPS 32 8 8 #define MAX_PROGS 32 9 9 10 - struct bpf_map_def { 10 + struct bpf_load_map_def { 11 11 unsigned int type; 12 12 unsigned int key_size; 13 13 unsigned int value_size; ··· 21 21 int fd; 22 22 char *name; 23 23 size_t elf_offset; 24 - struct bpf_map_def def; 24 + struct bpf_load_map_def def; 25 25 }; 26 26 27 27 typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);

+1 -1

samples/bpf/cookie_uid_helper_example.c

··· 51 51 #include <sys/types.h> 52 52 #include <unistd.h> 53 53 #include <bpf/bpf.h> 54 - #include "libbpf.h" 54 + #include "bpf_insn.h" 55 55 56 56 #define PORT 8888 57 57

+1 -1

samples/bpf/cpustat_user.c

··· 17 17 #include <sys/resource.h> 18 18 #include <sys/wait.h> 19 19 20 - #include "libbpf.h" 20 + #include <bpf/bpf.h> 21 21 #include "bpf_load.h" 22 22 23 23 #define MAX_CPU 8

+3 -1

samples/bpf/fds_example.c

··· 12 12 #include <sys/types.h> 13 13 #include <sys/socket.h> 14 14 15 + #include <bpf/bpf.h> 16 + 17 + #include "bpf_insn.h" 15 18 #include "bpf_load.h" 16 - #include "libbpf.h" 17 19 #include "sock_example.h" 18 20 19 21 #define BPF_F_PIN (1 << 0)

+1 -1

samples/bpf/lathist_user.c

··· 10 10 #include <stdlib.h> 11 11 #include <signal.h> 12 12 #include <linux/bpf.h> 13 - #include "libbpf.h" 13 + #include <bpf/bpf.h> 14 14 #include "bpf_load.h" 15 15 16 16 #define MAX_ENTRIES 20

+3 -5

samples/bpf/libbpf.h samples/bpf/bpf_insn.h

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 - /* eBPF mini library */ 3 - #ifndef __LIBBPF_H 4 - #define __LIBBPF_H 5 - 6 - #include <bpf/bpf.h> 2 + /* eBPF instruction mini library */ 3 + #ifndef __BPF_INSN_H 4 + #define __BPF_INSN_H 7 5 8 6 struct bpf_insn; 9 7

+1 -1

samples/bpf/load_sock_ops.c

··· 8 8 #include <stdlib.h> 9 9 #include <string.h> 10 10 #include <linux/bpf.h> 11 - #include "libbpf.h" 11 + #include <bpf/bpf.h> 12 12 #include "bpf_load.h" 13 13 #include <unistd.h> 14 14 #include <errno.h>

+1 -1

samples/bpf/lwt_len_hist_user.c

··· 9 9 #include <errno.h> 10 10 #include <arpa/inet.h> 11 11 12 - #include "libbpf.h" 12 + #include <bpf/bpf.h> 13 13 #include "bpf_util.h" 14 14 15 15 #define MAX_INDEX 64

+1 -1

samples/bpf/map_perf_test_user.c

··· 21 21 #include <arpa/inet.h> 22 22 #include <errno.h> 23 23 24 - #include "libbpf.h" 24 + #include <bpf/bpf.h> 25 25 #include "bpf_load.h" 26 26 27 27 #define TEST_BIT(t) (1U << (t))

+2 -1

samples/bpf/sock_example.c

··· 26 26 #include <linux/if_ether.h> 27 27 #include <linux/ip.h> 28 28 #include <stddef.h> 29 - #include "libbpf.h" 29 + #include <bpf/bpf.h> 30 + #include "bpf_insn.h" 30 31 #include "sock_example.h" 31 32 32 33 char bpf_log_buf[BPF_LOG_BUF_SIZE];

-1

samples/bpf/sock_example.h

··· 9 9 #include <net/if.h> 10 10 #include <linux/if_packet.h> 11 11 #include <arpa/inet.h> 12 - #include "libbpf.h" 13 12 14 13 static inline int open_raw_sock(const char *name) 15 14 {

+1 -1

samples/bpf/sockex1_user.c

··· 2 2 #include <stdio.h> 3 3 #include <assert.h> 4 4 #include <linux/bpf.h> 5 - #include "libbpf.h" 5 + #include <bpf/bpf.h> 6 6 #include "bpf_load.h" 7 7 #include "sock_example.h" 8 8 #include <unistd.h>

+1 -1

samples/bpf/sockex2_user.c

··· 2 2 #include <stdio.h> 3 3 #include <assert.h> 4 4 #include <linux/bpf.h> 5 - #include "libbpf.h" 5 + #include <bpf/bpf.h> 6 6 #include "bpf_load.h" 7 7 #include "sock_example.h" 8 8 #include <unistd.h>

+1 -1

samples/bpf/sockex3_user.c

··· 2 2 #include <stdio.h> 3 3 #include <assert.h> 4 4 #include <linux/bpf.h> 5 - #include "libbpf.h" 5 + #include <bpf/bpf.h> 6 6 #include "bpf_load.h" 7 7 #include "sock_example.h" 8 8 #include <unistd.h>

+1 -1

samples/bpf/syscall_tp_user.c

··· 16 16 #include <assert.h> 17 17 #include <stdbool.h> 18 18 #include <sys/resource.h> 19 - #include "libbpf.h" 19 + #include <bpf/bpf.h> 20 20 #include "bpf_load.h" 21 21 22 22 /* This program verifies bpf attachment to tracepoint sys_enter_* and sys_exit_*.

+1 -1

samples/bpf/tc_l2_redirect_user.c

··· 13 13 #include <string.h> 14 14 #include <errno.h> 15 15 16 - #include "libbpf.h" 16 + #include <bpf/bpf.h> 17 17 18 18 static void usage(void) 19 19 {

+1 -1

samples/bpf/test_cgrp2_array_pin.c

··· 14 14 #include <errno.h> 15 15 #include <fcntl.h> 16 16 17 - #include "libbpf.h" 17 + #include <bpf/bpf.h> 18 18 19 19 static void usage(void) 20 20 {

+2 -1

samples/bpf/test_cgrp2_attach.c

··· 28 28 #include <fcntl.h> 29 29 30 30 #include <linux/bpf.h> 31 + #include <bpf/bpf.h> 31 32 32 - #include "libbpf.h" 33 + #include "bpf_insn.h" 33 34 34 35 enum { 35 36 MAP_KEY_PACKETS,

+2 -1

samples/bpf/test_cgrp2_attach2.c

··· 24 24 #include <unistd.h> 25 25 26 26 #include <linux/bpf.h> 27 + #include <bpf/bpf.h> 27 28 28 - #include "libbpf.h" 29 + #include "bpf_insn.h" 29 30 #include "cgroup_helpers.h" 30 31 31 32 #define FOO "/foo"

+2 -1

samples/bpf/test_cgrp2_sock.c

··· 21 21 #include <net/if.h> 22 22 #include <inttypes.h> 23 23 #include <linux/bpf.h> 24 + #include <bpf/bpf.h> 24 25 25 - #include "libbpf.h" 26 + #include "bpf_insn.h" 26 27 27 28 char bpf_log_buf[BPF_LOG_BUF_SIZE]; 28 29

+2 -1

samples/bpf/test_cgrp2_sock2.c

··· 19 19 #include <fcntl.h> 20 20 #include <net/if.h> 21 21 #include <linux/bpf.h> 22 + #include <bpf/bpf.h> 22 23 23 - #include "libbpf.h" 24 + #include "bpf_insn.h" 24 25 #include "bpf_load.h" 25 26 26 27 static int usage(const char *argv0)

+1 -1

samples/bpf/test_current_task_under_cgroup_user.c

··· 9 9 #include <stdio.h> 10 10 #include <linux/bpf.h> 11 11 #include <unistd.h> 12 - #include "libbpf.h" 12 + #include <bpf/bpf.h> 13 13 #include "bpf_load.h" 14 14 #include <linux/bpf.h> 15 15 #include "cgroup_helpers.h"

+1 -1

samples/bpf/test_lru_dist.c

··· 21 21 #include <stdlib.h> 22 22 #include <time.h> 23 23 24 - #include "libbpf.h" 24 + #include <bpf/bpf.h> 25 25 #include "bpf_util.h" 26 26 27 27 #define min(a, b) ((a) < (b) ? (a) : (b))

+1 -1

samples/bpf/test_map_in_map_user.c

··· 13 13 #include <errno.h> 14 14 #include <stdlib.h> 15 15 #include <stdio.h> 16 - #include "libbpf.h" 16 + #include <bpf/bpf.h> 17 17 #include "bpf_load.h" 18 18 19 19 #define PORT_A (map_fd[0])

+1 -1

samples/bpf/test_overhead_user.c

··· 19 19 #include <string.h> 20 20 #include <time.h> 21 21 #include <sys/resource.h> 22 - #include "libbpf.h" 22 + #include <bpf/bpf.h> 23 23 #include "bpf_load.h" 24 24 25 25 #define MAX_CNT 1000000

+1 -1

samples/bpf/test_probe_write_user_user.c

··· 3 3 #include <assert.h> 4 4 #include <linux/bpf.h> 5 5 #include <unistd.h> 6 - #include "libbpf.h" 6 + #include <bpf/bpf.h> 7 7 #include "bpf_load.h" 8 8 #include <sys/socket.h> 9 9 #include <string.h>

+4 -4

samples/bpf/trace_output_user.c

··· 18 18 #include <sys/mman.h> 19 19 #include <time.h> 20 20 #include <signal.h> 21 - #include "libbpf.h" 21 + #include <libbpf.h> 22 22 #include "bpf_load.h" 23 23 #include "perf-sys.h" 24 24 #include "trace_helpers.h" ··· 48 48 if (e->cookie != 0x12345678) { 49 49 printf("BUG pid %llx cookie %llx sized %d\n", 50 50 e->pid, e->cookie, size); 51 - return PERF_EVENT_ERROR; 51 + return LIBBPF_PERF_EVENT_ERROR; 52 52 } 53 53 54 54 cnt++; ··· 56 56 if (cnt == MAX_CNT) { 57 57 printf("recv %lld events per sec\n", 58 58 MAX_CNT * 1000000000ll / (time_get_ns() - start_time)); 59 - return PERF_EVENT_DONE; 59 + return LIBBPF_PERF_EVENT_DONE; 60 60 } 61 61 62 - return PERF_EVENT_CONT; 62 + return LIBBPF_PERF_EVENT_CONT; 63 63 } 64 64 65 65 static void test_bpf_perf_event(void)

+1 -1

samples/bpf/tracex1_user.c

··· 2 2 #include <stdio.h> 3 3 #include <linux/bpf.h> 4 4 #include <unistd.h> 5 - #include "libbpf.h" 5 + #include <bpf/bpf.h> 6 6 #include "bpf_load.h" 7 7 8 8 int main(int ac, char **argv)

+1 -1

samples/bpf/tracex2_user.c

··· 7 7 #include <string.h> 8 8 #include <sys/resource.h> 9 9 10 - #include "libbpf.h" 10 + #include <bpf/bpf.h> 11 11 #include "bpf_load.h" 12 12 #include "bpf_util.h" 13 13

+1 -1

samples/bpf/tracex3_user.c

··· 13 13 #include <linux/bpf.h> 14 14 #include <sys/resource.h> 15 15 16 - #include "libbpf.h" 16 + #include <bpf/bpf.h> 17 17 #include "bpf_load.h" 18 18 #include "bpf_util.h" 19 19

+1 -1

samples/bpf/tracex4_user.c

··· 14 14 #include <linux/bpf.h> 15 15 #include <sys/resource.h> 16 16 17 - #include "libbpf.h" 17 + #include <bpf/bpf.h> 18 18 #include "bpf_load.h" 19 19 20 20 struct pair {

+1 -1

samples/bpf/tracex5_user.c

··· 5 5 #include <linux/filter.h> 6 6 #include <linux/seccomp.h> 7 7 #include <sys/prctl.h> 8 - #include "libbpf.h" 8 + #include <bpf/bpf.h> 9 9 #include "bpf_load.h" 10 10 #include <sys/resource.h> 11 11

+1 -1

samples/bpf/tracex6_user.c

··· 16 16 #include <unistd.h> 17 17 18 18 #include "bpf_load.h" 19 - #include "libbpf.h" 19 + #include <bpf/bpf.h> 20 20 #include "perf-sys.h" 21 21 22 22 #define SAMPLE_PERIOD 0x7fffffffffffffffULL

+1 -1

samples/bpf/tracex7_user.c

··· 3 3 #include <stdio.h> 4 4 #include <linux/bpf.h> 5 5 #include <unistd.h> 6 - #include "libbpf.h" 6 + #include <bpf/bpf.h> 7 7 #include "bpf_load.h" 8 8 9 9 int main(int argc, char **argv)

+21 -10

samples/bpf/xdp1_user.c

··· 16 16 #include <libgen.h> 17 17 #include <sys/resource.h> 18 18 19 - #include "bpf_load.h" 20 19 #include "bpf_util.h" 21 - #include "libbpf.h" 20 + #include "bpf/bpf.h" 21 + #include "bpf/libbpf.h" 22 22 23 23 static int ifindex; 24 24 static __u32 xdp_flags; ··· 31 31 32 32 /* simple per-protocol drop counter 33 33 */ 34 - static void poll_stats(int interval) 34 + static void poll_stats(int map_fd, int interval) 35 35 { 36 36 unsigned int nr_cpus = bpf_num_possible_cpus(); 37 37 const unsigned int nr_keys = 256; ··· 47 47 for (key = 0; key < nr_keys; key++) { 48 48 __u64 sum = 0; 49 49 50 - assert(bpf_map_lookup_elem(map_fd[0], &key, values) == 0); 50 + assert(bpf_map_lookup_elem(map_fd, &key, values) == 0); 51 51 for (i = 0; i < nr_cpus; i++) 52 52 sum += (values[i] - prev[key][i]); 53 53 if (sum) ··· 71 71 int main(int argc, char **argv) 72 72 { 73 73 struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; 74 + struct bpf_prog_load_attr prog_load_attr = { 75 + .prog_type = BPF_PROG_TYPE_XDP, 76 + }; 74 77 const char *optstr = "SN"; 78 + int prog_fd, map_fd, opt; 79 + struct bpf_object *obj; 80 + struct bpf_map *map; 75 81 char filename[256]; 76 - int opt; 77 82 78 83 while ((opt = getopt(argc, argv, optstr)) != -1) { 79 84 switch (opt) { ··· 107 102 ifindex = strtoul(argv[optind], NULL, 0); 108 103 109 104 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); 105 + prog_load_attr.file = filename; 110 106 111 - if (load_bpf_file(filename)) { 112 - printf("%s", bpf_log_buf); 107 + if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) 108 + return 1; 109 + 110 + map = bpf_map__next(NULL, obj); 111 + if (!map) { 112 + printf("finding a map in obj file failed\n"); 113 113 return 1; 114 114 } 115 + map_fd = bpf_map__fd(map); 115 116 116 - if (!prog_fd[0]) { 117 + if (!prog_fd) { 117 118 printf("load_bpf_file: %s\n", strerror(errno)); 118 119 return 1; 119 120 } ··· 127 116 signal(SIGINT, int_exit); 128 117 signal(SIGTERM, int_exit); 129 118 130 - if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) { 119 + if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) { 131 120 printf("link set xdp fd failed\n"); 132 121 return 1; 133 122 } 134 123 135 - poll_stats(2); 124 + poll_stats(map_fd, 2); 136 125 137 126 return 0; 138 127 }

+22 -14

samples/bpf/xdp_adjust_tail_user.c

··· 18 18 #include <netinet/ether.h> 19 19 #include <unistd.h> 20 20 #include <time.h> 21 - #include "bpf_load.h" 22 - #include "libbpf.h" 23 - #include "bpf_util.h" 21 + #include "bpf/bpf.h" 22 + #include "bpf/libbpf.h" 24 23 25 24 #define STATS_INTERVAL_S 2U 26 25 ··· 35 36 36 37 /* simple "icmp packet too big sent" counter 37 38 */ 38 - static void poll_stats(unsigned int kill_after_s) 39 + static void poll_stats(unsigned int map_fd, unsigned int kill_after_s) 39 40 { 40 41 time_t started_at = time(NULL); 41 42 __u64 value = 0; ··· 45 46 while (!kill_after_s || time(NULL) - started_at <= kill_after_s) { 46 47 sleep(STATS_INTERVAL_S); 47 48 48 - assert(bpf_map_lookup_elem(map_fd[0], &key, &value) == 0); 49 + assert(bpf_map_lookup_elem(map_fd, &key, &value) == 0); 49 50 50 51 printf("icmp \"packet too big\" sent: %10llu pkts\n", value); 51 52 } ··· 65 66 66 67 int main(int argc, char **argv) 67 68 { 69 + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; 70 + struct bpf_prog_load_attr prog_load_attr = { 71 + .prog_type = BPF_PROG_TYPE_XDP, 72 + }; 68 73 unsigned char opt_flags[256] = {}; 69 74 unsigned int kill_after_s = 0; 70 75 const char *optstr = "i:T:SNh"; 71 - struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; 76 + int i, prog_fd, map_fd, opt; 77 + struct bpf_object *obj; 78 + struct bpf_map *map; 72 79 char filename[256]; 73 - int opt; 74 - int i; 75 - 76 80 77 81 for (i = 0; i < strlen(optstr); i++) 78 82 if (optstr[i] != 'h' && 'a' <= optstr[i] && optstr[i] <= 'z') ··· 117 115 } 118 116 119 117 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); 118 + prog_load_attr.file = filename; 120 119 121 - if (load_bpf_file(filename)) { 122 - printf("%s", bpf_log_buf); 120 + if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) 121 + return 1; 122 + 123 + map = bpf_map__next(NULL, obj); 124 + if (!map) { 125 + printf("finding a map in obj file failed\n"); 123 126 return 1; 124 127 } 128 + map_fd = bpf_map__fd(map); 125 129 126 - if (!prog_fd[0]) { 130 + if (!prog_fd) { 127 131 printf("load_bpf_file: %s\n", strerror(errno)); 128 132 return 1; 129 133 } ··· 137 129 signal(SIGINT, int_exit); 138 130 signal(SIGTERM, int_exit); 139 131 140 - if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) { 132 + if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) { 141 133 printf("link set xdp fd failed\n"); 142 134 return 1; 143 135 } 144 136 145 - poll_stats(kill_after_s); 137 + poll_stats(map_fd, kill_after_s); 146 138 147 139 bpf_set_link_xdp_fd(ifindex, -1, xdp_flags); 148 140

+138

samples/bpf/xdp_fwd_kern.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com> 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of version 2 of the GNU General Public 6 + * License as published by the Free Software Foundation. 7 + * 8 + * This program is distributed in the hope that it will be useful, but 9 + * WITHOUT ANY WARRANTY; without even the implied warranty of 10 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 11 + * General Public License for more details. 12 + */ 13 + #define KBUILD_MODNAME "foo" 14 + #include <uapi/linux/bpf.h> 15 + #include <linux/in.h> 16 + #include <linux/if_ether.h> 17 + #include <linux/if_packet.h> 18 + #include <linux/if_vlan.h> 19 + #include <linux/ip.h> 20 + #include <linux/ipv6.h> 21 + 22 + #include "bpf_helpers.h" 23 + 24 + #define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFFFFFF) 25 + 26 + struct bpf_map_def SEC("maps") tx_port = { 27 + .type = BPF_MAP_TYPE_DEVMAP, 28 + .key_size = sizeof(int), 29 + .value_size = sizeof(int), 30 + .max_entries = 64, 31 + }; 32 + 33 + /* from include/net/ip.h */ 34 + static __always_inline int ip_decrease_ttl(struct iphdr *iph) 35 + { 36 + u32 check = (__force u32)iph->check; 37 + 38 + check += (__force u32)htons(0x0100); 39 + iph->check = (__force __sum16)(check + (check >= 0xFFFF)); 40 + return --iph->ttl; 41 + } 42 + 43 + static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) 44 + { 45 + void *data_end = (void *)(long)ctx->data_end; 46 + void *data = (void *)(long)ctx->data; 47 + struct bpf_fib_lookup fib_params; 48 + struct ethhdr *eth = data; 49 + struct ipv6hdr *ip6h; 50 + struct iphdr *iph; 51 + int out_index; 52 + u16 h_proto; 53 + u64 nh_off; 54 + 55 + nh_off = sizeof(*eth); 56 + if (data + nh_off > data_end) 57 + return XDP_DROP; 58 + 59 + __builtin_memset(&fib_params, 0, sizeof(fib_params)); 60 + 61 + h_proto = eth->h_proto; 62 + if (h_proto == htons(ETH_P_IP)) { 63 + iph = data + nh_off; 64 + 65 + if (iph + 1 > data_end) 66 + return XDP_DROP; 67 + 68 + if (iph->ttl <= 1) 69 + return XDP_PASS; 70 + 71 + fib_params.family = AF_INET; 72 + fib_params.tos = iph->tos; 73 + fib_params.l4_protocol = iph->protocol; 74 + fib_params.sport = 0; 75 + fib_params.dport = 0; 76 + fib_params.tot_len = ntohs(iph->tot_len); 77 + fib_params.ipv4_src = iph->saddr; 78 + fib_params.ipv4_dst = iph->daddr; 79 + } else if (h_proto == htons(ETH_P_IPV6)) { 80 + struct in6_addr *src = (struct in6_addr *) fib_params.ipv6_src; 81 + struct in6_addr *dst = (struct in6_addr *) fib_params.ipv6_dst; 82 + 83 + ip6h = data + nh_off; 84 + if (ip6h + 1 > data_end) 85 + return XDP_DROP; 86 + 87 + if (ip6h->hop_limit <= 1) 88 + return XDP_PASS; 89 + 90 + fib_params.family = AF_INET6; 91 + fib_params.flowlabel = *(__be32 *)ip6h & IPV6_FLOWINFO_MASK; 92 + fib_params.l4_protocol = ip6h->nexthdr; 93 + fib_params.sport = 0; 94 + fib_params.dport = 0; 95 + fib_params.tot_len = ntohs(ip6h->payload_len); 96 + *src = ip6h->saddr; 97 + *dst = ip6h->daddr; 98 + } else { 99 + return XDP_PASS; 100 + } 101 + 102 + fib_params.ifindex = ctx->ingress_ifindex; 103 + 104 + out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); 105 + 106 + /* verify egress index has xdp support 107 + * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with 108 + * cannot pass map_type 14 into func bpf_map_lookup_elem#1: 109 + * NOTE: without verification that egress index supports XDP 110 + * forwarding packets are dropped. 111 + */ 112 + if (out_index > 0) { 113 + if (h_proto == htons(ETH_P_IP)) 114 + ip_decrease_ttl(iph); 115 + else if (h_proto == htons(ETH_P_IPV6)) 116 + ip6h->hop_limit--; 117 + 118 + memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN); 119 + memcpy(eth->h_source, fib_params.smac, ETH_ALEN); 120 + return bpf_redirect_map(&tx_port, out_index, 0); 121 + } 122 + 123 + return XDP_PASS; 124 + } 125 + 126 + SEC("xdp_fwd") 127 + int xdp_fwd_prog(struct xdp_md *ctx) 128 + { 129 + return xdp_fwd_flags(ctx, 0); 130 + } 131 + 132 + SEC("xdp_fwd_direct") 133 + int xdp_fwd_direct_prog(struct xdp_md *ctx) 134 + { 135 + return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT); 136 + } 137 + 138 + char _license[] SEC("license") = "GPL";

+136

samples/bpf/xdp_fwd_user.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com> 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of version 2 of the GNU General Public 6 + * License as published by the Free Software Foundation. 7 + * 8 + * This program is distributed in the hope that it will be useful, but 9 + * WITHOUT ANY WARRANTY; without even the implied warranty of 10 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 11 + * General Public License for more details. 12 + */ 13 + 14 + #include <linux/bpf.h> 15 + #include <linux/if_link.h> 16 + #include <linux/limits.h> 17 + #include <net/if.h> 18 + #include <errno.h> 19 + #include <stdio.h> 20 + #include <stdlib.h> 21 + #include <stdbool.h> 22 + #include <string.h> 23 + #include <unistd.h> 24 + #include <fcntl.h> 25 + #include <libgen.h> 26 + 27 + #include "bpf_load.h" 28 + #include "bpf_util.h" 29 + #include <bpf/bpf.h> 30 + 31 + 32 + static int do_attach(int idx, int fd, const char *name) 33 + { 34 + int err; 35 + 36 + err = bpf_set_link_xdp_fd(idx, fd, 0); 37 + if (err < 0) 38 + printf("ERROR: failed to attach program to %s\n", name); 39 + 40 + return err; 41 + } 42 + 43 + static int do_detach(int idx, const char *name) 44 + { 45 + int err; 46 + 47 + err = bpf_set_link_xdp_fd(idx, -1, 0); 48 + if (err < 0) 49 + printf("ERROR: failed to detach program from %s\n", name); 50 + 51 + return err; 52 + } 53 + 54 + static void usage(const char *prog) 55 + { 56 + fprintf(stderr, 57 + "usage: %s [OPTS] interface-list\n" 58 + "\nOPTS:\n" 59 + " -d detach program\n" 60 + " -D direct table lookups (skip fib rules)\n", 61 + prog); 62 + } 63 + 64 + int main(int argc, char **argv) 65 + { 66 + char filename[PATH_MAX]; 67 + int opt, i, idx, err; 68 + int prog_id = 0; 69 + int attach = 1; 70 + int ret = 0; 71 + 72 + while ((opt = getopt(argc, argv, ":dD")) != -1) { 73 + switch (opt) { 74 + case 'd': 75 + attach = 0; 76 + break; 77 + case 'D': 78 + prog_id = 1; 79 + break; 80 + default: 81 + usage(basename(argv[0])); 82 + return 1; 83 + } 84 + } 85 + 86 + if (optind == argc) { 87 + usage(basename(argv[0])); 88 + return 1; 89 + } 90 + 91 + if (attach) { 92 + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); 93 + 94 + if (access(filename, O_RDONLY) < 0) { 95 + printf("error accessing file %s: %s\n", 96 + filename, strerror(errno)); 97 + return 1; 98 + } 99 + 100 + if (load_bpf_file(filename)) { 101 + printf("%s", bpf_log_buf); 102 + return 1; 103 + } 104 + 105 + if (!prog_fd[prog_id]) { 106 + printf("load_bpf_file: %s\n", strerror(errno)); 107 + return 1; 108 + } 109 + } 110 + if (attach) { 111 + for (i = 1; i < 64; ++i) 112 + bpf_map_update_elem(map_fd[0], &i, &i, 0); 113 + } 114 + 115 + for (i = optind; i < argc; ++i) { 116 + idx = if_nametoindex(argv[i]); 117 + if (!idx) 118 + idx = strtoul(argv[i], NULL, 0); 119 + 120 + if (!idx) { 121 + fprintf(stderr, "Invalid arg\n"); 122 + return 1; 123 + } 124 + if (!attach) { 125 + err = do_detach(idx, argv[i]); 126 + if (err) 127 + ret = err; 128 + } else { 129 + err = do_attach(idx, prog_fd[prog_id], argv[i]); 130 + if (err) 131 + ret = err; 132 + } 133 + } 134 + 135 + return ret; 136 + }

+3 -3

samples/bpf/xdp_monitor_user.c

··· 26 26 #include <net/if.h> 27 27 #include <time.h> 28 28 29 - #include "libbpf.h" 29 + #include <bpf/bpf.h> 30 30 #include "bpf_load.h" 31 31 #include "bpf_util.h" 32 32 ··· 58 58 printf(" flag (internal value:%d)", 59 59 *long_options[i].flag); 60 60 else 61 - printf("(internal short-option: -%c)", 61 + printf("short-option: -%c", 62 62 long_options[i].val); 63 63 printf("\n"); 64 64 } ··· 594 594 snprintf(bpf_obj_file, sizeof(bpf_obj_file), "%s_kern.o", argv[0]); 595 595 596 596 /* Parse commands line args */ 597 - while ((opt = getopt_long(argc, argv, "h", 597 + while ((opt = getopt_long(argc, argv, "hDSs:", 598 598 long_options, &longindex)) != -1) { 599 599 switch (opt) { 600 600 case 'D':

+1 -1

samples/bpf/xdp_redirect_cpu_user.c

··· 28 28 * use bpf/libbpf.h), but cannot as (currently) needed for XDP 29 29 * attaching to a device via bpf_set_link_xdp_fd() 30 30 */ 31 - #include "libbpf.h" 31 + #include <bpf/bpf.h> 32 32 #include "bpf_load.h" 33 33 34 34 #include "bpf_util.h"

+1 -1

samples/bpf/xdp_redirect_map_user.c

··· 24 24 25 25 #include "bpf_load.h" 26 26 #include "bpf_util.h" 27 - #include "libbpf.h" 27 + #include <bpf/bpf.h> 28 28 29 29 static int ifindex_in; 30 30 static int ifindex_out;

+1 -1

samples/bpf/xdp_redirect_user.c

··· 24 24 25 25 #include "bpf_load.h" 26 26 #include "bpf_util.h" 27 - #include "libbpf.h" 27 + #include <bpf/bpf.h> 28 28 29 29 static int ifindex_in; 30 30 static int ifindex_out;

+1 -1

samples/bpf/xdp_router_ipv4_user.c

··· 16 16 #include <sys/socket.h> 17 17 #include <unistd.h> 18 18 #include "bpf_load.h" 19 - #include "libbpf.h" 19 + #include <bpf/bpf.h> 20 20 #include <arpa/inet.h> 21 21 #include <fcntl.h> 22 22 #include <poll.h>

+31 -15

samples/bpf/xdp_rxq_info_user.c

··· 22 22 #include <arpa/inet.h> 23 23 #include <linux/if_link.h> 24 24 25 - #include "libbpf.h" 26 - #include "bpf_load.h" 25 + #include "bpf/bpf.h" 26 + #include "bpf/libbpf.h" 27 27 #include "bpf_util.h" 28 28 29 29 static int ifindex = -1; ··· 31 31 static char *ifname; 32 32 33 33 static __u32 xdp_flags; 34 + 35 + static struct bpf_map *stats_global_map; 36 + static struct bpf_map *rx_queue_index_map; 34 37 35 38 /* Exit return codes */ 36 39 #define EXIT_OK 0 ··· 177 174 178 175 static struct record *alloc_record_per_rxq(void) 179 176 { 180 - unsigned int nr_rxqs = map_data[2].def.max_entries; 177 + unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries; 181 178 struct record *array; 182 179 size_t size; 183 180 ··· 193 190 194 191 static struct stats_record *alloc_stats_record(void) 195 192 { 196 - unsigned int nr_rxqs = map_data[2].def.max_entries; 193 + unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries; 197 194 struct stats_record *rec; 198 195 int i; 199 196 ··· 213 210 214 211 static void free_stats_record(struct stats_record *r) 215 212 { 216 - unsigned int nr_rxqs = map_data[2].def.max_entries; 213 + unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries; 217 214 int i; 218 215 219 216 for (i = 0; i < nr_rxqs; i++) ··· 257 254 { 258 255 int fd, i, max_rxqs; 259 256 260 - fd = map_data[1].fd; /* map: stats_global_map */ 257 + fd = bpf_map__fd(stats_global_map); 261 258 map_collect_percpu(fd, 0, &rec->stats); 262 259 263 - fd = map_data[2].fd; /* map: rx_queue_index_map */ 264 - max_rxqs = map_data[2].def.max_entries; 260 + fd = bpf_map__fd(rx_queue_index_map); 261 + max_rxqs = bpf_map__def(rx_queue_index_map)->max_entries; 265 262 for (i = 0; i < max_rxqs; i++) 266 263 map_collect_percpu(fd, i, &rec->rxq[i]); 267 264 } ··· 307 304 struct stats_record *stats_prev, 308 305 int action) 309 306 { 307 + unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries; 310 308 unsigned int nr_cpus = bpf_num_possible_cpus(); 311 - unsigned int nr_rxqs = map_data[2].def.max_entries; 312 309 double pps = 0, err = 0; 313 310 struct record *rec, *prev; 314 311 double t; ··· 422 419 int main(int argc, char **argv) 423 420 { 424 421 struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY}; 422 + struct bpf_prog_load_attr prog_load_attr = { 423 + .prog_type = BPF_PROG_TYPE_XDP, 424 + }; 425 + int prog_fd, map_fd, opt, err; 425 426 bool use_separators = true; 426 427 struct config cfg = { 0 }; 428 + struct bpf_object *obj; 429 + struct bpf_map *map; 427 430 char filename[256]; 428 431 int longindex = 0; 429 432 int interval = 2; 430 433 __u32 key = 0; 431 - int opt, err; 432 434 433 435 char action_str_buf[XDP_ACTION_MAX_STRLEN + 1 /* for \0 */] = { 0 }; 434 436 int action = XDP_PASS; /* Default action */ 435 437 char *action_str = NULL; 436 438 437 439 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); 440 + prog_load_attr.file = filename; 438 441 439 442 if (setrlimit(RLIMIT_MEMLOCK, &r)) { 440 443 perror("setrlimit(RLIMIT_MEMLOCK)"); 441 444 return 1; 442 445 } 443 446 444 - if (load_bpf_file(filename)) { 445 - fprintf(stderr, "ERR in load_bpf_file(): %s", bpf_log_buf); 447 + if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) 448 + return EXIT_FAIL; 449 + 450 + map = bpf_map__next(NULL, obj); 451 + stats_global_map = bpf_map__next(map, obj); 452 + rx_queue_index_map = bpf_map__next(stats_global_map, obj); 453 + if (!map || !stats_global_map || !rx_queue_index_map) { 454 + printf("finding a map in obj file failed\n"); 446 455 return EXIT_FAIL; 447 456 } 457 + map_fd = bpf_map__fd(map); 448 458 449 - if (!prog_fd[0]) { 459 + if (!prog_fd) { 450 460 fprintf(stderr, "ERR: load_bpf_file: %s\n", strerror(errno)); 451 461 return EXIT_FAIL; 452 462 } ··· 528 512 setlocale(LC_NUMERIC, "en_US"); 529 513 530 514 /* User-side setup ifindex in config_map */ 531 - err = bpf_map_update_elem(map_fd[0], &key, &cfg, 0); 515 + err = bpf_map_update_elem(map_fd, &key, &cfg, 0); 532 516 if (err) { 533 517 fprintf(stderr, "Store config failed (err:%d)\n", err); 534 518 exit(EXIT_FAIL_BPF); ··· 537 521 /* Remove XDP program when program is interrupted */ 538 522 signal(SIGINT, int_exit); 539 523 540 - if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) { 524 + if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) { 541 525 fprintf(stderr, "link set xdp fd failed\n"); 542 526 return EXIT_FAIL_XDP; 543 527 }

+1 -1

samples/bpf/xdp_tx_iptunnel_user.c

··· 18 18 #include <unistd.h> 19 19 #include <time.h> 20 20 #include "bpf_load.h" 21 - #include "libbpf.h" 21 + #include <bpf/bpf.h> 22 22 #include "bpf_util.h" 23 23 #include "xdp_tx_iptunnel_common.h" 24 24

+1 -1

samples/bpf/xdpsock_user.c

··· 38 38 39 39 #include "bpf_load.h" 40 40 #include "bpf_util.h" 41 - #include "libbpf.h" 41 + #include <bpf/bpf.h> 42 42 43 43 #include "xdpsock.h" 44 44

+3

tools/bpf/bpftool/.gitignore

··· 1 + *.d 2 + bpftool 3 + FEATURE-DUMP.bpftool

+1

tools/bpf/bpftool/map.c

··· 66 66 [BPF_MAP_TYPE_DEVMAP] = "devmap", 67 67 [BPF_MAP_TYPE_SOCKMAP] = "sockmap", 68 68 [BPF_MAP_TYPE_CPUMAP] = "cpumap", 69 + [BPF_MAP_TYPE_SOCKHASH] = "sockhash", 69 70 }; 70 71 71 72 static bool map_is_per_cpu(__u32 type)

+20 -61

tools/bpf/bpftool/map_perf_ring.c

··· 39 39 40 40 struct perf_event_sample { 41 41 struct perf_event_header header; 42 + u64 time; 42 43 __u32 size; 43 44 unsigned char data[]; 44 45 }; ··· 50 49 stop = true; 51 50 } 52 51 53 - static void 54 - print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e) 52 + static enum bpf_perf_event_ret print_bpf_output(void *event, void *priv) 55 53 { 54 + struct event_ring_info *ring = priv; 55 + struct perf_event_sample *e = event; 56 56 struct { 57 57 struct perf_event_header header; 58 58 __u64 id; 59 59 __u64 lost; 60 - } *lost = (void *)e; 61 - struct timespec ts; 62 - 63 - if (clock_gettime(CLOCK_MONOTONIC, &ts)) { 64 - perror("Can't read clock for timestamp"); 65 - return; 66 - } 60 + } *lost = event; 67 61 68 62 if (json_output) { 69 63 jsonw_start_object(json_wtr); 70 - jsonw_name(json_wtr, "timestamp"); 71 - jsonw_uint(json_wtr, ts.tv_sec * 1000000000ull + ts.tv_nsec); 72 64 jsonw_name(json_wtr, "type"); 73 65 jsonw_uint(json_wtr, e->header.type); 74 66 jsonw_name(json_wtr, "cpu"); ··· 69 75 jsonw_name(json_wtr, "index"); 70 76 jsonw_uint(json_wtr, ring->key); 71 77 if (e->header.type == PERF_RECORD_SAMPLE) { 78 + jsonw_name(json_wtr, "timestamp"); 79 + jsonw_uint(json_wtr, e->time); 72 80 jsonw_name(json_wtr, "data"); 73 81 print_data_json(e->data, e->size); 74 82 } else if (e->header.type == PERF_RECORD_LOST) { ··· 85 89 jsonw_end_object(json_wtr); 86 90 } else { 87 91 if (e->header.type == PERF_RECORD_SAMPLE) { 88 - printf("== @%ld.%ld CPU: %d index: %d =====\n", 89 - (long)ts.tv_sec, ts.tv_nsec, 92 + printf("== @%lld.%09lld CPU: %d index: %d =====\n", 93 + e->time / 1000000000ULL, e->time % 1000000000ULL, 90 94 ring->cpu, ring->key); 91 95 fprint_hex(stdout, e->data, e->size, " "); 92 96 printf("\n"); ··· 97 101 e->header.type, e->header.size); 98 102 } 99 103 } 104 + 105 + return LIBBPF_PERF_EVENT_CONT; 100 106 } 101 107 102 108 static void 103 109 perf_event_read(struct event_ring_info *ring, void **buf, size_t *buf_len) 104 110 { 105 - volatile struct perf_event_mmap_page *header = ring->mem; 106 - __u64 buffer_size = MMAP_PAGE_CNT * get_page_size(); 107 - __u64 data_tail = header->data_tail; 108 - __u64 data_head = header->data_head; 109 - void *base, *begin, *end; 111 + enum bpf_perf_event_ret ret; 110 112 111 - asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */ 112 - if (data_head == data_tail) 113 - return; 114 - 115 - base = ((char *)header) + get_page_size(); 116 - 117 - begin = base + data_tail % buffer_size; 118 - end = base + data_head % buffer_size; 119 - 120 - while (begin != end) { 121 - struct perf_event_sample *e; 122 - 123 - e = begin; 124 - if (begin + e->header.size > base + buffer_size) { 125 - long len = base + buffer_size - begin; 126 - 127 - if (*buf_len < e->header.size) { 128 - free(*buf); 129 - *buf = malloc(e->header.size); 130 - if (!*buf) { 131 - fprintf(stderr, 132 - "can't allocate memory"); 133 - stop = true; 134 - return; 135 - } 136 - *buf_len = e->header.size; 137 - } 138 - 139 - memcpy(*buf, begin, len); 140 - memcpy(*buf + len, base, e->header.size - len); 141 - e = (void *)*buf; 142 - begin = base + e->header.size - len; 143 - } else if (begin + e->header.size == base + buffer_size) { 144 - begin = base; 145 - } else { 146 - begin += e->header.size; 147 - } 148 - 149 - print_bpf_output(ring, e); 113 + ret = bpf_perf_event_read_simple(ring->mem, 114 + MMAP_PAGE_CNT * get_page_size(), 115 + get_page_size(), buf, buf_len, 116 + print_bpf_output, ring); 117 + if (ret != LIBBPF_PERF_EVENT_CONT) { 118 + fprintf(stderr, "perf read loop failed with %d\n", ret); 119 + stop = true; 150 120 } 151 - 152 - __sync_synchronize(); /* smp_mb() */ 153 - header->data_tail = data_head; 154 121 } 155 122 156 123 static int perf_mmap_size(void) ··· 144 185 static int bpf_perf_event_open(int map_fd, int key, int cpu) 145 186 { 146 187 struct perf_event_attr attr = { 147 - .sample_type = PERF_SAMPLE_RAW, 188 + .sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_TIME, 148 189 .type = PERF_TYPE_SOFTWARE, 149 190 .config = PERF_COUNT_SW_BPF_OUTPUT, 150 191 };

+18

tools/include/uapi/asm/bitsperlong.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #if defined(__i386__) || defined(__x86_64__) 3 + #include "../../arch/x86/include/uapi/asm/bitsperlong.h" 4 + #elif defined(__aarch64__) 5 + #include "../../arch/arm64/include/uapi/asm/bitsperlong.h" 6 + #elif defined(__powerpc__) 7 + #include "../../arch/powerpc/include/uapi/asm/bitsperlong.h" 8 + #elif defined(__s390__) 9 + #include "../../arch/s390/include/uapi/asm/bitsperlong.h" 10 + #elif defined(__sparc__) 11 + #include "../../arch/sparc/include/uapi/asm/bitsperlong.h" 12 + #elif defined(__mips__) 13 + #include "../../arch/mips/include/uapi/asm/bitsperlong.h" 14 + #elif defined(__ia64__) 15 + #include "../../arch/ia64/include/uapi/asm/bitsperlong.h" 16 + #else 17 + #include <asm-generic/bitsperlong.h> 18 + #endif

+18

tools/include/uapi/asm/errno.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #if defined(__i386__) || defined(__x86_64__) 3 + #include "../../arch/x86/include/uapi/asm/errno.h" 4 + #elif defined(__powerpc__) 5 + #include "../../arch/powerpc/include/uapi/asm/errno.h" 6 + #elif defined(__sparc__) 7 + #include "../../arch/sparc/include/uapi/asm/errno.h" 8 + #elif defined(__alpha__) 9 + #include "../../arch/alpha/include/uapi/asm/errno.h" 10 + #elif defined(__mips__) 11 + #include "../../arch/mips/include/uapi/asm/errno.h" 12 + #elif defined(__ia64__) 13 + #include "../../arch/ia64/include/uapi/asm/errno.h" 14 + #elif defined(__xtensa__) 15 + #include "../../arch/xtensa/include/uapi/asm/errno.h" 16 + #else 17 + #include <asm-generic/errno.h> 18 + #endif

+142 -1

tools/include/uapi/linux/bpf.h

··· 96 96 BPF_PROG_QUERY, 97 97 BPF_RAW_TRACEPOINT_OPEN, 98 98 BPF_BTF_LOAD, 99 + BPF_BTF_GET_FD_BY_ID, 99 100 }; 100 101 101 102 enum bpf_map_type { ··· 117 116 BPF_MAP_TYPE_DEVMAP, 118 117 BPF_MAP_TYPE_SOCKMAP, 119 118 BPF_MAP_TYPE_CPUMAP, 119 + BPF_MAP_TYPE_XSKMAP, 120 + BPF_MAP_TYPE_SOCKHASH, 120 121 }; 121 122 122 123 enum bpf_prog_type { ··· 346 343 __u32 start_id; 347 344 __u32 prog_id; 348 345 __u32 map_id; 346 + __u32 btf_id; 349 347 }; 350 348 __u32 next_id; 351 349 __u32 open_flags; ··· 1829 1825 * Return 1830 1826 * 0 on success, or a negative error in case of failure. 1831 1827 * 1828 + * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags) 1829 + * Description 1830 + * Do FIB lookup in kernel tables using parameters in *params*. 1831 + * If lookup is successful and result shows packet is to be 1832 + * forwarded, the neighbor tables are searched for the nexthop. 1833 + * If successful (ie., FIB lookup shows forwarding and nexthop 1834 + * is resolved), the nexthop address is returned in ipv4_dst, 1835 + * ipv6_dst or mpls_out based on family, smac is set to mac 1836 + * address of egress device, dmac is set to nexthop mac address, 1837 + * rt_metric is set to metric from route. 1838 + * 1839 + * *plen* argument is the size of the passed in struct. 1840 + * *flags* argument can be one or more BPF_FIB_LOOKUP_ flags: 1841 + * 1842 + * **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs 1843 + * full lookup using FIB rules 1844 + * **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress 1845 + * perspective (default is ingress) 1846 + * 1847 + * *ctx* is either **struct xdp_md** for XDP programs or 1848 + * **struct sk_buff** tc cls_act programs. 1849 + * 1850 + * Return 1851 + * Egress device index on success, 0 if packet needs to continue 1852 + * up the stack for further processing or a negative error in case 1853 + * of failure. 1854 + * 1855 + * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags) 1856 + * Description 1857 + * Add an entry to, or update a sockhash *map* referencing sockets. 1858 + * The *skops* is used as a new value for the entry associated to 1859 + * *key*. *flags* is one of: 1860 + * 1861 + * **BPF_NOEXIST** 1862 + * The entry for *key* must not exist in the map. 1863 + * **BPF_EXIST** 1864 + * The entry for *key* must already exist in the map. 1865 + * **BPF_ANY** 1866 + * No condition on the existence of the entry for *key*. 1867 + * 1868 + * If the *map* has eBPF programs (parser and verdict), those will 1869 + * be inherited by the socket being added. If the socket is 1870 + * already attached to eBPF programs, this results in an error. 1871 + * Return 1872 + * 0 on success, or a negative error in case of failure. 1873 + * 1874 + * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags) 1875 + * Description 1876 + * This helper is used in programs implementing policies at the 1877 + * socket level. If the message *msg* is allowed to pass (i.e. if 1878 + * the verdict eBPF program returns **SK_PASS**), redirect it to 1879 + * the socket referenced by *map* (of type 1880 + * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and 1881 + * egress interfaces can be used for redirection. The 1882 + * **BPF_F_INGRESS** value in *flags* is used to make the 1883 + * distinction (ingress path is selected if the flag is present, 1884 + * egress path otherwise). This is the only flag supported for now. 1885 + * Return 1886 + * **SK_PASS** on success, or **SK_DROP** on error. 1887 + * 1888 + * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags) 1889 + * Description 1890 + * This helper is used in programs implementing policies at the 1891 + * skb socket level. If the sk_buff *skb* is allowed to pass (i.e. 1892 + * if the verdeict eBPF program returns **SK_PASS**), redirect it 1893 + * to the socket referenced by *map* (of type 1894 + * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and 1895 + * egress interfaces can be used for redirection. The 1896 + * **BPF_F_INGRESS** value in *flags* is used to make the 1897 + * distinction (ingress path is selected if the flag is present, 1898 + * egress otherwise). This is the only flag supported for now. 1899 + * Return 1900 + * **SK_PASS** on success, or **SK_DROP** on error. 1832 1901 */ 1833 1902 #define __BPF_FUNC_MAPPER(FN) \ 1834 1903 FN(unspec), \ ··· 1972 1895 FN(xdp_adjust_tail), \ 1973 1896 FN(skb_get_xfrm_state), \ 1974 1897 FN(get_stack), \ 1975 - FN(skb_load_bytes_relative), 1898 + FN(skb_load_bytes_relative), \ 1899 + FN(fib_lookup), \ 1900 + FN(sock_hash_update), \ 1901 + FN(msg_redirect_hash), \ 1902 + FN(sk_redirect_hash), 1976 1903 1977 1904 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 1978 1905 * function eBPF program intends to call ··· 2210 2129 __u32 ifindex; 2211 2130 __u64 netns_dev; 2212 2131 __u64 netns_ino; 2132 + __u32 btf_id; 2133 + __u32 btf_key_id; 2134 + __u32 btf_value_id; 2135 + } __attribute__((aligned(8))); 2136 + 2137 + struct bpf_btf_info { 2138 + __aligned_u64 btf; 2139 + __u32 btf_size; 2140 + __u32 id; 2213 2141 } __attribute__((aligned(8))); 2214 2142 2215 2143 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed ··· 2397 2307 2398 2308 struct bpf_raw_tracepoint_args { 2399 2309 __u64 args[0]; 2310 + }; 2311 + 2312 + /* DIRECT: Skip the FIB rules and go to FIB table associated with device 2313 + * OUTPUT: Do lookup from egress perspective; default is ingress 2314 + */ 2315 + #define BPF_FIB_LOOKUP_DIRECT BIT(0) 2316 + #define BPF_FIB_LOOKUP_OUTPUT BIT(1) 2317 + 2318 + struct bpf_fib_lookup { 2319 + /* input */ 2320 + __u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */ 2321 + 2322 + /* set if lookup is to consider L4 data - e.g., FIB rules */ 2323 + __u8 l4_protocol; 2324 + __be16 sport; 2325 + __be16 dport; 2326 + 2327 + /* total length of packet from network header - used for MTU check */ 2328 + __u16 tot_len; 2329 + __u32 ifindex; /* L3 device index for lookup */ 2330 + 2331 + union { 2332 + /* inputs to lookup */ 2333 + __u8 tos; /* AF_INET */ 2334 + __be32 flowlabel; /* AF_INET6 */ 2335 + 2336 + /* output: metric of fib result */ 2337 + __u32 rt_metric; 2338 + }; 2339 + 2340 + union { 2341 + __be32 mpls_in; 2342 + __be32 ipv4_src; 2343 + __u32 ipv6_src[4]; /* in6_addr; network order */ 2344 + }; 2345 + 2346 + /* input to bpf_fib_lookup, *dst is destination address. 2347 + * output: bpf_fib_lookup sets to gateway address 2348 + */ 2349 + union { 2350 + /* return for MPLS lookups */ 2351 + __be32 mpls_out[4]; /* support up to 4 labels */ 2352 + __be32 ipv4_dst; 2353 + __u32 ipv6_dst[4]; /* in6_addr; network order */ 2354 + }; 2355 + 2356 + /* output */ 2357 + __be16 h_vlan_proto; 2358 + __be16 h_vlan_TCI; 2359 + __u8 smac[6]; /* ETH_ALEN */ 2360 + __u8 dmac[6]; /* ETH_ALEN */ 2400 2361 }; 2401 2362 2402 2363 #endif /* _UAPI__LINUX_BPF_H__ */

+1 -1

tools/lib/bpf/Makefile

··· 69 69 FEATURE_TESTS = libelf libelf-getphdrnum libelf-mmap bpf 70 70 FEATURE_DISPLAY = libelf bpf 71 71 72 - INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi 72 + INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi -I$(srctree)/tools/perf 73 73 FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES) 74 74 75 75 check_feat := 1

+12

tools/lib/bpf/bpf.c

··· 91 91 attr.btf_fd = create_attr->btf_fd; 92 92 attr.btf_key_id = create_attr->btf_key_id; 93 93 attr.btf_value_id = create_attr->btf_value_id; 94 + attr.map_ifindex = create_attr->map_ifindex; 94 95 95 96 return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); 96 97 } ··· 202 201 attr.log_size = 0; 203 202 attr.log_level = 0; 204 203 attr.kern_version = load_attr->kern_version; 204 + attr.prog_ifindex = load_attr->prog_ifindex; 205 205 memcpy(attr.prog_name, load_attr->name, 206 206 min(name_len, BPF_OBJ_NAME_LEN - 1)); 207 207 ··· 458 456 attr.map_id = id; 459 457 460 458 return sys_bpf(BPF_MAP_GET_FD_BY_ID, &attr, sizeof(attr)); 459 + } 460 + 461 + int bpf_btf_get_fd_by_id(__u32 id) 462 + { 463 + union bpf_attr attr; 464 + 465 + bzero(&attr, sizeof(attr)); 466 + attr.btf_id = id; 467 + 468 + return sys_bpf(BPF_BTF_GET_FD_BY_ID, &attr, sizeof(attr)); 461 469 } 462 470 463 471 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)

+3

tools/lib/bpf/bpf.h

··· 38 38 __u32 btf_fd; 39 39 __u32 btf_key_id; 40 40 __u32 btf_value_id; 41 + __u32 map_ifindex; 41 42 }; 42 43 43 44 int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr); ··· 65 64 size_t insns_cnt; 66 65 const char *license; 67 66 __u32 kern_version; 67 + __u32 prog_ifindex; 68 68 }; 69 69 70 70 /* Recommend log buffer size */ ··· 100 98 int bpf_map_get_next_id(__u32 start_id, __u32 *next_id); 101 99 int bpf_prog_get_fd_by_id(__u32 id); 102 100 int bpf_map_get_fd_by_id(__u32 id); 101 + int bpf_btf_get_fd_by_id(__u32 id); 103 102 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len); 104 103 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags, 105 104 __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);

+115 -10

tools/lib/bpf/libbpf.c

··· 31 31 #include <unistd.h> 32 32 #include <fcntl.h> 33 33 #include <errno.h> 34 + #include <perf-sys.h> 34 35 #include <asm/unistd.h> 35 36 #include <linux/err.h> 36 37 #include <linux/kernel.h> ··· 178 177 /* Index in elf obj file, for relocation use. */ 179 178 int idx; 180 179 char *name; 180 + int prog_ifindex; 181 181 char *section_name; 182 182 struct bpf_insn *insns; 183 183 size_t insns_cnt, main_prog_cnt; ··· 214 212 int fd; 215 213 char *name; 216 214 size_t offset; 215 + int map_ifindex; 217 216 struct bpf_map_def def; 218 217 uint32_t btf_key_id; 219 218 uint32_t btf_value_id; ··· 1093 1090 int *pfd = &map->fd; 1094 1091 1095 1092 create_attr.name = map->name; 1093 + create_attr.map_ifindex = map->map_ifindex; 1096 1094 create_attr.map_type = def->type; 1097 1095 create_attr.map_flags = def->map_flags; 1098 1096 create_attr.key_size = def->key_size; ··· 1276 1272 static int 1277 1273 load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type, 1278 1274 const char *name, struct bpf_insn *insns, int insns_cnt, 1279 - char *license, u32 kern_version, int *pfd) 1275 + char *license, u32 kern_version, int *pfd, int prog_ifindex) 1280 1276 { 1281 1277 struct bpf_load_program_attr load_attr; 1282 1278 char *log_buf; ··· 1290 1286 load_attr.insns_cnt = insns_cnt; 1291 1287 load_attr.license = license; 1292 1288 load_attr.kern_version = kern_version; 1289 + load_attr.prog_ifindex = prog_ifindex; 1293 1290 1294 1291 if (!load_attr.insns || !load_attr.insns_cnt) 1295 1292 return -EINVAL; ··· 1372 1367 } 1373 1368 err = load_program(prog->type, prog->expected_attach_type, 1374 1369 prog->name, prog->insns, prog->insns_cnt, 1375 - license, kern_version, &fd); 1370 + license, kern_version, &fd, 1371 + prog->prog_ifindex); 1376 1372 if (!err) 1377 1373 prog->instances.fds[0] = fd; 1378 1374 goto out; ··· 1404 1398 err = load_program(prog->type, prog->expected_attach_type, 1405 1399 prog->name, result.new_insn_ptr, 1406 1400 result.new_insn_cnt, 1407 - license, kern_version, &fd); 1401 + license, kern_version, &fd, 1402 + prog->prog_ifindex); 1408 1403 1409 1404 if (err) { 1410 1405 pr_warning("Loading the %dth instance of program '%s' failed\n", ··· 1444 1437 return 0; 1445 1438 } 1446 1439 1447 - static int bpf_object__validate(struct bpf_object *obj) 1440 + static bool bpf_prog_type__needs_kver(enum bpf_prog_type type) 1448 1441 { 1449 - if (obj->kern_version == 0) { 1442 + switch (type) { 1443 + case BPF_PROG_TYPE_SOCKET_FILTER: 1444 + case BPF_PROG_TYPE_SCHED_CLS: 1445 + case BPF_PROG_TYPE_SCHED_ACT: 1446 + case BPF_PROG_TYPE_XDP: 1447 + case BPF_PROG_TYPE_CGROUP_SKB: 1448 + case BPF_PROG_TYPE_CGROUP_SOCK: 1449 + case BPF_PROG_TYPE_LWT_IN: 1450 + case BPF_PROG_TYPE_LWT_OUT: 1451 + case BPF_PROG_TYPE_LWT_XMIT: 1452 + case BPF_PROG_TYPE_SOCK_OPS: 1453 + case BPF_PROG_TYPE_SK_SKB: 1454 + case BPF_PROG_TYPE_CGROUP_DEVICE: 1455 + case BPF_PROG_TYPE_SK_MSG: 1456 + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: 1457 + return false; 1458 + case BPF_PROG_TYPE_UNSPEC: 1459 + case BPF_PROG_TYPE_KPROBE: 1460 + case BPF_PROG_TYPE_TRACEPOINT: 1461 + case BPF_PROG_TYPE_PERF_EVENT: 1462 + case BPF_PROG_TYPE_RAW_TRACEPOINT: 1463 + default: 1464 + return true; 1465 + } 1466 + } 1467 + 1468 + static int bpf_object__validate(struct bpf_object *obj, bool needs_kver) 1469 + { 1470 + if (needs_kver && obj->kern_version == 0) { 1450 1471 pr_warning("%s doesn't provide kernel version\n", 1451 1472 obj->path); 1452 1473 return -LIBBPF_ERRNO__KVERSION; ··· 1483 1448 } 1484 1449 1485 1450 static struct bpf_object * 1486 - __bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz) 1451 + __bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz, 1452 + bool needs_kver) 1487 1453 { 1488 1454 struct bpf_object *obj; 1489 1455 int err; ··· 1502 1466 CHECK_ERR(bpf_object__check_endianness(obj), err, out); 1503 1467 CHECK_ERR(bpf_object__elf_collect(obj), err, out); 1504 1468 CHECK_ERR(bpf_object__collect_reloc(obj), err, out); 1505 - CHECK_ERR(bpf_object__validate(obj), err, out); 1469 + CHECK_ERR(bpf_object__validate(obj, needs_kver), err, out); 1506 1470 1507 1471 bpf_object__elf_finish(obj); 1508 1472 return obj; ··· 1519 1483 1520 1484 pr_debug("loading %s\n", path); 1521 1485 1522 - return __bpf_object__open(path, NULL, 0); 1486 + return __bpf_object__open(path, NULL, 0, true); 1523 1487 } 1524 1488 1525 1489 struct bpf_object *bpf_object__open_buffer(void *obj_buf, ··· 1542 1506 pr_debug("loading object '%s' from buffer\n", 1543 1507 name); 1544 1508 1545 - return __bpf_object__open(name, obj_buf, obj_buf_sz); 1509 + return __bpf_object__open(name, obj_buf, obj_buf_sz, true); 1546 1510 } 1547 1511 1548 1512 int bpf_object__unload(struct bpf_object *obj) ··· 2194 2158 enum bpf_attach_type expected_attach_type; 2195 2159 enum bpf_prog_type prog_type; 2196 2160 struct bpf_object *obj; 2161 + struct bpf_map *map; 2197 2162 int section_idx; 2198 2163 int err; 2199 2164 2200 2165 if (!attr) 2201 2166 return -EINVAL; 2167 + if (!attr->file) 2168 + return -EINVAL; 2202 2169 2203 - obj = bpf_object__open(attr->file); 2170 + obj = __bpf_object__open(attr->file, NULL, 0, 2171 + bpf_prog_type__needs_kver(attr->prog_type)); 2204 2172 if (IS_ERR(obj)) 2205 2173 return -ENOENT; 2206 2174 ··· 2214 2174 * section name. 2215 2175 */ 2216 2176 prog_type = attr->prog_type; 2177 + prog->prog_ifindex = attr->ifindex; 2217 2178 expected_attach_type = attr->expected_attach_type; 2218 2179 if (prog_type == BPF_PROG_TYPE_UNSPEC) { 2219 2180 section_idx = bpf_program__identify_section(prog); ··· 2235 2194 first_prog = prog; 2236 2195 } 2237 2196 2197 + bpf_map__for_each(map, obj) { 2198 + map->map_ifindex = attr->ifindex; 2199 + } 2200 + 2238 2201 if (!first_prog) { 2239 2202 pr_warning("object file doesn't contain bpf program\n"); 2240 2203 bpf_object__close(obj); ··· 2254 2209 *pobj = obj; 2255 2210 *prog_fd = bpf_program__fd(first_prog); 2256 2211 return 0; 2212 + } 2213 + 2214 + enum bpf_perf_event_ret 2215 + bpf_perf_event_read_simple(void *mem, unsigned long size, 2216 + unsigned long page_size, void **buf, size_t *buf_len, 2217 + bpf_perf_event_print_t fn, void *priv) 2218 + { 2219 + volatile struct perf_event_mmap_page *header = mem; 2220 + __u64 data_tail = header->data_tail; 2221 + __u64 data_head = header->data_head; 2222 + void *base, *begin, *end; 2223 + int ret; 2224 + 2225 + asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */ 2226 + if (data_head == data_tail) 2227 + return LIBBPF_PERF_EVENT_CONT; 2228 + 2229 + base = ((char *)header) + page_size; 2230 + 2231 + begin = base + data_tail % size; 2232 + end = base + data_head % size; 2233 + 2234 + while (begin != end) { 2235 + struct perf_event_header *ehdr; 2236 + 2237 + ehdr = begin; 2238 + if (begin + ehdr->size > base + size) { 2239 + long len = base + size - begin; 2240 + 2241 + if (*buf_len < ehdr->size) { 2242 + free(*buf); 2243 + *buf = malloc(ehdr->size); 2244 + if (!*buf) { 2245 + ret = LIBBPF_PERF_EVENT_ERROR; 2246 + break; 2247 + } 2248 + *buf_len = ehdr->size; 2249 + } 2250 + 2251 + memcpy(*buf, begin, len); 2252 + memcpy(*buf + len, base, ehdr->size - len); 2253 + ehdr = (void *)*buf; 2254 + begin = base + ehdr->size - len; 2255 + } else if (begin + ehdr->size == base + size) { 2256 + begin = base; 2257 + } else { 2258 + begin += ehdr->size; 2259 + } 2260 + 2261 + ret = fn(ehdr, priv); 2262 + if (ret != LIBBPF_PERF_EVENT_CONT) 2263 + break; 2264 + 2265 + data_tail += ehdr->size; 2266 + } 2267 + 2268 + __sync_synchronize(); /* smp_mb() */ 2269 + header->data_tail = data_tail; 2270 + 2271 + return ret; 2257 2272 }

+38 -24

tools/lib/bpf/libbpf.h

··· 52 52 int libbpf_strerror(int err, char *buf, size_t size); 53 53 54 54 /* 55 - * In include/linux/compiler-gcc.h, __printf is defined. However 56 - * it should be better if libbpf.h doesn't depend on Linux header file. 55 + * __printf is defined in include/linux/compiler-gcc.h. However, 56 + * it would be better if libbpf.h didn't depend on Linux header files. 57 57 * So instead of __printf, here we use gcc attribute directly. 58 58 */ 59 59 typedef int (*libbpf_print_fn_t)(const char *, ...) ··· 92 92 bpf_object_clear_priv_t clear_priv); 93 93 void *bpf_object__priv(struct bpf_object *prog); 94 94 95 - /* Accessors of bpf_program. */ 95 + /* Accessors of bpf_program */ 96 96 struct bpf_program; 97 97 struct bpf_program *bpf_program__next(struct bpf_program *prog, 98 98 struct bpf_object *obj); ··· 121 121 122 122 /* 123 123 * Libbpf allows callers to adjust BPF programs before being loaded 124 - * into kernel. One program in an object file can be transform into 125 - * multiple variants to be attached to different code. 124 + * into kernel. One program in an object file can be transformed into 125 + * multiple variants to be attached to different hooks. 126 126 * 127 127 * bpf_program_prep_t, bpf_program__set_prep and bpf_program__nth_fd 128 - * are APIs for this propose. 128 + * form an API for this purpose. 129 129 * 130 130 * - bpf_program_prep_t: 131 - * It defines 'preprocessor', which is a caller defined function 131 + * Defines a 'preprocessor', which is a caller defined function 132 132 * passed to libbpf through bpf_program__set_prep(), and will be 133 133 * called before program is loaded. The processor should adjust 134 - * the program one time for each instances according to the number 134 + * the program one time for each instance according to the instance id 135 135 * passed to it. 136 136 * 137 137 * - bpf_program__set_prep: 138 - * Attachs a preprocessor to a BPF program. The number of instances 139 - * whould be created is also passed through this function. 138 + * Attaches a preprocessor to a BPF program. The number of instances 139 + * that should be created is also passed through this function. 140 140 * 141 141 * - bpf_program__nth_fd: 142 - * After the program is loaded, get resuling fds from bpf program for 143 - * each instances. 142 + * After the program is loaded, get resulting FD of a given instance 143 + * of the BPF program. 144 144 * 145 - * If bpf_program__set_prep() is not used, the program whould be loaded 145 + * If bpf_program__set_prep() is not used, the program would be loaded 146 146 * without adjustment during bpf_object__load(). The program has only 147 147 * one instance. In this case bpf_program__fd(prog) is equal to 148 148 * bpf_program__nth_fd(prog, 0). ··· 156 156 struct bpf_insn *new_insn_ptr; 157 157 int new_insn_cnt; 158 158 159 - /* If not NULL, result fd is set to it */ 159 + /* If not NULL, result FD is written to it. */ 160 160 int *pfd; 161 161 }; 162 162 ··· 169 169 * - res: Output parameter, result of transformation. 170 170 * 171 171 * Return value: 172 - * - Zero: pre-processing success. 173 - * - Non-zero: pre-processing, stop loading. 172 + * - Zero: pre-processing success. 173 + * - Non-zero: pre-processing error, stop loading. 174 174 */ 175 175 typedef int (*bpf_program_prep_t)(struct bpf_program *prog, int n, 176 176 struct bpf_insn *insns, int insns_cnt, ··· 182 182 int bpf_program__nth_fd(struct bpf_program *prog, int n); 183 183 184 184 /* 185 - * Adjust type of bpf program. Default is kprobe. 185 + * Adjust type of BPF program. Default is kprobe. 186 186 */ 187 187 int bpf_program__set_socket_filter(struct bpf_program *prog); 188 188 int bpf_program__set_tracepoint(struct bpf_program *prog); ··· 206 206 bool bpf_program__is_perf_event(struct bpf_program *prog); 207 207 208 208 /* 209 - * We don't need __attribute__((packed)) now since it is 210 - * unnecessary for 'bpf_map_def' because they are all aligned. 211 - * In addition, using it will trigger -Wpacked warning message, 212 - * and will be treated as an error due to -Werror. 209 + * No need for __attribute__((packed)), all members of 'bpf_map_def' 210 + * are all aligned. In addition, using __attribute__((packed)) 211 + * would trigger a -Wpacked warning message, and lead to an error 212 + * if -Werror is set. 213 213 */ 214 214 struct bpf_map_def { 215 215 unsigned int type; ··· 220 220 }; 221 221 222 222 /* 223 - * There is another 'struct bpf_map' in include/linux/map.h. However, 224 - * it is not a uapi header so no need to consider name clash. 223 + * The 'struct bpf_map' in include/linux/bpf.h is internal to the kernel, 224 + * so no need to worry about a name clash. 225 225 */ 226 226 struct bpf_map; 227 227 struct bpf_map * ··· 229 229 230 230 /* 231 231 * Get bpf_map through the offset of corresponding struct bpf_map_def 232 - * in the bpf object file. 232 + * in the BPF object file. 233 233 */ 234 234 struct bpf_map * 235 235 bpf_object__find_map_by_offset(struct bpf_object *obj, size_t offset); ··· 259 259 const char *file; 260 260 enum bpf_prog_type prog_type; 261 261 enum bpf_attach_type expected_attach_type; 262 + int ifindex; 262 263 }; 263 264 264 265 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, ··· 268 267 struct bpf_object **pobj, int *prog_fd); 269 268 270 269 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags); 270 + 271 + enum bpf_perf_event_ret { 272 + LIBBPF_PERF_EVENT_DONE = 0, 273 + LIBBPF_PERF_EVENT_ERROR = -1, 274 + LIBBPF_PERF_EVENT_CONT = -2, 275 + }; 276 + 277 + typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(void *event, 278 + void *priv); 279 + int bpf_perf_event_read_simple(void *mem, unsigned long size, 280 + unsigned long page_size, 281 + void **buf, size_t *buf_len, 282 + bpf_perf_event_print_t fn, void *priv); 271 283 #endif

+1

tools/testing/selftests/bpf/.gitignore

··· 16 16 test_sock_addr 17 17 urandom_read 18 18 test_btf 19 + test_sockmap

+6 -6

tools/testing/selftests/bpf/Makefile

··· 10 10 GENFLAGS := -DHAVE_GENHDR 11 11 endif 12 12 13 - CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include 13 + CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include 14 14 LDLIBS += -lcap -lelf -lrt -lpthread 15 15 16 16 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read ··· 19 19 $(TEST_CUSTOM_PROGS): urandom_read 20 20 21 21 urandom_read: urandom_read.c 22 - $(CC) -o $(TEST_CUSTOM_PROGS) -static $< 22 + $(CC) -o $(TEST_CUSTOM_PROGS) -static $< -Wl,--build-id 23 23 24 24 # Order correspond to 'make run_tests' order 25 25 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ ··· 33 33 sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \ 34 34 sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \ 35 35 test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \ 36 - test_get_stack_rawtp.o 36 + test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o 37 37 38 38 # Order correspond to 'make run_tests' order 39 39 TEST_PROGS := test_kmod.sh \ ··· 90 90 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline 91 91 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline 92 92 93 - BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help |& grep dwarfris) 94 - BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help |& grep BTF) 95 - BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version |& grep LLVM) 93 + BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris) 94 + BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF) 95 + BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version 2>&1 | grep LLVM) 96 96 97 97 ifneq ($(BTF_LLC_PROBE),) 98 98 ifneq ($(BTF_PAHOLE_PROBE),)

+11

tools/testing/selftests/bpf/bpf_helpers.h

··· 75 75 (void *) BPF_FUNC_sock_ops_cb_flags_set; 76 76 static int (*bpf_sk_redirect_map)(void *ctx, void *map, int key, int flags) = 77 77 (void *) BPF_FUNC_sk_redirect_map; 78 + static int (*bpf_sk_redirect_hash)(void *ctx, void *map, void *key, int flags) = 79 + (void *) BPF_FUNC_sk_redirect_hash; 78 80 static int (*bpf_sock_map_update)(void *map, void *key, void *value, 79 81 unsigned long long flags) = 80 82 (void *) BPF_FUNC_sock_map_update; 83 + static int (*bpf_sock_hash_update)(void *map, void *key, void *value, 84 + unsigned long long flags) = 85 + (void *) BPF_FUNC_sock_hash_update; 81 86 static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags, 82 87 void *buf, unsigned int buf_size) = 83 88 (void *) BPF_FUNC_perf_event_read_value; ··· 93 88 (void *) BPF_FUNC_override_return; 94 89 static int (*bpf_msg_redirect_map)(void *ctx, void *map, int key, int flags) = 95 90 (void *) BPF_FUNC_msg_redirect_map; 91 + static int (*bpf_msg_redirect_hash)(void *ctx, 92 + void *map, void *key, int flags) = 93 + (void *) BPF_FUNC_msg_redirect_hash; 96 94 static int (*bpf_msg_apply_bytes)(void *ctx, int len) = 97 95 (void *) BPF_FUNC_msg_apply_bytes; 98 96 static int (*bpf_msg_cork_bytes)(void *ctx, int len) = ··· 111 103 (void *) BPF_FUNC_skb_get_xfrm_state; 112 104 static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) = 113 105 (void *) BPF_FUNC_get_stack; 106 + static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params, 107 + int plen, __u32 flags) = 108 + (void *) BPF_FUNC_fib_lookup; 114 109 115 110 /* llvm builtin functions that eBPF C program may use to 116 111 * emit BPF_LD_ABS and BPF_LD_IND instructions

+80

tools/testing/selftests/bpf/bpf_rand.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __BPF_RAND__ 3 + #define __BPF_RAND__ 4 + 5 + #include <stdint.h> 6 + #include <stdlib.h> 7 + #include <time.h> 8 + 9 + static inline uint64_t bpf_rand_mask(uint64_t mask) 10 + { 11 + return (((uint64_t)(uint32_t)rand()) | 12 + ((uint64_t)(uint32_t)rand() << 32)) & mask; 13 + } 14 + 15 + #define bpf_rand_ux(x, m) \ 16 + static inline uint64_t bpf_rand_u##x(int shift) \ 17 + { \ 18 + return bpf_rand_mask((m)) << shift; \ 19 + } 20 + 21 + bpf_rand_ux( 8, 0xffULL) 22 + bpf_rand_ux(16, 0xffffULL) 23 + bpf_rand_ux(24, 0xffffffULL) 24 + bpf_rand_ux(32, 0xffffffffULL) 25 + bpf_rand_ux(40, 0xffffffffffULL) 26 + bpf_rand_ux(48, 0xffffffffffffULL) 27 + bpf_rand_ux(56, 0xffffffffffffffULL) 28 + bpf_rand_ux(64, 0xffffffffffffffffULL) 29 + 30 + static inline void bpf_semi_rand_init(void) 31 + { 32 + srand(time(NULL)); 33 + } 34 + 35 + static inline uint64_t bpf_semi_rand_get(void) 36 + { 37 + switch (rand() % 39) { 38 + case 0: return 0x000000ff00000000ULL | bpf_rand_u8(0); 39 + case 1: return 0xffffffff00000000ULL | bpf_rand_u16(0); 40 + case 2: return 0x00000000ffff0000ULL | bpf_rand_u16(0); 41 + case 3: return 0x8000000000000000ULL | bpf_rand_u32(0); 42 + case 4: return 0x00000000f0000000ULL | bpf_rand_u32(0); 43 + case 5: return 0x0000000100000000ULL | bpf_rand_u24(0); 44 + case 6: return 0x800ff00000000000ULL | bpf_rand_u32(0); 45 + case 7: return 0x7fffffff00000000ULL | bpf_rand_u32(0); 46 + case 8: return 0xffffffffffffff00ULL ^ bpf_rand_u32(24); 47 + case 9: return 0xffffffffffffff00ULL | bpf_rand_u8(0); 48 + case 10: return 0x0000000010000000ULL | bpf_rand_u32(0); 49 + case 11: return 0xf000000000000000ULL | bpf_rand_u8(0); 50 + case 12: return 0x0000f00000000000ULL | bpf_rand_u8(8); 51 + case 13: return 0x000000000f000000ULL | bpf_rand_u8(16); 52 + case 14: return 0x0000000000000f00ULL | bpf_rand_u8(32); 53 + case 15: return 0x00fff00000000f00ULL | bpf_rand_u8(48); 54 + case 16: return 0x00007fffffffffffULL ^ bpf_rand_u32(1); 55 + case 17: return 0xffff800000000000ULL | bpf_rand_u8(4); 56 + case 18: return 0xffff800000000000ULL | bpf_rand_u8(20); 57 + case 19: return (0xffffffc000000000ULL + 0x80000ULL) | bpf_rand_u32(0); 58 + case 20: return (0xffffffc000000000ULL - 0x04000000ULL) | bpf_rand_u32(0); 59 + case 21: return 0x0000000000000000ULL | bpf_rand_u8(55) | bpf_rand_u32(20); 60 + case 22: return 0xffffffffffffffffULL ^ bpf_rand_u8(3) ^ bpf_rand_u32(40); 61 + case 23: return 0x0000000000000000ULL | bpf_rand_u8(bpf_rand_u8(0) % 64); 62 + case 24: return 0x0000000000000000ULL | bpf_rand_u16(bpf_rand_u8(0) % 64); 63 + case 25: return 0xffffffffffffffffULL ^ bpf_rand_u8(bpf_rand_u8(0) % 64); 64 + case 26: return 0xffffffffffffffffULL ^ bpf_rand_u40(bpf_rand_u8(0) % 64); 65 + case 27: return 0x0000800000000000ULL; 66 + case 28: return 0x8000000000000000ULL; 67 + case 29: return 0x0000000000000000ULL; 68 + case 30: return 0xffffffffffffffffULL; 69 + case 31: return bpf_rand_u16(bpf_rand_u8(0) % 64); 70 + case 32: return bpf_rand_u24(bpf_rand_u8(0) % 64); 71 + case 33: return bpf_rand_u32(bpf_rand_u8(0) % 64); 72 + case 34: return bpf_rand_u40(bpf_rand_u8(0) % 64); 73 + case 35: return bpf_rand_u48(bpf_rand_u8(0) % 64); 74 + case 36: return bpf_rand_u56(bpf_rand_u8(0) % 64); 75 + case 37: return bpf_rand_u64(bpf_rand_u8(0) % 64); 76 + default: return bpf_rand_u64(0); 77 + } 78 + } 79 + 80 + #endif /* __BPF_RAND__ */

+372 -112

tools/testing/selftests/bpf/test_btf.c

··· 20 20 21 21 #include "bpf_rlimit.h" 22 22 23 + static uint32_t pass_cnt; 24 + static uint32_t error_cnt; 25 + static uint32_t skip_cnt; 26 + 27 + #define CHECK(condition, format...) ({ \ 28 + int __ret = !!(condition); \ 29 + if (__ret) { \ 30 + fprintf(stderr, "%s:%d:FAIL ", __func__, __LINE__); \ 31 + fprintf(stderr, format); \ 32 + } \ 33 + __ret; \ 34 + }) 35 + 36 + static int count_result(int err) 37 + { 38 + if (err) 39 + error_cnt++; 40 + else 41 + pass_cnt++; 42 + 43 + fprintf(stderr, "\n"); 44 + return err; 45 + } 46 + 23 47 #define min(a, b) ((a) < (b) ? (a) : (b)) 24 48 #define __printf(a, b) __attribute__((format(printf, a, b))) 25 49 ··· 918 894 void *raw_btf; 919 895 920 896 type_sec_size = get_type_sec_size(raw_types); 921 - if (type_sec_size < 0) { 922 - fprintf(stderr, "Cannot get nr_raw_types\n"); 897 + if (CHECK(type_sec_size < 0, "Cannot get nr_raw_types")) 923 898 return NULL; 924 - } 925 899 926 900 size_needed = sizeof(*hdr) + type_sec_size + str_sec_size; 927 901 raw_btf = malloc(size_needed); 928 - if (!raw_btf) { 929 - fprintf(stderr, "Cannot allocate memory for raw_btf\n"); 902 + if (CHECK(!raw_btf, "Cannot allocate memory for raw_btf")) 930 903 return NULL; 931 - } 932 904 933 905 /* Copy header */ 934 906 memcpy(raw_btf, hdr, sizeof(*hdr)); ··· 935 915 for (i = 0; i < type_sec_size / sizeof(raw_types[0]); i++) { 936 916 if (raw_types[i] == NAME_TBD) { 937 917 next_str = get_next_str(next_str, end_str); 938 - if (!next_str) { 939 - fprintf(stderr, "Error in getting next_str\n"); 918 + if (CHECK(!next_str, "Error in getting next_str")) { 940 919 free(raw_btf); 941 920 return NULL; 942 921 } ··· 992 973 free(raw_btf); 993 974 994 975 err = ((btf_fd == -1) != test->btf_load_err); 995 - if (err) 996 - fprintf(stderr, "btf_load_err:%d btf_fd:%d\n", 997 - test->btf_load_err, btf_fd); 976 + CHECK(err, "btf_fd:%d test->btf_load_err:%u", 977 + btf_fd, test->btf_load_err); 998 978 999 979 if (err || btf_fd == -1) 1000 980 goto done; ··· 1010 992 map_fd = bpf_create_map_xattr(&create_attr); 1011 993 1012 994 err = ((map_fd == -1) != test->map_create_err); 1013 - if (err) 1014 - fprintf(stderr, "map_create_err:%d map_fd:%d\n", 1015 - test->map_create_err, map_fd); 995 + CHECK(err, "map_fd:%d test->map_create_err:%u", 996 + map_fd, test->map_create_err); 1016 997 1017 998 done: 1018 999 if (!err) 1019 - fprintf(stderr, "OK\n"); 1000 + fprintf(stderr, "OK"); 1020 1001 1021 1002 if (*btf_log_buf && (err || args.always_log)) 1022 - fprintf(stderr, "%s\n", btf_log_buf); 1003 + fprintf(stderr, "\n%s", btf_log_buf); 1023 1004 1024 1005 if (btf_fd != -1) 1025 1006 close(btf_fd); ··· 1034 1017 int err = 0; 1035 1018 1036 1019 if (args.raw_test_num) 1037 - return do_test_raw(args.raw_test_num); 1020 + return count_result(do_test_raw(args.raw_test_num)); 1038 1021 1039 1022 for (i = 1; i <= ARRAY_SIZE(raw_tests); i++) 1040 - err |= do_test_raw(i); 1023 + err |= count_result(do_test_raw(i)); 1041 1024 1042 1025 return err; 1043 1026 } ··· 1047 1030 const char *str_sec; 1048 1031 __u32 raw_types[MAX_NR_RAW_TYPES]; 1049 1032 __u32 str_sec_size; 1050 - int info_size_delta; 1033 + int btf_size_delta; 1034 + int (*special_test)(unsigned int test_num); 1051 1035 }; 1036 + 1037 + static int test_big_btf_info(unsigned int test_num); 1038 + static int test_btf_id(unsigned int test_num); 1052 1039 1053 1040 const struct btf_get_info_test get_info_tests[] = { 1054 1041 { ··· 1064 1043 }, 1065 1044 .str_sec = "", 1066 1045 .str_sec_size = sizeof(""), 1067 - .info_size_delta = 1, 1046 + .btf_size_delta = 1, 1068 1047 }, 1069 1048 { 1070 1049 .descr = "== raw_btf_size-3", ··· 1075 1054 }, 1076 1055 .str_sec = "", 1077 1056 .str_sec_size = sizeof(""), 1078 - .info_size_delta = -3, 1057 + .btf_size_delta = -3, 1058 + }, 1059 + { 1060 + .descr = "Large bpf_btf_info", 1061 + .raw_types = { 1062 + /* int */ /* [1] */ 1063 + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), 1064 + BTF_END_RAW, 1065 + }, 1066 + .str_sec = "", 1067 + .str_sec_size = sizeof(""), 1068 + .special_test = test_big_btf_info, 1069 + }, 1070 + { 1071 + .descr = "BTF ID", 1072 + .raw_types = { 1073 + /* int */ /* [1] */ 1074 + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), 1075 + /* unsigned int */ /* [2] */ 1076 + BTF_TYPE_INT_ENC(0, 0, 0, 32, 4), 1077 + BTF_END_RAW, 1078 + }, 1079 + .str_sec = "", 1080 + .str_sec_size = sizeof(""), 1081 + .special_test = test_btf_id, 1079 1082 }, 1080 1083 }; 1081 1084 1082 - static int do_test_get_info(unsigned int test_num) 1085 + static inline __u64 ptr_to_u64(const void *ptr) 1086 + { 1087 + return (__u64)(unsigned long)ptr; 1088 + } 1089 + 1090 + static int test_big_btf_info(unsigned int test_num) 1083 1091 { 1084 1092 const struct btf_get_info_test *test = &get_info_tests[test_num - 1]; 1085 - unsigned int raw_btf_size, user_btf_size, expected_nbytes; 1086 1093 uint8_t *raw_btf = NULL, *user_btf = NULL; 1094 + unsigned int raw_btf_size; 1095 + struct { 1096 + struct bpf_btf_info info; 1097 + uint64_t garbage; 1098 + } info_garbage; 1099 + struct bpf_btf_info *info; 1087 1100 int btf_fd = -1, err; 1088 - 1089 - fprintf(stderr, "BTF GET_INFO_BY_ID test[%u] (%s): ", 1090 - test_num, test->descr); 1101 + uint32_t info_len; 1091 1102 1092 1103 raw_btf = btf_raw_create(&hdr_tmpl, 1093 1104 test->raw_types, ··· 1133 1080 *btf_log_buf = '\0'; 1134 1081 1135 1082 user_btf = malloc(raw_btf_size); 1136 - if (!user_btf) { 1137 - fprintf(stderr, "Cannot allocate memory for user_btf\n"); 1083 + if (CHECK(!user_btf, "!user_btf")) { 1138 1084 err = -1; 1139 1085 goto done; 1140 1086 } ··· 1141 1089 btf_fd = bpf_load_btf(raw_btf, raw_btf_size, 1142 1090 btf_log_buf, BTF_LOG_BUF_SIZE, 1143 1091 args.always_log); 1144 - if (btf_fd == -1) { 1145 - fprintf(stderr, "bpf_load_btf:%s(%d)\n", 1146 - strerror(errno), errno); 1092 + if (CHECK(btf_fd == -1, "errno:%d", errno)) { 1147 1093 err = -1; 1148 1094 goto done; 1149 1095 } 1150 1096 1151 - user_btf_size = (int)raw_btf_size + test->info_size_delta; 1097 + /* 1098 + * GET_INFO should error out if the userspace info 1099 + * has non zero tailing bytes. 1100 + */ 1101 + info = &info_garbage.info; 1102 + memset(info, 0, sizeof(*info)); 1103 + info_garbage.garbage = 0xdeadbeef; 1104 + info_len = sizeof(info_garbage); 1105 + info->btf = ptr_to_u64(user_btf); 1106 + info->btf_size = raw_btf_size; 1107 + 1108 + err = bpf_obj_get_info_by_fd(btf_fd, info, &info_len); 1109 + if (CHECK(!err, "!err")) { 1110 + err = -1; 1111 + goto done; 1112 + } 1113 + 1114 + /* 1115 + * GET_INFO should succeed even info_len is larger than 1116 + * the kernel supported as long as tailing bytes are zero. 1117 + * The kernel supported info len should also be returned 1118 + * to userspace. 1119 + */ 1120 + info_garbage.garbage = 0; 1121 + err = bpf_obj_get_info_by_fd(btf_fd, info, &info_len); 1122 + if (CHECK(err || info_len != sizeof(*info), 1123 + "err:%d errno:%d info_len:%u sizeof(*info):%lu", 1124 + err, errno, info_len, sizeof(*info))) { 1125 + err = -1; 1126 + goto done; 1127 + } 1128 + 1129 + fprintf(stderr, "OK"); 1130 + 1131 + done: 1132 + if (*btf_log_buf && (err || args.always_log)) 1133 + fprintf(stderr, "\n%s", btf_log_buf); 1134 + 1135 + free(raw_btf); 1136 + free(user_btf); 1137 + 1138 + if (btf_fd != -1) 1139 + close(btf_fd); 1140 + 1141 + return err; 1142 + } 1143 + 1144 + static int test_btf_id(unsigned int test_num) 1145 + { 1146 + const struct btf_get_info_test *test = &get_info_tests[test_num - 1]; 1147 + struct bpf_create_map_attr create_attr = {}; 1148 + uint8_t *raw_btf = NULL, *user_btf[2] = {}; 1149 + int btf_fd[2] = {-1, -1}, map_fd = -1; 1150 + struct bpf_map_info map_info = {}; 1151 + struct bpf_btf_info info[2] = {}; 1152 + unsigned int raw_btf_size; 1153 + uint32_t info_len; 1154 + int err, i, ret; 1155 + 1156 + raw_btf = btf_raw_create(&hdr_tmpl, 1157 + test->raw_types, 1158 + test->str_sec, 1159 + test->str_sec_size, 1160 + &raw_btf_size); 1161 + 1162 + if (!raw_btf) 1163 + return -1; 1164 + 1165 + *btf_log_buf = '\0'; 1166 + 1167 + for (i = 0; i < 2; i++) { 1168 + user_btf[i] = malloc(raw_btf_size); 1169 + if (CHECK(!user_btf[i], "!user_btf[%d]", i)) { 1170 + err = -1; 1171 + goto done; 1172 + } 1173 + info[i].btf = ptr_to_u64(user_btf[i]); 1174 + info[i].btf_size = raw_btf_size; 1175 + } 1176 + 1177 + btf_fd[0] = bpf_load_btf(raw_btf, raw_btf_size, 1178 + btf_log_buf, BTF_LOG_BUF_SIZE, 1179 + args.always_log); 1180 + if (CHECK(btf_fd[0] == -1, "errno:%d", errno)) { 1181 + err = -1; 1182 + goto done; 1183 + } 1184 + 1185 + /* Test BPF_OBJ_GET_INFO_BY_ID on btf_id */ 1186 + info_len = sizeof(info[0]); 1187 + err = bpf_obj_get_info_by_fd(btf_fd[0], &info[0], &info_len); 1188 + if (CHECK(err, "errno:%d", errno)) { 1189 + err = -1; 1190 + goto done; 1191 + } 1192 + 1193 + btf_fd[1] = bpf_btf_get_fd_by_id(info[0].id); 1194 + if (CHECK(btf_fd[1] == -1, "errno:%d", errno)) { 1195 + err = -1; 1196 + goto done; 1197 + } 1198 + 1199 + ret = 0; 1200 + err = bpf_obj_get_info_by_fd(btf_fd[1], &info[1], &info_len); 1201 + if (CHECK(err || info[0].id != info[1].id || 1202 + info[0].btf_size != info[1].btf_size || 1203 + (ret = memcmp(user_btf[0], user_btf[1], info[0].btf_size)), 1204 + "err:%d errno:%d id0:%u id1:%u btf_size0:%u btf_size1:%u memcmp:%d", 1205 + err, errno, info[0].id, info[1].id, 1206 + info[0].btf_size, info[1].btf_size, ret)) { 1207 + err = -1; 1208 + goto done; 1209 + } 1210 + 1211 + /* Test btf members in struct bpf_map_info */ 1212 + create_attr.name = "test_btf_id"; 1213 + create_attr.map_type = BPF_MAP_TYPE_ARRAY; 1214 + create_attr.key_size = sizeof(int); 1215 + create_attr.value_size = sizeof(unsigned int); 1216 + create_attr.max_entries = 4; 1217 + create_attr.btf_fd = btf_fd[0]; 1218 + create_attr.btf_key_id = 1; 1219 + create_attr.btf_value_id = 2; 1220 + 1221 + map_fd = bpf_create_map_xattr(&create_attr); 1222 + if (CHECK(map_fd == -1, "errno:%d", errno)) { 1223 + err = -1; 1224 + goto done; 1225 + } 1226 + 1227 + info_len = sizeof(map_info); 1228 + err = bpf_obj_get_info_by_fd(map_fd, &map_info, &info_len); 1229 + if (CHECK(err || map_info.btf_id != info[0].id || 1230 + map_info.btf_key_id != 1 || map_info.btf_value_id != 2, 1231 + "err:%d errno:%d info.id:%u btf_id:%u btf_key_id:%u btf_value_id:%u", 1232 + err, errno, info[0].id, map_info.btf_id, map_info.btf_key_id, 1233 + map_info.btf_value_id)) { 1234 + err = -1; 1235 + goto done; 1236 + } 1237 + 1238 + for (i = 0; i < 2; i++) { 1239 + close(btf_fd[i]); 1240 + btf_fd[i] = -1; 1241 + } 1242 + 1243 + /* Test BTF ID is removed from the kernel */ 1244 + btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id); 1245 + if (CHECK(btf_fd[0] == -1, "errno:%d", errno)) { 1246 + err = -1; 1247 + goto done; 1248 + } 1249 + close(btf_fd[0]); 1250 + btf_fd[0] = -1; 1251 + 1252 + /* The map holds the last ref to BTF and its btf_id */ 1253 + close(map_fd); 1254 + map_fd = -1; 1255 + btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id); 1256 + if (CHECK(btf_fd[0] != -1, "BTF lingers")) { 1257 + err = -1; 1258 + goto done; 1259 + } 1260 + 1261 + fprintf(stderr, "OK"); 1262 + 1263 + done: 1264 + if (*btf_log_buf && (err || args.always_log)) 1265 + fprintf(stderr, "\n%s", btf_log_buf); 1266 + 1267 + free(raw_btf); 1268 + if (map_fd != -1) 1269 + close(map_fd); 1270 + for (i = 0; i < 2; i++) { 1271 + free(user_btf[i]); 1272 + if (btf_fd[i] != -1) 1273 + close(btf_fd[i]); 1274 + } 1275 + 1276 + return err; 1277 + } 1278 + 1279 + static int do_test_get_info(unsigned int test_num) 1280 + { 1281 + const struct btf_get_info_test *test = &get_info_tests[test_num - 1]; 1282 + unsigned int raw_btf_size, user_btf_size, expected_nbytes; 1283 + uint8_t *raw_btf = NULL, *user_btf = NULL; 1284 + struct bpf_btf_info info = {}; 1285 + int btf_fd = -1, err, ret; 1286 + uint32_t info_len; 1287 + 1288 + fprintf(stderr, "BTF GET_INFO test[%u] (%s): ", 1289 + test_num, test->descr); 1290 + 1291 + if (test->special_test) 1292 + return test->special_test(test_num); 1293 + 1294 + raw_btf = btf_raw_create(&hdr_tmpl, 1295 + test->raw_types, 1296 + test->str_sec, 1297 + test->str_sec_size, 1298 + &raw_btf_size); 1299 + 1300 + if (!raw_btf) 1301 + return -1; 1302 + 1303 + *btf_log_buf = '\0'; 1304 + 1305 + user_btf = malloc(raw_btf_size); 1306 + if (CHECK(!user_btf, "!user_btf")) { 1307 + err = -1; 1308 + goto done; 1309 + } 1310 + 1311 + btf_fd = bpf_load_btf(raw_btf, raw_btf_size, 1312 + btf_log_buf, BTF_LOG_BUF_SIZE, 1313 + args.always_log); 1314 + if (CHECK(btf_fd == -1, "errno:%d", errno)) { 1315 + err = -1; 1316 + goto done; 1317 + } 1318 + 1319 + user_btf_size = (int)raw_btf_size + test->btf_size_delta; 1152 1320 expected_nbytes = min(raw_btf_size, user_btf_size); 1153 1321 if (raw_btf_size > expected_nbytes) 1154 1322 memset(user_btf + expected_nbytes, 0xff, 1155 1323 raw_btf_size - expected_nbytes); 1156 1324 1157 - err = bpf_obj_get_info_by_fd(btf_fd, user_btf, &user_btf_size); 1158 - if (err || user_btf_size != raw_btf_size || 1159 - memcmp(raw_btf, user_btf, expected_nbytes)) { 1160 - fprintf(stderr, 1161 - "err:%d(errno:%d) raw_btf_size:%u user_btf_size:%u expected_nbytes:%u memcmp:%d\n", 1162 - err, errno, 1163 - raw_btf_size, user_btf_size, expected_nbytes, 1164 - memcmp(raw_btf, user_btf, expected_nbytes)); 1325 + info_len = sizeof(info); 1326 + info.btf = ptr_to_u64(user_btf); 1327 + info.btf_size = user_btf_size; 1328 + 1329 + ret = 0; 1330 + err = bpf_obj_get_info_by_fd(btf_fd, &info, &info_len); 1331 + if (CHECK(err || !info.id || info_len != sizeof(info) || 1332 + info.btf_size != raw_btf_size || 1333 + (ret = memcmp(raw_btf, user_btf, expected_nbytes)), 1334 + "err:%d errno:%d info.id:%u info_len:%u sizeof(info):%lu raw_btf_size:%u info.btf_size:%u expected_nbytes:%u memcmp:%d", 1335 + err, errno, info.id, info_len, sizeof(info), 1336 + raw_btf_size, info.btf_size, expected_nbytes, ret)) { 1165 1337 err = -1; 1166 1338 goto done; 1167 1339 } 1168 1340 1169 1341 while (expected_nbytes < raw_btf_size) { 1170 1342 fprintf(stderr, "%u...", expected_nbytes); 1171 - if (user_btf[expected_nbytes++] != 0xff) { 1172 - fprintf(stderr, "!= 0xff\n"); 1343 + if (CHECK(user_btf[expected_nbytes++] != 0xff, 1344 + "user_btf[%u]:%x != 0xff", expected_nbytes - 1, 1345 + user_btf[expected_nbytes - 1])) { 1173 1346 err = -1; 1174 1347 goto done; 1175 1348 } 1176 1349 } 1177 1350 1178 - fprintf(stderr, "OK\n"); 1351 + fprintf(stderr, "OK"); 1179 1352 1180 1353 done: 1181 1354 if (*btf_log_buf && (err || args.always_log)) 1182 - fprintf(stderr, "%s\n", btf_log_buf); 1355 + fprintf(stderr, "\n%s", btf_log_buf); 1183 1356 1184 1357 free(raw_btf); 1185 1358 free(user_btf); ··· 1421 1144 int err = 0; 1422 1145 1423 1146 if (args.get_info_test_num) 1424 - return do_test_get_info(args.get_info_test_num); 1147 + return count_result(do_test_get_info(args.get_info_test_num)); 1425 1148 1426 1149 for (i = 1; i <= ARRAY_SIZE(get_info_tests); i++) 1427 - err |= do_test_get_info(i); 1150 + err |= count_result(do_test_get_info(i)); 1428 1151 1429 1152 return err; 1430 1153 } ··· 1452 1175 Elf *elf; 1453 1176 int ret; 1454 1177 1455 - if (elf_version(EV_CURRENT) == EV_NONE) { 1456 - fprintf(stderr, "Failed to init libelf\n"); 1178 + if (CHECK(elf_version(EV_CURRENT) == EV_NONE, 1179 + "elf_version(EV_CURRENT) == EV_NONE")) 1457 1180 return -1; 1458 - } 1459 1181 1460 1182 elf_fd = open(fn, O_RDONLY); 1461 - if (elf_fd == -1) { 1462 - fprintf(stderr, "Cannot open file %s: %s(%d)\n", 1463 - fn, strerror(errno), errno); 1183 + if (CHECK(elf_fd == -1, "open(%s): errno:%d", fn, errno)) 1464 1184 return -1; 1465 - } 1466 1185 1467 1186 elf = elf_begin(elf_fd, ELF_C_READ, NULL); 1468 - if (!elf) { 1469 - fprintf(stderr, "Failed to read ELF from %s. %s\n", fn, 1470 - elf_errmsg(elf_errno())); 1187 + if (CHECK(!elf, "elf_begin(%s): %s", fn, elf_errmsg(elf_errno()))) { 1471 1188 ret = -1; 1472 1189 goto done; 1473 1190 } 1474 1191 1475 - if (!gelf_getehdr(elf, &ehdr)) { 1476 - fprintf(stderr, "Failed to get EHDR from %s\n", fn); 1192 + if (CHECK(!gelf_getehdr(elf, &ehdr), "!gelf_getehdr(%s)", fn)) { 1477 1193 ret = -1; 1478 1194 goto done; 1479 1195 } ··· 1475 1205 const char *sh_name; 1476 1206 GElf_Shdr sh; 1477 1207 1478 - if (gelf_getshdr(scn, &sh) != &sh) { 1479 - fprintf(stderr, 1480 - "Failed to get section header from %s\n", fn); 1208 + if (CHECK(gelf_getshdr(scn, &sh) != &sh, 1209 + "file:%s gelf_getshdr != &sh", fn)) { 1481 1210 ret = -1; 1482 1211 goto done; 1483 1212 } ··· 1512 1243 return err; 1513 1244 1514 1245 if (err == 0) { 1515 - fprintf(stderr, "SKIP. No ELF %s found\n", BTF_ELF_SEC); 1246 + fprintf(stderr, "SKIP. No ELF %s found", BTF_ELF_SEC); 1247 + skip_cnt++; 1516 1248 return 0; 1517 1249 } 1518 1250 1519 1251 obj = bpf_object__open(test->file); 1520 - if (IS_ERR(obj)) 1252 + if (CHECK(IS_ERR(obj), "obj: %ld", PTR_ERR(obj))) 1521 1253 return PTR_ERR(obj); 1522 1254 1523 1255 err = bpf_object__btf_fd(obj); 1524 - if (err == -1) { 1525 - fprintf(stderr, "bpf_object__btf_fd: -1\n"); 1256 + if (CHECK(err == -1, "bpf_object__btf_fd: -1")) 1526 1257 goto done; 1527 - } 1528 1258 1529 1259 prog = bpf_program__next(NULL, obj); 1530 - if (!prog) { 1531 - fprintf(stderr, "Cannot find bpf_prog\n"); 1260 + if (CHECK(!prog, "Cannot find bpf_prog")) { 1532 1261 err = -1; 1533 1262 goto done; 1534 1263 } 1535 1264 1536 1265 bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); 1537 1266 err = bpf_object__load(obj); 1538 - if (err < 0) { 1539 - fprintf(stderr, "bpf_object__load: %d\n", err); 1267 + if (CHECK(err < 0, "bpf_object__load: %d", err)) 1540 1268 goto done; 1541 - } 1542 1269 1543 1270 map = bpf_object__find_map_by_name(obj, "btf_map"); 1544 - if (!map) { 1545 - fprintf(stderr, "btf_map not found\n"); 1271 + if (CHECK(!map, "btf_map not found")) { 1546 1272 err = -1; 1547 1273 goto done; 1548 1274 } 1549 1275 1550 1276 err = (bpf_map__btf_key_id(map) == 0 || bpf_map__btf_value_id(map) == 0) 1551 1277 != test->btf_kv_notfound; 1552 - if (err) { 1553 - fprintf(stderr, 1554 - "btf_kv_notfound:%u btf_key_id:%u btf_value_id:%u\n", 1555 - test->btf_kv_notfound, 1556 - bpf_map__btf_key_id(map), 1557 - bpf_map__btf_value_id(map)); 1278 + if (CHECK(err, "btf_key_id:%u btf_value_id:%u test->btf_kv_notfound:%u", 1279 + bpf_map__btf_key_id(map), bpf_map__btf_value_id(map), 1280 + test->btf_kv_notfound)) 1558 1281 goto done; 1559 - } 1560 1282 1561 - fprintf(stderr, "OK\n"); 1283 + fprintf(stderr, "OK"); 1562 1284 1563 1285 done: 1564 1286 bpf_object__close(obj); ··· 1562 1302 int err = 0; 1563 1303 1564 1304 if (args.file_test_num) 1565 - return do_test_file(args.file_test_num); 1305 + return count_result(do_test_file(args.file_test_num)); 1566 1306 1567 1307 for (i = 1; i <= ARRAY_SIZE(file_tests); i++) 1568 - err |= do_test_file(i); 1308 + err |= count_result(do_test_file(i)); 1569 1309 1570 1310 return err; 1571 1311 } ··· 1685 1425 unsigned int key; 1686 1426 uint8_t *raw_btf; 1687 1427 ssize_t nread; 1688 - int err; 1428 + int err, ret; 1689 1429 1690 1430 fprintf(stderr, "%s......", test->descr); 1691 1431 raw_btf = btf_raw_create(&hdr_tmpl, test->raw_types, ··· 1701 1441 args.always_log); 1702 1442 free(raw_btf); 1703 1443 1704 - if (btf_fd == -1) { 1444 + if (CHECK(btf_fd == -1, "errno:%d", errno)) { 1705 1445 err = -1; 1706 - fprintf(stderr, "bpf_load_btf: %s(%d)\n", 1707 - strerror(errno), errno); 1708 1446 goto done; 1709 1447 } 1710 1448 ··· 1716 1458 create_attr.btf_value_id = test->value_id; 1717 1459 1718 1460 map_fd = bpf_create_map_xattr(&create_attr); 1719 - if (map_fd == -1) { 1461 + if (CHECK(map_fd == -1, "errno:%d", errno)) { 1720 1462 err = -1; 1721 - fprintf(stderr, "bpf_creat_map_btf: %s(%d)\n", 1722 - strerror(errno), errno); 1723 1463 goto done; 1724 1464 } 1725 1465 1726 - if (snprintf(pin_path, sizeof(pin_path), "%s/%s", 1727 - "/sys/fs/bpf", test->map_name) == sizeof(pin_path)) { 1466 + ret = snprintf(pin_path, sizeof(pin_path), "%s/%s", 1467 + "/sys/fs/bpf", test->map_name); 1468 + 1469 + if (CHECK(ret == sizeof(pin_path), "pin_path %s/%s is too long", 1470 + "/sys/fs/bpf", test->map_name)) { 1728 1471 err = -1; 1729 - fprintf(stderr, "pin_path is too long\n"); 1730 1472 goto done; 1731 1473 } 1732 1474 1733 1475 err = bpf_obj_pin(map_fd, pin_path); 1734 - if (err) { 1735 - fprintf(stderr, "Cannot pin to %s. %s(%d).\n", pin_path, 1736 - strerror(errno), errno); 1476 + if (CHECK(err, "bpf_obj_pin(%s): errno:%d.", pin_path, errno)) 1737 1477 goto done; 1738 - } 1739 1478 1740 1479 for (key = 0; key < test->max_entries; key++) { 1741 1480 set_pprint_mapv(&mapv, key); ··· 1740 1485 } 1741 1486 1742 1487 pin_file = fopen(pin_path, "r"); 1743 - if (!pin_file) { 1488 + if (CHECK(!pin_file, "fopen(%s): errno:%d", pin_path, errno)) { 1744 1489 err = -1; 1745 - fprintf(stderr, "fopen(%s): %s(%d)\n", pin_path, 1746 - strerror(errno), errno); 1747 1490 goto done; 1748 1491 } 1749 1492 ··· 1750 1497 *line == '#') 1751 1498 ; 1752 1499 1753 - if (nread <= 0) { 1500 + if (CHECK(nread <= 0, "Unexpected EOF")) { 1754 1501 err = -1; 1755 - fprintf(stderr, "Unexpected EOF\n"); 1756 1502 goto done; 1757 1503 } 1758 1504 ··· 1770 1518 mapv.ui8a[4], mapv.ui8a[5], mapv.ui8a[6], mapv.ui8a[7], 1771 1519 pprint_enum_str[mapv.aenum]); 1772 1520 1773 - if (nexpected_line == sizeof(expected_line)) { 1521 + if (CHECK(nexpected_line == sizeof(expected_line), 1522 + "expected_line is too long")) { 1774 1523 err = -1; 1775 - fprintf(stderr, "expected_line is too long\n"); 1776 1524 goto done; 1777 1525 } 1778 1526 ··· 1787 1535 nread = getline(&line, &line_len, pin_file); 1788 1536 } while (++key < test->max_entries && nread > 0); 1789 1537 1790 - if (key < test->max_entries) { 1538 + if (CHECK(key < test->max_entries, 1539 + "Unexpected EOF. key:%u test->max_entries:%u", 1540 + key, test->max_entries)) { 1791 1541 err = -1; 1792 - fprintf(stderr, "Unexpected EOF\n"); 1793 1542 goto done; 1794 1543 } 1795 1544 1796 - if (nread > 0) { 1545 + if (CHECK(nread > 0, "Unexpected extra pprint output: %s", line)) { 1797 1546 err = -1; 1798 - fprintf(stderr, "Unexpected extra pprint output: %s\n", line); 1799 1547 goto done; 1800 1548 } 1801 1549 ··· 1803 1551 1804 1552 done: 1805 1553 if (!err) 1806 - fprintf(stderr, "OK\n"); 1554 + fprintf(stderr, "OK"); 1807 1555 if (*btf_log_buf && (err || args.always_log)) 1808 - fprintf(stderr, "%s\n", btf_log_buf); 1556 + fprintf(stderr, "\n%s", btf_log_buf); 1809 1557 if (btf_fd != -1) 1810 1558 close(btf_fd); 1811 1559 if (map_fd != -1) ··· 1886 1634 return 0; 1887 1635 } 1888 1636 1637 + static void print_summary(void) 1638 + { 1639 + fprintf(stderr, "PASS:%u SKIP:%u FAIL:%u\n", 1640 + pass_cnt - skip_cnt, skip_cnt, error_cnt); 1641 + } 1642 + 1889 1643 int main(int argc, char **argv) 1890 1644 { 1891 1645 int err = 0; ··· 1913 1655 err |= test_file(); 1914 1656 1915 1657 if (args.pprint_test) 1916 - err |= test_pprint(); 1658 + err |= count_result(test_pprint()); 1917 1659 1918 1660 if (args.raw_test || args.get_info_test || args.file_test || 1919 1661 args.pprint_test) 1920 - return err; 1662 + goto done; 1921 1663 1922 1664 err |= test_raw(); 1923 1665 err |= test_get_info(); 1924 1666 err |= test_file(); 1925 1667 1668 + done: 1669 + print_summary(); 1926 1670 return err; 1927 1671 }

+137 -3

tools/testing/selftests/bpf/test_progs.c

··· 1272 1272 return; 1273 1273 } 1274 1274 1275 + static void test_stacktrace_build_id_nmi(void) 1276 + { 1277 + int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd; 1278 + const char *file = "./test_stacktrace_build_id.o"; 1279 + int err, pmu_fd, prog_fd; 1280 + struct perf_event_attr attr = { 1281 + .sample_freq = 5000, 1282 + .freq = 1, 1283 + .type = PERF_TYPE_HARDWARE, 1284 + .config = PERF_COUNT_HW_CPU_CYCLES, 1285 + }; 1286 + __u32 key, previous_key, val, duration = 0; 1287 + struct bpf_object *obj; 1288 + char buf[256]; 1289 + int i, j; 1290 + struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH]; 1291 + int build_id_matches = 0; 1292 + 1293 + err = bpf_prog_load(file, BPF_PROG_TYPE_PERF_EVENT, &obj, &prog_fd); 1294 + if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno)) 1295 + return; 1296 + 1297 + pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */, 1298 + 0 /* cpu 0 */, -1 /* group id */, 1299 + 0 /* flags */); 1300 + if (CHECK(pmu_fd < 0, "perf_event_open", 1301 + "err %d errno %d. Does the test host support PERF_COUNT_HW_CPU_CYCLES?\n", 1302 + pmu_fd, errno)) 1303 + goto close_prog; 1304 + 1305 + err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0); 1306 + if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", 1307 + err, errno)) 1308 + goto close_pmu; 1309 + 1310 + err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); 1311 + if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", 1312 + err, errno)) 1313 + goto disable_pmu; 1314 + 1315 + /* find map fds */ 1316 + control_map_fd = bpf_find_map(__func__, obj, "control_map"); 1317 + if (CHECK(control_map_fd < 0, "bpf_find_map control_map", 1318 + "err %d errno %d\n", err, errno)) 1319 + goto disable_pmu; 1320 + 1321 + stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap"); 1322 + if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap", 1323 + "err %d errno %d\n", err, errno)) 1324 + goto disable_pmu; 1325 + 1326 + stackmap_fd = bpf_find_map(__func__, obj, "stackmap"); 1327 + if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n", 1328 + err, errno)) 1329 + goto disable_pmu; 1330 + 1331 + stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap"); 1332 + if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap", 1333 + "err %d errno %d\n", err, errno)) 1334 + goto disable_pmu; 1335 + 1336 + assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null") 1337 + == 0); 1338 + assert(system("taskset 0x1 ./urandom_read 100000") == 0); 1339 + /* disable stack trace collection */ 1340 + key = 0; 1341 + val = 1; 1342 + bpf_map_update_elem(control_map_fd, &key, &val, 0); 1343 + 1344 + /* for every element in stackid_hmap, we can find a corresponding one 1345 + * in stackmap, and vise versa. 1346 + */ 1347 + err = compare_map_keys(stackid_hmap_fd, stackmap_fd); 1348 + if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap", 1349 + "err %d errno %d\n", err, errno)) 1350 + goto disable_pmu; 1351 + 1352 + err = compare_map_keys(stackmap_fd, stackid_hmap_fd); 1353 + if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap", 1354 + "err %d errno %d\n", err, errno)) 1355 + goto disable_pmu; 1356 + 1357 + err = extract_build_id(buf, 256); 1358 + 1359 + if (CHECK(err, "get build_id with readelf", 1360 + "err %d errno %d\n", err, errno)) 1361 + goto disable_pmu; 1362 + 1363 + err = bpf_map_get_next_key(stackmap_fd, NULL, &key); 1364 + if (CHECK(err, "get_next_key from stackmap", 1365 + "err %d, errno %d\n", err, errno)) 1366 + goto disable_pmu; 1367 + 1368 + do { 1369 + char build_id[64]; 1370 + 1371 + err = bpf_map_lookup_elem(stackmap_fd, &key, id_offs); 1372 + if (CHECK(err, "lookup_elem from stackmap", 1373 + "err %d, errno %d\n", err, errno)) 1374 + goto disable_pmu; 1375 + for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i) 1376 + if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID && 1377 + id_offs[i].offset != 0) { 1378 + for (j = 0; j < 20; ++j) 1379 + sprintf(build_id + 2 * j, "%02x", 1380 + id_offs[i].build_id[j] & 0xff); 1381 + if (strstr(buf, build_id) != NULL) 1382 + build_id_matches = 1; 1383 + } 1384 + previous_key = key; 1385 + } while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0); 1386 + 1387 + if (CHECK(build_id_matches < 1, "build id match", 1388 + "Didn't find expected build ID from the map\n")) 1389 + goto disable_pmu; 1390 + 1391 + /* 1392 + * We intentionally skip compare_stack_ips(). This is because we 1393 + * only support one in_nmi() ips-to-build_id translation per cpu 1394 + * at any time, thus stack_amap here will always fallback to 1395 + * BPF_STACK_BUILD_ID_IP; 1396 + */ 1397 + 1398 + disable_pmu: 1399 + ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE); 1400 + 1401 + close_pmu: 1402 + close(pmu_fd); 1403 + 1404 + close_prog: 1405 + bpf_object__close(obj); 1406 + } 1407 + 1275 1408 #define MAX_CNT_RAWTP 10ull 1276 1409 #define MAX_STACK_RAWTP 100 1277 1410 struct get_stack_trace_t { ··· 1470 1337 good_user_stack = true; 1471 1338 } 1472 1339 if (!good_kern_stack || !good_user_stack) 1473 - return PERF_EVENT_ERROR; 1340 + return LIBBPF_PERF_EVENT_ERROR; 1474 1341 1475 1342 if (cnt == MAX_CNT_RAWTP) 1476 - return PERF_EVENT_DONE; 1343 + return LIBBPF_PERF_EVENT_DONE; 1477 1344 1478 - return PERF_EVENT_CONT; 1345 + return LIBBPF_PERF_EVENT_CONT; 1479 1346 } 1480 1347 1481 1348 static void test_get_stack_raw_tp(void) ··· 1558 1425 test_tp_attach_query(); 1559 1426 test_stacktrace_map(); 1560 1427 test_stacktrace_build_id(); 1428 + test_stacktrace_build_id_nmi(); 1561 1429 test_stacktrace_map_raw_tp(); 1562 1430 test_get_stack_raw_tp(); 1563 1431

+5

tools/testing/selftests/bpf/test_sockhash_kern.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (c) 2018 Covalent IO, Inc. http://covalent.io 3 + #undef SOCKMAP 4 + #define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH 5 + #include "./test_sockmap_kern.h"

+20 -7

tools/testing/selftests/bpf/test_sockmap.c

··· 47 47 #define S1_PORT 10000 48 48 #define S2_PORT 10001 49 49 50 - #define BPF_FILENAME "test_sockmap_kern.o" 50 + #define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o" 51 + #define BPF_SOCKHASH_FILENAME "test_sockhash_kern.o" 51 52 #define CG_PATH "/sockmap" 52 53 53 54 /* global sockets */ ··· 1261 1260 BPF_PROG_TYPE_SK_MSG, 1262 1261 }; 1263 1262 1264 - static int populate_progs(void) 1263 + static int populate_progs(char *bpf_file) 1265 1264 { 1266 - char *bpf_file = BPF_FILENAME; 1267 1265 struct bpf_program *prog; 1268 1266 struct bpf_object *obj; 1269 1267 int i = 0; ··· 1306 1306 return 0; 1307 1307 } 1308 1308 1309 - static int test_suite(void) 1309 + static int __test_suite(char *bpf_file) 1310 1310 { 1311 1311 int cg_fd, err; 1312 1312 1313 - err = populate_progs(); 1313 + err = populate_progs(bpf_file); 1314 1314 if (err < 0) { 1315 1315 fprintf(stderr, "ERROR: (%i) load bpf failed\n", err); 1316 1316 return err; ··· 1347 1347 1348 1348 out: 1349 1349 printf("Summary: %i PASSED %i FAILED\n", passed, failed); 1350 + cleanup_cgroup_environment(); 1350 1351 close(cg_fd); 1352 + return err; 1353 + } 1354 + 1355 + static int test_suite(void) 1356 + { 1357 + int err; 1358 + 1359 + err = __test_suite(BPF_SOCKMAP_FILENAME); 1360 + if (err) 1361 + goto out; 1362 + err = __test_suite(BPF_SOCKHASH_FILENAME); 1363 + out: 1351 1364 return err; 1352 1365 } 1353 1366 ··· 1370 1357 int iov_count = 1, length = 1024, rate = 1; 1371 1358 struct sockmap_options options = {0}; 1372 1359 int opt, longindex, err, cg_fd = 0; 1373 - char *bpf_file = BPF_FILENAME; 1360 + char *bpf_file = BPF_SOCKMAP_FILENAME; 1374 1361 int test = PING_PONG; 1375 1362 1376 1363 if (setrlimit(RLIMIT_MEMLOCK, &r)) { ··· 1451 1438 return -1; 1452 1439 } 1453 1440 1454 - err = populate_progs(); 1441 + err = populate_progs(bpf_file); 1455 1442 if (err) { 1456 1443 fprintf(stderr, "populate program: (%s) %s\n", 1457 1444 bpf_file, strerror(errno));

+4 -339

tools/testing/selftests/bpf/test_sockmap_kern.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 - // Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io 3 - #include <stddef.h> 4 - #include <string.h> 5 - #include <linux/bpf.h> 6 - #include <linux/if_ether.h> 7 - #include <linux/if_packet.h> 8 - #include <linux/ip.h> 9 - #include <linux/ipv6.h> 10 - #include <linux/in.h> 11 - #include <linux/udp.h> 12 - #include <linux/tcp.h> 13 - #include <linux/pkt_cls.h> 14 - #include <sys/socket.h> 15 - #include "bpf_helpers.h" 16 - #include "bpf_endian.h" 17 - 18 - /* Sockmap sample program connects a client and a backend together 19 - * using cgroups. 20 - * 21 - * client:X <---> frontend:80 client:X <---> backend:80 22 - * 23 - * For simplicity we hard code values here and bind 1:1. The hard 24 - * coded values are part of the setup in sockmap.sh script that 25 - * is associated with this BPF program. 26 - * 27 - * The bpf_printk is verbose and prints information as connections 28 - * are established and verdicts are decided. 29 - */ 30 - 31 - #define bpf_printk(fmt, ...) \ 32 - ({ \ 33 - char ____fmt[] = fmt; \ 34 - bpf_trace_printk(____fmt, sizeof(____fmt), \ 35 - ##__VA_ARGS__); \ 36 - }) 37 - 38 - struct bpf_map_def SEC("maps") sock_map = { 39 - .type = BPF_MAP_TYPE_SOCKMAP, 40 - .key_size = sizeof(int), 41 - .value_size = sizeof(int), 42 - .max_entries = 20, 43 - }; 44 - 45 - struct bpf_map_def SEC("maps") sock_map_txmsg = { 46 - .type = BPF_MAP_TYPE_SOCKMAP, 47 - .key_size = sizeof(int), 48 - .value_size = sizeof(int), 49 - .max_entries = 20, 50 - }; 51 - 52 - struct bpf_map_def SEC("maps") sock_map_redir = { 53 - .type = BPF_MAP_TYPE_SOCKMAP, 54 - .key_size = sizeof(int), 55 - .value_size = sizeof(int), 56 - .max_entries = 20, 57 - }; 58 - 59 - struct bpf_map_def SEC("maps") sock_apply_bytes = { 60 - .type = BPF_MAP_TYPE_ARRAY, 61 - .key_size = sizeof(int), 62 - .value_size = sizeof(int), 63 - .max_entries = 1 64 - }; 65 - 66 - struct bpf_map_def SEC("maps") sock_cork_bytes = { 67 - .type = BPF_MAP_TYPE_ARRAY, 68 - .key_size = sizeof(int), 69 - .value_size = sizeof(int), 70 - .max_entries = 1 71 - }; 72 - 73 - struct bpf_map_def SEC("maps") sock_pull_bytes = { 74 - .type = BPF_MAP_TYPE_ARRAY, 75 - .key_size = sizeof(int), 76 - .value_size = sizeof(int), 77 - .max_entries = 2 78 - }; 79 - 80 - struct bpf_map_def SEC("maps") sock_redir_flags = { 81 - .type = BPF_MAP_TYPE_ARRAY, 82 - .key_size = sizeof(int), 83 - .value_size = sizeof(int), 84 - .max_entries = 1 85 - }; 86 - 87 - struct bpf_map_def SEC("maps") sock_skb_opts = { 88 - .type = BPF_MAP_TYPE_ARRAY, 89 - .key_size = sizeof(int), 90 - .value_size = sizeof(int), 91 - .max_entries = 1 92 - }; 93 - 94 - SEC("sk_skb1") 95 - int bpf_prog1(struct __sk_buff *skb) 96 - { 97 - return skb->len; 98 - } 99 - 100 - SEC("sk_skb2") 101 - int bpf_prog2(struct __sk_buff *skb) 102 - { 103 - __u32 lport = skb->local_port; 104 - __u32 rport = skb->remote_port; 105 - int len, *f, ret, zero = 0; 106 - __u64 flags = 0; 107 - 108 - if (lport == 10000) 109 - ret = 10; 110 - else 111 - ret = 1; 112 - 113 - len = (__u32)skb->data_end - (__u32)skb->data; 114 - f = bpf_map_lookup_elem(&sock_skb_opts, &zero); 115 - if (f && *f) { 116 - ret = 3; 117 - flags = *f; 118 - } 119 - 120 - bpf_printk("sk_skb2: redirect(%iB) flags=%i\n", 121 - len, flags); 122 - return bpf_sk_redirect_map(skb, &sock_map, ret, flags); 123 - } 124 - 125 - SEC("sockops") 126 - int bpf_sockmap(struct bpf_sock_ops *skops) 127 - { 128 - __u32 lport, rport; 129 - int op, err = 0, index, key, ret; 130 - 131 - 132 - op = (int) skops->op; 133 - 134 - switch (op) { 135 - case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: 136 - lport = skops->local_port; 137 - rport = skops->remote_port; 138 - 139 - if (lport == 10000) { 140 - ret = 1; 141 - err = bpf_sock_map_update(skops, &sock_map, &ret, 142 - BPF_NOEXIST); 143 - bpf_printk("passive(%i -> %i) map ctx update err: %d\n", 144 - lport, bpf_ntohl(rport), err); 145 - } 146 - break; 147 - case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: 148 - lport = skops->local_port; 149 - rport = skops->remote_port; 150 - 151 - if (bpf_ntohl(rport) == 10001) { 152 - ret = 10; 153 - err = bpf_sock_map_update(skops, &sock_map, &ret, 154 - BPF_NOEXIST); 155 - bpf_printk("active(%i -> %i) map ctx update err: %d\n", 156 - lport, bpf_ntohl(rport), err); 157 - } 158 - break; 159 - default: 160 - break; 161 - } 162 - 163 - return 0; 164 - } 165 - 166 - SEC("sk_msg1") 167 - int bpf_prog4(struct sk_msg_md *msg) 168 - { 169 - int *bytes, zero = 0, one = 1; 170 - int *start, *end; 171 - 172 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 173 - if (bytes) 174 - bpf_msg_apply_bytes(msg, *bytes); 175 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 176 - if (bytes) 177 - bpf_msg_cork_bytes(msg, *bytes); 178 - start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 179 - end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 180 - if (start && end) 181 - bpf_msg_pull_data(msg, *start, *end, 0); 182 - return SK_PASS; 183 - } 184 - 185 - SEC("sk_msg2") 186 - int bpf_prog5(struct sk_msg_md *msg) 187 - { 188 - int err1 = -1, err2 = -1, zero = 0, one = 1; 189 - int *bytes, *start, *end, len1, len2; 190 - 191 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 192 - if (bytes) 193 - err1 = bpf_msg_apply_bytes(msg, *bytes); 194 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 195 - if (bytes) 196 - err2 = bpf_msg_cork_bytes(msg, *bytes); 197 - len1 = (__u64)msg->data_end - (__u64)msg->data; 198 - start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 199 - end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 200 - if (start && end) { 201 - int err; 202 - 203 - bpf_printk("sk_msg2: pull(%i:%i)\n", 204 - start ? *start : 0, end ? *end : 0); 205 - err = bpf_msg_pull_data(msg, *start, *end, 0); 206 - if (err) 207 - bpf_printk("sk_msg2: pull_data err %i\n", 208 - err); 209 - len2 = (__u64)msg->data_end - (__u64)msg->data; 210 - bpf_printk("sk_msg2: length update %i->%i\n", 211 - len1, len2); 212 - } 213 - bpf_printk("sk_msg2: data length %i err1 %i err2 %i\n", 214 - len1, err1, err2); 215 - return SK_PASS; 216 - } 217 - 218 - SEC("sk_msg3") 219 - int bpf_prog6(struct sk_msg_md *msg) 220 - { 221 - int *bytes, zero = 0, one = 1, key = 0; 222 - int *start, *end, *f; 223 - __u64 flags = 0; 224 - 225 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 226 - if (bytes) 227 - bpf_msg_apply_bytes(msg, *bytes); 228 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 229 - if (bytes) 230 - bpf_msg_cork_bytes(msg, *bytes); 231 - start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 232 - end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 233 - if (start && end) 234 - bpf_msg_pull_data(msg, *start, *end, 0); 235 - f = bpf_map_lookup_elem(&sock_redir_flags, &zero); 236 - if (f && *f) { 237 - key = 2; 238 - flags = *f; 239 - } 240 - return bpf_msg_redirect_map(msg, &sock_map_redir, key, flags); 241 - } 242 - 243 - SEC("sk_msg4") 244 - int bpf_prog7(struct sk_msg_md *msg) 245 - { 246 - int err1 = 0, err2 = 0, zero = 0, one = 1, key = 0; 247 - int *f, *bytes, *start, *end, len1, len2; 248 - __u64 flags = 0; 249 - 250 - int err; 251 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 252 - if (bytes) 253 - err1 = bpf_msg_apply_bytes(msg, *bytes); 254 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 255 - if (bytes) 256 - err2 = bpf_msg_cork_bytes(msg, *bytes); 257 - len1 = (__u64)msg->data_end - (__u64)msg->data; 258 - start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 259 - end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 260 - if (start && end) { 261 - 262 - bpf_printk("sk_msg2: pull(%i:%i)\n", 263 - start ? *start : 0, end ? *end : 0); 264 - err = bpf_msg_pull_data(msg, *start, *end, 0); 265 - if (err) 266 - bpf_printk("sk_msg2: pull_data err %i\n", 267 - err); 268 - len2 = (__u64)msg->data_end - (__u64)msg->data; 269 - bpf_printk("sk_msg2: length update %i->%i\n", 270 - len1, len2); 271 - } 272 - f = bpf_map_lookup_elem(&sock_redir_flags, &zero); 273 - if (f && *f) { 274 - key = 2; 275 - flags = *f; 276 - } 277 - bpf_printk("sk_msg3: redirect(%iB) flags=%i err=%i\n", 278 - len1, flags, err1 ? err1 : err2); 279 - err = bpf_msg_redirect_map(msg, &sock_map_redir, key, flags); 280 - bpf_printk("sk_msg3: err %i\n", err); 281 - return err; 282 - } 283 - 284 - SEC("sk_msg5") 285 - int bpf_prog8(struct sk_msg_md *msg) 286 - { 287 - void *data_end = (void *)(long) msg->data_end; 288 - void *data = (void *)(long) msg->data; 289 - int ret = 0, *bytes, zero = 0; 290 - 291 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 292 - if (bytes) { 293 - ret = bpf_msg_apply_bytes(msg, *bytes); 294 - if (ret) 295 - return SK_DROP; 296 - } else { 297 - return SK_DROP; 298 - } 299 - return SK_PASS; 300 - } 301 - SEC("sk_msg6") 302 - int bpf_prog9(struct sk_msg_md *msg) 303 - { 304 - void *data_end = (void *)(long) msg->data_end; 305 - void *data = (void *)(long) msg->data; 306 - int ret = 0, *bytes, zero = 0; 307 - 308 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 309 - if (bytes) { 310 - if (((__u64)data_end - (__u64)data) >= *bytes) 311 - return SK_PASS; 312 - ret = bpf_msg_cork_bytes(msg, *bytes); 313 - if (ret) 314 - return SK_DROP; 315 - } 316 - return SK_PASS; 317 - } 318 - 319 - SEC("sk_msg7") 320 - int bpf_prog10(struct sk_msg_md *msg) 321 - { 322 - int *bytes, zero = 0, one = 1; 323 - int *start, *end; 324 - 325 - bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 326 - if (bytes) 327 - bpf_msg_apply_bytes(msg, *bytes); 328 - bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 329 - if (bytes) 330 - bpf_msg_cork_bytes(msg, *bytes); 331 - start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 332 - end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 333 - if (start && end) 334 - bpf_msg_pull_data(msg, *start, *end, 0); 335 - 336 - return SK_DROP; 337 - } 338 - 339 - int _version SEC("version") = 1; 340 - char _license[] SEC("license") = "GPL"; 2 + // Copyright (c) 2018 Covalent IO, Inc. http://covalent.io 3 + #define SOCKMAP 4 + #define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKMAP 5 + #include "./test_sockmap_kern.h"

+363

tools/testing/selftests/bpf/test_sockmap_kern.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io */ 3 + #include <stddef.h> 4 + #include <string.h> 5 + #include <linux/bpf.h> 6 + #include <linux/if_ether.h> 7 + #include <linux/if_packet.h> 8 + #include <linux/ip.h> 9 + #include <linux/ipv6.h> 10 + #include <linux/in.h> 11 + #include <linux/udp.h> 12 + #include <linux/tcp.h> 13 + #include <linux/pkt_cls.h> 14 + #include <sys/socket.h> 15 + #include "bpf_helpers.h" 16 + #include "bpf_endian.h" 17 + 18 + /* Sockmap sample program connects a client and a backend together 19 + * using cgroups. 20 + * 21 + * client:X <---> frontend:80 client:X <---> backend:80 22 + * 23 + * For simplicity we hard code values here and bind 1:1. The hard 24 + * coded values are part of the setup in sockmap.sh script that 25 + * is associated with this BPF program. 26 + * 27 + * The bpf_printk is verbose and prints information as connections 28 + * are established and verdicts are decided. 29 + */ 30 + 31 + #define bpf_printk(fmt, ...) \ 32 + ({ \ 33 + char ____fmt[] = fmt; \ 34 + bpf_trace_printk(____fmt, sizeof(____fmt), \ 35 + ##__VA_ARGS__); \ 36 + }) 37 + 38 + struct bpf_map_def SEC("maps") sock_map = { 39 + .type = TEST_MAP_TYPE, 40 + .key_size = sizeof(int), 41 + .value_size = sizeof(int), 42 + .max_entries = 20, 43 + }; 44 + 45 + struct bpf_map_def SEC("maps") sock_map_txmsg = { 46 + .type = TEST_MAP_TYPE, 47 + .key_size = sizeof(int), 48 + .value_size = sizeof(int), 49 + .max_entries = 20, 50 + }; 51 + 52 + struct bpf_map_def SEC("maps") sock_map_redir = { 53 + .type = TEST_MAP_TYPE, 54 + .key_size = sizeof(int), 55 + .value_size = sizeof(int), 56 + .max_entries = 20, 57 + }; 58 + 59 + struct bpf_map_def SEC("maps") sock_apply_bytes = { 60 + .type = BPF_MAP_TYPE_ARRAY, 61 + .key_size = sizeof(int), 62 + .value_size = sizeof(int), 63 + .max_entries = 1 64 + }; 65 + 66 + struct bpf_map_def SEC("maps") sock_cork_bytes = { 67 + .type = BPF_MAP_TYPE_ARRAY, 68 + .key_size = sizeof(int), 69 + .value_size = sizeof(int), 70 + .max_entries = 1 71 + }; 72 + 73 + struct bpf_map_def SEC("maps") sock_pull_bytes = { 74 + .type = BPF_MAP_TYPE_ARRAY, 75 + .key_size = sizeof(int), 76 + .value_size = sizeof(int), 77 + .max_entries = 2 78 + }; 79 + 80 + struct bpf_map_def SEC("maps") sock_redir_flags = { 81 + .type = BPF_MAP_TYPE_ARRAY, 82 + .key_size = sizeof(int), 83 + .value_size = sizeof(int), 84 + .max_entries = 1 85 + }; 86 + 87 + struct bpf_map_def SEC("maps") sock_skb_opts = { 88 + .type = BPF_MAP_TYPE_ARRAY, 89 + .key_size = sizeof(int), 90 + .value_size = sizeof(int), 91 + .max_entries = 1 92 + }; 93 + 94 + SEC("sk_skb1") 95 + int bpf_prog1(struct __sk_buff *skb) 96 + { 97 + return skb->len; 98 + } 99 + 100 + SEC("sk_skb2") 101 + int bpf_prog2(struct __sk_buff *skb) 102 + { 103 + __u32 lport = skb->local_port; 104 + __u32 rport = skb->remote_port; 105 + int len, *f, ret, zero = 0; 106 + __u64 flags = 0; 107 + 108 + if (lport == 10000) 109 + ret = 10; 110 + else 111 + ret = 1; 112 + 113 + len = (__u32)skb->data_end - (__u32)skb->data; 114 + f = bpf_map_lookup_elem(&sock_skb_opts, &zero); 115 + if (f && *f) { 116 + ret = 3; 117 + flags = *f; 118 + } 119 + 120 + bpf_printk("sk_skb2: redirect(%iB) flags=%i\n", 121 + len, flags); 122 + #ifdef SOCKMAP 123 + return bpf_sk_redirect_map(skb, &sock_map, ret, flags); 124 + #else 125 + return bpf_sk_redirect_hash(skb, &sock_map, &ret, flags); 126 + #endif 127 + 128 + } 129 + 130 + SEC("sockops") 131 + int bpf_sockmap(struct bpf_sock_ops *skops) 132 + { 133 + __u32 lport, rport; 134 + int op, err = 0, index, key, ret; 135 + 136 + 137 + op = (int) skops->op; 138 + 139 + switch (op) { 140 + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: 141 + lport = skops->local_port; 142 + rport = skops->remote_port; 143 + 144 + if (lport == 10000) { 145 + ret = 1; 146 + #ifdef SOCKMAP 147 + err = bpf_sock_map_update(skops, &sock_map, &ret, 148 + BPF_NOEXIST); 149 + #else 150 + err = bpf_sock_hash_update(skops, &sock_map, &ret, 151 + BPF_NOEXIST); 152 + #endif 153 + bpf_printk("passive(%i -> %i) map ctx update err: %d\n", 154 + lport, bpf_ntohl(rport), err); 155 + } 156 + break; 157 + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: 158 + lport = skops->local_port; 159 + rport = skops->remote_port; 160 + 161 + if (bpf_ntohl(rport) == 10001) { 162 + ret = 10; 163 + #ifdef SOCKMAP 164 + err = bpf_sock_map_update(skops, &sock_map, &ret, 165 + BPF_NOEXIST); 166 + #else 167 + err = bpf_sock_hash_update(skops, &sock_map, &ret, 168 + BPF_NOEXIST); 169 + #endif 170 + bpf_printk("active(%i -> %i) map ctx update err: %d\n", 171 + lport, bpf_ntohl(rport), err); 172 + } 173 + break; 174 + default: 175 + break; 176 + } 177 + 178 + return 0; 179 + } 180 + 181 + SEC("sk_msg1") 182 + int bpf_prog4(struct sk_msg_md *msg) 183 + { 184 + int *bytes, zero = 0, one = 1; 185 + int *start, *end; 186 + 187 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 188 + if (bytes) 189 + bpf_msg_apply_bytes(msg, *bytes); 190 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 191 + if (bytes) 192 + bpf_msg_cork_bytes(msg, *bytes); 193 + start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 194 + end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 195 + if (start && end) 196 + bpf_msg_pull_data(msg, *start, *end, 0); 197 + return SK_PASS; 198 + } 199 + 200 + SEC("sk_msg2") 201 + int bpf_prog5(struct sk_msg_md *msg) 202 + { 203 + int err1 = -1, err2 = -1, zero = 0, one = 1; 204 + int *bytes, *start, *end, len1, len2; 205 + 206 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 207 + if (bytes) 208 + err1 = bpf_msg_apply_bytes(msg, *bytes); 209 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 210 + if (bytes) 211 + err2 = bpf_msg_cork_bytes(msg, *bytes); 212 + len1 = (__u64)msg->data_end - (__u64)msg->data; 213 + start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 214 + end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 215 + if (start && end) { 216 + int err; 217 + 218 + bpf_printk("sk_msg2: pull(%i:%i)\n", 219 + start ? *start : 0, end ? *end : 0); 220 + err = bpf_msg_pull_data(msg, *start, *end, 0); 221 + if (err) 222 + bpf_printk("sk_msg2: pull_data err %i\n", 223 + err); 224 + len2 = (__u64)msg->data_end - (__u64)msg->data; 225 + bpf_printk("sk_msg2: length update %i->%i\n", 226 + len1, len2); 227 + } 228 + bpf_printk("sk_msg2: data length %i err1 %i err2 %i\n", 229 + len1, err1, err2); 230 + return SK_PASS; 231 + } 232 + 233 + SEC("sk_msg3") 234 + int bpf_prog6(struct sk_msg_md *msg) 235 + { 236 + int *bytes, zero = 0, one = 1, key = 0; 237 + int *start, *end, *f; 238 + __u64 flags = 0; 239 + 240 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 241 + if (bytes) 242 + bpf_msg_apply_bytes(msg, *bytes); 243 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 244 + if (bytes) 245 + bpf_msg_cork_bytes(msg, *bytes); 246 + start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 247 + end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 248 + if (start && end) 249 + bpf_msg_pull_data(msg, *start, *end, 0); 250 + f = bpf_map_lookup_elem(&sock_redir_flags, &zero); 251 + if (f && *f) { 252 + key = 2; 253 + flags = *f; 254 + } 255 + #ifdef SOCKMAP 256 + return bpf_msg_redirect_map(msg, &sock_map_redir, key, flags); 257 + #else 258 + return bpf_msg_redirect_hash(msg, &sock_map_redir, &key, flags); 259 + #endif 260 + } 261 + 262 + SEC("sk_msg4") 263 + int bpf_prog7(struct sk_msg_md *msg) 264 + { 265 + int err1 = 0, err2 = 0, zero = 0, one = 1, key = 0; 266 + int *f, *bytes, *start, *end, len1, len2; 267 + __u64 flags = 0; 268 + 269 + int err; 270 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 271 + if (bytes) 272 + err1 = bpf_msg_apply_bytes(msg, *bytes); 273 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 274 + if (bytes) 275 + err2 = bpf_msg_cork_bytes(msg, *bytes); 276 + len1 = (__u64)msg->data_end - (__u64)msg->data; 277 + start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 278 + end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 279 + if (start && end) { 280 + 281 + bpf_printk("sk_msg2: pull(%i:%i)\n", 282 + start ? *start : 0, end ? *end : 0); 283 + err = bpf_msg_pull_data(msg, *start, *end, 0); 284 + if (err) 285 + bpf_printk("sk_msg2: pull_data err %i\n", 286 + err); 287 + len2 = (__u64)msg->data_end - (__u64)msg->data; 288 + bpf_printk("sk_msg2: length update %i->%i\n", 289 + len1, len2); 290 + } 291 + f = bpf_map_lookup_elem(&sock_redir_flags, &zero); 292 + if (f && *f) { 293 + key = 2; 294 + flags = *f; 295 + } 296 + bpf_printk("sk_msg3: redirect(%iB) flags=%i err=%i\n", 297 + len1, flags, err1 ? err1 : err2); 298 + #ifdef SOCKMAP 299 + err = bpf_msg_redirect_map(msg, &sock_map_redir, key, flags); 300 + #else 301 + err = bpf_msg_redirect_hash(msg, &sock_map_redir, &key, flags); 302 + #endif 303 + bpf_printk("sk_msg3: err %i\n", err); 304 + return err; 305 + } 306 + 307 + SEC("sk_msg5") 308 + int bpf_prog8(struct sk_msg_md *msg) 309 + { 310 + void *data_end = (void *)(long) msg->data_end; 311 + void *data = (void *)(long) msg->data; 312 + int ret = 0, *bytes, zero = 0; 313 + 314 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 315 + if (bytes) { 316 + ret = bpf_msg_apply_bytes(msg, *bytes); 317 + if (ret) 318 + return SK_DROP; 319 + } else { 320 + return SK_DROP; 321 + } 322 + return SK_PASS; 323 + } 324 + SEC("sk_msg6") 325 + int bpf_prog9(struct sk_msg_md *msg) 326 + { 327 + void *data_end = (void *)(long) msg->data_end; 328 + void *data = (void *)(long) msg->data; 329 + int ret = 0, *bytes, zero = 0; 330 + 331 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 332 + if (bytes) { 333 + if (((__u64)data_end - (__u64)data) >= *bytes) 334 + return SK_PASS; 335 + ret = bpf_msg_cork_bytes(msg, *bytes); 336 + if (ret) 337 + return SK_DROP; 338 + } 339 + return SK_PASS; 340 + } 341 + 342 + SEC("sk_msg7") 343 + int bpf_prog10(struct sk_msg_md *msg) 344 + { 345 + int *bytes, zero = 0, one = 1; 346 + int *start, *end; 347 + 348 + bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero); 349 + if (bytes) 350 + bpf_msg_apply_bytes(msg, *bytes); 351 + bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero); 352 + if (bytes) 353 + bpf_msg_cork_bytes(msg, *bytes); 354 + start = bpf_map_lookup_elem(&sock_pull_bytes, &zero); 355 + end = bpf_map_lookup_elem(&sock_pull_bytes, &one); 356 + if (start && end) 357 + bpf_msg_pull_data(msg, *start, *end, 0); 358 + 359 + return SK_DROP; 360 + } 361 + 362 + int _version SEC("version") = 1; 363 + char _license[] SEC("license") = "GPL";

+62

tools/testing/selftests/bpf/test_verifier.c

··· 41 41 # endif 42 42 #endif 43 43 #include "bpf_rlimit.h" 44 + #include "bpf_rand.h" 44 45 #include "../../../include/linux/filter.h" 45 46 46 47 #ifndef ARRAY_SIZE ··· 151 150 while (i < len - 1) 152 151 insn[i++] = BPF_LD_ABS(BPF_B, 1); 153 152 insn[i] = BPF_EXIT_INSN(); 153 + } 154 + 155 + static void bpf_fill_rand_ld_dw(struct bpf_test *self) 156 + { 157 + struct bpf_insn *insn = self->insns; 158 + uint64_t res = 0; 159 + int i = 0; 160 + 161 + insn[i++] = BPF_MOV32_IMM(BPF_REG_0, 0); 162 + while (i < self->retval) { 163 + uint64_t val = bpf_semi_rand_get(); 164 + struct bpf_insn tmp[2] = { BPF_LD_IMM64(BPF_REG_1, val) }; 165 + 166 + res ^= val; 167 + insn[i++] = tmp[0]; 168 + insn[i++] = tmp[1]; 169 + insn[i++] = BPF_ALU64_REG(BPF_XOR, BPF_REG_0, BPF_REG_1); 170 + } 171 + insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_0); 172 + insn[i++] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 32); 173 + insn[i++] = BPF_ALU64_REG(BPF_XOR, BPF_REG_0, BPF_REG_1); 174 + insn[i] = BPF_EXIT_INSN(); 175 + res ^= (res >> 32); 176 + self->retval = (uint32_t)res; 154 177 } 155 178 156 179 static struct bpf_test tests[] = { ··· 11999 11974 .result = ACCEPT, 12000 11975 .retval = 10, 12001 11976 }, 11977 + { 11978 + "ld_dw: xor semi-random 64 bit imms, test 1", 11979 + .insns = { }, 11980 + .data = { }, 11981 + .fill_helper = bpf_fill_rand_ld_dw, 11982 + .prog_type = BPF_PROG_TYPE_SCHED_CLS, 11983 + .result = ACCEPT, 11984 + .retval = 4090, 11985 + }, 11986 + { 11987 + "ld_dw: xor semi-random 64 bit imms, test 2", 11988 + .insns = { }, 11989 + .data = { }, 11990 + .fill_helper = bpf_fill_rand_ld_dw, 11991 + .prog_type = BPF_PROG_TYPE_SCHED_CLS, 11992 + .result = ACCEPT, 11993 + .retval = 2047, 11994 + }, 11995 + { 11996 + "ld_dw: xor semi-random 64 bit imms, test 3", 11997 + .insns = { }, 11998 + .data = { }, 11999 + .fill_helper = bpf_fill_rand_ld_dw, 12000 + .prog_type = BPF_PROG_TYPE_SCHED_CLS, 12001 + .result = ACCEPT, 12002 + .retval = 511, 12003 + }, 12004 + { 12005 + "ld_dw: xor semi-random 64 bit imms, test 4", 12006 + .insns = { }, 12007 + .data = { }, 12008 + .fill_helper = bpf_fill_rand_ld_dw, 12009 + .prog_type = BPF_PROG_TYPE_SCHED_CLS, 12010 + .result = ACCEPT, 12011 + .retval = 5, 12012 + }, 12002 12013 }; 12003 12014 12004 12015 static int probe_filter_length(const struct bpf_insn *fp) ··· 12407 12346 return EXIT_FAILURE; 12408 12347 } 12409 12348 12349 + bpf_semi_rand_init(); 12410 12350 return do_test(unpriv, from, to); 12411 12351 }

+30 -57

tools/testing/selftests/bpf/trace_helpers.c

··· 74 74 75 75 static int page_size; 76 76 static int page_cnt = 8; 77 - static volatile struct perf_event_mmap_page *header; 77 + static struct perf_event_mmap_page *header; 78 78 79 79 int perf_event_mmap(int fd) 80 80 { ··· 107 107 char data[]; 108 108 }; 109 109 110 - static int perf_event_read(perf_event_print_fn fn) 110 + static enum bpf_perf_event_ret bpf_perf_event_print(void *event, void *priv) 111 111 { 112 - __u64 data_tail = header->data_tail; 113 - __u64 data_head = header->data_head; 114 - __u64 buffer_size = page_cnt * page_size; 115 - void *base, *begin, *end; 116 - char buf[256]; 112 + struct perf_event_sample *e = event; 113 + perf_event_print_fn fn = priv; 117 114 int ret; 118 115 119 - asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */ 120 - if (data_head == data_tail) 121 - return PERF_EVENT_CONT; 122 - 123 - base = ((char *)header) + page_size; 124 - 125 - begin = base + data_tail % buffer_size; 126 - end = base + data_head % buffer_size; 127 - 128 - while (begin != end) { 129 - struct perf_event_sample *e; 130 - 131 - e = begin; 132 - if (begin + e->header.size > base + buffer_size) { 133 - long len = base + buffer_size - begin; 134 - 135 - assert(len < e->header.size); 136 - memcpy(buf, begin, len); 137 - memcpy(buf + len, base, e->header.size - len); 138 - e = (void *) buf; 139 - begin = base + e->header.size - len; 140 - } else if (begin + e->header.size == base + buffer_size) { 141 - begin = base; 142 - } else { 143 - begin += e->header.size; 144 - } 145 - 146 - if (e->header.type == PERF_RECORD_SAMPLE) { 147 - ret = fn(e->data, e->size); 148 - if (ret != PERF_EVENT_CONT) 149 - return ret; 150 - } else if (e->header.type == PERF_RECORD_LOST) { 151 - struct { 152 - struct perf_event_header header; 153 - __u64 id; 154 - __u64 lost; 155 - } *lost = (void *) e; 156 - printf("lost %lld events\n", lost->lost); 157 - } else { 158 - printf("unknown event type=%d size=%d\n", 159 - e->header.type, e->header.size); 160 - } 116 + if (e->header.type == PERF_RECORD_SAMPLE) { 117 + ret = fn(e->data, e->size); 118 + if (ret != LIBBPF_PERF_EVENT_CONT) 119 + return ret; 120 + } else if (e->header.type == PERF_RECORD_LOST) { 121 + struct { 122 + struct perf_event_header header; 123 + __u64 id; 124 + __u64 lost; 125 + } *lost = (void *) e; 126 + printf("lost %lld events\n", lost->lost); 127 + } else { 128 + printf("unknown event type=%d size=%d\n", 129 + e->header.type, e->header.size); 161 130 } 162 131 163 - __sync_synchronize(); /* smp_mb() */ 164 - header->data_tail = data_head; 165 - return PERF_EVENT_CONT; 132 + return LIBBPF_PERF_EVENT_CONT; 166 133 } 167 134 168 135 int perf_event_poller(int fd, perf_event_print_fn output_fn) 169 136 { 170 - int ret; 137 + enum bpf_perf_event_ret ret; 138 + void *buf = NULL; 139 + size_t len = 0; 171 140 172 141 for (;;) { 173 142 perf_event_poll(fd); 174 - ret = perf_event_read(output_fn); 175 - if (ret != PERF_EVENT_CONT) 176 - return ret; 143 + ret = bpf_perf_event_read_simple(header, page_cnt * page_size, 144 + page_size, &buf, &len, 145 + bpf_perf_event_print, 146 + output_fn); 147 + if (ret != LIBBPF_PERF_EVENT_CONT) 148 + break; 177 149 } 150 + free(buf); 178 151 179 - return PERF_EVENT_DONE; 152 + return ret; 180 153 }

+4 -7

tools/testing/selftests/bpf/trace_helpers.h

··· 2 2 #ifndef __TRACE_HELPER_H 3 3 #define __TRACE_HELPER_H 4 4 5 + #include <libbpf.h> 6 + 5 7 struct ksym { 6 8 long addr; 7 9 char *name; ··· 12 10 int load_kallsyms(void); 13 11 struct ksym *ksym_search(long key); 14 12 15 - typedef int (*perf_event_print_fn)(void *data, int size); 16 - 17 - /* return code for perf_event_print_fn */ 18 - #define PERF_EVENT_DONE 0 19 - #define PERF_EVENT_ERROR -1 20 - #define PERF_EVENT_CONT -2 13 + typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size); 21 14 22 15 int perf_event_mmap(int fd); 23 - /* return PERF_EVENT_DONE or PERF_EVENT_ERROR */ 16 + /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */ 24 17 int perf_event_poller(int fd, perf_event_print_fn output_fn); 25 18 #endif

+8 -2

tools/testing/selftests/bpf/urandom_read.c

··· 6 6 #include <stdlib.h> 7 7 8 8 #define BUF_SIZE 256 9 - int main(void) 9 + 10 + int main(int argc, char *argv[]) 10 11 { 11 12 int fd = open("/dev/urandom", O_RDONLY); 12 13 int i; 13 14 char buf[BUF_SIZE]; 15 + int count = 4; 14 16 15 17 if (fd < 0) 16 18 return 1; 17 - for (i = 0; i < 4; ++i) 19 + 20 + if (argc == 2) 21 + count = atoi(argv[1]); 22 + 23 + for (i = 0; i < count; ++i) 18 24 read(fd, buf, BUF_SIZE); 19 25 20 26 close(fd);