Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'bpf-trampoline-support-jmp-mode'

Menglong Dong says:

====================
bpf trampoline support "jmp" mode

For now, the bpf trampoline is called by the "call" instruction. However,
it break the RSB and introduce extra overhead in x86_64 arch.

For example, we hook the function "foo" with fexit, the call and return
logic will be like this:
call foo -> call trampoline -> call foo-body ->
return foo-body -> return foo

As we can see above, there are 3 call, but 2 return, which break the RSB
balance. We can pseudo a "return" here, but it's not the best choice,
as it will still cause once RSB miss:
call foo -> call trampoline -> call foo-body ->
return foo-body -> return dummy -> return foo

The "return dummy" doesn't pair the "call trampoline", which can also
cause the RSB miss.

Therefore, we introduce the "jmp" mode for bpf trampoline, as advised by
Alexei in [1]. And the logic will become this:
call foo -> jmp trampoline -> call foo-body ->
return foo-body -> return foo

As we can see above, the RSB is totally balanced after this series.

In this series, we introduce the FTRACE_OPS_FL_JMP for ftrace to make it
use the "jmp" instruction instead of "call".

And we also do some adjustment to bpf_arch_text_poke() to allow us specify
the old and new poke_type.

For the BPF_TRAMP_F_SHARE_IPMODIFY case, we will fallback to the "call"
mode, as it need to get the function address from the stack, which is not
supported in "jmp" mode.

Before this series, we have the following performance with the bpf
benchmark:

$ cd tools/testing/selftests/bpf
$ ./benchs/run_bench_trigger.sh
usermode-count : 890.171 ± 1.522M/s
kernel-count : 409.184 ± 0.330M/s
syscall-count : 26.792 ± 0.010M/s
fentry : 171.242 ± 0.322M/s
fexit : 80.544 ± 0.045M/s
fmodret : 78.301 ± 0.065M/s
rawtp : 192.906 ± 0.900M/s
tp : 81.883 ± 0.209M/s
kprobe : 52.029 ± 0.113M/s
kprobe-multi : 62.237 ± 0.060M/s
kprobe-multi-all: 4.761 ± 0.014M/s
kretprobe : 23.779 ± 0.046M/s
kretprobe-multi: 29.134 ± 0.012M/s
kretprobe-multi-all: 3.822 ± 0.003M/

And after this series, we have the following performance:

usermode-count : 890.443 ± 0.307M/s
kernel-count : 416.139 ± 0.055M/s
syscall-count : 31.037 ± 0.813M/s
fentry : 169.549 ± 0.519M/s
fexit : 136.540 ± 0.518M/s
fmodret : 159.248 ± 0.188M/s
rawtp : 194.475 ± 0.144M/s
tp : 84.505 ± 0.041M/s
kprobe : 59.951 ± 0.071M/s
kprobe-multi : 63.153 ± 0.177M/s
kprobe-multi-all: 4.699 ± 0.012M/s
kretprobe : 23.740 ± 0.015M/s
kretprobe-multi: 29.301 ± 0.022M/s
kretprobe-multi-all: 3.869 ± 0.005M/s

As we can see above, the performance of fexit increase from 80.544M/s to
136.540M/s, and the "fmodret" increase from 78.301M/s to 159.248M/s.

Link: https://lore.kernel.org/bpf/20251117034906.32036-1-dongml2@chinatelecom.cn/
Changes since v2:
* reject if the addr is already "jmp" in register_ftrace_direct() and
__modify_ftrace_direct() in the 1st patch.
* fix compile error in powerpc in the 5th patch.
* changes in the 6th patch:
- fix the compile error by wrapping the write to tr->fops->flags with
CONFIG_DYNAMIC_FTRACE_WITH_JMP
- reset BPF_TRAMP_F_SKIP_FRAME when the second try of modify_fentry in
bpf_trampoline_update()

Link: https://lore.kernel.org/bpf/20251114092450.172024-1-dongml2@chinatelecom.cn/
Changes since v1:
* change the bool parameter that we add to save_args() to "u32 flags"
* rename bpf_trampoline_need_jmp() to bpf_trampoline_use_jmp()
* add new function parameter to bpf_arch_text_poke instead of introduce
bpf_arch_text_poke_type()
* rename bpf_text_poke to bpf_trampoline_update_fentry
* remove the BPF_TRAMP_F_JMPED and check the current mode with the origin
flags instead.

Link: https://lore.kernel.org/bpf/CAADnVQLX54sVi1oaHrkSiLqjJaJdm3TQjoVrgU-LZimK6iDcSA@mail.gmail.com/[1]
====================

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20251118123639.688444-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

+225 -65
+7 -7
arch/arm64/net/bpf_jit_comp.c
··· 2934 2934 * The dummy_tramp is used to prevent another CPU from jumping to unknown 2935 2935 * locations during the patching process, making the patching process easier. 2936 2936 */ 2937 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, 2938 - void *old_addr, void *new_addr) 2937 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 2938 + enum bpf_text_poke_type new_t, void *old_addr, 2939 + void *new_addr) 2939 2940 { 2940 2941 int ret; 2941 2942 u32 old_insn; ··· 2980 2979 !poking_bpf_entry)) 2981 2980 return -EINVAL; 2982 2981 2983 - if (poke_type == BPF_MOD_CALL) 2984 - branch_type = AARCH64_INSN_BRANCH_LINK; 2985 - else 2986 - branch_type = AARCH64_INSN_BRANCH_NOLINK; 2987 - 2982 + branch_type = old_t == BPF_MOD_CALL ? AARCH64_INSN_BRANCH_LINK : 2983 + AARCH64_INSN_BRANCH_NOLINK; 2988 2984 if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0) 2989 2985 return -EFAULT; 2990 2986 2987 + branch_type = new_t == BPF_MOD_CALL ? AARCH64_INSN_BRANCH_LINK : 2988 + AARCH64_INSN_BRANCH_NOLINK; 2991 2989 if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0) 2992 2990 return -EFAULT; 2993 2991
+6 -3
arch/loongarch/net/bpf_jit.c
··· 1284 1284 return ret ? ERR_PTR(-EINVAL) : dst; 1285 1285 } 1286 1286 1287 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, 1288 - void *old_addr, void *new_addr) 1287 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 1288 + enum bpf_text_poke_type new_t, void *old_addr, 1289 + void *new_addr) 1289 1290 { 1290 1291 int ret; 1291 - bool is_call = (poke_type == BPF_MOD_CALL); 1292 + bool is_call; 1292 1293 u32 old_insns[LOONGARCH_LONG_JUMP_NINSNS] = {[0 ... 4] = INSN_NOP}; 1293 1294 u32 new_insns[LOONGARCH_LONG_JUMP_NINSNS] = {[0 ... 4] = INSN_NOP}; 1294 1295 ··· 1299 1298 if (!is_bpf_text_address((unsigned long)ip)) 1300 1299 return -ENOTSUPP; 1301 1300 1301 + is_call = old_t == BPF_MOD_CALL; 1302 1302 ret = emit_jump_or_nops(old_addr, ip, old_insns, is_call); 1303 1303 if (ret) 1304 1304 return ret; ··· 1307 1305 if (memcmp(ip, old_insns, LOONGARCH_LONG_JUMP_NBYTES)) 1308 1306 return -EFAULT; 1309 1307 1308 + is_call = new_t == BPF_MOD_CALL; 1310 1309 ret = emit_jump_or_nops(new_addr, ip, new_insns, is_call); 1311 1310 if (ret) 1312 1311 return ret;
+6 -4
arch/powerpc/net/bpf_jit_comp.c
··· 1107 1107 * execute isync (or some CSI) so that they don't go back into the 1108 1108 * trampoline again. 1109 1109 */ 1110 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, 1111 - void *old_addr, void *new_addr) 1110 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 1111 + enum bpf_text_poke_type new_t, void *old_addr, 1112 + void *new_addr) 1112 1113 { 1113 1114 unsigned long bpf_func, bpf_func_end, size, offset; 1114 1115 ppc_inst_t old_inst, new_inst; ··· 1120 1119 return -EOPNOTSUPP; 1121 1120 1122 1121 bpf_func = (unsigned long)ip; 1123 - branch_flags = poke_type == BPF_MOD_CALL ? BRANCH_SET_LINK : 0; 1124 1122 1125 1123 /* We currently only support poking bpf programs */ 1126 1124 if (!__bpf_address_lookup(bpf_func, &size, &offset, name)) { ··· 1132 1132 * an unconditional branch instruction at im->ip_after_call 1133 1133 */ 1134 1134 if (offset) { 1135 - if (poke_type != BPF_MOD_JUMP) { 1135 + if (old_t == BPF_MOD_CALL || new_t == BPF_MOD_CALL) { 1136 1136 pr_err("%s (0x%lx): calls are not supported in bpf prog body\n", __func__, 1137 1137 bpf_func); 1138 1138 return -EOPNOTSUPP; ··· 1166 1166 } 1167 1167 1168 1168 old_inst = ppc_inst(PPC_RAW_NOP()); 1169 + branch_flags = old_t == BPF_MOD_CALL ? BRANCH_SET_LINK : 0; 1169 1170 if (old_addr) { 1170 1171 if (is_offset_in_branch_range(ip - old_addr)) 1171 1172 create_branch(&old_inst, ip, (unsigned long)old_addr, branch_flags); ··· 1175 1174 branch_flags); 1176 1175 } 1177 1176 new_inst = ppc_inst(PPC_RAW_NOP()); 1177 + branch_flags = new_t == BPF_MOD_CALL ? BRANCH_SET_LINK : 0; 1178 1178 if (new_addr) { 1179 1179 if (is_offset_in_branch_range(ip - new_addr)) 1180 1180 create_branch(&new_inst, ip, (unsigned long)new_addr, branch_flags);
+7 -4
arch/riscv/net/bpf_jit_comp64.c
··· 852 852 return emit_jump_and_link(is_call ? RV_REG_T0 : RV_REG_ZERO, rvoff, false, &ctx); 853 853 } 854 854 855 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, 856 - void *old_addr, void *new_addr) 855 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 856 + enum bpf_text_poke_type new_t, void *old_addr, 857 + void *new_addr) 857 858 { 858 859 u32 old_insns[RV_FENTRY_NINSNS], new_insns[RV_FENTRY_NINSNS]; 859 - bool is_call = poke_type == BPF_MOD_CALL; 860 + bool is_call; 860 861 int ret; 861 862 862 863 if (!is_kernel_text((unsigned long)ip) && 863 864 !is_bpf_text_address((unsigned long)ip)) 864 865 return -ENOTSUPP; 865 866 867 + is_call = old_t == BPF_MOD_CALL; 866 868 ret = gen_jump_or_nops(old_addr, ip, old_insns, is_call); 867 869 if (ret) 868 870 return ret; ··· 872 870 if (memcmp(ip, old_insns, RV_FENTRY_NBYTES)) 873 871 return -EFAULT; 874 872 873 + is_call = new_t == BPF_MOD_CALL; 875 874 ret = gen_jump_or_nops(new_addr, ip, new_insns, is_call); 876 875 if (ret) 877 876 return ret; ··· 1134 1131 store_args(nr_arg_slots, args_off, ctx); 1135 1132 1136 1133 /* skip to actual body of traced function */ 1137 - if (flags & BPF_TRAMP_F_SKIP_FRAME) 1134 + if (flags & BPF_TRAMP_F_ORIG_STACK) 1138 1135 orig_call += RV_FENTRY_NINSNS * 4; 1139 1136 1140 1137 if (flags & BPF_TRAMP_F_CALL_ORIG) {
+4 -3
arch/s390/net/bpf_jit_comp.c
··· 2413 2413 return true; 2414 2414 } 2415 2415 2416 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, 2417 - void *old_addr, void *new_addr) 2416 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 2417 + enum bpf_text_poke_type new_t, void *old_addr, 2418 + void *new_addr) 2418 2419 { 2419 2420 struct bpf_plt expected_plt, current_plt, new_plt, *plt; 2420 2421 struct { ··· 2432 2431 if (insn.opc != (0xc004 | (old_addr ? 0xf0 : 0))) 2433 2432 return -EINVAL; 2434 2433 2435 - if (t == BPF_MOD_JUMP && 2434 + if ((new_t == BPF_MOD_JUMP || old_t == BPF_MOD_JUMP) && 2436 2435 insn.disp == ((char *)new_addr - (char *)ip) >> 1) { 2437 2436 /* 2438 2437 * The branch already points to the destination,
+1
arch/x86/Kconfig
··· 230 230 select HAVE_DYNAMIC_FTRACE_WITH_ARGS if X86_64 231 231 select HAVE_FTRACE_REGS_HAVING_PT_REGS if X86_64 232 232 select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS 233 + select HAVE_DYNAMIC_FTRACE_WITH_JMP if X86_64 233 234 select HAVE_SAMPLE_FTRACE_DIRECT if X86_64 234 235 select HAVE_SAMPLE_FTRACE_DIRECT_MULTI if X86_64 235 236 select HAVE_EBPF_JIT
+6 -1
arch/x86/kernel/ftrace.c
··· 74 74 * No need to translate into a callthunk. The trampoline does 75 75 * the depth accounting itself. 76 76 */ 77 - return text_gen_insn(CALL_INSN_OPCODE, (void *)ip, (void *)addr); 77 + if (ftrace_is_jmp(addr)) { 78 + addr = ftrace_jmp_get(addr); 79 + return text_gen_insn(JMP32_INSN_OPCODE, (void *)ip, (void *)addr); 80 + } else { 81 + return text_gen_insn(CALL_INSN_OPCODE, (void *)ip, (void *)addr); 82 + } 78 83 } 79 84 80 85 static int ftrace_verify_code(unsigned long ip, const char *old_code)
+11 -1
arch/x86/kernel/ftrace_64.S
··· 285 285 ANNOTATE_NOENDBR 286 286 RET 287 287 288 + 1: 289 + testb $1, %al 290 + jz 2f 291 + andq $0xfffffffffffffffe, %rax 292 + movq %rax, MCOUNT_REG_SIZE+8(%rsp) 293 + restore_mcount_regs 294 + /* Restore flags */ 295 + popfq 296 + RET 297 + 288 298 /* Swap the flags with orig_rax */ 289 - 1: movq MCOUNT_REG_SIZE(%rsp), %rdi 299 + 2: movq MCOUNT_REG_SIZE(%rsp), %rdi 290 300 movq %rdi, MCOUNT_REG_SIZE-8(%rsp) 291 301 movq %rax, MCOUNT_REG_SIZE(%rsp) 292 302
+33 -22
arch/x86/net/bpf_jit_comp.c
··· 597 597 return emit_patch(pprog, func, ip, 0xE9); 598 598 } 599 599 600 - static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, 600 + static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 601 + enum bpf_text_poke_type new_t, 601 602 void *old_addr, void *new_addr) 602 603 { 603 604 const u8 *nop_insn = x86_nops[5]; ··· 608 607 int ret; 609 608 610 609 memcpy(old_insn, nop_insn, X86_PATCH_SIZE); 611 - if (old_addr) { 610 + if (old_t != BPF_MOD_NOP && old_addr) { 612 611 prog = old_insn; 613 - ret = t == BPF_MOD_CALL ? 612 + ret = old_t == BPF_MOD_CALL ? 614 613 emit_call(&prog, old_addr, ip) : 615 614 emit_jump(&prog, old_addr, ip); 616 615 if (ret) ··· 618 617 } 619 618 620 619 memcpy(new_insn, nop_insn, X86_PATCH_SIZE); 621 - if (new_addr) { 620 + if (new_t != BPF_MOD_NOP && new_addr) { 622 621 prog = new_insn; 623 - ret = t == BPF_MOD_CALL ? 622 + ret = new_t == BPF_MOD_CALL ? 624 623 emit_call(&prog, new_addr, ip) : 625 624 emit_jump(&prog, new_addr, ip); 626 625 if (ret) ··· 641 640 return ret; 642 641 } 643 642 644 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, 645 - void *old_addr, void *new_addr) 643 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 644 + enum bpf_text_poke_type new_t, void *old_addr, 645 + void *new_addr) 646 646 { 647 647 if (!is_kernel_text((long)ip) && 648 648 !is_bpf_text_address((long)ip)) ··· 657 655 if (is_endbr(ip)) 658 656 ip += ENDBR_INSN_SIZE; 659 657 660 - return __bpf_arch_text_poke(ip, t, old_addr, new_addr); 658 + return __bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr); 661 659 } 662 660 663 661 #define EMIT_LFENCE() EMIT3(0x0F, 0xAE, 0xE8) ··· 899 897 target = array->ptrs[poke->tail_call.key]; 900 898 if (target) { 901 899 ret = __bpf_arch_text_poke(poke->tailcall_target, 902 - BPF_MOD_JUMP, NULL, 900 + BPF_MOD_NOP, BPF_MOD_JUMP, 901 + NULL, 903 902 (u8 *)target->bpf_func + 904 903 poke->adj_off); 905 904 BUG_ON(ret < 0); 906 905 ret = __bpf_arch_text_poke(poke->tailcall_bypass, 907 - BPF_MOD_JUMP, 906 + BPF_MOD_JUMP, BPF_MOD_NOP, 908 907 (u8 *)poke->tailcall_target + 909 908 X86_PATCH_SIZE, NULL); 910 909 BUG_ON(ret < 0); ··· 2850 2847 } 2851 2848 2852 2849 static void save_args(const struct btf_func_model *m, u8 **prog, 2853 - int stack_size, bool for_call_origin) 2850 + int stack_size, bool for_call_origin, u32 flags) 2854 2851 { 2855 2852 int arg_regs, first_off = 0, nr_regs = 0, nr_stack_slots = 0; 2853 + bool use_jmp = bpf_trampoline_use_jmp(flags); 2856 2854 int i, j; 2857 2855 2858 2856 /* Store function arguments to stack. ··· 2894 2890 */ 2895 2891 for (j = 0; j < arg_regs; j++) { 2896 2892 emit_ldx(prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 2897 - nr_stack_slots * 8 + 0x18); 2893 + nr_stack_slots * 8 + 16 + (!use_jmp) * 8); 2898 2894 emit_stx(prog, BPF_DW, BPF_REG_FP, BPF_REG_0, 2899 2895 -stack_size); 2900 2896 ··· 3288 3284 * should be 16-byte aligned. Following code depend on 3289 3285 * that stack_size is already 8-byte aligned. 3290 3286 */ 3291 - stack_size += (stack_size % 16) ? 0 : 8; 3287 + if (bpf_trampoline_use_jmp(flags)) { 3288 + /* no rip in the "jmp" case */ 3289 + stack_size += (stack_size % 16) ? 8 : 0; 3290 + } else { 3291 + stack_size += (stack_size % 16) ? 0 : 8; 3292 + } 3292 3293 } 3293 3294 3294 3295 arg_stack_off = stack_size; 3295 3296 3296 - if (flags & BPF_TRAMP_F_SKIP_FRAME) { 3297 + if (flags & BPF_TRAMP_F_CALL_ORIG) { 3297 3298 /* skip patched call instruction and point orig_call to actual 3298 3299 * body of the kernel function. 3299 3300 */ ··· 3353 3344 emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -ip_off); 3354 3345 } 3355 3346 3356 - save_args(m, &prog, regs_off, false); 3347 + save_args(m, &prog, regs_off, false, flags); 3357 3348 3358 3349 if (flags & BPF_TRAMP_F_CALL_ORIG) { 3359 3350 /* arg1: mov rdi, im */ ··· 3386 3377 3387 3378 if (flags & BPF_TRAMP_F_CALL_ORIG) { 3388 3379 restore_regs(m, &prog, regs_off); 3389 - save_args(m, &prog, arg_stack_off, true); 3380 + save_args(m, &prog, arg_stack_off, true, flags); 3390 3381 3391 3382 if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) { 3392 3383 /* Before calling the original function, load the ··· 3988 3979 struct bpf_prog *new, struct bpf_prog *old) 3989 3980 { 3990 3981 u8 *old_addr, *new_addr, *old_bypass_addr; 3982 + enum bpf_text_poke_type t; 3991 3983 int ret; 3992 3984 3993 3985 old_bypass_addr = old ? NULL : poke->bypass_addr; ··· 4001 3991 * the kallsyms check. 4002 3992 */ 4003 3993 if (new) { 3994 + t = old_addr ? BPF_MOD_JUMP : BPF_MOD_NOP; 4004 3995 ret = __bpf_arch_text_poke(poke->tailcall_target, 4005 - BPF_MOD_JUMP, 3996 + t, BPF_MOD_JUMP, 4006 3997 old_addr, new_addr); 4007 3998 BUG_ON(ret < 0); 4008 3999 if (!old) { 4009 4000 ret = __bpf_arch_text_poke(poke->tailcall_bypass, 4010 - BPF_MOD_JUMP, 4001 + BPF_MOD_JUMP, BPF_MOD_NOP, 4011 4002 poke->bypass_addr, 4012 4003 NULL); 4013 4004 BUG_ON(ret < 0); 4014 4005 } 4015 4006 } else { 4007 + t = old_bypass_addr ? BPF_MOD_JUMP : BPF_MOD_NOP; 4016 4008 ret = __bpf_arch_text_poke(poke->tailcall_bypass, 4017 - BPF_MOD_JUMP, 4018 - old_bypass_addr, 4009 + t, BPF_MOD_JUMP, old_bypass_addr, 4019 4010 poke->bypass_addr); 4020 4011 BUG_ON(ret < 0); 4021 4012 /* let other CPUs finish the execution of program ··· 4025 4014 */ 4026 4015 if (!ret) 4027 4016 synchronize_rcu(); 4017 + t = old_addr ? BPF_MOD_JUMP : BPF_MOD_NOP; 4028 4018 ret = __bpf_arch_text_poke(poke->tailcall_target, 4029 - BPF_MOD_JUMP, 4030 - old_addr, NULL); 4019 + t, BPF_MOD_NOP, old_addr, NULL); 4031 4020 BUG_ON(ret < 0); 4032 4021 } 4033 4022 }
+16 -2
include/linux/bpf.h
··· 1264 1264 bpf_trampoline_enter_t bpf_trampoline_enter(const struct bpf_prog *prog); 1265 1265 bpf_trampoline_exit_t bpf_trampoline_exit(const struct bpf_prog *prog); 1266 1266 1267 + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_JMP 1268 + static inline bool bpf_trampoline_use_jmp(u64 flags) 1269 + { 1270 + return flags & BPF_TRAMP_F_CALL_ORIG && !(flags & BPF_TRAMP_F_SKIP_FRAME); 1271 + } 1272 + #else 1273 + static inline bool bpf_trampoline_use_jmp(u64 flags) 1274 + { 1275 + return false; 1276 + } 1277 + #endif 1278 + 1267 1279 struct bpf_ksym { 1268 1280 unsigned long start; 1269 1281 unsigned long end; ··· 3710 3698 #endif /* CONFIG_INET */ 3711 3699 3712 3700 enum bpf_text_poke_type { 3701 + BPF_MOD_NOP, 3713 3702 BPF_MOD_CALL, 3714 3703 BPF_MOD_JUMP, 3715 3704 }; 3716 3705 3717 - int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, 3718 - void *addr1, void *addr2); 3706 + int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 3707 + enum bpf_text_poke_type new_t, void *old_addr, 3708 + void *new_addr); 3719 3709 3720 3710 void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke, 3721 3711 struct bpf_prog *new, struct bpf_prog *old);
+33
include/linux/ftrace.h
··· 359 359 FTRACE_OPS_FL_DIRECT = BIT(17), 360 360 FTRACE_OPS_FL_SUBOP = BIT(18), 361 361 FTRACE_OPS_FL_GRAPH = BIT(19), 362 + FTRACE_OPS_FL_JMP = BIT(20), 362 363 }; 363 364 364 365 #ifndef CONFIG_DYNAMIC_FTRACE_WITH_ARGS ··· 577 576 static inline void arch_ftrace_set_direct_caller(struct ftrace_regs *fregs, 578 577 unsigned long addr) { } 579 578 #endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ 579 + 580 + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_JMP 581 + static inline bool ftrace_is_jmp(unsigned long addr) 582 + { 583 + return addr & 1; 584 + } 585 + 586 + static inline unsigned long ftrace_jmp_set(unsigned long addr) 587 + { 588 + return addr | 1UL; 589 + } 590 + 591 + static inline unsigned long ftrace_jmp_get(unsigned long addr) 592 + { 593 + return addr & ~1UL; 594 + } 595 + #else 596 + static inline bool ftrace_is_jmp(unsigned long addr) 597 + { 598 + return false; 599 + } 600 + 601 + static inline unsigned long ftrace_jmp_set(unsigned long addr) 602 + { 603 + return addr; 604 + } 605 + 606 + static inline unsigned long ftrace_jmp_get(unsigned long addr) 607 + { 608 + return addr; 609 + } 610 + #endif /* CONFIG_DYNAMIC_FTRACE_WITH_JMP */ 580 611 581 612 #ifdef CONFIG_STACK_TRACER 582 613
+3 -2
kernel/bpf/core.c
··· 3150 3150 return -EFAULT; 3151 3151 } 3152 3152 3153 - int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, 3154 - void *addr1, void *addr2) 3153 + int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, 3154 + enum bpf_text_poke_type new_t, void *old_addr, 3155 + void *new_addr) 3155 3156 { 3156 3157 return -ENOTSUPP; 3157 3158 }
+64 -15
kernel/bpf/trampoline.c
··· 175 175 return tr; 176 176 } 177 177 178 - static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr) 178 + static int bpf_trampoline_update_fentry(struct bpf_trampoline *tr, u32 orig_flags, 179 + void *old_addr, void *new_addr) 179 180 { 181 + enum bpf_text_poke_type new_t = BPF_MOD_CALL, old_t = BPF_MOD_CALL; 180 182 void *ip = tr->func.addr; 183 + 184 + if (!new_addr) 185 + new_t = BPF_MOD_NOP; 186 + else if (bpf_trampoline_use_jmp(tr->flags)) 187 + new_t = BPF_MOD_JUMP; 188 + 189 + if (!old_addr) 190 + old_t = BPF_MOD_NOP; 191 + else if (bpf_trampoline_use_jmp(orig_flags)) 192 + old_t = BPF_MOD_JUMP; 193 + 194 + return bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr); 195 + } 196 + 197 + static int unregister_fentry(struct bpf_trampoline *tr, u32 orig_flags, 198 + void *old_addr) 199 + { 181 200 int ret; 182 201 183 202 if (tr->func.ftrace_managed) 184 203 ret = unregister_ftrace_direct(tr->fops, (long)old_addr, false); 185 204 else 186 - ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, NULL); 205 + ret = bpf_trampoline_update_fentry(tr, orig_flags, old_addr, NULL); 187 206 188 207 return ret; 189 208 } 190 209 191 - static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr, 210 + static int modify_fentry(struct bpf_trampoline *tr, u32 orig_flags, 211 + void *old_addr, void *new_addr, 192 212 bool lock_direct_mutex) 193 213 { 194 - void *ip = tr->func.addr; 195 214 int ret; 196 215 197 216 if (tr->func.ftrace_managed) { ··· 219 200 else 220 201 ret = modify_ftrace_direct_nolock(tr->fops, (long)new_addr); 221 202 } else { 222 - ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, new_addr); 203 + ret = bpf_trampoline_update_fentry(tr, orig_flags, old_addr, 204 + new_addr); 223 205 } 224 206 return ret; 225 207 } ··· 245 225 return ret; 246 226 ret = register_ftrace_direct(tr->fops, (long)new_addr); 247 227 } else { 248 - ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr); 228 + ret = bpf_trampoline_update_fentry(tr, 0, NULL, new_addr); 249 229 } 250 230 251 231 return ret; ··· 356 336 * call_rcu_tasks() is not necessary. 357 337 */ 358 338 if (im->ip_after_call) { 359 - int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP, 360 - NULL, im->ip_epilogue); 339 + int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_NOP, 340 + BPF_MOD_JUMP, NULL, 341 + im->ip_epilogue); 361 342 WARN_ON(err); 362 343 if (IS_ENABLED(CONFIG_TASKS_RCU)) 363 344 call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu_tasks); ··· 431 410 return PTR_ERR(tlinks); 432 411 433 412 if (total == 0) { 434 - err = unregister_fentry(tr, tr->cur_image->image); 413 + err = unregister_fentry(tr, orig_flags, tr->cur_image->image); 435 414 bpf_tramp_image_put(tr->cur_image); 436 415 tr->cur_image = NULL; 437 416 goto out; ··· 455 434 456 435 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS 457 436 again: 458 - if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) && 459 - (tr->flags & BPF_TRAMP_F_CALL_ORIG)) 460 - tr->flags |= BPF_TRAMP_F_ORIG_STACK; 437 + if (tr->flags & BPF_TRAMP_F_CALL_ORIG) { 438 + if (tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) { 439 + /* The BPF_TRAMP_F_SKIP_FRAME can be cleared in the 440 + * first try, reset it in the second try. 441 + */ 442 + tr->flags |= BPF_TRAMP_F_ORIG_STACK | BPF_TRAMP_F_SKIP_FRAME; 443 + } else if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_JMP)) { 444 + /* Use "jmp" instead of "call" for the trampoline 445 + * in the origin call case, and we don't need to 446 + * skip the frame. 447 + */ 448 + tr->flags &= ~BPF_TRAMP_F_SKIP_FRAME; 449 + } 450 + } 461 451 #endif 462 452 463 453 size = arch_bpf_trampoline_size(&tr->func.model, tr->flags, ··· 499 467 if (err) 500 468 goto out_free; 501 469 470 + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_JMP 471 + if (bpf_trampoline_use_jmp(tr->flags)) 472 + tr->fops->flags |= FTRACE_OPS_FL_JMP; 473 + else 474 + tr->fops->flags &= ~FTRACE_OPS_FL_JMP; 475 + #endif 476 + 502 477 WARN_ON(tr->cur_image && total == 0); 503 478 if (tr->cur_image) 504 479 /* progs already running at this address */ 505 - err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex); 480 + err = modify_fentry(tr, orig_flags, tr->cur_image->image, 481 + im->image, lock_direct_mutex); 506 482 else 507 483 /* first time registering */ 508 484 err = register_fentry(tr, im->image); ··· 533 493 tr->cur_image = im; 534 494 out: 535 495 /* If any error happens, restore previous flags */ 536 - if (err) 496 + if (err) { 537 497 tr->flags = orig_flags; 498 + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_JMP 499 + if (bpf_trampoline_use_jmp(tr->flags)) 500 + tr->fops->flags |= FTRACE_OPS_FL_JMP; 501 + else 502 + tr->fops->flags &= ~FTRACE_OPS_FL_JMP; 503 + #endif 504 + } 538 505 kfree(tlinks); 539 506 return err; 540 507 ··· 617 570 if (err) 618 571 return err; 619 572 tr->extension_prog = link->link.prog; 620 - return bpf_arch_text_poke(tr->func.addr, BPF_MOD_JUMP, NULL, 573 + return bpf_arch_text_poke(tr->func.addr, BPF_MOD_NOP, 574 + BPF_MOD_JUMP, NULL, 621 575 link->link.prog->bpf_func); 622 576 } 623 577 if (cnt >= BPF_MAX_TRAMP_LINKS) ··· 666 618 if (kind == BPF_TRAMP_REPLACE) { 667 619 WARN_ON_ONCE(!tr->extension_prog); 668 620 err = bpf_arch_text_poke(tr->func.addr, BPF_MOD_JUMP, 621 + BPF_MOD_NOP, 669 622 tr->extension_prog->bpf_func, NULL); 670 623 tr->extension_prog = NULL; 671 624 guard(mutex)(&tgt_prog->aux->ext_mutex);
+12
kernel/trace/Kconfig
··· 80 80 If the architecture generates __patchable_function_entries sections 81 81 but does not want them included in the ftrace locations. 82 82 83 + config HAVE_DYNAMIC_FTRACE_WITH_JMP 84 + bool 85 + help 86 + If the architecture supports to replace the __fentry__ with a 87 + "jmp" instruction. 88 + 83 89 config HAVE_SYSCALL_TRACEPOINTS 84 90 bool 85 91 help ··· 335 329 def_bool y 336 330 depends on DYNAMIC_FTRACE 337 331 depends on HAVE_DYNAMIC_FTRACE_WITH_ARGS 332 + 333 + config DYNAMIC_FTRACE_WITH_JMP 334 + def_bool y 335 + depends on DYNAMIC_FTRACE 336 + depends on DYNAMIC_FTRACE_WITH_DIRECT_CALLS 337 + depends on HAVE_DYNAMIC_FTRACE_WITH_JMP 338 338 339 339 config FPROBE 340 340 bool "Kernel Function Probe (fprobe)"
+16 -1
kernel/trace/ftrace.c
··· 5951 5951 for (i = 0; i < size; i++) { 5952 5952 hlist_for_each_entry(entry, &hash->buckets[i], hlist) { 5953 5953 del = __ftrace_lookup_ip(direct_functions, entry->ip); 5954 - if (del && del->direct == addr) { 5954 + if (del && ftrace_jmp_get(del->direct) == 5955 + ftrace_jmp_get(addr)) { 5955 5956 remove_hash_entry(direct_functions, del); 5956 5957 kfree(del); 5957 5958 } ··· 6017 6016 if (ftrace_hash_empty(hash)) 6018 6017 return -EINVAL; 6019 6018 6019 + /* This is a "raw" address, and this should never happen. */ 6020 + if (WARN_ON_ONCE(ftrace_is_jmp(addr))) 6021 + return -EINVAL; 6022 + 6020 6023 mutex_lock(&direct_mutex); 6024 + 6025 + if (ops->flags & FTRACE_OPS_FL_JMP) 6026 + addr = ftrace_jmp_set(addr); 6021 6027 6022 6028 /* Make sure requested entries are not already registered.. */ 6023 6029 size = 1 << hash->size_bits; ··· 6145 6137 int err; 6146 6138 6147 6139 lockdep_assert_held_once(&direct_mutex); 6140 + 6141 + /* This is a "raw" address, and this should never happen. */ 6142 + if (WARN_ON_ONCE(ftrace_is_jmp(addr))) 6143 + return -EINVAL; 6144 + 6145 + if (ops->flags & FTRACE_OPS_FL_JMP) 6146 + addr = ftrace_jmp_set(addr); 6148 6147 6149 6148 /* Enable the tmp_ops to have the same functions as the direct ops */ 6150 6149 ftrace_ops_init(&tmp_ops);