Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code). For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
refcount_t refcnt;
enum bpf_struct_ops_state state;
struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case). This "value" struct
is created automatically by a macro. Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem). The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
running kernel.
Instead of reusing the attr->btf_value_type_id,
btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
used as the "user" btf which could store other useful sysadmin/debug
info that may be introduced in the furture,
e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
in the running kernel btf. Populate the value of this object.
The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
the map value. The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated. BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem. The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
for the purose of tracking the subsystem usage. Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage. However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt. When the very last subsystem's refcnt
is gone, it will also release the map->refcnt. All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops name dctcp flags 0x0
key 4B value 256B max_entries 1 memlock 4096B
btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
"value": {
"refcnt": {
"refs": {
"counter": 1
}
},
"state": 1,
"data": {
"list": {
"next": 0,
"prev": 0
},
"key": 0,
"flags": 2,
"init": 24,
"release": 0,
"ssthresh": 25,
"cong_avoid": 30,
"set_state": 27,
"cwnd_event": 28,
"in_ack_event": 26,
"undo_cwnd": 29,
"pkts_acked": 0,
"min_tso_segs": 0,
"sndbuf_expand": 0,
"cong_control": 0,
"get_info": 0,
"name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
],
"owner": 0
}
}
}
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
It does an inplace update on "*value" instead returning a pointer
to syscall.c. Otherwise, it needs a separate copy of "zero" value
for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
preempt_disable() from map_delete_elem(). It is because
the "->unreg()" may requires sleepable context, e.g.
the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com

authored by

Martin KaFai Lau and committed by
Alexei Starovoitov
85d33df3 27ae7997

+642 -47
+8 -10
arch/x86/net/bpf_jit_comp.c
··· 1328 1328 return proglen; 1329 1329 } 1330 1330 1331 - static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args, 1331 + static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args, 1332 1332 int stack_size) 1333 1333 { 1334 1334 int i; ··· 1344 1344 -(stack_size - i * 8)); 1345 1345 } 1346 1346 1347 - static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args, 1347 + static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, 1348 1348 int stack_size) 1349 1349 { 1350 1350 int i; ··· 1361 1361 -(stack_size - i * 8)); 1362 1362 } 1363 1363 1364 - static int invoke_bpf(struct btf_func_model *m, u8 **pprog, 1364 + static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, 1365 1365 struct bpf_prog **progs, int prog_cnt, int stack_size) 1366 1366 { 1367 1367 u8 *prog = *pprog; ··· 1456 1456 * add rsp, 8 // skip eth_type_trans's frame 1457 1457 * ret // return to its caller 1458 1458 */ 1459 - int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, 1459 + int arch_prepare_bpf_trampoline(void *image, void *image_end, 1460 + const struct btf_func_model *m, u32 flags, 1460 1461 struct bpf_prog **fentry_progs, int fentry_cnt, 1461 1462 struct bpf_prog **fexit_progs, int fexit_cnt, 1462 1463 void *orig_call) ··· 1524 1523 /* skip our return address and return to parent */ 1525 1524 EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ 1526 1525 EMIT1(0xC3); /* ret */ 1527 - /* One half of the page has active running trampoline. 1528 - * Another half is an area for next trampoline. 1529 - * Make sure the trampoline generation logic doesn't overflow. 1530 - */ 1531 - if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY)) 1526 + /* Make sure the trampoline generation logic doesn't overflow */ 1527 + if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) 1532 1528 return -EFAULT; 1533 - return 0; 1529 + return prog - (u8 *)image; 1534 1530 } 1535 1531 1536 1532 static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
+47 -2
include/linux/bpf.h
··· 17 17 #include <linux/u64_stats_sync.h> 18 18 #include <linux/refcount.h> 19 19 #include <linux/mutex.h> 20 + #include <linux/module.h> 20 21 21 22 struct bpf_verifier_env; 22 23 struct bpf_verifier_log; ··· 107 106 struct btf *btf; 108 107 struct bpf_map_memory memory; 109 108 char name[BPF_OBJ_NAME_LEN]; 109 + u32 btf_vmlinux_value_type_id; 110 110 bool unpriv_array; 111 111 bool frozen; /* write-once; write-protected by freeze_mutex */ 112 112 /* 22 bytes hole */ ··· 185 183 186 184 static inline bool bpf_map_support_seq_show(const struct bpf_map *map) 187 185 { 188 - return map->btf && map->ops->map_seq_show_elem; 186 + return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) && 187 + map->ops->map_seq_show_elem; 189 188 } 190 189 191 190 int map_check_no_btf(const struct bpf_map *map, ··· 444 441 * fentry = a set of program to run before calling original function 445 442 * fexit = a set of program to run after original function 446 443 */ 447 - int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, 444 + int arch_prepare_bpf_trampoline(void *image, void *image_end, 445 + const struct btf_func_model *m, u32 flags, 448 446 struct bpf_prog **fentry_progs, int fentry_cnt, 449 447 struct bpf_prog **fexit_progs, int fexit_cnt, 450 448 void *orig_call); ··· 676 672 struct work_struct work; 677 673 }; 678 674 675 + struct bpf_struct_ops_value; 679 676 struct btf_type; 680 677 struct btf_member; 681 678 ··· 686 681 int (*init)(struct btf *btf); 687 682 int (*check_member)(const struct btf_type *t, 688 683 const struct btf_member *member); 684 + int (*init_member)(const struct btf_type *t, 685 + const struct btf_member *member, 686 + void *kdata, const void *udata); 687 + int (*reg)(void *kdata); 688 + void (*unreg)(void *kdata); 689 689 const struct btf_type *type; 690 + const struct btf_type *value_type; 690 691 const char *name; 691 692 struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; 692 693 u32 type_id; 694 + u32 value_id; 693 695 }; 694 696 695 697 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) 698 + #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) 696 699 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); 697 700 void bpf_struct_ops_init(struct btf *btf); 701 + bool bpf_struct_ops_get(const void *kdata); 702 + void bpf_struct_ops_put(const void *kdata); 703 + int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, 704 + void *value); 705 + static inline bool bpf_try_module_get(const void *data, struct module *owner) 706 + { 707 + if (owner == BPF_MODULE_OWNER) 708 + return bpf_struct_ops_get(data); 709 + else 710 + return try_module_get(owner); 711 + } 712 + static inline void bpf_module_put(const void *data, struct module *owner) 713 + { 714 + if (owner == BPF_MODULE_OWNER) 715 + bpf_struct_ops_put(data); 716 + else 717 + module_put(owner); 718 + } 698 719 #else 699 720 static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) 700 721 { 701 722 return NULL; 702 723 } 703 724 static inline void bpf_struct_ops_init(struct btf *btf) { } 725 + static inline bool bpf_try_module_get(const void *data, struct module *owner) 726 + { 727 + return try_module_get(owner); 728 + } 729 + static inline void bpf_module_put(const void *data, struct module *owner) 730 + { 731 + module_put(owner); 732 + } 733 + static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, 734 + void *key, 735 + void *value) 736 + { 737 + return -EINVAL; 738 + } 704 739 #endif 705 740 706 741 struct bpf_array {
+3
include/linux/bpf_types.h
··· 109 109 #endif 110 110 BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops) 111 111 BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) 112 + #if defined(CONFIG_BPF_JIT) 113 + BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) 114 + #endif
+13
include/linux/btf.h
··· 7 7 #include <linux/types.h> 8 8 #include <uapi/linux/btf.h> 9 9 10 + #define BTF_TYPE_EMIT(type) ((void)(type *)0) 11 + 10 12 struct btf; 11 13 struct btf_member; 12 14 struct btf_type; ··· 62 60 u32 id, u32 *res_id); 63 61 const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, 64 62 u32 id, u32 *res_id); 63 + const struct btf_type * 64 + btf_resolve_size(const struct btf *btf, const struct btf_type *type, 65 + u32 *type_size, const struct btf_type **elem_type, 66 + u32 *total_nelems); 65 67 66 68 #define for_each_member(i, struct_type, member) \ 67 69 for (i = 0, member = btf_type_member(struct_type); \ ··· 110 104 static inline bool btf_type_kflag(const struct btf_type *t) 111 105 { 112 106 return BTF_INFO_KFLAG(t->info); 107 + } 108 + 109 + static inline u32 btf_member_bit_offset(const struct btf_type *struct_type, 110 + const struct btf_member *member) 111 + { 112 + return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) 113 + : member->offset; 113 114 } 114 115 115 116 static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
+6 -1
include/uapi/linux/bpf.h
··· 136 136 BPF_MAP_TYPE_STACK, 137 137 BPF_MAP_TYPE_SK_STORAGE, 138 138 BPF_MAP_TYPE_DEVMAP_HASH, 139 + BPF_MAP_TYPE_STRUCT_OPS, 139 140 }; 140 141 141 142 /* Note that tracing related programs such as ··· 399 398 __u32 btf_fd; /* fd pointing to a BTF type data */ 400 399 __u32 btf_key_type_id; /* BTF type_id of the key */ 401 400 __u32 btf_value_type_id; /* BTF type_id of the value */ 401 + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel- 402 + * struct stored as the 403 + * map value 404 + */ 402 405 }; 403 406 404 407 struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ ··· 3355 3350 __u32 map_flags; 3356 3351 char name[BPF_OBJ_NAME_LEN]; 3357 3352 __u32 ifindex; 3358 - __u32 :32; 3353 + __u32 btf_vmlinux_value_type_id; 3359 3354 __u64 netns_dev; 3360 3355 __u64 netns_ino; 3361 3356 __u32 btf_id;
+509 -2
kernel/bpf/bpf_struct_ops.c
··· 9 9 #include <linux/numa.h> 10 10 #include <linux/seq_file.h> 11 11 #include <linux/refcount.h> 12 + #include <linux/mutex.h> 12 13 14 + enum bpf_struct_ops_state { 15 + BPF_STRUCT_OPS_STATE_INIT, 16 + BPF_STRUCT_OPS_STATE_INUSE, 17 + BPF_STRUCT_OPS_STATE_TOBEFREE, 18 + }; 19 + 20 + #define BPF_STRUCT_OPS_COMMON_VALUE \ 21 + refcount_t refcnt; \ 22 + enum bpf_struct_ops_state state 23 + 24 + struct bpf_struct_ops_value { 25 + BPF_STRUCT_OPS_COMMON_VALUE; 26 + char data[0] ____cacheline_aligned_in_smp; 27 + }; 28 + 29 + struct bpf_struct_ops_map { 30 + struct bpf_map map; 31 + const struct bpf_struct_ops *st_ops; 32 + /* protect map_update */ 33 + struct mutex lock; 34 + /* progs has all the bpf_prog that is populated 35 + * to the func ptr of the kernel's struct 36 + * (in kvalue.data). 37 + */ 38 + struct bpf_prog **progs; 39 + /* image is a page that has all the trampolines 40 + * that stores the func args before calling the bpf_prog. 41 + * A PAGE_SIZE "image" is enough to store all trampoline for 42 + * "progs[]". 43 + */ 44 + void *image; 45 + /* uvalue->data stores the kernel struct 46 + * (e.g. tcp_congestion_ops) that is more useful 47 + * to userspace than the kvalue. For example, 48 + * the bpf_prog's id is stored instead of the kernel 49 + * address of a func ptr. 50 + */ 51 + struct bpf_struct_ops_value *uvalue; 52 + /* kvalue.data stores the actual kernel's struct 53 + * (e.g. tcp_congestion_ops) that will be 54 + * registered to the kernel subsystem. 55 + */ 56 + struct bpf_struct_ops_value kvalue; 57 + }; 58 + 59 + #define VALUE_PREFIX "bpf_struct_ops_" 60 + #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) 61 + 62 + /* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is 63 + * the map's value exposed to the userspace and its btf-type-id is 64 + * stored at the map->btf_vmlinux_value_type_id. 65 + * 66 + */ 13 67 #define BPF_STRUCT_OPS_TYPE(_name) \ 14 - extern struct bpf_struct_ops bpf_##_name; 68 + extern struct bpf_struct_ops bpf_##_name; \ 69 + \ 70 + struct bpf_struct_ops_##_name { \ 71 + BPF_STRUCT_OPS_COMMON_VALUE; \ 72 + struct _name data ____cacheline_aligned_in_smp; \ 73 + }; 15 74 #include "bpf_struct_ops_types.h" 16 75 #undef BPF_STRUCT_OPS_TYPE 17 76 ··· 94 35 const struct bpf_prog_ops bpf_struct_ops_prog_ops = { 95 36 }; 96 37 38 + static const struct btf_type *module_type; 39 + 97 40 void bpf_struct_ops_init(struct btf *btf) 98 41 { 42 + s32 type_id, value_id, module_id; 99 43 const struct btf_member *member; 100 44 struct bpf_struct_ops *st_ops; 101 45 struct bpf_verifier_log log = {}; 102 46 const struct btf_type *t; 47 + char value_name[128]; 103 48 const char *mname; 104 - s32 type_id; 105 49 u32 i, j; 50 + 51 + /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ 52 + #define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); 53 + #include "bpf_struct_ops_types.h" 54 + #undef BPF_STRUCT_OPS_TYPE 55 + 56 + module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); 57 + if (module_id < 0) { 58 + pr_warn("Cannot find struct module in btf_vmlinux\n"); 59 + return; 60 + } 61 + module_type = btf_type_by_id(btf, module_id); 106 62 107 63 for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { 108 64 st_ops = bpf_struct_ops[i]; 65 + 66 + if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= 67 + sizeof(value_name)) { 68 + pr_warn("struct_ops name %s is too long\n", 69 + st_ops->name); 70 + continue; 71 + } 72 + sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); 73 + 74 + value_id = btf_find_by_name_kind(btf, value_name, 75 + BTF_KIND_STRUCT); 76 + if (value_id < 0) { 77 + pr_warn("Cannot find struct %s in btf_vmlinux\n", 78 + value_name); 79 + continue; 80 + } 109 81 110 82 type_id = btf_find_by_name_kind(btf, st_ops->name, 111 83 BTF_KIND_STRUCT); ··· 188 98 } else { 189 99 st_ops->type_id = type_id; 190 100 st_ops->type = t; 101 + st_ops->value_id = value_id; 102 + st_ops->value_type = btf_type_by_id(btf, 103 + value_id); 191 104 } 192 105 } 193 106 } 194 107 } 195 108 196 109 extern struct btf *btf_vmlinux; 110 + 111 + static const struct bpf_struct_ops * 112 + bpf_struct_ops_find_value(u32 value_id) 113 + { 114 + unsigned int i; 115 + 116 + if (!value_id || !btf_vmlinux) 117 + return NULL; 118 + 119 + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { 120 + if (bpf_struct_ops[i]->value_id == value_id) 121 + return bpf_struct_ops[i]; 122 + } 123 + 124 + return NULL; 125 + } 197 126 198 127 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) 199 128 { ··· 227 118 } 228 119 229 120 return NULL; 121 + } 122 + 123 + static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, 124 + void *next_key) 125 + { 126 + if (key && *(u32 *)key == 0) 127 + return -ENOENT; 128 + 129 + *(u32 *)next_key = 0; 130 + return 0; 131 + } 132 + 133 + int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, 134 + void *value) 135 + { 136 + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; 137 + struct bpf_struct_ops_value *uvalue, *kvalue; 138 + enum bpf_struct_ops_state state; 139 + 140 + if (unlikely(*(u32 *)key != 0)) 141 + return -ENOENT; 142 + 143 + kvalue = &st_map->kvalue; 144 + /* Pair with smp_store_release() during map_update */ 145 + state = smp_load_acquire(&kvalue->state); 146 + if (state == BPF_STRUCT_OPS_STATE_INIT) { 147 + memset(value, 0, map->value_size); 148 + return 0; 149 + } 150 + 151 + /* No lock is needed. state and refcnt do not need 152 + * to be updated together under atomic context. 153 + */ 154 + uvalue = (struct bpf_struct_ops_value *)value; 155 + memcpy(uvalue, st_map->uvalue, map->value_size); 156 + uvalue->state = state; 157 + refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt)); 158 + 159 + return 0; 160 + } 161 + 162 + static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key) 163 + { 164 + return ERR_PTR(-EINVAL); 165 + } 166 + 167 + static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) 168 + { 169 + const struct btf_type *t = st_map->st_ops->type; 170 + u32 i; 171 + 172 + for (i = 0; i < btf_type_vlen(t); i++) { 173 + if (st_map->progs[i]) { 174 + bpf_prog_put(st_map->progs[i]); 175 + st_map->progs[i] = NULL; 176 + } 177 + } 178 + } 179 + 180 + static int check_zero_holes(const struct btf_type *t, void *data) 181 + { 182 + const struct btf_member *member; 183 + u32 i, moff, msize, prev_mend = 0; 184 + const struct btf_type *mtype; 185 + 186 + for_each_member(i, t, member) { 187 + moff = btf_member_bit_offset(t, member) / 8; 188 + if (moff > prev_mend && 189 + memchr_inv(data + prev_mend, 0, moff - prev_mend)) 190 + return -EINVAL; 191 + 192 + mtype = btf_type_by_id(btf_vmlinux, member->type); 193 + mtype = btf_resolve_size(btf_vmlinux, mtype, &msize, 194 + NULL, NULL); 195 + if (IS_ERR(mtype)) 196 + return PTR_ERR(mtype); 197 + prev_mend = moff + msize; 198 + } 199 + 200 + if (t->size > prev_mend && 201 + memchr_inv(data + prev_mend, 0, t->size - prev_mend)) 202 + return -EINVAL; 203 + 204 + return 0; 205 + } 206 + 207 + static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, 208 + void *value, u64 flags) 209 + { 210 + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; 211 + const struct bpf_struct_ops *st_ops = st_map->st_ops; 212 + struct bpf_struct_ops_value *uvalue, *kvalue; 213 + const struct btf_member *member; 214 + const struct btf_type *t = st_ops->type; 215 + void *udata, *kdata; 216 + int prog_fd, err = 0; 217 + void *image; 218 + u32 i; 219 + 220 + if (flags) 221 + return -EINVAL; 222 + 223 + if (*(u32 *)key != 0) 224 + return -E2BIG; 225 + 226 + err = check_zero_holes(st_ops->value_type, value); 227 + if (err) 228 + return err; 229 + 230 + uvalue = (struct bpf_struct_ops_value *)value; 231 + err = check_zero_holes(t, uvalue->data); 232 + if (err) 233 + return err; 234 + 235 + if (uvalue->state || refcount_read(&uvalue->refcnt)) 236 + return -EINVAL; 237 + 238 + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; 239 + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; 240 + 241 + mutex_lock(&st_map->lock); 242 + 243 + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { 244 + err = -EBUSY; 245 + goto unlock; 246 + } 247 + 248 + memcpy(uvalue, value, map->value_size); 249 + 250 + udata = &uvalue->data; 251 + kdata = &kvalue->data; 252 + image = st_map->image; 253 + 254 + for_each_member(i, t, member) { 255 + const struct btf_type *mtype, *ptype; 256 + struct bpf_prog *prog; 257 + u32 moff; 258 + 259 + moff = btf_member_bit_offset(t, member) / 8; 260 + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); 261 + if (ptype == module_type) { 262 + if (*(void **)(udata + moff)) 263 + goto reset_unlock; 264 + *(void **)(kdata + moff) = BPF_MODULE_OWNER; 265 + continue; 266 + } 267 + 268 + err = st_ops->init_member(t, member, kdata, udata); 269 + if (err < 0) 270 + goto reset_unlock; 271 + 272 + /* The ->init_member() has handled this member */ 273 + if (err > 0) 274 + continue; 275 + 276 + /* If st_ops->init_member does not handle it, 277 + * we will only handle func ptrs and zero-ed members 278 + * here. Reject everything else. 279 + */ 280 + 281 + /* All non func ptr member must be 0 */ 282 + if (!ptype || !btf_type_is_func_proto(ptype)) { 283 + u32 msize; 284 + 285 + mtype = btf_type_by_id(btf_vmlinux, member->type); 286 + mtype = btf_resolve_size(btf_vmlinux, mtype, &msize, 287 + NULL, NULL); 288 + if (IS_ERR(mtype)) { 289 + err = PTR_ERR(mtype); 290 + goto reset_unlock; 291 + } 292 + 293 + if (memchr_inv(udata + moff, 0, msize)) { 294 + err = -EINVAL; 295 + goto reset_unlock; 296 + } 297 + 298 + continue; 299 + } 300 + 301 + prog_fd = (int)(*(unsigned long *)(udata + moff)); 302 + /* Similar check as the attr->attach_prog_fd */ 303 + if (!prog_fd) 304 + continue; 305 + 306 + prog = bpf_prog_get(prog_fd); 307 + if (IS_ERR(prog)) { 308 + err = PTR_ERR(prog); 309 + goto reset_unlock; 310 + } 311 + st_map->progs[i] = prog; 312 + 313 + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || 314 + prog->aux->attach_btf_id != st_ops->type_id || 315 + prog->expected_attach_type != i) { 316 + err = -EINVAL; 317 + goto reset_unlock; 318 + } 319 + 320 + err = arch_prepare_bpf_trampoline(image, 321 + st_map->image + PAGE_SIZE, 322 + &st_ops->func_models[i], 0, 323 + &prog, 1, NULL, 0, NULL); 324 + if (err < 0) 325 + goto reset_unlock; 326 + 327 + *(void **)(kdata + moff) = image; 328 + image += err; 329 + 330 + /* put prog_id to udata */ 331 + *(unsigned long *)(udata + moff) = prog->aux->id; 332 + } 333 + 334 + refcount_set(&kvalue->refcnt, 1); 335 + bpf_map_inc(map); 336 + 337 + set_memory_ro((long)st_map->image, 1); 338 + set_memory_x((long)st_map->image, 1); 339 + err = st_ops->reg(kdata); 340 + if (likely(!err)) { 341 + /* Pair with smp_load_acquire() during lookup_elem(). 342 + * It ensures the above udata updates (e.g. prog->aux->id) 343 + * can be seen once BPF_STRUCT_OPS_STATE_INUSE is set. 344 + */ 345 + smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); 346 + goto unlock; 347 + } 348 + 349 + /* Error during st_ops->reg(). It is very unlikely since 350 + * the above init_member() should have caught it earlier 351 + * before reg(). The only possibility is if there was a race 352 + * in registering the struct_ops (under the same name) to 353 + * a sub-system through different struct_ops's maps. 354 + */ 355 + set_memory_nx((long)st_map->image, 1); 356 + set_memory_rw((long)st_map->image, 1); 357 + bpf_map_put(map); 358 + 359 + reset_unlock: 360 + bpf_struct_ops_map_put_progs(st_map); 361 + memset(uvalue, 0, map->value_size); 362 + memset(kvalue, 0, map->value_size); 363 + unlock: 364 + mutex_unlock(&st_map->lock); 365 + return err; 366 + } 367 + 368 + static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) 369 + { 370 + enum bpf_struct_ops_state prev_state; 371 + struct bpf_struct_ops_map *st_map; 372 + 373 + st_map = (struct bpf_struct_ops_map *)map; 374 + prev_state = cmpxchg(&st_map->kvalue.state, 375 + BPF_STRUCT_OPS_STATE_INUSE, 376 + BPF_STRUCT_OPS_STATE_TOBEFREE); 377 + if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) { 378 + st_map->st_ops->unreg(&st_map->kvalue.data); 379 + if (refcount_dec_and_test(&st_map->kvalue.refcnt)) 380 + bpf_map_put(map); 381 + } 382 + 383 + return 0; 384 + } 385 + 386 + static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, 387 + struct seq_file *m) 388 + { 389 + void *value; 390 + 391 + value = bpf_struct_ops_map_lookup_elem(map, key); 392 + if (!value) 393 + return; 394 + 395 + btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, 396 + value, m); 397 + seq_puts(m, "\n"); 398 + } 399 + 400 + static void bpf_struct_ops_map_free(struct bpf_map *map) 401 + { 402 + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; 403 + 404 + if (st_map->progs) 405 + bpf_struct_ops_map_put_progs(st_map); 406 + bpf_map_area_free(st_map->progs); 407 + bpf_jit_free_exec(st_map->image); 408 + bpf_map_area_free(st_map->uvalue); 409 + bpf_map_area_free(st_map); 410 + } 411 + 412 + static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) 413 + { 414 + if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || 415 + attr->map_flags || !attr->btf_vmlinux_value_type_id) 416 + return -EINVAL; 417 + return 0; 418 + } 419 + 420 + static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) 421 + { 422 + const struct bpf_struct_ops *st_ops; 423 + size_t map_total_size, st_map_size; 424 + struct bpf_struct_ops_map *st_map; 425 + const struct btf_type *t, *vt; 426 + struct bpf_map_memory mem; 427 + struct bpf_map *map; 428 + int err; 429 + 430 + if (!capable(CAP_SYS_ADMIN)) 431 + return ERR_PTR(-EPERM); 432 + 433 + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); 434 + if (!st_ops) 435 + return ERR_PTR(-ENOTSUPP); 436 + 437 + vt = st_ops->value_type; 438 + if (attr->value_size != vt->size) 439 + return ERR_PTR(-EINVAL); 440 + 441 + t = st_ops->type; 442 + 443 + st_map_size = sizeof(*st_map) + 444 + /* kvalue stores the 445 + * struct bpf_struct_ops_tcp_congestions_ops 446 + */ 447 + (vt->size - sizeof(struct bpf_struct_ops_value)); 448 + map_total_size = st_map_size + 449 + /* uvalue */ 450 + sizeof(vt->size) + 451 + /* struct bpf_progs **progs */ 452 + btf_type_vlen(t) * sizeof(struct bpf_prog *); 453 + err = bpf_map_charge_init(&mem, map_total_size); 454 + if (err < 0) 455 + return ERR_PTR(err); 456 + 457 + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); 458 + if (!st_map) { 459 + bpf_map_charge_finish(&mem); 460 + return ERR_PTR(-ENOMEM); 461 + } 462 + st_map->st_ops = st_ops; 463 + map = &st_map->map; 464 + 465 + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); 466 + st_map->progs = 467 + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), 468 + NUMA_NO_NODE); 469 + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); 470 + if (!st_map->uvalue || !st_map->progs || !st_map->image) { 471 + bpf_struct_ops_map_free(map); 472 + bpf_map_charge_finish(&mem); 473 + return ERR_PTR(-ENOMEM); 474 + } 475 + 476 + mutex_init(&st_map->lock); 477 + set_vm_flush_reset_perms(st_map->image); 478 + bpf_map_init_from_attr(map, attr); 479 + bpf_map_charge_move(&map->memory, &mem); 480 + 481 + return map; 482 + } 483 + 484 + const struct bpf_map_ops bpf_struct_ops_map_ops = { 485 + .map_alloc_check = bpf_struct_ops_map_alloc_check, 486 + .map_alloc = bpf_struct_ops_map_alloc, 487 + .map_free = bpf_struct_ops_map_free, 488 + .map_get_next_key = bpf_struct_ops_map_get_next_key, 489 + .map_lookup_elem = bpf_struct_ops_map_lookup_elem, 490 + .map_delete_elem = bpf_struct_ops_map_delete_elem, 491 + .map_update_elem = bpf_struct_ops_map_update_elem, 492 + .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem, 493 + }; 494 + 495 + /* "const void *" because some subsystem is 496 + * passing a const (e.g. const struct tcp_congestion_ops *) 497 + */ 498 + bool bpf_struct_ops_get(const void *kdata) 499 + { 500 + struct bpf_struct_ops_value *kvalue; 501 + 502 + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); 503 + 504 + return refcount_inc_not_zero(&kvalue->refcnt); 505 + } 506 + 507 + void bpf_struct_ops_put(const void *kdata) 508 + { 509 + struct bpf_struct_ops_value *kvalue; 510 + 511 + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); 512 + if (refcount_dec_and_test(&kvalue->refcnt)) { 513 + struct bpf_struct_ops_map *st_map; 514 + 515 + st_map = container_of(kvalue, struct bpf_struct_ops_map, 516 + kvalue); 517 + bpf_map_put(&st_map->map); 518 + } 230 519 }
+9 -11
kernel/bpf/btf.c
··· 500 500 return "UNKN"; 501 501 } 502 502 503 - static u32 btf_member_bit_offset(const struct btf_type *struct_type, 504 - const struct btf_member *member) 505 - { 506 - return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) 507 - : member->offset; 508 - } 509 - 510 503 static u32 btf_type_int(const struct btf_type *t) 511 504 { 512 505 return *(u32 *)(t + 1); ··· 1082 1089 * *elem_type: same as return type ("struct X") 1083 1090 * *total_nelems: 1 1084 1091 */ 1085 - static const struct btf_type * 1092 + const struct btf_type * 1086 1093 btf_resolve_size(const struct btf *btf, const struct btf_type *type, 1087 1094 u32 *type_size, const struct btf_type **elem_type, 1088 1095 u32 *total_nelems) ··· 1136 1143 return ERR_PTR(-EINVAL); 1137 1144 1138 1145 *type_size = nelems * size; 1139 - *total_nelems = nelems; 1140 - *elem_type = type; 1146 + if (total_nelems) 1147 + *total_nelems = nelems; 1148 + if (elem_type) 1149 + *elem_type = type; 1141 1150 1142 1151 return array_type ? : type; 1143 1152 } ··· 1853 1858 u32 type_id, void *data, 1854 1859 u8 bits_offset, struct seq_file *m) 1855 1860 { 1856 - t = btf_type_id_resolve(btf, &type_id); 1861 + if (btf->resolved_ids) 1862 + t = btf_type_id_resolve(btf, &type_id); 1863 + else 1864 + t = btf_type_skip_modifiers(btf, type_id, NULL); 1857 1865 1858 1866 btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); 1859 1867 }
+2 -1
kernel/bpf/map_in_map.c
··· 22 22 */ 23 23 if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY || 24 24 inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE || 25 - inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { 25 + inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE || 26 + inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { 26 27 fdput(f); 27 28 return ERR_PTR(-ENOTSUPP); 28 29 }
+35 -17
kernel/bpf/syscall.c
··· 628 628 return ret; 629 629 } 630 630 631 - #define BPF_MAP_CREATE_LAST_FIELD btf_value_type_id 631 + #define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id 632 632 /* called via syscall */ 633 633 static int map_create(union bpf_attr *attr) 634 634 { ··· 641 641 err = CHECK_ATTR(BPF_MAP_CREATE); 642 642 if (err) 643 643 return -EINVAL; 644 + 645 + if (attr->btf_vmlinux_value_type_id) { 646 + if (attr->map_type != BPF_MAP_TYPE_STRUCT_OPS || 647 + attr->btf_key_type_id || attr->btf_value_type_id) 648 + return -EINVAL; 649 + } else if (attr->btf_key_type_id && !attr->btf_value_type_id) { 650 + return -EINVAL; 651 + } 644 652 645 653 f_flags = bpf_get_file_flag(attr->map_flags); 646 654 if (f_flags < 0) ··· 672 664 atomic64_set(&map->usercnt, 1); 673 665 mutex_init(&map->freeze_mutex); 674 666 675 - if (attr->btf_key_type_id || attr->btf_value_type_id) { 667 + map->spin_lock_off = -EINVAL; 668 + if (attr->btf_key_type_id || attr->btf_value_type_id || 669 + /* Even the map's value is a kernel's struct, 670 + * the bpf_prog.o must have BTF to begin with 671 + * to figure out the corresponding kernel's 672 + * counter part. Thus, attr->btf_fd has 673 + * to be valid also. 674 + */ 675 + attr->btf_vmlinux_value_type_id) { 676 676 struct btf *btf; 677 - 678 - if (!attr->btf_value_type_id) { 679 - err = -EINVAL; 680 - goto free_map; 681 - } 682 677 683 678 btf = btf_get_by_fd(attr->btf_fd); 684 679 if (IS_ERR(btf)) { 685 680 err = PTR_ERR(btf); 686 681 goto free_map; 687 682 } 683 + map->btf = btf; 688 684 689 - err = map_check_btf(map, btf, attr->btf_key_type_id, 690 - attr->btf_value_type_id); 691 - if (err) { 692 - btf_put(btf); 693 - goto free_map; 685 + if (attr->btf_value_type_id) { 686 + err = map_check_btf(map, btf, attr->btf_key_type_id, 687 + attr->btf_value_type_id); 688 + if (err) 689 + goto free_map; 694 690 } 695 691 696 - map->btf = btf; 697 692 map->btf_key_type_id = attr->btf_key_type_id; 698 693 map->btf_value_type_id = attr->btf_value_type_id; 699 - } else { 700 - map->spin_lock_off = -EINVAL; 694 + map->btf_vmlinux_value_type_id = 695 + attr->btf_vmlinux_value_type_id; 701 696 } 702 697 703 698 err = security_bpf_map_alloc(map); ··· 899 888 } else if (map->map_type == BPF_MAP_TYPE_QUEUE || 900 889 map->map_type == BPF_MAP_TYPE_STACK) { 901 890 err = map->ops->map_peek_elem(map, value); 891 + } else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { 892 + /* struct_ops map requires directly updating "value" */ 893 + err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); 902 894 } else { 903 895 rcu_read_lock(); 904 896 if (map->ops->map_lookup_elem_sys_only) ··· 1017 1003 goto out; 1018 1004 } else if (map->map_type == BPF_MAP_TYPE_CPUMAP || 1019 1005 map->map_type == BPF_MAP_TYPE_SOCKHASH || 1020 - map->map_type == BPF_MAP_TYPE_SOCKMAP) { 1006 + map->map_type == BPF_MAP_TYPE_SOCKMAP || 1007 + map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { 1021 1008 err = map->ops->map_update_elem(map, key, value, attr->flags); 1022 1009 goto out; 1023 1010 } else if (IS_FD_PROG_ARRAY(map)) { ··· 1107 1092 if (bpf_map_is_dev_bound(map)) { 1108 1093 err = bpf_map_offload_delete_elem(map, key); 1109 1094 goto out; 1110 - } else if (IS_FD_PROG_ARRAY(map)) { 1095 + } else if (IS_FD_PROG_ARRAY(map) || 1096 + map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { 1097 + /* These maps require sleepable context */ 1111 1098 err = map->ops->map_delete_elem(map, key); 1112 1099 goto out; 1113 1100 } ··· 2839 2822 info.btf_key_type_id = map->btf_key_type_id; 2840 2823 info.btf_value_type_id = map->btf_value_type_id; 2841 2824 } 2825 + info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id; 2842 2826 2843 2827 if (bpf_map_is_dev_bound(map)) { 2844 2828 err = bpf_map_offload_info_fill(&info, map);
+5 -3
kernel/bpf/trampoline.c
··· 160 160 if (fexit_cnt) 161 161 flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; 162 162 163 - err = arch_prepare_bpf_trampoline(new_image, &tr->func.model, flags, 163 + err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2, 164 + &tr->func.model, flags, 164 165 fentry, fentry_cnt, 165 166 fexit, fexit_cnt, 166 167 tr->func.addr); 167 - if (err) 168 + if (err < 0) 168 169 goto out; 169 170 170 171 if (tr->selector) ··· 297 296 } 298 297 299 298 int __weak 300 - arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, 299 + arch_prepare_bpf_trampoline(void *image, void *image_end, 300 + const struct btf_func_model *m, u32 flags, 301 301 struct bpf_prog **fentry_progs, int fentry_cnt, 302 302 struct bpf_prog **fexit_progs, int fexit_cnt, 303 303 void *orig_call)
+5
kernel/bpf/verifier.c
··· 8155 8155 return -EINVAL; 8156 8156 } 8157 8157 8158 + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { 8159 + verbose(env, "bpf_struct_ops map cannot be used in prog\n"); 8160 + return -EINVAL; 8161 + } 8162 + 8158 8163 return 0; 8159 8164 } 8160 8165