Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Add netns cookie and enable it for bpf cgroup hooks

In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net

authored by

Daniel Borkmann and committed by
Alexei Starovoitov
f318903c fcf752ea

+103 -8
+1
include/linux/bpf.h
··· 233 233 ARG_CONST_SIZE_OR_ZERO, /* number of bytes accessed from memory or 0 */ 234 234 235 235 ARG_PTR_TO_CTX, /* pointer to context */ 236 + ARG_PTR_TO_CTX_OR_NULL, /* pointer to context or NULL */ 236 237 ARG_ANYTHING, /* any (initialized) argument is ok */ 237 238 ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ 238 239 ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */
+10
include/net/net_namespace.h
··· 168 168 #ifdef CONFIG_XFRM 169 169 struct netns_xfrm xfrm; 170 170 #endif 171 + 172 + atomic64_t net_cookie; /* written once */ 173 + 171 174 #if IS_ENABLED(CONFIG_IP_VS) 172 175 struct netns_ipvs *ipvs; 173 176 #endif ··· 276 273 277 274 void net_drop_ns(void *); 278 275 276 + u64 net_gen_cookie(struct net *net); 277 + 279 278 #else 280 279 281 280 static inline struct net *get_net(struct net *net) ··· 303 298 static inline int check_net(const struct net *net) 304 299 { 305 300 return 1; 301 + } 302 + 303 + static inline u64 net_gen_cookie(struct net *net) 304 + { 305 + return 0; 306 306 } 307 307 308 308 #define net_drop_ns NULL
+15 -1
include/uapi/linux/bpf.h
··· 2950 2950 * restricted to raw_tracepoint bpf programs. 2951 2951 * Return 2952 2952 * 0 on success, or a negative error in case of failure. 2953 + * 2954 + * u64 bpf_get_netns_cookie(void *ctx) 2955 + * Description 2956 + * Retrieve the cookie (generated by the kernel) of the network 2957 + * namespace the input *ctx* is associated with. The network 2958 + * namespace cookie remains stable for its lifetime and provides 2959 + * a global identifier that can be assumed unique. If *ctx* is 2960 + * NULL, then the helper returns the cookie for the initial 2961 + * network namespace. The cookie itself is very similar to that 2962 + * of bpf_get_socket_cookie() helper, but for network namespaces 2963 + * instead of sockets. 2964 + * Return 2965 + * A 8-byte long opaque number. 2953 2966 */ 2954 2967 #define __BPF_FUNC_MAPPER(FN) \ 2955 2968 FN(unspec), \ ··· 3086 3073 FN(jiffies64), \ 3087 3074 FN(read_branch_records), \ 3088 3075 FN(get_ns_current_pid_tgid), \ 3089 - FN(xdp_output), 3076 + FN(xdp_output), \ 3077 + FN(get_netns_cookie), 3090 3078 3091 3079 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 3092 3080 * function eBPF program intends to call
+10 -6
kernel/bpf/verifier.c
··· 3461 3461 expected_type = CONST_PTR_TO_MAP; 3462 3462 if (type != expected_type) 3463 3463 goto err_type; 3464 - } else if (arg_type == ARG_PTR_TO_CTX) { 3464 + } else if (arg_type == ARG_PTR_TO_CTX || 3465 + arg_type == ARG_PTR_TO_CTX_OR_NULL) { 3465 3466 expected_type = PTR_TO_CTX; 3466 - if (type != expected_type) 3467 - goto err_type; 3468 - err = check_ctx_reg(env, reg, regno); 3469 - if (err < 0) 3470 - return err; 3467 + if (!(register_is_null(reg) && 3468 + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { 3469 + if (type != expected_type) 3470 + goto err_type; 3471 + err = check_ctx_reg(env, reg, regno); 3472 + if (err < 0) 3473 + return err; 3474 + } 3471 3475 } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { 3472 3476 expected_type = PTR_TO_SOCK_COMMON; 3473 3477 /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */
+37
net/core/filter.c
··· 4141 4141 .arg1_type = ARG_PTR_TO_CTX, 4142 4142 }; 4143 4143 4144 + static u64 __bpf_get_netns_cookie(struct sock *sk) 4145 + { 4146 + #ifdef CONFIG_NET_NS 4147 + return net_gen_cookie(sk ? sk->sk_net.net : &init_net); 4148 + #else 4149 + return 0; 4150 + #endif 4151 + } 4152 + 4153 + BPF_CALL_1(bpf_get_netns_cookie_sock, struct sock *, ctx) 4154 + { 4155 + return __bpf_get_netns_cookie(ctx); 4156 + } 4157 + 4158 + static const struct bpf_func_proto bpf_get_netns_cookie_sock_proto = { 4159 + .func = bpf_get_netns_cookie_sock, 4160 + .gpl_only = false, 4161 + .ret_type = RET_INTEGER, 4162 + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, 4163 + }; 4164 + 4165 + BPF_CALL_1(bpf_get_netns_cookie_sock_addr, struct bpf_sock_addr_kern *, ctx) 4166 + { 4167 + return __bpf_get_netns_cookie(ctx ? ctx->sk : NULL); 4168 + } 4169 + 4170 + static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { 4171 + .func = bpf_get_netns_cookie_sock_addr, 4172 + .gpl_only = false, 4173 + .ret_type = RET_INTEGER, 4174 + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, 4175 + }; 4176 + 4144 4177 BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) 4145 4178 { 4146 4179 struct sock *sk = sk_to_full_sk(skb->sk); ··· 6001 5968 return &bpf_get_local_storage_proto; 6002 5969 case BPF_FUNC_get_socket_cookie: 6003 5970 return &bpf_get_socket_cookie_sock_proto; 5971 + case BPF_FUNC_get_netns_cookie: 5972 + return &bpf_get_netns_cookie_sock_proto; 6004 5973 case BPF_FUNC_perf_event_output: 6005 5974 return &bpf_event_output_data_proto; 6006 5975 default: ··· 6029 5994 } 6030 5995 case BPF_FUNC_get_socket_cookie: 6031 5996 return &bpf_get_socket_cookie_sock_addr_proto; 5997 + case BPF_FUNC_get_netns_cookie: 5998 + return &bpf_get_netns_cookie_sock_addr_proto; 6032 5999 case BPF_FUNC_get_local_storage: 6033 6000 return &bpf_get_local_storage_proto; 6034 6001 case BPF_FUNC_perf_event_output:
+15
net/core/net_namespace.c
··· 69 69 70 70 static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS; 71 71 72 + static atomic64_t cookie_gen; 73 + 74 + u64 net_gen_cookie(struct net *net) 75 + { 76 + while (1) { 77 + u64 res = atomic64_read(&net->net_cookie); 78 + 79 + if (res) 80 + return res; 81 + res = atomic64_inc_return(&cookie_gen); 82 + atomic64_cmpxchg(&net->net_cookie, 0, res); 83 + } 84 + } 85 + 72 86 static struct net_generic *net_alloc_generic(void) 73 87 { 74 88 struct net_generic *ng; ··· 1101 1087 panic("Could not allocate generic netns"); 1102 1088 1103 1089 rcu_assign_pointer(init_net.gen, ng); 1090 + net_gen_cookie(&init_net); 1104 1091 1105 1092 down_write(&pernet_ops_rwsem); 1106 1093 if (setup_net(&init_net, &init_user_ns))
+15 -1
tools/include/uapi/linux/bpf.h
··· 2950 2950 * restricted to raw_tracepoint bpf programs. 2951 2951 * Return 2952 2952 * 0 on success, or a negative error in case of failure. 2953 + * 2954 + * u64 bpf_get_netns_cookie(void *ctx) 2955 + * Description 2956 + * Retrieve the cookie (generated by the kernel) of the network 2957 + * namespace the input *ctx* is associated with. The network 2958 + * namespace cookie remains stable for its lifetime and provides 2959 + * a global identifier that can be assumed unique. If *ctx* is 2960 + * NULL, then the helper returns the cookie for the initial 2961 + * network namespace. The cookie itself is very similar to that 2962 + * of bpf_get_socket_cookie() helper, but for network namespaces 2963 + * instead of sockets. 2964 + * Return 2965 + * A 8-byte long opaque number. 2953 2966 */ 2954 2967 #define __BPF_FUNC_MAPPER(FN) \ 2955 2968 FN(unspec), \ ··· 3086 3073 FN(jiffies64), \ 3087 3074 FN(read_branch_records), \ 3088 3075 FN(get_ns_current_pid_tgid), \ 3089 - FN(xdp_output), 3076 + FN(xdp_output), \ 3077 + FN(get_netns_cookie), 3090 3078 3091 3079 /* integer value in 'imm' field of BPF_CALL instruction selects which helper 3092 3080 * function eBPF program intends to call