Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: Keep TCP_CLOSE sockets in the reuseport group.

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

- we cannot carry over the eBPF prog of shutdowned sockets
- we cannot attach/detach an eBPF prog to/from listening sockets via
shutdowned sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp

authored by

Kuniyuki Iwashima and committed by
Daniel Borkmann
333bb73f 5c040eaf

+186 -11
+1
include/net/sock_reuseport.h
··· 32 32 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, 33 33 bool bind_inany); 34 34 extern void reuseport_detach_sock(struct sock *sk); 35 + void reuseport_stop_listen_sock(struct sock *sk); 35 36 extern struct sock *reuseport_select_sock(struct sock *sk, 36 37 u32 hash, 37 38 struct sk_buff *skb,
+174 -8
net/core/sock_reuseport.c
··· 17 17 DEFINE_SPINLOCK(reuseport_lock); 18 18 19 19 static DEFINE_IDA(reuseport_ida); 20 + static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse, 21 + struct sock_reuseport *reuse, bool bind_inany); 20 22 21 23 static int reuseport_sock_index(struct sock *sk, 22 24 const struct sock_reuseport *reuse, ··· 63 61 return true; 64 62 } 65 63 64 + static void __reuseport_add_closed_sock(struct sock *sk, 65 + struct sock_reuseport *reuse) 66 + { 67 + reuse->socks[reuse->max_socks - reuse->num_closed_socks - 1] = sk; 68 + /* paired with READ_ONCE() in inet_csk_bind_conflict() */ 69 + WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks + 1); 70 + } 71 + 72 + static bool __reuseport_detach_closed_sock(struct sock *sk, 73 + struct sock_reuseport *reuse) 74 + { 75 + int i = reuseport_sock_index(sk, reuse, true); 76 + 77 + if (i == -1) 78 + return false; 79 + 80 + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; 81 + /* paired with READ_ONCE() in inet_csk_bind_conflict() */ 82 + WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks - 1); 83 + 84 + return true; 85 + } 86 + 66 87 static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) 67 88 { 68 89 unsigned int size = sizeof(struct sock_reuseport) + ··· 117 92 reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 118 93 lockdep_is_held(&reuseport_lock)); 119 94 if (reuse) { 95 + if (reuse->num_closed_socks) { 96 + /* sk was shutdown()ed before */ 97 + ret = reuseport_resurrect(sk, reuse, NULL, bind_inany); 98 + goto out; 99 + } 100 + 120 101 /* Only set reuse->bind_inany if the bind_inany is true. 121 102 * Otherwise, it will overwrite the reuse->bind_inany 122 103 * which was set by the bind/hash path. ··· 164 133 u32 more_socks_size, i; 165 134 166 135 more_socks_size = reuse->max_socks * 2U; 167 - if (more_socks_size > U16_MAX) 136 + if (more_socks_size > U16_MAX) { 137 + if (reuse->num_closed_socks) { 138 + /* Make room by removing a closed sk. 139 + * The child has already been migrated. 140 + * Only reqsk left at this point. 141 + */ 142 + struct sock *sk; 143 + 144 + sk = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; 145 + RCU_INIT_POINTER(sk->sk_reuseport_cb, NULL); 146 + __reuseport_detach_closed_sock(sk, reuse); 147 + 148 + return reuse; 149 + } 150 + 168 151 return NULL; 152 + } 169 153 170 154 more_reuse = __reuseport_alloc(more_socks_size); 171 155 if (!more_reuse) ··· 246 200 reuse = rcu_dereference_protected(sk2->sk_reuseport_cb, 247 201 lockdep_is_held(&reuseport_lock)); 248 202 old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 249 - lockdep_is_held(&reuseport_lock)); 203 + lockdep_is_held(&reuseport_lock)); 204 + if (old_reuse && old_reuse->num_closed_socks) { 205 + /* sk was shutdown()ed before */ 206 + int err = reuseport_resurrect(sk, old_reuse, reuse, reuse->bind_inany); 207 + 208 + spin_unlock_bh(&reuseport_lock); 209 + return err; 210 + } 211 + 250 212 if (old_reuse && old_reuse->num_socks != 1) { 251 213 spin_unlock_bh(&reuseport_lock); 252 214 return -EBUSY; ··· 279 225 } 280 226 EXPORT_SYMBOL(reuseport_add_sock); 281 227 228 + static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse, 229 + struct sock_reuseport *reuse, bool bind_inany) 230 + { 231 + if (old_reuse == reuse) { 232 + /* If sk was in the same reuseport group, just pop sk out of 233 + * the closed section and push sk into the listening section. 234 + */ 235 + __reuseport_detach_closed_sock(sk, old_reuse); 236 + __reuseport_add_sock(sk, old_reuse); 237 + return 0; 238 + } 239 + 240 + if (!reuse) { 241 + /* In bind()/listen() path, we cannot carry over the eBPF prog 242 + * for the shutdown()ed socket. In setsockopt() path, we should 243 + * not change the eBPF prog of listening sockets by attaching a 244 + * prog to the shutdown()ed socket. Thus, we will allocate a new 245 + * reuseport group and detach sk from the old group. 246 + */ 247 + int id; 248 + 249 + reuse = __reuseport_alloc(INIT_SOCKS); 250 + if (!reuse) 251 + return -ENOMEM; 252 + 253 + id = ida_alloc(&reuseport_ida, GFP_ATOMIC); 254 + if (id < 0) { 255 + kfree(reuse); 256 + return id; 257 + } 258 + 259 + reuse->reuseport_id = id; 260 + reuse->bind_inany = bind_inany; 261 + } else { 262 + /* Move sk from the old group to the new one if 263 + * - all the other listeners in the old group were close()d or 264 + * shutdown()ed, and then sk2 has listen()ed on the same port 265 + * OR 266 + * - sk listen()ed without bind() (or with autobind), was 267 + * shutdown()ed, and then listen()s on another port which 268 + * sk2 listen()s on. 269 + */ 270 + if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) { 271 + reuse = reuseport_grow(reuse); 272 + if (!reuse) 273 + return -ENOMEM; 274 + } 275 + } 276 + 277 + __reuseport_detach_closed_sock(sk, old_reuse); 278 + __reuseport_add_sock(sk, reuse); 279 + rcu_assign_pointer(sk->sk_reuseport_cb, reuse); 280 + 281 + if (old_reuse->num_socks + old_reuse->num_closed_socks == 0) 282 + call_rcu(&old_reuse->rcu, reuseport_free_rcu); 283 + 284 + return 0; 285 + } 286 + 282 287 void reuseport_detach_sock(struct sock *sk) 283 288 { 284 289 struct sock_reuseport *reuse; ··· 345 232 spin_lock_bh(&reuseport_lock); 346 233 reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 347 234 lockdep_is_held(&reuseport_lock)); 235 + 236 + /* reuseport_grow() has detached a closed sk */ 237 + if (!reuse) 238 + goto out; 348 239 349 240 /* Notify the bpf side. The sk may be added to a sockarray 350 241 * map. If so, sockarray logic will remove it from the map. ··· 361 244 bpf_sk_reuseport_detach(sk); 362 245 363 246 rcu_assign_pointer(sk->sk_reuseport_cb, NULL); 364 - __reuseport_detach_sock(sk, reuse); 247 + 248 + if (!__reuseport_detach_closed_sock(sk, reuse)) 249 + __reuseport_detach_sock(sk, reuse); 365 250 366 251 if (reuse->num_socks + reuse->num_closed_socks == 0) 367 252 call_rcu(&reuse->rcu, reuseport_free_rcu); 368 253 254 + out: 369 255 spin_unlock_bh(&reuseport_lock); 370 256 } 371 257 EXPORT_SYMBOL(reuseport_detach_sock); 258 + 259 + void reuseport_stop_listen_sock(struct sock *sk) 260 + { 261 + if (sk->sk_protocol == IPPROTO_TCP) { 262 + struct sock_reuseport *reuse; 263 + 264 + spin_lock_bh(&reuseport_lock); 265 + 266 + reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 267 + lockdep_is_held(&reuseport_lock)); 268 + 269 + if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) { 270 + /* Migration capable, move sk from the listening section 271 + * to the closed section. 272 + */ 273 + bpf_sk_reuseport_detach(sk); 274 + 275 + __reuseport_detach_sock(sk, reuse); 276 + __reuseport_add_closed_sock(sk, reuse); 277 + 278 + spin_unlock_bh(&reuseport_lock); 279 + return; 280 + } 281 + 282 + spin_unlock_bh(&reuseport_lock); 283 + } 284 + 285 + /* Not capable to do migration, detach immediately */ 286 + reuseport_detach_sock(sk); 287 + } 288 + EXPORT_SYMBOL(reuseport_stop_listen_sock); 372 289 373 290 static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, 374 291 struct bpf_prog *prog, struct sk_buff *skb, ··· 503 352 struct sock_reuseport *reuse; 504 353 struct bpf_prog *old_prog; 505 354 506 - if (sk_unhashed(sk) && sk->sk_reuseport) { 507 - int err = reuseport_alloc(sk, false); 355 + if (sk_unhashed(sk)) { 356 + int err; 508 357 358 + if (!sk->sk_reuseport) 359 + return -EINVAL; 360 + 361 + err = reuseport_alloc(sk, false); 509 362 if (err) 510 363 return err; 511 364 } else if (!rcu_access_pointer(sk->sk_reuseport_cb)) { ··· 535 380 struct sock_reuseport *reuse; 536 381 struct bpf_prog *old_prog; 537 382 538 - if (!rcu_access_pointer(sk->sk_reuseport_cb)) 539 - return sk->sk_reuseport ? -ENOENT : -EINVAL; 540 - 541 383 old_prog = NULL; 542 384 spin_lock_bh(&reuseport_lock); 543 385 reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 544 386 lockdep_is_held(&reuseport_lock)); 387 + 388 + /* reuse must be checked after acquiring the reuseport_lock 389 + * because reuseport_grow() can detach a closed sk. 390 + */ 391 + if (!reuse) { 392 + spin_unlock_bh(&reuseport_lock); 393 + return sk->sk_reuseport ? -ENOENT : -EINVAL; 394 + } 395 + 396 + if (sk_unhashed(sk) && reuse->num_closed_socks) { 397 + spin_unlock_bh(&reuseport_lock); 398 + return -ENOENT; 399 + } 400 + 545 401 old_prog = rcu_replace_pointer(reuse->prog, old_prog, 546 402 lockdep_is_held(&reuseport_lock)); 547 403 spin_unlock_bh(&reuseport_lock);
+10 -2
net/ipv4/inet_connection_sock.c
··· 135 135 bool relax, bool reuseport_ok) 136 136 { 137 137 struct sock *sk2; 138 + bool reuseport_cb_ok; 138 139 bool reuse = sk->sk_reuse; 139 140 bool reuseport = !!sk->sk_reuseport; 141 + struct sock_reuseport *reuseport_cb; 140 142 kuid_t uid = sock_i_uid((struct sock *)sk); 143 + 144 + rcu_read_lock(); 145 + reuseport_cb = rcu_dereference(sk->sk_reuseport_cb); 146 + /* paired with WRITE_ONCE() in __reuseport_(add|detach)_closed_sock */ 147 + reuseport_cb_ok = !reuseport_cb || READ_ONCE(reuseport_cb->num_closed_socks); 148 + rcu_read_unlock(); 141 149 142 150 /* 143 151 * Unlike other sk lookup places we do not check ··· 164 156 if ((!relax || 165 157 (!reuseport_ok && 166 158 reuseport && sk2->sk_reuseport && 167 - !rcu_access_pointer(sk->sk_reuseport_cb) && 159 + reuseport_cb_ok && 168 160 (sk2->sk_state == TCP_TIME_WAIT || 169 161 uid_eq(uid, sock_i_uid(sk2))))) && 170 162 inet_rcv_saddr_equal(sk, sk2, true)) 171 163 break; 172 164 } else if (!reuseport_ok || 173 165 !reuseport || !sk2->sk_reuseport || 174 - rcu_access_pointer(sk->sk_reuseport_cb) || 166 + !reuseport_cb_ok || 175 167 (sk2->sk_state != TCP_TIME_WAIT && 176 168 !uid_eq(uid, sock_i_uid(sk2)))) { 177 169 if (inet_rcv_saddr_equal(sk, sk2, true))
+1 -1
net/ipv4/inet_hashtables.c
··· 697 697 goto unlock; 698 698 699 699 if (rcu_access_pointer(sk->sk_reuseport_cb)) 700 - reuseport_detach_sock(sk); 700 + reuseport_stop_listen_sock(sk); 701 701 if (ilb) { 702 702 inet_unhash2(hashinfo, sk); 703 703 ilb->count--;