Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.

Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.

The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.

You can reproduce this issue by following instructions below.

1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.

Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.

This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.

1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)

If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.

The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Kuniyuki Iwashima and committed by
David S. Miller
4b01a967 16f6c251

+28 -1
+9
Documentation/networking/ip-sysctl.txt
··· 958 958 which can be quite useful - but may break some applications. 959 959 Default: 0 960 960 961 + ip_autobind_reuse - BOOLEAN 962 + By default, bind() does not select the ports automatically even if 963 + the new socket and all sockets bound to the port have SO_REUSEADDR. 964 + ip_autobind_reuse allows bind() to reuse the port and this is useful 965 + when you use bind()+connect(), but may break some applications. 966 + The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this 967 + option should only be set by experts. 968 + Default: 0 969 + 961 970 ip_dynaddr - BOOLEAN 962 971 If set non-zero, enables support for dynamic addresses. 963 972 If set to a non-zero value larger than 1, a kernel log
+1
include/net/netns/ipv4.h
··· 101 101 int sysctl_ip_fwd_use_pmtu; 102 102 int sysctl_ip_fwd_update_priority; 103 103 int sysctl_ip_nonlocal_bind; 104 + int sysctl_ip_autobind_reuse; 104 105 /* Shall we try to damage output packets if routing dev changes? */ 105 106 int sysctl_ip_dynaddr; 106 107 int sysctl_ip_early_demux;
+9 -1
net/ipv4/inet_connection_sock.c
··· 174 174 int port = 0; 175 175 struct inet_bind_hashbucket *head; 176 176 struct net *net = sock_net(sk); 177 + bool relax = false; 177 178 int i, low, high, attempt_half; 178 179 struct inet_bind_bucket *tb; 179 180 u32 remaining, offset; 180 181 int l3mdev; 181 182 182 183 l3mdev = inet_sk_bound_l3mdev(sk); 184 + ports_exhausted: 183 185 attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; 184 186 other_half_scan: 185 187 inet_get_local_port_range(net, &low, &high); ··· 219 217 inet_bind_bucket_for_each(tb, &head->chain) 220 218 if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && 221 219 tb->port == port) { 222 - if (!inet_csk_bind_conflict(sk, tb, false, false)) 220 + if (!inet_csk_bind_conflict(sk, tb, relax, false)) 223 221 goto success; 224 222 goto next_port; 225 223 } ··· 238 236 /* OK we now try the upper half of the range */ 239 237 attempt_half = 2; 240 238 goto other_half_scan; 239 + } 240 + 241 + if (net->ipv4.sysctl_ip_autobind_reuse && !relax) { 242 + /* We still have a chance to connect to different destinations */ 243 + relax = true; 244 + goto ports_exhausted; 241 245 } 242 246 return NULL; 243 247 success:
+9
net/ipv4/sysctl_net_ipv4.c
··· 764 764 .proc_handler = proc_dointvec 765 765 }, 766 766 { 767 + .procname = "ip_autobind_reuse", 768 + .data = &init_net.ipv4.sysctl_ip_autobind_reuse, 769 + .maxlen = sizeof(int), 770 + .mode = 0644, 771 + .proc_handler = proc_dointvec_minmax, 772 + .extra1 = SYSCTL_ZERO, 773 + .extra2 = SYSCTL_ONE, 774 + }, 775 + { 767 776 .procname = "fwmark_reflect", 768 777 .data = &init_net.ipv4.sysctl_fwmark_reflect, 769 778 .maxlen = sizeof(int),