Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: abort orphan sockets stalling on zero window probes

Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).

But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html

This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.

In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Yuchung Cheng and committed by
David S. Miller
b248230c cb57659a

+22 -21
+1 -1
net/ipv4/tcp.c
··· 2693 2693 break; 2694 2694 #endif 2695 2695 case TCP_USER_TIMEOUT: 2696 - /* Cap the max timeout in ms TCP will retry/retrans 2696 + /* Cap the max time in ms TCP will retry or probe the window 2697 2697 * before giving up and aborting (ETIMEDOUT) a connection. 2698 2698 */ 2699 2699 if (val < 0)
+21 -20
net/ipv4/tcp_timer.c
··· 52 52 * limit. 53 53 * 2. If we have strong memory pressure. 54 54 */ 55 - static int tcp_out_of_resources(struct sock *sk, int do_reset) 55 + static int tcp_out_of_resources(struct sock *sk, bool do_reset) 56 56 { 57 57 struct tcp_sock *tp = tcp_sk(sk); 58 58 int shift = 0; ··· 72 72 if ((s32)(tcp_time_stamp - tp->lsndtime) <= TCP_TIMEWAIT_LEN || 73 73 /* 2. Window is closed. */ 74 74 (!tp->snd_wnd && !tp->packets_out)) 75 - do_reset = 1; 75 + do_reset = true; 76 76 if (do_reset) 77 77 tcp_send_active_reset(sk, GFP_ATOMIC); 78 78 tcp_done(sk); ··· 270 270 struct inet_connection_sock *icsk = inet_csk(sk); 271 271 struct tcp_sock *tp = tcp_sk(sk); 272 272 int max_probes; 273 + u32 start_ts; 273 274 274 275 if (tp->packets_out || !tcp_send_head(sk)) { 275 276 icsk->icsk_probes_out = 0; 276 277 return; 277 278 } 278 279 279 - /* *WARNING* RFC 1122 forbids this 280 - * 281 - * It doesn't AFAIK, because we kill the retransmit timer -AK 282 - * 283 - * FIXME: We ought not to do it, Solaris 2.5 actually has fixing 284 - * this behaviour in Solaris down as a bug fix. [AC] 285 - * 286 - * Let me to explain. icsk_probes_out is zeroed by incoming ACKs 287 - * even if they advertise zero window. Hence, connection is killed only 288 - * if we received no ACKs for normal connection timeout. It is not killed 289 - * only because window stays zero for some time, window may be zero 290 - * until armageddon and even later. We are in full accordance 291 - * with RFCs, only probe timer combines both retransmission timeout 292 - * and probe timeout in one bottle. --ANK 280 + /* RFC 1122 4.2.2.17 requires the sender to stay open indefinitely as 281 + * long as the receiver continues to respond probes. We support this by 282 + * default and reset icsk_probes_out with incoming ACKs. But if the 283 + * socket is orphaned or the user specifies TCP_USER_TIMEOUT, we 284 + * kill the socket when the retry count and the time exceeds the 285 + * corresponding system limit. We also implement similar policy when 286 + * we use RTO to probe window in tcp_retransmit_timer(). 293 287 */ 294 - max_probes = sysctl_tcp_retries2; 288 + start_ts = tcp_skb_timestamp(tcp_send_head(sk)); 289 + if (!start_ts) 290 + skb_mstamp_get(&tcp_send_head(sk)->skb_mstamp); 291 + else if (icsk->icsk_user_timeout && 292 + (s32)(tcp_time_stamp - start_ts) > icsk->icsk_user_timeout) 293 + goto abort; 295 294 295 + max_probes = sysctl_tcp_retries2; 296 296 if (sock_flag(sk, SOCK_DEAD)) { 297 297 const int alive = inet_csk_rto_backoff(icsk, TCP_RTO_MAX) < TCP_RTO_MAX; 298 298 299 299 max_probes = tcp_orphan_retries(sk, alive); 300 - 301 - if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes)) 300 + if (!alive && icsk->icsk_backoff >= max_probes) 301 + goto abort; 302 + if (tcp_out_of_resources(sk, true)) 302 303 return; 303 304 } 304 305 305 306 if (icsk->icsk_probes_out > max_probes) { 306 - tcp_write_err(sk); 307 + abort: tcp_write_err(sk); 307 308 } else { 308 309 /* Only send another probe if we didn't close things up. */ 309 310 tcp_send_probe0(sk);