Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: implement TSQ for retransmits

We saw sch_fq drops caused by the per flow limit of 100 packets and TCP
when dealing with large cwnd and bursts of retransmits.

Even after increasing the limit to 1000, and even after commit
10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time"),
we can still have these drops.

Under certain conditions, TCP can spend a considerable amount of
time queuing thousands of skbs in a single tcp_xmit_retransmit_queue()
invocation, incurring latency spikes and stalls of other softirq
handlers.

This patch implements TSQ for retransmits, limiting number of packets
and giving more chance for scheduling packets in both ways.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
f9616c35 0f1100c1

+47 -25
+47 -25
net/ipv4/tcp_output.c
··· 734 734 { 735 735 if ((1 << sk->sk_state) & 736 736 (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING | 737 - TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) 738 - tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle, 737 + TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) { 738 + struct tcp_sock *tp = tcp_sk(sk); 739 + 740 + if (tp->lost_out > tp->retrans_out && 741 + tp->snd_cwnd > tcp_packets_in_flight(tp)) 742 + tcp_xmit_retransmit_queue(sk); 743 + 744 + tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle, 739 745 0, GFP_ATOMIC); 746 + } 740 747 } 741 748 /* 742 749 * One tasklet per cpu tries to send more skbs. ··· 2046 2039 return -1; 2047 2040 } 2048 2041 2042 + /* TCP Small Queues : 2043 + * Control number of packets in qdisc/devices to two packets / or ~1 ms. 2044 + * (These limits are doubled for retransmits) 2045 + * This allows for : 2046 + * - better RTT estimation and ACK scheduling 2047 + * - faster recovery 2048 + * - high rates 2049 + * Alas, some drivers / subsystems require a fair amount 2050 + * of queued bytes to ensure line rate. 2051 + * One example is wifi aggregation (802.11 AMPDU) 2052 + */ 2053 + static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb, 2054 + unsigned int factor) 2055 + { 2056 + unsigned int limit; 2057 + 2058 + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); 2059 + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); 2060 + limit <<= factor; 2061 + 2062 + if (atomic_read(&sk->sk_wmem_alloc) > limit) { 2063 + set_bit(TSQ_THROTTLED, &tcp_sk(sk)->tsq_flags); 2064 + /* It is possible TX completion already happened 2065 + * before we set TSQ_THROTTLED, so we must 2066 + * test again the condition. 2067 + */ 2068 + smp_mb__after_atomic(); 2069 + if (atomic_read(&sk->sk_wmem_alloc) > limit) 2070 + return true; 2071 + } 2072 + return false; 2073 + } 2074 + 2049 2075 /* This routine writes packets to the network. It advances the 2050 2076 * send_head. This happens as incoming acks open up the remote 2051 2077 * window for us. ··· 2165 2125 unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) 2166 2126 break; 2167 2127 2168 - /* TCP Small Queues : 2169 - * Control number of packets in qdisc/devices to two packets / or ~1 ms. 2170 - * This allows for : 2171 - * - better RTT estimation and ACK scheduling 2172 - * - faster recovery 2173 - * - high rates 2174 - * Alas, some drivers / subsystems require a fair amount 2175 - * of queued bytes to ensure line rate. 2176 - * One example is wifi aggregation (802.11 AMPDU) 2177 - */ 2178 - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); 2179 - limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); 2180 - 2181 - if (atomic_read(&sk->sk_wmem_alloc) > limit) { 2182 - set_bit(TSQ_THROTTLED, &tp->tsq_flags); 2183 - /* It is possible TX completion already happened 2184 - * before we set TSQ_THROTTLED, so we must 2185 - * test again the condition. 2186 - */ 2187 - smp_mb__after_atomic(); 2188 - if (atomic_read(&sk->sk_wmem_alloc) > limit) 2189 - break; 2190 - } 2128 + if (tcp_small_queue_check(sk, skb, 0)) 2129 + break; 2191 2130 2192 2131 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) 2193 2132 break; ··· 2865 2846 2866 2847 if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS)) 2867 2848 continue; 2849 + 2850 + if (tcp_small_queue_check(sk, skb, 1)) 2851 + return; 2868 2852 2869 2853 if (tcp_retransmit_skb(sk, skb, segs)) 2870 2854 return;