Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: fix cwnd limited checking to improve congestion control

Yuchung discovered tcp_is_cwnd_limited() was returning false in
slow start phase even if the application filled the socket write queue.

All congestion modules take into account tcp_is_cwnd_limited()
before increasing cwnd, so this behavior limits slow start from
probing the bandwidth at full speed.

The problem is that even if write queue is full (aka we are _not_
application limited), cwnd can be under utilized if TSO should auto
defer or TCP Small queues decided to hold packets.

So the in_flight can be kept to smaller value, and we can get to the
point tcp_is_cwnd_limited() returns false.

With TCP Small Queues and FQ/pacing, this issue is more visible.

We fix this by having tcp_cwnd_validate(), which is supposed to track
such things, take into account unsent_segs, the number of segs that we
are not sending at the moment due to TSO or TSQ, but intend to send
real soon. Then when we are cwnd-limited, remember this fact while we
are processing the window of ACKs that comes back.

For example, suppose we have a brand new connection with cwnd=10; we
are in slow start, and we send a flight of 9 packets. By the time we
have received ACKs for all 9 packets we want our cwnd to be 18.
We implement this by setting tp->lsnd_pending to 9, and
considering ourselves to be cwnd-limited while cwnd is less than
twice tp->lsnd_pending (2*9 -> 18).

This makes tcp_is_cwnd_limited() more understandable, by removing
the GSO/TSO kludge, that tried to work around the issue.

Note the in_flight parameter can be removed in a followup cleanup
patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
e114a710 4e8bbb81

+36 -28
+1
include/linux/tcp.h
··· 230 230 u32 snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */ 231 231 u32 snd_cwnd_used; 232 232 u32 snd_cwnd_stamp; 233 + u32 lsnd_pending; /* packets inflight or unsent since last xmit */ 233 234 u32 prior_cwnd; /* Congestion window at start of Recovery. */ 234 235 u32 prr_delivered; /* Number of newly delivered packets to 235 236 * receiver in Recovery. */
+21 -1
include/net/tcp.h
··· 974 974 { 975 975 return tp->snd_una + tp->snd_wnd; 976 976 } 977 - bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight); 977 + 978 + /* We follow the spirit of RFC2861 to validate cwnd but implement a more 979 + * flexible approach. The RFC suggests cwnd should not be raised unless 980 + * it was fully used previously. But we allow cwnd to grow as long as the 981 + * application has used half the cwnd. 982 + * Example : 983 + * cwnd is 10 (IW10), but application sends 9 frames. 984 + * We allow cwnd to reach 18 when all frames are ACKed. 985 + * This check is safe because it's as aggressive as slow start which already 986 + * risks 100% overshoot. The advantage is that we discourage application to 987 + * either send more filler packets or data to artificially blow up the cwnd 988 + * usage, and allow application-limited process to probe bw more aggressively. 989 + * 990 + * TODO: remove in_flight once we can fix all callers, and their callers... 991 + */ 992 + static inline bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) 993 + { 994 + const struct tcp_sock *tp = tcp_sk(sk); 995 + 996 + return tp->snd_cwnd < 2 * tp->lsnd_pending; 997 + } 978 998 979 999 static inline void tcp_check_probe_timer(struct sock *sk) 980 1000 {
-20
net/ipv4/tcp_cong.c
··· 276 276 return err; 277 277 } 278 278 279 - /* RFC2861 Check whether we are limited by application or congestion window 280 - * This is the inverse of cwnd check in tcp_tso_should_defer 281 - */ 282 - bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) 283 - { 284 - const struct tcp_sock *tp = tcp_sk(sk); 285 - u32 left; 286 - 287 - if (in_flight >= tp->snd_cwnd) 288 - return true; 289 - 290 - left = tp->snd_cwnd - in_flight; 291 - if (sk_can_gso(sk) && 292 - left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd && 293 - left < tp->xmit_size_goal_segs) 294 - return true; 295 - return left <= tcp_max_tso_deferred_mss(tp); 296 - } 297 - EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited); 298 - 299 279 /* Slow start is used when congestion window is no greater than the slow start 300 280 * threshold. We base on RFC2581 and also handle stretch ACKs properly. 301 281 * We do not implement RFC3465 Appropriate Byte Counting (ABC) per se but
+14 -7
net/ipv4/tcp_output.c
··· 1402 1402 tp->snd_cwnd_stamp = tcp_time_stamp; 1403 1403 } 1404 1404 1405 - /* Congestion window validation. (RFC2861) */ 1406 - static void tcp_cwnd_validate(struct sock *sk) 1405 + static void tcp_cwnd_validate(struct sock *sk, u32 unsent_segs) 1407 1406 { 1408 1407 struct tcp_sock *tp = tcp_sk(sk); 1409 1408 1410 - if (tp->packets_out >= tp->snd_cwnd) { 1409 + tp->lsnd_pending = tp->packets_out + unsent_segs; 1410 + 1411 + if (tcp_is_cwnd_limited(sk, 0)) { 1411 1412 /* Network is feed fully. */ 1412 1413 tp->snd_cwnd_used = 0; 1413 1414 tp->snd_cwnd_stamp = tcp_time_stamp; ··· 1881 1880 { 1882 1881 struct tcp_sock *tp = tcp_sk(sk); 1883 1882 struct sk_buff *skb; 1884 - unsigned int tso_segs, sent_pkts; 1883 + unsigned int tso_segs, sent_pkts, unsent_segs = 0; 1885 1884 int cwnd_quota; 1886 1885 int result; 1887 1886 ··· 1925 1924 break; 1926 1925 } else { 1927 1926 if (!push_one && tcp_tso_should_defer(sk, skb)) 1928 - break; 1927 + goto compute_unsent_segs; 1929 1928 } 1930 1929 1931 1930 /* TCP Small Queues : ··· 1950 1949 * there is no smp_mb__after_set_bit() yet 1951 1950 */ 1952 1951 smp_mb__after_clear_bit(); 1953 - if (atomic_read(&sk->sk_wmem_alloc) > limit) 1952 + if (atomic_read(&sk->sk_wmem_alloc) > limit) { 1953 + u32 unsent_bytes; 1954 + 1955 + compute_unsent_segs: 1956 + unsent_bytes = tp->write_seq - tp->snd_nxt; 1957 + unsent_segs = DIV_ROUND_UP(unsent_bytes, mss_now); 1954 1958 break; 1959 + } 1955 1960 } 1956 1961 1957 1962 limit = mss_now; ··· 1997 1990 /* Send one loss probe per tail loss episode. */ 1998 1991 if (push_one != 2) 1999 1992 tcp_schedule_loss_probe(sk); 2000 - tcp_cwnd_validate(sk); 1993 + tcp_cwnd_validate(sk, unsent_segs); 2001 1994 return false; 2002 1995 } 2003 1996 return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));