Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net-tcp: Disable TCP ssthresh metrics cache by default

This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
that disables TCP ssthresh metrics cache by default. Other parts of TCP
metrics cache, e.g. rtt, cwnd, remain unchanged.

As modern networks becoming more and more dynamic, TCP metrics cache
today often causes more harm than benefits. For example, the same IP
address is often shared by different subscribers behind NAT in residential
networks. Even if the IP address is not shared by different users,
caching the slow-start threshold of a previous short flow using loss-based
congestion control (e.g. cubic) often causes the future longer flows of
the same network path to exit slow-start prematurely with abysmal
throughput.

Caching ssthresh is very risky and can lead to terrible performance.
Therefore it makes sense to make disabling ssthresh caching by
default and opt-in for specific networks by the administrators.
This practice also has worked well for several years of deployment with
CUBIC congestion control at Google.

Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Kevin(Yudong) Yang and committed by
David S. Miller
65e6d901 4e7696d9

+24 -4
+4
Documentation/networking/ip-sysctl.txt
··· 479 479 degradation. If set, TCP will not cache metrics on closing 480 480 connections. 481 481 482 + tcp_no_ssthresh_metrics_save - BOOLEAN 483 + Controls whether TCP saves ssthresh metrics in the route cache. 484 + Default is 1, which disables ssthresh metrics. 485 + 482 486 tcp_orphan_retries - INTEGER 483 487 This value influences the timeout of a locally closed TCP connection, 484 488 when RTO retransmissions remain unacknowledged.
+1
include/net/netns/ipv4.h
··· 154 154 int sysctl_tcp_adv_win_scale; 155 155 int sysctl_tcp_frto; 156 156 int sysctl_tcp_nometrics_save; 157 + int sysctl_tcp_no_ssthresh_metrics_save; 157 158 int sysctl_tcp_moderate_rcvbuf; 158 159 int sysctl_tcp_tso_win_divisor; 159 160 int sysctl_tcp_workaround_signed_windows;
+9
net/ipv4/sysctl_net_ipv4.c
··· 1193 1193 .proc_handler = proc_dointvec, 1194 1194 }, 1195 1195 { 1196 + .procname = "tcp_no_ssthresh_metrics_save", 1197 + .data = &init_net.ipv4.sysctl_tcp_no_ssthresh_metrics_save, 1198 + .maxlen = sizeof(int), 1199 + .mode = 0644, 1200 + .proc_handler = proc_dointvec_minmax, 1201 + .extra1 = SYSCTL_ZERO, 1202 + .extra2 = SYSCTL_ONE, 1203 + }, 1204 + { 1196 1205 .procname = "tcp_moderate_rcvbuf", 1197 1206 .data = &init_net.ipv4.sysctl_tcp_moderate_rcvbuf, 1198 1207 .maxlen = sizeof(int),
+1
net/ipv4/tcp_ipv4.c
··· 2674 2674 net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT; 2675 2675 net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX; 2676 2676 net->ipv4.sysctl_tcp_tw_reuse = 2; 2677 + net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1; 2677 2678 2678 2679 cnt = tcp_hashinfo.ehash_mask + 1; 2679 2680 net->ipv4.tcp_death_row.sysctl_max_tw_buckets = cnt / 2;
+9 -4
net/ipv4/tcp_metrics.c
··· 385 385 386 386 if (tcp_in_initial_slowstart(tp)) { 387 387 /* Slow start still did not finish. */ 388 - if (!tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) { 388 + if (!net->ipv4.sysctl_tcp_no_ssthresh_metrics_save && 389 + !tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) { 389 390 val = tcp_metric_get(tm, TCP_METRIC_SSTHRESH); 390 391 if (val && (tp->snd_cwnd >> 1) > val) 391 392 tcp_metric_set(tm, TCP_METRIC_SSTHRESH, ··· 401 400 } else if (!tcp_in_slow_start(tp) && 402 401 icsk->icsk_ca_state == TCP_CA_Open) { 403 402 /* Cong. avoidance phase, cwnd is reliable. */ 404 - if (!tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) 403 + if (!net->ipv4.sysctl_tcp_no_ssthresh_metrics_save && 404 + !tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) 405 405 tcp_metric_set(tm, TCP_METRIC_SSTHRESH, 406 406 max(tp->snd_cwnd >> 1, tp->snd_ssthresh)); 407 407 if (!tcp_metric_locked(tm, TCP_METRIC_CWND)) { ··· 418 416 tcp_metric_set(tm, TCP_METRIC_CWND, 419 417 (val + tp->snd_ssthresh) >> 1); 420 418 } 421 - if (!tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) { 419 + if (!net->ipv4.sysctl_tcp_no_ssthresh_metrics_save && 420 + !tcp_metric_locked(tm, TCP_METRIC_SSTHRESH)) { 422 421 val = tcp_metric_get(tm, TCP_METRIC_SSTHRESH); 423 422 if (val && tp->snd_ssthresh > val) 424 423 tcp_metric_set(tm, TCP_METRIC_SSTHRESH, ··· 444 441 { 445 442 struct dst_entry *dst = __sk_dst_get(sk); 446 443 struct tcp_sock *tp = tcp_sk(sk); 444 + struct net *net = sock_net(sk); 447 445 struct tcp_metrics_block *tm; 448 446 u32 val, crtt = 0; /* cached RTT scaled by 8 */ 449 447 ··· 462 458 if (tcp_metric_locked(tm, TCP_METRIC_CWND)) 463 459 tp->snd_cwnd_clamp = tcp_metric_get(tm, TCP_METRIC_CWND); 464 460 465 - val = tcp_metric_get(tm, TCP_METRIC_SSTHRESH); 461 + val = net->ipv4.sysctl_tcp_no_ssthresh_metrics_save ? 462 + 0 : tcp_metric_get(tm, TCP_METRIC_SSTHRESH); 466 463 if (val) { 467 464 tp->snd_ssthresh = val; 468 465 if (tp->snd_ssthresh > tp->snd_cwnd_clamp)