Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: fix tcp_tso_should_defer() vs large RTT

Neal reported that using neper tcp_stream with TCP_TX_DELAY
set to 50ms would often lead to flows stuck in a small cwnd mode,
regardless of the congestion control.

While tcp_stream sets TCP_TX_DELAY too late after the connect(),
it highlighted two kernel bugs.

The following heuristic in tcp_tso_should_defer() seems wrong
for large RTT:

delta = tp->tcp_clock_cache - head->tstamp;
/* If next ACK is likely to come too late (half srtt), do not defer */
if ((s64)(delta - (u64)NSEC_PER_USEC * (tp->srtt_us >> 4)) < 0)
goto send_now;

If next ACK is expected to come in more than 1 ms, we should
not defer because we prefer a smooth ACK clocking.

While blamed commit was a step in the good direction, it was not
generic enough.

Another patch fixing TCP_TX_DELAY for established flows
will be proposed when net-next reopens.

Fixes: 50c8339e9299 ("tcp: tso: restore IW10 after TSO autosizing")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251011115742.1245771-1-edumazet@google.com
[pabeni@redhat.com: fixed whitespace issue]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

authored by

Eric Dumazet and committed by
Paolo Abeni
295ce1eb 75527d61

+15 -4
+15 -4
net/ipv4/tcp_output.c
··· 2369 2369 u32 max_segs) 2370 2370 { 2371 2371 const struct inet_connection_sock *icsk = inet_csk(sk); 2372 - u32 send_win, cong_win, limit, in_flight; 2372 + u32 send_win, cong_win, limit, in_flight, threshold; 2373 + u64 srtt_in_ns, expected_ack, how_far_is_the_ack; 2373 2374 struct tcp_sock *tp = tcp_sk(sk); 2374 2375 struct sk_buff *head; 2375 2376 int win_divisor; ··· 2432 2431 head = tcp_rtx_queue_head(sk); 2433 2432 if (!head) 2434 2433 goto send_now; 2435 - delta = tp->tcp_clock_cache - head->tstamp; 2436 - /* If next ACK is likely to come too late (half srtt), do not defer */ 2437 - if ((s64)(delta - (u64)NSEC_PER_USEC * (tp->srtt_us >> 4)) < 0) 2434 + 2435 + srtt_in_ns = (u64)(NSEC_PER_USEC >> 3) * tp->srtt_us; 2436 + /* When is the ACK expected ? */ 2437 + expected_ack = head->tstamp + srtt_in_ns; 2438 + /* How far from now is the ACK expected ? */ 2439 + how_far_is_the_ack = expected_ack - tp->tcp_clock_cache; 2440 + 2441 + /* If next ACK is likely to come too late, 2442 + * ie in more than min(1ms, half srtt), do not defer. 2443 + */ 2444 + threshold = min(srtt_in_ns >> 1, NSEC_PER_MSEC); 2445 + 2446 + if ((s64)(how_far_is_the_ack - threshold) > 0) 2438 2447 goto send_now; 2439 2448 2440 2449 /* Ok, it looks like it is advisable to defer.