Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: introduce a per-route knob for quick ack

In previous discussions, I tried to find some reasonable heuristics
for delayed ACK, however this seems not possible, according to Eric:

"ACKS might also be delayed because of bidirectional
traffic, and is more controlled by the application
response time. TCP stack can not easily estimate it."

"ACK can be incredibly useful to recover from losses in
a short time.

The vast majority of TCP sessions are small lived, and we
send one ACK per received segment anyway at beginning or
retransmits to let the sender smoothly increase its cwnd,
so an auto-tuning facility wont help them that much."

and according to David:

"ACKs are the only information we have to detect loss.

And, for the same reasons that TCP VEGAS is fundamentally
broken, we cannot measure the pipe or some other
receiver-side-visible piece of information to determine
when it's "safe" to stretch ACK.

And even if it's "safe", we should not do it so that losses are
accurately detected and we don't spuriously retransmit.

The only way to know when the bandwidth increases is to
"test" it, by sending more and more packets until drops happen.
That's why all successful congestion control algorithms must
operate on explicited tested pieces of information.

Similarly, it's not really possible to universally know if
it's safe to stretch ACK or not."

It still makes sense to enable or disable quick ack mode like
what TCP_QUICK_ACK does.

Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior. As David suggested, this should belong
to per-path scope, since different pathes may want different
behaviors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Cong Wang and committed by
David S. Miller
bcefe17c 2c0740e4

+10 -3
+2
include/uapi/linux/rtnetlink.h
··· 386 386 #define RTAX_RTO_MIN RTAX_RTO_MIN 387 387 RTAX_INITRWND, 388 388 #define RTAX_INITRWND RTAX_INITRWND 389 + RTAX_QUICKACK, 390 + #define RTAX_QUICKACK RTAX_QUICKACK 389 391 __RTAX_MAX 390 392 }; 391 393
+4 -1
net/ipv4/tcp_input.c
··· 3717 3717 static void tcp_fin(struct sock *sk) 3718 3718 { 3719 3719 struct tcp_sock *tp = tcp_sk(sk); 3720 + const struct dst_entry *dst; 3720 3721 3721 3722 inet_csk_schedule_ack(sk); 3722 3723 ··· 3729 3728 case TCP_ESTABLISHED: 3730 3729 /* Move to CLOSE_WAIT */ 3731 3730 tcp_set_state(sk, TCP_CLOSE_WAIT); 3732 - inet_csk(sk)->icsk_ack.pingpong = 1; 3731 + dst = __sk_dst_get(sk); 3732 + if (!dst || !dst_metric(dst, RTAX_QUICKACK)) 3733 + inet_csk(sk)->icsk_ack.pingpong = 1; 3733 3734 break; 3734 3735 3735 3736 case TCP_CLOSE_WAIT:
+4 -2
net/ipv4/tcp_output.c
··· 160 160 { 161 161 struct inet_connection_sock *icsk = inet_csk(sk); 162 162 const u32 now = tcp_time_stamp; 163 + const struct dst_entry *dst = __sk_dst_get(sk); 163 164 164 165 if (sysctl_tcp_slow_start_after_idle && 165 166 (!tp->packets_out && (s32)(now - tp->lsndtime) > icsk->icsk_rto)) ··· 171 170 /* If it is a reply for ato after last received 172 171 * packet, enter pingpong mode. 173 172 */ 174 - if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) 175 - icsk->icsk_ack.pingpong = 1; 173 + if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato && 174 + (!dst || !dst_metric(dst, RTAX_QUICKACK))) 175 + icsk->icsk_ack.pingpong = 1; 176 176 } 177 177 178 178 /* Account for an ACK we sent. */