Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: add rfc3168, section 6.1.1.1. fallback

This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

1) Normal ECN-capable path:

SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->

2) Path with broken middlebox, when client has fallback:

SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Daniel Borkmann and committed by
David S. Miller
49213555 134e0dbe

+38 -1
+1
Documentation/networking/dctcp.txt
··· 8 8 To enable it on end hosts: 9 9 10 10 sysctl -w net.ipv4.tcp_congestion_control=dctcp 11 + sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) 11 12 12 13 All switches in the data center network running DCTCP must support ECN 13 14 marking and be configured for marking when reaching defined switch buffer
+9
Documentation/networking/ip-sysctl.txt
··· 267 267 but do not request ECN on outgoing connections. 268 268 Default: 2 269 269 270 + tcp_ecn_fallback - BOOLEAN 271 + If the kernel detects that ECN connection misbehaves, enable fall 272 + back to non-ECN. Currently, this knob implements the fallback 273 + from RFC3168, section 6.1.1.1., but we reserve that in future, 274 + additional detection mechanisms could be implemented under this 275 + knob. The value is not used, if tcp_ecn or per route (or congestion 276 + control) ECN settings are disabled. 277 + Default: 1 (fallback enabled) 278 + 270 279 tcp_fack - BOOLEAN 271 280 Enable FACK congestion avoidance and fast retransmission. 272 281 The value is not used, if tcp_sack is not enabled.
+2
include/net/netns/ipv4.h
··· 77 77 struct local_ports ip_local_ports; 78 78 79 79 int sysctl_tcp_ecn; 80 + int sysctl_tcp_ecn_fallback; 81 + 80 82 int sysctl_ip_no_pmtu_disc; 81 83 int sysctl_ip_fwd_use_pmtu; 82 84 int sysctl_ip_nonlocal_bind;
+2
include/net/tcp.h
··· 712 712 #define TCPHDR_ECE 0x40 713 713 #define TCPHDR_CWR 0x80 714 714 715 + #define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR) 716 + 715 717 /* This is what the send packet queuing engine uses to pass 716 718 * TCP per-packet control information to the transmission code. 717 719 * We also store the host-order sequence numbers in here too.
+7
net/ipv4/sysctl_net_ipv4.c
··· 821 821 .proc_handler = proc_dointvec 822 822 }, 823 823 { 824 + .procname = "tcp_ecn_fallback", 825 + .data = &init_net.ipv4.sysctl_tcp_ecn_fallback, 826 + .maxlen = sizeof(int), 827 + .mode = 0644, 828 + .proc_handler = proc_dointvec 829 + }, 830 + { 824 831 .procname = "ip_local_port_range", 825 832 .maxlen = sizeof(init_net.ipv4.ip_local_ports.range), 826 833 .data = &init_net.ipv4.ip_local_ports.range,
+4 -1
net/ipv4/tcp_ipv4.c
··· 2411 2411 goto fail; 2412 2412 *per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk; 2413 2413 } 2414 + 2414 2415 net->ipv4.sysctl_tcp_ecn = 2; 2416 + net->ipv4.sysctl_tcp_ecn_fallback = 1; 2417 + 2415 2418 net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS; 2416 2419 net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; 2417 2420 net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; 2418 - return 0; 2419 2421 2422 + return 0; 2420 2423 fail: 2421 2424 tcp_sk_exit(net); 2422 2425
+13
net/ipv4/tcp_output.c
··· 350 350 } 351 351 } 352 352 353 + static void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb) 354 + { 355 + if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback) 356 + /* tp->ecn_flags are cleared at a later point in time when 357 + * SYN ACK is ultimatively being received. 358 + */ 359 + TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR); 360 + } 361 + 353 362 static void 354 363 tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th, 355 364 struct sock *sk) ··· 2623 2614 tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb)); 2624 2615 } 2625 2616 } 2617 + 2618 + /* RFC3168, section 6.1.1.1. ECN fallback */ 2619 + if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN) 2620 + tcp_ecn_clear_syn(sk, skb); 2626 2621 2627 2622 tcp_retrans_try_collapse(sk, skb, cur_mss); 2628 2623