Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'accecn-protocol-patch-series'

Chia-Yu Chang says:

====================
AccECN protocol patch series

Please find the v19 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.

This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
---
Chia-Yu Chang (3):
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK

Ilpo Järvinen (7):
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics

Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 28 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 33 ++
include/net/tcp_ecn.h | 554 +++++++++++++++++-
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 318 +++++++++-
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 239 +++++++-
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1278 insertions(+), 76 deletions(-)
====================

Link: https://patch.msgid.link/20250916082434.100722-1-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

+1279 -77
+44 -11
Documentation/networking/ip-sysctl.rst
··· 443 443 444 444 tcp_ecn - INTEGER 445 445 Control use of Explicit Congestion Notification (ECN) by TCP. 446 - ECN is used only when both ends of the TCP connection indicate 447 - support for it. This feature is useful in avoiding losses due 448 - to congestion by allowing supporting routers to signal 449 - congestion before having to drop packets. 446 + ECN is used only when both ends of the TCP connection indicate support 447 + for it. This feature is useful in avoiding losses due to congestion by 448 + allowing supporting routers to signal congestion before having to drop 449 + packets. A host that supports ECN both sends ECN at the IP layer and 450 + feeds back ECN at the TCP layer. The highest variant of ECN feedback 451 + that both peers support is chosen by the ECN negotiation (Accurate ECN, 452 + ECN, or no ECN). 453 + 454 + The highest negotiated variant for incoming connection requests 455 + and the highest variant requested by outgoing connection 456 + attempts: 457 + 458 + ===== ==================== ==================== 459 + Value Incoming connections Outgoing connections 460 + ===== ==================== ==================== 461 + 0 No ECN No ECN 462 + 1 ECN ECN 463 + 2 ECN No ECN 464 + 3 AccECN AccECN 465 + 4 AccECN ECN 466 + 5 AccECN No ECN 467 + ===== ==================== ==================== 468 + 469 + Default: 2 470 + 471 + tcp_ecn_option - INTEGER 472 + Control Accurate ECN (AccECN) option sending when AccECN has been 473 + successfully negotiated during handshake. Send logic inhibits 474 + sending AccECN options regarless of this setting when no AccECN 475 + option has been seen for the reverse direction. 450 476 451 477 Possible values are: 452 478 453 - = ===================================================== 454 - 0 Disable ECN. Neither initiate nor accept ECN. 455 - 1 Enable ECN when requested by incoming connections and 456 - also request ECN on outgoing connection attempts. 457 - 2 Enable ECN when requested by incoming connections 458 - but do not request ECN on outgoing connections. 459 - = ===================================================== 479 + = ============================================================ 480 + 0 Never send AccECN option. This also disables sending AccECN 481 + option in SYN/ACK during handshake. 482 + 1 Send AccECN option sparingly according to the minimum option 483 + rules outlined in draft-ietf-tcpm-accurate-ecn. 484 + 2 Send AccECN option on every packet whenever it fits into TCP 485 + option space. 486 + = ============================================================ 460 487 461 488 Default: 2 489 + 490 + tcp_ecn_option_beacon - INTEGER 491 + Control Accurate ECN (AccECN) option sending frequency per RTT and it 492 + takes effect only when tcp_ecn_option is set to 2. 493 + 494 + Default: 3 (AccECN will be send at least 3 times per RTT) 462 495 463 496 tcp_ecn_fallback - BOOLEAN 464 497 If the kernel detects that ECN connection misbehaves, enable fall
+12
Documentation/networking/net_cachelines/tcp_sock.rst
··· 101 101 u32 prr_out read_mostly read_mostly tcp_rate_skb_sent,tcp_newly_delivered(tx);tcp_ack,tcp_rate_gen,tcp_clean_rtx_queue(rx) 102 102 u32 delivered read_mostly read_write tcp_rate_skb_sent, tcp_newly_delivered(tx);tcp_ack, tcp_rate_gen, tcp_clean_rtx_queue (rx) 103 103 u32 delivered_ce read_mostly read_write tcp_rate_skb_sent(tx);tcp_rate_gen(rx) 104 + u32 received_ce read_mostly read_write 105 + u32[3] received_ecn_bytes read_mostly read_write 106 + u8:4 received_ce_pending read_mostly read_write 107 + u32[3] delivered_ecn_bytes read_write 108 + u8:2 syn_ect_snt write_mostly read_write 109 + u8:2 syn_ect_rcv read_mostly read_write 110 + u8:2 accecn_minlen write_mostly read_write 111 + u8:2 est_ecnfield read_write 112 + u8:2 accecn_opt_demand read_mostly read_write 113 + u8:2 prev_ecnfield read_write 114 + u64 accecn_opt_tstamp read_write 115 + u8:4 accecn_fail_mode 104 116 u32 lost read_mostly tcp_ack 105 117 u32 app_limited read_write read_mostly tcp_rate_check_app_limited,tcp_rate_skb_sent(tx);tcp_rate_gen(rx) 106 118 u64 first_tx_mstamp read_write tcp_rate_skb_sent
+25 -3
include/linux/tcp.h
··· 122 122 smc_ok : 1, /* SMC seen on SYN packet */ 123 123 snd_wscale : 4, /* Window scaling received from sender */ 124 124 rcv_wscale : 4; /* Window scaling to send to receiver */ 125 - u8 saw_unknown:1, /* Received unknown option */ 126 - unused:7; 125 + u8 accecn:6, /* AccECN index in header, 0=no options */ 126 + saw_unknown:1, /* Received unknown option */ 127 + unused:1; 127 128 u8 num_sacks; /* Number of SACK blocks */ 128 129 u16 user_mss; /* mss requested by user in ioctl */ 129 130 u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ ··· 169 168 * after data-in-SYN. 170 169 */ 171 170 u8 syn_tos; 171 + bool accecn_ok; 172 + u8 syn_ect_snt: 2, 173 + syn_ect_rcv: 2, 174 + accecn_fail_mode:4; 175 + u8 saw_accecn_opt :2; 172 176 #ifdef CONFIG_TCP_AO 173 177 u8 ao_keyid; 174 178 u8 ao_rcv_next; ··· 276 270 u32 mdev_us; /* medium deviation */ 277 271 u32 rtt_seq; /* sequence number to update rttvar */ 278 272 u64 tcp_wstamp_ns; /* departure time for next sent data packet */ 273 + u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ 279 274 struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */ 280 275 struct sk_buff *highest_sack; /* skb just after the highest 281 276 * skb with SACKed bit set ··· 294 287 */ 295 288 u8 nonagle : 4,/* Disable Nagle algorithm? */ 296 289 rate_app_limited:1; /* rate_{delivered,interval_us} limited? */ 290 + u8 received_ce_pending:4, /* Not yet transmit cnt of received_ce */ 291 + unused2:4; 292 + u8 accecn_minlen:2,/* Minimum length of AccECN option sent */ 293 + est_ecnfield:2,/* ECN field for AccECN delivered estimates */ 294 + accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */ 295 + prev_ecnfield:2; /* ECN bits from the previous segment */ 297 296 __be32 pred_flags; 298 297 u64 tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */ 299 298 u64 tcp_mstamp; /* most recent packet received/sent */ ··· 312 299 u32 snd_up; /* Urgent pointer */ 313 300 u32 delivered; /* Total data packets delivered incl. rexmits */ 314 301 u32 delivered_ce; /* Like the above but only ECE marked packets */ 302 + u32 received_ce; /* Like the above but for rcvd CE marked pkts */ 303 + u32 received_ecn_bytes[3]; /* received byte counters for three ECN 304 + * types: INET_ECN_ECT_1, INET_ECN_ECT_0, 305 + * and INET_ECN_CE 306 + */ 315 307 u32 app_limited; /* limited until "delivered" reaches this val */ 316 308 u32 rcv_wnd; /* Current receiver window */ 317 309 /* ··· 344 326 u32 rate_delivered; /* saved rate sample: packets delivered */ 345 327 u32 rate_interval_us; /* saved rate sample: time elapsed */ 346 328 u32 rcv_rtt_last_tsecr; 329 + u32 delivered_ecn_bytes[3]; 347 330 u64 first_tx_mstamp; /* start of window send phase */ 348 331 u64 delivered_mstamp; /* time we reached "delivered" */ 349 332 u64 bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked ··· 391 372 u8 compressed_ack; 392 373 u8 dup_ack_counter:2, 393 374 tlp_retrans:1, /* TLP is a retransmission */ 394 - unused:5; 375 + syn_ect_snt:2, /* AccECN ECT memory, only */ 376 + syn_ect_rcv:2; /* ... needed during 3WHS + first seqno */ 395 377 u8 thin_lto : 1,/* Use linear timeouts for thin streams */ 396 378 fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ 397 379 fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ ··· 408 388 syn_fastopen_child:1; /* created TFO passive child socket */ 409 389 410 390 u8 keepalive_probes; /* num of allowed keep alive probes */ 391 + u8 accecn_fail_mode:4, /* AccECN failure handling */ 392 + saw_accecn_opt:2; /* An AccECN option was seen */ 411 393 u32 tcp_tx_delay; /* delay (in usec) added to TX packets */ 412 394 413 395 /* RTT measurement */
+2
include/net/netns/ipv4.h
··· 148 148 struct local_ports ip_local_ports; 149 149 150 150 u8 sysctl_tcp_ecn; 151 + u8 sysctl_tcp_ecn_option; 152 + u8 sysctl_tcp_ecn_option_beacon; 151 153 u8 sysctl_tcp_ecn_fallback; 152 154 153 155 u8 sysctl_ip_default_ttl;
+33
include/net/tcp.h
··· 100 100 /* Maximal number of window scale according to RFC1323 */ 101 101 #define TCP_MAX_WSCALE 14U 102 102 103 + /* Default sending frequency of accurate ECN option per RTT */ 104 + #define TCP_ACCECN_OPTION_BEACON 3 105 + 103 106 /* urg_data states */ 104 107 #define TCP_URG_VALID 0x0100 105 108 #define TCP_URG_NOTYET 0x0200 ··· 216 213 #define TCPOPT_AO 29 /* Authentication Option (RFC5925) */ 217 214 #define TCPOPT_MPTCP 30 /* Multipath TCP (RFC6824) */ 218 215 #define TCPOPT_FASTOPEN 34 /* Fast open (RFC7413) */ 216 + #define TCPOPT_ACCECN0 172 /* 0xAC: Accurate ECN Order 0 */ 217 + #define TCPOPT_ACCECN1 174 /* 0xAE: Accurate ECN Order 1 */ 219 218 #define TCPOPT_EXP 254 /* Experimental */ 220 219 /* Magic number to be after the option value for sharing TCP 221 220 * experimental options. See draft-ietf-tcpm-experimental-options-00.txt ··· 235 230 #define TCPOLEN_TIMESTAMP 10 236 231 #define TCPOLEN_MD5SIG 18 237 232 #define TCPOLEN_FASTOPEN_BASE 2 233 + #define TCPOLEN_ACCECN_BASE 2 238 234 #define TCPOLEN_EXP_FASTOPEN_BASE 4 239 235 #define TCPOLEN_EXP_SMC_BASE 6 240 236 ··· 249 243 #define TCPOLEN_MD5SIG_ALIGNED 20 250 244 #define TCPOLEN_MSS_ALIGNED 4 251 245 #define TCPOLEN_EXP_SMC_BASE_ALIGNED 8 246 + #define TCPOLEN_ACCECN_PERFIELD 3 247 + 248 + /* Maximum number of byte counters in AccECN option + size */ 249 + #define TCP_ACCECN_NUMFIELDS 3 250 + #define TCP_ACCECN_MAXSIZE (TCPOLEN_ACCECN_BASE + \ 251 + TCPOLEN_ACCECN_PERFIELD * \ 252 + TCP_ACCECN_NUMFIELDS) 253 + #define TCP_ACCECN_SAFETY_SHIFT 1 /* SAFETY_FACTOR in accecn draft */ 252 254 253 255 /* Flags in tp->nonagle */ 254 256 #define TCP_NAGLE_OFF 1 /* Nagle's algo is disabled */ ··· 986 972 987 973 #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE) 988 974 #define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR) 975 + #define TCPHDR_SYNACK_ACCECN (TCPHDR_SYN | TCPHDR_ACK | TCPHDR_CWR) 976 + 977 + #define TCP_ACCECN_CEP_ACE_MASK 0x7 978 + #define TCP_ACCECN_ACE_MAX_DELTA 6 979 + 980 + /* To avoid/detect middlebox interference, not all counters start at 0. 981 + * See draft-ietf-tcpm-accurate-ecn for the latest values. 982 + */ 983 + #define TCP_ACCECN_CEP_INIT_OFFSET 5 984 + #define TCP_ACCECN_E1B_INIT_OFFSET 1 985 + #define TCP_ACCECN_E0B_INIT_OFFSET 1 986 + #define TCP_ACCECN_CEB_INIT_OFFSET 0 989 987 990 988 /* State flags for sacked in struct tcp_skb_cb */ 991 989 enum tcp_skb_cb_sacked_flags { ··· 1808 1782 1809 1783 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd) 1810 1784 { 1785 + u32 ace; 1786 + 1811 1787 /* mptcp hooks are only on the slow path */ 1812 1788 if (sk_is_mptcp((struct sock *)tp)) 1813 1789 return; 1814 1790 1791 + ace = tcp_ecn_mode_accecn(tp) ? 1792 + ((tp->delivered_ce + TCP_ACCECN_CEP_INIT_OFFSET) & 1793 + TCP_ACCECN_CEP_ACE_MASK) : 0; 1794 + 1815 1795 tp->pred_flags = htonl((tp->tcp_header_len << 26) | 1796 + (ace << 22) | 1816 1797 ntohl(TCP_FLAG_ACK) | 1817 1798 snd_wnd); 1818 1799 }
+541 -15
include/net/tcp_ecn.h
··· 4 4 5 5 #include <linux/tcp.h> 6 6 #include <linux/skbuff.h> 7 + #include <linux/bitfield.h> 7 8 8 9 #include <net/inet_connection_sock.h> 9 10 #include <net/sock.h> 10 11 #include <net/tcp.h> 11 12 #include <net/inet_ecn.h> 12 13 14 + /* The highest ECN variant (Accurate ECN, ECN, or no ECN) that is 15 + * attemped to be negotiated and requested for incoming connection 16 + * and outgoing connection, respectively. 17 + */ 18 + enum tcp_ecn_mode { 19 + TCP_ECN_IN_NOECN_OUT_NOECN = 0, 20 + TCP_ECN_IN_ECN_OUT_ECN = 1, 21 + TCP_ECN_IN_ECN_OUT_NOECN = 2, 22 + TCP_ECN_IN_ACCECN_OUT_ACCECN = 3, 23 + TCP_ECN_IN_ACCECN_OUT_ECN = 4, 24 + TCP_ECN_IN_ACCECN_OUT_NOECN = 5, 25 + }; 26 + 27 + /* AccECN option sending when AccECN has been successfully negotiated */ 28 + enum tcp_accecn_option { 29 + TCP_ACCECN_OPTION_DISABLED = 0, 30 + TCP_ACCECN_OPTION_MINIMUM = 1, 31 + TCP_ACCECN_OPTION_FULL = 2, 32 + }; 33 + 13 34 static inline void tcp_ecn_queue_cwr(struct tcp_sock *tp) 14 35 { 36 + /* Do not set CWR if in AccECN mode! */ 15 37 if (tcp_ecn_mode_rfc3168(tp)) 16 38 tp->ecn_flags |= TCP_ECN_QUEUE_CWR; 17 39 } ··· 41 19 static inline void tcp_ecn_accept_cwr(struct sock *sk, 42 20 const struct sk_buff *skb) 43 21 { 44 - if (tcp_hdr(skb)->cwr) { 45 - tcp_sk(sk)->ecn_flags &= ~TCP_ECN_DEMAND_CWR; 22 + struct tcp_sock *tp = tcp_sk(sk); 23 + 24 + if (tcp_ecn_mode_rfc3168(tp) && tcp_hdr(skb)->cwr) { 25 + tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR; 46 26 47 27 /* If the sender is telling us it has entered CWR, then its 48 28 * cwnd may be very low (even just 1 packet), so we should ACK ··· 60 36 tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR; 61 37 } 62 38 63 - static inline void tcp_ecn_rcv_synack(struct tcp_sock *tp, 64 - const struct tcphdr *th) 39 + /* tp->accecn_fail_mode */ 40 + #define TCP_ACCECN_ACE_FAIL_SEND BIT(0) 41 + #define TCP_ACCECN_ACE_FAIL_RECV BIT(1) 42 + #define TCP_ACCECN_OPT_FAIL_SEND BIT(2) 43 + #define TCP_ACCECN_OPT_FAIL_RECV BIT(3) 44 + 45 + static inline bool tcp_accecn_ace_fail_send(const struct tcp_sock *tp) 65 46 { 66 - if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr)) 67 - tcp_ecn_mode_set(tp, TCP_ECN_DISABLED); 47 + return tp->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_SEND; 68 48 } 69 49 70 - static inline void tcp_ecn_rcv_syn(struct tcp_sock *tp, 71 - const struct tcphdr *th) 50 + static inline bool tcp_accecn_ace_fail_recv(const struct tcp_sock *tp) 72 51 { 52 + return tp->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_RECV; 53 + } 54 + 55 + static inline bool tcp_accecn_opt_fail_send(const struct tcp_sock *tp) 56 + { 57 + return tp->accecn_fail_mode & TCP_ACCECN_OPT_FAIL_SEND; 58 + } 59 + 60 + static inline bool tcp_accecn_opt_fail_recv(const struct tcp_sock *tp) 61 + { 62 + return tp->accecn_fail_mode & TCP_ACCECN_OPT_FAIL_RECV; 63 + } 64 + 65 + static inline void tcp_accecn_fail_mode_set(struct tcp_sock *tp, u8 mode) 66 + { 67 + tp->accecn_fail_mode |= mode; 68 + } 69 + 70 + #define TCP_ACCECN_OPT_NOT_SEEN 0x0 71 + #define TCP_ACCECN_OPT_EMPTY_SEEN 0x1 72 + #define TCP_ACCECN_OPT_COUNTER_SEEN 0x2 73 + #define TCP_ACCECN_OPT_FAIL_SEEN 0x3 74 + 75 + static inline u8 tcp_accecn_ace(const struct tcphdr *th) 76 + { 77 + return (th->ae << 2) | (th->cwr << 1) | th->ece; 78 + } 79 + 80 + /* Infer the ECT value our SYN arrived with from the echoed ACE field */ 81 + static inline int tcp_accecn_extract_syn_ect(u8 ace) 82 + { 83 + /* Below is an excerpt from the 1st block of Table 2 of AccECN spec */ 84 + static const int ace_to_ecn[8] = { 85 + INET_ECN_ECT_0, /* 0b000 (Undefined) */ 86 + INET_ECN_ECT_1, /* 0b001 (Undefined) */ 87 + INET_ECN_NOT_ECT, /* 0b010 (Not-ECT is received) */ 88 + INET_ECN_ECT_1, /* 0b011 (ECT-1 is received) */ 89 + INET_ECN_ECT_0, /* 0b100 (ECT-0 is received) */ 90 + INET_ECN_ECT_1, /* 0b101 (Reserved) */ 91 + INET_ECN_CE, /* 0b110 (CE is received) */ 92 + INET_ECN_ECT_1 /* 0b111 (Undefined) */ 93 + }; 94 + 95 + return ace_to_ecn[ace & 0x7]; 96 + } 97 + 98 + /* Check ECN field transition to detect invalid transitions */ 99 + static inline bool tcp_ect_transition_valid(u8 snt, u8 rcv) 100 + { 101 + if (rcv == snt) 102 + return true; 103 + 104 + /* Non-ECT altered to something or something became non-ECT */ 105 + if (snt == INET_ECN_NOT_ECT || rcv == INET_ECN_NOT_ECT) 106 + return false; 107 + /* CE -> ECT(0/1)? */ 108 + if (snt == INET_ECN_CE) 109 + return false; 110 + return true; 111 + } 112 + 113 + static inline bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, 114 + u8 sent_ect) 115 + { 116 + u8 ect = tcp_accecn_extract_syn_ect(ace); 117 + struct tcp_sock *tp = tcp_sk(sk); 118 + 119 + if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)) 120 + return true; 121 + 122 + if (!tcp_ect_transition_valid(sent_ect, ect)) { 123 + tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV); 124 + return false; 125 + } 126 + 127 + return true; 128 + } 129 + 130 + static inline void tcp_accecn_saw_opt_fail_recv(struct tcp_sock *tp, 131 + u8 saw_opt) 132 + { 133 + tp->saw_accecn_opt = saw_opt; 134 + if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN) 135 + tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV); 136 + } 137 + 138 + /* Validate the 3rd ACK based on the ACE field, see Table 4 of AccECN spec */ 139 + static inline void tcp_accecn_third_ack(struct sock *sk, 140 + const struct sk_buff *skb, u8 sent_ect) 141 + { 142 + u8 ace = tcp_accecn_ace(tcp_hdr(skb)); 143 + struct tcp_sock *tp = tcp_sk(sk); 144 + 145 + switch (ace) { 146 + case 0x0: 147 + /* Invalid value */ 148 + tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV); 149 + break; 150 + case 0x7: 151 + case 0x5: 152 + case 0x1: 153 + /* Unused but legal values */ 154 + break; 155 + default: 156 + /* Validation only applies to first non-data packet */ 157 + if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq && 158 + !TCP_SKB_CB(skb)->sacked && 159 + tcp_accecn_validate_syn_feedback(sk, ace, sent_ect)) { 160 + if ((tcp_accecn_extract_syn_ect(ace) == INET_ECN_CE) && 161 + !tp->delivered_ce) 162 + tp->delivered_ce++; 163 + } 164 + break; 165 + } 166 + } 167 + 168 + /* Demand the minimum # to send AccECN optnio */ 169 + static inline void tcp_accecn_opt_demand_min(struct sock *sk, 170 + u8 opt_demand_min) 171 + { 172 + struct tcp_sock *tp = tcp_sk(sk); 173 + u8 opt_demand; 174 + 175 + opt_demand = max_t(u8, opt_demand_min, tp->accecn_opt_demand); 176 + tp->accecn_opt_demand = opt_demand; 177 + } 178 + 179 + /* Maps IP ECN field ECT/CE code point to AccECN option field number, given 180 + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0). 181 + */ 182 + static inline u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield) 183 + { 184 + switch (ecnfield & INET_ECN_MASK) { 185 + case INET_ECN_NOT_ECT: 186 + return 0; /* AccECN does not send counts of NOT_ECT */ 187 + case INET_ECN_ECT_1: 188 + return 1; 189 + case INET_ECN_CE: 190 + return 2; 191 + case INET_ECN_ECT_0: 192 + return 3; 193 + } 194 + return 0; 195 + } 196 + 197 + /* Maps IP ECN field ECT/CE code point to AccECN option field value offset. 198 + * Some fields do not start from zero, to detect zeroing by middleboxes. 199 + */ 200 + static inline u32 tcp_accecn_field_init_offset(u8 ecnfield) 201 + { 202 + switch (ecnfield & INET_ECN_MASK) { 203 + case INET_ECN_NOT_ECT: 204 + return 0; /* AccECN does not send counts of NOT_ECT */ 205 + case INET_ECN_ECT_1: 206 + return TCP_ACCECN_E1B_INIT_OFFSET; 207 + case INET_ECN_CE: 208 + return TCP_ACCECN_CEB_INIT_OFFSET; 209 + case INET_ECN_ECT_0: 210 + return TCP_ACCECN_E0B_INIT_OFFSET; 211 + } 212 + return 0; 213 + } 214 + 215 + /* Maps AccECN option field #nr to IP ECN field ECT/CE bits */ 216 + static inline unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int option, 217 + bool order) 218 + { 219 + /* Based on Table 5 of the AccECN spec to map (option, order) to 220 + * the corresponding ECN conuters (ECT-1, ECT-0, or CE). 221 + */ 222 + static const u8 optfield_lookup[2][3] = { 223 + /* order = 0: 1st field ECT-0, 2nd field CE, 3rd field ECT-1 */ 224 + { INET_ECN_ECT_0, INET_ECN_CE, INET_ECN_ECT_1 }, 225 + /* order = 1: 1st field ECT-1, 2nd field CE, 3rd field ECT-0 */ 226 + { INET_ECN_ECT_1, INET_ECN_CE, INET_ECN_ECT_0 } 227 + }; 228 + 229 + return optfield_lookup[order][option % 3]; 230 + } 231 + 232 + /* Handles AccECN option ECT and CE 24-bit byte counters update into 233 + * the u32 value in tcp_sock. As we're processing TCP options, it is 234 + * safe to access from - 1. 235 + */ 236 + static inline s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, 237 + u32 init_offset) 238 + { 239 + u32 truncated = (get_unaligned_be32(from - 1) - init_offset) & 240 + 0xFFFFFFU; 241 + u32 delta = (truncated - *cnt) & 0xFFFFFFU; 242 + 243 + /* If delta has the highest bit set (24th bit) indicating 244 + * negative, sign extend to correct an estimation using 245 + * sign_extend32(delta, 24 - 1) 246 + */ 247 + delta = sign_extend32(delta, 23); 248 + *cnt += delta; 249 + return (s32)delta; 250 + } 251 + 252 + /* Updates Accurate ECN received counters from the received IP ECN field */ 253 + static inline void tcp_ecn_received_counters(struct sock *sk, 254 + const struct sk_buff *skb, u32 len) 255 + { 256 + u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK; 257 + u8 is_ce = INET_ECN_is_ce(ecnfield); 258 + struct tcp_sock *tp = tcp_sk(sk); 259 + bool ecn_edge; 260 + 261 + if (!INET_ECN_is_not_ect(ecnfield)) { 262 + u32 pcount = is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs); 263 + 264 + /* As for accurate ECN, the TCP_ECN_SEEN flag is set by 265 + * tcp_ecn_received_counters() when the ECN codepoint of 266 + * received TCP data or ACK contains ECT(0), ECT(1), or CE. 267 + */ 268 + if (!tcp_ecn_mode_rfc3168(tp)) 269 + tp->ecn_flags |= TCP_ECN_SEEN; 270 + 271 + /* ACE counter tracks *all* segments including pure ACKs */ 272 + tp->received_ce += pcount; 273 + tp->received_ce_pending = min(tp->received_ce_pending + pcount, 274 + 0xfU); 275 + 276 + if (len > 0) { 277 + u8 minlen = tcp_ecnfield_to_accecn_optfield(ecnfield); 278 + u32 oldbytes = tp->received_ecn_bytes[ecnfield - 1]; 279 + u32 bytes_mask = GENMASK_U32(31, 22); 280 + 281 + tp->received_ecn_bytes[ecnfield - 1] += len; 282 + tp->accecn_minlen = max_t(u8, tp->accecn_minlen, 283 + minlen); 284 + 285 + /* Send AccECN option at least once per 2^22-byte 286 + * increase in any ECN byte counter. 287 + */ 288 + if ((tp->received_ecn_bytes[ecnfield - 1] ^ oldbytes) & 289 + bytes_mask) { 290 + tcp_accecn_opt_demand_min(sk, 1); 291 + } 292 + } 293 + } 294 + 295 + ecn_edge = tp->prev_ecnfield != ecnfield; 296 + if (ecn_edge || is_ce) { 297 + tp->prev_ecnfield = ecnfield; 298 + /* Demand Accurate ECN change-triggered ACKs. Two ACK are 299 + * demanded to indicate unambiguously the ecnfield value 300 + * in the latter ACK. 301 + */ 302 + if (tcp_ecn_mode_accecn(tp)) { 303 + if (ecn_edge) 304 + inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW; 305 + tp->accecn_opt_demand = 2; 306 + } 307 + } 308 + } 309 + 310 + /* AccECN specification, 2.2: [...] A Data Receiver maintains four counters 311 + * initialized at the start of the half-connection. [...] These byte counters 312 + * reflect only the TCP payload length, excluding TCP header and TCP options. 313 + */ 314 + static inline void tcp_ecn_received_counters_payload(struct sock *sk, 315 + const struct sk_buff *skb) 316 + { 317 + const struct tcphdr *th = (const struct tcphdr *)skb->data; 318 + 319 + tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4); 320 + } 321 + 322 + /* AccECN specification, 5.1: [...] a server can determine that it 323 + * negotiated AccECN as [...] if the ACK contains an ACE field with 324 + * the value 0b010 to 0b111 (decimal 2 to 7). 325 + */ 326 + static inline bool cookie_accecn_ok(const struct tcphdr *th) 327 + { 328 + return tcp_accecn_ace(th) > 0x1; 329 + } 330 + 331 + /* Used to form the ACE flags for SYN/ACK */ 332 + static inline u16 tcp_accecn_reflector_flags(u8 ect) 333 + { 334 + /* TCP ACE flags of SYN/ACK are set based on IP-ECN received from SYN. 335 + * Below is an excerpt from the 1st block of Table 2 of AccECN spec, 336 + * in which TCP ACE flags are encoded as: (AE << 2) | (CWR << 1) | ECE 337 + */ 338 + static const u8 ecn_to_ace_flags[4] = { 339 + 0b010, /* Not-ECT is received */ 340 + 0b011, /* ECT(1) is received */ 341 + 0b100, /* ECT(0) is received */ 342 + 0b110 /* CE is received */ 343 + }; 344 + 345 + return FIELD_PREP(TCPHDR_ACE, ecn_to_ace_flags[ect & 0x3]); 346 + } 347 + 348 + /* AccECN specification, 3.1.2: If a TCP server that implements AccECN 349 + * receives a SYN with the three TCP header flags (AE, CWR and ECE) set 350 + * to any combination other than 000, 011 or 111, it MUST negotiate the 351 + * use of AccECN as if they had been set to 111. 352 + */ 353 + static inline bool tcp_accecn_syn_requested(const struct tcphdr *th) 354 + { 355 + u8 ace = tcp_accecn_ace(th); 356 + 357 + return ace && ace != 0x3; 358 + } 359 + 360 + static inline void __tcp_accecn_init_bytes_counters(int *counter_array) 361 + { 362 + BUILD_BUG_ON(INET_ECN_ECT_1 != 0x1); 363 + BUILD_BUG_ON(INET_ECN_ECT_0 != 0x2); 364 + BUILD_BUG_ON(INET_ECN_CE != 0x3); 365 + 366 + counter_array[INET_ECN_ECT_1 - 1] = 0; 367 + counter_array[INET_ECN_ECT_0 - 1] = 0; 368 + counter_array[INET_ECN_CE - 1] = 0; 369 + } 370 + 371 + static inline void tcp_accecn_init_counters(struct tcp_sock *tp) 372 + { 373 + tp->received_ce = 0; 374 + tp->received_ce_pending = 0; 375 + __tcp_accecn_init_bytes_counters(tp->received_ecn_bytes); 376 + __tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes); 377 + tp->accecn_minlen = 0; 378 + tp->accecn_opt_demand = 0; 379 + tp->est_ecnfield = 0; 380 + } 381 + 382 + /* Used for make_synack to form the ACE flags */ 383 + static inline void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect) 384 + { 385 + /* TCP ACE flags of SYN/ACK are set based on IP-ECN codepoint received 386 + * from SYN. Below is an excerpt from Table 2 of the AccECN spec: 387 + * +====================+====================================+ 388 + * | IP-ECN codepoint | Respective ACE falgs on SYN/ACK | 389 + * | received on SYN | AE CWR ECE | 390 + * +====================+====================================+ 391 + * | Not-ECT | 0 1 0 | 392 + * | ECT(1) | 0 1 1 | 393 + * | ECT(0) | 1 0 0 | 394 + * | CE | 1 1 0 | 395 + * +====================+====================================+ 396 + */ 397 + th->ae = !!(ect & INET_ECN_ECT_0); 398 + th->cwr = ect != INET_ECN_ECT_0; 399 + th->ece = ect == INET_ECN_ECT_1; 400 + } 401 + 402 + static inline void tcp_accecn_set_ace(struct tcp_sock *tp, struct sk_buff *skb, 403 + struct tcphdr *th) 404 + { 405 + u32 wire_ace; 406 + 407 + /* The final packet of the 3WHS or anything like it must reflect 408 + * the SYN/ACK ECT instead of putting CEP into ACE field, such 409 + * case show up in tcp_flags. 410 + */ 411 + if (likely(!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACE))) { 412 + wire_ace = tp->received_ce + TCP_ACCECN_CEP_INIT_OFFSET; 413 + th->ece = !!(wire_ace & 0x1); 414 + th->cwr = !!(wire_ace & 0x2); 415 + th->ae = !!(wire_ace & 0x4); 416 + tp->received_ce_pending = 0; 417 + } 418 + } 419 + 420 + static inline u8 tcp_accecn_option_init(const struct sk_buff *skb, 421 + u8 opt_offset) 422 + { 423 + u8 *ptr = skb_transport_header(skb) + opt_offset; 424 + unsigned int optlen = ptr[1] - 2; 425 + 426 + if (WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1)) 427 + return TCP_ACCECN_OPT_FAIL_SEEN; 428 + ptr += 2; 429 + 430 + /* Detect option zeroing: an AccECN connection "MAY check that the 431 + * initial value of the EE0B field or the EE1B field is non-zero" 432 + */ 433 + if (optlen < TCPOLEN_ACCECN_PERFIELD) 434 + return TCP_ACCECN_OPT_EMPTY_SEEN; 435 + if (get_unaligned_be24(ptr) == 0) 436 + return TCP_ACCECN_OPT_FAIL_SEEN; 437 + if (optlen < TCPOLEN_ACCECN_PERFIELD * 3) 438 + return TCP_ACCECN_OPT_COUNTER_SEEN; 439 + ptr += TCPOLEN_ACCECN_PERFIELD * 2; 440 + if (get_unaligned_be24(ptr) == 0) 441 + return TCP_ACCECN_OPT_FAIL_SEEN; 442 + 443 + return TCP_ACCECN_OPT_COUNTER_SEEN; 444 + } 445 + 446 + /* See Table 2 of the AccECN draft */ 447 + static inline void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb, 448 + const struct tcphdr *th, u8 ip_dsfield) 449 + { 450 + struct tcp_sock *tp = tcp_sk(sk); 451 + u8 ace = tcp_accecn_ace(th); 452 + 453 + switch (ace) { 454 + case 0x0: 455 + case 0x7: 456 + /* +========+========+============+=============+ 457 + * | A | B | SYN/ACK | Feedback | 458 + * | | | B->A | Mode of A | 459 + * | | | AE CWR ECE | | 460 + * +========+========+============+=============+ 461 + * | AccECN | No ECN | 0 0 0 | Not ECN | 462 + * | AccECN | Broken | 1 1 1 | Not ECN | 463 + * +========+========+============+=============+ 464 + */ 465 + tcp_ecn_mode_set(tp, TCP_ECN_DISABLED); 466 + break; 467 + case 0x1: 468 + case 0x5: 469 + /* +========+========+============+=============+ 470 + * | A | B | SYN/ACK | Feedback | 471 + * | | | B->A | Mode of A | 472 + * | | | AE CWR ECE | | 473 + * +========+========+============+=============+ 474 + * | AccECN | Nonce | 1 0 1 | (Reserved) | 475 + * | AccECN | ECN | 0 0 1 | Classic ECN | 476 + * | Nonce | AccECN | 0 0 1 | Classic ECN | 477 + * | ECN | AccECN | 0 0 1 | Classic ECN | 478 + * +========+========+============+=============+ 479 + */ 480 + if (tcp_ecn_mode_pending(tp)) 481 + /* Downgrade from AccECN, or requested initially */ 482 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); 483 + break; 484 + default: 485 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); 486 + tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK; 487 + if (tp->rx_opt.accecn && 488 + tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { 489 + u8 saw_opt = tcp_accecn_option_init(skb, tp->rx_opt.accecn); 490 + 491 + tcp_accecn_saw_opt_fail_recv(tp, saw_opt); 492 + tp->accecn_opt_demand = 2; 493 + } 494 + if (INET_ECN_is_ce(ip_dsfield) && 495 + tcp_accecn_validate_syn_feedback(sk, ace, 496 + tp->syn_ect_snt)) { 497 + tp->received_ce++; 498 + tp->received_ce_pending++; 499 + } 500 + break; 501 + } 502 + } 503 + 504 + static inline void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th, 505 + const struct sk_buff *skb) 506 + { 507 + if (tcp_ecn_mode_pending(tp)) { 508 + if (!tcp_accecn_syn_requested(th)) { 509 + /* Downgrade to classic ECN feedback */ 510 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); 511 + } else { 512 + tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & 513 + INET_ECN_MASK; 514 + tp->prev_ecnfield = tp->syn_ect_rcv; 515 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); 516 + } 517 + } 73 518 if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr)) 74 519 tcp_ecn_mode_set(tp, TCP_ECN_DISABLED); 75 520 } ··· 554 61 /* Packet ECN state for a SYN-ACK */ 555 62 static inline void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb) 556 63 { 557 - const struct tcp_sock *tp = tcp_sk(sk); 64 + struct tcp_sock *tp = tcp_sk(sk); 558 65 559 66 TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR; 560 67 if (tcp_ecn_disabled(tp)) ··· 562 69 else if (tcp_ca_needs_ecn(sk) || 563 70 tcp_bpf_ca_needs_ecn(sk)) 564 71 INET_ECN_xmit(sk); 72 + 73 + if (tp->ecn_flags & TCP_ECN_MODE_ACCECN) { 74 + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE; 75 + TCP_SKB_CB(skb)->tcp_flags |= 76 + tcp_accecn_reflector_flags(tp->syn_ect_rcv); 77 + tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; 78 + } 565 79 } 566 80 567 81 /* Packet ECN state for a SYN. */ ··· 576 76 { 577 77 struct tcp_sock *tp = tcp_sk(sk); 578 78 bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk); 579 - bool use_ecn = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn) == 1 || 580 - tcp_ca_needs_ecn(sk) || bpf_needs_ecn; 79 + bool use_ecn, use_accecn; 80 + u8 tcp_ecn = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn); 81 + 82 + use_accecn = tcp_ecn == TCP_ECN_IN_ACCECN_OUT_ACCECN; 83 + use_ecn = tcp_ecn == TCP_ECN_IN_ECN_OUT_ECN || 84 + tcp_ecn == TCP_ECN_IN_ACCECN_OUT_ECN || 85 + tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn; 581 86 582 87 if (!use_ecn) { 583 88 const struct dst_entry *dst = __sk_dst_get(sk); ··· 598 93 INET_ECN_xmit(sk); 599 94 600 95 TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR; 601 - tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); 96 + if (use_accecn) { 97 + TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_AE; 98 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_PENDING); 99 + tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; 100 + } else { 101 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); 102 + } 602 103 } 603 104 } 604 105 605 106 static inline void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb) 606 107 { 607 - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)) 108 + if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)) { 608 109 /* tp->ecn_flags are cleared at a later point in time when 609 110 * SYN ACK is ultimatively being received. 610 111 */ 611 - TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR); 112 + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE; 113 + } 612 114 } 613 115 614 116 static inline void 615 117 tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th) 616 118 { 617 - if (inet_rsk(req)->ecn_ok) 119 + if (tcp_rsk(req)->accecn_ok) 120 + tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv); 121 + else if (inet_rsk(req)->ecn_ok) 618 122 th->ece = 1; 123 + } 124 + 125 + static inline bool tcp_accecn_option_beacon_check(const struct sock *sk) 126 + { 127 + u32 ecn_beacon = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_option_beacon); 128 + const struct tcp_sock *tp = tcp_sk(sk); 129 + 130 + if (!ecn_beacon) 131 + return false; 132 + 133 + return tcp_stamp_us_delta(tp->tcp_mstamp, tp->accecn_opt_tstamp) * ecn_beacon >= 134 + (tp->srtt_us >> 3); 619 135 } 620 136 621 137 #endif /* _LINUX_TCP_ECN_H */
+9
include/uapi/linux/tcp.h
··· 316 316 * in milliseconds, including any 317 317 * unfinished recovery. 318 318 */ 319 + __u32 tcpi_received_ce; /* # of CE marks received */ 320 + __u32 tcpi_delivered_e1_bytes; /* Accurate ECN byte counters */ 321 + __u32 tcpi_delivered_e0_bytes; 322 + __u32 tcpi_delivered_ce_bytes; 323 + __u32 tcpi_received_e1_bytes; 324 + __u32 tcpi_received_e0_bytes; 325 + __u32 tcpi_received_ce_bytes; 326 + __u16 tcpi_accecn_fail_mode; 327 + __u16 tcpi_accecn_opt_seen; 319 328 }; 320 329 321 330 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
+4
net/ipv4/syncookies.c
··· 12 12 #include <linux/export.h> 13 13 #include <net/secure_seq.h> 14 14 #include <net/tcp.h> 15 + #include <net/tcp_ecn.h> 15 16 #include <net/route.h> 16 17 17 18 static siphash_aligned_key_t syncookie_secret[2]; ··· 404 403 struct tcp_sock *tp = tcp_sk(sk); 405 404 struct inet_request_sock *ireq; 406 405 struct net *net = sock_net(sk); 406 + struct tcp_request_sock *treq; 407 407 struct request_sock *req; 408 408 struct sock *ret = sk; 409 409 struct flowi4 fl4; ··· 430 428 } 431 429 432 430 ireq = inet_rsk(req); 431 + treq = tcp_rsk(req); 433 432 434 433 sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr); 435 434 sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr); ··· 486 483 if (!req->syncookie) 487 484 ireq->rcv_wscale = rcv_wscale; 488 485 ireq->ecn_ok &= cookie_ecn_ok(net, &rt->dst); 486 + treq->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th); 489 487 490 488 ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst); 491 489 /* ip_queue_xmit() depends on our flow being setup
+19
net/ipv4/sysctl_net_ipv4.c
··· 47 47 static int tcp_plb_max_rounds = 31; 48 48 static int tcp_plb_max_cong_thresh = 256; 49 49 static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC; 50 + static int tcp_ecn_mode_max = 2; 50 51 51 52 /* obsolete */ 52 53 static int sysctl_tcp_low_latency __read_mostly; ··· 729 728 .mode = 0644, 730 729 .proc_handler = proc_dou8vec_minmax, 731 730 .extra1 = SYSCTL_ZERO, 731 + .extra2 = &tcp_ecn_mode_max, 732 + }, 733 + { 734 + .procname = "tcp_ecn_option", 735 + .data = &init_net.ipv4.sysctl_tcp_ecn_option, 736 + .maxlen = sizeof(u8), 737 + .mode = 0644, 738 + .proc_handler = proc_dou8vec_minmax, 739 + .extra1 = SYSCTL_ZERO, 732 740 .extra2 = SYSCTL_TWO, 741 + }, 742 + { 743 + .procname = "tcp_ecn_option_beacon", 744 + .data = &init_net.ipv4.sysctl_tcp_ecn_option_beacon, 745 + .maxlen = sizeof(u8), 746 + .mode = 0644, 747 + .proc_handler = proc_dou8vec_minmax, 748 + .extra1 = SYSCTL_ZERO, 749 + .extra2 = SYSCTL_THREE, 733 750 }, 734 751 { 735 752 .procname = "tcp_ecn_fallback",
+27 -3
net/ipv4/tcp.c
··· 270 270 271 271 #include <net/icmp.h> 272 272 #include <net/inet_common.h> 273 + #include <net/inet_ecn.h> 273 274 #include <net/tcp.h> 275 + #include <net/tcp_ecn.h> 274 276 #include <net/mptcp.h> 275 277 #include <net/proto_memory.h> 276 278 #include <net/xfrm.h> ··· 3408 3406 tp->window_clamp = 0; 3409 3407 tp->delivered = 0; 3410 3408 tp->delivered_ce = 0; 3409 + tp->accecn_fail_mode = 0; 3410 + tp->saw_accecn_opt = TCP_ACCECN_OPT_NOT_SEEN; 3411 + tcp_accecn_init_counters(tp); 3412 + tp->prev_ecnfield = 0; 3413 + tp->accecn_opt_tstamp = 0; 3411 3414 if (icsk->icsk_ca_initialized && icsk->icsk_ca_ops->release) 3412 3415 icsk->icsk_ca_ops->release(sk); 3413 3416 memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); ··· 4159 4152 { 4160 4153 const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */ 4161 4154 const struct inet_connection_sock *icsk = inet_csk(sk); 4155 + const u8 ect1_idx = INET_ECN_ECT_1 - 1; 4156 + const u8 ect0_idx = INET_ECN_ECT_0 - 1; 4157 + const u8 ce_idx = INET_ECN_CE - 1; 4162 4158 unsigned long rate; 4163 4159 u32 now; 4164 4160 u64 rate64; ··· 4287 4277 info->tcpi_total_rto_time = tp->total_rto_time; 4288 4278 if (tp->rto_stamp) 4289 4279 info->tcpi_total_rto_time += tcp_clock_ms() - tp->rto_stamp; 4280 + 4281 + info->tcpi_accecn_fail_mode = tp->accecn_fail_mode; 4282 + info->tcpi_accecn_opt_seen = tp->saw_accecn_opt; 4283 + info->tcpi_received_ce = tp->received_ce; 4284 + info->tcpi_delivered_e1_bytes = tp->delivered_ecn_bytes[ect1_idx]; 4285 + info->tcpi_delivered_e0_bytes = tp->delivered_ecn_bytes[ect0_idx]; 4286 + info->tcpi_delivered_ce_bytes = tp->delivered_ecn_bytes[ce_idx]; 4287 + info->tcpi_received_e1_bytes = tp->received_ecn_bytes[ect1_idx]; 4288 + info->tcpi_received_e0_bytes = tp->received_ecn_bytes[ect0_idx]; 4289 + info->tcpi_received_ce_bytes = tp->received_ecn_bytes[ce_idx]; 4290 4290 4291 4291 unlock_sock_fast(sk, slow); 4292 4292 } ··· 5139 5119 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, lsndtime); 5140 5120 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, mdev_us); 5141 5121 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, tcp_wstamp_ns); 5122 + CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, accecn_opt_tstamp); 5142 5123 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, rtt_seq); 5143 5124 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, tsorted_sent_queue); 5144 5125 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, highest_sack); 5145 5126 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_tx, ecn_flags); 5146 - CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_tx, 89); 5127 + CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_tx, 97); 5147 5128 5148 5129 /* TXRX read-write hotpath cache lines */ 5149 5130 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, pred_flags); ··· 5159 5138 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, snd_up); 5160 5139 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered); 5161 5140 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ce); 5141 + CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce); 5142 + CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes); 5162 5143 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited); 5163 5144 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd); 5164 5145 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt); ··· 5168 5145 /* 32bit arches with 8byte alignment on u64 fields might need padding 5169 5146 * before tcp_clock_cache. 5170 5147 */ 5171 - CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 91 + 4); 5148 + CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 107 + 4); 5172 5149 5173 5150 /* RX read-write hotpath cache lines */ 5174 5151 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received); ··· 5180 5157 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, rate_delivered); 5181 5158 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, rate_interval_us); 5182 5159 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, rcv_rtt_last_tsecr); 5160 + CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, delivered_ecn_bytes); 5183 5161 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, first_tx_mstamp); 5184 5162 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, delivered_mstamp); 5185 5163 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_acked); 5186 5164 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, rcv_rtt_est); 5187 5165 CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, rcvq_space); 5188 - CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_rx, 96); 5166 + CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_rx, 112); 5189 5167 } 5190 5168 5191 5169 void __init tcp_init(void)
+297 -21
net/ipv4/tcp_input.c
··· 70 70 #include <linux/sysctl.h> 71 71 #include <linux/kernel.h> 72 72 #include <linux/prefetch.h> 73 + #include <linux/bitops.h> 73 74 #include <net/dst.h> 74 75 #include <net/tcp.h> 75 76 #include <net/tcp_ecn.h> ··· 361 360 if (tcp_ca_needs_ecn(sk)) 362 361 tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); 363 362 364 - if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { 363 + if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) && 364 + tcp_ecn_mode_rfc3168(tp)) { 365 365 /* Better not delay acks, sender can have a very low cwnd */ 366 366 tcp_enter_quickack_mode(sk, 2); 367 367 tp->ecn_flags |= TCP_ECN_DEMAND_CWR; 368 368 } 369 + /* As for RFC3168 ECN, the TCP_ECN_SEEN flag is set by 370 + * tcp_data_ecn_check() when the ECN codepoint of 371 + * received TCP data contains ECT(0), ECT(1), or CE. 372 + */ 373 + if (!tcp_ecn_mode_rfc3168(tp)) 374 + break; 369 375 tp->ecn_flags |= TCP_ECN_SEEN; 370 376 break; 371 377 default: 372 378 if (tcp_ca_needs_ecn(sk)) 373 379 tcp_ca_event(sk, CA_EVENT_ECN_NO_CE); 380 + if (!tcp_ecn_mode_rfc3168(tp)) 381 + break; 374 382 tp->ecn_flags |= TCP_ECN_SEEN; 375 383 break; 376 384 } 385 + } 386 + 387 + /* Returns true if the byte counters can be used */ 388 + static bool tcp_accecn_process_option(struct tcp_sock *tp, 389 + const struct sk_buff *skb, 390 + u32 delivered_bytes, int flag) 391 + { 392 + u8 estimate_ecnfield = tp->est_ecnfield; 393 + bool ambiguous_ecn_bytes_incr = false; 394 + bool first_changed = false; 395 + unsigned int optlen; 396 + bool order1, res; 397 + unsigned int i; 398 + u8 *ptr; 399 + 400 + if (tcp_accecn_opt_fail_recv(tp)) 401 + return false; 402 + 403 + if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) { 404 + if (!tp->saw_accecn_opt) { 405 + /* Too late to enable after this point due to 406 + * potential counter wraps 407 + */ 408 + if (tp->bytes_sent >= (1 << 23) - 1) { 409 + u8 saw_opt = TCP_ACCECN_OPT_FAIL_SEEN; 410 + 411 + tcp_accecn_saw_opt_fail_recv(tp, saw_opt); 412 + } 413 + return false; 414 + } 415 + 416 + if (estimate_ecnfield) { 417 + u8 ecnfield = estimate_ecnfield - 1; 418 + 419 + tp->delivered_ecn_bytes[ecnfield] += delivered_bytes; 420 + return true; 421 + } 422 + return false; 423 + } 424 + 425 + ptr = skb_transport_header(skb) + tp->rx_opt.accecn; 426 + optlen = ptr[1] - 2; 427 + if (WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1)) 428 + return false; 429 + order1 = (ptr[0] == TCPOPT_ACCECN1); 430 + ptr += 2; 431 + 432 + if (tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { 433 + tp->saw_accecn_opt = tcp_accecn_option_init(skb, 434 + tp->rx_opt.accecn); 435 + if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN) 436 + tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV); 437 + } 438 + 439 + res = !!estimate_ecnfield; 440 + for (i = 0; i < 3; i++) { 441 + u32 init_offset; 442 + u8 ecnfield; 443 + s32 delta; 444 + u32 *cnt; 445 + 446 + if (optlen < TCPOLEN_ACCECN_PERFIELD) 447 + break; 448 + 449 + ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1); 450 + init_offset = tcp_accecn_field_init_offset(ecnfield); 451 + cnt = &tp->delivered_ecn_bytes[ecnfield - 1]; 452 + delta = tcp_update_ecn_bytes(cnt, ptr, init_offset); 453 + if (delta && delta < 0) { 454 + res = false; 455 + ambiguous_ecn_bytes_incr = true; 456 + } 457 + if (delta && ecnfield != estimate_ecnfield) { 458 + if (!first_changed) { 459 + tp->est_ecnfield = ecnfield; 460 + first_changed = true; 461 + } else { 462 + res = false; 463 + ambiguous_ecn_bytes_incr = true; 464 + } 465 + } 466 + 467 + optlen -= TCPOLEN_ACCECN_PERFIELD; 468 + ptr += TCPOLEN_ACCECN_PERFIELD; 469 + } 470 + if (ambiguous_ecn_bytes_incr) 471 + tp->est_ecnfield = 0; 472 + 473 + return res; 377 474 } 378 475 379 476 static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count) ··· 484 385 bool ece_ack) 485 386 { 486 387 tp->delivered += delivered; 487 - if (ece_ack) 388 + if (tcp_ecn_mode_rfc3168(tp) && ece_ack) 488 389 tcp_count_delivered_ce(tp, delivered); 390 + } 391 + 392 + /* Returns the ECN CE delta */ 393 + static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb, 394 + u32 delivered_pkts, u32 delivered_bytes, 395 + int flag) 396 + { 397 + u32 old_ceb = tcp_sk(sk)->delivered_ecn_bytes[INET_ECN_CE - 1]; 398 + const struct tcphdr *th = tcp_hdr(skb); 399 + struct tcp_sock *tp = tcp_sk(sk); 400 + u32 delta, safe_delta, d_ceb; 401 + bool opt_deltas_valid; 402 + u32 corrected_ace; 403 + 404 + /* Reordered ACK or uncertain due to lack of data to send and ts */ 405 + if (!(flag & (FLAG_FORWARD_PROGRESS | FLAG_TS_PROGRESS))) 406 + return 0; 407 + 408 + opt_deltas_valid = tcp_accecn_process_option(tp, skb, 409 + delivered_bytes, flag); 410 + 411 + if (!(flag & FLAG_SLOWPATH)) { 412 + /* AccECN counter might overflow on large ACKs */ 413 + if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK) 414 + return 0; 415 + } 416 + 417 + /* ACE field is not available during handshake */ 418 + if (flag & FLAG_SYN_ACKED) 419 + return 0; 420 + 421 + if (tp->received_ce_pending >= TCP_ACCECN_ACE_MAX_DELTA) 422 + inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW; 423 + 424 + corrected_ace = tcp_accecn_ace(th) - TCP_ACCECN_CEP_INIT_OFFSET; 425 + delta = (corrected_ace - tp->delivered_ce) & TCP_ACCECN_CEP_ACE_MASK; 426 + if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK) 427 + return delta; 428 + 429 + safe_delta = delivered_pkts - 430 + ((delivered_pkts - delta) & TCP_ACCECN_CEP_ACE_MASK); 431 + 432 + if (opt_deltas_valid) { 433 + d_ceb = tp->delivered_ecn_bytes[INET_ECN_CE - 1] - old_ceb; 434 + if (!d_ceb) 435 + return delta; 436 + 437 + if ((delivered_pkts >= (TCP_ACCECN_CEP_ACE_MASK + 1) * 2) && 438 + (tcp_is_sack(tp) || 439 + ((1 << inet_csk(sk)->icsk_ca_state) & 440 + (TCPF_CA_Open | TCPF_CA_CWR)))) { 441 + u32 est_d_cep; 442 + 443 + if (delivered_bytes <= d_ceb) 444 + return safe_delta; 445 + 446 + est_d_cep = DIV_ROUND_UP_ULL((u64)d_ceb * 447 + delivered_pkts, 448 + delivered_bytes); 449 + return min(safe_delta, 450 + delta + 451 + (est_d_cep & ~TCP_ACCECN_CEP_ACE_MASK)); 452 + } 453 + 454 + if (d_ceb > delta * tp->mss_cache) 455 + return safe_delta; 456 + if (d_ceb < 457 + safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT) 458 + return delta; 459 + } 460 + 461 + return safe_delta; 462 + } 463 + 464 + static u32 tcp_accecn_process(struct sock *sk, const struct sk_buff *skb, 465 + u32 delivered_pkts, u32 delivered_bytes, 466 + int *flag) 467 + { 468 + struct tcp_sock *tp = tcp_sk(sk); 469 + u32 delta; 470 + 471 + delta = __tcp_accecn_process(sk, skb, delivered_pkts, 472 + delivered_bytes, *flag); 473 + if (delta > 0) { 474 + tcp_count_delivered_ce(tp, delta); 475 + *flag |= FLAG_ECE; 476 + /* Recalculate header predictor */ 477 + if (tp->pred_flags) 478 + tcp_fast_path_on(tp); 479 + } 480 + return delta; 489 481 } 490 482 491 483 /* Buffer size and advertised window tuning. ··· 1177 987 u64 last_sackt; 1178 988 u32 reord; 1179 989 u32 sack_delivered; 990 + u32 delivered_bytes; 1180 991 int flag; 1181 992 unsigned int mss_now; 1182 993 struct rate_sample *rate; ··· 1539 1348 static u8 tcp_sacktag_one(struct sock *sk, 1540 1349 struct tcp_sacktag_state *state, u8 sacked, 1541 1350 u32 start_seq, u32 end_seq, 1542 - int dup_sack, int pcount, 1351 + int dup_sack, int pcount, u32 plen, 1543 1352 u64 xmit_time) 1544 1353 { 1545 1354 struct tcp_sock *tp = tcp_sk(sk); ··· 1599 1408 tp->sacked_out += pcount; 1600 1409 /* Out-of-order packets delivered */ 1601 1410 state->sack_delivered += pcount; 1411 + state->delivered_bytes += plen; 1602 1412 } 1603 1413 1604 1414 /* D-SACK. We can detect redundant retransmission in S|R and plain R ··· 1636 1444 * tcp_highest_sack_seq() when skb is highest_sack. 1637 1445 */ 1638 1446 tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked, 1639 - start_seq, end_seq, dup_sack, pcount, 1447 + start_seq, end_seq, dup_sack, pcount, skb->len, 1640 1448 tcp_skb_timestamp_us(skb)); 1641 1449 tcp_rate_skb_delivered(sk, skb, state->rate); 1642 1450 ··· 1921 1729 TCP_SKB_CB(skb)->end_seq, 1922 1730 dup_sack, 1923 1731 tcp_skb_pcount(skb), 1732 + skb->len, 1924 1733 tcp_skb_timestamp_us(skb)); 1925 1734 tcp_rate_skb_delivered(sk, skb, state->rate); 1926 1735 if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ··· 3430 3237 3431 3238 if (sacked & TCPCB_SACKED_ACKED) { 3432 3239 tp->sacked_out -= acked_pcount; 3240 + /* snd_una delta covers these skbs */ 3241 + sack->delivered_bytes -= skb->len; 3433 3242 } else if (tcp_is_sack(tp)) { 3434 3243 tcp_count_delivered(tp, acked_pcount, ece_ack); 3435 3244 if (!tcp_skb_spurious_retrans(tp, skb)) ··· 3528 3333 if (before(reord, prior_fack)) 3529 3334 tcp_check_sack_reordering(sk, reord, 0); 3530 3335 } 3336 + 3337 + sack->delivered_bytes = (skb ? 3338 + TCP_SKB_CB(skb)->seq : tp->snd_una) - 3339 + prior_snd_una; 3531 3340 } else if (skb && rtt_update && sack_rtt_us >= 0 && 3532 3341 sack_rtt_us > tcp_stamp_us_delta(tp->tcp_mstamp, 3533 3342 tcp_skb_timestamp_us(skb))) { ··· 3801 3602 return __tcp_oow_rate_limited(net, mib_idx, last_oow_ack_time); 3802 3603 } 3803 3604 3605 + static void tcp_send_ack_reflect_ect(struct sock *sk, bool accecn_reflector) 3606 + { 3607 + struct tcp_sock *tp = tcp_sk(sk); 3608 + u16 flags = 0; 3609 + 3610 + if (accecn_reflector) 3611 + flags = tcp_accecn_reflector_flags(tp->syn_ect_rcv); 3612 + __tcp_send_ack(sk, tp->rcv_nxt, flags); 3613 + } 3614 + 3804 3615 /* RFC 5961 7 [ACK Throttling] */ 3805 - static void tcp_send_challenge_ack(struct sock *sk) 3616 + static void tcp_send_challenge_ack(struct sock *sk, bool accecn_reflector) 3806 3617 { 3807 3618 struct tcp_sock *tp = tcp_sk(sk); 3808 3619 struct net *net = sock_net(sk); ··· 3842 3633 WRITE_ONCE(net->ipv4.tcp_challenge_count, count - 1); 3843 3634 send_ack: 3844 3635 NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK); 3845 - tcp_send_ack(sk); 3636 + tcp_send_ack_reflect_ect(sk, accecn_reflector); 3846 3637 } 3847 3638 } 3848 3639 ··· 3953 3744 } 3954 3745 3955 3746 /* Returns the number of packets newly acked or sacked by the current ACK */ 3956 - static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag) 3747 + static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, 3748 + u32 ecn_count, int flag) 3957 3749 { 3958 3750 const struct net *net = sock_net(sk); 3959 3751 struct tcp_sock *tp = tcp_sk(sk); ··· 3962 3752 3963 3753 delivered = tp->delivered - prior_delivered; 3964 3754 NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered); 3965 - if (flag & FLAG_ECE) 3966 - NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, delivered); 3755 + 3756 + if (flag & FLAG_ECE) { 3757 + if (tcp_ecn_mode_rfc3168(tp)) 3758 + ecn_count = delivered; 3759 + NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, ecn_count); 3760 + } 3967 3761 3968 3762 return delivered; 3969 3763 } ··· 3988 3774 u32 delivered = tp->delivered; 3989 3775 u32 lost = tp->lost; 3990 3776 int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */ 3777 + u32 ecn_count = 0; /* Did we receive ECE/an AccECN ACE update? */ 3991 3778 u32 prior_fack; 3992 3779 3993 3780 sack_state.first_sackt = 0; 3994 3781 sack_state.rate = &rs; 3995 3782 sack_state.sack_delivered = 0; 3783 + sack_state.delivered_bytes = 0; 3996 3784 3997 3785 /* We very likely will need to access rtx queue. */ 3998 3786 prefetch(sk->tcp_rtx_queue.rb_node); ··· 4010 3794 /* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */ 4011 3795 if (before(ack, prior_snd_una - max_window)) { 4012 3796 if (!(flag & FLAG_NO_CHALLENGE_ACK)) 4013 - tcp_send_challenge_ack(sk); 3797 + tcp_send_challenge_ack(sk, false); 4014 3798 return -SKB_DROP_REASON_TCP_TOO_OLD_ACK; 4015 3799 } 4016 3800 goto old_ack; ··· 4097 3881 4098 3882 tcp_rack_update_reo_wnd(sk, &rs); 4099 3883 3884 + if (tcp_ecn_mode_accecn(tp)) 3885 + ecn_count = tcp_accecn_process(sk, skb, 3886 + tp->delivered - delivered, 3887 + sack_state.delivered_bytes, 3888 + &flag); 3889 + 4100 3890 tcp_in_ack_event(sk, flag); 4101 3891 4102 3892 if (tp->tlp_high_seq) ··· 4127 3905 if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) 4128 3906 sk_dst_confirm(sk); 4129 3907 4130 - delivered = tcp_newly_delivered(sk, delivered, flag); 3908 + delivered = tcp_newly_delivered(sk, delivered, ecn_count, flag); 3909 + 4131 3910 lost = tp->lost - lost; /* freshly marked lost */ 4132 3911 rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED); 4133 3912 tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate); ··· 4137 3914 return 1; 4138 3915 4139 3916 no_queue: 3917 + if (tcp_ecn_mode_accecn(tp)) 3918 + ecn_count = tcp_accecn_process(sk, skb, 3919 + tp->delivered - delivered, 3920 + sack_state.delivered_bytes, 3921 + &flag); 4140 3922 tcp_in_ack_event(sk, flag); 4141 3923 /* If data was DSACKed, see if we can undo a cwnd reduction. */ 4142 3924 if (flag & FLAG_DSACKING_ACK) { 4143 3925 tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag, 4144 3926 &rexmit); 4145 - tcp_newly_delivered(sk, delivered, flag); 3927 + tcp_newly_delivered(sk, delivered, ecn_count, flag); 4146 3928 } 4147 3929 /* If this ack opens up a zero window, clear backoff. It was 4148 3930 * being used to time the probes, and is probably far higher than ··· 4168 3940 &sack_state); 4169 3941 tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag, 4170 3942 &rexmit); 4171 - tcp_newly_delivered(sk, delivered, flag); 3943 + tcp_newly_delivered(sk, delivered, ecn_count, flag); 4172 3944 tcp_xmit_recovery(sk, rexmit); 4173 3945 } 4174 3946 ··· 4268 4040 4269 4041 ptr = (const unsigned char *)(th + 1); 4270 4042 opt_rx->saw_tstamp = 0; 4043 + opt_rx->accecn = 0; 4271 4044 opt_rx->saw_unknown = 0; 4272 4045 4273 4046 while (length > 0) { ··· 4360 4131 ptr, th->syn, foc, false); 4361 4132 break; 4362 4133 4134 + case TCPOPT_ACCECN0: 4135 + case TCPOPT_ACCECN1: 4136 + /* Save offset of AccECN option in TCP header */ 4137 + opt_rx->accecn = (ptr - 2) - (__u8 *)th; 4138 + break; 4139 + 4363 4140 case TCPOPT_EXP: 4364 4141 /* Fast Open option shares code 254 using a 4365 4142 * 16 bits magic number. ··· 4426 4191 */ 4427 4192 if (th->doff == (sizeof(*th) / 4)) { 4428 4193 tp->rx_opt.saw_tstamp = 0; 4194 + tp->rx_opt.accecn = 0; 4429 4195 return false; 4430 4196 } else if (tp->rx_opt.tstamp_ok && 4431 4197 th->doff == ((sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4)) { 4432 - if (tcp_parse_aligned_timestamp(tp, th)) 4198 + if (tcp_parse_aligned_timestamp(tp, th)) { 4199 + tp->rx_opt.accecn = 0; 4433 4200 return true; 4201 + } 4434 4202 } 4435 4203 4436 4204 tcp_parse_options(net, skb, &tp->rx_opt, 1, NULL); ··· 6066 5828 const struct tcphdr *th, int syn_inerr) 6067 5829 { 6068 5830 struct tcp_sock *tp = tcp_sk(sk); 5831 + bool accecn_reflector = false; 6069 5832 SKB_DR(reason); 6070 5833 6071 5834 /* RFC1323: H1. Apply PAWS check first. */ ··· 6164 5925 if (tp->syn_fastopen && !tp->data_segs_in && 6165 5926 sk->sk_state == TCP_ESTABLISHED) 6166 5927 tcp_fastopen_active_disable(sk); 6167 - tcp_send_challenge_ack(sk); 5928 + tcp_send_challenge_ack(sk, false); 6168 5929 SKB_DR_SET(reason, TCP_RESET); 6169 5930 goto discard; 6170 5931 } ··· 6175 5936 * RFC 5961 4.2 : Send a challenge ack 6176 5937 */ 6177 5938 if (th->syn) { 5939 + if (tcp_ecn_mode_accecn(tp)) { 5940 + accecn_reflector = true; 5941 + if (tp->rx_opt.accecn && 5942 + tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { 5943 + u8 saw_opt = tcp_accecn_option_init(skb, tp->rx_opt.accecn); 5944 + 5945 + tcp_accecn_saw_opt_fail_recv(tp, saw_opt); 5946 + tcp_accecn_opt_demand_min(sk, 1); 5947 + } 5948 + } 6178 5949 if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack && 6179 5950 TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq && 6180 5951 TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt && ··· 6194 5945 if (syn_inerr) 6195 5946 TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); 6196 5947 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE); 6197 - tcp_send_challenge_ack(sk); 5948 + tcp_send_challenge_ack(sk, accecn_reflector); 6198 5949 SKB_DR_SET(reason, TCP_INVALID_SYN); 6199 5950 goto discard; 6200 5951 } ··· 6266 6017 */ 6267 6018 6268 6019 tp->rx_opt.saw_tstamp = 0; 6020 + tp->rx_opt.accecn = 0; 6269 6021 6270 6022 /* pred_flags is 0xS?10 << 16 + snd_wnd 6271 6023 * if header_prediction is to be made ··· 6321 6071 flag |= __tcp_replace_ts_recent(tp, 6322 6072 delta); 6323 6073 6074 + tcp_ecn_received_counters(sk, skb, 0); 6075 + 6324 6076 /* We know that such packets are checksummed 6325 6077 * on entry. 6326 6078 */ ··· 6371 6119 /* Bulk data transfer: receiver */ 6372 6120 tcp_cleanup_skb(skb); 6373 6121 __skb_pull(skb, tcp_header_len); 6122 + tcp_ecn_received_counters(sk, skb, 6123 + len - tcp_header_len); 6374 6124 eaten = tcp_queue_rcv(sk, skb, &fragstolen); 6375 6125 6376 6126 tcp_event_data_recv(sk, skb); ··· 6413 6159 return; 6414 6160 6415 6161 step5: 6162 + tcp_ecn_received_counters_payload(sk, skb); 6163 + 6416 6164 reason = tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT); 6417 6165 if ((int)reason < 0) { 6418 6166 reason = -reason; ··· 6665 6409 * state to ESTABLISHED..." 6666 6410 */ 6667 6411 6668 - tcp_ecn_rcv_synack(tp, th); 6412 + if (tcp_ecn_mode_any(tp)) 6413 + tcp_ecn_rcv_synack(sk, skb, th, 6414 + TCP_SKB_CB(skb)->ip_dsfield); 6669 6415 6670 6416 tcp_init_wl(tp, TCP_SKB_CB(skb)->seq); 6671 6417 tcp_try_undo_spurious_syn(sk); ··· 6739 6481 TCP_DELACK_MAX, false); 6740 6482 goto consume; 6741 6483 } 6742 - tcp_send_ack(sk); 6484 + tcp_send_ack_reflect_ect(sk, tcp_ecn_mode_accecn(tp)); 6743 6485 return -1; 6744 6486 } 6745 6487 ··· 6798 6540 tp->snd_wl1 = TCP_SKB_CB(skb)->seq; 6799 6541 tp->max_window = tp->snd_wnd; 6800 6542 6801 - tcp_ecn_rcv_syn(tp, th); 6543 + tcp_ecn_rcv_syn(tp, th, skb); 6802 6544 6803 6545 tcp_mtup_init(sk); 6804 6546 tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); ··· 6980 6722 } 6981 6723 /* accept old ack during closing */ 6982 6724 if ((int)reason < 0) { 6983 - tcp_send_challenge_ack(sk); 6725 + tcp_send_challenge_ack(sk, false); 6984 6726 reason = -reason; 6985 6727 goto discard; 6986 6728 } ··· 7027 6769 tp->lsndtime = tcp_jiffies32; 7028 6770 7029 6771 tcp_initialize_rcv_mss(sk); 6772 + if (tcp_ecn_mode_accecn(tp)) 6773 + tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt); 7030 6774 tcp_fast_path_on(tp); 7031 6775 if (sk->sk_shutdown & SEND_SHUTDOWN) 7032 6776 tcp_shutdown(sk, SEND_SHUTDOWN); 6777 + 7033 6778 break; 7034 6779 7035 6780 case TCP_FIN_WAIT1: { ··· 7202 6941 bool ect, ecn_ok; 7203 6942 u32 ecn_ok_dst; 7204 6943 6944 + if (tcp_accecn_syn_requested(th) && 6945 + READ_ONCE(net->ipv4.sysctl_tcp_ecn) >= 3) { 6946 + inet_rsk(req)->ecn_ok = 1; 6947 + tcp_rsk(req)->accecn_ok = 1; 6948 + tcp_rsk(req)->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & 6949 + INET_ECN_MASK; 6950 + return; 6951 + } 6952 + 7205 6953 if (!th_ecn) 7206 6954 return; 7207 6955 ··· 7218 6948 ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK); 7219 6949 ecn_ok = READ_ONCE(net->ipv4.sysctl_tcp_ecn) || ecn_ok_dst; 7220 6950 7221 - if (((!ect || th->res1) && ecn_ok) || tcp_ca_needs_ecn(listen_sk) || 6951 + if (((!ect || th->res1 || th->ae) && ecn_ok) || 6952 + tcp_ca_needs_ecn(listen_sk) || 7222 6953 (ecn_ok_dst & DST_FEATURE_ECN_CA) || 7223 6954 tcp_bpf_ca_needs_ecn((struct sock *)req)) 7224 6955 inet_rsk(req)->ecn_ok = 1; ··· 7237 6966 tcp_rsk(req)->snt_synack = 0; 7238 6967 tcp_rsk(req)->snt_tsval_first = 0; 7239 6968 tcp_rsk(req)->last_oow_ack_time = 0; 6969 + tcp_rsk(req)->accecn_ok = 0; 6970 + tcp_rsk(req)->saw_accecn_opt = TCP_ACCECN_OPT_NOT_SEEN; 6971 + tcp_rsk(req)->accecn_fail_mode = 0; 6972 + tcp_rsk(req)->syn_ect_rcv = 0; 6973 + tcp_rsk(req)->syn_ect_snt = 0; 7240 6974 req->mss = rx_opt->mss_clamp; 7241 6975 req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0; 7242 6976 ireq->tstamp_ok = rx_opt->tstamp_ok;
+6 -2
net/ipv4/tcp_ipv4.c
··· 65 65 #include <net/icmp.h> 66 66 #include <net/inet_hashtables.h> 67 67 #include <net/tcp.h> 68 + #include <net/tcp_ecn.h> 68 69 #include <net/transp_v6.h> 69 70 #include <net/ipv6.h> 70 71 #include <net/inet_common.h> ··· 1190 1189 enum tcp_synack_type synack_type, 1191 1190 struct sk_buff *syn_skb) 1192 1191 { 1193 - const struct inet_request_sock *ireq = inet_rsk(req); 1192 + struct inet_request_sock *ireq = inet_rsk(req); 1194 1193 struct flowi4 fl4; 1195 1194 int err = -1; 1196 1195 struct sk_buff *skb; ··· 1203 1202 skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); 1204 1203 1205 1204 if (skb) { 1205 + tcp_rsk(req)->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; 1206 1206 __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); 1207 1207 1208 1208 tos = READ_ONCE(inet_sk(sk)->tos); ··· 3560 3558 3561 3559 static int __net_init tcp_sk_init(struct net *net) 3562 3560 { 3563 - net->ipv4.sysctl_tcp_ecn = 2; 3561 + net->ipv4.sysctl_tcp_ecn = TCP_ECN_IN_ECN_OUT_NOECN; 3562 + net->ipv4.sysctl_tcp_ecn_option = TCP_ACCECN_OPTION_FULL; 3563 + net->ipv4.sysctl_tcp_ecn_option_beacon = TCP_ACCECN_OPTION_BEACON; 3564 3564 net->ipv4.sysctl_tcp_ecn_fallback = 1; 3565 3565 3566 3566 net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+34 -6
net/ipv4/tcp_minisocks.c
··· 20 20 */ 21 21 22 22 #include <net/tcp.h> 23 + #include <net/tcp_ecn.h> 23 24 #include <net/xfrm.h> 24 25 #include <net/busy_poll.h> 25 26 #include <net/rstreason.h> ··· 452 451 ireq->rcv_wscale = rcv_wscale; 453 452 } 454 453 455 - static void tcp_ecn_openreq_child(struct tcp_sock *tp, 456 - const struct request_sock *req) 454 + static void tcp_ecn_openreq_child(struct sock *sk, 455 + const struct request_sock *req, 456 + const struct sk_buff *skb) 457 457 { 458 - tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ? 459 - TCP_ECN_MODE_RFC3168 : 460 - TCP_ECN_DISABLED); 458 + const struct tcp_request_sock *treq = tcp_rsk(req); 459 + struct tcp_sock *tp = tcp_sk(sk); 460 + 461 + if (treq->accecn_ok) { 462 + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); 463 + tp->syn_ect_snt = treq->syn_ect_snt; 464 + tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt); 465 + tp->saw_accecn_opt = treq->saw_accecn_opt; 466 + tp->prev_ecnfield = treq->syn_ect_rcv; 467 + tp->accecn_opt_demand = 1; 468 + tcp_ecn_received_counters_payload(sk, skb); 469 + } else { 470 + tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ? 471 + TCP_ECN_MODE_RFC3168 : 472 + TCP_ECN_DISABLED); 473 + } 461 474 } 462 475 463 476 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst) ··· 636 621 if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len) 637 622 newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len; 638 623 newtp->rx_opt.mss_clamp = req->mss; 639 - tcp_ecn_openreq_child(newtp, req); 624 + tcp_ecn_openreq_child(newsk, req, skb); 640 625 newtp->fastopen_req = NULL; 641 626 RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); 642 627 ··· 679 664 bool own_req; 680 665 681 666 tmp_opt.saw_tstamp = 0; 667 + tmp_opt.accecn = 0; 682 668 if (th->doff > (sizeof(struct tcphdr)>>2)) { 683 669 tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL); 684 670 ··· 856 840 */ 857 841 if (!(flg & TCP_FLAG_ACK)) 858 842 return NULL; 843 + 844 + if (tcp_rsk(req)->accecn_ok && tmp_opt.accecn && 845 + tcp_rsk(req)->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { 846 + u8 saw_opt = tcp_accecn_option_init(skb, tmp_opt.accecn); 847 + 848 + tcp_rsk(req)->saw_accecn_opt = saw_opt; 849 + if (tcp_rsk(req)->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN) { 850 + u8 fail_mode = TCP_ACCECN_OPT_FAIL_RECV; 851 + 852 + tcp_rsk(req)->accecn_fail_mode |= fail_mode; 853 + } 854 + } 859 855 860 856 /* For Fast Open no more processing is needed (sk is the 861 857 * child socket).
+223 -16
net/ipv4/tcp_output.c
··· 328 328 { 329 329 struct tcp_sock *tp = tcp_sk(sk); 330 330 331 - if (tcp_ecn_mode_rfc3168(tp)) { 331 + if (!tcp_ecn_mode_any(tp)) 332 + return; 333 + 334 + if (tcp_ecn_mode_accecn(tp)) { 335 + if (!tcp_accecn_ace_fail_recv(tp)) 336 + INET_ECN_xmit(sk); 337 + tcp_accecn_set_ace(tp, skb, th); 338 + skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN; 339 + } else { 332 340 /* Not-retransmitted data segment: set ECT and inject CWR. */ 333 341 if (skb->len != tcp_header_len && 334 342 !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) { ··· 385 377 #define OPTION_SMC BIT(9) 386 378 #define OPTION_MPTCP BIT(10) 387 379 #define OPTION_AO BIT(11) 380 + #define OPTION_ACCECN BIT(12) 388 381 389 382 static void smc_options_write(__be32 *ptr, u16 *options) 390 383 { ··· 407 398 u16 mss; /* 0 to disable */ 408 399 u8 ws; /* window scale, 0 to disable */ 409 400 u8 num_sack_blocks; /* number of SACK blocks to include */ 401 + u8 num_accecn_fields:7, /* number of AccECN fields needed */ 402 + use_synack_ecn_bytes:1; /* Use synack_ecn_bytes or not */ 410 403 u8 hash_size; /* bytes in hash_location */ 411 404 u8 bpf_opt_len; /* length of BPF hdr option */ 412 405 __u8 *hash_location; /* temporary pointer, overloaded */ ··· 606 595 return ptr; 607 596 } 608 597 598 + /* Initial values for AccECN option, ordered is based on ECN field bits 599 + * similar to received_ecn_bytes. Used for SYN/ACK AccECN option. 600 + */ 601 + static const u32 synack_ecn_bytes[3] = { 0, 0, 0 }; 602 + 609 603 /* Write previously computed TCP options to the packet. 610 604 * 611 605 * Beware: Something in the Internet is very sensitive to the ordering of ··· 629 613 struct tcp_out_options *opts, 630 614 struct tcp_key *key) 631 615 { 616 + u8 leftover_highbyte = TCPOPT_NOP; /* replace 1st NOP if avail */ 617 + u8 leftover_lowbyte = TCPOPT_NOP; /* replace 2nd NOP in succession */ 632 618 __be32 *ptr = (__be32 *)(th + 1); 633 619 u16 options = opts->options; /* mungable copy */ 634 620 ··· 666 648 *ptr++ = htonl(opts->tsecr); 667 649 } 668 650 651 + if (OPTION_ACCECN & options) { 652 + const u32 *ecn_bytes = opts->use_synack_ecn_bytes ? 653 + synack_ecn_bytes : 654 + tp->received_ecn_bytes; 655 + const u8 ect0_idx = INET_ECN_ECT_0 - 1; 656 + const u8 ect1_idx = INET_ECN_ECT_1 - 1; 657 + const u8 ce_idx = INET_ECN_CE - 1; 658 + u32 e0b; 659 + u32 e1b; 660 + u32 ceb; 661 + u8 len; 662 + 663 + e0b = ecn_bytes[ect0_idx] + TCP_ACCECN_E0B_INIT_OFFSET; 664 + e1b = ecn_bytes[ect1_idx] + TCP_ACCECN_E1B_INIT_OFFSET; 665 + ceb = ecn_bytes[ce_idx] + TCP_ACCECN_CEB_INIT_OFFSET; 666 + len = TCPOLEN_ACCECN_BASE + 667 + opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD; 668 + 669 + if (opts->num_accecn_fields == 2) { 670 + *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) | 671 + ((e1b >> 8) & 0xffff)); 672 + *ptr++ = htonl(((e1b & 0xff) << 24) | 673 + (ceb & 0xffffff)); 674 + } else if (opts->num_accecn_fields == 1) { 675 + *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) | 676 + ((e1b >> 8) & 0xffff)); 677 + leftover_highbyte = e1b & 0xff; 678 + leftover_lowbyte = TCPOPT_NOP; 679 + } else if (opts->num_accecn_fields == 0) { 680 + leftover_highbyte = TCPOPT_ACCECN1; 681 + leftover_lowbyte = len; 682 + } else if (opts->num_accecn_fields == 3) { 683 + *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) | 684 + ((e1b >> 8) & 0xffff)); 685 + *ptr++ = htonl(((e1b & 0xff) << 24) | 686 + (ceb & 0xffffff)); 687 + *ptr++ = htonl(((e0b & 0xffffff) << 8) | 688 + TCPOPT_NOP); 689 + } 690 + if (tp) { 691 + tp->accecn_minlen = 0; 692 + tp->accecn_opt_tstamp = tp->tcp_mstamp; 693 + if (tp->accecn_opt_demand) 694 + tp->accecn_opt_demand--; 695 + } 696 + } 697 + 669 698 if (unlikely(OPTION_SACK_ADVERTISE & options)) { 670 - *ptr++ = htonl((TCPOPT_NOP << 24) | 671 - (TCPOPT_NOP << 16) | 699 + *ptr++ = htonl((leftover_highbyte << 24) | 700 + (leftover_lowbyte << 16) | 672 701 (TCPOPT_SACK_PERM << 8) | 673 702 TCPOLEN_SACK_PERM); 703 + leftover_highbyte = TCPOPT_NOP; 704 + leftover_lowbyte = TCPOPT_NOP; 674 705 } 675 706 676 707 if (unlikely(OPTION_WSCALE & options)) { 677 - *ptr++ = htonl((TCPOPT_NOP << 24) | 708 + u8 highbyte = TCPOPT_NOP; 709 + 710 + /* Do not split the leftover 2-byte to fit into a single 711 + * NOP, i.e., replace this NOP only when 1 byte is leftover 712 + * within leftover_highbyte. 713 + */ 714 + if (unlikely(leftover_highbyte != TCPOPT_NOP && 715 + leftover_lowbyte == TCPOPT_NOP)) { 716 + highbyte = leftover_highbyte; 717 + leftover_highbyte = TCPOPT_NOP; 718 + } 719 + *ptr++ = htonl((highbyte << 24) | 678 720 (TCPOPT_WINDOW << 16) | 679 721 (TCPOLEN_WINDOW << 8) | 680 722 opts->ws); ··· 745 667 tp->duplicate_sack : tp->selective_acks; 746 668 int this_sack; 747 669 748 - *ptr++ = htonl((TCPOPT_NOP << 24) | 749 - (TCPOPT_NOP << 16) | 670 + *ptr++ = htonl((leftover_highbyte << 24) | 671 + (leftover_lowbyte << 16) | 750 672 (TCPOPT_SACK << 8) | 751 673 (TCPOLEN_SACK_BASE + (opts->num_sack_blocks * 752 674 TCPOLEN_SACK_PERBLOCK))); 675 + leftover_highbyte = TCPOPT_NOP; 676 + leftover_lowbyte = TCPOPT_NOP; 753 677 754 678 for (this_sack = 0; this_sack < opts->num_sack_blocks; 755 679 ++this_sack) { ··· 760 680 } 761 681 762 682 tp->rx_opt.dsack = 0; 683 + } else if (unlikely(leftover_highbyte != TCPOPT_NOP || 684 + leftover_lowbyte != TCPOPT_NOP)) { 685 + *ptr++ = htonl((leftover_highbyte << 24) | 686 + (leftover_lowbyte << 16) | 687 + (TCPOPT_NOP << 8) | 688 + TCPOPT_NOP); 689 + leftover_highbyte = TCPOPT_NOP; 690 + leftover_lowbyte = TCPOPT_NOP; 763 691 } 764 692 765 693 if (unlikely(OPTION_FAST_OPEN_COOKIE & options)) { ··· 846 758 } 847 759 } 848 760 } 761 + } 762 + 763 + static u32 tcp_synack_options_combine_saving(struct tcp_out_options *opts) 764 + { 765 + /* How much there's room for combining with the alignment padding? */ 766 + if ((opts->options & (OPTION_SACK_ADVERTISE | OPTION_TS)) == 767 + OPTION_SACK_ADVERTISE) 768 + return 2; 769 + else if (opts->options & OPTION_WSCALE) 770 + return 1; 771 + return 0; 772 + } 773 + 774 + /* Calculates how long AccECN option will fit to @remaining option space. 775 + * 776 + * AccECN option can sometimes replace NOPs used for alignment of other 777 + * TCP options (up to @max_combine_saving available). 778 + * 779 + * Only solutions with at least @required AccECN fields are accepted. 780 + * 781 + * Returns: The size of the AccECN option excluding space repurposed from 782 + * the alignment of the other options. 783 + */ 784 + static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required, 785 + int remaining) 786 + { 787 + int size = TCP_ACCECN_MAXSIZE; 788 + int sack_blocks_reduce = 0; 789 + int max_combine_saving; 790 + int rem = remaining; 791 + int align_size; 792 + 793 + if (opts->use_synack_ecn_bytes) 794 + max_combine_saving = tcp_synack_options_combine_saving(opts); 795 + else 796 + max_combine_saving = opts->num_sack_blocks > 0 ? 2 : 0; 797 + opts->num_accecn_fields = TCP_ACCECN_NUMFIELDS; 798 + while (opts->num_accecn_fields >= required) { 799 + /* Pad to dword if cannot combine */ 800 + if ((size & 0x3) > max_combine_saving) 801 + align_size = ALIGN(size, 4); 802 + else 803 + align_size = ALIGN_DOWN(size, 4); 804 + 805 + if (rem >= align_size) { 806 + size = align_size; 807 + break; 808 + } else if (opts->num_accecn_fields == required && 809 + opts->num_sack_blocks > 2 && 810 + required > 0) { 811 + /* Try to fit the option by removing one SACK block */ 812 + opts->num_sack_blocks--; 813 + sack_blocks_reduce++; 814 + rem = rem + TCPOLEN_SACK_PERBLOCK; 815 + 816 + opts->num_accecn_fields = TCP_ACCECN_NUMFIELDS; 817 + size = TCP_ACCECN_MAXSIZE; 818 + continue; 819 + } 820 + 821 + opts->num_accecn_fields--; 822 + size -= TCPOLEN_ACCECN_PERFIELD; 823 + } 824 + if (sack_blocks_reduce > 0) { 825 + if (opts->num_accecn_fields >= required) 826 + size -= sack_blocks_reduce * TCPOLEN_SACK_PERBLOCK; 827 + else 828 + opts->num_sack_blocks += sack_blocks_reduce; 829 + } 830 + if (opts->num_accecn_fields < required) 831 + return 0; 832 + 833 + opts->options |= OPTION_ACCECN; 834 + return size; 849 835 } 850 836 851 837 /* Compute TCP options for SYN packets. This is not the final ··· 1004 842 } 1005 843 } 1006 844 845 + /* Simultaneous open SYN/ACK needs AccECN option but not SYN. 846 + * It is attempted to negotiate the use of AccECN also on the first 847 + * retransmitted SYN, as mentioned in "3.1.4.1. Retransmitted SYNs" 848 + * of AccECN draft. 849 + */ 850 + if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) && 851 + tcp_ecn_mode_accecn(tp) && 852 + inet_csk(sk)->icsk_retransmits < 2 && 853 + READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_option) && 854 + remaining >= TCPOLEN_ACCECN_BASE)) { 855 + opts->use_synack_ecn_bytes = 1; 856 + remaining -= tcp_options_fit_accecn(opts, 0, remaining); 857 + } 858 + 1007 859 bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); 1008 860 1009 861 return MAX_TCP_OPTION_SPACE - remaining; ··· 1035 859 { 1036 860 struct inet_request_sock *ireq = inet_rsk(req); 1037 861 unsigned int remaining = MAX_TCP_OPTION_SPACE; 862 + struct tcp_request_sock *treq = tcp_rsk(req); 1038 863 1039 864 if (tcp_key_is_md5(key)) { 1040 865 opts->options |= OPTION_MD5; ··· 1098 921 1099 922 smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); 1100 923 924 + if (treq->accecn_ok && 925 + READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_option) && 926 + req->num_timeout < 1 && remaining >= TCPOLEN_ACCECN_BASE) { 927 + opts->use_synack_ecn_bytes = 1; 928 + remaining -= tcp_options_fit_accecn(opts, 0, remaining); 929 + } 930 + 1101 931 bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb, 1102 932 synack_type, opts, &remaining); 1103 933 ··· 1161 977 eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack; 1162 978 if (unlikely(eff_sacks)) { 1163 979 const unsigned int remaining = MAX_TCP_OPTION_SPACE - size; 1164 - if (unlikely(remaining < TCPOLEN_SACK_BASE_ALIGNED + 1165 - TCPOLEN_SACK_PERBLOCK)) 1166 - return size; 980 + if (likely(remaining >= TCPOLEN_SACK_BASE_ALIGNED + 981 + TCPOLEN_SACK_PERBLOCK)) { 982 + opts->num_sack_blocks = 983 + min_t(unsigned int, eff_sacks, 984 + (remaining - TCPOLEN_SACK_BASE_ALIGNED) / 985 + TCPOLEN_SACK_PERBLOCK); 1167 986 1168 - opts->num_sack_blocks = 1169 - min_t(unsigned int, eff_sacks, 1170 - (remaining - TCPOLEN_SACK_BASE_ALIGNED) / 1171 - TCPOLEN_SACK_PERBLOCK); 987 + size += TCPOLEN_SACK_BASE_ALIGNED + 988 + opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; 989 + } else { 990 + opts->num_sack_blocks = 0; 991 + } 992 + } else { 993 + opts->num_sack_blocks = 0; 994 + } 1172 995 1173 - size += TCPOLEN_SACK_BASE_ALIGNED + 1174 - opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; 996 + if (tcp_ecn_mode_accecn(tp)) { 997 + int ecn_opt = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_option); 998 + 999 + if (ecn_opt && tp->saw_accecn_opt && !tcp_accecn_opt_fail_send(tp) && 1000 + (ecn_opt >= TCP_ACCECN_OPTION_FULL || tp->accecn_opt_demand || 1001 + tcp_accecn_option_beacon_check(sk))) { 1002 + opts->use_synack_ecn_bytes = 0; 1003 + size += tcp_options_fit_accecn(opts, tp->accecn_minlen, 1004 + MAX_TCP_OPTION_SPACE - size); 1005 + } 1175 1006 } 1176 1007 1177 1008 if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp, ··· 2896 2697 sent_pkts = 0; 2897 2698 2898 2699 tcp_mstamp_refresh(tp); 2700 + 2701 + /* AccECN option beacon depends on mstamp, it may change mss */ 2702 + if (tcp_ecn_mode_accecn(tp) && tcp_accecn_option_beacon_check(sk)) 2703 + mss_now = tcp_current_mss(sk); 2704 + 2899 2705 if (!push_one) { 2900 2706 /* Do MTU probing. */ 2901 2707 result = tcp_mtu_probe(sk); ··· 3553 3349 tcp_retrans_try_collapse(sk, skb, avail_wnd); 3554 3350 } 3555 3351 3556 - /* RFC3168, section 6.1.1.1. ECN fallback */ 3352 + /* RFC3168, section 6.1.1.1. ECN fallback 3353 + * As AccECN uses the same SYN flags (+ AE), this check covers both 3354 + * cases. 3355 + */ 3557 3356 if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN) 3558 3357 tcp_ecn_clear_syn(sk, skb); 3559 3358
+2
net/ipv6/syncookies.c
··· 16 16 #include <net/secure_seq.h> 17 17 #include <net/ipv6.h> 18 18 #include <net/tcp.h> 19 + #include <net/tcp_ecn.h> 19 20 20 21 #define COOKIEBITS 24 /* Upper bits store count */ 21 22 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1) ··· 265 264 if (!req->syncookie) 266 265 ireq->rcv_wscale = rcv_wscale; 267 266 ireq->ecn_ok &= cookie_ecn_ok(net, dst); 267 + tcp_rsk(req)->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th); 268 268 269 269 ret = tcp_get_cookie_sock(sk, skb, req, dst); 270 270 if (!ret) {
+1
net/ipv6/tcp_ipv6.c
··· 544 544 skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); 545 545 546 546 if (skb) { 547 + tcp_rsk(req)->syn_ect_snt = np->tclass & INET_ECN_MASK; 547 548 __tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr, 548 549 &ireq->ir_v6_rmt_addr); 549 550