Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: detect loss above high_seq in recovery

Correctly implement a loss detection heuristic: New sequences (above
high_seq) sent during the fast recovery are deemed lost when higher
sequences are SACKed.

Current code does not catch these losses, because tcp_mark_head_lost()
does not check packets beyond high_seq. The fix is straight-forward by
checking packets until the highest sacked packet. In addition, all the
FLAG_DATA_LOST logic are in-effective and redundant and can be removed.

Update the loss heuristic comments. The algorithm above is documented
as heuristic B, but it is redundant too because heuristic A already
covers B.

Note that this change only marks some forward-retransmitted packets LOST.
It does NOT forbid TCP performing further CWR on new losses. A potential
follow-up patch under preparation is to perform another CWR on "new"
losses such as
1) sequence above high_seq is lost (by resetting high_seq to snd_nxt)
2) retransmission is lost.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Yuchung Cheng and committed by
David S. Miller
974c1236 d0249e44

+15 -28
-1
include/linux/snmp.h
··· 192 192 LINUX_MIB_TCPPARTIALUNDO, /* TCPPartialUndo */ 193 193 LINUX_MIB_TCPDSACKUNDO, /* TCPDSACKUndo */ 194 194 LINUX_MIB_TCPLOSSUNDO, /* TCPLossUndo */ 195 - LINUX_MIB_TCPLOSS, /* TCPLoss */ 196 195 LINUX_MIB_TCPLOSTRETRANSMIT, /* TCPLostRetransmit */ 197 196 LINUX_MIB_TCPRENOFAILURES, /* TCPRenoFailures */ 198 197 LINUX_MIB_TCPSACKFAILURES, /* TCPSackFailures */
-1
net/ipv4/proc.c
··· 216 216 SNMP_MIB_ITEM("TCPPartialUndo", LINUX_MIB_TCPPARTIALUNDO), 217 217 SNMP_MIB_ITEM("TCPDSACKUndo", LINUX_MIB_TCPDSACKUNDO), 218 218 SNMP_MIB_ITEM("TCPLossUndo", LINUX_MIB_TCPLOSSUNDO), 219 - SNMP_MIB_ITEM("TCPLoss", LINUX_MIB_TCPLOSS), 220 219 SNMP_MIB_ITEM("TCPLostRetransmit", LINUX_MIB_TCPLOSTRETRANSMIT), 221 220 SNMP_MIB_ITEM("TCPRenoFailures", LINUX_MIB_TCPRENOFAILURES), 222 221 SNMP_MIB_ITEM("TCPSackFailures", LINUX_MIB_TCPSACKFAILURES),
+15 -26
net/ipv4/tcp_input.c
··· 105 105 #define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */ 106 106 #define FLAG_DATA_SACKED 0x20 /* New SACK. */ 107 107 #define FLAG_ECE 0x40 /* ECE in this ACK */ 108 - #define FLAG_DATA_LOST 0x80 /* SACK detected data lossage. */ 109 108 #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ 110 109 #define FLAG_ONLY_ORIG_SACKED 0x200 /* SACKs only non-rexmit sent before RTO */ 111 110 #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ ··· 1039 1040 * These 6 states form finite state machine, controlled by the following events: 1040 1041 * 1. New ACK (+SACK) arrives. (tcp_sacktag_write_queue()) 1041 1042 * 2. Retransmission. (tcp_retransmit_skb(), tcp_xmit_retransmit_queue()) 1042 - * 3. Loss detection event of one of three flavors: 1043 + * 3. Loss detection event of two flavors: 1043 1044 * A. Scoreboard estimator decided the packet is lost. 1044 1045 * A'. Reno "three dupacks" marks head of queue lost. 1045 - * A''. Its FACK modfication, head until snd.fack is lost. 1046 - * B. SACK arrives sacking data transmitted after never retransmitted 1047 - * hole was sent out. 1048 - * C. SACK arrives sacking SND.NXT at the moment, when the 1046 + * A''. Its FACK modification, head until snd.fack is lost. 1047 + * B. SACK arrives sacking SND.NXT at the moment, when the 1049 1048 * segment was retransmitted. 1050 1049 * 4. D-SACK added new rule: D-SACK changes any tag to S. 1051 1050 * ··· 1150 1153 } 1151 1154 1152 1155 /* Check for lost retransmit. This superb idea is borrowed from "ratehalving". 1153 - * Event "C". Later note: FACK people cheated me again 8), we have to account 1156 + * Event "B". Later note: FACK people cheated me again 8), we have to account 1154 1157 * for reordering! Ugly, but should help. 1155 1158 * 1156 1159 * Search retransmitted skbs from write_queue that were sent when snd_nxt was ··· 1841 1844 if (found_dup_sack && ((i + 1) == first_sack_index)) 1842 1845 next_dup = &sp[i + 1]; 1843 1846 1844 - /* Event "B" in the comment above. */ 1845 - if (after(end_seq, tp->high_seq)) 1846 - state.flag |= FLAG_DATA_LOST; 1847 - 1848 1847 /* Skip too early cached blocks */ 1849 1848 while (tcp_sack_cache_ok(tp, cache) && 1850 1849 !before(start_seq, cache->end_seq)) ··· 2508 2515 tcp_verify_left_out(tp); 2509 2516 } 2510 2517 2511 - /* Mark head of queue up as lost. With RFC3517 SACK, the packets is 2512 - * is against sacked "cnt", otherwise it's against facked "cnt" 2518 + /* Detect loss in event "A" above by marking head of queue up as lost. 2519 + * For FACK or non-SACK(Reno) senders, the first "packets" number of segments 2520 + * are considered lost. For RFC3517 SACK, a segment is considered lost if it 2521 + * has at least tp->reordering SACKed seqments above it; "packets" refers to 2522 + * the maximum SACKed segments to pass before reaching this limit. 2513 2523 */ 2514 2524 static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head) 2515 2525 { ··· 2521 2525 int cnt, oldcnt; 2522 2526 int err; 2523 2527 unsigned int mss; 2528 + /* Use SACK to deduce losses of new sequences sent during recovery */ 2529 + const u32 loss_high = tcp_is_sack(tp) ? tp->snd_nxt : tp->high_seq; 2524 2530 2525 2531 WARN_ON(packets > tp->packets_out); 2526 2532 if (tp->lost_skb_hint) { ··· 2544 2546 tp->lost_skb_hint = skb; 2545 2547 tp->lost_cnt_hint = cnt; 2546 2548 2547 - if (after(TCP_SKB_CB(skb)->end_seq, tp->high_seq)) 2549 + if (after(TCP_SKB_CB(skb)->end_seq, loss_high)) 2548 2550 break; 2549 2551 2550 2552 oldcnt = cnt; ··· 3031 3033 if (tcp_check_sack_reneging(sk, flag)) 3032 3034 return; 3033 3035 3034 - /* C. Process data loss notification, provided it is valid. */ 3035 - if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOST) && 3036 - before(tp->snd_una, tp->high_seq) && 3037 - icsk->icsk_ca_state != TCP_CA_Open && 3038 - tp->fackets_out > tp->reordering) { 3039 - tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0); 3040 - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS); 3041 - } 3042 - 3043 - /* D. Check consistency of the current state. */ 3036 + /* C. Check consistency of the current state. */ 3044 3037 tcp_verify_left_out(tp); 3045 3038 3046 - /* E. Check state exit conditions. State can be terminated 3039 + /* D. Check state exit conditions. State can be terminated 3047 3040 * when high_seq is ACKed. */ 3048 3041 if (icsk->icsk_ca_state == TCP_CA_Open) { 3049 3042 WARN_ON(tp->retrans_out != 0); ··· 3066 3077 } 3067 3078 } 3068 3079 3069 - /* F. Process state. */ 3080 + /* E. Process state. */ 3070 3081 switch (icsk->icsk_ca_state) { 3071 3082 case TCP_CA_Recovery: 3072 3083 if (!(flag & FLAG_SND_UNA_ADVANCED)) {