Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: tcp_vegas cong avoid fix

This patch addresses a book-keeping issue in tcp_vegas.c. At present
tcp_vegas does separate book-keeping of cwnd based on packet sequence
numbers. A mismatch can develop between this book-keeping and
tp->snd_cwnd due, for example, to delayed acks acking multiple
packets. When vegas transitions to reno operation (e.g. following
loss), then this mismatch leads to incorrect behaviour (akin to a cwnd
backoff). This seems mostly to affect operation at low cwnds where
delayed acking can lead to a significant fraction of cwnd being
covered by a single ack, leading to the book-keeping mismatch. This
patch modifies the congestion avoidance update to avoid the need for
separate book-keeping while leaving vegas congestion avoidance
functionally unchanged. A secondary advantage of this modification is
that the use of fixed-point (via V_PARAM_SHIFT) and 64 bit arithmetic
is no longer necessary, simplifying the code.

Some example test measurements with the patched code (confirming no functional
change in the congestion avoidance algorithm) can be seen at:

http://www.hamilton.ie/doug/vegaspatch/

Signed-off-by: Doug Leith <doug.leith@nuim.ie>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Doug Leith and committed by
David S. Miller
8d3a564d 8c83f80b

+10 -70
+10 -70
net/ipv4/tcp_vegas.c
··· 40 40 41 41 #include "tcp_vegas.h" 42 42 43 - /* Default values of the Vegas variables, in fixed-point representation 44 - * with V_PARAM_SHIFT bits to the right of the binary point. 45 - */ 46 - #define V_PARAM_SHIFT 1 47 - static int alpha = 2<<V_PARAM_SHIFT; 48 - static int beta = 4<<V_PARAM_SHIFT; 49 - static int gamma = 1<<V_PARAM_SHIFT; 43 + static int alpha = 2; 44 + static int beta = 4; 45 + static int gamma = 1; 50 46 51 47 module_param(alpha, int, 0644); 52 - MODULE_PARM_DESC(alpha, "lower bound of packets in network (scale by 2)"); 48 + MODULE_PARM_DESC(alpha, "lower bound of packets in network"); 53 49 module_param(beta, int, 0644); 54 - MODULE_PARM_DESC(beta, "upper bound of packets in network (scale by 2)"); 50 + MODULE_PARM_DESC(beta, "upper bound of packets in network"); 55 51 module_param(gamma, int, 0644); 56 52 MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)"); 57 53 ··· 168 172 return; 169 173 } 170 174 171 - /* The key players are v_beg_snd_una and v_beg_snd_nxt. 172 - * 173 - * These are so named because they represent the approximate values 174 - * of snd_una and snd_nxt at the beginning of the current RTT. More 175 - * precisely, they represent the amount of data sent during the RTT. 176 - * At the end of the RTT, when we receive an ACK for v_beg_snd_nxt, 177 - * we will calculate that (v_beg_snd_nxt - v_beg_snd_una) outstanding 178 - * bytes of data have been ACKed during the course of the RTT, giving 179 - * an "actual" rate of: 180 - * 181 - * (v_beg_snd_nxt - v_beg_snd_una) / (rtt duration) 182 - * 183 - * Unfortunately, v_beg_snd_una is not exactly equal to snd_una, 184 - * because delayed ACKs can cover more than one segment, so they 185 - * don't line up nicely with the boundaries of RTTs. 186 - * 187 - * Another unfortunate fact of life is that delayed ACKs delay the 188 - * advance of the left edge of our send window, so that the number 189 - * of bytes we send in an RTT is often less than our cwnd will allow. 190 - * So we keep track of our cwnd separately, in v_beg_snd_cwnd. 191 - */ 192 - 193 175 if (after(ack, vegas->beg_snd_nxt)) { 194 176 /* Do the Vegas once-per-RTT cwnd adjustment. */ 195 - u32 old_wnd, old_snd_cwnd; 196 - 197 - 198 - /* Here old_wnd is essentially the window of data that was 199 - * sent during the previous RTT, and has all 200 - * been acknowledged in the course of the RTT that ended 201 - * with the ACK we just received. Likewise, old_snd_cwnd 202 - * is the cwnd during the previous RTT. 203 - */ 204 - old_wnd = (vegas->beg_snd_nxt - vegas->beg_snd_una) / 205 - tp->mss_cache; 206 - old_snd_cwnd = vegas->beg_snd_cwnd; 207 177 208 178 /* Save the extent of the current window so we can use this 209 179 * at the end of the next RTT. 210 180 */ 211 - vegas->beg_snd_una = vegas->beg_snd_nxt; 212 181 vegas->beg_snd_nxt = tp->snd_nxt; 213 - vegas->beg_snd_cwnd = tp->snd_cwnd; 214 182 215 183 /* We do the Vegas calculations only if we got enough RTT 216 184 * samples that we can be reasonably sure that we got ··· 212 252 * 213 253 * This is: 214 254 * (actual rate in segments) * baseRTT 215 - * We keep it as a fixed point number with 216 - * V_PARAM_SHIFT bits to the right of the binary point. 217 255 */ 218 - target_cwnd = ((u64)old_wnd * vegas->baseRTT); 219 - target_cwnd <<= V_PARAM_SHIFT; 220 - do_div(target_cwnd, rtt); 256 + target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt; 221 257 222 258 /* Calculate the difference between the window we had, 223 259 * and the window we would like to have. This quantity 224 260 * is the "Diff" from the Arizona Vegas papers. 225 - * 226 - * Again, this is a fixed point number with 227 - * V_PARAM_SHIFT bits to the right of the binary 228 - * point. 229 261 */ 230 - diff = (old_wnd << V_PARAM_SHIFT) - target_cwnd; 262 + diff = tp->snd_cwnd * (rtt-vegas->baseRTT) / vegas->baseRTT; 231 263 232 264 if (diff > gamma && tp->snd_ssthresh > 2 ) { 233 265 /* Going too fast. Time to slow down ··· 234 282 * truncation robs us of full link 235 283 * utilization. 236 284 */ 237 - tp->snd_cwnd = min(tp->snd_cwnd, 238 - ((u32)target_cwnd >> 239 - V_PARAM_SHIFT)+1); 285 + tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1); 240 286 241 287 } else if (tp->snd_cwnd <= tp->snd_ssthresh) { 242 288 /* Slow start. */ 243 289 tcp_slow_start(tp); 244 290 } else { 245 291 /* Congestion avoidance. */ 246 - u32 next_snd_cwnd; 247 292 248 293 /* Figure out where we would like cwnd 249 294 * to be. ··· 249 300 /* The old window was too fast, so 250 301 * we slow down. 251 302 */ 252 - next_snd_cwnd = old_snd_cwnd - 1; 303 + tp->snd_cwnd--; 253 304 } else if (diff < alpha) { 254 305 /* We don't have enough extra packets 255 306 * in the network, so speed up. 256 307 */ 257 - next_snd_cwnd = old_snd_cwnd + 1; 308 + tp->snd_cwnd++; 258 309 } else { 259 310 /* Sending just as fast as we 260 311 * should be. 261 312 */ 262 - next_snd_cwnd = old_snd_cwnd; 263 313 } 264 - 265 - /* Adjust cwnd upward or downward, toward the 266 - * desired value. 267 - */ 268 - if (next_snd_cwnd > tp->snd_cwnd) 269 - tp->snd_cwnd++; 270 - else if (next_snd_cwnd < tp->snd_cwnd) 271 - tp->snd_cwnd--; 272 314 } 273 315 274 316 if (tp->snd_cwnd < 2)