Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: tweak len/truesize ratio for coalesce candidates

tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh
which has a direct impact on advertized window sizes.

We added TCP coalescing in linux-3.4 & linux-3.5:

Instead of storing skbs with one or two MSS in receive queue (or OFO queue),
we try to append segments together to reduce memory overhead.

High performance network drivers tend to cook skb with 3 parts :

1) sk_buff structure (256 bytes)
2) skb->head contains room to copy headers as needed, and skb_shared_info
3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU)

Once coalesced into a previous skb, 1) and 2) are freed.

We can therefore tweak the way we compute len/truesize ratio knowing
that skb->truesize is inflated by 1) and 2) soon to be freed.

This is done only for in-order skb, or skb coalesced into OFO queue.

The result is that low rate flows no longer pay the memory price of having
low GRO aggregation factor. Same result for drivers not using GRO.

This is critical to allow a big enough receiver window,
typically tcp_rmem[2] / 2.

We have been using this at Google for about 5 years, it is due time
to make it upstream.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
240bfd13 54cb4319

+30 -8
+30 -8
net/ipv4/tcp_input.c
··· 454 454 */ 455 455 456 456 /* Slow part of check#2. */ 457 - static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb) 457 + static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb, 458 + unsigned int skbtruesize) 458 459 { 459 460 struct tcp_sock *tp = tcp_sk(sk); 460 461 /* Optimize this! */ 461 - int truesize = tcp_win_from_space(sk, skb->truesize) >> 1; 462 + int truesize = tcp_win_from_space(sk, skbtruesize) >> 1; 462 463 int window = tcp_win_from_space(sk, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]) >> 1; 463 464 464 465 while (tp->rcv_ssthresh <= window) { ··· 472 471 return 0; 473 472 } 474 473 475 - static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) 474 + /* Even if skb appears to have a bad len/truesize ratio, TCP coalescing 475 + * can play nice with us, as sk_buff and skb->head might be either 476 + * freed or shared with up to MAX_SKB_FRAGS segments. 477 + * Only give a boost to drivers using page frag(s) to hold the frame(s), 478 + * and if no payload was pulled in skb->head before reaching us. 479 + */ 480 + static u32 truesize_adjust(bool adjust, const struct sk_buff *skb) 481 + { 482 + u32 truesize = skb->truesize; 483 + 484 + if (adjust && !skb_headlen(skb)) { 485 + truesize -= SKB_TRUESIZE(skb_end_offset(skb)); 486 + /* paranoid check, some drivers might be buggy */ 487 + if (unlikely((int)truesize < (int)skb->len)) 488 + truesize = skb->truesize; 489 + } 490 + return truesize; 491 + } 492 + 493 + static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb, 494 + bool adjust) 476 495 { 477 496 struct tcp_sock *tp = tcp_sk(sk); 478 497 int room; ··· 501 480 502 481 /* Check #1 */ 503 482 if (room > 0 && !tcp_under_memory_pressure(sk)) { 483 + unsigned int truesize = truesize_adjust(adjust, skb); 504 484 int incr; 505 485 506 486 /* Check #2. Increase window, if skb with such overhead 507 487 * will fit to rcvbuf in future. 508 488 */ 509 - if (tcp_win_from_space(sk, skb->truesize) <= skb->len) 489 + if (tcp_win_from_space(sk, truesize) <= skb->len) 510 490 incr = 2 * tp->advmss; 511 491 else 512 - incr = __tcp_grow_window(sk, skb); 492 + incr = __tcp_grow_window(sk, skb, truesize); 513 493 514 494 if (incr) { 515 495 incr = max_t(int, incr, 2 * skb->len); ··· 804 782 tcp_ecn_check_ce(sk, skb); 805 783 806 784 if (skb->len >= 128) 807 - tcp_grow_window(sk, skb); 785 + tcp_grow_window(sk, skb, true); 808 786 } 809 787 810 788 /* Called to compute a smoothed rtt estimate. The data fed to this ··· 4791 4769 * and trigger fast retransmit. 4792 4770 */ 4793 4771 if (tcp_is_sack(tp)) 4794 - tcp_grow_window(sk, skb); 4772 + tcp_grow_window(sk, skb, true); 4795 4773 kfree_skb_partial(skb, fragstolen); 4796 4774 skb = NULL; 4797 4775 goto add_sack; ··· 4879 4857 * and trigger fast retransmit. 4880 4858 */ 4881 4859 if (tcp_is_sack(tp)) 4882 - tcp_grow_window(sk, skb); 4860 + tcp_grow_window(sk, skb, false); 4883 4861 skb_condense(skb); 4884 4862 skb_set_owner_r(skb, sk); 4885 4863 }