Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: reduce out_of_order memory use

With increasing receive window sizes, but speed of light not improved
that much, out of order queue can contain a huge number of skbs, waiting
to be moved to receive_queue when missing packets can fill the holes.

Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
many cases.

When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
latency killer and cpu cache blower.

Doing the coalescing attempt each time we add a frame in ofo queue
permits to keep memory use tight and in many cases avoid the
tcp_collapse() thing later.

Tested on various wireless setups (b43, ath9k, ...) known to use big skb
truesize, this patch removed the "packets collapsed in receive queue due
to low socket buffer" I had before.

This also reduced average memory used by tcp sockets.

With help from Neal Cardwell.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
c8628155 e86b2919

+20 -1
+1
include/linux/snmp.h
··· 233 233 LINUX_MIB_TCPREQQFULLDOCOOKIES, /* TCPReqQFullDoCookies */ 234 234 LINUX_MIB_TCPREQQFULLDROP, /* TCPReqQFullDrop */ 235 235 LINUX_MIB_TCPRETRANSFAIL, /* TCPRetransFail */ 236 + LINUX_MIB_TCPRCVCOALESCE, /* TCPRcvCoalesce */ 236 237 __LINUX_MIB_MAX 237 238 }; 238 239
+1
net/ipv4/proc.c
··· 257 257 SNMP_MIB_ITEM("TCPReqQFullDoCookies", LINUX_MIB_TCPREQQFULLDOCOOKIES), 258 258 SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP), 259 259 SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL), 260 + SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE), 260 261 SNMP_MIB_SENTINEL 261 262 }; 262 263
+18 -1
net/ipv4/tcp_input.c
··· 4484 4484 end_seq = TCP_SKB_CB(skb)->end_seq; 4485 4485 4486 4486 if (seq == TCP_SKB_CB(skb1)->end_seq) { 4487 - __skb_queue_after(&tp->out_of_order_queue, skb1, skb); 4487 + /* Packets in ofo can stay in queue a long time. 4488 + * Better try to coalesce them right now 4489 + * to avoid future tcp_collapse_ofo_queue(), 4490 + * probably the most expensive function in tcp stack. 4491 + */ 4492 + if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) { 4493 + NET_INC_STATS_BH(sock_net(sk), 4494 + LINUX_MIB_TCPRCVCOALESCE); 4495 + BUG_ON(skb_copy_bits(skb, 0, 4496 + skb_put(skb1, skb->len), 4497 + skb->len)); 4498 + TCP_SKB_CB(skb1)->end_seq = end_seq; 4499 + TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq; 4500 + __kfree_skb(skb); 4501 + skb = NULL; 4502 + } else { 4503 + __skb_queue_after(&tp->out_of_order_queue, skb1, skb); 4504 + } 4488 4505 4489 4506 if (!tp->rx_opt.num_sacks || 4490 4507 tp->selective_acks[0].end_seq != seq)