Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: fix SO_RCVLOWAT hangs with fat skbs

We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
overhead in tcp_set_rcvlowat()

This works well when skb->len/skb->truesize ratio is bigger than 0.5

But if we receive packets with small MSS, we can end up in a situation
where not enough bytes are available in the receive queue to satisfy
RCVLOWAT setting.
As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
preventing remote peer from sending more data.

Even autotuning does not help, because it only triggers at the time
user process drains the queue. If no EPOLLIN is generated, this
can not happen.

Note poll() has a similar issue, after commit
c7004482e8dc ("tcp: Respect SO_RCVLOWAT in tcp_poll().")

Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
24adbc16 92db978f

+26 -4
+13
include/net/tcp.h
··· 1420 1420 return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf)); 1421 1421 } 1422 1422 1423 + /* We provision sk_rcvbuf around 200% of sk_rcvlowat. 1424 + * If 87.5 % (7/8) of the space has been consumed, we want to override 1425 + * SO_RCVLOWAT constraint, since we are receiving skbs with too small 1426 + * len/truesize ratio. 1427 + */ 1428 + static inline bool tcp_rmem_pressure(const struct sock *sk) 1429 + { 1430 + int rcvbuf = READ_ONCE(sk->sk_rcvbuf); 1431 + int threshold = rcvbuf - (rcvbuf >> 3); 1432 + 1433 + return atomic_read(&sk->sk_rmem_alloc) > threshold; 1434 + } 1435 + 1423 1436 extern void tcp_openreq_init_rwin(struct request_sock *req, 1424 1437 const struct sock *sk_listener, 1425 1438 const struct dst_entry *dst);
+11 -3
net/ipv4/tcp.c
··· 476 476 static inline bool tcp_stream_is_readable(const struct tcp_sock *tp, 477 477 int target, struct sock *sk) 478 478 { 479 - return (READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq) >= target) || 480 - (sk->sk_prot->stream_memory_read ? 481 - sk->sk_prot->stream_memory_read(sk) : false); 479 + int avail = READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq); 480 + 481 + if (avail > 0) { 482 + if (avail >= target) 483 + return true; 484 + if (tcp_rmem_pressure(sk)) 485 + return true; 486 + } 487 + if (sk->sk_prot->stream_memory_read) 488 + return sk->sk_prot->stream_memory_read(sk); 489 + return false; 482 490 } 483 491 484 492 /*
+2 -1
net/ipv4/tcp_input.c
··· 4757 4757 const struct tcp_sock *tp = tcp_sk(sk); 4758 4758 int avail = tp->rcv_nxt - tp->copied_seq; 4759 4759 4760 - if (avail < sk->sk_rcvlowat && !sock_flag(sk, SOCK_DONE)) 4760 + if (avail < sk->sk_rcvlowat && !tcp_rmem_pressure(sk) && 4761 + !sock_flag(sk, SOCK_DONE)) 4761 4762 return; 4762 4763 4763 4764 sk->sk_data_ready(sk);