Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: TCP_NOTSENT_LOWAT socket option

Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB

Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

412,514 context-switches

200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB

Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

2,675,818 context-switches

200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
c9bee3b7 64dc6130

+61 -6
+13
Documentation/networking/ip-sysctl.txt
··· 516 516 this value is ignored. 517 517 Default: between 64K and 4MB, depending on RAM size. 518 518 519 + tcp_notsent_lowat - UNSIGNED INTEGER 520 + A TCP socket can control the amount of unsent bytes in its write queue, 521 + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() 522 + reports POLLOUT events if the amount of unsent bytes is below a per 523 + socket value, and if the write queue is not full. sendmsg() will 524 + also not add new buffers if the limit is hit. 525 + 526 + This global variable controls the amount of unsent data for 527 + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change 528 + to the global variable has immediate effect. 529 + 530 + Default: UINT_MAX (0xFFFFFFFF) 531 + 519 532 tcp_workaround_signed_windows - BOOLEAN 520 533 If set, assume no receipt of a window scaling option means the 521 534 remote TCP is broken and treats the window as a signed quantity.
+1
include/linux/tcp.h
··· 238 238 239 239 u32 rcv_wnd; /* Current receiver window */ 240 240 u32 write_seq; /* Tail(+1) of data held in tcp send buffer */ 241 + u32 notsent_lowat; /* TCP_NOTSENT_LOWAT */ 241 242 u32 pushed_seq; /* Last pushed seq, required to talk to windows */ 242 243 u32 lost_out; /* Lost packets */ 243 244 u32 sacked_out; /* SACK'd packets */
+13 -6
include/net/sock.h
··· 746 746 747 747 extern void sk_stream_write_space(struct sock *sk); 748 748 749 - static inline bool sk_stream_memory_free(const struct sock *sk) 750 - { 751 - return sk->sk_wmem_queued < sk->sk_sndbuf; 752 - } 753 - 754 749 /* OOB backlog add */ 755 750 static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb) 756 751 { ··· 945 950 unsigned int inuse_idx; 946 951 #endif 947 952 953 + bool (*stream_memory_free)(const struct sock *sk); 948 954 /* Memory pressure */ 949 955 void (*enter_memory_pressure)(struct sock *sk); 950 956 atomic_long_t *memory_allocated; /* Current allocated memory. */ ··· 1084 1088 } 1085 1089 #endif 1086 1090 1091 + static inline bool sk_stream_memory_free(const struct sock *sk) 1092 + { 1093 + if (sk->sk_wmem_queued >= sk->sk_sndbuf) 1094 + return false; 1095 + 1096 + return sk->sk_prot->stream_memory_free ? 1097 + sk->sk_prot->stream_memory_free(sk) : true; 1098 + } 1099 + 1087 1100 static inline bool sk_stream_is_writeable(const struct sock *sk) 1088 1101 { 1089 - return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk); 1102 + return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && 1103 + sk_stream_memory_free(sk); 1090 1104 } 1105 + 1091 1106 1092 1107 static inline bool sk_has_memory_pressure(const struct sock *sk) 1093 1108 {
+14
include/net/tcp.h
··· 284 284 extern int sysctl_tcp_early_retrans; 285 285 extern int sysctl_tcp_limit_output_bytes; 286 286 extern int sysctl_tcp_challenge_ack_limit; 287 + extern unsigned int sysctl_tcp_notsent_lowat; 287 288 288 289 extern atomic_long_t tcp_memory_allocated; 289 290 extern struct percpu_counter tcp_sockets_allocated; ··· 1539 1538 1540 1539 extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr, 1541 1540 __be32 daddr); 1541 + 1542 + static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp) 1543 + { 1544 + return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat; 1545 + } 1546 + 1547 + static inline bool tcp_stream_memory_free(const struct sock *sk) 1548 + { 1549 + const struct tcp_sock *tp = tcp_sk(sk); 1550 + u32 notsent_bytes = tp->write_seq - tp->snd_nxt; 1551 + 1552 + return notsent_bytes < tcp_notsent_lowat(tp); 1553 + } 1542 1554 1543 1555 #ifdef CONFIG_PROC_FS 1544 1556 extern int tcp4_proc_init(void);
+1
include/uapi/linux/tcp.h
··· 111 111 #define TCP_REPAIR_OPTIONS 22 112 112 #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */ 113 113 #define TCP_TIMESTAMP 24 114 + #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */ 114 115 115 116 struct tcp_repair_opt { 116 117 __u32 opt_code;
+7
net/ipv4/sysctl_net_ipv4.c
··· 555 555 .extra1 = &one, 556 556 }, 557 557 { 558 + .procname = "tcp_notsent_lowat", 559 + .data = &sysctl_tcp_notsent_lowat, 560 + .maxlen = sizeof(sysctl_tcp_notsent_lowat), 561 + .mode = 0644, 562 + .proc_handler = proc_dointvec, 563 + }, 564 + { 558 565 .procname = "tcp_rmem", 559 566 .data = &sysctl_tcp_rmem, 560 567 .maxlen = sizeof(sysctl_tcp_rmem),
+7
net/ipv4/tcp.c
··· 2631 2631 else 2632 2632 tp->tsoffset = val - tcp_time_stamp; 2633 2633 break; 2634 + case TCP_NOTSENT_LOWAT: 2635 + tp->notsent_lowat = val; 2636 + sk->sk_write_space(sk); 2637 + break; 2634 2638 default: 2635 2639 err = -ENOPROTOOPT; 2636 2640 break; ··· 2850 2846 break; 2851 2847 case TCP_TIMESTAMP: 2852 2848 val = tcp_time_stamp + tp->tsoffset; 2849 + break; 2850 + case TCP_NOTSENT_LOWAT: 2851 + val = tp->notsent_lowat; 2853 2852 break; 2854 2853 default: 2855 2854 return -ENOPROTOOPT;
+1
net/ipv4/tcp_ipv4.c
··· 2800 2800 .unhash = inet_unhash, 2801 2801 .get_port = inet_csk_get_port, 2802 2802 .enter_memory_pressure = tcp_enter_memory_pressure, 2803 + .stream_memory_free = tcp_stream_memory_free, 2803 2804 .sockets_allocated = &tcp_sockets_allocated, 2804 2805 .orphan_count = &tcp_orphan_count, 2805 2806 .memory_allocated = &tcp_memory_allocated,
+3
net/ipv4/tcp_output.c
··· 65 65 /* By default, RFC2861 behavior. */ 66 66 int sysctl_tcp_slow_start_after_idle __read_mostly = 1; 67 67 68 + unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX; 69 + EXPORT_SYMBOL(sysctl_tcp_notsent_lowat); 70 + 68 71 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, 69 72 int push_one, gfp_t gfp); 70 73
+1
net/ipv6/tcp_ipv6.c
··· 1924 1924 .unhash = inet_unhash, 1925 1925 .get_port = inet_csk_get_port, 1926 1926 .enter_memory_pressure = tcp_enter_memory_pressure, 1927 + .stream_memory_free = tcp_stream_memory_free, 1927 1928 .sockets_allocated = &tcp_sockets_allocated, 1928 1929 .memory_allocated = &tcp_memory_allocated, 1929 1930 .memory_pressure = &tcp_memory_pressure,