Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net: Introduce net.core.bypass_prot_mem sysctl.

If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

Let's control the flag by a new sysctl knob.

The flag is written once during socket(2) and is inherited to child
sockets.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
# ulimit -n 524288

Without net.core.bypass_prot_mem, charged to tcp_mem & memcg

# python3 pressure.py &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 22642688 <-------------------------------------- charged to memcg
# cat /proc/net/sockstat| grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0

With net.core.bypass_prot_mem=1, charged to memcg only:

# sysctl -q net.core.bypass_prot_mem=1
# python3 pressure.py &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2757468160 <------------------------------------ charged to memcg
# cat /proc/net/sockstat | grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274
# nstat | grep Pressure || echo no pressure
no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-4-kuniyu@google.com

authored by

Kuniyuki Iwashima and committed by
Martin KaFai Lau
b46ab631 7c268eae

+23
+8
Documentation/admin-guide/sysctl/net.rst
··· 212 212 213 213 Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. 214 214 215 + bypass_prot_mem 216 + --------------- 217 + 218 + Skip charging socket buffers to the global per-protocol memory 219 + accounting controlled by net.ipv4.tcp_mem, net.ipv4.udp_mem, etc. 220 + 221 + Default: 0 (off) 222 + 215 223 rmem_default 216 224 ------------ 217 225
+1
include/net/netns/core.h
··· 17 17 int sysctl_optmem_max; 18 18 u8 sysctl_txrehash; 19 19 u8 sysctl_tstamp_allow_data; 20 + u8 sysctl_bypass_prot_mem; 20 21 21 22 #ifdef CONFIG_PROC_FS 22 23 struct prot_inuse __percpu *prot_inuse;
+5
net/core/sock.c
··· 2306 2306 * why we need sk_prot_creator -acme 2307 2307 */ 2308 2308 sk->sk_prot = sk->sk_prot_creator = prot; 2309 + 2310 + if (READ_ONCE(net->core.sysctl_bypass_prot_mem)) 2311 + sk->sk_bypass_prot_mem = 1; 2312 + 2309 2313 sk->sk_kern_sock = kern; 2310 2314 sock_lock_init(sk); 2315 + 2311 2316 sk->sk_net_refcnt = kern ? 0 : 1; 2312 2317 if (likely(sk->sk_net_refcnt)) { 2313 2318 get_net_track(net, &sk->ns_tracker, priority);
+9
net/core/sysctl_net_core.c
··· 683 683 .extra1 = SYSCTL_ZERO, 684 684 .extra2 = SYSCTL_ONE 685 685 }, 686 + { 687 + .procname = "bypass_prot_mem", 688 + .data = &init_net.core.sysctl_bypass_prot_mem, 689 + .maxlen = sizeof(u8), 690 + .mode = 0644, 691 + .proc_handler = proc_dou8vec_minmax, 692 + .extra1 = SYSCTL_ZERO, 693 + .extra2 = SYSCTL_ONE 694 + }, 686 695 /* sysctl_core_net_init() will set the values after this 687 696 * to readonly in network namespaces 688 697 */