Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

tcp: properly terminate timers for kernel sockets

We had various syzbot reports about tcp timers firing after
the corresponding netns has been dismantled.

Fortunately Josef Bacik could trigger the issue more often,
and could test a patch I wrote two years ago.

When TCP sockets are closed, we call inet_csk_clear_xmit_timers()
to 'stop' the timers.

inet_csk_clear_xmit_timers() can be called from any context,
including when socket lock is held.
This is the reason it uses sk_stop_timer(), aka del_timer().
This means that ongoing timers might finish much later.

For user sockets, this is fine because each running timer
holds a reference on the socket, and the user socket holds
a reference on the netns.

For kernel sockets, we risk that the netns is freed before
timer can complete, because kernel sockets do not hold
reference on the netns.

This patch adds inet_csk_clear_xmit_timers_sync() function
that using sk_stop_timer_sync() to make sure all timers
are terminated before the kernel socket is released.
Modules using kernel sockets close them in their netns exit()
handler.

Also add sock_not_owned_by_me() helper to get LOCKDEP
support : inet_csk_clear_xmit_timers_sync() must not be called
while socket lock is held.

It is very possible we can revert in the future commit
3a58f13a881e ("net: rds: acquire refcount on TCP sockets")
which attempted to solve the issue in rds only.
(net/smc/af_smc.c and net/mptcp/subflow.c have similar code)

We probably can remove the check_net() tests from
tcp_out_of_resources() and __tcp_close() in the future.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Closes: https://lore.kernel.org/netdev/20240314210740.GA2823176@perftesting/
Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Fixes: 8a68173691f0 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
Link: https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Josef Bacik <josef@toxicpanda.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/20240322135732.1535772-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Eric Dumazet and committed by
Jakub Kicinski
151c9c72 b11c8173

+24
+1
include/net/inet_connection_sock.h
··· 175 175 void (*delack_handler)(struct timer_list *), 176 176 void (*keepalive_handler)(struct timer_list *)); 177 177 void inet_csk_clear_xmit_timers(struct sock *sk); 178 + void inet_csk_clear_xmit_timers_sync(struct sock *sk); 178 179 179 180 static inline void inet_csk_schedule_ack(struct sock *sk) 180 181 {
+7
include/net/sock.h
··· 1759 1759 #endif 1760 1760 } 1761 1761 1762 + static inline void sock_not_owned_by_me(const struct sock *sk) 1763 + { 1764 + #ifdef CONFIG_LOCKDEP 1765 + WARN_ON_ONCE(lockdep_sock_is_held(sk) && debug_locks); 1766 + #endif 1767 + } 1768 + 1762 1769 static inline bool sock_owned_by_user(const struct sock *sk) 1763 1770 { 1764 1771 sock_owned_by_me(sk);
+14
net/ipv4/inet_connection_sock.c
··· 771 771 } 772 772 EXPORT_SYMBOL(inet_csk_clear_xmit_timers); 773 773 774 + void inet_csk_clear_xmit_timers_sync(struct sock *sk) 775 + { 776 + struct inet_connection_sock *icsk = inet_csk(sk); 777 + 778 + /* ongoing timer handlers need to acquire socket lock. */ 779 + sock_not_owned_by_me(sk); 780 + 781 + icsk->icsk_pending = icsk->icsk_ack.pending = 0; 782 + 783 + sk_stop_timer_sync(sk, &icsk->icsk_retransmit_timer); 784 + sk_stop_timer_sync(sk, &icsk->icsk_delack_timer); 785 + sk_stop_timer_sync(sk, &sk->sk_timer); 786 + } 787 + 774 788 void inet_csk_delete_keepalive_timer(struct sock *sk) 775 789 { 776 790 sk_stop_timer(sk, &sk->sk_timer);
+2
net/ipv4/tcp.c
··· 2931 2931 lock_sock(sk); 2932 2932 __tcp_close(sk, timeout); 2933 2933 release_sock(sk); 2934 + if (!sk->sk_net_refcnt) 2935 + inet_csk_clear_xmit_timers_sync(sk); 2934 2936 sock_put(sk); 2935 2937 } 2936 2938 EXPORT_SYMBOL(tcp_close);