Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'report-rcu-qs-for-busy-network-kthreads'

Yan Zhai says:

====================
Report RCU QS for busy network kthreads

This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:

* polling a batch of packets
* if there are more work, call cond_resched() to allow scheduling
* continue to poll more packets when rx queue is not empty

We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched() is insufficient for RCU tasks to reach
quiescent states. This also has the side effect of frequently clearing
the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
schedule() will not be called in these circumstances, despite schedule()
in fact provides required quiescent states. This at least affects NAPI
threads, napi_busy_loop, and also cpumap kthread.

By reporting RCU QSes in these kthreads periodically before cond_resched, the
blocked RCU waiters can correctly progress. Instead of just reporting QS for
RCU tasks, these code share the same concern as noted in the commit
d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
So report a consolidated QS for safety.

It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as high
as 2ms, which is far larger than recommended 50us-100us in the documentation.
So napi_busy_loop is left untouched.

Lastly, this does not affect RT kernels, which does not enter the scheduler
through cond_resched(). Without the mentioned side effect, schedule() will
be called time by time, and clear the RCU task holdouts.

V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/
V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
====================

Link: https://lore.kernel.org/r/cover.1710877680.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+37
+31
include/linux/rcupdate.h
··· 247 247 cond_resched(); \ 248 248 } while (0) 249 249 250 + /** 251 + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states 252 + * @old_ts: jiffies at start of processing. 253 + * 254 + * This helper is for long-running softirq handlers, such as NAPI threads in 255 + * networking. The caller should initialize the variable passed in as @old_ts 256 + * at the beginning of the softirq handler. When invoked frequently, this macro 257 + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will 258 + * provide both RCU and RCU-Tasks quiescent states. Note that this macro 259 + * modifies its old_ts argument. 260 + * 261 + * Because regions of code that have disabled softirq act as RCU read-side 262 + * critical sections, this macro should be invoked with softirq (and 263 + * preemption) enabled. 264 + * 265 + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would 266 + * have more chance to invoke schedule() calls and provide necessary quiescent 267 + * states. As a contrast, calling cond_resched() only won't achieve the same 268 + * effect because cond_resched() does not provide RCU-Tasks quiescent states. 269 + */ 270 + #define rcu_softirq_qs_periodic(old_ts) \ 271 + do { \ 272 + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ 273 + time_after(jiffies, (old_ts) + HZ / 10)) { \ 274 + preempt_disable(); \ 275 + rcu_softirq_qs(); \ 276 + preempt_enable(); \ 277 + (old_ts) = jiffies; \ 278 + } \ 279 + } while (0) 280 + 250 281 /* 251 282 * Infrastructure to implement the synchronize_() primitives in 252 283 * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
+3
kernel/bpf/cpumap.c
··· 263 263 static int cpu_map_kthread_run(void *data) 264 264 { 265 265 struct bpf_cpu_map_entry *rcpu = data; 266 + unsigned long last_qs = jiffies; 266 267 267 268 complete(&rcpu->kthread_running); 268 269 set_current_state(TASK_INTERRUPTIBLE); ··· 289 288 if (__ptr_ring_empty(rcpu->queue)) { 290 289 schedule(); 291 290 sched = 1; 291 + last_qs = jiffies; 292 292 } else { 293 293 __set_current_state(TASK_RUNNING); 294 294 } 295 295 } else { 296 + rcu_softirq_qs_periodic(last_qs); 296 297 sched = cond_resched(); 297 298 } 298 299
+3
net/core/dev.c
··· 6743 6743 void *have; 6744 6744 6745 6745 while (!napi_thread_wait(napi)) { 6746 + unsigned long last_qs = jiffies; 6747 + 6746 6748 for (;;) { 6747 6749 bool repoll = false; 6748 6750 ··· 6769 6767 if (!repoll) 6770 6768 break; 6771 6769 6770 + rcu_softirq_qs_periodic(last_qs); 6772 6771 cond_resched(); 6773 6772 } 6774 6773 }