workqueue: Implement BH workqueues to eventually replace tasklets

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.

v3: - Add missing interrupt.h include.

v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.

- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Tejun Heo 2 years ago 4cb1ef64 2fcdb1b4

+284 -59

5 changed files

expand all

Documentation

core-api

workqueue.rst

include

linux

workqueue.h

kernel

softirq.c

workqueue.c

tools

workqueue

wq_dump.py

+24 -5

Documentation/core-api/workqueue.rst

··· 77 77 item pointing to that function and queue that work item on a 78 78 workqueue. 79 79 80 - Special purpose threads, called worker threads, execute the functions 81 - off of the queue, one after the other. If no work is queued, the 82 - worker threads become idle. These worker threads are managed in so 83 - called worker-pools. 80 + A work item can be executed in either a thread or the BH (softirq) context. 81 + 82 + For threaded workqueues, special purpose threads, called [k]workers, execute 83 + the functions off of the queue, one after the other. If no work is queued, 84 + the worker threads become idle. These worker threads are managed in 85 + worker-pools. 84 86 85 87 The cmwq design differentiates between the user-facing workqueues that 86 88 subsystems and drivers queue work items on and the backend mechanism ··· 92 90 for high priority ones, for each possible CPU and some extra 93 91 worker-pools to serve work items queued on unbound workqueues - the 94 92 number of these backing pools is dynamic. 93 + 94 + BH workqueues use the same framework. However, as there can only be one 95 + concurrent execution context, there's no need to worry about concurrency. 96 + Each per-CPU BH worker pool contains only one pseudo worker which represents 97 + the BH execution context. A BH workqueue can be considered a convenience 98 + interface to softirq. 95 99 96 100 Subsystems and drivers can create and queue work items through special 97 101 workqueue API functions as they see fit. They can influence some ··· 114 106 be queued on the worklist of either normal or highpri worker-pool that 115 107 is associated to the CPU the issuer is running on. 116 108 117 - For any worker pool implementation, managing the concurrency level 109 + For any thread pool implementation, managing the concurrency level 118 110 (how many execution contexts are active) is an important issue. cmwq 119 111 tries to keep the concurrency at a minimal but sufficient level. 120 112 Minimal to save resources and sufficient in that the system is used at ··· 171 163 172 164 ``flags`` 173 165 --------- 166 + 167 + ``WQ_BH`` 168 + BH workqueues can be considered a convenience interface to softirq. BH 169 + workqueues are always per-CPU and all BH work items are executed in the 170 + queueing CPU's softirq context in the queueing order. 171 + 172 + All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the 173 + only allowed additional flag. 174 + 175 + BH work items cannot sleep. All other features such as delayed queueing, 176 + flushing and canceling are supported. 174 177 175 178 ``WQ_UNBOUND`` 176 179 Work items queued to an unbound wq are served by the special

+11

include/linux/workqueue.h

··· 353 353 * Documentation/core-api/workqueue.rst. 354 354 */ 355 355 enum wq_flags { 356 + WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */ 356 357 WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ 357 358 WQ_FREEZABLE = 1 << 2, /* freeze during suspend */ 358 359 WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */ ··· 393 392 __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ 394 393 __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ 395 394 __WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */ 395 + 396 + /* BH wq only allows the following flags */ 397 + __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI, 396 398 }; 397 399 398 400 enum wq_consts { ··· 438 434 * they are same as their non-power-efficient counterparts - e.g. 439 435 * system_power_efficient_wq is identical to system_wq if 440 436 * 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info. 437 + * 438 + * system_bh[_highpri]_wq are convenience interface to softirq. BH work items 439 + * are executed in the queueing CPU's BH context in the queueing order. 441 440 */ 442 441 extern struct workqueue_struct *system_wq; 443 442 extern struct workqueue_struct *system_highpri_wq; ··· 449 442 extern struct workqueue_struct *system_freezable_wq; 450 443 extern struct workqueue_struct *system_power_efficient_wq; 451 444 extern struct workqueue_struct *system_freezable_power_efficient_wq; 445 + extern struct workqueue_struct *system_bh_wq; 446 + extern struct workqueue_struct *system_bh_highpri_wq; 447 + 448 + void workqueue_softirq_action(bool highpri); 452 449 453 450 /** 454 451 * alloc_workqueue - allocate a workqueue

kernel/softirq.c

··· 27 27 #include <linux/tick.h> 28 28 #include <linux/irq.h> 29 29 #include <linux/wait_bit.h> 30 + #include <linux/workqueue.h> 30 31 31 32 #include <asm/softirq_stack.h> 32 33 ··· 803 802 804 803 static __latent_entropy void tasklet_action(struct softirq_action *a) 805 804 { 805 + workqueue_softirq_action(false); 806 806 tasklet_action_common(a, this_cpu_ptr(&tasklet_vec), TASKLET_SOFTIRQ); 807 807 } 808 808 809 809 static __latent_entropy void tasklet_hi_action(struct softirq_action *a) 810 810 { 811 + workqueue_softirq_action(true); 811 812 tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ); 812 813 } 813 814

+237 -52

kernel/workqueue.c

··· 29 29 #include <linux/kernel.h> 30 30 #include <linux/sched.h> 31 31 #include <linux/init.h> 32 + #include <linux/interrupt.h> 32 33 #include <linux/signal.h> 33 34 #include <linux/completion.h> 34 35 #include <linux/workqueue.h> ··· 73 72 * Note that DISASSOCIATED should be flipped only while holding 74 73 * wq_pool_attach_mutex to avoid changing binding state while 75 74 * worker_attach_to_pool() is in progress. 75 + * 76 + * As there can only be one concurrent BH execution context per CPU, a 77 + * BH pool is per-CPU and always DISASSOCIATED. 76 78 */ 77 - POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */ 79 + POOL_BH = 1 << 0, /* is a BH pool */ 80 + POOL_MANAGER_ACTIVE = 1 << 1, /* being managed */ 78 81 POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */ 79 82 }; 80 83 ··· 119 114 120 115 WQ_NAME_LEN = 32, 121 116 }; 117 + 118 + /* 119 + * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and 120 + * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because 121 + * msecs_to_jiffies() can't be an initializer. 122 + */ 123 + #define BH_WORKER_JIFFIES msecs_to_jiffies(2) 124 + #define BH_WORKER_RESTARTS 10 122 125 123 126 /* 124 127 * Structure fields follow one of the following exclusion rules. ··· 456 443 #endif 457 444 module_param_named(debug_force_rr_cpu, wq_debug_force_rr_cpu, bool, 0644); 458 445 446 + /* the BH worker pools */ 447 + static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], 448 + bh_worker_pools); 449 + 459 450 /* the per-cpu worker pools */ 460 - static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools); 451 + static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], 452 + cpu_worker_pools); 461 453 462 454 static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */ 463 455 ··· 496 478 EXPORT_SYMBOL_GPL(system_power_efficient_wq); 497 479 struct workqueue_struct *system_freezable_power_efficient_wq __ro_after_init; 498 480 EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq); 481 + struct workqueue_struct *system_bh_wq; 482 + EXPORT_SYMBOL_GPL(system_bh_wq); 483 + struct workqueue_struct *system_bh_highpri_wq; 484 + EXPORT_SYMBOL_GPL(system_bh_highpri_wq); 499 485 500 486 static int worker_thread(void *__worker); 501 487 static void workqueue_sysfs_unregister(struct workqueue_struct *wq); ··· 519 497 !lockdep_is_held(&wq->mutex) && \ 520 498 !lockdep_is_held(&wq_pool_mutex), \ 521 499 "RCU, wq->mutex or wq_pool_mutex should be held") 500 + 501 + #define for_each_bh_worker_pool(pool, cpu) \ 502 + for ((pool) = &per_cpu(bh_worker_pools, cpu)[0]; \ 503 + (pool) < &per_cpu(bh_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \ 504 + (pool)++) 522 505 523 506 #define for_each_cpu_worker_pool(pool, cpu) \ 524 507 for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \ ··· 1213 1186 if (!need_more_worker(pool) || !worker) 1214 1187 return false; 1215 1188 1189 + if (pool->flags & POOL_BH) { 1190 + if (pool->attrs->nice == HIGHPRI_NICE_LEVEL) 1191 + raise_softirq_irqoff(HI_SOFTIRQ); 1192 + else 1193 + raise_softirq_irqoff(TASKLET_SOFTIRQ); 1194 + return true; 1195 + } 1196 + 1216 1197 p = worker->task; 1217 1198 1218 1199 #ifdef CONFIG_SMP ··· 1703 1668 lockdep_assert_held(&pool->lock); 1704 1669 1705 1670 if (!nna) { 1706 - /* per-cpu workqueue, pwq->nr_active is sufficient */ 1671 + /* BH or per-cpu workqueue, pwq->nr_active is sufficient */ 1707 1672 obtained = pwq->nr_active < READ_ONCE(wq->max_active); 1708 1673 goto out; 1709 1674 } ··· 2558 2523 * cpu-[un]hotplugs. 2559 2524 */ 2560 2525 static void worker_attach_to_pool(struct worker *worker, 2561 - struct worker_pool *pool) 2526 + struct worker_pool *pool) 2562 2527 { 2563 2528 mutex_lock(&wq_pool_attach_mutex); 2564 2529 2565 2530 /* 2566 - * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains 2567 - * stable across this function. See the comments above the flag 2568 - * definition for details. 2531 + * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains stable 2532 + * across this function. See the comments above the flag definition for 2533 + * details. BH workers are, while per-CPU, always DISASSOCIATED. 2569 2534 */ 2570 - if (pool->flags & POOL_DISASSOCIATED) 2535 + if (pool->flags & POOL_DISASSOCIATED) { 2571 2536 worker->flags |= WORKER_UNBOUND; 2572 - else 2537 + } else { 2538 + WARN_ON_ONCE(pool->flags & POOL_BH); 2573 2539 kthread_set_per_cpu(worker->task, pool->cpu); 2540 + } 2574 2541 2575 2542 if (worker->rescue_wq) 2576 2543 set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool)); ··· 2595 2558 { 2596 2559 struct worker_pool *pool = worker->pool; 2597 2560 struct completion *detach_completion = NULL; 2561 + 2562 + /* there is one permanent BH worker per CPU which should never detach */ 2563 + WARN_ON_ONCE(pool->flags & POOL_BH); 2598 2564 2599 2565 mutex_lock(&wq_pool_attach_mutex); 2600 2566 ··· 2650 2610 2651 2611 worker->id = id; 2652 2612 2653 - if (pool->cpu >= 0) 2654 - snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id, 2655 - pool->attrs->nice < 0 ? "H" : ""); 2656 - else 2657 - snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id); 2613 + if (!(pool->flags & POOL_BH)) { 2614 + if (pool->cpu >= 0) 2615 + snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id, 2616 + pool->attrs->nice < 0 ? "H" : ""); 2617 + else 2618 + snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id); 2658 2619 2659 - worker->task = kthread_create_on_node(worker_thread, worker, pool->node, 2660 - "kworker/%s", id_buf); 2661 - if (IS_ERR(worker->task)) { 2662 - if (PTR_ERR(worker->task) == -EINTR) { 2663 - pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n", 2664 - id_buf); 2665 - } else { 2666 - pr_err_once("workqueue: Failed to create a worker thread: %pe", 2667 - worker->task); 2620 + worker->task = kthread_create_on_node(worker_thread, worker, 2621 + pool->node, "kworker/%s", id_buf); 2622 + if (IS_ERR(worker->task)) { 2623 + if (PTR_ERR(worker->task) == -EINTR) { 2624 + pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n", 2625 + id_buf); 2626 + } else { 2627 + pr_err_once("workqueue: Failed to create a worker thread: %pe", 2628 + worker->task); 2629 + } 2630 + goto fail; 2668 2631 } 2669 - goto fail; 2670 - } 2671 2632 2672 - set_user_nice(worker->task, pool->attrs->nice); 2673 - kthread_bind_mask(worker->task, pool_allowed_cpus(pool)); 2633 + set_user_nice(worker->task, pool->attrs->nice); 2634 + kthread_bind_mask(worker->task, pool_allowed_cpus(pool)); 2635 + } 2674 2636 2675 2637 /* successful, attach the worker to the pool */ 2676 2638 worker_attach_to_pool(worker, pool); ··· 2688 2646 * check if not woken up soon. As kick_pool() is noop if @pool is empty, 2689 2647 * wake it up explicitly. 2690 2648 */ 2691 - wake_up_process(worker->task); 2649 + if (worker->task) 2650 + wake_up_process(worker->task); 2692 2651 2693 2652 raw_spin_unlock_irq(&pool->lock); 2694 2653 ··· 3031 2988 worker->current_work = work; 3032 2989 worker->current_func = work->func; 3033 2990 worker->current_pwq = pwq; 3034 - worker->current_at = worker->task->se.sum_exec_runtime; 2991 + if (worker->task) 2992 + worker->current_at = worker->task->se.sum_exec_runtime; 3035 2993 work_data = *work_data_bits(work); 3036 2994 worker->current_color = get_work_color(work_data); 3037 2995 ··· 3130 3086 * stop_machine. At the same time, report a quiescent RCU state so 3131 3087 * the same condition doesn't freeze RCU. 3132 3088 */ 3133 - cond_resched(); 3089 + if (worker->task) 3090 + cond_resched(); 3134 3091 3135 3092 raw_spin_lock_irq(&pool->lock); 3136 3093 ··· 3414 3369 goto repeat; 3415 3370 } 3416 3371 3372 + static void bh_worker(struct worker *worker) 3373 + { 3374 + struct worker_pool *pool = worker->pool; 3375 + int nr_restarts = BH_WORKER_RESTARTS; 3376 + unsigned long end = jiffies + BH_WORKER_JIFFIES; 3377 + 3378 + raw_spin_lock_irq(&pool->lock); 3379 + worker_leave_idle(worker); 3380 + 3381 + /* 3382 + * This function follows the structure of worker_thread(). See there for 3383 + * explanations on each step. 3384 + */ 3385 + if (!need_more_worker(pool)) 3386 + goto done; 3387 + 3388 + WARN_ON_ONCE(!list_empty(&worker->scheduled)); 3389 + worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); 3390 + 3391 + do { 3392 + struct work_struct *work = 3393 + list_first_entry(&pool->worklist, 3394 + struct work_struct, entry); 3395 + 3396 + if (assign_work(work, worker, NULL)) 3397 + process_scheduled_works(worker); 3398 + } while (keep_working(pool) && 3399 + --nr_restarts && time_before(jiffies, end)); 3400 + 3401 + worker_set_flags(worker, WORKER_PREP); 3402 + done: 3403 + worker_enter_idle(worker); 3404 + kick_pool(pool); 3405 + raw_spin_unlock_irq(&pool->lock); 3406 + } 3407 + 3408 + /* 3409 + * TODO: Convert all tasklet users to workqueue and use softirq directly. 3410 + * 3411 + * This is currently called from tasklet[_hi]action() and thus is also called 3412 + * whenever there are tasklets to run. Let's do an early exit if there's nothing 3413 + * queued. Once conversion from tasklet is complete, the need_more_worker() test 3414 + * can be dropped. 3415 + * 3416 + * After full conversion, we'll add worker->softirq_action, directly use the 3417 + * softirq action and obtain the worker pointer from the softirq_action pointer. 3418 + */ 3419 + void workqueue_softirq_action(bool highpri) 3420 + { 3421 + struct worker_pool *pool = 3422 + &per_cpu(bh_worker_pools, smp_processor_id())[highpri]; 3423 + if (need_more_worker(pool)) 3424 + bh_worker(list_first_entry(&pool->workers, struct worker, node)); 3425 + } 3426 + 3417 3427 /** 3418 3428 * check_flush_dependency - check for flush dependency sanity 3419 3429 * @target_wq: workqueue being flushed ··· 3541 3441 struct wq_barrier *barr, 3542 3442 struct work_struct *target, struct worker *worker) 3543 3443 { 3444 + static __maybe_unused struct lock_class_key bh_key, thr_key; 3544 3445 unsigned int work_flags = 0; 3545 3446 unsigned int work_color; 3546 3447 struct list_head *head; ··· 3551 3450 * as we know for sure that this will not trigger any of the 3552 3451 * checks and call back into the fixup functions where we 3553 3452 * might deadlock. 3453 + * 3454 + * BH and threaded workqueues need separate lockdep keys to avoid 3455 + * spuriously triggering "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} 3456 + * usage". 3554 3457 */ 3555 - INIT_WORK_ONSTACK(&barr->work, wq_barrier_func); 3458 + INIT_WORK_ONSTACK_KEY(&barr->work, wq_barrier_func, 3459 + (pwq->wq->flags & WQ_BH) ? &bh_key : &thr_key); 3556 3460 __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work)); 3557 3461 3558 3462 init_completion_map(&barr->done, &target->lockdep_map); ··· 3663 3557 3664 3558 static void touch_wq_lockdep_map(struct workqueue_struct *wq) 3665 3559 { 3560 + #ifdef CONFIG_LOCKDEP 3561 + if (wq->flags & WQ_BH) 3562 + local_bh_disable(); 3563 + 3666 3564 lock_map_acquire(&wq->lockdep_map); 3667 3565 lock_map_release(&wq->lockdep_map); 3566 + 3567 + if (wq->flags & WQ_BH) 3568 + local_bh_enable(); 3569 + #endif 3668 3570 } 3669 3571 3670 3572 static void touch_work_lockdep_map(struct work_struct *work, 3671 3573 struct workqueue_struct *wq) 3672 3574 { 3575 + #ifdef CONFIG_LOCKDEP 3576 + if (wq->flags & WQ_BH) 3577 + local_bh_disable(); 3578 + 3673 3579 lock_map_acquire(&work->lockdep_map); 3674 3580 lock_map_release(&work->lockdep_map); 3581 + 3582 + if (wq->flags & WQ_BH) 3583 + local_bh_enable(); 3584 + #endif 3675 3585 } 3676 3586 3677 3587 /** ··· 5141 5019 5142 5020 if (!(wq->flags & WQ_UNBOUND)) { 5143 5021 for_each_possible_cpu(cpu) { 5144 - struct pool_workqueue **pwq_p = 5145 - per_cpu_ptr(wq->cpu_pwq, cpu); 5146 - struct worker_pool *pool = 5147 - &(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]); 5022 + struct pool_workqueue **pwq_p; 5023 + struct worker_pool __percpu *pools; 5024 + struct worker_pool *pool; 5025 + 5026 + if (wq->flags & WQ_BH) 5027 + pools = bh_worker_pools; 5028 + else 5029 + pools = cpu_worker_pools; 5030 + 5031 + pool = &(per_cpu_ptr(pools, cpu)[highpri]); 5032 + pwq_p = per_cpu_ptr(wq->cpu_pwq, cpu); 5148 5033 5149 5034 *pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, 5150 5035 pool->node); ··· 5326 5197 size_t wq_size; 5327 5198 int name_len; 5328 5199 5200 + if (flags & WQ_BH) { 5201 + if (WARN_ON_ONCE(flags & ~__WQ_BH_ALLOWS)) 5202 + return NULL; 5203 + if (WARN_ON_ONCE(max_active)) 5204 + return NULL; 5205 + } 5206 + 5329 5207 /* 5330 5208 * Unbound && max_active == 1 used to imply ordered, which is no longer 5331 5209 * the case on many machines due to per-pod pools. While ··· 5370 5234 pr_warn_once("workqueue: name exceeds WQ_NAME_LEN. Truncating to: %s\n", 5371 5235 wq->name); 5372 5236 5373 - max_active = max_active ?: WQ_DFL_ACTIVE; 5374 - max_active = wq_clamp_max_active(max_active, flags, wq->name); 5237 + if (flags & WQ_BH) { 5238 + /* 5239 + * BH workqueues always share a single execution context per CPU 5240 + * and don't impose any max_active limit. 5241 + */ 5242 + max_active = INT_MAX; 5243 + } else { 5244 + max_active = max_active ?: WQ_DFL_ACTIVE; 5245 + max_active = wq_clamp_max_active(max_active, flags, wq->name); 5246 + } 5375 5247 5376 5248 /* init wq */ 5377 5249 wq->flags = flags; ··· 5560 5416 */ 5561 5417 void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) 5562 5418 { 5419 + /* max_active doesn't mean anything for BH workqueues */ 5420 + if (WARN_ON(wq->flags & WQ_BH)) 5421 + return; 5563 5422 /* disallow meddling with max_active for ordered workqueues */ 5564 5423 if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT)) 5565 5424 return; ··· 5764 5617 pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask); 5765 5618 if (pool->node != NUMA_NO_NODE) 5766 5619 pr_cont(" node=%d", pool->node); 5767 - pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice); 5620 + pr_cont(" flags=0x%x", pool->flags); 5621 + if (pool->flags & POOL_BH) 5622 + pr_cont(" bh%s", 5623 + pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : ""); 5624 + else 5625 + pr_cont(" nice=%d", pool->attrs->nice); 5626 + } 5627 + 5628 + static void pr_cont_worker_id(struct worker *worker) 5629 + { 5630 + struct worker_pool *pool = worker->pool; 5631 + 5632 + if (pool->flags & WQ_BH) 5633 + pr_cont("bh%s", 5634 + pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : ""); 5635 + else 5636 + pr_cont("%d%s", task_pid_nr(worker->task), 5637 + worker->rescue_wq ? "(RESCUER)" : ""); 5768 5638 } 5769 5639 5770 5640 struct pr_cont_work_struct { ··· 5858 5694 if (worker->current_pwq != pwq) 5859 5695 continue; 5860 5696 5861 - pr_cont("%s %d%s:%ps", comma ? "," : "", 5862 - task_pid_nr(worker->task), 5863 - worker->rescue_wq ? "(RESCUER)" : "", 5864 - worker->current_func); 5697 + pr_cont(" %s", comma ? "," : ""); 5698 + pr_cont_worker_id(worker); 5699 + pr_cont(":%ps", worker->current_func); 5865 5700 list_for_each_entry(work, &worker->scheduled, entry) 5866 5701 pr_cont_work(false, work, &pcws); 5867 5702 pr_cont_work_flush(comma, (work_func_t)-1L, &pcws); ··· 5979 5816 pr_cont(" manager: %d", 5980 5817 task_pid_nr(pool->manager->task)); 5981 5818 list_for_each_entry(worker, &pool->idle_list, entry) { 5982 - pr_cont(" %s%d", first ? "idle: " : "", 5983 - task_pid_nr(worker->task)); 5819 + pr_cont(" %s", first ? "idle: " : ""); 5820 + pr_cont_worker_id(worker); 5984 5821 first = false; 5985 5822 } 5986 5823 pr_cont("\n"); ··· 6253 6090 mutex_lock(&wq_pool_mutex); 6254 6091 6255 6092 for_each_pool(pool, pi) { 6256 - mutex_lock(&wq_pool_attach_mutex); 6093 + /* BH pools aren't affected by hotplug */ 6094 + if (pool->flags & POOL_BH) 6095 + continue; 6257 6096 6097 + mutex_lock(&wq_pool_attach_mutex); 6258 6098 if (pool->cpu == cpu) 6259 6099 rebind_workers(pool); 6260 6100 else if (pool->cpu < 0) 6261 6101 restore_unbound_workers_cpumask(pool, cpu); 6262 - 6263 6102 mutex_unlock(&wq_pool_attach_mutex); 6264 6103 } 6265 6104 ··· 7218 7053 /* did we stall? */ 7219 7054 if (time_after(now, ts + thresh)) { 7220 7055 lockup_detected = true; 7221 - if (pool->cpu >= 0) { 7056 + if (pool->cpu >= 0 && !(pool->flags & POOL_BH)) { 7222 7057 pool->cpu_stall = true; 7223 7058 cpu_pool_stall = true; 7224 7059 } ··· 7383 7218 pt->pod_node[0] = NUMA_NO_NODE; 7384 7219 pt->cpu_pod[0] = 0; 7385 7220 7386 - /* initialize CPU pools */ 7221 + /* initialize BH and CPU pools */ 7387 7222 for_each_possible_cpu(cpu) { 7388 7223 struct worker_pool *pool; 7224 + 7225 + i = 0; 7226 + for_each_bh_worker_pool(pool, cpu) { 7227 + init_cpu_worker_pool(pool, cpu, std_nice[i++]); 7228 + pool->flags |= POOL_BH; 7229 + } 7389 7230 7390 7231 i = 0; 7391 7232 for_each_cpu_worker_pool(pool, cpu) ··· 7428 7257 system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_pwr_efficient", 7429 7258 WQ_FREEZABLE | WQ_POWER_EFFICIENT, 7430 7259 0); 7260 + system_bh_wq = alloc_workqueue("events_bh", WQ_BH, 0); 7261 + system_bh_highpri_wq = alloc_workqueue("events_bh_highpri", 7262 + WQ_BH | WQ_HIGHPRI, 0); 7431 7263 BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq || 7432 7264 !system_unbound_wq || !system_freezable_wq || 7433 7265 !system_power_efficient_wq || 7434 - !system_freezable_power_efficient_wq); 7266 + !system_freezable_power_efficient_wq || 7267 + !system_bh_wq || !system_bh_highpri_wq); 7435 7268 } 7436 7269 7437 7270 static void __init wq_cpu_intensive_thresh_init(void) ··· 7501 7326 * up. Also, create a rescuer for workqueues that requested it. 7502 7327 */ 7503 7328 for_each_possible_cpu(cpu) { 7504 - for_each_cpu_worker_pool(pool, cpu) { 7329 + for_each_bh_worker_pool(pool, cpu) 7505 7330 pool->node = cpu_to_node(cpu); 7506 - } 7331 + for_each_cpu_worker_pool(pool, cpu) 7332 + pool->node = cpu_to_node(cpu); 7507 7333 } 7508 7334 7509 7335 list_for_each_entry(wq, &workqueues, list) { ··· 7515 7339 7516 7340 mutex_unlock(&wq_pool_mutex); 7517 7341 7518 - /* create the initial workers */ 7342 + /* 7343 + * Create the initial workers. A BH pool has one pseudo worker that 7344 + * represents the shared BH execution context and thus doesn't get 7345 + * affected by hotplug events. Create the BH pseudo workers for all 7346 + * possible CPUs here. 7347 + */ 7348 + for_each_possible_cpu(cpu) 7349 + for_each_bh_worker_pool(pool, cpu) 7350 + BUG_ON(!create_worker(pool)); 7351 + 7519 7352 for_each_online_cpu(cpu) { 7520 7353 for_each_cpu_worker_pool(pool, cpu) { 7521 7354 pool->flags &= ~POOL_DISASSOCIATED;

+9 -2

tools/workqueue/wq_dump.py

··· 79 79 wq_type_len = 9 80 80 81 81 def wq_type_str(wq): 82 - if wq.flags & WQ_UNBOUND: 82 + if wq.flags & WQ_BH: 83 + return f'{"bh":{wq_type_len}}' 84 + elif wq.flags & WQ_UNBOUND: 83 85 if wq.flags & WQ_ORDERED: 84 86 return f'{"ordered":{wq_type_len}}' 85 87 else: ··· 99 97 wq_affn_dfl = prog['wq_affn_dfl'] 100 98 wq_affn_names = prog['wq_affn_names'] 101 99 100 + WQ_BH = prog['WQ_BH'] 102 101 WQ_UNBOUND = prog['WQ_UNBOUND'] 103 102 WQ_ORDERED = prog['__WQ_ORDERED'] 104 103 WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM'] ··· 109 106 WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE'] 110 107 WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA'] 111 108 WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM'] 109 + 110 + POOL_BH = prog['POOL_BH'] 112 111 113 112 WQ_NAME_LEN = prog['WQ_NAME_LEN'].value_() 114 113 cpumask_str_len = len(cpumask_str(wq_unbound_cpumask)) ··· 156 151 157 152 for pi, pool in idr_for_each(worker_pool_idr): 158 153 pool = drgn.Object(prog, 'struct worker_pool', address=pool) 159 - print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='') 154 + print(f'pool[{pi:0{max_pool_id_len}}] flags=0x{pool.flags.value_():02x} ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='') 160 155 print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='') 161 156 if pool.cpu >= 0: 162 157 print(f'cpu={pool.cpu.value_():3}', end='') 158 + if pool.flags & POOL_BH: 159 + print(' bh', end='') 163 160 else: 164 161 print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='') 165 162 print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')