Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

- Unbound workqueues now support more flexible affinity scopes.

The default behavior is to soft-affine according to last level cache
boundaries. A work item queued from a given LLC is executed by a
worker running on the same LLC but the worker may be moved across
cache boundaries as the scheduler sees fit. On machines which
multiple L3 caches, which are becoming more popular along with
chiplet designs, this improves cache locality while not harming work
conservation too much.

Unbound workqueues are now also a lot more flexible in terms of
execution affinity. Differeing levels of affinity scopes are
supported and both the default and per-workqueue affinity settings
can be modified dynamically. This should help working around amny of
sub-optimal behaviors observed recently with asymmetric ARM CPUs.

This involved signficant restructuring of workqueue code. Nothing was
reported yet but there's some risk of subtle regressions. Should keep
an eye out.

- Rescuer workers now has more identifiable comms.

- workqueue.unbound_cpus added so that CPUs which can be used by
workqueue can be constrained early during boot.

- Now that all the in-tree users have been flushed out, trigger warning
if system-wide workqueues are flushed.

* tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (31 commits)
workqueue: fix data race with the pwq->stats[] increment
workqueue: Rename rescuer kworker
workqueue: Make default affinity_scope dynamically updatable
workqueue: Add "Affinity Scopes and Performance" section to documentation
workqueue: Implement non-strict affinity scope for unbound workqueues
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue: Factor out need_more_worker() check and worker wake-up
workqueue: Factor out work to worker assignment and collision handling
workqueue: Add multiple affinity scopes and interface to select them
workqueue: Modularize wq_pod_type initialization
workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
workqueue: Generalize unbound CPU pods
workqueue: Factor out clearing of workqueue-only attrs fields
workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
workqueue: Initialize unbound CPU pods later in the boot
workqueue: Move wq_pod_init() below workqueue_init()
workqueue: Rename NUMA related names to use pod instead
workqueue: Rename workqueue_attrs->no_numa to ->ordered
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
...

+1512 -804
+19 -9
Documentation/admin-guide/kernel-parameters.txt
··· 7076 7076 disables both lockup detectors. Default is 10 7077 7077 seconds. 7078 7078 7079 + workqueue.unbound_cpus= 7080 + [KNL,SMP] Specify to constrain one or some CPUs 7081 + to use in unbound workqueues. 7082 + Format: <cpu-list> 7083 + By default, all online CPUs are available for 7084 + unbound workqueues. 7085 + 7079 7086 workqueue.watchdog_thresh= 7080 7087 If CONFIG_WQ_WATCHDOG is configured, workqueue can 7081 7088 warn stall conditions and dump internal state to ··· 7104 7097 threshold repeatedly. They are likely good 7105 7098 candidates for using WQ_UNBOUND workqueues instead. 7106 7099 7107 - workqueue.disable_numa 7108 - By default, all work items queued to unbound 7109 - workqueues are affine to the NUMA nodes they're 7110 - issued on, which results in better behavior in 7111 - general. If NUMA affinity needs to be disabled for 7112 - whatever reason, this option can be used. Note 7113 - that this also can be controlled per-workqueue for 7114 - workqueues visible under /sys/bus/workqueue/. 7115 - 7116 7100 workqueue.power_efficient 7117 7101 Per-cpu workqueues are generally preferred because 7118 7102 they show better performance thanks to cache ··· 7118 7120 7119 7121 The default value of this parameter is determined by 7120 7122 the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT. 7123 + 7124 + workqueue.default_affinity_scope= 7125 + Select the default affinity scope to use for unbound 7126 + workqueues. Can be one of "cpu", "smt", "cache", 7127 + "numa" and "system". Default is "cache". For more 7128 + information, see the Affinity Scopes section in 7129 + Documentation/core-api/workqueue.rst. 7130 + 7131 + This can be changed after boot by writing to the 7132 + matching /sys/module/workqueue/parameters file. All 7133 + workqueues with the "default" affinity scope will be 7134 + updated accordignly. 7121 7135 7122 7136 workqueue.debug_force_rr_cpu 7123 7137 Workqueue used to implicitly guarantee that work
+337 -19
Documentation/core-api/workqueue.rst
··· 1 - ==================================== 2 - Concurrency Managed Workqueue (cmwq) 3 - ==================================== 1 + ========= 2 + Workqueue 3 + ========= 4 4 5 5 :Date: September, 2010 6 6 :Author: Tejun Heo <tj@kernel.org> ··· 25 25 When a new work item gets queued, the worker begins executing again. 26 26 27 27 28 - Why cmwq? 29 - ========= 28 + Why Concurrency Managed Workqueue? 29 + ================================== 30 30 31 31 In the original wq implementation, a multi threaded (MT) wq had one 32 32 worker thread per CPU and a single threaded (ST) wq had one worker ··· 220 220 ``max_active`` 221 221 -------------- 222 222 223 - ``@max_active`` determines the maximum number of execution contexts 224 - per CPU which can be assigned to the work items of a wq. For example, 225 - with ``@max_active`` of 16, at most 16 work items of the wq can be 226 - executing at the same time per CPU. 223 + ``@max_active`` determines the maximum number of execution contexts per 224 + CPU which can be assigned to the work items of a wq. For example, with 225 + ``@max_active`` of 16, at most 16 work items of the wq can be executing 226 + at the same time per CPU. This is always a per-CPU attribute, even for 227 + unbound workqueues. 227 228 228 - Currently, for a bound wq, the maximum limit for ``@max_active`` is 229 - 512 and the default value used when 0 is specified is 256. For an 230 - unbound wq, the limit is higher of 512 and 4 * 231 - ``num_possible_cpus()``. These values are chosen sufficiently high 232 - such that they are not the limiting factor while providing protection 233 - in runaway cases. 229 + The maximum limit for ``@max_active`` is 512 and the default value used 230 + when 0 is specified is 256. These values are chosen sufficiently high 231 + such that they are not the limiting factor while providing protection in 232 + runaway cases. 234 233 235 234 The number of active work items of a wq is usually regulated by the 236 235 users of the wq, more specifically, by how many work items the users ··· 347 348 level of locality in wq operations and work item execution. 348 349 349 350 351 + Affinity Scopes 352 + =============== 353 + 354 + An unbound workqueue groups CPUs according to its affinity scope to improve 355 + cache locality. For example, if a workqueue is using the default affinity 356 + scope of "cache", it will group CPUs according to last level cache 357 + boundaries. A work item queued on the workqueue will be assigned to a worker 358 + on one of the CPUs which share the last level cache with the issuing CPU. 359 + Once started, the worker may or may not be allowed to move outside the scope 360 + depending on the ``affinity_strict`` setting of the scope. 361 + 362 + Workqueue currently supports the following affinity scopes. 363 + 364 + ``default`` 365 + Use the scope in module parameter ``workqueue.default_affinity_scope`` 366 + which is always set to one of the scopes below. 367 + 368 + ``cpu`` 369 + CPUs are not grouped. A work item issued on one CPU is processed by a 370 + worker on the same CPU. This makes unbound workqueues behave as per-cpu 371 + workqueues without concurrency management. 372 + 373 + ``smt`` 374 + CPUs are grouped according to SMT boundaries. This usually means that the 375 + logical threads of each physical CPU core are grouped together. 376 + 377 + ``cache`` 378 + CPUs are grouped according to cache boundaries. Which specific cache 379 + boundary is used is determined by the arch code. L3 is used in a lot of 380 + cases. This is the default affinity scope. 381 + 382 + ``numa`` 383 + CPUs are grouped according to NUMA bounaries. 384 + 385 + ``system`` 386 + All CPUs are put in the same group. Workqueue makes no effort to process a 387 + work item on a CPU close to the issuing CPU. 388 + 389 + The default affinity scope can be changed with the module parameter 390 + ``workqueue.default_affinity_scope`` and a specific workqueue's affinity 391 + scope can be changed using ``apply_workqueue_attrs()``. 392 + 393 + If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope 394 + related interface files under its ``/sys/devices/virtual/WQ_NAME/`` 395 + directory. 396 + 397 + ``affinity_scope`` 398 + Read to see the current affinity scope. Write to change. 399 + 400 + When default is the current scope, reading this file will also show the 401 + current effective scope in parentheses, for example, ``default (cache)``. 402 + 403 + ``affinity_strict`` 404 + 0 by default indicating that affinity scopes are not strict. When a work 405 + item starts execution, workqueue makes a best-effort attempt to ensure 406 + that the worker is inside its affinity scope, which is called 407 + repatriation. Once started, the scheduler is free to move the worker 408 + anywhere in the system as it sees fit. This enables benefiting from scope 409 + locality while still being able to utilize other CPUs if necessary and 410 + available. 411 + 412 + If set to 1, all workers of the scope are guaranteed always to be in the 413 + scope. This may be useful when crossing affinity scopes has other 414 + implications, for example, in terms of power consumption or workload 415 + isolation. Strict NUMA scope can also be used to match the workqueue 416 + behavior of older kernels. 417 + 418 + 419 + Affinity Scopes and Performance 420 + =============================== 421 + 422 + It'd be ideal if an unbound workqueue's behavior is optimal for vast 423 + majority of use cases without further tuning. Unfortunately, in the current 424 + kernel, there exists a pronounced trade-off between locality and utilization 425 + necessitating explicit configurations when workqueues are heavily used. 426 + 427 + Higher locality leads to higher efficiency where more work is performed for 428 + the same number of consumed CPU cycles. However, higher locality may also 429 + cause lower overall system utilization if the work items are not spread 430 + enough across the affinity scopes by the issuers. The following performance 431 + testing with dm-crypt clearly illustrates this trade-off. 432 + 433 + The tests are run on a CPU with 12-cores/24-threads split across four L3 434 + caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. 435 + ``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and 436 + opened with ``cryptsetup`` with default settings. 437 + 438 + 439 + Scenario 1: Enough issuers and work spread across the machine 440 + ------------------------------------------------------------- 441 + 442 + The command used: :: 443 + 444 + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ 445 + --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ 446 + --name=iops-test-job --verify=sha512 447 + 448 + There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` 449 + makes ``fio`` generate and read back the content each time which makes 450 + execution locality matter between the issuer and ``kcryptd``. The followings 451 + are the read bandwidths and CPU utilizations depending on different affinity 452 + scope settings on ``kcryptd`` measured over five runs. Bandwidths are in 453 + MiBps, and CPU util in percents. 454 + 455 + .. list-table:: 456 + :widths: 16 20 20 457 + :header-rows: 1 458 + 459 + * - Affinity 460 + - Bandwidth (MiBps) 461 + - CPU util (%) 462 + 463 + * - system 464 + - 1159.40 ±1.34 465 + - 99.31 ±0.02 466 + 467 + * - cache 468 + - 1166.40 ±0.89 469 + - 99.34 ±0.01 470 + 471 + * - cache (strict) 472 + - 1166.00 ±0.71 473 + - 99.35 ±0.01 474 + 475 + With enough issuers spread across the system, there is no downside to 476 + "cache", strict or otherwise. All three configurations saturate the whole 477 + machine but the cache-affine ones outperform by 0.6% thanks to improved 478 + locality. 479 + 480 + 481 + Scenario 2: Fewer issuers, enough work for saturation 482 + ----------------------------------------------------- 483 + 484 + The command used: :: 485 + 486 + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ 487 + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ 488 + --time_based --group_reporting --name=iops-test-job --verify=sha512 489 + 490 + The only difference from the previous scenario is ``--numjobs=8``. There are 491 + a third of the issuers but is still enough total work to saturate the 492 + system. 493 + 494 + .. list-table:: 495 + :widths: 16 20 20 496 + :header-rows: 1 497 + 498 + * - Affinity 499 + - Bandwidth (MiBps) 500 + - CPU util (%) 501 + 502 + * - system 503 + - 1155.40 ±0.89 504 + - 97.41 ±0.05 505 + 506 + * - cache 507 + - 1154.40 ±1.14 508 + - 96.15 ±0.09 509 + 510 + * - cache (strict) 511 + - 1112.00 ±4.64 512 + - 93.26 ±0.35 513 + 514 + This is more than enough work to saturate the system. Both "system" and 515 + "cache" are nearly saturating the machine but not fully. "cache" is using 516 + less CPU but the better efficiency puts it at the same bandwidth as 517 + "system". 518 + 519 + Eight issuers moving around over four L3 cache scope still allow "cache 520 + (strict)" to mostly saturate the machine but the loss of work conservation 521 + is now starting to hurt with 3.7% bandwidth loss. 522 + 523 + 524 + Scenario 3: Even fewer issuers, not enough work to saturate 525 + ----------------------------------------------------------- 526 + 527 + The command used: :: 528 + 529 + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ 530 + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ 531 + --time_based --group_reporting --name=iops-test-job --verify=sha512 532 + 533 + Again, the only difference is ``--numjobs=4``. With the number of issuers 534 + reduced to four, there now isn't enough work to saturate the whole system 535 + and the bandwidth becomes dependent on completion latencies. 536 + 537 + .. list-table:: 538 + :widths: 16 20 20 539 + :header-rows: 1 540 + 541 + * - Affinity 542 + - Bandwidth (MiBps) 543 + - CPU util (%) 544 + 545 + * - system 546 + - 993.60 ±1.82 547 + - 75.49 ±0.06 548 + 549 + * - cache 550 + - 973.40 ±1.52 551 + - 74.90 ±0.07 552 + 553 + * - cache (strict) 554 + - 828.20 ±4.49 555 + - 66.84 ±0.29 556 + 557 + Now, the tradeoff between locality and utilization is clearer. "cache" shows 558 + 2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. 559 + 560 + 561 + Conclusion and Recommendations 562 + ------------------------------ 563 + 564 + In the above experiments, the efficiency advantage of the "cache" affinity 565 + scope over "system" is, while consistent and noticeable, small. However, the 566 + impact is dependent on the distances between the scopes and may be more 567 + pronounced in processors with more complex topologies. 568 + 569 + While the loss of work-conservation in certain scenarios hurts, it is a lot 570 + better than "cache (strict)" and maximizing workqueue utilization is 571 + unlikely to be the common case anyway. As such, "cache" is the default 572 + affinity scope for unbound pools. 573 + 574 + * As there is no one option which is great for most cases, workqueue usages 575 + that may consume a significant amount of CPU are recommended to configure 576 + the workqueues using ``apply_workqueue_attrs()`` and/or enable 577 + ``WQ_SYSFS``. 578 + 579 + * An unbound workqueue with strict "cpu" affinity scope behaves the same as 580 + ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the 581 + latter and an unbound workqueue provides a lot more flexibility. 582 + 583 + * Affinity scopes are introduced in Linux v6.5. To emulate the previous 584 + behavior, use strict "numa" affinity scope. 585 + 586 + * The loss of work-conservation in non-strict affinity scopes is likely 587 + originating from the scheduler. There is no theoretical reason why the 588 + kernel wouldn't be able to do the right thing and maintain 589 + work-conservation in most cases. As such, it is possible that future 590 + scheduler improvements may make most of these tunables unnecessary. 591 + 592 + 593 + Examining Configuration 594 + ======================= 595 + 596 + Use tools/workqueue/wq_dump.py to examine unbound CPU affinity 597 + configuration, worker pools and how workqueues map to the pools: :: 598 + 599 + $ tools/workqueue/wq_dump.py 600 + Affinity Scopes 601 + =============== 602 + wq_unbound_cpumask=0000000f 603 + 604 + CPU 605 + nr_pods 4 606 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 607 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 608 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 609 + 610 + SMT 611 + nr_pods 4 612 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 613 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 614 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 615 + 616 + CACHE (default) 617 + nr_pods 2 618 + pod_cpus [0]=00000003 [1]=0000000c 619 + pod_node [0]=0 [1]=1 620 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 621 + 622 + NUMA 623 + nr_pods 2 624 + pod_cpus [0]=00000003 [1]=0000000c 625 + pod_node [0]=0 [1]=1 626 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 627 + 628 + SYSTEM 629 + nr_pods 1 630 + pod_cpus [0]=0000000f 631 + pod_node [0]=-1 632 + cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 633 + 634 + Worker Pools 635 + ============ 636 + pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 637 + pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 638 + pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 639 + pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 640 + pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 641 + pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 642 + pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 643 + pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 644 + pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f 645 + pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 646 + pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c 647 + pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f 648 + pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 649 + pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c 650 + 651 + Workqueue CPU -> pool 652 + ===================== 653 + [ workqueue \ CPU 0 1 2 3 dfl] 654 + events percpu 0 2 4 6 655 + events_highpri percpu 1 3 5 7 656 + events_long percpu 0 2 4 6 657 + events_unbound unbound 9 9 10 10 8 658 + events_freezable percpu 0 2 4 6 659 + events_power_efficient percpu 0 2 4 6 660 + events_freezable_power_ percpu 0 2 4 6 661 + rcu_gp percpu 0 2 4 6 662 + rcu_par_gp percpu 0 2 4 6 663 + slub_flushwq percpu 0 2 4 6 664 + netns ordered 8 8 8 8 8 665 + ... 666 + 667 + See the command's help message for more info. 668 + 669 + 350 670 Monitoring 351 671 ========== 352 672 353 673 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: 354 674 355 675 $ tools/workqueue/wq_monitor.py events 356 - total infl CPUtime CPUhog CMwake mayday rescued 676 + total infl CPUtime CPUhog CMW/RPR mayday rescued 357 677 events 18545 0 6.1 0 5 - - 358 678 events_highpri 8 0 0.0 0 0 - - 359 679 events_long 3 0 0.0 0 0 - - 360 - events_unbound 38306 0 0.1 - - - - 680 + events_unbound 38306 0 0.1 - 7 - - 361 681 events_freezable 0 0 0.0 0 0 - - 362 682 events_power_efficient 29598 0 0.2 0 0 - - 363 683 events_freezable_power_ 10 0 0.0 0 0 - - 364 684 sock_diag_events 0 0 0.0 0 0 - - 365 685 366 - total infl CPUtime CPUhog CMwake mayday rescued 686 + total infl CPUtime CPUhog CMW/RPR mayday rescued 367 687 events 18548 0 6.1 0 5 - - 368 688 events_highpri 8 0 0.0 0 0 - - 369 689 events_long 3 0 0.0 0 0 - - 370 - events_unbound 38322 0 0.1 - - - - 690 + events_unbound 38322 0 0.1 - 7 - - 371 691 events_freezable 0 0 0.0 0 0 - - 372 692 events_power_efficient 29603 0 0.2 0 0 - - 373 693 events_freezable_power_ 10 0 0.0 0 0 - -
+63 -52
include/linux/workqueue.h
··· 125 125 struct workqueue_struct *wq; 126 126 }; 127 127 128 + enum wq_affn_scope { 129 + WQ_AFFN_DFL, /* use system default */ 130 + WQ_AFFN_CPU, /* one pod per CPU */ 131 + WQ_AFFN_SMT, /* one pod poer SMT */ 132 + WQ_AFFN_CACHE, /* one pod per LLC */ 133 + WQ_AFFN_NUMA, /* one pod per NUMA node */ 134 + WQ_AFFN_SYSTEM, /* one pod across the whole system */ 135 + 136 + WQ_AFFN_NR_TYPES, 137 + }; 138 + 128 139 /** 129 140 * struct workqueue_attrs - A struct for workqueue attributes. 130 141 * ··· 149 138 150 139 /** 151 140 * @cpumask: allowed CPUs 141 + * 142 + * Work items in this workqueue are affine to these CPUs and not allowed 143 + * to execute on other CPUs. A pool serving a workqueue must have the 144 + * same @cpumask. 152 145 */ 153 146 cpumask_var_t cpumask; 154 147 155 148 /** 156 - * @no_numa: disable NUMA affinity 149 + * @__pod_cpumask: internal attribute used to create per-pod pools 157 150 * 158 - * Unlike other fields, ``no_numa`` isn't a property of a worker_pool. It 159 - * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus 160 - * doesn't participate in pool hash calculations or equality comparisons. 151 + * Internal use only. 152 + * 153 + * Per-pod unbound worker pools are used to improve locality. Always a 154 + * subset of ->cpumask. A workqueue can be associated with multiple 155 + * worker pools with disjoint @__pod_cpumask's. Whether the enforcement 156 + * of a pool's @__pod_cpumask is strict depends on @affn_strict. 161 157 */ 162 - bool no_numa; 158 + cpumask_var_t __pod_cpumask; 159 + 160 + /** 161 + * @affn_strict: affinity scope is strict 162 + * 163 + * If clear, workqueue will make a best-effort attempt at starting the 164 + * worker inside @__pod_cpumask but the scheduler is free to migrate it 165 + * outside. 166 + * 167 + * If set, workers are only allowed to run inside @__pod_cpumask. 168 + */ 169 + bool affn_strict; 170 + 171 + /* 172 + * Below fields aren't properties of a worker_pool. They only modify how 173 + * :c:func:`apply_workqueue_attrs` select pools and thus don't 174 + * participate in pool hash calculations or equality comparisons. 175 + */ 176 + 177 + /** 178 + * @affn_scope: unbound CPU affinity scope 179 + * 180 + * CPU pods are used to improve execution locality of unbound work 181 + * items. There are multiple pod types, one for each wq_affn_scope, and 182 + * every CPU in the system belongs to one pod in every pod type. CPUs 183 + * that belong to the same pod share the worker pool. For example, 184 + * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker 185 + * pool for each NUMA node. 186 + */ 187 + enum wq_affn_scope affn_scope; 188 + 189 + /** 190 + * @ordered: work items must be executed one by one in queueing order 191 + */ 192 + bool ordered; 163 193 }; 164 194 165 195 static inline struct delayed_work *to_delayed_work(struct work_struct *work) ··· 395 343 __WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */ 396 344 397 345 WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ 398 - WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ 346 + WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE, 399 347 WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, 400 348 }; 401 - 402 - /* unbound wq's aren't per-cpu, scale max_active according to #cpus */ 403 - #define WQ_UNBOUND_MAX_ACTIVE \ 404 - max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU) 405 349 406 350 /* 407 351 * System-wide workqueues which are always present. ··· 439 391 * alloc_workqueue - allocate a workqueue 440 392 * @fmt: printf format for the name of the workqueue 441 393 * @flags: WQ_* flags 442 - * @max_active: max in-flight work items, 0 for default 394 + * @max_active: max in-flight work items per CPU, 0 for default 443 395 * remaining args: args for @fmt 444 396 * 445 397 * Allocate a workqueue with the specified parameters. For detailed ··· 617 569 618 570 /* 619 571 * Detect attempt to flush system-wide workqueues at compile time when possible. 572 + * Warn attempt to flush system-wide workqueues at runtime. 620 573 * 621 574 * See https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp 622 575 * for reasons and steps for converting system-wide workqueues into local workqueues. ··· 625 576 extern void __warn_flushing_systemwide_wq(void) 626 577 __compiletime_warning("Please avoid flushing system-wide workqueues."); 627 578 628 - /** 629 - * flush_scheduled_work - ensure that any scheduled work has run to completion. 630 - * 631 - * Forces execution of the kernel-global workqueue and blocks until its 632 - * completion. 633 - * 634 - * It's very easy to get into trouble if you don't take great care. 635 - * Either of the following situations will lead to deadlock: 636 - * 637 - * One of the work items currently on the workqueue needs to acquire 638 - * a lock held by your code or its caller. 639 - * 640 - * Your code is running in the context of a work routine. 641 - * 642 - * They will be detected by lockdep when they occur, but the first might not 643 - * occur very often. It depends on what work items are on the workqueue and 644 - * what locks they need, which you have no control over. 645 - * 646 - * In most situations flushing the entire workqueue is overkill; you merely 647 - * need to know that a particular work item isn't queued and isn't running. 648 - * In such cases you should use cancel_delayed_work_sync() or 649 - * cancel_work_sync() instead. 650 - * 651 - * Please stop calling this function! A conversion to stop flushing system-wide 652 - * workqueues is in progress. This function will be removed after all in-tree 653 - * users stopped calling this function. 654 - */ 655 - /* 656 - * The background of commit 771c035372a036f8 ("deprecate the 657 - * '__deprecated' attribute warnings entirely and for good") is that, 658 - * since Linus builds all modules between every single pull he does, 659 - * the standard kernel build needs to be _clean_ in order to be able to 660 - * notice when new problems happen. Therefore, don't emit warning while 661 - * there are in-tree users. 662 - */ 579 + /* Please stop using this function, for this function will be removed in near future. */ 663 580 #define flush_scheduled_work() \ 664 581 ({ \ 665 - if (0) \ 666 - __warn_flushing_systemwide_wq(); \ 582 + __warn_flushing_systemwide_wq(); \ 667 583 __flush_workqueue(system_wq); \ 668 584 }) 669 585 670 - /* 671 - * Although there is no longer in-tree caller, for now just emit warning 672 - * in order to give out-of-tree callers time to update. 673 - */ 674 586 #define flush_workqueue(wq) \ 675 587 ({ \ 676 588 struct workqueue_struct *_wq = (wq); \ ··· 724 714 725 715 void __init workqueue_init_early(void); 726 716 void __init workqueue_init(void); 717 + void __init workqueue_init_topology(void); 727 718 728 719 #endif
+1
init/main.c
··· 1540 1540 smp_init(); 1541 1541 sched_init_smp(); 1542 1542 1543 + workqueue_init_topology(); 1543 1544 padata_init(); 1544 1545 page_alloc_init_late(); 1545 1546
+900 -716
kernel/workqueue.c
··· 122 122 * 123 123 * L: pool->lock protected. Access with pool->lock held. 124 124 * 125 - * X: During normal operation, modification requires pool->lock and should 126 - * be done only from local cpu. Either disabling preemption on local 127 - * cpu or grabbing pool->lock is enough for read access. If 128 - * POOL_DISASSOCIATED is set, it's identical to L. 129 - * 130 125 * K: Only modified by worker while holding pool->lock. Can be safely read by 131 126 * self, while holding pool->lock or from IRQ context if %current is the 132 127 * kworker. ··· 155 160 int cpu; /* I: the associated cpu */ 156 161 int node; /* I: the associated node ID */ 157 162 int id; /* I: pool ID */ 158 - unsigned int flags; /* X: flags */ 163 + unsigned int flags; /* L: flags */ 159 164 160 165 unsigned long watchdog_ts; /* L: watchdog timestamp */ 161 166 bool cpu_stall; /* WD: stalled cpu bound pool */ ··· 211 216 PWQ_STAT_CPU_TIME, /* total CPU time consumed */ 212 217 PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us violations */ 213 218 PWQ_STAT_CM_WAKEUP, /* concurrency-management worker wakeups */ 219 + PWQ_STAT_REPATRIATED, /* unbound workers brought back into scope */ 214 220 PWQ_STAT_MAYDAY, /* maydays to rescuer */ 215 221 PWQ_STAT_RESCUED, /* linked work items executed by rescuer */ 216 222 ··· 258 262 u64 stats[PWQ_NR_STATS]; 259 263 260 264 /* 261 - * Release of unbound pwq is punted to system_wq. See put_pwq() 262 - * and pwq_unbound_release_workfn() for details. pool_workqueue 263 - * itself is also RCU protected so that the first pwq can be 264 - * determined without grabbing wq->mutex. 265 + * Release of unbound pwq is punted to a kthread_worker. See put_pwq() 266 + * and pwq_release_workfn() for details. pool_workqueue itself is also 267 + * RCU protected so that the first pwq can be determined without 268 + * grabbing wq->mutex. 265 269 */ 266 - struct work_struct unbound_release_work; 270 + struct kthread_work release_work; 267 271 struct rcu_head rcu; 268 272 } __aligned(1 << WORK_STRUCT_FLAG_BITS); 269 273 ··· 322 326 323 327 /* hot fields used during command issue, aligned to cacheline */ 324 328 unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */ 325 - struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */ 326 - struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */ 329 + struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */ 327 330 }; 328 331 329 332 static struct kmem_cache *pwq_cache; 330 333 331 - static cpumask_var_t *wq_numa_possible_cpumask; 332 - /* possible CPUs of each node */ 334 + /* 335 + * Each pod type describes how CPUs should be grouped for unbound workqueues. 336 + * See the comment above workqueue_attrs->affn_scope. 337 + */ 338 + struct wq_pod_type { 339 + int nr_pods; /* number of pods */ 340 + cpumask_var_t *pod_cpus; /* pod -> cpus */ 341 + int *pod_node; /* pod -> node */ 342 + int *cpu_pod; /* cpu -> pod */ 343 + }; 344 + 345 + static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES]; 346 + static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE; 347 + 348 + static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = { 349 + [WQ_AFFN_DFL] = "default", 350 + [WQ_AFFN_CPU] = "cpu", 351 + [WQ_AFFN_SMT] = "smt", 352 + [WQ_AFFN_CACHE] = "cache", 353 + [WQ_AFFN_NUMA] = "numa", 354 + [WQ_AFFN_SYSTEM] = "system", 355 + }; 333 356 334 357 /* 335 358 * Per-cpu work items which run for longer than the following threshold are ··· 360 345 static unsigned long wq_cpu_intensive_thresh_us = ULONG_MAX; 361 346 module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644); 362 347 363 - static bool wq_disable_numa; 364 - module_param_named(disable_numa, wq_disable_numa, bool, 0444); 365 - 366 348 /* see the comment above the definition of WQ_POWER_EFFICIENT */ 367 349 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT); 368 350 module_param_named(power_efficient, wq_power_efficient, bool, 0444); 369 351 370 352 static bool wq_online; /* can kworkers be created yet? */ 371 353 372 - static bool wq_numa_enabled; /* unbound NUMA affinity enabled */ 373 - 374 - /* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */ 375 - static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf; 354 + /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */ 355 + static struct workqueue_attrs *wq_update_pod_attrs_buf; 376 356 377 357 static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */ 378 358 static DEFINE_MUTEX(wq_pool_attach_mutex); /* protects worker attach/detach */ ··· 380 370 381 371 /* PL&A: allowable cpus for unbound wqs and work items */ 382 372 static cpumask_var_t wq_unbound_cpumask; 373 + 374 + /* for further constrain wq_unbound_cpumask by cmdline parameter*/ 375 + static struct cpumask wq_cmdline_cpumask __initdata; 383 376 384 377 /* CPU where unbound work was last round robin scheduled from this CPU */ 385 378 static DEFINE_PER_CPU(int, wq_rr_cpu_last); ··· 412 399 413 400 /* I: attributes used when instantiating ordered pools on demand */ 414 401 static struct workqueue_attrs *ordered_wq_attrs[NR_STD_WORKER_POOLS]; 402 + 403 + /* 404 + * I: kthread_worker to release pwq's. pwq release needs to be bounced to a 405 + * process context while holding a pool lock. Bounce to a dedicated kthread 406 + * worker to avoid A-A deadlocks. 407 + */ 408 + static struct kthread_worker *pwq_release_worker; 415 409 416 410 struct workqueue_struct *system_wq __read_mostly; 417 411 EXPORT_SYMBOL(system_wq); ··· 626 606 return ret; 627 607 } 628 608 629 - /** 630 - * unbound_pwq_by_node - return the unbound pool_workqueue for the given node 631 - * @wq: the target workqueue 632 - * @node: the node ID 633 - * 634 - * This must be called with any of wq_pool_mutex, wq->mutex or RCU 635 - * read locked. 636 - * If the pwq needs to be used beyond the locking in effect, the caller is 637 - * responsible for guaranteeing that the pwq stays online. 638 - * 639 - * Return: The unbound pool_workqueue for @node. 640 - */ 641 - static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq, 642 - int node) 643 - { 644 - assert_rcu_or_wq_mutex_or_pool_mutex(wq); 645 - 646 - /* 647 - * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a 648 - * delayed item is pending. The plan is to keep CPU -> NODE 649 - * mapping valid and stable across CPU on/offlines. Once that 650 - * happens, this workaround can be removed. 651 - */ 652 - if (unlikely(node == NUMA_NO_NODE)) 653 - return wq->dfl_pwq; 654 - 655 - return rcu_dereference_raw(wq->numa_pwq_tbl[node]); 656 - } 657 - 658 609 static unsigned int work_color_to_flags(int color) 659 610 { 660 611 return color << WORK_STRUCT_COLOR_SHIFT; ··· 816 825 * they're being called with pool->lock held. 817 826 */ 818 827 819 - static bool __need_more_worker(struct worker_pool *pool) 820 - { 821 - return !pool->nr_running; 822 - } 823 - 824 828 /* 825 829 * Need to wake up a worker? Called from anything but currently 826 830 * running workers. ··· 826 840 */ 827 841 static bool need_more_worker(struct worker_pool *pool) 828 842 { 829 - return !list_empty(&pool->worklist) && __need_more_worker(pool); 843 + return !list_empty(&pool->worklist) && !pool->nr_running; 830 844 } 831 845 832 846 /* Can I start working? Called from busy but !running workers. */ ··· 857 871 return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy; 858 872 } 859 873 860 - /* 861 - * Wake up functions. 862 - */ 863 - 864 - /* Return the first idle worker. Called with pool->lock held. */ 865 - static struct worker *first_idle_worker(struct worker_pool *pool) 866 - { 867 - if (unlikely(list_empty(&pool->idle_list))) 868 - return NULL; 869 - 870 - return list_first_entry(&pool->idle_list, struct worker, entry); 871 - } 872 - 873 - /** 874 - * wake_up_worker - wake up an idle worker 875 - * @pool: worker pool to wake worker from 876 - * 877 - * Wake up the first idle worker of @pool. 878 - * 879 - * CONTEXT: 880 - * raw_spin_lock_irq(pool->lock). 881 - */ 882 - static void wake_up_worker(struct worker_pool *pool) 883 - { 884 - struct worker *worker = first_idle_worker(pool); 885 - 886 - if (likely(worker)) 887 - wake_up_process(worker->task); 888 - } 889 - 890 874 /** 891 875 * worker_set_flags - set worker flags and adjust nr_running accordingly 892 876 * @worker: self 893 877 * @flags: flags to set 894 878 * 895 879 * Set @flags in @worker->flags and adjust nr_running accordingly. 896 - * 897 - * CONTEXT: 898 - * raw_spin_lock_irq(pool->lock) 899 880 */ 900 881 static inline void worker_set_flags(struct worker *worker, unsigned int flags) 901 882 { 902 883 struct worker_pool *pool = worker->pool; 903 884 904 - WARN_ON_ONCE(worker->task != current); 885 + lockdep_assert_held(&pool->lock); 905 886 906 887 /* If transitioning into NOT_RUNNING, adjust nr_running. */ 907 888 if ((flags & WORKER_NOT_RUNNING) && ··· 885 932 * @flags: flags to clear 886 933 * 887 934 * Clear @flags in @worker->flags and adjust nr_running accordingly. 888 - * 889 - * CONTEXT: 890 - * raw_spin_lock_irq(pool->lock) 891 935 */ 892 936 static inline void worker_clr_flags(struct worker *worker, unsigned int flags) 893 937 { 894 938 struct worker_pool *pool = worker->pool; 895 939 unsigned int oflags = worker->flags; 896 940 897 - WARN_ON_ONCE(worker->task != current); 941 + lockdep_assert_held(&pool->lock); 898 942 899 943 worker->flags &= ~flags; 900 944 ··· 903 953 if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING)) 904 954 if (!(worker->flags & WORKER_NOT_RUNNING)) 905 955 pool->nr_running++; 956 + } 957 + 958 + /* Return the first idle worker. Called with pool->lock held. */ 959 + static struct worker *first_idle_worker(struct worker_pool *pool) 960 + { 961 + if (unlikely(list_empty(&pool->idle_list))) 962 + return NULL; 963 + 964 + return list_first_entry(&pool->idle_list, struct worker, entry); 965 + } 966 + 967 + /** 968 + * worker_enter_idle - enter idle state 969 + * @worker: worker which is entering idle state 970 + * 971 + * @worker is entering idle state. Update stats and idle timer if 972 + * necessary. 973 + * 974 + * LOCKING: 975 + * raw_spin_lock_irq(pool->lock). 976 + */ 977 + static void worker_enter_idle(struct worker *worker) 978 + { 979 + struct worker_pool *pool = worker->pool; 980 + 981 + if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) || 982 + WARN_ON_ONCE(!list_empty(&worker->entry) && 983 + (worker->hentry.next || worker->hentry.pprev))) 984 + return; 985 + 986 + /* can't use worker_set_flags(), also called from create_worker() */ 987 + worker->flags |= WORKER_IDLE; 988 + pool->nr_idle++; 989 + worker->last_active = jiffies; 990 + 991 + /* idle_list is LIFO */ 992 + list_add(&worker->entry, &pool->idle_list); 993 + 994 + if (too_many_workers(pool) && !timer_pending(&pool->idle_timer)) 995 + mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT); 996 + 997 + /* Sanity check nr_running. */ 998 + WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); 999 + } 1000 + 1001 + /** 1002 + * worker_leave_idle - leave idle state 1003 + * @worker: worker which is leaving idle state 1004 + * 1005 + * @worker is leaving idle state. Update stats. 1006 + * 1007 + * LOCKING: 1008 + * raw_spin_lock_irq(pool->lock). 1009 + */ 1010 + static void worker_leave_idle(struct worker *worker) 1011 + { 1012 + struct worker_pool *pool = worker->pool; 1013 + 1014 + if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE))) 1015 + return; 1016 + worker_clr_flags(worker, WORKER_IDLE); 1017 + pool->nr_idle--; 1018 + list_del_init(&worker->entry); 1019 + } 1020 + 1021 + /** 1022 + * find_worker_executing_work - find worker which is executing a work 1023 + * @pool: pool of interest 1024 + * @work: work to find worker for 1025 + * 1026 + * Find a worker which is executing @work on @pool by searching 1027 + * @pool->busy_hash which is keyed by the address of @work. For a worker 1028 + * to match, its current execution should match the address of @work and 1029 + * its work function. This is to avoid unwanted dependency between 1030 + * unrelated work executions through a work item being recycled while still 1031 + * being executed. 1032 + * 1033 + * This is a bit tricky. A work item may be freed once its execution 1034 + * starts and nothing prevents the freed area from being recycled for 1035 + * another work item. If the same work item address ends up being reused 1036 + * before the original execution finishes, workqueue will identify the 1037 + * recycled work item as currently executing and make it wait until the 1038 + * current execution finishes, introducing an unwanted dependency. 1039 + * 1040 + * This function checks the work item address and work function to avoid 1041 + * false positives. Note that this isn't complete as one may construct a 1042 + * work function which can introduce dependency onto itself through a 1043 + * recycled work item. Well, if somebody wants to shoot oneself in the 1044 + * foot that badly, there's only so much we can do, and if such deadlock 1045 + * actually occurs, it should be easy to locate the culprit work function. 1046 + * 1047 + * CONTEXT: 1048 + * raw_spin_lock_irq(pool->lock). 1049 + * 1050 + * Return: 1051 + * Pointer to worker which is executing @work if found, %NULL 1052 + * otherwise. 1053 + */ 1054 + static struct worker *find_worker_executing_work(struct worker_pool *pool, 1055 + struct work_struct *work) 1056 + { 1057 + struct worker *worker; 1058 + 1059 + hash_for_each_possible(pool->busy_hash, worker, hentry, 1060 + (unsigned long)work) 1061 + if (worker->current_work == work && 1062 + worker->current_func == work->func) 1063 + return worker; 1064 + 1065 + return NULL; 1066 + } 1067 + 1068 + /** 1069 + * move_linked_works - move linked works to a list 1070 + * @work: start of series of works to be scheduled 1071 + * @head: target list to append @work to 1072 + * @nextp: out parameter for nested worklist walking 1073 + * 1074 + * Schedule linked works starting from @work to @head. Work series to be 1075 + * scheduled starts at @work and includes any consecutive work with 1076 + * WORK_STRUCT_LINKED set in its predecessor. See assign_work() for details on 1077 + * @nextp. 1078 + * 1079 + * CONTEXT: 1080 + * raw_spin_lock_irq(pool->lock). 1081 + */ 1082 + static void move_linked_works(struct work_struct *work, struct list_head *head, 1083 + struct work_struct **nextp) 1084 + { 1085 + struct work_struct *n; 1086 + 1087 + /* 1088 + * Linked worklist will always end before the end of the list, 1089 + * use NULL for list head. 1090 + */ 1091 + list_for_each_entry_safe_from(work, n, NULL, entry) { 1092 + list_move_tail(&work->entry, head); 1093 + if (!(*work_data_bits(work) & WORK_STRUCT_LINKED)) 1094 + break; 1095 + } 1096 + 1097 + /* 1098 + * If we're already inside safe list traversal and have moved 1099 + * multiple works to the scheduled queue, the next position 1100 + * needs to be updated. 1101 + */ 1102 + if (nextp) 1103 + *nextp = n; 1104 + } 1105 + 1106 + /** 1107 + * assign_work - assign a work item and its linked work items to a worker 1108 + * @work: work to assign 1109 + * @worker: worker to assign to 1110 + * @nextp: out parameter for nested worklist walking 1111 + * 1112 + * Assign @work and its linked work items to @worker. If @work is already being 1113 + * executed by another worker in the same pool, it'll be punted there. 1114 + * 1115 + * If @nextp is not NULL, it's updated to point to the next work of the last 1116 + * scheduled work. This allows assign_work() to be nested inside 1117 + * list_for_each_entry_safe(). 1118 + * 1119 + * Returns %true if @work was successfully assigned to @worker. %false if @work 1120 + * was punted to another worker already executing it. 1121 + */ 1122 + static bool assign_work(struct work_struct *work, struct worker *worker, 1123 + struct work_struct **nextp) 1124 + { 1125 + struct worker_pool *pool = worker->pool; 1126 + struct worker *collision; 1127 + 1128 + lockdep_assert_held(&pool->lock); 1129 + 1130 + /* 1131 + * A single work shouldn't be executed concurrently by multiple workers. 1132 + * __queue_work() ensures that @work doesn't jump to a different pool 1133 + * while still running in the previous pool. Here, we should ensure that 1134 + * @work is not executed concurrently by multiple workers from the same 1135 + * pool. Check whether anyone is already processing the work. If so, 1136 + * defer the work to the currently executing one. 1137 + */ 1138 + collision = find_worker_executing_work(pool, work); 1139 + if (unlikely(collision)) { 1140 + move_linked_works(work, &collision->scheduled, nextp); 1141 + return false; 1142 + } 1143 + 1144 + move_linked_works(work, &worker->scheduled, nextp); 1145 + return true; 1146 + } 1147 + 1148 + /** 1149 + * kick_pool - wake up an idle worker if necessary 1150 + * @pool: pool to kick 1151 + * 1152 + * @pool may have pending work items. Wake up worker if necessary. Returns 1153 + * whether a worker was woken up. 1154 + */ 1155 + static bool kick_pool(struct worker_pool *pool) 1156 + { 1157 + struct worker *worker = first_idle_worker(pool); 1158 + struct task_struct *p; 1159 + 1160 + lockdep_assert_held(&pool->lock); 1161 + 1162 + if (!need_more_worker(pool) || !worker) 1163 + return false; 1164 + 1165 + p = worker->task; 1166 + 1167 + #ifdef CONFIG_SMP 1168 + /* 1169 + * Idle @worker is about to execute @work and waking up provides an 1170 + * opportunity to migrate @worker at a lower cost by setting the task's 1171 + * wake_cpu field. Let's see if we want to move @worker to improve 1172 + * execution locality. 1173 + * 1174 + * We're waking the worker that went idle the latest and there's some 1175 + * chance that @worker is marked idle but hasn't gone off CPU yet. If 1176 + * so, setting the wake_cpu won't do anything. As this is a best-effort 1177 + * optimization and the race window is narrow, let's leave as-is for 1178 + * now. If this becomes pronounced, we can skip over workers which are 1179 + * still on cpu when picking an idle worker. 1180 + * 1181 + * If @pool has non-strict affinity, @worker might have ended up outside 1182 + * its affinity scope. Repatriate. 1183 + */ 1184 + if (!pool->attrs->affn_strict && 1185 + !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) { 1186 + struct work_struct *work = list_first_entry(&pool->worklist, 1187 + struct work_struct, entry); 1188 + p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask); 1189 + get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++; 1190 + } 1191 + #endif 1192 + wake_up_process(p); 1193 + return true; 906 1194 } 907 1195 908 1196 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT ··· 1308 1120 } 1309 1121 1310 1122 pool->nr_running--; 1311 - if (need_more_worker(pool)) { 1123 + if (kick_pool(pool)) 1312 1124 worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++; 1313 - wake_up_worker(pool); 1314 - } 1125 + 1315 1126 raw_spin_unlock_irq(&pool->lock); 1316 1127 } 1317 1128 ··· 1358 1171 wq_cpu_intensive_report(worker->current_func); 1359 1172 pwq->stats[PWQ_STAT_CPU_INTENSIVE]++; 1360 1173 1361 - if (need_more_worker(pool)) { 1174 + if (kick_pool(pool)) 1362 1175 pwq->stats[PWQ_STAT_CM_WAKEUP]++; 1363 - wake_up_worker(pool); 1364 - } 1365 1176 1366 1177 raw_spin_unlock(&pool->lock); 1367 1178 } ··· 1396 1211 } 1397 1212 1398 1213 /** 1399 - * find_worker_executing_work - find worker which is executing a work 1400 - * @pool: pool of interest 1401 - * @work: work to find worker for 1402 - * 1403 - * Find a worker which is executing @work on @pool by searching 1404 - * @pool->busy_hash which is keyed by the address of @work. For a worker 1405 - * to match, its current execution should match the address of @work and 1406 - * its work function. This is to avoid unwanted dependency between 1407 - * unrelated work executions through a work item being recycled while still 1408 - * being executed. 1409 - * 1410 - * This is a bit tricky. A work item may be freed once its execution 1411 - * starts and nothing prevents the freed area from being recycled for 1412 - * another work item. If the same work item address ends up being reused 1413 - * before the original execution finishes, workqueue will identify the 1414 - * recycled work item as currently executing and make it wait until the 1415 - * current execution finishes, introducing an unwanted dependency. 1416 - * 1417 - * This function checks the work item address and work function to avoid 1418 - * false positives. Note that this isn't complete as one may construct a 1419 - * work function which can introduce dependency onto itself through a 1420 - * recycled work item. Well, if somebody wants to shoot oneself in the 1421 - * foot that badly, there's only so much we can do, and if such deadlock 1422 - * actually occurs, it should be easy to locate the culprit work function. 1423 - * 1424 - * CONTEXT: 1425 - * raw_spin_lock_irq(pool->lock). 1426 - * 1427 - * Return: 1428 - * Pointer to worker which is executing @work if found, %NULL 1429 - * otherwise. 1430 - */ 1431 - static struct worker *find_worker_executing_work(struct worker_pool *pool, 1432 - struct work_struct *work) 1433 - { 1434 - struct worker *worker; 1435 - 1436 - hash_for_each_possible(pool->busy_hash, worker, hentry, 1437 - (unsigned long)work) 1438 - if (worker->current_work == work && 1439 - worker->current_func == work->func) 1440 - return worker; 1441 - 1442 - return NULL; 1443 - } 1444 - 1445 - /** 1446 - * move_linked_works - move linked works to a list 1447 - * @work: start of series of works to be scheduled 1448 - * @head: target list to append @work to 1449 - * @nextp: out parameter for nested worklist walking 1450 - * 1451 - * Schedule linked works starting from @work to @head. Work series to 1452 - * be scheduled starts at @work and includes any consecutive work with 1453 - * WORK_STRUCT_LINKED set in its predecessor. 1454 - * 1455 - * If @nextp is not NULL, it's updated to point to the next work of 1456 - * the last scheduled work. This allows move_linked_works() to be 1457 - * nested inside outer list_for_each_entry_safe(). 1458 - * 1459 - * CONTEXT: 1460 - * raw_spin_lock_irq(pool->lock). 1461 - */ 1462 - static void move_linked_works(struct work_struct *work, struct list_head *head, 1463 - struct work_struct **nextp) 1464 - { 1465 - struct work_struct *n; 1466 - 1467 - /* 1468 - * Linked worklist will always end before the end of the list, 1469 - * use NULL for list head. 1470 - */ 1471 - list_for_each_entry_safe_from(work, n, NULL, entry) { 1472 - list_move_tail(&work->entry, head); 1473 - if (!(*work_data_bits(work) & WORK_STRUCT_LINKED)) 1474 - break; 1475 - } 1476 - 1477 - /* 1478 - * If we're already inside safe list traversal and have moved 1479 - * multiple works to the scheduled queue, the next position 1480 - * needs to be updated. 1481 - */ 1482 - if (nextp) 1483 - *nextp = n; 1484 - } 1485 - 1486 - /** 1487 1214 * get_pwq - get an extra reference on the specified pool_workqueue 1488 1215 * @pwq: pool_workqueue to get 1489 1216 * ··· 1421 1324 lockdep_assert_held(&pwq->pool->lock); 1422 1325 if (likely(--pwq->refcnt)) 1423 1326 return; 1424 - if (WARN_ON_ONCE(!(pwq->wq->flags & WQ_UNBOUND))) 1425 - return; 1426 1327 /* 1427 - * @pwq can't be released under pool->lock, bounce to 1428 - * pwq_unbound_release_workfn(). This never recurses on the same 1429 - * pool->lock as this path is taken only for unbound workqueues and 1430 - * the release work item is scheduled on a per-cpu workqueue. To 1431 - * avoid lockdep warning, unbound pool->locks are given lockdep 1432 - * subclass of 1 in get_unbound_pool(). 1328 + * @pwq can't be released under pool->lock, bounce to a dedicated 1329 + * kthread_worker to avoid A-A deadlocks. 1433 1330 */ 1434 - schedule_work(&pwq->unbound_release_work); 1331 + kthread_queue_work(pwq_release_worker, &pwq->release_work); 1435 1332 } 1436 1333 1437 1334 /** ··· 1641 1550 static void insert_work(struct pool_workqueue *pwq, struct work_struct *work, 1642 1551 struct list_head *head, unsigned int extra_flags) 1643 1552 { 1644 - struct worker_pool *pool = pwq->pool; 1553 + debug_work_activate(work); 1645 1554 1646 1555 /* record the work call stack in order to print it in KASAN reports */ 1647 1556 kasan_record_aux_stack_noalloc(work); ··· 1650 1559 set_work_pwq(work, pwq, extra_flags); 1651 1560 list_add_tail(&work->entry, head); 1652 1561 get_pwq(pwq); 1653 - 1654 - if (__need_more_worker(pool)) 1655 - wake_up_worker(pool); 1656 1562 } 1657 1563 1658 1564 /* ··· 1703 1615 struct work_struct *work) 1704 1616 { 1705 1617 struct pool_workqueue *pwq; 1706 - struct worker_pool *last_pool; 1707 - struct list_head *worklist; 1618 + struct worker_pool *last_pool, *pool; 1708 1619 unsigned int work_flags; 1709 1620 unsigned int req_cpu = cpu; 1710 1621 ··· 1727 1640 rcu_read_lock(); 1728 1641 retry: 1729 1642 /* pwq which will be used unless @work is executing elsewhere */ 1730 - if (wq->flags & WQ_UNBOUND) { 1731 - if (req_cpu == WORK_CPU_UNBOUND) 1643 + if (req_cpu == WORK_CPU_UNBOUND) { 1644 + if (wq->flags & WQ_UNBOUND) 1732 1645 cpu = wq_select_unbound_cpu(raw_smp_processor_id()); 1733 - pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); 1734 - } else { 1735 - if (req_cpu == WORK_CPU_UNBOUND) 1646 + else 1736 1647 cpu = raw_smp_processor_id(); 1737 - pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); 1738 1648 } 1649 + 1650 + pwq = rcu_dereference(*per_cpu_ptr(wq->cpu_pwq, cpu)); 1651 + pool = pwq->pool; 1739 1652 1740 1653 /* 1741 1654 * If @work was previously on a different pool, it might still be ··· 1743 1656 * pool to guarantee non-reentrancy. 1744 1657 */ 1745 1658 last_pool = get_work_pool(work); 1746 - if (last_pool && last_pool != pwq->pool) { 1659 + if (last_pool && last_pool != pool) { 1747 1660 struct worker *worker; 1748 1661 1749 1662 raw_spin_lock(&last_pool->lock); ··· 1752 1665 1753 1666 if (worker && worker->current_pwq->wq == wq) { 1754 1667 pwq = worker->current_pwq; 1668 + pool = pwq->pool; 1669 + WARN_ON_ONCE(pool != last_pool); 1755 1670 } else { 1756 1671 /* meh... not running there, queue here */ 1757 1672 raw_spin_unlock(&last_pool->lock); 1758 - raw_spin_lock(&pwq->pool->lock); 1673 + raw_spin_lock(&pool->lock); 1759 1674 } 1760 1675 } else { 1761 - raw_spin_lock(&pwq->pool->lock); 1676 + raw_spin_lock(&pool->lock); 1762 1677 } 1763 1678 1764 1679 /* 1765 - * pwq is determined and locked. For unbound pools, we could have 1766 - * raced with pwq release and it could already be dead. If its 1767 - * refcnt is zero, repeat pwq selection. Note that pwqs never die 1768 - * without another pwq replacing it in the numa_pwq_tbl or while 1769 - * work items are executing on it, so the retrying is guaranteed to 1770 - * make forward-progress. 1680 + * pwq is determined and locked. For unbound pools, we could have raced 1681 + * with pwq release and it could already be dead. If its refcnt is zero, 1682 + * repeat pwq selection. Note that unbound pwqs never die without 1683 + * another pwq replacing it in cpu_pwq or while work items are executing 1684 + * on it, so the retrying is guaranteed to make forward-progress. 1771 1685 */ 1772 1686 if (unlikely(!pwq->refcnt)) { 1773 1687 if (wq->flags & WQ_UNBOUND) { 1774 - raw_spin_unlock(&pwq->pool->lock); 1688 + raw_spin_unlock(&pool->lock); 1775 1689 cpu_relax(); 1776 1690 goto retry; 1777 1691 } ··· 1791 1703 work_flags = work_color_to_flags(pwq->work_color); 1792 1704 1793 1705 if (likely(pwq->nr_active < pwq->max_active)) { 1706 + if (list_empty(&pool->worklist)) 1707 + pool->watchdog_ts = jiffies; 1708 + 1794 1709 trace_workqueue_activate_work(work); 1795 1710 pwq->nr_active++; 1796 - worklist = &pwq->pool->worklist; 1797 - if (list_empty(worklist)) 1798 - pwq->pool->watchdog_ts = jiffies; 1711 + insert_work(pwq, work, &pool->worklist, work_flags); 1712 + kick_pool(pool); 1799 1713 } else { 1800 1714 work_flags |= WORK_STRUCT_INACTIVE; 1801 - worklist = &pwq->inactive_works; 1715 + insert_work(pwq, work, &pwq->inactive_works, work_flags); 1802 1716 } 1803 1717 1804 - debug_work_activate(work); 1805 - insert_work(pwq, work, worklist, work_flags); 1806 - 1807 1718 out: 1808 - raw_spin_unlock(&pwq->pool->lock); 1719 + raw_spin_unlock(&pool->lock); 1809 1720 rcu_read_unlock(); 1810 1721 } 1811 1722 ··· 1841 1754 EXPORT_SYMBOL(queue_work_on); 1842 1755 1843 1756 /** 1844 - * workqueue_select_cpu_near - Select a CPU based on NUMA node 1757 + * select_numa_node_cpu - Select a CPU based on NUMA node 1845 1758 * @node: NUMA node ID that we want to select a CPU from 1846 1759 * 1847 1760 * This function will attempt to find a "random" cpu available on a given ··· 1849 1762 * WORK_CPU_UNBOUND indicating that we should just schedule to any 1850 1763 * available CPU if we need to schedule this work. 1851 1764 */ 1852 - static int workqueue_select_cpu_near(int node) 1765 + static int select_numa_node_cpu(int node) 1853 1766 { 1854 1767 int cpu; 1855 - 1856 - /* No point in doing this if NUMA isn't enabled for workqueues */ 1857 - if (!wq_numa_enabled) 1858 - return WORK_CPU_UNBOUND; 1859 1768 1860 1769 /* Delay binding to CPU if node is not valid or online */ 1861 1770 if (node < 0 || node >= MAX_NUMNODES || !node_online(node)) ··· 1909 1826 local_irq_save(flags); 1910 1827 1911 1828 if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { 1912 - int cpu = workqueue_select_cpu_near(node); 1829 + int cpu = select_numa_node_cpu(node); 1913 1830 1914 1831 __queue_work(cpu, wq, work); 1915 1832 ret = true; ··· 2064 1981 } 2065 1982 EXPORT_SYMBOL(queue_rcu_work); 2066 1983 2067 - /** 2068 - * worker_enter_idle - enter idle state 2069 - * @worker: worker which is entering idle state 2070 - * 2071 - * @worker is entering idle state. Update stats and idle timer if 2072 - * necessary. 2073 - * 2074 - * LOCKING: 2075 - * raw_spin_lock_irq(pool->lock). 2076 - */ 2077 - static void worker_enter_idle(struct worker *worker) 2078 - { 2079 - struct worker_pool *pool = worker->pool; 2080 - 2081 - if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) || 2082 - WARN_ON_ONCE(!list_empty(&worker->entry) && 2083 - (worker->hentry.next || worker->hentry.pprev))) 2084 - return; 2085 - 2086 - /* can't use worker_set_flags(), also called from create_worker() */ 2087 - worker->flags |= WORKER_IDLE; 2088 - pool->nr_idle++; 2089 - worker->last_active = jiffies; 2090 - 2091 - /* idle_list is LIFO */ 2092 - list_add(&worker->entry, &pool->idle_list); 2093 - 2094 - if (too_many_workers(pool) && !timer_pending(&pool->idle_timer)) 2095 - mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT); 2096 - 2097 - /* Sanity check nr_running. */ 2098 - WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); 2099 - } 2100 - 2101 - /** 2102 - * worker_leave_idle - leave idle state 2103 - * @worker: worker which is leaving idle state 2104 - * 2105 - * @worker is leaving idle state. Update stats. 2106 - * 2107 - * LOCKING: 2108 - * raw_spin_lock_irq(pool->lock). 2109 - */ 2110 - static void worker_leave_idle(struct worker *worker) 2111 - { 2112 - struct worker_pool *pool = worker->pool; 2113 - 2114 - if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE))) 2115 - return; 2116 - worker_clr_flags(worker, WORKER_IDLE); 2117 - pool->nr_idle--; 2118 - list_del_init(&worker->entry); 2119 - } 2120 - 2121 1984 static struct worker *alloc_worker(int node) 2122 1985 { 2123 1986 struct worker *worker; ··· 2077 2048 worker->flags = WORKER_PREP; 2078 2049 } 2079 2050 return worker; 2051 + } 2052 + 2053 + static cpumask_t *pool_allowed_cpus(struct worker_pool *pool) 2054 + { 2055 + if (pool->cpu < 0 && pool->attrs->affn_strict) 2056 + return pool->attrs->__pod_cpumask; 2057 + else 2058 + return pool->attrs->cpumask; 2080 2059 } 2081 2060 2082 2061 /** ··· 2112 2075 kthread_set_per_cpu(worker->task, pool->cpu); 2113 2076 2114 2077 if (worker->rescue_wq) 2115 - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); 2078 + set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool)); 2116 2079 2117 2080 list_add_tail(&worker->node, &pool->workers); 2118 2081 worker->pool = pool; ··· 2204 2167 } 2205 2168 2206 2169 set_user_nice(worker->task, pool->attrs->nice); 2207 - kthread_bind_mask(worker->task, pool->attrs->cpumask); 2170 + kthread_bind_mask(worker->task, pool_allowed_cpus(pool)); 2208 2171 2209 2172 /* successful, attach the worker to the pool */ 2210 2173 worker_attach_to_pool(worker, pool); 2211 2174 2212 2175 /* start the newly created worker */ 2213 2176 raw_spin_lock_irq(&pool->lock); 2177 + 2214 2178 worker->pool->nr_workers++; 2215 2179 worker_enter_idle(worker); 2180 + kick_pool(pool); 2181 + 2182 + /* 2183 + * @worker is waiting on a completion in kthread() and will trigger hung 2184 + * check if not woken up soon. As kick_pool() might not have waken it 2185 + * up, wake it up explicitly once more. 2186 + */ 2216 2187 wake_up_process(worker->task); 2188 + 2217 2189 raw_spin_unlock_irq(&pool->lock); 2218 2190 2219 2191 return worker; ··· 2350 2304 static void idle_cull_fn(struct work_struct *work) 2351 2305 { 2352 2306 struct worker_pool *pool = container_of(work, struct worker_pool, idle_cull_work); 2353 - struct list_head cull_list; 2307 + LIST_HEAD(cull_list); 2354 2308 2355 - INIT_LIST_HEAD(&cull_list); 2356 2309 /* 2357 2310 * Grabbing wq_pool_attach_mutex here ensures an already-running worker 2358 2311 * cannot proceed beyong worker_detach_from_pool() in its self-destruct ··· 2540 2495 struct pool_workqueue *pwq = get_work_pwq(work); 2541 2496 struct worker_pool *pool = worker->pool; 2542 2497 unsigned long work_data; 2543 - struct worker *collision; 2544 2498 #ifdef CONFIG_LOCKDEP 2545 2499 /* 2546 2500 * It is permissible to free the struct work_struct from ··· 2555 2511 /* ensure we're on the correct CPU */ 2556 2512 WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && 2557 2513 raw_smp_processor_id() != pool->cpu); 2558 - 2559 - /* 2560 - * A single work shouldn't be executed concurrently by 2561 - * multiple workers on a single cpu. Check whether anyone is 2562 - * already processing the work. If so, defer the work to the 2563 - * currently executing one. 2564 - */ 2565 - collision = find_worker_executing_work(pool, work); 2566 - if (unlikely(collision)) { 2567 - move_linked_works(work, &collision->scheduled, NULL); 2568 - return; 2569 - } 2570 2514 2571 2515 /* claim and dequeue */ 2572 2516 debug_work_deactivate(work); ··· 2584 2552 worker_set_flags(worker, WORKER_CPU_INTENSIVE); 2585 2553 2586 2554 /* 2587 - * Wake up another worker if necessary. The condition is always 2588 - * false for normal per-cpu workers since nr_running would always 2589 - * be >= 1 at this point. This is used to chain execution of the 2590 - * pending work items for WORKER_NOT_RUNNING workers such as the 2591 - * UNBOUND and CPU_INTENSIVE ones. 2555 + * Kick @pool if necessary. It's always noop for per-cpu worker pools 2556 + * since nr_running would always be >= 1 at this point. This is used to 2557 + * chain execution of the pending work items for WORKER_NOT_RUNNING 2558 + * workers such as the UNBOUND and CPU_INTENSIVE ones. 2592 2559 */ 2593 - if (need_more_worker(pool)) 2594 - wake_up_worker(pool); 2560 + kick_pool(pool); 2595 2561 2596 2562 /* 2597 2563 * Record the last pool and clear PENDING which should be the last ··· 2599 2569 */ 2600 2570 set_work_pool_and_clear_pending(work, pool->id); 2601 2571 2572 + pwq->stats[PWQ_STAT_STARTED]++; 2602 2573 raw_spin_unlock_irq(&pool->lock); 2603 2574 2604 2575 lock_map_acquire(&pwq->wq->lockdep_map); ··· 2626 2595 * workqueues), so hiding them isn't a problem. 2627 2596 */ 2628 2597 lockdep_invariant_state(true); 2629 - pwq->stats[PWQ_STAT_STARTED]++; 2630 2598 trace_workqueue_execute_start(work); 2631 2599 worker->current_func(work); 2632 2600 /* ··· 2691 2661 */ 2692 2662 static void process_scheduled_works(struct worker *worker) 2693 2663 { 2694 - while (!list_empty(&worker->scheduled)) { 2695 - struct work_struct *work = list_first_entry(&worker->scheduled, 2696 - struct work_struct, entry); 2664 + struct work_struct *work; 2665 + bool first = true; 2666 + 2667 + while ((work = list_first_entry_or_null(&worker->scheduled, 2668 + struct work_struct, entry))) { 2669 + if (first) { 2670 + worker->pool->watchdog_ts = jiffies; 2671 + first = false; 2672 + } 2697 2673 process_one_work(worker, work); 2698 2674 } 2699 2675 } ··· 2780 2744 list_first_entry(&pool->worklist, 2781 2745 struct work_struct, entry); 2782 2746 2783 - pool->watchdog_ts = jiffies; 2784 - 2785 - if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) { 2786 - /* optimization path, not strictly necessary */ 2787 - process_one_work(worker, work); 2788 - if (unlikely(!list_empty(&worker->scheduled))) 2789 - process_scheduled_works(worker); 2790 - } else { 2791 - move_linked_works(work, &worker->scheduled, NULL); 2747 + if (assign_work(work, worker, NULL)) 2792 2748 process_scheduled_works(worker); 2793 - } 2794 2749 } while (keep_working(pool)); 2795 2750 2796 2751 worker_set_flags(worker, WORKER_PREP); ··· 2825 2798 { 2826 2799 struct worker *rescuer = __rescuer; 2827 2800 struct workqueue_struct *wq = rescuer->rescue_wq; 2828 - struct list_head *scheduled = &rescuer->scheduled; 2829 2801 bool should_stop; 2830 2802 2831 2803 set_user_nice(current, RESCUER_NICE_LEVEL); ··· 2855 2829 struct pool_workqueue, mayday_node); 2856 2830 struct worker_pool *pool = pwq->pool; 2857 2831 struct work_struct *work, *n; 2858 - bool first = true; 2859 2832 2860 2833 __set_current_state(TASK_RUNNING); 2861 2834 list_del_init(&pwq->mayday_node); ··· 2869 2844 * Slurp in all works issued via this workqueue and 2870 2845 * process'em. 2871 2846 */ 2872 - WARN_ON_ONCE(!list_empty(scheduled)); 2847 + WARN_ON_ONCE(!list_empty(&rescuer->scheduled)); 2873 2848 list_for_each_entry_safe(work, n, &pool->worklist, entry) { 2874 - if (get_work_pwq(work) == pwq) { 2875 - if (first) 2876 - pool->watchdog_ts = jiffies; 2877 - move_linked_works(work, scheduled, &n); 2849 + if (get_work_pwq(work) == pwq && 2850 + assign_work(work, rescuer, &n)) 2878 2851 pwq->stats[PWQ_STAT_RESCUED]++; 2879 - } 2880 - first = false; 2881 2852 } 2882 2853 2883 - if (!list_empty(scheduled)) { 2854 + if (!list_empty(&rescuer->scheduled)) { 2884 2855 process_scheduled_works(rescuer); 2885 2856 2886 2857 /* ··· 2909 2888 put_pwq(pwq); 2910 2889 2911 2890 /* 2912 - * Leave this pool. If need_more_worker() is %true, notify a 2913 - * regular worker; otherwise, we end up with 0 concurrency 2914 - * and stalling the execution. 2891 + * Leave this pool. Notify regular workers; otherwise, we end up 2892 + * with 0 concurrency and stalling the execution. 2915 2893 */ 2916 - if (need_more_worker(pool)) 2917 - wake_up_worker(pool); 2894 + kick_pool(pool); 2918 2895 2919 2896 raw_spin_unlock_irq(&pool->lock); 2920 2897 ··· 3047 3028 pwq->nr_in_flight[work_color]++; 3048 3029 work_flags |= work_color_to_flags(work_color); 3049 3030 3050 - debug_work_activate(&barr->work); 3051 3031 insert_work(pwq, &barr->work, head, work_flags); 3052 3032 } 3053 3033 ··· 3709 3691 { 3710 3692 if (attrs) { 3711 3693 free_cpumask_var(attrs->cpumask); 3694 + free_cpumask_var(attrs->__pod_cpumask); 3712 3695 kfree(attrs); 3713 3696 } 3714 3697 } ··· 3731 3712 goto fail; 3732 3713 if (!alloc_cpumask_var(&attrs->cpumask, GFP_KERNEL)) 3733 3714 goto fail; 3715 + if (!alloc_cpumask_var(&attrs->__pod_cpumask, GFP_KERNEL)) 3716 + goto fail; 3734 3717 3735 3718 cpumask_copy(attrs->cpumask, cpu_possible_mask); 3719 + attrs->affn_scope = WQ_AFFN_DFL; 3736 3720 return attrs; 3737 3721 fail: 3738 3722 free_workqueue_attrs(attrs); ··· 3747 3725 { 3748 3726 to->nice = from->nice; 3749 3727 cpumask_copy(to->cpumask, from->cpumask); 3728 + cpumask_copy(to->__pod_cpumask, from->__pod_cpumask); 3729 + to->affn_strict = from->affn_strict; 3730 + 3750 3731 /* 3751 - * Unlike hash and equality test, this function doesn't ignore 3752 - * ->no_numa as it is used for both pool and wq attrs. Instead, 3753 - * get_unbound_pool() explicitly clears ->no_numa after copying. 3732 + * Unlike hash and equality test, copying shouldn't ignore wq-only 3733 + * fields as copying is used for both pool and wq attrs. Instead, 3734 + * get_unbound_pool() explicitly clears the fields. 3754 3735 */ 3755 - to->no_numa = from->no_numa; 3736 + to->affn_scope = from->affn_scope; 3737 + to->ordered = from->ordered; 3738 + } 3739 + 3740 + /* 3741 + * Some attrs fields are workqueue-only. Clear them for worker_pool's. See the 3742 + * comments in 'struct workqueue_attrs' definition. 3743 + */ 3744 + static void wqattrs_clear_for_pool(struct workqueue_attrs *attrs) 3745 + { 3746 + attrs->affn_scope = WQ_AFFN_NR_TYPES; 3747 + attrs->ordered = false; 3756 3748 } 3757 3749 3758 3750 /* hash value of the content of @attr */ ··· 3777 3741 hash = jhash_1word(attrs->nice, hash); 3778 3742 hash = jhash(cpumask_bits(attrs->cpumask), 3779 3743 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash); 3744 + hash = jhash(cpumask_bits(attrs->__pod_cpumask), 3745 + BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash); 3746 + hash = jhash_1word(attrs->affn_strict, hash); 3780 3747 return hash; 3781 3748 } 3782 3749 ··· 3791 3752 return false; 3792 3753 if (!cpumask_equal(a->cpumask, b->cpumask)) 3793 3754 return false; 3755 + if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask)) 3756 + return false; 3757 + if (a->affn_strict != b->affn_strict) 3758 + return false; 3794 3759 return true; 3760 + } 3761 + 3762 + /* Update @attrs with actually available CPUs */ 3763 + static void wqattrs_actualize_cpumask(struct workqueue_attrs *attrs, 3764 + const cpumask_t *unbound_cpumask) 3765 + { 3766 + /* 3767 + * Calculate the effective CPU mask of @attrs given @unbound_cpumask. If 3768 + * @attrs->cpumask doesn't overlap with @unbound_cpumask, we fallback to 3769 + * @unbound_cpumask. 3770 + */ 3771 + cpumask_and(attrs->cpumask, attrs->cpumask, unbound_cpumask); 3772 + if (unlikely(cpumask_empty(attrs->cpumask))) 3773 + cpumask_copy(attrs->cpumask, unbound_cpumask); 3774 + } 3775 + 3776 + /* find wq_pod_type to use for @attrs */ 3777 + static const struct wq_pod_type * 3778 + wqattrs_pod_type(const struct workqueue_attrs *attrs) 3779 + { 3780 + enum wq_affn_scope scope; 3781 + struct wq_pod_type *pt; 3782 + 3783 + /* to synchronize access to wq_affn_dfl */ 3784 + lockdep_assert_held(&wq_pool_mutex); 3785 + 3786 + if (attrs->affn_scope == WQ_AFFN_DFL) 3787 + scope = wq_affn_dfl; 3788 + else 3789 + scope = attrs->affn_scope; 3790 + 3791 + pt = &wq_pod_types[scope]; 3792 + 3793 + if (!WARN_ON_ONCE(attrs->affn_scope == WQ_AFFN_NR_TYPES) && 3794 + likely(pt->nr_pods)) 3795 + return pt; 3796 + 3797 + /* 3798 + * Before workqueue_init_topology(), only SYSTEM is available which is 3799 + * initialized in workqueue_init_early(). 3800 + */ 3801 + pt = &wq_pod_types[WQ_AFFN_SYSTEM]; 3802 + BUG_ON(!pt->nr_pods); 3803 + return pt; 3795 3804 } 3796 3805 3797 3806 /** ··· 3880 3793 pool->attrs = alloc_workqueue_attrs(); 3881 3794 if (!pool->attrs) 3882 3795 return -ENOMEM; 3796 + 3797 + wqattrs_clear_for_pool(pool->attrs); 3798 + 3883 3799 return 0; 3884 3800 } 3885 3801 ··· 3930 3840 container_of(rcu, struct workqueue_struct, rcu); 3931 3841 3932 3842 wq_free_lockdep(wq); 3933 - 3934 - if (!(wq->flags & WQ_UNBOUND)) 3935 - free_percpu(wq->cpu_pwqs); 3936 - else 3937 - free_workqueue_attrs(wq->unbound_attrs); 3938 - 3843 + free_percpu(wq->cpu_pwq); 3844 + free_workqueue_attrs(wq->unbound_attrs); 3939 3845 kfree(wq); 3940 3846 } 3941 3847 ··· 3958 3872 static void put_unbound_pool(struct worker_pool *pool) 3959 3873 { 3960 3874 DECLARE_COMPLETION_ONSTACK(detach_completion); 3961 - struct list_head cull_list; 3962 3875 struct worker *worker; 3963 - 3964 - INIT_LIST_HEAD(&cull_list); 3876 + LIST_HEAD(cull_list); 3965 3877 3966 3878 lockdep_assert_held(&wq_pool_mutex); 3967 3879 ··· 4043 3959 */ 4044 3960 static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) 4045 3961 { 3962 + struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA]; 4046 3963 u32 hash = wqattrs_hash(attrs); 4047 3964 struct worker_pool *pool; 4048 - int node; 4049 - int target_node = NUMA_NO_NODE; 3965 + int pod, node = NUMA_NO_NODE; 4050 3966 4051 3967 lockdep_assert_held(&wq_pool_mutex); 4052 3968 ··· 4058 3974 } 4059 3975 } 4060 3976 4061 - /* if cpumask is contained inside a NUMA node, we belong to that node */ 4062 - if (wq_numa_enabled) { 4063 - for_each_node(node) { 4064 - if (cpumask_subset(attrs->cpumask, 4065 - wq_numa_possible_cpumask[node])) { 4066 - target_node = node; 4067 - break; 4068 - } 3977 + /* If __pod_cpumask is contained inside a NUMA pod, that's our node */ 3978 + for (pod = 0; pod < pt->nr_pods; pod++) { 3979 + if (cpumask_subset(attrs->__pod_cpumask, pt->pod_cpus[pod])) { 3980 + node = pt->pod_node[pod]; 3981 + break; 4069 3982 } 4070 3983 } 4071 3984 4072 3985 /* nope, create a new one */ 4073 - pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_node); 3986 + pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node); 4074 3987 if (!pool || init_worker_pool(pool) < 0) 4075 3988 goto fail; 4076 3989 4077 - lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */ 3990 + pool->node = node; 4078 3991 copy_workqueue_attrs(pool->attrs, attrs); 4079 - pool->node = target_node; 4080 - 4081 - /* 4082 - * no_numa isn't a worker_pool attribute, always clear it. See 4083 - * 'struct workqueue_attrs' comments for detail. 4084 - */ 4085 - pool->attrs->no_numa = false; 3992 + wqattrs_clear_for_pool(pool->attrs); 4086 3993 4087 3994 if (worker_pool_assign_id(pool) < 0) 4088 3995 goto fail; ··· 4099 4024 } 4100 4025 4101 4026 /* 4102 - * Scheduled on system_wq by put_pwq() when an unbound pwq hits zero refcnt 4103 - * and needs to be destroyed. 4027 + * Scheduled on pwq_release_worker by put_pwq() when an unbound pwq hits zero 4028 + * refcnt and needs to be destroyed. 4104 4029 */ 4105 - static void pwq_unbound_release_workfn(struct work_struct *work) 4030 + static void pwq_release_workfn(struct kthread_work *work) 4106 4031 { 4107 4032 struct pool_workqueue *pwq = container_of(work, struct pool_workqueue, 4108 - unbound_release_work); 4033 + release_work); 4109 4034 struct workqueue_struct *wq = pwq->wq; 4110 4035 struct worker_pool *pool = pwq->pool; 4111 4036 bool is_last = false; 4112 4037 4113 4038 /* 4114 - * when @pwq is not linked, it doesn't hold any reference to the 4039 + * When @pwq is not linked, it doesn't hold any reference to the 4115 4040 * @wq, and @wq is invalid to access. 4116 4041 */ 4117 4042 if (!list_empty(&pwq->pwqs_node)) { 4118 - if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND))) 4119 - return; 4120 - 4121 4043 mutex_lock(&wq->mutex); 4122 4044 list_del_rcu(&pwq->pwqs_node); 4123 4045 is_last = list_empty(&wq->pwqs); 4124 4046 mutex_unlock(&wq->mutex); 4125 4047 } 4126 4048 4127 - mutex_lock(&wq_pool_mutex); 4128 - put_unbound_pool(pool); 4129 - mutex_unlock(&wq_pool_mutex); 4049 + if (wq->flags & WQ_UNBOUND) { 4050 + mutex_lock(&wq_pool_mutex); 4051 + put_unbound_pool(pool); 4052 + mutex_unlock(&wq_pool_mutex); 4053 + } 4130 4054 4131 4055 call_rcu(&pwq->rcu, rcu_free_pwq); 4132 4056 ··· 4169 4095 * is updated and visible. 4170 4096 */ 4171 4097 if (!freezable || !workqueue_freezing) { 4172 - bool kick = false; 4173 - 4174 4098 pwq->max_active = wq->saved_max_active; 4175 4099 4176 4100 while (!list_empty(&pwq->inactive_works) && 4177 - pwq->nr_active < pwq->max_active) { 4101 + pwq->nr_active < pwq->max_active) 4178 4102 pwq_activate_first_inactive(pwq); 4179 - kick = true; 4180 - } 4181 4103 4182 - /* 4183 - * Need to kick a worker after thawed or an unbound wq's 4184 - * max_active is bumped. In realtime scenarios, always kicking a 4185 - * worker will cause interference on the isolated cpu cores, so 4186 - * let's kick iff work items were activated. 4187 - */ 4188 - if (kick) 4189 - wake_up_worker(pwq->pool); 4104 + kick_pool(pwq->pool); 4190 4105 } else { 4191 4106 pwq->max_active = 0; 4192 4107 } ··· 4198 4135 INIT_LIST_HEAD(&pwq->inactive_works); 4199 4136 INIT_LIST_HEAD(&pwq->pwqs_node); 4200 4137 INIT_LIST_HEAD(&pwq->mayday_node); 4201 - INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn); 4138 + kthread_init_work(&pwq->release_work, pwq_release_workfn); 4202 4139 } 4203 4140 4204 4141 /* sync @pwq with the current state of its associated wq and link it */ ··· 4246 4183 } 4247 4184 4248 4185 /** 4249 - * wq_calc_node_cpumask - calculate a wq_attrs' cpumask for the specified node 4186 + * wq_calc_pod_cpumask - calculate a wq_attrs' cpumask for a pod 4250 4187 * @attrs: the wq_attrs of the default pwq of the target workqueue 4251 - * @node: the target NUMA node 4188 + * @cpu: the target CPU 4252 4189 * @cpu_going_down: if >= 0, the CPU to consider as offline 4253 - * @cpumask: outarg, the resulting cpumask 4254 4190 * 4255 - * Calculate the cpumask a workqueue with @attrs should use on @node. If 4256 - * @cpu_going_down is >= 0, that cpu is considered offline during 4257 - * calculation. The result is stored in @cpumask. 4191 + * Calculate the cpumask a workqueue with @attrs should use on @pod. If 4192 + * @cpu_going_down is >= 0, that cpu is considered offline during calculation. 4193 + * The result is stored in @attrs->__pod_cpumask. 4258 4194 * 4259 - * If NUMA affinity is not enabled, @attrs->cpumask is always used. If 4260 - * enabled and @node has online CPUs requested by @attrs, the returned 4261 - * cpumask is the intersection of the possible CPUs of @node and 4262 - * @attrs->cpumask. 4195 + * If pod affinity is not enabled, @attrs->cpumask is always used. If enabled 4196 + * and @pod has online CPUs requested by @attrs, the returned cpumask is the 4197 + * intersection of the possible CPUs of @pod and @attrs->cpumask. 4263 4198 * 4264 - * The caller is responsible for ensuring that the cpumask of @node stays 4265 - * stable. 4266 - * 4267 - * Return: %true if the resulting @cpumask is different from @attrs->cpumask, 4268 - * %false if equal. 4199 + * The caller is responsible for ensuring that the cpumask of @pod stays stable. 4269 4200 */ 4270 - static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node, 4271 - int cpu_going_down, cpumask_t *cpumask) 4201 + static void wq_calc_pod_cpumask(struct workqueue_attrs *attrs, int cpu, 4202 + int cpu_going_down) 4272 4203 { 4273 - if (!wq_numa_enabled || attrs->no_numa) 4274 - goto use_dfl; 4204 + const struct wq_pod_type *pt = wqattrs_pod_type(attrs); 4205 + int pod = pt->cpu_pod[cpu]; 4275 4206 4276 - /* does @node have any online CPUs @attrs wants? */ 4277 - cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask); 4207 + /* does @pod have any online CPUs @attrs wants? */ 4208 + cpumask_and(attrs->__pod_cpumask, pt->pod_cpus[pod], attrs->cpumask); 4209 + cpumask_and(attrs->__pod_cpumask, attrs->__pod_cpumask, cpu_online_mask); 4278 4210 if (cpu_going_down >= 0) 4279 - cpumask_clear_cpu(cpu_going_down, cpumask); 4211 + cpumask_clear_cpu(cpu_going_down, attrs->__pod_cpumask); 4280 4212 4281 - if (cpumask_empty(cpumask)) 4282 - goto use_dfl; 4283 - 4284 - /* yeap, return possible CPUs in @node that @attrs wants */ 4285 - cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); 4286 - 4287 - if (cpumask_empty(cpumask)) { 4288 - pr_warn_once("WARNING: workqueue cpumask: online intersect > " 4289 - "possible intersect\n"); 4290 - return false; 4213 + if (cpumask_empty(attrs->__pod_cpumask)) { 4214 + cpumask_copy(attrs->__pod_cpumask, attrs->cpumask); 4215 + return; 4291 4216 } 4292 4217 4293 - return !cpumask_equal(cpumask, attrs->cpumask); 4218 + /* yeap, return possible CPUs in @pod that @attrs wants */ 4219 + cpumask_and(attrs->__pod_cpumask, attrs->cpumask, pt->pod_cpus[pod]); 4294 4220 4295 - use_dfl: 4296 - cpumask_copy(cpumask, attrs->cpumask); 4297 - return false; 4221 + if (cpumask_empty(attrs->__pod_cpumask)) 4222 + pr_warn_once("WARNING: workqueue cpumask: online intersect > " 4223 + "possible intersect\n"); 4298 4224 } 4299 4225 4300 - /* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */ 4301 - static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq, 4302 - int node, 4303 - struct pool_workqueue *pwq) 4226 + /* install @pwq into @wq's cpu_pwq and return the old pwq */ 4227 + static struct pool_workqueue *install_unbound_pwq(struct workqueue_struct *wq, 4228 + int cpu, struct pool_workqueue *pwq) 4304 4229 { 4305 4230 struct pool_workqueue *old_pwq; 4306 4231 ··· 4298 4247 /* link_pwq() can handle duplicate calls */ 4299 4248 link_pwq(pwq); 4300 4249 4301 - old_pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]); 4302 - rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq); 4250 + old_pwq = rcu_access_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu)); 4251 + rcu_assign_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu), pwq); 4303 4252 return old_pwq; 4304 4253 } 4305 4254 ··· 4316 4265 static void apply_wqattrs_cleanup(struct apply_wqattrs_ctx *ctx) 4317 4266 { 4318 4267 if (ctx) { 4319 - int node; 4268 + int cpu; 4320 4269 4321 - for_each_node(node) 4322 - put_pwq_unlocked(ctx->pwq_tbl[node]); 4270 + for_each_possible_cpu(cpu) 4271 + put_pwq_unlocked(ctx->pwq_tbl[cpu]); 4323 4272 put_pwq_unlocked(ctx->dfl_pwq); 4324 4273 4325 4274 free_workqueue_attrs(ctx->attrs); ··· 4335 4284 const cpumask_var_t unbound_cpumask) 4336 4285 { 4337 4286 struct apply_wqattrs_ctx *ctx; 4338 - struct workqueue_attrs *new_attrs, *tmp_attrs; 4339 - int node; 4287 + struct workqueue_attrs *new_attrs; 4288 + int cpu; 4340 4289 4341 4290 lockdep_assert_held(&wq_pool_mutex); 4342 4291 4343 - ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_node_ids), GFP_KERNEL); 4292 + if (WARN_ON(attrs->affn_scope < 0 || 4293 + attrs->affn_scope >= WQ_AFFN_NR_TYPES)) 4294 + return ERR_PTR(-EINVAL); 4295 + 4296 + ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_cpu_ids), GFP_KERNEL); 4344 4297 4345 4298 new_attrs = alloc_workqueue_attrs(); 4346 - tmp_attrs = alloc_workqueue_attrs(); 4347 - if (!ctx || !new_attrs || !tmp_attrs) 4299 + if (!ctx || !new_attrs) 4348 4300 goto out_free; 4349 - 4350 - /* 4351 - * Calculate the attrs of the default pwq with unbound_cpumask 4352 - * which is wq_unbound_cpumask or to set to wq_unbound_cpumask. 4353 - * If the user configured cpumask doesn't overlap with the 4354 - * wq_unbound_cpumask, we fallback to the wq_unbound_cpumask. 4355 - */ 4356 - copy_workqueue_attrs(new_attrs, attrs); 4357 - cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbound_cpumask); 4358 - if (unlikely(cpumask_empty(new_attrs->cpumask))) 4359 - cpumask_copy(new_attrs->cpumask, unbound_cpumask); 4360 - 4361 - /* 4362 - * We may create multiple pwqs with differing cpumasks. Make a 4363 - * copy of @new_attrs which will be modified and used to obtain 4364 - * pools. 4365 - */ 4366 - copy_workqueue_attrs(tmp_attrs, new_attrs); 4367 4301 4368 4302 /* 4369 4303 * If something goes wrong during CPU up/down, we'll fall back to 4370 4304 * the default pwq covering whole @attrs->cpumask. Always create 4371 4305 * it even if we don't use it immediately. 4372 4306 */ 4307 + copy_workqueue_attrs(new_attrs, attrs); 4308 + wqattrs_actualize_cpumask(new_attrs, unbound_cpumask); 4309 + cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask); 4373 4310 ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs); 4374 4311 if (!ctx->dfl_pwq) 4375 4312 goto out_free; 4376 4313 4377 - for_each_node(node) { 4378 - if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) { 4379 - ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs); 4380 - if (!ctx->pwq_tbl[node]) 4381 - goto out_free; 4382 - } else { 4314 + for_each_possible_cpu(cpu) { 4315 + if (new_attrs->ordered) { 4383 4316 ctx->dfl_pwq->refcnt++; 4384 - ctx->pwq_tbl[node] = ctx->dfl_pwq; 4317 + ctx->pwq_tbl[cpu] = ctx->dfl_pwq; 4318 + } else { 4319 + wq_calc_pod_cpumask(new_attrs, cpu, -1); 4320 + ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs); 4321 + if (!ctx->pwq_tbl[cpu]) 4322 + goto out_free; 4385 4323 } 4386 4324 } 4387 4325 4388 4326 /* save the user configured attrs and sanitize it. */ 4389 4327 copy_workqueue_attrs(new_attrs, attrs); 4390 4328 cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask); 4329 + cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask); 4391 4330 ctx->attrs = new_attrs; 4392 4331 4393 4332 ctx->wq = wq; 4394 - free_workqueue_attrs(tmp_attrs); 4395 4333 return ctx; 4396 4334 4397 4335 out_free: 4398 - free_workqueue_attrs(tmp_attrs); 4399 4336 free_workqueue_attrs(new_attrs); 4400 4337 apply_wqattrs_cleanup(ctx); 4401 - return NULL; 4338 + return ERR_PTR(-ENOMEM); 4402 4339 } 4403 4340 4404 4341 /* set attrs and install prepared pwqs, @ctx points to old pwqs on return */ 4405 4342 static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx) 4406 4343 { 4407 - int node; 4344 + int cpu; 4408 4345 4409 4346 /* all pwqs have been created successfully, let's install'em */ 4410 4347 mutex_lock(&ctx->wq->mutex); ··· 4400 4361 copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs); 4401 4362 4402 4363 /* save the previous pwq and install the new one */ 4403 - for_each_node(node) 4404 - ctx->pwq_tbl[node] = numa_pwq_tbl_install(ctx->wq, node, 4405 - ctx->pwq_tbl[node]); 4364 + for_each_possible_cpu(cpu) 4365 + ctx->pwq_tbl[cpu] = install_unbound_pwq(ctx->wq, cpu, 4366 + ctx->pwq_tbl[cpu]); 4406 4367 4407 4368 /* @dfl_pwq might not have been used, ensure it's linked */ 4408 4369 link_pwq(ctx->dfl_pwq); ··· 4442 4403 } 4443 4404 4444 4405 ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask); 4445 - if (!ctx) 4446 - return -ENOMEM; 4406 + if (IS_ERR(ctx)) 4407 + return PTR_ERR(ctx); 4447 4408 4448 4409 /* the ctx has been prepared successfully, let's commit it */ 4449 4410 apply_wqattrs_commit(ctx); ··· 4457 4418 * @wq: the target workqueue 4458 4419 * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs() 4459 4420 * 4460 - * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA 4461 - * machines, this function maps a separate pwq to each NUMA node with 4462 - * possibles CPUs in @attrs->cpumask so that work items are affine to the 4463 - * NUMA node it was issued on. Older pwqs are released as in-flight work 4464 - * items finish. Note that a work item which repeatedly requeues itself 4465 - * back-to-back will stay on its current pwq. 4421 + * Apply @attrs to an unbound workqueue @wq. Unless disabled, this function maps 4422 + * a separate pwq to each CPU pod with possibles CPUs in @attrs->cpumask so that 4423 + * work items are affine to the pod it was issued on. Older pwqs are released as 4424 + * in-flight work items finish. Note that a work item which repeatedly requeues 4425 + * itself back-to-back will stay on its current pwq. 4466 4426 * 4467 4427 * Performs GFP_KERNEL allocations. 4468 4428 * ··· 4484 4446 } 4485 4447 4486 4448 /** 4487 - * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug 4449 + * wq_update_pod - update pod affinity of a wq for CPU hot[un]plug 4488 4450 * @wq: the target workqueue 4489 - * @cpu: the CPU coming up or going down 4451 + * @cpu: the CPU to update pool association for 4452 + * @hotplug_cpu: the CPU coming up or going down 4490 4453 * @online: whether @cpu is coming up or going down 4491 4454 * 4492 4455 * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and 4493 - * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of 4456 + * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update pod affinity of 4494 4457 * @wq accordingly. 4495 4458 * 4496 - * If NUMA affinity can't be adjusted due to memory allocation failure, it 4497 - * falls back to @wq->dfl_pwq which may not be optimal but is always 4498 - * correct. 4499 4459 * 4500 - * Note that when the last allowed CPU of a NUMA node goes offline for a 4501 - * workqueue with a cpumask spanning multiple nodes, the workers which were 4502 - * already executing the work items for the workqueue will lose their CPU 4503 - * affinity and may execute on any CPU. This is similar to how per-cpu 4504 - * workqueues behave on CPU_DOWN. If a workqueue user wants strict 4505 - * affinity, it's the user's responsibility to flush the work item from 4506 - * CPU_DOWN_PREPARE. 4460 + * If pod affinity can't be adjusted due to memory allocation failure, it falls 4461 + * back to @wq->dfl_pwq which may not be optimal but is always correct. 4462 + * 4463 + * Note that when the last allowed CPU of a pod goes offline for a workqueue 4464 + * with a cpumask spanning multiple pods, the workers which were already 4465 + * executing the work items for the workqueue will lose their CPU affinity and 4466 + * may execute on any CPU. This is similar to how per-cpu workqueues behave on 4467 + * CPU_DOWN. If a workqueue user wants strict affinity, it's the user's 4468 + * responsibility to flush the work item from CPU_DOWN_PREPARE. 4507 4469 */ 4508 - static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, 4509 - bool online) 4470 + static void wq_update_pod(struct workqueue_struct *wq, int cpu, 4471 + int hotplug_cpu, bool online) 4510 4472 { 4511 - int node = cpu_to_node(cpu); 4512 - int cpu_off = online ? -1 : cpu; 4473 + int off_cpu = online ? -1 : hotplug_cpu; 4513 4474 struct pool_workqueue *old_pwq = NULL, *pwq; 4514 4475 struct workqueue_attrs *target_attrs; 4515 - cpumask_t *cpumask; 4516 4476 4517 4477 lockdep_assert_held(&wq_pool_mutex); 4518 4478 4519 - if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) || 4520 - wq->unbound_attrs->no_numa) 4479 + if (!(wq->flags & WQ_UNBOUND) || wq->unbound_attrs->ordered) 4521 4480 return; 4522 4481 4523 4482 /* ··· 4522 4487 * Let's use a preallocated one. The following buf is protected by 4523 4488 * CPU hotplug exclusion. 4524 4489 */ 4525 - target_attrs = wq_update_unbound_numa_attrs_buf; 4526 - cpumask = target_attrs->cpumask; 4490 + target_attrs = wq_update_pod_attrs_buf; 4527 4491 4528 4492 copy_workqueue_attrs(target_attrs, wq->unbound_attrs); 4529 - pwq = unbound_pwq_by_node(wq, node); 4493 + wqattrs_actualize_cpumask(target_attrs, wq_unbound_cpumask); 4530 4494 4531 - /* 4532 - * Let's determine what needs to be done. If the target cpumask is 4533 - * different from the default pwq's, we need to compare it to @pwq's 4534 - * and create a new one if they don't match. If the target cpumask 4535 - * equals the default pwq's, the default pwq should be used. 4536 - */ 4537 - if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) { 4538 - if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask)) 4539 - return; 4540 - } else { 4541 - goto use_dfl_pwq; 4542 - } 4495 + /* nothing to do if the target cpumask matches the current pwq */ 4496 + wq_calc_pod_cpumask(target_attrs, cpu, off_cpu); 4497 + pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu), 4498 + lockdep_is_held(&wq_pool_mutex)); 4499 + if (wqattrs_equal(target_attrs, pwq->pool->attrs)) 4500 + return; 4543 4501 4544 4502 /* create a new pwq */ 4545 4503 pwq = alloc_unbound_pwq(wq, target_attrs); 4546 4504 if (!pwq) { 4547 - pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n", 4505 + pr_warn("workqueue: allocation failed while updating CPU pod affinity of \"%s\"\n", 4548 4506 wq->name); 4549 4507 goto use_dfl_pwq; 4550 4508 } 4551 4509 4552 4510 /* Install the new pwq. */ 4553 4511 mutex_lock(&wq->mutex); 4554 - old_pwq = numa_pwq_tbl_install(wq, node, pwq); 4512 + old_pwq = install_unbound_pwq(wq, cpu, pwq); 4555 4513 goto out_unlock; 4556 4514 4557 4515 use_dfl_pwq: ··· 4552 4524 raw_spin_lock_irq(&wq->dfl_pwq->pool->lock); 4553 4525 get_pwq(wq->dfl_pwq); 4554 4526 raw_spin_unlock_irq(&wq->dfl_pwq->pool->lock); 4555 - old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq); 4527 + old_pwq = install_unbound_pwq(wq, cpu, wq->dfl_pwq); 4556 4528 out_unlock: 4557 4529 mutex_unlock(&wq->mutex); 4558 4530 put_pwq_unlocked(old_pwq); ··· 4563 4535 bool highpri = wq->flags & WQ_HIGHPRI; 4564 4536 int cpu, ret; 4565 4537 4538 + wq->cpu_pwq = alloc_percpu(struct pool_workqueue *); 4539 + if (!wq->cpu_pwq) 4540 + goto enomem; 4541 + 4566 4542 if (!(wq->flags & WQ_UNBOUND)) { 4567 - wq->cpu_pwqs = alloc_percpu(struct pool_workqueue); 4568 - if (!wq->cpu_pwqs) 4569 - return -ENOMEM; 4570 - 4571 4543 for_each_possible_cpu(cpu) { 4572 - struct pool_workqueue *pwq = 4573 - per_cpu_ptr(wq->cpu_pwqs, cpu); 4574 - struct worker_pool *cpu_pools = 4575 - per_cpu(cpu_worker_pools, cpu); 4544 + struct pool_workqueue **pwq_p = 4545 + per_cpu_ptr(wq->cpu_pwq, cpu); 4546 + struct worker_pool *pool = 4547 + &(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]); 4576 4548 4577 - init_pwq(pwq, wq, &cpu_pools[highpri]); 4549 + *pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, 4550 + pool->node); 4551 + if (!*pwq_p) 4552 + goto enomem; 4553 + 4554 + init_pwq(*pwq_p, wq, pool); 4578 4555 4579 4556 mutex_lock(&wq->mutex); 4580 - link_pwq(pwq); 4557 + link_pwq(*pwq_p); 4581 4558 mutex_unlock(&wq->mutex); 4582 4559 } 4583 4560 return 0; ··· 4601 4568 cpus_read_unlock(); 4602 4569 4603 4570 return ret; 4571 + 4572 + enomem: 4573 + if (wq->cpu_pwq) { 4574 + for_each_possible_cpu(cpu) 4575 + kfree(*per_cpu_ptr(wq->cpu_pwq, cpu)); 4576 + free_percpu(wq->cpu_pwq); 4577 + wq->cpu_pwq = NULL; 4578 + } 4579 + return -ENOMEM; 4604 4580 } 4605 4581 4606 4582 static int wq_clamp_max_active(int max_active, unsigned int flags, 4607 4583 const char *name) 4608 4584 { 4609 - int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE; 4610 - 4611 - if (max_active < 1 || max_active > lim) 4585 + if (max_active < 1 || max_active > WQ_MAX_ACTIVE) 4612 4586 pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n", 4613 - max_active, name, 1, lim); 4587 + max_active, name, 1, WQ_MAX_ACTIVE); 4614 4588 4615 - return clamp_val(max_active, 1, lim); 4589 + return clamp_val(max_active, 1, WQ_MAX_ACTIVE); 4616 4590 } 4617 4591 4618 4592 /* ··· 4642 4602 } 4643 4603 4644 4604 rescuer->rescue_wq = wq; 4645 - rescuer->task = kthread_create(rescuer_thread, rescuer, "%s", wq->name); 4605 + rescuer->task = kthread_create(rescuer_thread, rescuer, "kworker/R-%s", wq->name); 4646 4606 if (IS_ERR(rescuer->task)) { 4647 4607 ret = PTR_ERR(rescuer->task); 4648 4608 pr_err("workqueue: Failed to create a rescuer kthread for wq \"%s\": %pe", ··· 4663 4623 unsigned int flags, 4664 4624 int max_active, ...) 4665 4625 { 4666 - size_t tbl_size = 0; 4667 4626 va_list args; 4668 4627 struct workqueue_struct *wq; 4669 4628 struct pool_workqueue *pwq; 4670 4629 4671 4630 /* 4672 - * Unbound && max_active == 1 used to imply ordered, which is no 4673 - * longer the case on NUMA machines due to per-node pools. While 4631 + * Unbound && max_active == 1 used to imply ordered, which is no longer 4632 + * the case on many machines due to per-pod pools. While 4674 4633 * alloc_ordered_workqueue() is the right way to create an ordered 4675 - * workqueue, keep the previous behavior to avoid subtle breakages 4676 - * on NUMA. 4634 + * workqueue, keep the previous behavior to avoid subtle breakages. 4677 4635 */ 4678 4636 if ((flags & WQ_UNBOUND) && max_active == 1) 4679 4637 flags |= __WQ_ORDERED; ··· 4681 4643 flags |= WQ_UNBOUND; 4682 4644 4683 4645 /* allocate wq and format name */ 4684 - if (flags & WQ_UNBOUND) 4685 - tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]); 4686 - 4687 - wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL); 4646 + wq = kzalloc(sizeof(*wq), GFP_KERNEL); 4688 4647 if (!wq) 4689 4648 return NULL; 4690 4649 ··· 4776 4741 void destroy_workqueue(struct workqueue_struct *wq) 4777 4742 { 4778 4743 struct pool_workqueue *pwq; 4779 - int node; 4744 + int cpu; 4780 4745 4781 4746 /* 4782 4747 * Remove it from sysfs first so that sanity check failure doesn't ··· 4835 4800 list_del_rcu(&wq->list); 4836 4801 mutex_unlock(&wq_pool_mutex); 4837 4802 4838 - if (!(wq->flags & WQ_UNBOUND)) { 4839 - wq_unregister_lockdep(wq); 4840 - /* 4841 - * The base ref is never dropped on per-cpu pwqs. Directly 4842 - * schedule RCU free. 4843 - */ 4844 - call_rcu(&wq->rcu, rcu_free_wq); 4845 - } else { 4846 - /* 4847 - * We're the sole accessor of @wq at this point. Directly 4848 - * access numa_pwq_tbl[] and dfl_pwq to put the base refs. 4849 - * @wq will be freed when the last pwq is released. 4850 - */ 4851 - for_each_node(node) { 4852 - pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]); 4853 - RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL); 4854 - put_pwq_unlocked(pwq); 4855 - } 4803 + /* 4804 + * We're the sole accessor of @wq. Directly access cpu_pwq and dfl_pwq 4805 + * to put the base refs. @wq will be auto-destroyed from the last 4806 + * pwq_put. RCU read lock prevents @wq from going away from under us. 4807 + */ 4808 + rcu_read_lock(); 4856 4809 4857 - /* 4858 - * Put dfl_pwq. @wq may be freed any time after dfl_pwq is 4859 - * put. Don't access it afterwards. 4860 - */ 4861 - pwq = wq->dfl_pwq; 4862 - wq->dfl_pwq = NULL; 4810 + for_each_possible_cpu(cpu) { 4811 + pwq = rcu_access_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu)); 4812 + RCU_INIT_POINTER(*per_cpu_ptr(wq->cpu_pwq, cpu), NULL); 4863 4813 put_pwq_unlocked(pwq); 4864 4814 } 4815 + 4816 + put_pwq_unlocked(wq->dfl_pwq); 4817 + wq->dfl_pwq = NULL; 4818 + 4819 + rcu_read_unlock(); 4865 4820 } 4866 4821 EXPORT_SYMBOL_GPL(destroy_workqueue); 4867 4822 ··· 4928 4903 * unreliable and only useful as advisory hints or for debugging. 4929 4904 * 4930 4905 * If @cpu is WORK_CPU_UNBOUND, the test is performed on the local CPU. 4931 - * Note that both per-cpu and unbound workqueues may be associated with 4932 - * multiple pool_workqueues which have separate congested states. A 4933 - * workqueue being congested on one CPU doesn't mean the workqueue is also 4934 - * contested on other CPUs / NUMA nodes. 4906 + * 4907 + * With the exception of ordered workqueues, all workqueues have per-cpu 4908 + * pool_workqueues, each with its own congested state. A workqueue being 4909 + * congested on one CPU doesn't mean that the workqueue is contested on any 4910 + * other CPUs. 4935 4911 * 4936 4912 * Return: 4937 4913 * %true if congested, %false otherwise. ··· 4948 4922 if (cpu == WORK_CPU_UNBOUND) 4949 4923 cpu = smp_processor_id(); 4950 4924 4951 - if (!(wq->flags & WQ_UNBOUND)) 4952 - pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); 4953 - else 4954 - pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); 4955 - 4925 + pwq = *per_cpu_ptr(wq->cpu_pwq, cpu); 4956 4926 ret = !list_empty(&pwq->inactive_works); 4927 + 4957 4928 preempt_enable(); 4958 4929 rcu_read_unlock(); 4959 4930 ··· 5425 5402 * worker blocking could lead to lengthy stalls. Kick off 5426 5403 * unbound chain execution of currently pending work items. 5427 5404 */ 5428 - wake_up_worker(pool); 5405 + kick_pool(pool); 5429 5406 5430 5407 raw_spin_unlock_irq(&pool->lock); 5431 5408 ··· 5458 5435 for_each_pool_worker(worker, pool) { 5459 5436 kthread_set_per_cpu(worker->task, pool->cpu); 5460 5437 WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, 5461 - pool->attrs->cpumask) < 0); 5438 + pool_allowed_cpus(pool)) < 0); 5462 5439 } 5463 5440 5464 5441 raw_spin_lock_irq(&pool->lock); ··· 5552 5529 mutex_unlock(&wq_pool_attach_mutex); 5553 5530 } 5554 5531 5555 - /* update NUMA affinity of unbound workqueues */ 5556 - list_for_each_entry(wq, &workqueues, list) 5557 - wq_update_unbound_numa(wq, cpu, true); 5532 + /* update pod affinity of unbound workqueues */ 5533 + list_for_each_entry(wq, &workqueues, list) { 5534 + struct workqueue_attrs *attrs = wq->unbound_attrs; 5535 + 5536 + if (attrs) { 5537 + const struct wq_pod_type *pt = wqattrs_pod_type(attrs); 5538 + int tcpu; 5539 + 5540 + for_each_cpu(tcpu, pt->pod_cpus[pt->cpu_pod[cpu]]) 5541 + wq_update_pod(wq, tcpu, cpu, true); 5542 + } 5543 + } 5558 5544 5559 5545 mutex_unlock(&wq_pool_mutex); 5560 5546 return 0; ··· 5579 5547 5580 5548 unbind_workers(cpu); 5581 5549 5582 - /* update NUMA affinity of unbound workqueues */ 5550 + /* update pod affinity of unbound workqueues */ 5583 5551 mutex_lock(&wq_pool_mutex); 5584 - list_for_each_entry(wq, &workqueues, list) 5585 - wq_update_unbound_numa(wq, cpu, false); 5552 + list_for_each_entry(wq, &workqueues, list) { 5553 + struct workqueue_attrs *attrs = wq->unbound_attrs; 5554 + 5555 + if (attrs) { 5556 + const struct wq_pod_type *pt = wqattrs_pod_type(attrs); 5557 + int tcpu; 5558 + 5559 + for_each_cpu(tcpu, pt->pod_cpus[pt->cpu_pod[cpu]]) 5560 + wq_update_pod(wq, tcpu, cpu, false); 5561 + } 5562 + } 5586 5563 mutex_unlock(&wq_pool_mutex); 5587 5564 5588 5565 return 0; ··· 5787 5746 continue; 5788 5747 5789 5748 ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask); 5790 - if (!ctx) { 5791 - ret = -ENOMEM; 5749 + if (IS_ERR(ctx)) { 5750 + ret = PTR_ERR(ctx); 5792 5751 break; 5793 5752 } 5794 5753 ··· 5846 5805 return ret; 5847 5806 } 5848 5807 5808 + static int parse_affn_scope(const char *val) 5809 + { 5810 + int i; 5811 + 5812 + for (i = 0; i < ARRAY_SIZE(wq_affn_names); i++) { 5813 + if (!strncasecmp(val, wq_affn_names[i], strlen(wq_affn_names[i]))) 5814 + return i; 5815 + } 5816 + return -EINVAL; 5817 + } 5818 + 5819 + static int wq_affn_dfl_set(const char *val, const struct kernel_param *kp) 5820 + { 5821 + struct workqueue_struct *wq; 5822 + int affn, cpu; 5823 + 5824 + affn = parse_affn_scope(val); 5825 + if (affn < 0) 5826 + return affn; 5827 + if (affn == WQ_AFFN_DFL) 5828 + return -EINVAL; 5829 + 5830 + cpus_read_lock(); 5831 + mutex_lock(&wq_pool_mutex); 5832 + 5833 + wq_affn_dfl = affn; 5834 + 5835 + list_for_each_entry(wq, &workqueues, list) { 5836 + for_each_online_cpu(cpu) { 5837 + wq_update_pod(wq, cpu, cpu, true); 5838 + } 5839 + } 5840 + 5841 + mutex_unlock(&wq_pool_mutex); 5842 + cpus_read_unlock(); 5843 + 5844 + return 0; 5845 + } 5846 + 5847 + static int wq_affn_dfl_get(char *buffer, const struct kernel_param *kp) 5848 + { 5849 + return scnprintf(buffer, PAGE_SIZE, "%s\n", wq_affn_names[wq_affn_dfl]); 5850 + } 5851 + 5852 + static const struct kernel_param_ops wq_affn_dfl_ops = { 5853 + .set = wq_affn_dfl_set, 5854 + .get = wq_affn_dfl_get, 5855 + }; 5856 + 5857 + module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644); 5858 + 5849 5859 #ifdef CONFIG_SYSFS 5850 5860 /* 5851 5861 * Workqueues with WQ_SYSFS flag set is visible to userland via 5852 5862 * /sys/bus/workqueue/devices/WQ_NAME. All visible workqueues have the 5853 5863 * following attributes. 5854 5864 * 5855 - * per_cpu RO bool : whether the workqueue is per-cpu or unbound 5856 - * max_active RW int : maximum number of in-flight work items 5865 + * per_cpu RO bool : whether the workqueue is per-cpu or unbound 5866 + * max_active RW int : maximum number of in-flight work items 5857 5867 * 5858 5868 * Unbound workqueues have the following extra attributes. 5859 5869 * 5860 - * pool_ids RO int : the associated pool IDs for each node 5861 - * nice RW int : nice value of the workers 5862 - * cpumask RW mask : bitmask of allowed CPUs for the workers 5863 - * numa RW bool : whether enable NUMA affinity 5870 + * nice RW int : nice value of the workers 5871 + * cpumask RW mask : bitmask of allowed CPUs for the workers 5872 + * affinity_scope RW str : worker CPU affinity scope (cache, numa, none) 5873 + * affinity_strict RW bool : worker CPU affinity is strict 5864 5874 */ 5865 5875 struct wq_device { 5866 5876 struct workqueue_struct *wq; ··· 5963 5871 NULL, 5964 5872 }; 5965 5873 ATTRIBUTE_GROUPS(wq_sysfs); 5966 - 5967 - static ssize_t wq_pool_ids_show(struct device *dev, 5968 - struct device_attribute *attr, char *buf) 5969 - { 5970 - struct workqueue_struct *wq = dev_to_wq(dev); 5971 - const char *delim = ""; 5972 - int node, written = 0; 5973 - 5974 - cpus_read_lock(); 5975 - rcu_read_lock(); 5976 - for_each_node(node) { 5977 - written += scnprintf(buf + written, PAGE_SIZE - written, 5978 - "%s%d:%d", delim, node, 5979 - unbound_pwq_by_node(wq, node)->pool->id); 5980 - delim = " "; 5981 - } 5982 - written += scnprintf(buf + written, PAGE_SIZE - written, "\n"); 5983 - rcu_read_unlock(); 5984 - cpus_read_unlock(); 5985 - 5986 - return written; 5987 - } 5988 5874 5989 5875 static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr, 5990 5876 char *buf) ··· 6054 5984 return ret ?: count; 6055 5985 } 6056 5986 6057 - static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr, 6058 - char *buf) 5987 + static ssize_t wq_affn_scope_show(struct device *dev, 5988 + struct device_attribute *attr, char *buf) 6059 5989 { 6060 5990 struct workqueue_struct *wq = dev_to_wq(dev); 6061 5991 int written; 6062 5992 6063 5993 mutex_lock(&wq->mutex); 6064 - written = scnprintf(buf, PAGE_SIZE, "%d\n", 6065 - !wq->unbound_attrs->no_numa); 5994 + if (wq->unbound_attrs->affn_scope == WQ_AFFN_DFL) 5995 + written = scnprintf(buf, PAGE_SIZE, "%s (%s)\n", 5996 + wq_affn_names[WQ_AFFN_DFL], 5997 + wq_affn_names[wq_affn_dfl]); 5998 + else 5999 + written = scnprintf(buf, PAGE_SIZE, "%s\n", 6000 + wq_affn_names[wq->unbound_attrs->affn_scope]); 6066 6001 mutex_unlock(&wq->mutex); 6067 6002 6068 6003 return written; 6069 6004 } 6070 6005 6071 - static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr, 6072 - const char *buf, size_t count) 6006 + static ssize_t wq_affn_scope_store(struct device *dev, 6007 + struct device_attribute *attr, 6008 + const char *buf, size_t count) 6009 + { 6010 + struct workqueue_struct *wq = dev_to_wq(dev); 6011 + struct workqueue_attrs *attrs; 6012 + int affn, ret = -ENOMEM; 6013 + 6014 + affn = parse_affn_scope(buf); 6015 + if (affn < 0) 6016 + return affn; 6017 + 6018 + apply_wqattrs_lock(); 6019 + attrs = wq_sysfs_prep_attrs(wq); 6020 + if (attrs) { 6021 + attrs->affn_scope = affn; 6022 + ret = apply_workqueue_attrs_locked(wq, attrs); 6023 + } 6024 + apply_wqattrs_unlock(); 6025 + free_workqueue_attrs(attrs); 6026 + return ret ?: count; 6027 + } 6028 + 6029 + static ssize_t wq_affinity_strict_show(struct device *dev, 6030 + struct device_attribute *attr, char *buf) 6031 + { 6032 + struct workqueue_struct *wq = dev_to_wq(dev); 6033 + 6034 + return scnprintf(buf, PAGE_SIZE, "%d\n", 6035 + wq->unbound_attrs->affn_strict); 6036 + } 6037 + 6038 + static ssize_t wq_affinity_strict_store(struct device *dev, 6039 + struct device_attribute *attr, 6040 + const char *buf, size_t count) 6073 6041 { 6074 6042 struct workqueue_struct *wq = dev_to_wq(dev); 6075 6043 struct workqueue_attrs *attrs; 6076 6044 int v, ret = -ENOMEM; 6077 6045 6046 + if (sscanf(buf, "%d", &v) != 1) 6047 + return -EINVAL; 6048 + 6078 6049 apply_wqattrs_lock(); 6079 - 6080 6050 attrs = wq_sysfs_prep_attrs(wq); 6081 - if (!attrs) 6082 - goto out_unlock; 6083 - 6084 - ret = -EINVAL; 6085 - if (sscanf(buf, "%d", &v) == 1) { 6086 - attrs->no_numa = !v; 6051 + if (attrs) { 6052 + attrs->affn_strict = (bool)v; 6087 6053 ret = apply_workqueue_attrs_locked(wq, attrs); 6088 6054 } 6089 - 6090 - out_unlock: 6091 6055 apply_wqattrs_unlock(); 6092 6056 free_workqueue_attrs(attrs); 6093 6057 return ret ?: count; 6094 6058 } 6095 6059 6096 6060 static struct device_attribute wq_sysfs_unbound_attrs[] = { 6097 - __ATTR(pool_ids, 0444, wq_pool_ids_show, NULL), 6098 6061 __ATTR(nice, 0644, wq_nice_show, wq_nice_store), 6099 6062 __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store), 6100 - __ATTR(numa, 0644, wq_numa_show, wq_numa_store), 6063 + __ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store), 6064 + __ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store), 6101 6065 __ATTR_NULL, 6102 6066 }; 6103 6067 ··· 6497 6393 6498 6394 #endif /* CONFIG_WQ_WATCHDOG */ 6499 6395 6500 - static void __init wq_numa_init(void) 6501 - { 6502 - cpumask_var_t *tbl; 6503 - int node, cpu; 6504 - 6505 - if (num_possible_nodes() <= 1) 6506 - return; 6507 - 6508 - if (wq_disable_numa) { 6509 - pr_info("workqueue: NUMA affinity support disabled\n"); 6510 - return; 6511 - } 6512 - 6513 - for_each_possible_cpu(cpu) { 6514 - if (WARN_ON(cpu_to_node(cpu) == NUMA_NO_NODE)) { 6515 - pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu); 6516 - return; 6517 - } 6518 - } 6519 - 6520 - wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(); 6521 - BUG_ON(!wq_update_unbound_numa_attrs_buf); 6522 - 6523 - /* 6524 - * We want masks of possible CPUs of each node which isn't readily 6525 - * available. Build one from cpu_to_node() which should have been 6526 - * fully initialized by now. 6527 - */ 6528 - tbl = kcalloc(nr_node_ids, sizeof(tbl[0]), GFP_KERNEL); 6529 - BUG_ON(!tbl); 6530 - 6531 - for_each_node(node) 6532 - BUG_ON(!zalloc_cpumask_var_node(&tbl[node], GFP_KERNEL, 6533 - node_online(node) ? node : NUMA_NO_NODE)); 6534 - 6535 - for_each_possible_cpu(cpu) { 6536 - node = cpu_to_node(cpu); 6537 - cpumask_set_cpu(cpu, tbl[node]); 6538 - } 6539 - 6540 - wq_numa_possible_cpumask = tbl; 6541 - wq_numa_enabled = true; 6542 - } 6543 - 6544 6396 /** 6545 6397 * workqueue_init_early - early init for workqueue subsystem 6546 6398 * 6547 - * This is the first half of two-staged workqueue subsystem initialization 6548 - * and invoked as soon as the bare basics - memory allocation, cpumasks and 6549 - * idr are up. It sets up all the data structures and system workqueues 6550 - * and allows early boot code to create workqueues and queue/cancel work 6551 - * items. Actual work item execution starts only after kthreads can be 6552 - * created and scheduled right before early initcalls. 6399 + * This is the first step of three-staged workqueue subsystem initialization and 6400 + * invoked as soon as the bare basics - memory allocation, cpumasks and idr are 6401 + * up. It sets up all the data structures and system workqueues and allows early 6402 + * boot code to create workqueues and queue/cancel work items. Actual work item 6403 + * execution starts only after kthreads can be created and scheduled right 6404 + * before early initcalls. 6553 6405 */ 6554 6406 void __init workqueue_init_early(void) 6555 6407 { 6408 + struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_SYSTEM]; 6556 6409 int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL }; 6557 6410 int i, cpu; 6558 6411 ··· 6519 6458 cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ)); 6520 6459 cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN)); 6521 6460 6461 + if (!cpumask_empty(&wq_cmdline_cpumask)) 6462 + cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, &wq_cmdline_cpumask); 6463 + 6522 6464 pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); 6465 + 6466 + wq_update_pod_attrs_buf = alloc_workqueue_attrs(); 6467 + BUG_ON(!wq_update_pod_attrs_buf); 6468 + 6469 + /* initialize WQ_AFFN_SYSTEM pods */ 6470 + pt->pod_cpus = kcalloc(1, sizeof(pt->pod_cpus[0]), GFP_KERNEL); 6471 + pt->pod_node = kcalloc(1, sizeof(pt->pod_node[0]), GFP_KERNEL); 6472 + pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL); 6473 + BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod); 6474 + 6475 + BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[0], GFP_KERNEL, NUMA_NO_NODE)); 6476 + 6477 + wq_update_pod_attrs_buf = alloc_workqueue_attrs(); 6478 + BUG_ON(!wq_update_pod_attrs_buf); 6479 + 6480 + pt->nr_pods = 1; 6481 + cpumask_copy(pt->pod_cpus[0], cpu_possible_mask); 6482 + pt->pod_node[0] = NUMA_NO_NODE; 6483 + pt->cpu_pod[0] = 0; 6523 6484 6524 6485 /* initialize CPU pools */ 6525 6486 for_each_possible_cpu(cpu) { ··· 6552 6469 BUG_ON(init_worker_pool(pool)); 6553 6470 pool->cpu = cpu; 6554 6471 cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu)); 6472 + cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu)); 6555 6473 pool->attrs->nice = std_nice[i++]; 6474 + pool->attrs->affn_strict = true; 6556 6475 pool->node = cpu_to_node(cpu); 6557 6476 6558 6477 /* alloc pool ID */ ··· 6575 6490 /* 6576 6491 * An ordered wq should have only one pwq as ordering is 6577 6492 * guaranteed by max_active which is enforced by pwqs. 6578 - * Turn off NUMA so that dfl_pwq is used for all nodes. 6579 6493 */ 6580 6494 BUG_ON(!(attrs = alloc_workqueue_attrs())); 6581 6495 attrs->nice = std_nice[i]; 6582 - attrs->no_numa = true; 6496 + attrs->ordered = true; 6583 6497 ordered_wq_attrs[i] = attrs; 6584 6498 } 6585 6499 ··· 6586 6502 system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0); 6587 6503 system_long_wq = alloc_workqueue("events_long", 0, 0); 6588 6504 system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND, 6589 - WQ_UNBOUND_MAX_ACTIVE); 6505 + WQ_MAX_ACTIVE); 6590 6506 system_freezable_wq = alloc_workqueue("events_freezable", 6591 6507 WQ_FREEZABLE, 0); 6592 6508 system_power_efficient_wq = alloc_workqueue("events_power_efficient", ··· 6608 6524 /* if the user set it to a specific value, keep it */ 6609 6525 if (wq_cpu_intensive_thresh_us != ULONG_MAX) 6610 6526 return; 6527 + 6528 + pwq_release_worker = kthread_create_worker(0, "pool_workqueue_release"); 6529 + BUG_ON(IS_ERR(pwq_release_worker)); 6611 6530 6612 6531 /* 6613 6532 * The default of 10ms is derived from the fact that most modern (as of ··· 6642 6555 /** 6643 6556 * workqueue_init - bring workqueue subsystem fully online 6644 6557 * 6645 - * This is the latter half of two-staged workqueue subsystem initialization 6646 - * and invoked as soon as kthreads can be created and scheduled. 6647 - * Workqueues have been created and work items queued on them, but there 6648 - * are no kworkers executing the work items yet. Populate the worker pools 6649 - * with the initial workers and enable future kworker creations. 6558 + * This is the second step of three-staged workqueue subsystem initialization 6559 + * and invoked as soon as kthreads can be created and scheduled. Workqueues have 6560 + * been created and work items queued on them, but there are no kworkers 6561 + * executing the work items yet. Populate the worker pools with the initial 6562 + * workers and enable future kworker creations. 6650 6563 */ 6651 6564 void __init workqueue_init(void) 6652 6565 { ··· 6656 6569 6657 6570 wq_cpu_intensive_thresh_init(); 6658 6571 6659 - /* 6660 - * It'd be simpler to initialize NUMA in workqueue_init_early() but 6661 - * CPU to node mapping may not be available that early on some 6662 - * archs such as power and arm64. As per-cpu pools created 6663 - * previously could be missing node hint and unbound pools NUMA 6664 - * affinity, fix them up. 6665 - * 6666 - * Also, while iterating workqueues, create rescuers if requested. 6667 - */ 6668 - wq_numa_init(); 6669 - 6670 6572 mutex_lock(&wq_pool_mutex); 6671 6573 6574 + /* 6575 + * Per-cpu pools created earlier could be missing node hint. Fix them 6576 + * up. Also, create a rescuer for workqueues that requested it. 6577 + */ 6672 6578 for_each_possible_cpu(cpu) { 6673 6579 for_each_cpu_worker_pool(pool, cpu) { 6674 6580 pool->node = cpu_to_node(cpu); ··· 6669 6589 } 6670 6590 6671 6591 list_for_each_entry(wq, &workqueues, list) { 6672 - wq_update_unbound_numa(wq, smp_processor_id(), true); 6673 6592 WARN(init_rescuer(wq), 6674 6593 "workqueue: failed to create early rescuer for %s", 6675 6594 wq->name); ··· 6692 6613 } 6693 6614 6694 6615 /* 6695 - * Despite the naming, this is a no-op function which is here only for avoiding 6696 - * link error. Since compile-time warning may fail to catch, we will need to 6697 - * emit run-time warning from __flush_workqueue(). 6616 + * Initialize @pt by first initializing @pt->cpu_pod[] with pod IDs according to 6617 + * @cpu_shares_pod(). Each subset of CPUs that share a pod is assigned a unique 6618 + * and consecutive pod ID. The rest of @pt is initialized accordingly. 6698 6619 */ 6699 - void __warn_flushing_systemwide_wq(void) { } 6620 + static void __init init_pod_type(struct wq_pod_type *pt, 6621 + bool (*cpus_share_pod)(int, int)) 6622 + { 6623 + int cur, pre, cpu, pod; 6624 + 6625 + pt->nr_pods = 0; 6626 + 6627 + /* init @pt->cpu_pod[] according to @cpus_share_pod() */ 6628 + pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL); 6629 + BUG_ON(!pt->cpu_pod); 6630 + 6631 + for_each_possible_cpu(cur) { 6632 + for_each_possible_cpu(pre) { 6633 + if (pre >= cur) { 6634 + pt->cpu_pod[cur] = pt->nr_pods++; 6635 + break; 6636 + } 6637 + if (cpus_share_pod(cur, pre)) { 6638 + pt->cpu_pod[cur] = pt->cpu_pod[pre]; 6639 + break; 6640 + } 6641 + } 6642 + } 6643 + 6644 + /* init the rest to match @pt->cpu_pod[] */ 6645 + pt->pod_cpus = kcalloc(pt->nr_pods, sizeof(pt->pod_cpus[0]), GFP_KERNEL); 6646 + pt->pod_node = kcalloc(pt->nr_pods, sizeof(pt->pod_node[0]), GFP_KERNEL); 6647 + BUG_ON(!pt->pod_cpus || !pt->pod_node); 6648 + 6649 + for (pod = 0; pod < pt->nr_pods; pod++) 6650 + BUG_ON(!zalloc_cpumask_var(&pt->pod_cpus[pod], GFP_KERNEL)); 6651 + 6652 + for_each_possible_cpu(cpu) { 6653 + cpumask_set_cpu(cpu, pt->pod_cpus[pt->cpu_pod[cpu]]); 6654 + pt->pod_node[pt->cpu_pod[cpu]] = cpu_to_node(cpu); 6655 + } 6656 + } 6657 + 6658 + static bool __init cpus_dont_share(int cpu0, int cpu1) 6659 + { 6660 + return false; 6661 + } 6662 + 6663 + static bool __init cpus_share_smt(int cpu0, int cpu1) 6664 + { 6665 + #ifdef CONFIG_SCHED_SMT 6666 + return cpumask_test_cpu(cpu0, cpu_smt_mask(cpu1)); 6667 + #else 6668 + return false; 6669 + #endif 6670 + } 6671 + 6672 + static bool __init cpus_share_numa(int cpu0, int cpu1) 6673 + { 6674 + return cpu_to_node(cpu0) == cpu_to_node(cpu1); 6675 + } 6676 + 6677 + /** 6678 + * workqueue_init_topology - initialize CPU pods for unbound workqueues 6679 + * 6680 + * This is the third step of there-staged workqueue subsystem initialization and 6681 + * invoked after SMP and topology information are fully initialized. It 6682 + * initializes the unbound CPU pods accordingly. 6683 + */ 6684 + void __init workqueue_init_topology(void) 6685 + { 6686 + struct workqueue_struct *wq; 6687 + int cpu; 6688 + 6689 + init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share); 6690 + init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt); 6691 + init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache); 6692 + init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa); 6693 + 6694 + mutex_lock(&wq_pool_mutex); 6695 + 6696 + /* 6697 + * Workqueues allocated earlier would have all CPUs sharing the default 6698 + * worker pool. Explicitly call wq_update_pod() on all workqueue and CPU 6699 + * combinations to apply per-pod sharing. 6700 + */ 6701 + list_for_each_entry(wq, &workqueues, list) { 6702 + for_each_online_cpu(cpu) { 6703 + wq_update_pod(wq, cpu, cpu, true); 6704 + } 6705 + } 6706 + 6707 + mutex_unlock(&wq_pool_mutex); 6708 + } 6709 + 6710 + void __warn_flushing_systemwide_wq(void) 6711 + { 6712 + pr_warn("WARNING: Flushing system-wide workqueues will be prohibited in near future.\n"); 6713 + dump_stack(); 6714 + } 6700 6715 EXPORT_SYMBOL(__warn_flushing_systemwide_wq); 6716 + 6717 + static int __init workqueue_unbound_cpus_setup(char *str) 6718 + { 6719 + if (cpulist_parse(str, &wq_cmdline_cpumask) < 0) { 6720 + cpumask_clear(&wq_cmdline_cpumask); 6721 + pr_warn("workqueue.unbound_cpus: incorrect CPU range, using default\n"); 6722 + } 6723 + 6724 + return 1; 6725 + } 6726 + __setup("workqueue.unbound_cpus=", workqueue_unbound_cpus_setup);
+1 -1
kernel/workqueue_internal.h
··· 48 48 /* A: runs through worker->node */ 49 49 50 50 unsigned long last_active; /* K: last active timestamp */ 51 - unsigned int flags; /* X: flags */ 51 + unsigned int flags; /* L: flags */ 52 52 int id; /* I: worker id */ 53 53 54 54 /*
+177
tools/workqueue/wq_dump.py
··· 1 + #!/usr/bin/env drgn 2 + # 3 + # Copyright (C) 2023 Tejun Heo <tj@kernel.org> 4 + # Copyright (C) 2023 Meta Platforms, Inc. and affiliates. 5 + 6 + desc = """ 7 + This is a drgn script to show the current workqueue configuration. For more 8 + info on drgn, visit https://github.com/osandov/drgn. 9 + 10 + Affinity Scopes 11 + =============== 12 + 13 + Shows the CPUs that can be used for unbound workqueues and how they will be 14 + grouped by each available affinity type. For each type: 15 + 16 + nr_pods number of CPU pods in the affinity type 17 + pod_cpus CPUs in each pod 18 + pod_node NUMA node for memory allocation for each pod 19 + cpu_pod pod that each CPU is associated to 20 + 21 + Worker Pools 22 + ============ 23 + 24 + Lists all worker pools indexed by their ID. For each pool: 25 + 26 + ref number of pool_workqueue's associated with this pool 27 + nice nice value of the worker threads in the pool 28 + idle number of idle workers 29 + workers number of all workers 30 + cpu CPU the pool is associated with (per-cpu pool) 31 + cpus CPUs the workers in the pool can run on (unbound pool) 32 + 33 + Workqueue CPU -> pool 34 + ===================== 35 + 36 + Lists all workqueues along with their type and worker pool association. For 37 + each workqueue: 38 + 39 + NAME TYPE[,FLAGS] POOL_ID... 40 + 41 + NAME name of the workqueue 42 + TYPE percpu, unbound or ordered 43 + FLAGS S: strict affinity scope 44 + POOL_ID worker pool ID associated with each possible CPU 45 + """ 46 + 47 + import sys 48 + 49 + import drgn 50 + from drgn.helpers.linux.list import list_for_each_entry,list_empty 51 + from drgn.helpers.linux.percpu import per_cpu_ptr 52 + from drgn.helpers.linux.cpumask import for_each_cpu,for_each_possible_cpu 53 + from drgn.helpers.linux.idr import idr_for_each 54 + 55 + import argparse 56 + parser = argparse.ArgumentParser(description=desc, 57 + formatter_class=argparse.RawTextHelpFormatter) 58 + args = parser.parse_args() 59 + 60 + def err(s): 61 + print(s, file=sys.stderr, flush=True) 62 + sys.exit(1) 63 + 64 + def cpumask_str(cpumask): 65 + output = "" 66 + base = 0 67 + v = 0 68 + for cpu in for_each_cpu(cpumask[0]): 69 + while cpu - base >= 32: 70 + output += f'{hex(v)} ' 71 + base += 32 72 + v = 0 73 + v |= 1 << (cpu - base) 74 + if v > 0: 75 + output += f'{v:08x}' 76 + return output.strip() 77 + 78 + worker_pool_idr = prog['worker_pool_idr'] 79 + workqueues = prog['workqueues'] 80 + wq_unbound_cpumask = prog['wq_unbound_cpumask'] 81 + wq_pod_types = prog['wq_pod_types'] 82 + wq_affn_dfl = prog['wq_affn_dfl'] 83 + wq_affn_names = prog['wq_affn_names'] 84 + 85 + WQ_UNBOUND = prog['WQ_UNBOUND'] 86 + WQ_ORDERED = prog['__WQ_ORDERED'] 87 + WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM'] 88 + 89 + WQ_AFFN_CPU = prog['WQ_AFFN_CPU'] 90 + WQ_AFFN_SMT = prog['WQ_AFFN_SMT'] 91 + WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE'] 92 + WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA'] 93 + WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM'] 94 + 95 + print('Affinity Scopes') 96 + print('===============') 97 + 98 + print(f'wq_unbound_cpumask={cpumask_str(wq_unbound_cpumask)}') 99 + 100 + def print_pod_type(pt): 101 + print(f' nr_pods {pt.nr_pods.value_()}') 102 + 103 + print(' pod_cpus', end='') 104 + for pod in range(pt.nr_pods): 105 + print(f' [{pod}]={cpumask_str(pt.pod_cpus[pod])}', end='') 106 + print('') 107 + 108 + print(' pod_node', end='') 109 + for pod in range(pt.nr_pods): 110 + print(f' [{pod}]={pt.pod_node[pod].value_()}', end='') 111 + print('') 112 + 113 + print(f' cpu_pod ', end='') 114 + for cpu in for_each_possible_cpu(prog): 115 + print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='') 116 + print('') 117 + 118 + for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]: 119 + print('') 120 + print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}') 121 + print_pod_type(wq_pod_types[affn]) 122 + 123 + print('') 124 + print('Worker Pools') 125 + print('============') 126 + 127 + max_pool_id_len = 0 128 + max_ref_len = 0 129 + for pi, pool in idr_for_each(worker_pool_idr): 130 + pool = drgn.Object(prog, 'struct worker_pool', address=pool) 131 + max_pool_id_len = max(max_pool_id_len, len(f'{pi}')) 132 + max_ref_len = max(max_ref_len, len(f'{pool.refcnt.value_()}')) 133 + 134 + for pi, pool in idr_for_each(worker_pool_idr): 135 + pool = drgn.Object(prog, 'struct worker_pool', address=pool) 136 + print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='') 137 + print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='') 138 + if pool.cpu >= 0: 139 + print(f'cpu={pool.cpu.value_():3}', end='') 140 + else: 141 + print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='') 142 + print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='') 143 + if pool.attrs.affn_strict: 144 + print(' strict', end='') 145 + print('') 146 + 147 + print('') 148 + print('Workqueue CPU -> pool') 149 + print('=====================') 150 + 151 + print('[ workqueue \ type CPU', end='') 152 + for cpu in for_each_possible_cpu(prog): 153 + print(f' {cpu:{max_pool_id_len}}', end='') 154 + print(' dfl]') 155 + 156 + for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'): 157 + print(f'{wq.name.string_().decode()[-24:]:24}', end='') 158 + if wq.flags & WQ_UNBOUND: 159 + if wq.flags & WQ_ORDERED: 160 + print(' ordered ', end='') 161 + else: 162 + print(' unbound', end='') 163 + if wq.unbound_attrs.affn_strict: 164 + print(',S ', end='') 165 + else: 166 + print(' ', end='') 167 + else: 168 + print(' percpu ', end='') 169 + 170 + for cpu in for_each_possible_cpu(prog): 171 + pool_id = per_cpu_ptr(wq.cpu_pwq, cpu)[0].pool.id.value_() 172 + field_len = max(len(str(cpu)), max_pool_id_len) 173 + print(f' {pool_id:{field_len}}', end='') 174 + 175 + if wq.flags & WQ_UNBOUND: 176 + print(f' {wq.dfl_pwq.pool.id.value_():{max_pool_id_len}}', end='') 177 + print('')
+14 -7
tools/workqueue/wq_monitor.py
··· 20 20 and got excluded from concurrency management to avoid stalling 21 21 other work items. 22 22 23 - CMwake The number of concurrency-management wake-ups while executing a 24 - work item of the workqueue. 23 + CMW/RPR For per-cpu workqueues, the number of concurrency-management 24 + wake-ups while executing a work item of the workqueue. For 25 + unbound workqueues, the number of times a worker was repatriated 26 + to its affinity scope after being migrated to an off-scope CPU by 27 + the scheduler. 25 28 26 29 mayday The number of times the rescuer was requested while waiting for 27 30 new worker creation. ··· 68 65 PWQ_STAT_CPU_TIME = prog['PWQ_STAT_CPU_TIME'] # total CPU time consumed 69 66 PWQ_STAT_CPU_INTENSIVE = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations 70 67 PWQ_STAT_CM_WAKEUP = prog['PWQ_STAT_CM_WAKEUP'] # concurrency-management worker wakeups 68 + PWQ_STAT_REPATRIATED = prog['PWQ_STAT_REPATRIATED'] # unbound workers brought back into scope 71 69 PWQ_STAT_MAYDAY = prog['PWQ_STAT_MAYDAY'] # maydays to rescuer 72 70 PWQ_STAT_RESCUED = prog['PWQ_STAT_RESCUED'] # linked work items executed by rescuer 73 71 PWQ_NR_STATS = prog['PWQ_NR_STATS'] ··· 93 89 'cpu_time' : self.stats[PWQ_STAT_CPU_TIME], 94 90 'cpu_intensive' : self.stats[PWQ_STAT_CPU_INTENSIVE], 95 91 'cm_wakeup' : self.stats[PWQ_STAT_CM_WAKEUP], 92 + 'repatriated' : self.stats[PWQ_STAT_REPATRIATED], 96 93 'mayday' : self.stats[PWQ_STAT_MAYDAY], 97 94 'rescued' : self.stats[PWQ_STAT_RESCUED], } 98 95 99 96 def table_header_str(): 100 97 return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\ 101 - f'{"CPUitsv":>7} {"CMwake":>7} {"mayday":>7} {"rescued":>7}' 98 + f'{"CPUitsv":>7} {"CMW/RPR":>7} {"mayday":>7} {"rescued":>7}' 102 99 103 100 def table_row_str(self): 104 101 cpu_intensive = '-' 105 - cm_wakeup = '-' 102 + cmw_rpr = '-' 106 103 mayday = '-' 107 104 rescued = '-' 108 105 109 - if not self.unbound: 106 + if self.unbound: 107 + cmw_rpr = str(self.stats[PWQ_STAT_REPATRIATED]); 108 + else: 110 109 cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE]) 111 - cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP]) 110 + cmw_rpr = str(self.stats[PWQ_STAT_CM_WAKEUP]) 112 111 113 112 if self.mem_reclaim: 114 113 mayday = str(self.stats[PWQ_STAT_MAYDAY]) ··· 122 115 f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \ 123 116 f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \ 124 117 f'{cpu_intensive:>7} ' \ 125 - f'{cm_wakeup:>7} ' \ 118 + f'{cmw_rpr:>7} ' \ 126 119 f'{mayday:>7} ' \ 127 120 f'{rescued:>7} ' 128 121 return out.rstrip(':')