workqueue: Implement non-strict affinity scope for unbound workqueues

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.

While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.

While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.

This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.

After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:

* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.

* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.

This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.

There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.

While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.

v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

Tejun Heo 2 years ago 8639eceb 9546b29e

+132 -20

5 changed files

expand all

Documentation

core-api

workqueue.rst

include

linux

workqueue.h

kernel

workqueue.c

tools

workqueue

wq_dump.py

wq_monitor.py

+23 -7

Documentation/core-api/workqueue.rst

··· 353 353 An unbound workqueue groups CPUs according to its affinity scope to improve 354 354 cache locality. For example, if a workqueue is using the default affinity 355 355 scope of "cache", it will group CPUs according to last level cache 356 - boundaries. A work item queued on the workqueue will be processed by a 357 - worker running on one of the CPUs which share the last level cache with the 358 - issuing CPU. 356 + boundaries. A work item queued on the workqueue will be assigned to a worker 357 + on one of the CPUs which share the last level cache with the issuing CPU. 358 + Once started, the worker may or may not be allowed to move outside the scope 359 + depending on the ``affinity_strict`` setting of the scope. 359 360 360 361 Workqueue currently supports the following five affinity scopes. 361 362 ··· 391 390 392 391 ``affinity_scope`` 393 392 Read to see the current affinity scope. Write to change. 393 + 394 + ``affinity_strict`` 395 + 0 by default indicating that affinity scopes are not strict. When a work 396 + item starts execution, workqueue makes a best-effort attempt to ensure 397 + that the worker is inside its affinity scope, which is called 398 + repatriation. Once started, the scheduler is free to move the worker 399 + anywhere in the system as it sees fit. This enables benefiting from scope 400 + locality while still being able to utilize other CPUs if necessary and 401 + available. 402 + 403 + If set to 1, all workers of the scope are guaranteed always to be in the 404 + scope. This may be useful when crossing affinity scopes has other 405 + implications, for example, in terms of power consumption or workload 406 + isolation. Strict NUMA scope can also be used to match the workqueue 407 + behavior of older kernels. 394 408 395 409 396 410 Examining Configuration ··· 491 475 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: 492 476 493 477 $ tools/workqueue/wq_monitor.py events 494 - total infl CPUtime CPUhog CMwake mayday rescued 478 + total infl CPUtime CPUhog CMW/RPR mayday rescued 495 479 events 18545 0 6.1 0 5 - - 496 480 events_highpri 8 0 0.0 0 0 - - 497 481 events_long 3 0 0.0 0 0 - - 498 - events_unbound 38306 0 0.1 - - - - 482 + events_unbound 38306 0 0.1 - 7 - - 499 483 events_freezable 0 0 0.0 0 0 - - 500 484 events_power_efficient 29598 0 0.2 0 0 - - 501 485 events_freezable_power_ 10 0 0.0 0 0 - - 502 486 sock_diag_events 0 0 0.0 0 0 - - 503 487 504 - total infl CPUtime CPUhog CMwake mayday rescued 488 + total infl CPUtime CPUhog CMW/RPR mayday rescued 505 489 events 18548 0 6.1 0 5 - - 506 490 events_highpri 8 0 0.0 0 0 - - 507 491 events_long 3 0 0.0 0 0 - - 508 - events_unbound 38322 0 0.1 - - - - 492 + events_unbound 38322 0 0.1 - 7 - - 509 493 events_freezable 0 0 0.0 0 0 - - 510 494 events_power_efficient 29603 0 0.2 0 0 - - 511 495 events_freezable_power_ 10 0 0.0 0 0 - -

+11

include/linux/workqueue.h

··· 169 169 */ 170 170 cpumask_var_t __pod_cpumask; 171 171 172 + /** 173 + * @affn_strict: affinity scope is strict 174 + * 175 + * If clear, workqueue will make a best-effort attempt at starting the 176 + * worker inside @__pod_cpumask but the scheduler is free to migrate it 177 + * outside. 178 + * 179 + * If set, workers are only allowed to run inside @__pod_cpumask. 180 + */ 181 + bool affn_strict; 182 + 172 183 /* 173 184 * Below fields aren't properties of a worker_pool. They only modify how 174 185 * :c:func:`apply_workqueue_attrs` select pools and thus don't

+72 -2

kernel/workqueue.c

··· 211 211 PWQ_STAT_CPU_TIME, /* total CPU time consumed */ 212 212 PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us violations */ 213 213 PWQ_STAT_CM_WAKEUP, /* concurrency-management worker wakeups */ 214 + PWQ_STAT_REPATRIATED, /* unbound workers brought back into scope */ 214 215 PWQ_STAT_MAYDAY, /* maydays to rescuer */ 215 216 PWQ_STAT_RESCUED, /* linked work items executed by rescuer */ 216 217 ··· 1104 1103 static bool kick_pool(struct worker_pool *pool) 1105 1104 { 1106 1105 struct worker *worker = first_idle_worker(pool); 1106 + struct task_struct *p; 1107 1107 1108 1108 lockdep_assert_held(&pool->lock); 1109 1109 1110 1110 if (!need_more_worker(pool) || !worker) 1111 1111 return false; 1112 1112 1113 - wake_up_process(worker->task); 1113 + p = worker->task; 1114 + 1115 + #ifdef CONFIG_SMP 1116 + /* 1117 + * Idle @worker is about to execute @work and waking up provides an 1118 + * opportunity to migrate @worker at a lower cost by setting the task's 1119 + * wake_cpu field. Let's see if we want to move @worker to improve 1120 + * execution locality. 1121 + * 1122 + * We're waking the worker that went idle the latest and there's some 1123 + * chance that @worker is marked idle but hasn't gone off CPU yet. If 1124 + * so, setting the wake_cpu won't do anything. As this is a best-effort 1125 + * optimization and the race window is narrow, let's leave as-is for 1126 + * now. If this becomes pronounced, we can skip over workers which are 1127 + * still on cpu when picking an idle worker. 1128 + * 1129 + * If @pool has non-strict affinity, @worker might have ended up outside 1130 + * its affinity scope. Repatriate. 1131 + */ 1132 + if (!pool->attrs->affn_strict && 1133 + !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) { 1134 + struct work_struct *work = list_first_entry(&pool->worklist, 1135 + struct work_struct, entry); 1136 + p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask); 1137 + get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++; 1138 + } 1139 + #endif 1140 + wake_up_process(p); 1114 1141 return true; 1115 1142 } 1116 1143 ··· 2080 2051 2081 2052 static cpumask_t *pool_allowed_cpus(struct worker_pool *pool) 2082 2053 { 2083 - return pool->attrs->__pod_cpumask; 2054 + if (pool->cpu < 0 && pool->attrs->affn_strict) 2055 + return pool->attrs->__pod_cpumask; 2056 + else 2057 + return pool->attrs->cpumask; 2084 2058 } 2085 2059 2086 2060 /** ··· 3747 3715 to->nice = from->nice; 3748 3716 cpumask_copy(to->cpumask, from->cpumask); 3749 3717 cpumask_copy(to->__pod_cpumask, from->__pod_cpumask); 3718 + to->affn_strict = from->affn_strict; 3750 3719 3751 3720 /* 3752 3721 * Unlike hash and equality test, copying shouldn't ignore wq-only ··· 3778 3745 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash); 3779 3746 hash = jhash(cpumask_bits(attrs->__pod_cpumask), 3780 3747 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash); 3748 + hash = jhash_1word(attrs->affn_strict, hash); 3781 3749 return hash; 3782 3750 } 3783 3751 ··· 3791 3757 if (!cpumask_equal(a->cpumask, b->cpumask)) 3792 3758 return false; 3793 3759 if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask)) 3760 + return false; 3761 + if (a->affn_strict != b->affn_strict) 3794 3762 return false; 3795 3763 return true; 3796 3764 } ··· 5883 5847 * nice RW int : nice value of the workers 5884 5848 * cpumask RW mask : bitmask of allowed CPUs for the workers 5885 5849 * affinity_scope RW str : worker CPU affinity scope (cache, numa, none) 5850 + * affinity_strict RW bool : worker CPU affinity is strict 5886 5851 */ 5887 5852 struct wq_device { 5888 5853 struct workqueue_struct *wq; ··· 6063 6026 return ret ?: count; 6064 6027 } 6065 6028 6029 + static ssize_t wq_affinity_strict_show(struct device *dev, 6030 + struct device_attribute *attr, char *buf) 6031 + { 6032 + struct workqueue_struct *wq = dev_to_wq(dev); 6033 + 6034 + return scnprintf(buf, PAGE_SIZE, "%d\n", 6035 + wq->unbound_attrs->affn_strict); 6036 + } 6037 + 6038 + static ssize_t wq_affinity_strict_store(struct device *dev, 6039 + struct device_attribute *attr, 6040 + const char *buf, size_t count) 6041 + { 6042 + struct workqueue_struct *wq = dev_to_wq(dev); 6043 + struct workqueue_attrs *attrs; 6044 + int v, ret = -ENOMEM; 6045 + 6046 + if (sscanf(buf, "%d", &v) != 1) 6047 + return -EINVAL; 6048 + 6049 + apply_wqattrs_lock(); 6050 + attrs = wq_sysfs_prep_attrs(wq); 6051 + if (attrs) { 6052 + attrs->affn_strict = (bool)v; 6053 + ret = apply_workqueue_attrs_locked(wq, attrs); 6054 + } 6055 + apply_wqattrs_unlock(); 6056 + free_workqueue_attrs(attrs); 6057 + return ret ?: count; 6058 + } 6059 + 6066 6060 static struct device_attribute wq_sysfs_unbound_attrs[] = { 6067 6061 __ATTR(nice, 0644, wq_nice_show, wq_nice_store), 6068 6062 __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store), 6069 6063 __ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store), 6064 + __ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store), 6070 6065 __ATTR_NULL, 6071 6066 }; 6072 6067 ··· 6521 6452 cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu)); 6522 6453 cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu)); 6523 6454 pool->attrs->nice = std_nice[i++]; 6455 + pool->attrs->affn_strict = true; 6524 6456 pool->node = cpu_to_node(cpu); 6525 6457 6526 6458 /* alloc pool ID */

+12 -4

tools/workqueue/wq_dump.py

··· 36 36 Lists all workqueues along with their type and worker pool association. For 37 37 each workqueue: 38 38 39 - NAME TYPE POOL_ID... 39 + NAME TYPE[,FLAGS] POOL_ID... 40 40 41 41 NAME name of the workqueue 42 42 TYPE percpu, unbound or ordered 43 + FLAGS S: strict affinity scope 43 44 POOL_ID worker pool ID associated with each possible CPU 44 45 """ 45 46 ··· 139 138 print(f'cpu={pool.cpu.value_():3}', end='') 140 139 else: 141 140 print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='') 141 + print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='') 142 + if pool.attrs.affn_strict: 143 + print(' strict', end='') 142 144 print('') 143 145 144 146 print('') 145 147 print('Workqueue CPU -> pool') 146 148 print('=====================') 147 149 148 - print('[ workqueue \ CPU ', end='') 150 + print('[ workqueue \ type CPU', end='') 149 151 for cpu in for_each_possible_cpu(prog): 150 152 print(f' {cpu:{max_pool_id_len}}', end='') 151 153 print(' dfl]') ··· 157 153 print(f'{wq.name.string_().decode()[-24:]:24}', end='') 158 154 if wq.flags & WQ_UNBOUND: 159 155 if wq.flags & WQ_ORDERED: 160 - print(' ordered', end='') 156 + print(' ordered ', end='') 161 157 else: 162 158 print(' unbound', end='') 159 + if wq.unbound_attrs.affn_strict: 160 + print(',S ', end='') 161 + else: 162 + print(' ', end='') 163 163 else: 164 - print(' percpu ', end='') 164 + print(' percpu ', end='') 165 165 166 166 for cpu in for_each_possible_cpu(prog): 167 167 pool_id = per_cpu_ptr(wq.cpu_pwq, cpu)[0].pool.id.value_()

+14 -7

tools/workqueue/wq_monitor.py

··· 20 20 and got excluded from concurrency management to avoid stalling 21 21 other work items. 22 22 23 - CMwake The number of concurrency-management wake-ups while executing a 24 - work item of the workqueue. 23 + CMW/RPR For per-cpu workqueues, the number of concurrency-management 24 + wake-ups while executing a work item of the workqueue. For 25 + unbound workqueues, the number of times a worker was repatriated 26 + to its affinity scope after being migrated to an off-scope CPU by 27 + the scheduler. 25 28 26 29 mayday The number of times the rescuer was requested while waiting for 27 30 new worker creation. ··· 68 65 PWQ_STAT_CPU_TIME = prog['PWQ_STAT_CPU_TIME'] # total CPU time consumed 69 66 PWQ_STAT_CPU_INTENSIVE = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations 70 67 PWQ_STAT_CM_WAKEUP = prog['PWQ_STAT_CM_WAKEUP'] # concurrency-management worker wakeups 68 + PWQ_STAT_REPATRIATED = prog['PWQ_STAT_REPATRIATED'] # unbound workers brought back into scope 71 69 PWQ_STAT_MAYDAY = prog['PWQ_STAT_MAYDAY'] # maydays to rescuer 72 70 PWQ_STAT_RESCUED = prog['PWQ_STAT_RESCUED'] # linked work items executed by rescuer 73 71 PWQ_NR_STATS = prog['PWQ_NR_STATS'] ··· 93 89 'cpu_time' : self.stats[PWQ_STAT_CPU_TIME], 94 90 'cpu_intensive' : self.stats[PWQ_STAT_CPU_INTENSIVE], 95 91 'cm_wakeup' : self.stats[PWQ_STAT_CM_WAKEUP], 92 + 'repatriated' : self.stats[PWQ_STAT_REPATRIATED], 96 93 'mayday' : self.stats[PWQ_STAT_MAYDAY], 97 94 'rescued' : self.stats[PWQ_STAT_RESCUED], } 98 95 99 96 def table_header_str(): 100 97 return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\ 101 - f'{"CPUitsv":>7} {"CMwake":>7} {"mayday":>7} {"rescued":>7}' 98 + f'{"CPUitsv":>7} {"CMW/RPR":>7} {"mayday":>7} {"rescued":>7}' 102 99 103 100 def table_row_str(self): 104 101 cpu_intensive = '-' 105 - cm_wakeup = '-' 102 + cmw_rpr = '-' 106 103 mayday = '-' 107 104 rescued = '-' 108 105 109 - if not self.unbound: 106 + if self.unbound: 107 + cmw_rpr = str(self.stats[PWQ_STAT_REPATRIATED]); 108 + else: 110 109 cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE]) 111 - cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP]) 110 + cmw_rpr = str(self.stats[PWQ_STAT_CM_WAKEUP]) 112 111 113 112 if self.mem_reclaim: 114 113 mayday = str(self.stats[PWQ_STAT_MAYDAY]) ··· 122 115 f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \ 123 116 f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \ 124 117 f'{cpu_intensive:>7} ' \ 125 - f'{cm_wakeup:>7} ' \ 118 + f'{cmw_rpr:>7} ' \ 126 119 f'{mayday:>7} ' \ 127 120 f'{rescued:>7} ' 128 121 return out.rstrip(':')