sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com

authored by

Dave Chiluk and committed by

Peter Zijlstra 6 years ago de53fd7a 139d025c

+66 -82

3 changed files

expand all

Documentation

scheduler

sched-bwc.rst

kernel

sched

fair.c

sched.h

+59 -13

Documentation/scheduler/sched-bwc.rst

··· 9 9 specification of the maximum CPU bandwidth available to a group or hierarchy. 10 10 11 11 The bandwidth allowed for a group is specified using a quota and period. Within 12 - each given "period" (microseconds), a group is allowed to consume only up to 13 - "quota" microseconds of CPU time. When the CPU bandwidth consumption of a 14 - group exceeds this limit (for that period), the tasks belonging to its 15 - hierarchy will be throttled and are not allowed to run again until the next 16 - period. 12 + each given "period" (microseconds), a task group is allocated up to "quota" 13 + microseconds of CPU time. That quota is assigned to per-cpu run queues in 14 + slices as threads in the cgroup become runnable. Once all quota has been 15 + assigned any additional requests for quota will result in those threads being 16 + throttled. Throttled threads will not be able to run again until the next 17 + period when the quota is replenished. 17 18 18 - A group's unused runtime is globally tracked, being refreshed with quota units 19 - above at each period boundary. As threads consume this bandwidth it is 20 - transferred to cpu-local "silos" on a demand basis. The amount transferred 19 + A group's unassigned quota is globally tracked, being refreshed back to 20 + cfs_quota units at each period boundary. As threads consume this bandwidth it 21 + is transferred to cpu-local "silos" on a demand basis. The amount transferred 21 22 within each of these updates is tunable and described as the "slice". 22 23 23 24 Management ··· 36 35 37 36 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any 38 37 bandwidth restriction in place, such a group is described as an unconstrained 39 - bandwidth group. This represents the traditional work-conserving behavior for 38 + bandwidth group. This represents the traditional work-conserving behavior for 40 39 CFS. 41 40 42 41 Writing any (valid) positive value(s) will enact the specified bandwidth limit. 43 - The minimum quota allowed for the quota or period is 1ms. There is also an 44 - upper bound on the period length of 1s. Additional restrictions exist when 42 + The minimum quota allowed for the quota or period is 1ms. There is also an 43 + upper bound on the period length of 1s. Additional restrictions exist when 45 44 bandwidth limits are used in a hierarchical fashion, these are explained in 46 45 more detail below. 47 46 ··· 54 53 System wide settings 55 54 -------------------- 56 55 For efficiency run-time is transferred between the global pool and CPU local 57 - "silos" in a batch fashion. This greatly reduces global accounting pressure 58 - on large systems. The amount transferred each time such an update is required 56 + "silos" in a batch fashion. This greatly reduces global accounting pressure 57 + on large systems. The amount transferred each time such an update is required 59 58 is described as the "slice". 60 59 61 60 This is tunable via procfs:: ··· 97 96 98 97 In case b) above, even though the child may have runtime remaining it will not 99 98 be allowed to until the parent's runtime is refreshed. 99 + 100 + CFS Bandwidth Quota Caveats 101 + --------------------------- 102 + Once a slice is assigned to a cpu it does not expire. However all but 1ms of 103 + the slice may be returned to the global pool if all threads on that cpu become 104 + unrunnable. This is configured at compile time by the min_cfs_rq_runtime 105 + variable. This is a performance tweak that helps prevent added contention on 106 + the global lock. 107 + 108 + The fact that cpu-local slices do not expire results in some interesting corner 109 + cases that should be understood. 110 + 111 + For cgroup cpu constrained applications that are cpu limited this is a 112 + relatively moot point because they will naturally consume the entirety of their 113 + quota as well as the entirety of each cpu-local slice in each period. As a 114 + result it is expected that nr_periods roughly equal nr_throttled, and that 115 + cpuacct.usage will increase roughly equal to cfs_quota_us in each period. 116 + 117 + For highly-threaded, non-cpu bound applications this non-expiration nuance 118 + allows applications to briefly burst past their quota limits by the amount of 119 + unused slice on each cpu that the task group is running on (typically at most 120 + 1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only 121 + applies if quota had been assigned to a cpu and then not fully used or returned 122 + in previous periods. This burst amount will not be transferred between cores. 123 + As a result, this mechanism still strictly limits the task group to quota 124 + average usage, albeit over a longer time window than a single period. This 125 + also limits the burst ability to no more than 1ms per cpu. This provides 126 + better more predictable user experience for highly threaded applications with 127 + small quota limits on high core count machines. It also eliminates the 128 + propensity to throttle these applications while simultanously using less than 129 + quota amounts of cpu. Another way to say this, is that by allowing the unused 130 + portion of a slice to remain valid across periods we have decreased the 131 + possibility of wastefully expiring quota on cpu-local silos that don't need a 132 + full slice's amount of cpu time. 133 + 134 + The interaction between cpu-bound and non-cpu-bound-interactive applications 135 + should also be considered, especially when single core usage hits 100%. If you 136 + gave each of these applications half of a cpu-core and they both got scheduled 137 + on the same CPU it is theoretically possible that the non-cpu bound application 138 + will use up to 1ms additional quota in some periods, thereby preventing the 139 + cpu-bound application from fully using its quota by that same amount. In these 140 + instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to 141 + decide which application is chosen to run, as they will both be runnable and 142 + have remaining quota. This runtime discrepancy will be made up in the following 143 + periods when the interactive application idles. 100 144 101 145 Examples 102 146 --------

+7 -65

kernel/sched/fair.c

··· 4371 4371 4372 4372 now = sched_clock_cpu(smp_processor_id()); 4373 4373 cfs_b->runtime = cfs_b->quota; 4374 - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); 4375 - cfs_b->expires_seq++; 4376 4374 } 4377 4375 4378 4376 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) ··· 4392 4394 { 4393 4395 struct task_group *tg = cfs_rq->tg; 4394 4396 struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); 4395 - u64 amount = 0, min_amount, expires; 4396 - int expires_seq; 4397 + u64 amount = 0, min_amount; 4397 4398 4398 4399 /* note: this is a positive sum as runtime_remaining <= 0 */ 4399 4400 min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; ··· 4409 4412 cfs_b->idle = 0; 4410 4413 } 4411 4414 } 4412 - expires_seq = cfs_b->expires_seq; 4413 - expires = cfs_b->runtime_expires; 4414 4415 raw_spin_unlock(&cfs_b->lock); 4415 4416 4416 4417 cfs_rq->runtime_remaining += amount; 4417 - /* 4418 - * we may have advanced our local expiration to account for allowed 4419 - * spread between our sched_clock and the one on which runtime was 4420 - * issued. 4421 - */ 4422 - if (cfs_rq->expires_seq != expires_seq) { 4423 - cfs_rq->expires_seq = expires_seq; 4424 - cfs_rq->runtime_expires = expires; 4425 - } 4426 4418 4427 4419 return cfs_rq->runtime_remaining > 0; 4428 - } 4429 - 4430 - /* 4431 - * Note: This depends on the synchronization provided by sched_clock and the 4432 - * fact that rq->clock snapshots this value. 4433 - */ 4434 - static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) 4435 - { 4436 - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); 4437 - 4438 - /* if the deadline is ahead of our clock, nothing to do */ 4439 - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) 4440 - return; 4441 - 4442 - if (cfs_rq->runtime_remaining < 0) 4443 - return; 4444 - 4445 - /* 4446 - * If the local deadline has passed we have to consider the 4447 - * possibility that our sched_clock is 'fast' and the global deadline 4448 - * has not truly expired. 4449 - * 4450 - * Fortunately we can check determine whether this the case by checking 4451 - * whether the global deadline(cfs_b->expires_seq) has advanced. 4452 - */ 4453 - if (cfs_rq->expires_seq == cfs_b->expires_seq) { 4454 - /* extend local deadline, drift is bounded above by 2 ticks */ 4455 - cfs_rq->runtime_expires += TICK_NSEC; 4456 - } else { 4457 - /* global deadline is ahead, expiration has passed */ 4458 - cfs_rq->runtime_remaining = 0; 4459 - } 4460 4420 } 4461 4421 4462 4422 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) 4463 4423 { 4464 4424 /* dock delta_exec before expiring quota (as it could span periods) */ 4465 4425 cfs_rq->runtime_remaining -= delta_exec; 4466 - expire_cfs_rq_runtime(cfs_rq); 4467 4426 4468 4427 if (likely(cfs_rq->runtime_remaining > 0)) 4469 4428 return; ··· 4614 4661 resched_curr(rq); 4615 4662 } 4616 4663 4617 - static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, 4618 - u64 remaining, u64 expires) 4664 + static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) 4619 4665 { 4620 4666 struct cfs_rq *cfs_rq; 4621 4667 u64 runtime; ··· 4636 4684 remaining -= runtime; 4637 4685 4638 4686 cfs_rq->runtime_remaining += runtime; 4639 - cfs_rq->runtime_expires = expires; 4640 4687 4641 4688 /* we check whether we're throttled above */ 4642 4689 if (cfs_rq->runtime_remaining > 0) ··· 4660 4709 */ 4661 4710 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) 4662 4711 { 4663 - u64 runtime, runtime_expires; 4712 + u64 runtime; 4664 4713 int throttled; 4665 4714 4666 4715 /* no need to continue the timer with no bandwidth constraint */ ··· 4688 4737 /* account preceding periods in which throttling occurred */ 4689 4738 cfs_b->nr_throttled += overrun; 4690 4739 4691 - runtime_expires = cfs_b->runtime_expires; 4692 - 4693 4740 /* 4694 4741 * This check is repeated as we are holding onto the new bandwidth while 4695 4742 * we unthrottle. This can potentially race with an unthrottled group ··· 4700 4751 cfs_b->distribute_running = 1; 4701 4752 raw_spin_unlock_irqrestore(&cfs_b->lock, flags); 4702 4753 /* we can't nest cfs_b->lock while distributing bandwidth */ 4703 - runtime = distribute_cfs_runtime(cfs_b, runtime, 4704 - runtime_expires); 4754 + runtime = distribute_cfs_runtime(cfs_b, runtime); 4705 4755 raw_spin_lock_irqsave(&cfs_b->lock, flags); 4706 4756 4707 4757 cfs_b->distribute_running = 0; ··· 4782 4834 return; 4783 4835 4784 4836 raw_spin_lock(&cfs_b->lock); 4785 - if (cfs_b->quota != RUNTIME_INF && 4786 - cfs_rq->runtime_expires == cfs_b->runtime_expires) { 4837 + if (cfs_b->quota != RUNTIME_INF) { 4787 4838 cfs_b->runtime += slack_runtime; 4788 4839 4789 4840 /* we are under rq->lock, defer unthrottling using a timer */ ··· 4815 4868 { 4816 4869 u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); 4817 4870 unsigned long flags; 4818 - u64 expires; 4819 4871 4820 4872 /* confirm we're still not at a refresh boundary */ 4821 4873 raw_spin_lock_irqsave(&cfs_b->lock, flags); ··· 4832 4886 if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) 4833 4887 runtime = cfs_b->runtime; 4834 4888 4835 - expires = cfs_b->runtime_expires; 4836 4889 if (runtime) 4837 4890 cfs_b->distribute_running = 1; 4838 4891 ··· 4840 4895 if (!runtime) 4841 4896 return; 4842 4897 4843 - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); 4898 + runtime = distribute_cfs_runtime(cfs_b, runtime); 4844 4899 4845 4900 raw_spin_lock_irqsave(&cfs_b->lock, flags); 4846 - if (expires == cfs_b->runtime_expires) 4847 - lsub_positive(&cfs_b->runtime, runtime); 4901 + lsub_positive(&cfs_b->runtime, runtime); 4848 4902 cfs_b->distribute_running = 0; 4849 4903 raw_spin_unlock_irqrestore(&cfs_b->lock, flags); 4850 4904 } ··· 5000 5056 5001 5057 cfs_b->period_active = 1; 5002 5058 overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); 5003 - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); 5004 - cfs_b->expires_seq++; 5005 5059 hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); 5006 5060 } 5007 5061

-4

kernel/sched/sched.h

··· 335 335 u64 quota; 336 336 u64 runtime; 337 337 s64 hierarchical_quota; 338 - u64 runtime_expires; 339 - int expires_seq; 340 338 341 339 u8 idle; 342 340 u8 period_active; ··· 555 557 556 558 #ifdef CONFIG_CFS_BANDWIDTH 557 559 int runtime_enabled; 558 - int expires_seq; 559 - u64 runtime_expires; 560 560 s64 runtime_remaining; 561 561 562 562 u64 throttled_clock;