Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/deadline: Fix bandwidth reclaim equation in GRUB

According to the GRUB[1] rule, the runtime is depreciated as:
"dq = -max{u, (1 - Uinact - Uextra)} dt" (1)

To guarantee that deadline tasks doesn't starve lower class tasks,
we do not allocate the full bandwidth of the cpu to deadline tasks.
Maximum bandwidth usable by deadline tasks is denoted by "Umax".
Considering Umax, equation (1) becomes:
"dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2)

Current implementation has a minor bug in equation (2), which this
patch fixes.

The reclamation logic is verified by a sample program which creates
multiple deadline threads and observing their utilization. The tests
were run on an isolated cpu(isolcpus=3) on a 4 cpu system.

Tests on 6.3.0
==============

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69

RUN 3: 2 tasks
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=100ms
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23

As seen above, the reclamation doesn't reclaim the maximum allowed
bandwidth and as the bandwidth of tasks gets smaller, the reclaimed
bandwidth also comes down.

Tests with this patch applied
=============================

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.19
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.16

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.27
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.21

RUN 3: 2 tasks
Task 1: runtime=1ms, deadline=period=10ms
Task 2: runtime=1ms, deadline=period=100ms
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.64
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.66
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.45
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.73

Running tasks on all cpus allowing for migration also showed that
the utilization is reclaimed to the maximum. Running 10 tasks on
3 cpus SCHED_FLAG_RECLAIM - top shows:
%Cpu0 : 94.6 us, 0.0 sy, 0.0 ni, 5.4 id, 0.0 wa
%Cpu1 : 95.2 us, 0.0 sy, 0.0 ni, 4.8 id, 0.0 wa
%Cpu2 : 95.8 us, 0.0 sy, 0.0 ni, 4.2 id, 0.0 wa

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
(2015). Parallel and sequential reclaiming in multicore
real-time global scheduling.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230530135526.2385378-1-vineeth@bitbyteword.org

authored by

Vineeth Pillai and committed by
Peter Zijlstra
6a9d623a ef73d6a4

+29 -27
+23 -27
kernel/sched/deadline.c
··· 1253 1253 } 1254 1254 1255 1255 /* 1256 - * This function implements the GRUB accounting rule: 1257 - * according to the GRUB reclaiming algorithm, the runtime is 1258 - * not decreased as "dq = -dt", but as 1259 - * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt", 1256 + * This function implements the GRUB accounting rule. According to the 1257 + * GRUB reclaiming algorithm, the runtime is not decreased as "dq = -dt", 1258 + * but as "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt", 1260 1259 * where u is the utilization of the task, Umax is the maximum reclaimable 1261 1260 * utilization, Uinact is the (per-runqueue) inactive utilization, computed 1262 1261 * as the difference between the "total runqueue utilization" and the 1263 - * runqueue active utilization, and Uextra is the (per runqueue) extra 1262 + * "runqueue active utilization", and Uextra is the (per runqueue) extra 1264 1263 * reclaimable utilization. 1265 - * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations 1266 - * multiplied by 2^BW_SHIFT, the result has to be shifted right by 1267 - * BW_SHIFT. 1268 - * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT, 1269 - * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT. 1270 - * Since delta is a 64 bit variable, to have an overflow its value 1271 - * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds. 1272 - * So, overflow is not an issue here. 1264 + * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations multiplied 1265 + * by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT. 1266 + * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT, dl_bw 1267 + * is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT. 1268 + * Since delta is a 64 bit variable, to have an overflow its value should be 1269 + * larger than 2^(64 - 20 - 8), which is more than 64 seconds. So, overflow is 1270 + * not an issue here. 1273 1271 */ 1274 1272 static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se) 1275 1273 { 1276 - u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */ 1277 1274 u64 u_act; 1278 - u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT; 1275 + u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */ 1279 1276 1280 1277 /* 1281 - * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)}, 1282 - * we compare u_inact + rq->dl.extra_bw with 1283 - * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because 1284 - * u_inact + rq->dl.extra_bw can be larger than 1285 - * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative 1286 - * leading to wrong results) 1278 + * Instead of computing max{u, (u_max - u_inact - u_extra)}, we 1279 + * compare u_inact + u_extra with u_max - u, because u_inact + u_extra 1280 + * can be larger than u_max. So, u_max - u_inact - u_extra would be 1281 + * negative leading to wrong results. 1287 1282 */ 1288 - if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min) 1289 - u_act = u_act_min; 1283 + if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw) 1284 + u_act = dl_se->dl_bw; 1290 1285 else 1291 - u_act = BW_UNIT - u_inact - rq->dl.extra_bw; 1286 + u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw; 1292 1287 1288 + u_act = (u_act * rq->dl.bw_ratio) >> RATIO_SHIFT; 1293 1289 return (delta * u_act) >> BW_SHIFT; 1294 1290 } 1295 1291 ··· 2784 2788 { 2785 2789 if (global_rt_runtime() == RUNTIME_INF) { 2786 2790 dl_rq->bw_ratio = 1 << RATIO_SHIFT; 2787 - dl_rq->extra_bw = 1 << BW_SHIFT; 2791 + dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT; 2788 2792 } else { 2789 2793 dl_rq->bw_ratio = to_ratio(global_rt_runtime(), 2790 2794 global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT); 2791 - dl_rq->extra_bw = to_ratio(global_rt_period(), 2792 - global_rt_runtime()); 2795 + dl_rq->max_bw = dl_rq->extra_bw = 2796 + to_ratio(global_rt_period(), global_rt_runtime()); 2793 2797 } 2794 2798 } 2795 2799
+6
kernel/sched/sched.h
··· 748 748 u64 extra_bw; 749 749 750 750 /* 751 + * Maximum available bandwidth for reclaiming by SCHED_FLAG_RECLAIM 752 + * tasks of this rq. Used in calculation of reclaimable bandwidth(GRUB). 753 + */ 754 + u64 max_bw; 755 + 756 + /* 751 757 * Inverse of the fraction of CPU utilization that can be reclaimed 752 758 * by the GRUB algorithm. 753 759 */