Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/uclamp: Add a new sysctl to control RT default boost value

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com

authored by

Qais Yousef and committed by
Peter Zijlstra
13685c4a e65855a5

+131 -8
+8 -2
include/linux/sched.h
··· 686 686 struct sched_dl_entity dl; 687 687 688 688 #ifdef CONFIG_UCLAMP_TASK 689 - /* Clamp values requested for a scheduling entity */ 689 + /* 690 + * Clamp values requested for a scheduling entity. 691 + * Must be updated with task_rq_lock() held. 692 + */ 690 693 struct uclamp_se uclamp_req[UCLAMP_CNT]; 691 - /* Effective clamp values used for a scheduling entity */ 694 + /* 695 + * Effective clamp values used for a scheduling entity. 696 + * Must be updated with task_rq_lock() held. 697 + */ 692 698 struct uclamp_se uclamp[UCLAMP_CNT]; 693 699 #endif 694 700
+1
include/linux/sched/sysctl.h
··· 67 67 #ifdef CONFIG_UCLAMP_TASK 68 68 extern unsigned int sysctl_sched_uclamp_util_min; 69 69 extern unsigned int sysctl_sched_uclamp_util_max; 70 + extern unsigned int sysctl_sched_uclamp_util_min_rt_default; 70 71 #endif 71 72 72 73 #ifdef CONFIG_CFS_BANDWIDTH
+1
include/linux/sched/task.h
··· 55 55 extern void init_idle(struct task_struct *idle, int cpu); 56 56 57 57 extern int sched_fork(unsigned long clone_flags, struct task_struct *p); 58 + extern void sched_post_fork(struct task_struct *p); 58 59 extern void sched_dead(struct task_struct *p); 59 60 60 61 void __noreturn do_task_dead(void);
+1
kernel/fork.c
··· 2304 2304 write_unlock_irq(&tasklist_lock); 2305 2305 2306 2306 proc_fork_connector(p); 2307 + sched_post_fork(p); 2307 2308 cgroup_post_fork(p, args); 2308 2309 perf_event_fork(p); 2309 2310
+113 -6
kernel/sched/core.c
··· 889 889 /* Max allowed maximum utilization */ 890 890 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE; 891 891 892 + /* 893 + * By default RT tasks run at the maximum performance point/capacity of the 894 + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to 895 + * SCHED_CAPACITY_SCALE. 896 + * 897 + * This knob allows admins to change the default behavior when uclamp is being 898 + * used. In battery powered devices, particularly, running at the maximum 899 + * capacity and frequency will increase energy consumption and shorten the 900 + * battery life. 901 + * 902 + * This knob only affects RT tasks that their uclamp_se->user_defined == false. 903 + * 904 + * This knob will not override the system default sched_util_clamp_min defined 905 + * above. 906 + */ 907 + unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE; 908 + 892 909 /* All clamps are required to be less or equal than these values */ 893 910 static struct uclamp_se uclamp_default[UCLAMP_CNT]; 894 911 ··· 1006 989 1007 990 /* No tasks -- default clamp values */ 1008 991 return uclamp_idle_value(rq, clamp_id, clamp_value); 992 + } 993 + 994 + static void __uclamp_update_util_min_rt_default(struct task_struct *p) 995 + { 996 + unsigned int default_util_min; 997 + struct uclamp_se *uc_se; 998 + 999 + lockdep_assert_held(&p->pi_lock); 1000 + 1001 + uc_se = &p->uclamp_req[UCLAMP_MIN]; 1002 + 1003 + /* Only sync if user didn't override the default */ 1004 + if (uc_se->user_defined) 1005 + return; 1006 + 1007 + default_util_min = sysctl_sched_uclamp_util_min_rt_default; 1008 + uclamp_se_set(uc_se, default_util_min, false); 1009 + } 1010 + 1011 + static void uclamp_update_util_min_rt_default(struct task_struct *p) 1012 + { 1013 + struct rq_flags rf; 1014 + struct rq *rq; 1015 + 1016 + if (!rt_task(p)) 1017 + return; 1018 + 1019 + /* Protect updates to p->uclamp_* */ 1020 + rq = task_rq_lock(p, &rf); 1021 + __uclamp_update_util_min_rt_default(p); 1022 + task_rq_unlock(rq, p, &rf); 1023 + } 1024 + 1025 + static void uclamp_sync_util_min_rt_default(void) 1026 + { 1027 + struct task_struct *g, *p; 1028 + 1029 + /* 1030 + * copy_process() sysctl_uclamp 1031 + * uclamp_min_rt = X; 1032 + * write_lock(&tasklist_lock) read_lock(&tasklist_lock) 1033 + * // link thread smp_mb__after_spinlock() 1034 + * write_unlock(&tasklist_lock) read_unlock(&tasklist_lock); 1035 + * sched_post_fork() for_each_process_thread() 1036 + * __uclamp_sync_rt() __uclamp_sync_rt() 1037 + * 1038 + * Ensures that either sched_post_fork() will observe the new 1039 + * uclamp_min_rt or for_each_process_thread() will observe the new 1040 + * task. 1041 + */ 1042 + read_lock(&tasklist_lock); 1043 + smp_mb__after_spinlock(); 1044 + read_unlock(&tasklist_lock); 1045 + 1046 + rcu_read_lock(); 1047 + for_each_process_thread(g, p) 1048 + uclamp_update_util_min_rt_default(p); 1049 + rcu_read_unlock(); 1009 1050 } 1010 1051 1011 1052 static inline struct uclamp_se ··· 1353 1278 void *buffer, size_t *lenp, loff_t *ppos) 1354 1279 { 1355 1280 bool update_root_tg = false; 1356 - int old_min, old_max; 1281 + int old_min, old_max, old_min_rt; 1357 1282 int result; 1358 1283 1359 1284 mutex_lock(&uclamp_mutex); 1360 1285 old_min = sysctl_sched_uclamp_util_min; 1361 1286 old_max = sysctl_sched_uclamp_util_max; 1287 + old_min_rt = sysctl_sched_uclamp_util_min_rt_default; 1362 1288 1363 1289 result = proc_dointvec(table, write, buffer, lenp, ppos); 1364 1290 if (result) ··· 1368 1292 goto done; 1369 1293 1370 1294 if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max || 1371 - sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) { 1295 + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE || 1296 + sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) { 1297 + 1372 1298 result = -EINVAL; 1373 1299 goto undo; 1374 1300 } ··· 1391 1313 uclamp_update_root_tg(); 1392 1314 } 1393 1315 1316 + if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) { 1317 + static_branch_enable(&sched_uclamp_used); 1318 + uclamp_sync_util_min_rt_default(); 1319 + } 1320 + 1394 1321 /* 1395 1322 * We update all RUNNABLE tasks only when task groups are in use. 1396 1323 * Otherwise, keep it simple and do just a lazy update at each next ··· 1407 1324 undo: 1408 1325 sysctl_sched_uclamp_util_min = old_min; 1409 1326 sysctl_sched_uclamp_util_max = old_max; 1327 + sysctl_sched_uclamp_util_min_rt_default = old_min_rt; 1410 1328 done: 1411 1329 mutex_unlock(&uclamp_mutex); 1412 1330 ··· 1453 1369 */ 1454 1370 for_each_clamp_id(clamp_id) { 1455 1371 struct uclamp_se *uc_se = &p->uclamp_req[clamp_id]; 1456 - unsigned int clamp_value = uclamp_none(clamp_id); 1457 1372 1458 1373 /* Keep using defined clamps across class changes */ 1459 1374 if (uc_se->user_defined) 1460 1375 continue; 1461 1376 1462 - /* By default, RT tasks always get 100% boost */ 1377 + /* 1378 + * RT by default have a 100% boost value that could be modified 1379 + * at runtime. 1380 + */ 1463 1381 if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) 1464 - clamp_value = uclamp_none(UCLAMP_MAX); 1382 + __uclamp_update_util_min_rt_default(p); 1383 + else 1384 + uclamp_se_set(uc_se, uclamp_none(clamp_id), false); 1465 1385 1466 - uclamp_se_set(uc_se, clamp_value, false); 1467 1386 } 1468 1387 1469 1388 if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) ··· 1487 1400 { 1488 1401 enum uclamp_id clamp_id; 1489 1402 1403 + /* 1404 + * We don't need to hold task_rq_lock() when updating p->uclamp_* here 1405 + * as the task is still at its early fork stages. 1406 + */ 1490 1407 for_each_clamp_id(clamp_id) 1491 1408 p->uclamp[clamp_id].active = false; 1492 1409 ··· 1501 1410 uclamp_se_set(&p->uclamp_req[clamp_id], 1502 1411 uclamp_none(clamp_id), false); 1503 1412 } 1413 + } 1414 + 1415 + static void uclamp_post_fork(struct task_struct *p) 1416 + { 1417 + uclamp_update_util_min_rt_default(p); 1504 1418 } 1505 1419 1506 1420 static void __init init_uclamp_rq(struct rq *rq) ··· 1558 1462 static void __setscheduler_uclamp(struct task_struct *p, 1559 1463 const struct sched_attr *attr) { } 1560 1464 static inline void uclamp_fork(struct task_struct *p) { } 1465 + static inline void uclamp_post_fork(struct task_struct *p) { } 1561 1466 static inline void init_uclamp(void) { } 1562 1467 #endif /* CONFIG_UCLAMP_TASK */ 1563 1468 ··· 3300 3203 RB_CLEAR_NODE(&p->pushable_dl_tasks); 3301 3204 #endif 3302 3205 return 0; 3206 + } 3207 + 3208 + void sched_post_fork(struct task_struct *p) 3209 + { 3210 + uclamp_post_fork(p); 3303 3211 } 3304 3212 3305 3213 unsigned long to_ratio(u64 period, u64 runtime) ··· 5826 5724 kattr.sched_nice = task_nice(p); 5827 5725 5828 5726 #ifdef CONFIG_UCLAMP_TASK 5727 + /* 5728 + * This could race with another potential updater, but this is fine 5729 + * because it'll correctly read the old or the new value. We don't need 5730 + * to guarantee who wins the race as long as it doesn't return garbage. 5731 + */ 5829 5732 kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; 5830 5733 kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; 5831 5734 #endif
+7
kernel/sysctl.c
··· 1815 1815 .mode = 0644, 1816 1816 .proc_handler = sysctl_sched_uclamp_handler, 1817 1817 }, 1818 + { 1819 + .procname = "sched_util_clamp_min_rt_default", 1820 + .data = &sysctl_sched_uclamp_util_min_rt_default, 1821 + .maxlen = sizeof(unsigned int), 1822 + .mode = 0644, 1823 + .proc_handler = sysctl_sched_uclamp_handler, 1824 + }, 1818 1825 #endif 1819 1826 #ifdef CONFIG_SCHED_AUTOGROUP 1820 1827 {