Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Revert "sched/core: Reduce cost of sched_move_task when config autogroup"

This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261.

Hazem reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain):

https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com

There is an early bail from sched_move_task() if p->sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').

So in:

do_exit()

sched_autogroup_exit_task()

sched_move_task()

if sched_get_task_group(p) == p->sched_task_group
return

/* p is enqueued */
dequeue_task() \
sched_change_group() |
task_change_group_fair() |
detach_task_cfs_rq() | (1)
set_task_rq() |
attach_task_cfs_rq() |
enqueue_task() /

(1) isn't called for p anymore.

Turns out that the regression is related to sgs->group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs->group_util is ~900 and
sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.

I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).

This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).

Instead if (1) is called for 'p->flags & PF_EXITING' then the path
(4),(6) is taken much more often.

select_task_rq_fair(..., wake_flags = WF_FORK)

cpu = smp_processor_id()

new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)

group = sched_balance_find_dst_group(..., cpu)

do {

update_sg_wakeup_stats()

sgs->group_type = group_classify()

if group_is_overloaded() (2)
return group_overloaded

if !group_has_capacity() (3)
return group_fully_busy

return group_has_spare (4)

} while group

if local_sgs.group_type > idlest_sgs.group_type
return idlest (5)

case group_has_spare:

if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
return NULL (6)

Unixbench Tests './Run -c 4 spawn' on:

(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
and Ubuntu 22.04.5 LTS (aarch64).

Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.

w/o patch w/ patch
21005 27120

(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
Ubuntu 22.04.5 LTS (x86_64).

Shell & test run in '/A'.

w/o patch w/ patch
67675 88806

CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.

Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Hagar Hemdan <hagarhem@amazon.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com

authored by

Dietmar Eggemann and committed by
Ingo Molnar
76f970ce f3fa0e40

+3 -18
+3 -18
kernel/sched/core.c
··· 9016 9016 spin_unlock_irqrestore(&task_group_lock, flags); 9017 9017 } 9018 9018 9019 - static struct task_group *sched_get_task_group(struct task_struct *tsk) 9019 + static void sched_change_group(struct task_struct *tsk) 9020 9020 { 9021 9021 struct task_group *tg; 9022 9022 ··· 9028 9028 tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), 9029 9029 struct task_group, css); 9030 9030 tg = autogroup_task_group(tsk, tg); 9031 - 9032 - return tg; 9033 - } 9034 - 9035 - static void sched_change_group(struct task_struct *tsk, struct task_group *group) 9036 - { 9037 - tsk->sched_task_group = group; 9031 + tsk->sched_task_group = tg; 9038 9032 9039 9033 #ifdef CONFIG_FAIR_GROUP_SCHED 9040 9034 if (tsk->sched_class->task_change_group) ··· 9049 9055 { 9050 9056 int queued, running, queue_flags = 9051 9057 DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; 9052 - struct task_group *group; 9053 9058 struct rq *rq; 9054 9059 9055 9060 CLASS(task_rq_lock, rq_guard)(tsk); 9056 9061 rq = rq_guard.rq; 9057 - 9058 - /* 9059 - * Esp. with SCHED_AUTOGROUP enabled it is possible to get superfluous 9060 - * group changes. 9061 - */ 9062 - group = sched_get_task_group(tsk); 9063 - if (group == tsk->sched_task_group) 9064 - return; 9065 9062 9066 9063 update_rq_clock(rq); 9067 9064 ··· 9064 9079 if (running) 9065 9080 put_prev_task(rq, tsk); 9066 9081 9067 - sched_change_group(tsk, group); 9082 + sched_change_group(tsk); 9068 9083 if (!for_autogroup) 9069 9084 scx_cgroup_move_task(tsk); 9070 9085