Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/fair: Use the busiest group to set prefer_sibling

The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.

Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.

There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-7-ricardo.neri-calderon@linux.intel.com

authored by

Ricardo Neri and committed by
Peter Zijlstra
43726bde 5fd6d7f4

+11 -4
+11 -4
kernel/sched/fair.c
··· 10109 10109 10110 10110 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) 10111 10111 { 10112 - struct sched_domain *child = env->sd->child; 10113 10112 struct sched_group *sg = env->sd->groups; 10114 10113 struct sg_lb_stats *local = &sds->local_stat; 10115 10114 struct sg_lb_stats tmp_sgs; ··· 10149 10150 sg = sg->next; 10150 10151 } while (sg != env->sd->groups); 10151 10152 10152 - /* Tag domain that child domain prefers tasks go to siblings first */ 10153 - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; 10153 + /* 10154 + * Indicate that the child domain of the busiest group prefers tasks 10155 + * go to a child's sibling domains first. NB the flags of a sched group 10156 + * are those of the child domain. 10157 + */ 10158 + if (sds->busiest) 10159 + sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING); 10154 10160 10155 10161 10156 10162 if (env->sd->flags & SD_NUMA) ··· 10465 10461 goto out_balanced; 10466 10462 } 10467 10463 10468 - /* Try to move all excess tasks to child's sibling domain */ 10464 + /* 10465 + * Try to move all excess tasks to a sibling domain of the busiest 10466 + * group's child domain. 10467 + */ 10469 10468 if (sds.prefer_sibling && local->group_type == group_has_spare && 10470 10469 busiest->sum_nr_running > local->sum_nr_running + 1) 10471 10470 goto force_balance;