Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/topology: Assert non-NUMA topology masks don't (partially) overlap

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;

+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8<---

[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com

Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com

authored by

Valentin Schneider and committed by
Peter Zijlstra
ccf74128 3e0de271

+39
+39
kernel/sched/topology.c
··· 1880 1880 } 1881 1881 1882 1882 /* 1883 + * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for 1884 + * any two given CPUs at this (non-NUMA) topology level. 1885 + */ 1886 + static bool topology_span_sane(struct sched_domain_topology_level *tl, 1887 + const struct cpumask *cpu_map, int cpu) 1888 + { 1889 + int i; 1890 + 1891 + /* NUMA levels are allowed to overlap */ 1892 + if (tl->flags & SDTL_OVERLAP) 1893 + return true; 1894 + 1895 + /* 1896 + * Non-NUMA levels cannot partially overlap - they must be either 1897 + * completely equal or completely disjoint. Otherwise we can end up 1898 + * breaking the sched_group lists - i.e. a later get_group() pass 1899 + * breaks the linking done for an earlier span. 1900 + */ 1901 + for_each_cpu(i, cpu_map) { 1902 + if (i == cpu) 1903 + continue; 1904 + /* 1905 + * We should 'and' all those masks with 'cpu_map' to exactly 1906 + * match the topology we're about to build, but that can only 1907 + * remove CPUs, which only lessens our ability to detect 1908 + * overlaps 1909 + */ 1910 + if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && 1911 + cpumask_intersects(tl->mask(cpu), tl->mask(i))) 1912 + return false; 1913 + } 1914 + 1915 + return true; 1916 + } 1917 + 1918 + /* 1883 1919 * Find the sched_domain_topology_level where all CPU capacities are visible 1884 1920 * for all CPUs. 1885 1921 */ ··· 2010 1974 dflags |= SD_ASYM_CPUCAPACITY; 2011 1975 has_asym = true; 2012 1976 } 1977 + 1978 + if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) 1979 + goto error; 2013 1980 2014 1981 sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); 2015 1982