Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"

This reverts commit 7347fc87dfe6b7315e74310ee1243dc222c68086.

Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.

Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.

The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply. This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by Mel Gorman and committed by Ingo Molnar 789ba280 94d7dbf1

Changed files
+1 -56
kernel
sched
+1 -56
kernel/sched/fair.c
··· 1854 1854 static void numa_migrate_preferred(struct task_struct *p) 1855 1855 { 1856 1856 unsigned long interval = HZ; 1857 - unsigned long numa_migrate_retry; 1858 1857 1859 1858 /* This task has no NUMA fault statistics yet */ 1860 1859 if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults)) ··· 1861 1862 1862 1863 /* Periodically retry migrating the task to the preferred node */ 1863 1864 interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); 1864 - numa_migrate_retry = jiffies + interval; 1865 - 1866 - /* 1867 - * Check that the new retry threshold is after the current one. If 1868 - * the retry is in the future, it implies that wake_affine has 1869 - * temporarily asked NUMA balancing to backoff from placement. 1870 - */ 1871 - if (numa_migrate_retry > p->numa_migrate_retry) 1872 - return; 1873 - 1874 - /* Safe to try placing the task on the preferred node */ 1875 - p->numa_migrate_retry = numa_migrate_retry; 1865 + p->numa_migrate_retry = jiffies + interval; 1876 1866 1877 1867 /* Success if task is already running on preferred CPU */ 1878 1868 if (task_node(p) == p->numa_preferred_nid) ··· 5910 5922 return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; 5911 5923 } 5912 5924 5913 - #ifdef CONFIG_NUMA_BALANCING 5914 - static void 5915 - update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target) 5916 - { 5917 - unsigned long interval; 5918 - 5919 - if (!static_branch_likely(&sched_numa_balancing)) 5920 - return; 5921 - 5922 - /* If balancing has no preference then continue gathering data */ 5923 - if (p->numa_preferred_nid == -1) 5924 - return; 5925 - 5926 - /* 5927 - * If the wakeup is not affecting locality then it is neutral from 5928 - * the perspective of NUMA balacing so continue gathering data. 5929 - */ 5930 - if (cpu_to_node(prev_cpu) == cpu_to_node(target)) 5931 - return; 5932 - 5933 - /* 5934 - * Temporarily prevent NUMA balancing trying to place waker/wakee after 5935 - * wakee has been moved by wake_affine. This will potentially allow 5936 - * related tasks to converge and update their data placement. The 5937 - * 4 * numa_scan_period is to allow the two-pass filter to migrate 5938 - * hot data to the wakers node. 5939 - */ 5940 - interval = max(sysctl_numa_balancing_scan_delay, 5941 - p->numa_scan_period << 2); 5942 - p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval); 5943 - 5944 - interval = max(sysctl_numa_balancing_scan_delay, 5945 - current->numa_scan_period << 2); 5946 - current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval); 5947 - } 5948 - #else 5949 - static void 5950 - update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target) 5951 - { 5952 - } 5953 - #endif 5954 - 5955 5925 static int wake_affine(struct sched_domain *sd, struct task_struct *p, 5956 5926 int this_cpu, int prev_cpu, int sync) 5957 5927 { ··· 5925 5979 if (target == nr_cpumask_bits) 5926 5980 return prev_cpu; 5927 5981 5928 - update_wa_numa_placement(p, prev_cpu, target); 5929 5982 schedstat_inc(sd->ttwu_move_affine); 5930 5983 schedstat_inc(p->se.statistics.nr_wakeups_affine); 5931 5984 return target;