commits

This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.

This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.

CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p

# p exit:
exit_signals(p);
p->flags |= PF_EXITING
exit_mm
p->mm = NULL;

migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL

task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.

Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6mo ago

Linus Torvalds

d7b8f8e2

Linux 6.16-rc5 v6.16-rc5

6mo ago

Mikhail Paulyshka

5b937a1e

x86/rdrand: Disable RDSEED on AMD Cyan Skillfish

6mo ago

Linus Torvalds

939f15e6

Merge tag 'turbostat-2025.06.08' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

7mo ago

Linus Torvalds

4412b8b2

Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs

6mo ago

Gao Xiang

b44686c8

erofs: fix large fragment handling

6mo ago

Baolin Wang

82241a83

mm: fix the inaccurate memory statistics issue for users

On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter"). Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy. Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion. In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.

In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters. The
following test also confirm the theoretical analysis.

I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface. From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.

w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec

w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec

On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.

Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6mo ago

Linus Torvalds

bab5cac6

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

6mo ago

Linus Torvalds

be54f8c5

Merge tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7mo ago

Len Brown

42fd37dc

tools/power turbostat: version 2025.06.08

7mo ago

Linus Torvalds

2632d81f

Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd

6mo ago

Kent Overstreet

fec5e6f9

bcachefs: Don't set BCH_FS_error on transaction restart

6mo ago

Chao Yu

d31fbdc4

erofs: allow readdir() to be interrupted

6mo ago

Honggyu Kim

bd225b95

mm/damon: fix divide by zero in damon_get_intervals_score()

6mo ago

Linus Torvalds

772b78c2

Merge tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6mo ago

Al Viro

b969f961

fix proc_sys_compare() handling of in-lookup dentries

6mo ago

Linus Torvalds

0529ef8c

Merge tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7mo ago

Ingo Molnar

41cb0855

treewide, timers: Rename from_timer() to timer_container_of()

7mo ago

Zhang Rui

d8c0f5d9

tools/power turbostat: Add initial support for BartlettLake

7mo ago

Linus Torvalds

379f604c

Merge tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

6mo ago

Namjae Jeon

50f930db

ksmbd: fix potential use-after-free in oplock/lease break ack

6mo ago

Kent Overstreet

74f3931a

bcachefs: Fix additional misalignment in journal space calculations

6mo ago

Gao Xiang

27917e81

erofs: address D-cache aliasing

6mo ago

Honggyu Kim

ddba1b6c

samples/damon: fix damon sample mtier for start failure

6mo ago

Linus Torvalds

95eb0d38

Merge tag 'objtool_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6mo ago

kuyo chang

fc975cfb

sched/deadline: Fix dl_server runtime calculation formula

6mo ago

Linus Torvalds

d0b3b7b2

Linux 6.16-rc4 v6.16-rc4

6mo ago

Linus Torvalds

4710eacf

Merge tag 'timers-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7mo ago

Zeng Heng

dd2922dc

fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex

7mo ago

Linus Torvalds

8630c59e

Merge tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

7mo ago

Zhang Rui

83075bd5

tools/power turbostat: Add initial support for DMR

7mo ago

Linus Torvalds

3c2fe279

Merge tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel

6mo ago

Marc Zyngier

ba74278c

Revert "PCI: ecam: Allow cfg->priv to be pre-populated from the root port device"

6mo ago

Al Viro

277627b4

ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()

6mo ago

Kent Overstreet

7de3c8b4

bcachefs: Don't schedule non persistent passes persistently

6mo ago

Gao Xiang

f5443d0d

erofs: use memcpy_to_folio() to replace copy_to_iter()

6mo ago

Honggyu Kim

f1221c84

samples/damon: fix damon sample wsse for start failure

6mo ago

Linus Torvalds

a1639ce5

Merge tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

6mo ago

Heiko Carstens

ccdd09e0

objtool: Add missing endian conversion to read_annotate()

6mo ago

Peter Zijlstra

009836b4

sched/core: Fix migrate_swap() vs. hotplug

On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:

> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>

So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.

Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.

This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.

Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.

Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.

Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.

Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net

6mo ago

Linus Torvalds

afa9a6f4

Merge tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

6mo ago

Linus Torvalds

d9864e7d

Merge tag 'perf-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

7mo ago

Herbert Xu

434d7f9b

timens: Add struct seq_file forward declaration

7mo ago

Linux 6.16-rc6 v6.16-rc6

347e9f50

Linus Torvalds

6mo

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

3cd75219

Linus Torvalds

6mo

Merge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

5d5d6229

Linus Torvalds

6mo

dt-bindings: clock: mediatek: Add #reset-cells property for MT8188

a42b4dcc

Julien Massot

7mo

Merge tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

41998eeb

Linus Torvalds

6mo

MAINTAINERS: Update Kirill Shutemov's email address for TDX

cb73e53f

Kirill A. Shutemov

6mo

clk: imx: Fix an out-of-bounds access in dispmix_csr_clk_dev_data

aacc875a

Xiaolei Wang

7mo

Merge tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

0a197b75

Linus Torvalds

6mo

irqchip/irq-msi-lib: Fix build with PCI disabled

a8b289f0

Arnd Bergmann

6mo

x86/mm: Disable hugetlb page table sharing on 32-bit

76303ee8

Jann Horn

6mo

clk: scmi: Handle case where child clocks are initialized before their parents

6306e0c5

Sascha Hauer

7mo

Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

3f31a806

Linus Torvalds

6mo

perf/core: Fix WARN in perf_sigtrap()

3da6bb41

Tetsuo Handa

6mo

PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()

68ea85df

Himanshu Madhani

6mo

x86/CPU/AMD: Disable INVLPGB on Zen2

a74bb5f2

Mikhail Paulyshka

6mo

Linux 6.16-rc1 v6.16-rc1

19272b37

Linus Torvalds

7mo

Merge tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

3b428e1c

Linus Torvalds

6mo

Revert "sched/numa: add statistics of numa balance task"

db6cc3f4

Chen Yu

6mo

Linux 6.16-rc5 v6.16-rc5

d7b8f8e2

Linus Torvalds

6mo

x86/rdrand: Disable RDSEED on AMD Cyan Skillfish

5b937a1e

Mikhail Paulyshka

6mo

Merge tag 'turbostat-2025.06.08' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

939f15e6

Linus Torvalds

7mo

Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs

4412b8b2

Linus Torvalds

6mo

erofs: fix large fragment handling

b44686c8

Gao Xiang

6mo

mm: fix the inaccurate memory statistics issue for users

82241a83

Baolin Wang

6mo

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

bab5cac6

Linus Torvalds

6mo

Merge tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

be54f8c5

Linus Torvalds

7mo

tools/power turbostat: version 2025.06.08

42fd37dc

Len Brown

7mo

Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd

2632d81f

Linus Torvalds

6mo

bcachefs: Don't set BCH_FS_error on transaction restart

fec5e6f9

Kent Overstreet

6mo

erofs: allow readdir() to be interrupted

d31fbdc4

Chao Yu

6mo

mm/damon: fix divide by zero in damon_get_intervals_score()

bd225b95

Honggyu Kim

6mo

Merge tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

772b78c2

Linus Torvalds

6mo

fix proc_sys_compare() handling of in-lookup dentries

b969f961

Al Viro

6mo

Merge tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

0529ef8c

Linus Torvalds

7mo

treewide, timers: Rename from_timer() to timer_container_of()

41cb0855

Ingo Molnar

7mo

tools/power turbostat: Add initial support for BartlettLake

d8c0f5d9

Zhang Rui

7mo

Merge tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

379f604c

Linus Torvalds

6mo

ksmbd: fix potential use-after-free in oplock/lease break ack

50f930db

Namjae Jeon

6mo

bcachefs: Fix additional misalignment in journal space calculations

74f3931a

Kent Overstreet

6mo

erofs: address D-cache aliasing

27917e81

Gao Xiang

6mo

samples/damon: fix damon sample mtier for start failure

ddba1b6c

Honggyu Kim

6mo

Merge tag 'objtool_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

95eb0d38

Linus Torvalds

6mo

sched/deadline: Fix dl_server runtime calculation formula

In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.

This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.

Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.

This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime. And on other hardware configurations it can be even
worse.

For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com