commits

[Problem Description]
When running the hackbench program of LTP, the following memory leak is
reported by kmemleak.

# /opt/ltp/testcases/bin/hackbench 20 thread 1000
Running with 20*40 (== 800) tasks.

# dmesg | grep kmemleak
...
kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888cd8ca2c40 (size 64):
comm "hackbench", pid 17142, jiffies 4299780315
hex dump (first 32 bytes):
ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I.....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc bff18fd4):
[<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0
[<ffffffff8113f715>] task_numa_work+0x725/0xa00
[<ffffffff8110f878>] task_work_run+0x58/0x90
[<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0
[<ffffffff81dd78d5>] do_syscall_64+0x85/0x150
[<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
...

This issue can be consistently reproduced on three different servers:
* a 448-core server
* a 256-core server
* a 192-core server

[Root Cause]
Since multiple threads are created by the hackbench program (along with
the command argument 'thread'), a shared vma might be accessed by two or
more cores simultaneously. When two or more cores observe that
vma->numab_state is NULL at the same time, vma->numab_state will be
overwritten.

Although current code ensures that only one thread scans the VMAs in a
single 'numa_scan_period', there might be a chance for another thread
to enter in the next 'numa_scan_period' while we have not gotten till
numab_state allocation [1].

Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000`
cannot the reproduce the issue. It is verified with 200+ test runs.

[Solution]
Use the cmpxchg atomic operation to ensure that only one thread executes
the vma->numab_state assignment.

[1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/

Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com
Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1y ago

Ricardo Neri

b3fce429

cacheinfo: Allocate memory during CPU hotplug if not done from the primary CPU

Commit

5944ce092b97 ("arch_topology: Build cacheinfo from primary CPU")

adds functionality that architectures can use to optionally allocate and
build cacheinfo early during boot. Commit

6539cffa9495 ("cacheinfo: Add arch specific early level initializer")

lets secondary CPUs correct (and reallocate memory) cacheinfo data if
needed.

If the early build functionality is not used and cacheinfo does not need
correction, memory for cacheinfo is never allocated. x86 does not use
the early build functionality. Consequently, during the cacheinfo CPU
hotplug callback, last_level_cache_is_valid() attempts to dereference
a NULL pointer:

BUG: kernel NULL pointer dereference, address: 0000000000000100
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEPMT SMP NOPTI
CPU: 0 PID 19 Comm: cpuhp/0 Not tainted 6.4.0-rc2 #1
RIP: 0010: last_level_cache_is_valid+0x95/0xe0a

Allocate memory for cacheinfo during the cacheinfo CPU hotplug callback
if not done earlier.

Moreover, before determining the validity of the last-level cache info,
ensure that it has been allocated. Simply checking for non-zero
cache_leaves() is not sufficient, as some architectures (e.g., Intel
processors) have non-zero cache_leaves() before allocation.

Dereferencing NULL cacheinfo can occur in update_per_cpu_data_slice_size().
This function iterates over all online CPUs. However, a CPU may have come
online recently, but its cacheinfo may not have been allocated yet.

While here, remove an unnecessary indentation in allocate_cache_info().

[ bp: Massage. ]

Fixes: 6539cffa9495 ("cacheinfo: Add arch specific early level initializer")
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Radu Rendec <rrendec@redhat.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: stable@vger.kernel.org # 6.3+
Link: https://lore.kernel.org/r/20241128002247.26726-2-ricardo.neri-calderon@linux.intel.com

1y ago

Uwe Kleine-König

cc47268c

irqchip: Switch back to struct platform_driver::remove()

1y ago

Linus Torvalds

f788b5ef

Merge tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Dr. David Alan Gilbert

f69e6375

printf: Remove unused 'bprintf'

1y ago

Chen-Yu Tsai

0d40daa1

of: base: Document prefix argument for of_get_next_child_with_prefix()

1y ago

Linus Torvalds

7503345a

Merge tag 'block-6.13-20241207' of git://git.kernel.dk/linux

1y ago

John Garry

6918141d

scsi: scsi_debug: Fix hrtimer support for ndelay

1y ago

Steve French

ddca5023

smb3.1.1: fix posix mounts to older servers

1y ago

Akinobu Mita

6535b866

mm/damon: fix order of arguments in damos_before_apply tracepoint

1y ago

David Woodhouse

07fa619f

x86/kexec: Restore GDT on return from ::preserve_context kexec

1y ago

Zhou Wang

f82e62d4

irqchip/gicv3-its: Add workaround for hip09 ITS erratum 162100801

1y ago

Linus Torvalds

63f4993b

Merge tag 'irq_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Frederic Weisbecker

63dffecf

posix-timers: Target group sigqueue to current task only if not exiting

1y ago

Linus Torvalds

7af08b57

Merge tag 'trace-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

1y ago

Liam Zuiderhoek

44b68269

i2c: Fix whitespace style issue

1y ago

Linus Torvalds

aa0274d2

Merge tag 'io_uring-6.13-20241207' of git://git.kernel.dk/linux

1y ago

Ming Lei

22465bba

blk-mq: move cpuhp callback registering out of q->sysfs_lock

1y ago

Cathy Avery

b1aee7f0

scsi: storvsc: Do not flag MAINTENANCE_IN return of SRB_STATUS_DATA_OVERRUN as an error

1y ago

Ralph Boehme

8cb0bc54

fs/smb/client: cifs_prime_dcache() for SMB3 POSIX reparse points

1y ago

Kees Cook

5c379360

lib: stackinit: hide never-taken branch from compiler

1y ago

Fernando Fernandez Mancera

73da582a

x86/cpu/topology: Remove limit of CPUs due to disabled IO/APIC

1y ago

Russell King (Oracle)

12aaf675

irqchip/irq-mvebu-sei: Move misplaced select() callback to SEI CP domain

1y ago

Linus Torvalds

58ac609b

Merge tag 'x86_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Frederic Weisbecker

4d17c25e

delay: Fix ndelay() spuriously treated as udelay()

1y ago

Linus Torvalds

65ae975e

Merge tag 'net-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth.

Current release - regressions:

- rtnetlink: fix rtnl_dump_ifinfo() error path

- bluetooth: remove the redundant sco_conn_put

Previous releases - regressions:

- netlink: fix false positive warning in extack during dumps

- sched: sch_fq: don't follow the fast path if Tx is behind now

- ipv6: delete temporary address if mngtmpaddr is removed or
unmanaged

- tcp: fix use-after-free of nreq in reqsk_timer_handler().

- bluetooth: fix slab-use-after-free Read in set_powered_sync

- l2tp: fix warning in l2tp_exit_net found

- eth:
- bnxt_en: fix receive ring space parameters when XDP is active
- lan78xx: fix double free issue with interrupt buffer allocation
- tg3: set coherent DMA mask bits to 31 for BCM57766 chipsets

Previous releases - always broken:

- ipmr: fix tables suspicious RCU usage

- iucv: MSG_PEEK causes memory leak in iucv_sock_destruct()

- eth:
- octeontx2-af: fix low network performance
- stmmac: dwmac-socfpga: set RX watchdog interrupt as broken
- rtase: correct the speed for RTL907XD-V1

Misc:

- some documentation fixup"

* tag 'net-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
ipmr: fix build with clang and DEBUG_NET disabled.
Documentation: tls_offload: fix typos and grammar
Fix spelling mistake
ipmr: fix tables suspicious RCU usage
ip6mr: fix tables suspicious RCU usage
ipmr: add debug check for mr table cleanup
selftests: rds: move test.py to TEST_FILES
net_sched: sch_fq: don't follow the fast path if Tx is behind now
tcp: Fix use-after-free of nreq in reqsk_timer_handler().
net: phy: fix phy_ethtool_set_eee() incorrectly enabling LPI
net: Comment copy_from_sockptr() explaining its behaviour
rxrpc: Improve setsockopt() handling of malformed user input
llc: Improve setsockopt() handling of malformed user input
Bluetooth: SCO: remove the redundant sco_conn_put
Bluetooth: MGMT: Fix possible deadlocks
Bluetooth: MGMT: Fix slab-use-after-free Read in set_powered_sync
bnxt_en: Unregister PTP during PCI shutdown and suspend
bnxt_en: Refactor bnxt_ptp_init()
bnxt_en: Fix receive ring space parameters when XDP is active
bnxt_en: Fix queue start to update vnic RSS table
...

1y ago

Mathieu Desnoyers

2bd9b57d

tracing: Use guard() rather than scoped_guard()

1y ago

Chen-Yu Tsai

aac9e2af

arm64: dts: mediatek: mt8173-elm-hana: Mark touchscreens and trackpads as fail

1y ago

Linus Torvalds

a6db2a5d

Merge tag 'ubifs-for-linus-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

1y ago

Bernd Schubert

a07d2d79

io_uring: Change res2 parameter type in io_uring_cmd_done

1y ago

Ming Lei

4bf485a7

blk-mq: register cpuhp callback after hctx is added to xarray table

1y ago

Peter Wang

7f45ed5f

scsi: ufs: core: Add missing post notify for power mode change

1y ago

Ralph Boehme

6a832bc8

fs/smb/client: Implement new SMB3 POSIX type

1y ago

David Hildenbrand

3203b3ab

mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()

1y ago

David Woodhouse

d0ceea66

x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables

1y ago

Linus Torvalds

7eef7e30

Merge tag 'for-6.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

1y ago

Linux 6.13-rc2 v6.13-rc2

fac04efc

Linus Torvalds

Merge tag 'kbuild-fixes-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

0b6809a7

Linus Torvalds

Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

eadaac4d

Linus Torvalds

kbuild: deb-pkg: fix build error with O=

d8d326d6

Masahiro Yamada

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

c25ca0c2

Linus Torvalds

irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing

9151299e

Geert Uytterhoeven

modpost: Add .irqentry.text to OTHER_SECTIONS

79124056

Thomas Gleixner

Merge tag 'x86_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

84262262

Linus Torvalds

clocksource: Make negative motion detection more robust

Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.

It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.

max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns

If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.

Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.

For the case at hand this results in a tripping point of 1174405120ns.

Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.

Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.

Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net

76031d95

Thomas Gleixner

genirq/proc: Add missing space separator back

9d9f204b

Thomas Gleixner

Linux 6.13-rc1 v6.13-rc1

40384c84

Linus Torvalds

Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

553c89ec

Linus Torvalds

x86/CPU/AMD: WARN when setting EFER.AUTOIBRS if and only if the WRMSR fails

49207766

Sean Christopherson

Get rid of 'remove_new' relic from platform driver struct

e70140ba

Linus Torvalds

irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND

ee3878b8

Stefan Wahren

Merge tag 'i2c-for-6.13-rc1-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

a14bf463

Linus Torvalds

Merge tag '6.13-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

62b5a469

Linus Torvalds

iio: magnetometer: yas530: use signed integer type for clamp limits

f1ee5483

Jakob Hauser

x86/cacheinfo: Delete global num_cache_leaves

9677be09

Ricardo Neri

irqchip/gic-v3: Fix irq_complete_ack() comment

f58326c7

Lorenzo Pieralisi

Merge tag 'trace-printf-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

88862eeb

Linus Torvalds

MAINTAINERS: fix typo in I2C OF COMPONENT PROBER

caf4bdb5

Lukas Bulwahn

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

c94cd024

Linus Torvalds

smb: client: fix potential race in cifs_put_tcon()

c32b624f

Paulo Alcantara

sched/numa: fix memory leak due to the overwritten vma->numab_state

5f1b64e9

Adrian Huang

cacheinfo: Allocate memory during CPU hotplug if not done from the primary CPU

b3fce429

Ricardo Neri

irqchip: Switch back to struct platform_driver::remove()

cc47268c

Uwe Kleine-König

Merge tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

f788b5ef

Linus Torvalds

printf: Remove unused 'bprintf'

f69e6375

Dr. David Alan Gilbert

of: base: Document prefix argument for of_get_next_child_with_prefix()

0d40daa1

Chen-Yu Tsai

Merge tag 'block-6.13-20241207' of git://git.kernel.dk/linux

7503345a

Linus Torvalds

scsi: scsi_debug: Fix hrtimer support for ndelay

6918141d

John Garry

smb3.1.1: fix posix mounts to older servers

ddca5023

Steve French

mm/damon: fix order of arguments in damos_before_apply tracepoint

6535b866

Akinobu Mita

x86/kexec: Restore GDT on return from ::preserve_context kexec

07fa619f

David Woodhouse

irqchip/gicv3-its: Add workaround for hip09 ITS erratum 162100801

f82e62d4

Zhou Wang

Merge tag 'irq_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

63f4993b

Linus Torvalds

posix-timers: Target group sigqueue to current task only if not exiting

63dffecf

Frederic Weisbecker

Merge tag 'trace-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing updates from Steven Rostedt:

- Add trace flag for NEED_RESCHED_LAZY

Now that NEED_RESCHED_LAZY is upstream, add it to the status bits of
the common_flags. This will now show when the NEED_RESCHED_LAZY flag
is set that is used for debugging latency issues in the kernel via a
trace.

- Remove leftover "__idx" variable when SRCU was removed from the
tracepoint code

- Add rcu_tasks_trace guard

To add a guard() around the tracepoint code, a rcu_tasks_trace guard
needs to be created first.

- Remove __DO_TRACE() macro and just call __DO_TRACE_CALL() directly

The DO_TRACE() macro has conditional locking depending on what was
passed into the macro parameters. As the guts of the macro has been
moved to __DO_TRACE_CALL() to handle static call logic, there's no
reason to keep the __DO_TRACE() macro around.

It is better to just do the locking in place without the conditionals
and call __DO_TRACE_CALL() from those locations. The "cond" passed in
can also be moved out of that macro. This simplifies the code.

- Remove the "cond" from the system call tracepoint macros

The "cond" variable was added to allow some tracepoints to check a
condition within the static_branch (jump/nop) logic. The system calls
do not need this. Removing it simplifies the code.

- Replace scoped_guard() with just guard() in the tracepoint logic

guard() works just as well as scoped_guard() in the tracepoint logic
and the scoped_guard() causes some issues.

* tag 'trace-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Use guard() rather than scoped_guard()
tracing: Remove cond argument from __DECLARE_TRACE_SYSCALL
tracing: Remove conditional locking from __DO_TRACE()
rcupdate_trace: Define rcu_tasks_trace lock guard
tracing: Remove __idx variable from __DO_TRACE
tracing: Move it_func[0] comment to the relevant context
tracing: Record task flag NEED_RESCHED_LAZY.