commits
Pull i2c fixes from Wolfram Sang:
- subsystem: convert drivers to use recent callbacks of struct
i2c_algorithm A typical after-rc1 cleanup, which I couldn't send in
time for rc2
- tegra: fix YAML conversion of device tree bindings
- k1: re-add a check which got lost during upstreaming
* tag 'i2c-for-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: k1: check for transfer error
i2c: use inclusive callbacks in struct i2c_algorithm
dt-bindings: i2c: nvidia,tegra20-i2c: Specify the required properties
Pull x86 fixes from Borislav Petkov:
- Make sure the array tracking which kernel text positions need to be
alternatives-patched doesn't get mishandled by out-of-order
modifications, leading to it overflowing and causing page faults when
patching
- Avoid an infinite loop when early code does a ranged TLB invalidation
before the broadcast TLB invalidation count of how many pages it can
flush, has been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an
overflow of the bitmap tracking dynamic ASIDs which need to be
flushed when the kernel switches between the user and kernel address
space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
* tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fix int3 handling failure from broken text_poke array
x86/mm: Fix early boot use of INVPLGB
x86/its: Fix an ifdef typo in its_alloc()
x86/mm: Disable INVLPGB when PTI is enabled
x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
If spacemit_i2c_xfer_msg() times out waiting for a message transfer to
complete, or if the hardware reports an error, it returns a negative
error code (-ETIMEDOUT, -EAGAIN, -ENXIO. or -EIO).
The sole caller of spacemit_i2c_xfer_msg() is spacemit_i2c_xfer(),
which is the i2c_algorithm->xfer callback function. It currently
does not save the value returned by spacemit_i2c_xfer_msg().
The result is that transfer errors go unreported, and a caller
has no indication anything is wrong.
When this code was out for review, the return value *was* checked
in early versions. But for some reason, that assignment got dropped
between versions 5 and 6 of the series, perhaps related to reworking
the code to merge spacemit_i2c_xfer_core() into spacemit_i2c_xfer().
Simply assigning the value returned to "ret" fixes the problem.
Fixes: 5ea558473fa31 ("i2c: spacemit: add support for SpacemiT K1 SoC")
Signed-off-by: Alex Elder <elder@riscstar.com>
Cc: <stable@vger.kernel.org> # v6.15+
Reviewed-by: Troy Mitchell <troymitchell988@gmail.com>
Link: https://lore.kernel.org/r/20250616125137.1555453-1-elder@riscstar.com
Signed-off-by: Andi Shyti <andi@smida.it>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull irq fixes from Borislav Petkov:
- Fix missing prototypes warnings
- Properly initialize work context when allocating it
- Remove a method tracking when managed interrupts are suspended during
hotplug, in favor of the code using a IRQ disable depth tracking now,
and have interrupts get properly enabled again on restore
- Make sure multiple CPUs getting hotplugged don't cause wrong tracking
of the managed IRQ disable depth
* tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ath79-misc: Fix missing prototypes warnings
genirq/irq_sim: Initialize work context pointers properly
genirq/cpuhotplug: Restore affinity even for suspended IRQ
genirq/cpuhotplug: Rebalance managed interrupts across multi-CPU hotplug
Since smp_text_poke_single() does not expect there is another
text_poke request is queued, it can make text_poke_array not
sorted or cause a buffer overflow on the text_poke_array.vec[].
This will cause an Oops in int3 because of bsearch failing;
CPU 0 CPU 1 CPU 2
----- ----- -----
smp_text_poke_batch_add()
smp_text_poke_single() <<-- Adds out of order
<int3>
[Fails o find address
in text_poke_array ]
OOPS!
Or unhandled page fault because of a buffer overflow;
CPU 0 CPU 1
----- -----
smp_text_poke_batch_add() <<+
... |
smp_text_poke_batch_add() <<-- Adds TEXT_POKE_ARRAY_MAX times.
smp_text_poke_single() {
__smp_text_poke_batch_add() <<-- Adds entry at
TEXT_POKE_ARRAY_MAX + 1
smp_text_poke_batch_finish()
[Unhandled page fault because
text_poke_array.nr_entries is
overwritten]
BUG!
}
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
so that it correctly flush the queue if needed.
Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
Fixes: c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lkml.kernel.org/r/\ 175020512308.3582717.13631440385506146631.stgit@mhiramat.tok.corp.google.com
i2c-host-fixes for v6.16-rc2
tegra: fix YAML conversion of device tree bindings
Pull perf fixes from Borislav Petkov:
- Avoid a crash on a heterogeneous machine where not all cores support
the same hw events features
- Avoid a deadlock when throttling events
- Document the perf event states more
- Make sure a number of perf paths switching off or rescheduling events
call perf_cgroup_event_disable()
- Make sure perf does task sampling before its userspace mapping is
torn down, and not after
* tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix crash in icl_update_topdown_event()
perf: Fix the throttle error of some clock events
perf: Add comment to enum perf_event_state
perf/core: Fix WARN in perf_cgroup_switch()
perf: Fix dangling cgroup pointer in cpuctx
perf: Fix cgroup state vs ERROR
perf: Fix sample vs do_exit()
ath79_misc_irq_init() was defined but unused since commit 51fa4f8912c0
("MIPS: ath79: drop legacy IRQ code"), so it's time to drop it.
The build also warns about a missing prototype of get_c0_perfcount_int().
Remove the stale leftover function and add the missing include.
Signed-off-by: Shiji Yang <yangshiji66@outlook.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/OSBPR01MB167032D2017645200787AAEBBC72A@OSBPR01MB1670.jpnprd01.prod.outlook.com
The INVLPGB instruction has limits on how many pages it can invalidate
at once. That limit is enumerated in CPUID, read by the kernel, and
stored in 'invpgb_count_max'. Ranged invalidation, like
invlpgb_kernel_range_flush() break up their invalidations so
that they do not exceed the limit.
However, early boot code currently attempts to do ranged
invalidation before populating 'invlpgb_count_max'. There is a
for loop which is basically:
for (...; addr < end; addr += invlpgb_count_max*PAGE_SIZE)
If invlpgb_kernel_range_flush is called before the kernel has read
the value of invlpgb_count_max from the hardware, the normally
bounded loop can become an infinite loop if invlpgb_count_max is
initialized to zero.
Fix that issue by initializing invlpgb_count_max to 1.
This way INVPLGB at early boot time will be a little bit slower
than normal (with initialized invplgb_count_max), and not an
instant hang at bootup time.
Fixes: b7aa05cbdc52 ("x86/mm: Add INVLPGB support code")
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250606171112.4013261-3-riel%40surriel.com
Convert the I2C subsystem to drop using the 'master_'-prefixed callbacks
in favor of the simplified ones. Fix alignment of '=' while here.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Specify the properties which are essential and which are not for the
Tegra I2C driver to function correctly. This was not added correctly when
the TXT binding was converted to yaml. All the existing DT nodes have
these properties already and hence this does not break the ABI.
dmas and dma-names which were specified as a must in the TXT binding
is now made optional since the driver can work in PIO mode if dmas are
missing.
Fixes: f10a9b722f80 ("dt-bindings: i2c: tegra: Convert to json-schema”)
Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Cc: <stable@vger.kernel.org> # v5.17+
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Andi Shyti <andi@smida.it>
Link: https://lore.kernel.org/r/20250603153022.39434-1-akhilrajeev@nvidia.com
Pull locking fixes from Borislav Petkov:
- Make sure the switch to the global hash is requested always under a
lock so that two threads requesting that simultaneously cannot get to
inconsistent state
- Reject negative NUMA nodes earlier in the futex NUMA interface
handling code
- Selftests fixes
* tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Verify under the lock if hash can be replaced
futex: Handle invalid node numbers supplied by user
selftests/futex: Set the home_node in futex_numa_mpol
selftests/futex: getopt() requires int as return value.
The perf_fuzzer found a hard-lockup crash on a RaptorLake machine:
Oops: general protection fault, maybe for address 0xffff89aeceab400: 0000
CPU: 23 UID: 0 PID: 0 Comm: swapper/23
Tainted: [W]=WARN
Hardware name: Dell Inc. Precision 9660/0VJ762
RIP: 0010:native_read_pmc+0x7/0x40
Code: cc e8 8d a9 01 00 48 89 03 5b cd cc cc cc cc 0f 1f ...
RSP: 000:fffb03100273de8 EFLAGS: 00010046
....
Call Trace:
<TASK>
icl_update_topdown_event+0x165/0x190
? ktime_get+0x38/0xd0
intel_pmu_read_event+0xf9/0x210
__perf_event_read+0xf9/0x210
CPUs 16-23 are E-core CPUs that don't support the perf metrics feature.
The icl_update_topdown_event() should not be invoked on these CPUs.
It's a regression of commit:
f9bdf1f95339 ("perf/x86/intel: Avoid disable PMU if !cpuc->enabled in sample read")
The bug introduced by that commit is that the is_topdown_event() function
is mistakenly used to replace the is_topdown_count() call to check if the
topdown functions for the perf metrics feature should be invoked.
Fix it.
Fixes: f9bdf1f95339 ("perf/x86/intel: Avoid disable PMU if !cpuc->enabled in sample read")
Closes: https://lore.kernel.org/lkml/352f0709-f026-cd45-e60c-60dfd97f73f3@maine.edu/
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org # v6.15+
Link: https://lore.kernel.org/r/20250612143818.2889040-1-kan.liang@linux.intel.com
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
Commit a82b26451de1 ("x86/its: explicitly manage permissions for ITS
pages") reworks its_alloc() and introduces a typo in an ifdef
conditional, referring to CONFIG_MODULE instead of CONFIG_MODULES.
Fix this typo in its_alloc().
Fixes: a82b26451de1 ("x86/its: explicitly manage permissions for ITS pages")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250616100432.22941-1-lukas.bulwahn%40redhat.com
Pull EDAC fixes from Borislav Petkov:
- amd64: Correct the number of memory controllers on some AMD Zen
clients
- igen6: Handle firmware-disabled memory controllers properly
* tag 'edac_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/igen6: Fix NULL pointer dereference
EDAC/amd64: Correct number of UMCs for family 19h models 70h-7fh
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
The same applies if the hash is made immutable: There is no reference
counting and the hash must not be replaced.
Verify under mm_struct::futex_phash that neither the global hash nor an
immutable hash in use.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
Both ARM and IBM CI reports RCU stall, which can be reproduced by the
below perf command.
perf record -a -e cpu-clock -- sleep 2
The issue is introduced by the generic throttle patch set, which
unconditionally invoke the event_stop() when throttle is triggered.
The cpu-clock and task-clock are two special SW events, which rely on
the hrtimer. The throttle is invoked in the hrtimer handler. The
event_stop()->hrtimer_cancel() waits for the handler to finish, which is
a deadlock. Instead of invoking the stop(), the HRTIMER_NORESTART should
be used to stop the timer.
There may be two ways to fix it:
- Introduce a PMU flag to track the case. Avoid the event_stop in
perf_event_throttle() if the flag is detected.
It has been implemented in the
https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@linux.intel.com/
The new flag was thought to be an overkill for the issue.
- Add a check in the event_stop. Return immediately if the throttle is
invoked in the hrtimer handler. Rely on the existing HRTIMER_NORESTART
method to stop the timer.
The latter is implemented here.
Move event->hw.interrupts = MAX_INTERRUPTS before the stop(). It makes
the order the same as perf_event_unthrottle(). Except the patch, no one
checks the hw.interrupts in the stop(). There is no impact from the
order change.
When stops in the throttle, the event should not be updated,
stop(event, 0). But the cpu_clock_event_stop() doesn't handle the flag.
In logic, it's wrong. But it didn't bring any problems with the old
code, because the stop() was not invoked when handling the throttle.
Checking the flag before updating the event.
Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group")
Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@e132581.arm.com/
Closes: https://lore.kernel.org/lkml/djxlh5fx326gcenwrr52ry3pk4wxmugu4jccdjysza7tlc5fef@ktp4rffawgcw/
Closes: https://lore.kernel.org/lkml/8e8f51d8-af64-4d9e-934b-c0ee9f131293@linux.ibm.com/
Closes: https://lore.kernel.org/lkml/4ce106d0-950c-aadc-0b6a-f0215cd39913@maine.edu/
Reported-by: Leo Yan <leo.yan@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20250606192546.915765-1-kan.liang@linux.intel.com
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") tried to make managed shutdown/startup properly
reference counted, but it missed the fact that the unplug and hotplug code
has an intentional imbalance by skipping IRQS_SUSPENDED interrupts on
the "restore" path.
This means that if a managed-affinity interrupt was both suspended and
managed-shutdown (such as may happen during system suspend / S3), resume
skips calling irq_startup_managed(), and would again have an unbalanced
depth this time, with a positive value (i.e., remaining unexpectedly
masked).
This IRQS_SUSPENDED check was introduced in commit a60dd06af674
("genirq/cpuhotplug: Skip suspended interrupts when restoring affinity")
for essentially the same reason as commit 788019eb559f, to prevent that
irq_startup() would unconditionally re-enable an interrupt too early.
Because irq_startup_managed() now respsects the disable-depth count, the
IRQS_SUSPENDED check is not longer needed, and instead, it causes harm.
Thus, drop the IRQS_SUSPENDED check, and restore balance.
This effectively reverts commit a60dd06af674 ("genirq/cpuhotplug: Skip
suspended interrupts when restoring affinity"), because it is replaced
by commit 788019eb559f ("genirq: Retain disable depth for managed
interrupts across CPU hotplug").
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-3-briannorris@chromium.org
Closes: https://lore.kernel.org/lkml/24ec4adc-7c80-49e9-93ee-19908a97ab84@gmail.com/
PTI uses separate ASIDs (aka. PCIDs) for kernel and user address
spaces. When the kernel needs to flush the user address space, it
just sets a bit in a bitmap and then flushes the entire PCID on
the next switch to userspace.
This bitmap is a single 'unsigned long' which is plenty for all 6
dynamic ASIDs. But, unfortunately, the INVLPGB support brings along a
bunch more user ASIDs, as many as ~2k more. The bitmap can't address
that many.
Fortunately, the bitmap is only needed for PTI and all the CPUs
with INVLPGB are AMD CPUs that aren't vulnerable to Meltdown and
don't need PTI. The only way someone can run into an issue in
practice is by booting with pti=on on a newer AMD CPU.
Disable INVLPGB if PTI is enabled. Avoid overrunning the small
bitmap.
Note: this will be fixed up properly by making the bitmap bigger.
For now, just avoid the mostly theoretical bug.
Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250610222420.E8CBF472%40davehans-spike.ostc.intel.com
Pull turbostat updates from Len Brown:
- Add initial DMR support, which required smarter RAPL probe
- Fix AMD MSR RAPL energy reporting
- Add RAPL power limit configuration output
- Minor fixes
* tag 'turbostat-2025.06.08' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: version 2025.06.08
tools/power turbostat: Add initial support for BartlettLake
tools/power turbostat: Add initial support for DMR
tools/power turbostat: Dump RAPL sysfs info
tools/power turbostat: Avoid probing the same perf counters
tools/power turbostat: Allow probing RAPL with platform_features->rapl_msrs cleared
tools/power turbostat: Clean up add perf/msr counter logic
tools/power turbostat: Introduce add_msr_counter()
tools/power turbostat: Remove add_msr_perf_counter_()
tools/power turbostat: Remove add_cstate_perf_counter_()
tools/power turbostat: Remove add_rapl_perf_counter_()
tools/power turbostat: Quit early for unsupported RAPL counters
tools/power turbostat: Always check rapl_joules flag
tools/power turbostat: Fix AMD package-energy reporting
tools/power turbostat: Fix RAPL_GFX_ALL typo
tools/power turbostat: Add Android support for MSR device handling
tools/power turbostat.8: pm_domain wording fix
tools/power turbostat.8: fix typo: idle_pct should be pct_idle
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and
ensuring that we only deal with MSI for both the old and the new
route entry
- Rework the way the shadow LRs are addressed in a nesting
configuration, plugging an embarrassing bug as well as simplifying
the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace.
This was delayed because the spec lacked a way for userspace to
deny supporting these calls; the new exit code is now approved"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: TDX: Exit to userspace for GetTdVmCallInfo
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
KVM: arm64: VHE: Centralize ISBs when returning to host
KVM: arm64: Remove cpacr_clear_set()
KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()
KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()
KVM: arm64: Reorganise CPTR trap manipulation
KVM: arm64: VHE: Synchronize CPTR trap deactivation
KVM: arm64: VHE: Synchronize restore of host debug registers
KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases
KVM: arm64: Explicitly treat routing entry type changes as changes
KVM: arm64: nv: Fix tracking of shadow list registers
RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs
RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
A kernel panic was reported with the following kernel log:
EDAC igen6: Expected 2 mcs, but only 1 detected.
BUG: unable to handle page fault for address: 000000000000d570
...
Hardware name: Notebook V54x_6x_TU/V54x_6x_TU, BIOS Dasharo (coreboot+UEFI) v0.9.0 07/17/2024
RIP: e030:ecclog_handler+0x7e/0xf0 [igen6_edac]
...
igen6_probe+0x2a0/0x343 [igen6_edac]
...
igen6_init+0xc5/0xff0 [igen6_edac]
...
This issue occurred because one memory controller was disabled by
the BIOS but the igen6_edac driver still checked all the memory
controllers, including this absent one, to identify the source of
the error. Accessing the null MMIO for the absent memory controller
resulted in the oops above.
Fix this issue by reverting the configuration structure to non-const
and updating the field 'res_cfg->num_imc' to reflect the number of
detected memory controllers.
Fixes: 20e190b1c1fd ("EDAC/igen6: Skip absent memory controllers")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Closes: https://lore.kernel.org/all/aFFN7RlXkaK_loQb@mail-itl/
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Link: https://lore.kernel.org/r/20250618162307.1523736-1-qiuxu.zhuo@intel.com
syzbot used a negative node number which was not rejected early and led
to invalid memory access in node_possible().
Reject negative node numbers except for FUTEX_NO_NODE.
[bigeasy: Keep the FUTEX_NO_NODE check]
Closes: https://lore.kernel.org/all/6835bfe3.a70a0220.253bc2.00b5.GAE@google.com/
Fixes: cec199c5e39bd ("futex: Implement FUTEX2_NUMA")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: syzbot+9afaf6749e3a7aa1bdf3@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-4-bigeasy@linutronix.de
Better describe the event states.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Link: https://lkml.kernel.org/r/20250604135801.GK38114@noisy.programming.kicks-ass.net
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") intended to only decrement the disable depth once per
managed shutdown, but instead it decrements for each CPU hotplug in the
affinity mask, until its depth reaches a point where it finally gets
re-started.
For example, consider:
1. Interrupt is affine to CPU {M,N}
2. disable_irq() -> depth is 1
3. CPU M goes offline -> interrupt migrates to CPU N / depth is still 1
4. CPU N goes offline -> irq_shutdown() / depth is 2
5. CPU N goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 1
6. CPU M goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 0
*** BUG: driver expects the interrupt is still disabled ***
-> irq_startup() -> irqd_clr_managed_shutdown()
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This should clear the managed-shutdown flag at step 6, so that further
hotplugs don't cause further imbalance.
Note: It might be cleaner to also remove the irqd_clr_managed_shutdown()
invocation from __irq_startup_managed(). But this is currently not possible
because of irq_update_affinity_desc() as it sets IRQD_MANAGED_SHUTDOWN and
expects irq_startup() to clear it.
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-2-briannorris@chromium.org
In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
structure representing a NUMA node relies on the cacheinfo interface
(rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map)
for monitoring. The L3 cache information of a SNC NUMA node determines
which domains are summed for the "top level" L3-scoped events.
rdt_mon_domain::ci is initialized using the first online CPU of a NUMA
node. When this CPU goes offline, its shared_cpu_map is cleared to contain
only the offline CPU itself. Subsequently, attempting to read counters
via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning
zero values for "top-level events" without any error indication.
Replace the cacheinfo references in struct rdt_mon_domain and struct
rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).
rdt_domain_hdr::cpu_mask contains the online CPUs associated with that
domain. When reading "top-level events", select a CPU from
rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine
valid CPUs for reading RMID counter via the MSR interface.
Considering all CPUs associated with the L3 cache improves the chances
of picking a housekeeping CPU on which the counter reading work can be
queued, avoiding an unnecessary IPI.
Fixes: 328ea68874642 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files")
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250530182053.37502-2-qinyuntan@linux.alibaba.com
Pull timer cleanup from Thomas Gleixner:
"The delayed from_timer() API cleanup:
The renaming to the timer_*() namespace was delayed due massive
conflicts against Linux-next. Now that everything is upstream finish
the conversion"
* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename from_timer() to timer_container_of()
Add initial DMR support, which required smarter RAPL probe
Fix AMD MSR RAPL energy reporting
Add RAPL power limit configuration output
Minor fixes
Signed-off-by: Len Brown <len.brown@intel.com>
Pull smb client fixes from Steve French:
- Multichannel channel allocation fix for Kerberos mounts
- Two reconnect fixes
- Fix netfs_writepages crash with smbdirect/RDMA
- Directory caching fix
- Three minor cleanup fixes
- Log error when close cached dirs fails
* tag 'v6.16-rc2-smb3-client-fixes-v2' of git://git.samba.org/sfrench/cifs-2.6:
smb: minor fix to use SMB2_NTLMV2_SESSKEY_SIZE for auth_key size
smb: minor fix to use sizeof to initialize flags_string buffer
smb: Use loff_t for directory position in cached_dirents
smb: Log an error when close_all_cached_dirs fails
cifs: Fix prepare_write to negotiate wsize if needed
smb: client: fix max_sge overflow in smb_extract_folioq_to_rdma()
smb: client: fix first command failure during re-negotiation
cifs: Remove duplicate fattr->cf_dtype assignment from wsl_to_fattr() function
smb: fix secondary channel creation issue with kerberos by populating hostname when adding channels
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.
GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.
For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
successful execution indicates all GHCI base TDVMCALLs listed above are
supported.
Update the KVM TDX document with the set of the GHCI base APIs.
- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
indicates the TDX guest is querying the supported TDVMCALLs beyond
the GHCI base TDVMCALLs.
Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s)
supported by itself when the TDVMCALLs don't need support from userspace
after returning from userspace and before entering guest. Currently, no
such TDVMCALLs implemented, KVM just sets the values returned from
userspace.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD's Family 19h-based Models 70h-7fh support 4 unified memory controllers
(UMC) per processor die.
The amd64_edac driver, however, assumes only 2 UMCs are supported since
max_mcs variable for the models has not been explicitly set to 4. The same
results in incomplete or incorrect memory information being logged to dmesg by
the module during initialization in some instances.
Fixes: 6c79e42169fe ("EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh")
Closes: https://lore.kernel.org/all/27dc093f-ce27-4c71-9e81-786150a040b6@reox.at/
Reported-by: reox <mailinglist@reox.at>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250613005233.2330627-1-avadhut.naik@amd.com
The test fails at the MPOL step if multiple nodes are available. The
reason is that mbind() sets the policy but the home_node, which is
retrieved by the futex code, is not set. This causes to retrieve the
current node and with multiple nodes it fails on one of the iterations.
Use numa_set_mempolicy_home_node() to set the expected node.
Use ksft_exit_fail_msg() to fail and exit in order not to confuse ktap.
Fixes: 3163369407baf ("selftests/futex: Add futex_numa_mpol")
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-3-bigeasy@linutronix.de
There may be concurrency between perf_cgroup_switch and
perf_cgroup_event_disable. Consider the following scenario: after a new
perf cgroup event is created on CPU0, the new event may not trigger
a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1
disables this perf event, it executes __perf_remove_from_context->
list _del_event->perf_cgroup_event_disable on CPU1, which causes a race
with perf_cgroup_switch running on CPU0.
The following describes the details of this concurrency scenario:
CPU0 CPU1
perf_cgroup_switch:
...
# cpuctx->cgrp is not NULL here
if (READ_ONCE(cpuctx->cgrp) == NULL)
return;
perf_remove_from_context:
...
raw_spin_lock_irq(&ctx->lock);
...
# ctx->is_active == 0 because reprogramm is not
# tigger, so CPU1 can do __perf_remove_from_context
# for CPU0
__perf_remove_from_context:
perf_cgroup_event_disable:
...
if (--ctx->nr_cgroups)
...
# this warning will happened because CPU1 changed
# ctx.nr_cgroups to 0.
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
[peterz: use guard instead of goto unlock]
Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
Pull x86 fixes from Dave Hansen:
"This is a pretty scattered set of fixes. The majority of them are
further fixups around the recent ITS mitigations.
The rest don't really have a coherent story:
- Some flavors of Xen PV guests don't support large pages, but the
set_memory.c code assumes all CPUs support them.
Avoid problems with a quick CPU feature check.
- The TDX code has some wrappers to help retry calls to the TDX
module. They use function pointers to assembly functions and the
compiler usually generates direct CALLs. But some new compilers,
plus -Os turned them in to indirect CALLs and the assembly code was
not annotated for indirect calls.
Force inlining of the helper to fix it up.
- Last, a FRED issue showed up when single-stepping. It's fine when
using an external debugger, but was getting stuck returning from a
SIGTRAP handler otherwise.
Clear the FRED 'swevent' bit to ensure that forward progress is
made"
* tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "mm/execmem: Unify early execmem_cache behaviour"
x86/its: explicitly manage permissions for ITS pages
x86/its: move its_pages array to struct mod_arch_specific
x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set
x86/mm/pat: don't collapse pages without PSE set
x86/virt/tdx: Avoid indirect calls to TDX assembly functions
selftests/x86: Add a test to detect infinite SIGTRAP handler loop
x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread
via exit_thread(). exit_thread() invokes io_bitmap_exit() which
does the IO bitmap cleanups, which unfortunately assume that the
cleanup is related to the current task, which is obviously bogus.
Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the
command buffer in two places, which keeps stale error messages
around. Bring them back.
- Remove unused trace events"
* tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
x86/fpu: Remove unused trace events
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
Add initial support for BartlettLake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull nfsd fixes from Chuck Lever:
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
* tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
nfsd: use threads array as-is in netlink interface
SUNRPC: Cleanup/fix initial rq_pages allocation
NFSD: Avoid corruption of a referring call list
Replaced hardcoded value 16 with SMB2_NTLMV2_SESSKEY_SIZE
in the auth_key definition and memcpy call.
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Handle TDVMCALL for GetQuote to generate a TD-Quote.
GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter. Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave. When completed, the
Quote is returned via the same shared-memory area.
KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously. KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not. When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not. When completed, the generated Quote is returned via the same
buffer.
Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete. While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark reported that futex_priv_hash fails on ARM64.
It turns out that the command line parsing does not terminate properly
and ends in the default case assuming an invalid option was passed.
Use an int as the return type for getopt().
Closes: https://lore.kernel.org/all/31869a69-063f-44a3-a079-ba71b2506cce@sirena.org.uk/
Fixes: 3163369407baf ("selftests/futex: Add futex_numa_mpol")
Fixes: cda95faef7bcf ("selftests/futex: Add futex_priv_hash")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-2-bigeasy@linutronix.de
Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting
bug at task exit") moves the event->state update to before
list_del_event(). This makes the event->state test in list_del_event()
always false; never calling perf_cgroup_event_disable().
As a result, cpuctx->cgrp won't be cleared properly; causing havoc.
Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/
Pull powerpc fixes from Madhavan Srinivasan:
- Fix to handle VDSO32 with pcrel
- Couple of dts fixes in microwatt and mpc8315erdb
- Fix to handle PE bridge reconfiguration in VFIO EEH recovery path
- Fix ioctl macros related to struct termio
Thanks to Christophe Leroy, Ganesh Goudar, J. Neuschäfer, Justin M.
Forbes, Michael Ellerman, Narayana Murty N, Tulio Magno, and Vaibhav
Jain
* tag 'powerpc-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix struct termio related ioctl macros
powerpc: dts: mpc8315erdb: Add GPIO controller node
powerpc/microwatt: Fix model property in device tree
powerpc/eeh: Fix missing PE bridge reconfiguration during VFIO EEH recovery
powerpc/vdso: Fix build of VDSO32 with pcrel
The commit d6d1e3e6580c ("mm/execmem: Unify early execmem_cache
behaviour") changed early behaviour of execemem ROX cache to allow its
usage in early x86 code that allocates text pages when
CONFIG_MITGATION_ITS is enabled.
The permission management of the pages allocated from execmem for ITS
mitigation is now completely contained in arch/x86/kernel/alternatives.c
and therefore there is no need to special case early allocations in
execmem.
This reverts commit d6d1e3e6580ca35071ad474381f053cbf1fb6414.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250603111446.2609381-6-rppt@kernel.org
Pull timer fix from Thomas Gleixner:
"Add the missing seq_file forward declaration in the timer namespace
header"
* tag 'timers-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timens: Add struct seq_file forward declaration
A lockdep fix removed two rdt_last_cmd_clear() calls that were used to
clear the last_cmd_status buffer but called without holding the required
rdtgroup_mutex.
The impacted resctrl commands are writing to the cpus or cpus_list files
and creating a new monitor or control group. With stale data in the
last_cmd_status buffer the impacted resctrl commands report the stale error
on success, or append its own failure message to the stale error on
failure.
Consequently, restore the rdt_last_cmd_clear() calls after acquiring
rdtgroup_mutex.
Fixes: c8eafe149530 ("x86/resctrl: Fix potential lockdep warning")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20250603125828.1590067-1-zengheng4@huawei.com
Pull Kbuild updates from Masahiro Yamada:
- Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which
exports a symbol only to specified modules
- Improve ABI handling in gendwarfksyms
- Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n
- Add checkers for redundant or missing <linux/export.h> inclusion
- Deprecate the extra-y syntax
- Fix a genksyms bug when including enum constants from *.symref files
* tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits)
genksyms: Fix enum consts from a reference affecting new values
arch: use always-$(KBUILD_BUILTIN) for vmlinux.lds
kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES}
efi/libstub: use 'targets' instead of extra-y in Makefile
module: make __mod_device_table__* symbols static
scripts/misc-check: check unnecessary #include <linux/export.h> when W=1
scripts/misc-check: check missing #include <linux/export.h> when W=1
scripts/misc-check: add double-quotes to satisfy shellcheck
kbuild: move W=1 check for scripts/misc-check to top-level Makefile
scripts/tags.sh: allow to use alternative ctags implementation
kconfig: introduce menu type enum
docs: symbol-namespaces: fix reST warning with literal block
kbuild: link lib-y objects to vmlinux forcibly even when CONFIG_MODULES=n
tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
docs/core-api/symbol-namespaces: drop table of contents and section numbering
modpost: check forbidden MODULE_IMPORT_NS("module:") at compile time
kbuild: move kbuild syntax processing to scripts/Makefile.build
Makefile: remove dependency on archscripts for header installation
Documentation/kbuild: Add new gendwarfksyms kABI rules
Documentation/kbuild: Drop section numbers
...
Add initial support for DMR.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull erofs fixes from Gao Xiang:
- Use the mounter’s credentials for file-backed mounts to resolve
Android SELinux permission issues
- Remove the unused trace event `erofs_destroy_inode`
- Error out on crafted out-of-file-range encoded extents
- Remove an incorrect check for encoded extents
* tag 'erofs-for-6.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove a superfluous check for encoded extents
erofs: refuse crafted out-of-file-range encoded extents
erofs: remove unused trace event erofs_destroy_inode
erofs: impersonate the opener's credentials when accessing backing file
tianshuo han reported a remotely-triggerable crash if the client sends a
kernel RPC server a specially crafted packet. If decoding the RPC reply
fails in such a way that SVC_GARBAGE is returned without setting the
rq_accept_statp pointer, then that pointer can be dereferenced and a
value stored there.
If it's the first time the thread has processed an RPC, then that
pointer will be set to NULL and the kernel will crash. In other cases,
it could create a memory scribble.
The server sunrpc code treats a SVC_GARBAGE return from svc_authenticate
or pg_authenticate as if it should send a GARBAGE_ARGS reply. RFC 5531
says that if authentication fails that the RPC should be rejected
instead with a status of AUTH_ERR.
Handle a SVC_GARBAGE return as an AUTH_ERROR, with a reason of
AUTH_BADCRED instead of returning GARBAGE_ARGS in that case. This
sidesteps the whole problem of touching the rpc_accept_statp pointer in
this situation and avoids the crash.
Cc: stable@kernel.org
Fixes: 29cd2927fb91 ("SUNRPC: Fix encoding of accepted but unsuccessful RPC replies")
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replaced hardcoded length with sizeof(flags_string).
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.
Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.
Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
retry.
2. The subfunction is not supported.
For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.
With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull Kbuild fixes from Masahiro Yamada:
- Move warnings about linux/export.h from W=1 to W=2
- Fix structure type overrides in gendwarfksyms
* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- KUnit '#[test]'s:
- Support KUnit-mapped 'assert!' macros.
The support that landed last cycle was very basic, and the
'assert!' macros panicked since they were the standard library
ones. Now, they are mapped to the KUnit ones in a similar way to
how is done for doctests, reusing the infrastructure there.
With this, a failing test like:
#[test]
fn my_first_test() {
assert_eq!(42, 43);
}
will report:
# my_first_test: ASSERTION FAILED at rust/kernel/lib.rs:251
Expected 42 == 43 to be true, but is false
# my_first_test.speed: normal
not ok 1 my_first_test
- Support tests with checked 'Result' return types.
The return value of test functions that return a 'Result' will
be checked, thus one can now easily catch errors when e.g. using
the '?' operator in tests.
With this, a failing test like:
#[test]
fn my_test() -> Result {
f()?;
Ok(())
}
will report:
# my_test: ASSERTION FAILED at rust/kernel/lib.rs:321
Expected is_test_result_ok(my_test()) to be true, but is false
# my_test.speed: normal
not ok 1 my_test
- Add 'kunit_tests' to the prelude.
- Clarify the remaining language unstable features in use.
- Compile 'core' with edition 2024 for Rust >= 1.87.
- Workaround 'bindgen' issue with forward references to 'enum' types.
- objtool: relax slice condition to cover more 'noreturn' functions.
- Use absolute paths in macros referencing 'core' and 'kernel'
crates.
- Skip '-mno-fdpic' flag for bindgen in GCC 32-bit arm builds.
- Clean some 'doc_markdown' lint hits -- we may enable it later on.
'kernel' crate:
- 'alloc' module:
- 'Box': support for type coercion, e.g. 'Box<T>' to 'Box<dyn U>'
if 'T' implements 'U'.
- 'Vec': implement new methods (prerequisites for nova-core and
binder): 'truncate', 'resize', 'clear', 'pop',
'push_within_capacity' (with new error type 'PushError'),
'drain_all', 'retain', 'remove' (with new error type
'RemoveError'), insert_within_capacity' (with new error type
'InsertError').
In addition, simplify 'push' using 'spare_capacity_mut', split
'set_len' into 'inc_len' and 'dec_len', add type invariant 'len
<= capacity' and simplify 'truncate' using 'dec_len'.
- 'time' module:
- Morph the Rust hrtimer subsystem into the Rust timekeeping
subsystem, covering delay, sleep, timekeeping, timers. This new
subsystem has all the relevant timekeeping C maintainers listed
in the entry.
- Replace 'Ktime' with 'Delta' and 'Instant' types to represent a
duration of time and a point in time.
- Temporarily add 'Ktime' to 'hrtimer' module to allow 'hrtimer'
to delay converting to 'Instant' and 'Delta'.
- 'xarray' module:
- Add a Rust abstraction for the 'xarray' data structure. This
abstraction allows Rust code to leverage the 'xarray' to store
types that implement 'ForeignOwnable'. This support is a
dependency for memory backing feature of the Rust null block
driver, which is waiting to be merged.
- Set up an entry in 'MAINTAINERS' for the XArray Rust support.
Patches will go to the new Rust XArray tree and then via the
Rust subsystem tree for now.
- Allow 'ForeignOwnable' to carry information about the pointed-to
type. This helps asserting alignment requirements for the
pointer passed to the foreign language.
- 'container_of!': retain pointer mut-ness and add a compile-time
check of the type of the first parameter ('$field_ptr').
- Support optional message in 'static_assert!'.
- Add C FFI types (e.g. 'c_int') to the prelude.
- 'str' module: simplify KUnit tests 'format!' macro, convert
'rusttest' tests into KUnit, take advantage of the '-> Result'
support in KUnit '#[test]'s.
- 'list' module: add examples for 'List', fix path of
'assert_pinned!' (so far unused macro rule).
- 'workqueue' module: remove 'HasWork::OFFSET'.
- 'page' module: add 'inline' attribute.
'macros' crate:
- 'module' macro: place 'cleanup_module()' in '.exit.text' section.
'pin-init' crate:
- Add 'Wrapper<T>' trait for creating pin-initializers for wrapper
structs with a structurally pinned value such as 'UnsafeCell<T>' or
'MaybeUninit<T>'.
- Add 'MaybeZeroable' derive macro to try to derive 'Zeroable', but
not error if not all fields implement it. This is needed to derive
'Zeroable' for all bindgen-generated structs.
- Add 'unsafe fn cast_[pin_]init()' functions to unsafely change the
initialized type of an initializer. These are utilized by the
'Wrapper<T>' implementations.
- Add support for visibility in 'Zeroable' derive macro.
- Add support for 'union's in 'Zeroable' derive macro.
- Upstream dev news: streamline CI, fix some bugs. Add new workflows
to check if the user-space version and the one in the kernel tree
have diverged. Use the issues tab [1] to track them, which should
help folks report and diagnose issues w.r.t. 'pin-init' better.
[1] https://github.com/rust-for-linux/pin-init/issues
Documentation:
- Testing: add docs on the new KUnit '#[test]' tests.
- Coding guidelines: explain that '///' vs. '//' applies to private
items too. Add section on C FFI types.
- Quick Start guide: update Ubuntu instructions and split them into
"25.04" and "24.04 LTS and older".
And a few other cleanups and improvements"
* tag 'rust-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (78 commits)
rust: list: Fix typo `much` in arc.rs
rust: check type of `$ptr` in `container_of!`
rust: workqueue: remove HasWork::OFFSET
rust: retain pointer mut-ness in `container_of!`
Documentation: rust: testing: add docs on the new KUnit `#[test]` tests
Documentation: rust: rename `#[test]`s to "`rusttest` host tests"
rust: str: take advantage of the `-> Result` support in KUnit `#[test]`'s
rust: str: simplify KUnit tests `format!` macro
rust: str: convert `rusttest` tests into KUnit
rust: add `kunit_tests` to the prelude
rust: kunit: support checked `-> Result`s in KUnit `#[test]`s
rust: kunit: support KUnit-mapped `assert!` macros in `#[test]`s
rust: make section names plural
rust: list: fix path of `assert_pinned!`
rust: compile libcore with edition 2024 for 1.87+
rust: dma: add missing Markdown code span
rust: task: add missing Markdown code spans and intra-doc links
rust: pci: fix docs related to missing Markdown code spans
rust: alloc: add missing Markdown code span
rust: alloc: add missing Markdown code spans
...
While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.
Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().
Unify the disable paths to ensure this.
Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
Pull vfs fixes from Christian Brauner:
- Fix a regression in overlayfs caused by reworking the lookup_one*()
set of helpers
- Make sure that the name of the dentry is printed in overlayfs'
mkdir() helper
- Add missing iocb values to TRACE_IOCB_STRINGS define
- Unlock the superblock during iterate_supers_type(). This was an
accidental internal api change
- Drop a misleading assert in file_seek_cur_needs_f_lock() helper
- Never refuse to return PIDFD_GET_INGO when parent pid is zero
That can trivially happen in container scenarios where the parent
process might be located in an ancestor pid namespace
- Don't revalidate in try_lookup_noperm() as that causes regression for
filesystems such as cifs
- Fix simple_xattr_list() and reset the err variable after
security_inode_listsecurity() got called so as not to confuse
userspace about the length of the xattr
* tag 'vfs-6.16-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: drop assert in file_seek_cur_needs_f_lock
fs: unlock the superblock during iterate_supers_type
ovl: fix debug print in case of mkdir error
VFS: change try_lookup_noperm() to skip revalidation
fs: add missing values to TRACE_IOCB_STRINGS
fs/xattr.c: fix simple_xattr_list()
ovl: fix regression caused by lookup helpers API changes
pidfs: never refuse ppid == 0 in PIDFD_GET_INFO
Pull i2c fixes from Wolfram Sang:
- subsystem: convert drivers to use recent callbacks of struct
i2c_algorithm A typical after-rc1 cleanup, which I couldn't send in
time for rc2
- tegra: fix YAML conversion of device tree bindings
- k1: re-add a check which got lost during upstreaming
* tag 'i2c-for-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: k1: check for transfer error
i2c: use inclusive callbacks in struct i2c_algorithm
dt-bindings: i2c: nvidia,tegra20-i2c: Specify the required properties
Pull x86 fixes from Borislav Petkov:
- Make sure the array tracking which kernel text positions need to be
alternatives-patched doesn't get mishandled by out-of-order
modifications, leading to it overflowing and causing page faults when
patching
- Avoid an infinite loop when early code does a ranged TLB invalidation
before the broadcast TLB invalidation count of how many pages it can
flush, has been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an
overflow of the bitmap tracking dynamic ASIDs which need to be
flushed when the kernel switches between the user and kernel address
space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
* tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fix int3 handling failure from broken text_poke array
x86/mm: Fix early boot use of INVPLGB
x86/its: Fix an ifdef typo in its_alloc()
x86/mm: Disable INVLPGB when PTI is enabled
x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
If spacemit_i2c_xfer_msg() times out waiting for a message transfer to
complete, or if the hardware reports an error, it returns a negative
error code (-ETIMEDOUT, -EAGAIN, -ENXIO. or -EIO).
The sole caller of spacemit_i2c_xfer_msg() is spacemit_i2c_xfer(),
which is the i2c_algorithm->xfer callback function. It currently
does not save the value returned by spacemit_i2c_xfer_msg().
The result is that transfer errors go unreported, and a caller
has no indication anything is wrong.
When this code was out for review, the return value *was* checked
in early versions. But for some reason, that assignment got dropped
between versions 5 and 6 of the series, perhaps related to reworking
the code to merge spacemit_i2c_xfer_core() into spacemit_i2c_xfer().
Simply assigning the value returned to "ret" fixes the problem.
Fixes: 5ea558473fa31 ("i2c: spacemit: add support for SpacemiT K1 SoC")
Signed-off-by: Alex Elder <elder@riscstar.com>
Cc: <stable@vger.kernel.org> # v6.15+
Reviewed-by: Troy Mitchell <troymitchell988@gmail.com>
Link: https://lore.kernel.org/r/20250616125137.1555453-1-elder@riscstar.com
Signed-off-by: Andi Shyti <andi@smida.it>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull irq fixes from Borislav Petkov:
- Fix missing prototypes warnings
- Properly initialize work context when allocating it
- Remove a method tracking when managed interrupts are suspended during
hotplug, in favor of the code using a IRQ disable depth tracking now,
and have interrupts get properly enabled again on restore
- Make sure multiple CPUs getting hotplugged don't cause wrong tracking
of the managed IRQ disable depth
* tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ath79-misc: Fix missing prototypes warnings
genirq/irq_sim: Initialize work context pointers properly
genirq/cpuhotplug: Restore affinity even for suspended IRQ
genirq/cpuhotplug: Rebalance managed interrupts across multi-CPU hotplug
Since smp_text_poke_single() does not expect there is another
text_poke request is queued, it can make text_poke_array not
sorted or cause a buffer overflow on the text_poke_array.vec[].
This will cause an Oops in int3 because of bsearch failing;
CPU 0 CPU 1 CPU 2
----- ----- -----
smp_text_poke_batch_add()
smp_text_poke_single() <<-- Adds out of order
<int3>
[Fails o find address
in text_poke_array ]
OOPS!
Or unhandled page fault because of a buffer overflow;
CPU 0 CPU 1
----- -----
smp_text_poke_batch_add() <<+
... |
smp_text_poke_batch_add() <<-- Adds TEXT_POKE_ARRAY_MAX times.
smp_text_poke_single() {
__smp_text_poke_batch_add() <<-- Adds entry at
TEXT_POKE_ARRAY_MAX + 1
smp_text_poke_batch_finish()
[Unhandled page fault because
text_poke_array.nr_entries is
overwritten]
BUG!
}
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
so that it correctly flush the queue if needed.
Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
Fixes: c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lkml.kernel.org/r/\ 175020512308.3582717.13631440385506146631.stgit@mhiramat.tok.corp.google.com
Pull perf fixes from Borislav Petkov:
- Avoid a crash on a heterogeneous machine where not all cores support
the same hw events features
- Avoid a deadlock when throttling events
- Document the perf event states more
- Make sure a number of perf paths switching off or rescheduling events
call perf_cgroup_event_disable()
- Make sure perf does task sampling before its userspace mapping is
torn down, and not after
* tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix crash in icl_update_topdown_event()
perf: Fix the throttle error of some clock events
perf: Add comment to enum perf_event_state
perf/core: Fix WARN in perf_cgroup_switch()
perf: Fix dangling cgroup pointer in cpuctx
perf: Fix cgroup state vs ERROR
perf: Fix sample vs do_exit()
ath79_misc_irq_init() was defined but unused since commit 51fa4f8912c0
("MIPS: ath79: drop legacy IRQ code"), so it's time to drop it.
The build also warns about a missing prototype of get_c0_perfcount_int().
Remove the stale leftover function and add the missing include.
Signed-off-by: Shiji Yang <yangshiji66@outlook.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/OSBPR01MB167032D2017645200787AAEBBC72A@OSBPR01MB1670.jpnprd01.prod.outlook.com
The INVLPGB instruction has limits on how many pages it can invalidate
at once. That limit is enumerated in CPUID, read by the kernel, and
stored in 'invpgb_count_max'. Ranged invalidation, like
invlpgb_kernel_range_flush() break up their invalidations so
that they do not exceed the limit.
However, early boot code currently attempts to do ranged
invalidation before populating 'invlpgb_count_max'. There is a
for loop which is basically:
for (...; addr < end; addr += invlpgb_count_max*PAGE_SIZE)
If invlpgb_kernel_range_flush is called before the kernel has read
the value of invlpgb_count_max from the hardware, the normally
bounded loop can become an infinite loop if invlpgb_count_max is
initialized to zero.
Fix that issue by initializing invlpgb_count_max to 1.
This way INVPLGB at early boot time will be a little bit slower
than normal (with initialized invplgb_count_max), and not an
instant hang at bootup time.
Fixes: b7aa05cbdc52 ("x86/mm: Add INVLPGB support code")
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250606171112.4013261-3-riel%40surriel.com
Specify the properties which are essential and which are not for the
Tegra I2C driver to function correctly. This was not added correctly when
the TXT binding was converted to yaml. All the existing DT nodes have
these properties already and hence this does not break the ABI.
dmas and dma-names which were specified as a must in the TXT binding
is now made optional since the driver can work in PIO mode if dmas are
missing.
Fixes: f10a9b722f80 ("dt-bindings: i2c: tegra: Convert to json-schema”)
Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Cc: <stable@vger.kernel.org> # v5.17+
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Andi Shyti <andi@smida.it>
Link: https://lore.kernel.org/r/20250603153022.39434-1-akhilrajeev@nvidia.com
Pull locking fixes from Borislav Petkov:
- Make sure the switch to the global hash is requested always under a
lock so that two threads requesting that simultaneously cannot get to
inconsistent state
- Reject negative NUMA nodes earlier in the futex NUMA interface
handling code
- Selftests fixes
* tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Verify under the lock if hash can be replaced
futex: Handle invalid node numbers supplied by user
selftests/futex: Set the home_node in futex_numa_mpol
selftests/futex: getopt() requires int as return value.
The perf_fuzzer found a hard-lockup crash on a RaptorLake machine:
Oops: general protection fault, maybe for address 0xffff89aeceab400: 0000
CPU: 23 UID: 0 PID: 0 Comm: swapper/23
Tainted: [W]=WARN
Hardware name: Dell Inc. Precision 9660/0VJ762
RIP: 0010:native_read_pmc+0x7/0x40
Code: cc e8 8d a9 01 00 48 89 03 5b cd cc cc cc cc 0f 1f ...
RSP: 000:fffb03100273de8 EFLAGS: 00010046
....
Call Trace:
<TASK>
icl_update_topdown_event+0x165/0x190
? ktime_get+0x38/0xd0
intel_pmu_read_event+0xf9/0x210
__perf_event_read+0xf9/0x210
CPUs 16-23 are E-core CPUs that don't support the perf metrics feature.
The icl_update_topdown_event() should not be invoked on these CPUs.
It's a regression of commit:
f9bdf1f95339 ("perf/x86/intel: Avoid disable PMU if !cpuc->enabled in sample read")
The bug introduced by that commit is that the is_topdown_event() function
is mistakenly used to replace the is_topdown_count() call to check if the
topdown functions for the perf metrics feature should be invoked.
Fix it.
Fixes: f9bdf1f95339 ("perf/x86/intel: Avoid disable PMU if !cpuc->enabled in sample read")
Closes: https://lore.kernel.org/lkml/352f0709-f026-cd45-e60c-60dfd97f73f3@maine.edu/
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org # v6.15+
Link: https://lore.kernel.org/r/20250612143818.2889040-1-kan.liang@linux.intel.com
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
Commit a82b26451de1 ("x86/its: explicitly manage permissions for ITS
pages") reworks its_alloc() and introduces a typo in an ifdef
conditional, referring to CONFIG_MODULE instead of CONFIG_MODULES.
Fix this typo in its_alloc().
Fixes: a82b26451de1 ("x86/its: explicitly manage permissions for ITS pages")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250616100432.22941-1-lukas.bulwahn%40redhat.com
Pull EDAC fixes from Borislav Petkov:
- amd64: Correct the number of memory controllers on some AMD Zen
clients
- igen6: Handle firmware-disabled memory controllers properly
* tag 'edac_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/igen6: Fix NULL pointer dereference
EDAC/amd64: Correct number of UMCs for family 19h models 70h-7fh
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
The same applies if the hash is made immutable: There is no reference
counting and the hash must not be replaced.
Verify under mm_struct::futex_phash that neither the global hash nor an
immutable hash in use.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
Both ARM and IBM CI reports RCU stall, which can be reproduced by the
below perf command.
perf record -a -e cpu-clock -- sleep 2
The issue is introduced by the generic throttle patch set, which
unconditionally invoke the event_stop() when throttle is triggered.
The cpu-clock and task-clock are two special SW events, which rely on
the hrtimer. The throttle is invoked in the hrtimer handler. The
event_stop()->hrtimer_cancel() waits for the handler to finish, which is
a deadlock. Instead of invoking the stop(), the HRTIMER_NORESTART should
be used to stop the timer.
There may be two ways to fix it:
- Introduce a PMU flag to track the case. Avoid the event_stop in
perf_event_throttle() if the flag is detected.
It has been implemented in the
https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@linux.intel.com/
The new flag was thought to be an overkill for the issue.
- Add a check in the event_stop. Return immediately if the throttle is
invoked in the hrtimer handler. Rely on the existing HRTIMER_NORESTART
method to stop the timer.
The latter is implemented here.
Move event->hw.interrupts = MAX_INTERRUPTS before the stop(). It makes
the order the same as perf_event_unthrottle(). Except the patch, no one
checks the hw.interrupts in the stop(). There is no impact from the
order change.
When stops in the throttle, the event should not be updated,
stop(event, 0). But the cpu_clock_event_stop() doesn't handle the flag.
In logic, it's wrong. But it didn't bring any problems with the old
code, because the stop() was not invoked when handling the throttle.
Checking the flag before updating the event.
Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group")
Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@e132581.arm.com/
Closes: https://lore.kernel.org/lkml/djxlh5fx326gcenwrr52ry3pk4wxmugu4jccdjysza7tlc5fef@ktp4rffawgcw/
Closes: https://lore.kernel.org/lkml/8e8f51d8-af64-4d9e-934b-c0ee9f131293@linux.ibm.com/
Closes: https://lore.kernel.org/lkml/4ce106d0-950c-aadc-0b6a-f0215cd39913@maine.edu/
Reported-by: Leo Yan <leo.yan@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20250606192546.915765-1-kan.liang@linux.intel.com
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") tried to make managed shutdown/startup properly
reference counted, but it missed the fact that the unplug and hotplug code
has an intentional imbalance by skipping IRQS_SUSPENDED interrupts on
the "restore" path.
This means that if a managed-affinity interrupt was both suspended and
managed-shutdown (such as may happen during system suspend / S3), resume
skips calling irq_startup_managed(), and would again have an unbalanced
depth this time, with a positive value (i.e., remaining unexpectedly
masked).
This IRQS_SUSPENDED check was introduced in commit a60dd06af674
("genirq/cpuhotplug: Skip suspended interrupts when restoring affinity")
for essentially the same reason as commit 788019eb559f, to prevent that
irq_startup() would unconditionally re-enable an interrupt too early.
Because irq_startup_managed() now respsects the disable-depth count, the
IRQS_SUSPENDED check is not longer needed, and instead, it causes harm.
Thus, drop the IRQS_SUSPENDED check, and restore balance.
This effectively reverts commit a60dd06af674 ("genirq/cpuhotplug: Skip
suspended interrupts when restoring affinity"), because it is replaced
by commit 788019eb559f ("genirq: Retain disable depth for managed
interrupts across CPU hotplug").
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-3-briannorris@chromium.org
Closes: https://lore.kernel.org/lkml/24ec4adc-7c80-49e9-93ee-19908a97ab84@gmail.com/
PTI uses separate ASIDs (aka. PCIDs) for kernel and user address
spaces. When the kernel needs to flush the user address space, it
just sets a bit in a bitmap and then flushes the entire PCID on
the next switch to userspace.
This bitmap is a single 'unsigned long' which is plenty for all 6
dynamic ASIDs. But, unfortunately, the INVLPGB support brings along a
bunch more user ASIDs, as many as ~2k more. The bitmap can't address
that many.
Fortunately, the bitmap is only needed for PTI and all the CPUs
with INVLPGB are AMD CPUs that aren't vulnerable to Meltdown and
don't need PTI. The only way someone can run into an issue in
practice is by booting with pti=on on a newer AMD CPU.
Disable INVLPGB if PTI is enabled. Avoid overrunning the small
bitmap.
Note: this will be fixed up properly by making the bitmap bigger.
For now, just avoid the mostly theoretical bug.
Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250610222420.E8CBF472%40davehans-spike.ostc.intel.com
Pull turbostat updates from Len Brown:
- Add initial DMR support, which required smarter RAPL probe
- Fix AMD MSR RAPL energy reporting
- Add RAPL power limit configuration output
- Minor fixes
* tag 'turbostat-2025.06.08' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: version 2025.06.08
tools/power turbostat: Add initial support for BartlettLake
tools/power turbostat: Add initial support for DMR
tools/power turbostat: Dump RAPL sysfs info
tools/power turbostat: Avoid probing the same perf counters
tools/power turbostat: Allow probing RAPL with platform_features->rapl_msrs cleared
tools/power turbostat: Clean up add perf/msr counter logic
tools/power turbostat: Introduce add_msr_counter()
tools/power turbostat: Remove add_msr_perf_counter_()
tools/power turbostat: Remove add_cstate_perf_counter_()
tools/power turbostat: Remove add_rapl_perf_counter_()
tools/power turbostat: Quit early for unsupported RAPL counters
tools/power turbostat: Always check rapl_joules flag
tools/power turbostat: Fix AMD package-energy reporting
tools/power turbostat: Fix RAPL_GFX_ALL typo
tools/power turbostat: Add Android support for MSR device handling
tools/power turbostat.8: pm_domain wording fix
tools/power turbostat.8: fix typo: idle_pct should be pct_idle
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and
ensuring that we only deal with MSI for both the old and the new
route entry
- Rework the way the shadow LRs are addressed in a nesting
configuration, plugging an embarrassing bug as well as simplifying
the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace.
This was delayed because the spec lacked a way for userspace to
deny supporting these calls; the new exit code is now approved"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: TDX: Exit to userspace for GetTdVmCallInfo
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
KVM: arm64: VHE: Centralize ISBs when returning to host
KVM: arm64: Remove cpacr_clear_set()
KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()
KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()
KVM: arm64: Reorganise CPTR trap manipulation
KVM: arm64: VHE: Synchronize CPTR trap deactivation
KVM: arm64: VHE: Synchronize restore of host debug registers
KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases
KVM: arm64: Explicitly treat routing entry type changes as changes
KVM: arm64: nv: Fix tracking of shadow list registers
RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs
RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
A kernel panic was reported with the following kernel log:
EDAC igen6: Expected 2 mcs, but only 1 detected.
BUG: unable to handle page fault for address: 000000000000d570
...
Hardware name: Notebook V54x_6x_TU/V54x_6x_TU, BIOS Dasharo (coreboot+UEFI) v0.9.0 07/17/2024
RIP: e030:ecclog_handler+0x7e/0xf0 [igen6_edac]
...
igen6_probe+0x2a0/0x343 [igen6_edac]
...
igen6_init+0xc5/0xff0 [igen6_edac]
...
This issue occurred because one memory controller was disabled by
the BIOS but the igen6_edac driver still checked all the memory
controllers, including this absent one, to identify the source of
the error. Accessing the null MMIO for the absent memory controller
resulted in the oops above.
Fix this issue by reverting the configuration structure to non-const
and updating the field 'res_cfg->num_imc' to reflect the number of
detected memory controllers.
Fixes: 20e190b1c1fd ("EDAC/igen6: Skip absent memory controllers")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Closes: https://lore.kernel.org/all/aFFN7RlXkaK_loQb@mail-itl/
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Link: https://lore.kernel.org/r/20250618162307.1523736-1-qiuxu.zhuo@intel.com
syzbot used a negative node number which was not rejected early and led
to invalid memory access in node_possible().
Reject negative node numbers except for FUTEX_NO_NODE.
[bigeasy: Keep the FUTEX_NO_NODE check]
Closes: https://lore.kernel.org/all/6835bfe3.a70a0220.253bc2.00b5.GAE@google.com/
Fixes: cec199c5e39bd ("futex: Implement FUTEX2_NUMA")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: syzbot+9afaf6749e3a7aa1bdf3@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-4-bigeasy@linutronix.de
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") intended to only decrement the disable depth once per
managed shutdown, but instead it decrements for each CPU hotplug in the
affinity mask, until its depth reaches a point where it finally gets
re-started.
For example, consider:
1. Interrupt is affine to CPU {M,N}
2. disable_irq() -> depth is 1
3. CPU M goes offline -> interrupt migrates to CPU N / depth is still 1
4. CPU N goes offline -> irq_shutdown() / depth is 2
5. CPU N goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 1
6. CPU M goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 0
*** BUG: driver expects the interrupt is still disabled ***
-> irq_startup() -> irqd_clr_managed_shutdown()
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This should clear the managed-shutdown flag at step 6, so that further
hotplugs don't cause further imbalance.
Note: It might be cleaner to also remove the irqd_clr_managed_shutdown()
invocation from __irq_startup_managed(). But this is currently not possible
because of irq_update_affinity_desc() as it sets IRQD_MANAGED_SHUTDOWN and
expects irq_startup() to clear it.
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-2-briannorris@chromium.org
In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
structure representing a NUMA node relies on the cacheinfo interface
(rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map)
for monitoring. The L3 cache information of a SNC NUMA node determines
which domains are summed for the "top level" L3-scoped events.
rdt_mon_domain::ci is initialized using the first online CPU of a NUMA
node. When this CPU goes offline, its shared_cpu_map is cleared to contain
only the offline CPU itself. Subsequently, attempting to read counters
via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning
zero values for "top-level events" without any error indication.
Replace the cacheinfo references in struct rdt_mon_domain and struct
rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).
rdt_domain_hdr::cpu_mask contains the online CPUs associated with that
domain. When reading "top-level events", select a CPU from
rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine
valid CPUs for reading RMID counter via the MSR interface.
Considering all CPUs associated with the L3 cache improves the chances
of picking a housekeeping CPU on which the counter reading work can be
queued, avoiding an unnecessary IPI.
Fixes: 328ea68874642 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files")
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250530182053.37502-2-qinyuntan@linux.alibaba.com
Pull timer cleanup from Thomas Gleixner:
"The delayed from_timer() API cleanup:
The renaming to the timer_*() namespace was delayed due massive
conflicts against Linux-next. Now that everything is upstream finish
the conversion"
* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename from_timer() to timer_container_of()
Pull smb client fixes from Steve French:
- Multichannel channel allocation fix for Kerberos mounts
- Two reconnect fixes
- Fix netfs_writepages crash with smbdirect/RDMA
- Directory caching fix
- Three minor cleanup fixes
- Log error when close cached dirs fails
* tag 'v6.16-rc2-smb3-client-fixes-v2' of git://git.samba.org/sfrench/cifs-2.6:
smb: minor fix to use SMB2_NTLMV2_SESSKEY_SIZE for auth_key size
smb: minor fix to use sizeof to initialize flags_string buffer
smb: Use loff_t for directory position in cached_dirents
smb: Log an error when close_all_cached_dirs fails
cifs: Fix prepare_write to negotiate wsize if needed
smb: client: fix max_sge overflow in smb_extract_folioq_to_rdma()
smb: client: fix first command failure during re-negotiation
cifs: Remove duplicate fattr->cf_dtype assignment from wsl_to_fattr() function
smb: fix secondary channel creation issue with kerberos by populating hostname when adding channels
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.
GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.
For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
successful execution indicates all GHCI base TDVMCALLs listed above are
supported.
Update the KVM TDX document with the set of the GHCI base APIs.
- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
indicates the TDX guest is querying the supported TDVMCALLs beyond
the GHCI base TDVMCALLs.
Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s)
supported by itself when the TDVMCALLs don't need support from userspace
after returning from userspace and before entering guest. Currently, no
such TDVMCALLs implemented, KVM just sets the values returned from
userspace.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD's Family 19h-based Models 70h-7fh support 4 unified memory controllers
(UMC) per processor die.
The amd64_edac driver, however, assumes only 2 UMCs are supported since
max_mcs variable for the models has not been explicitly set to 4. The same
results in incomplete or incorrect memory information being logged to dmesg by
the module during initialization in some instances.
Fixes: 6c79e42169fe ("EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh")
Closes: https://lore.kernel.org/all/27dc093f-ce27-4c71-9e81-786150a040b6@reox.at/
Reported-by: reox <mailinglist@reox.at>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250613005233.2330627-1-avadhut.naik@amd.com
The test fails at the MPOL step if multiple nodes are available. The
reason is that mbind() sets the policy but the home_node, which is
retrieved by the futex code, is not set. This causes to retrieve the
current node and with multiple nodes it fails on one of the iterations.
Use numa_set_mempolicy_home_node() to set the expected node.
Use ksft_exit_fail_msg() to fail and exit in order not to confuse ktap.
Fixes: 3163369407baf ("selftests/futex: Add futex_numa_mpol")
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-3-bigeasy@linutronix.de
There may be concurrency between perf_cgroup_switch and
perf_cgroup_event_disable. Consider the following scenario: after a new
perf cgroup event is created on CPU0, the new event may not trigger
a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1
disables this perf event, it executes __perf_remove_from_context->
list _del_event->perf_cgroup_event_disable on CPU1, which causes a race
with perf_cgroup_switch running on CPU0.
The following describes the details of this concurrency scenario:
CPU0 CPU1
perf_cgroup_switch:
...
# cpuctx->cgrp is not NULL here
if (READ_ONCE(cpuctx->cgrp) == NULL)
return;
perf_remove_from_context:
...
raw_spin_lock_irq(&ctx->lock);
...
# ctx->is_active == 0 because reprogramm is not
# tigger, so CPU1 can do __perf_remove_from_context
# for CPU0
__perf_remove_from_context:
perf_cgroup_event_disable:
...
if (--ctx->nr_cgroups)
...
# this warning will happened because CPU1 changed
# ctx.nr_cgroups to 0.
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
[peterz: use guard instead of goto unlock]
Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
Pull x86 fixes from Dave Hansen:
"This is a pretty scattered set of fixes. The majority of them are
further fixups around the recent ITS mitigations.
The rest don't really have a coherent story:
- Some flavors of Xen PV guests don't support large pages, but the
set_memory.c code assumes all CPUs support them.
Avoid problems with a quick CPU feature check.
- The TDX code has some wrappers to help retry calls to the TDX
module. They use function pointers to assembly functions and the
compiler usually generates direct CALLs. But some new compilers,
plus -Os turned them in to indirect CALLs and the assembly code was
not annotated for indirect calls.
Force inlining of the helper to fix it up.
- Last, a FRED issue showed up when single-stepping. It's fine when
using an external debugger, but was getting stuck returning from a
SIGTRAP handler otherwise.
Clear the FRED 'swevent' bit to ensure that forward progress is
made"
* tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "mm/execmem: Unify early execmem_cache behaviour"
x86/its: explicitly manage permissions for ITS pages
x86/its: move its_pages array to struct mod_arch_specific
x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set
x86/mm/pat: don't collapse pages without PSE set
x86/virt/tdx: Avoid indirect calls to TDX assembly functions
selftests/x86: Add a test to detect infinite SIGTRAP handler loop
x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread
via exit_thread(). exit_thread() invokes io_bitmap_exit() which
does the IO bitmap cleanups, which unfortunately assume that the
cleanup is related to the current task, which is obviously bogus.
Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the
command buffer in two places, which keeps stale error messages
around. Bring them back.
- Remove unused trace events"
* tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
x86/fpu: Remove unused trace events
Pull nfsd fixes from Chuck Lever:
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
* tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
nfsd: use threads array as-is in netlink interface
SUNRPC: Cleanup/fix initial rq_pages allocation
NFSD: Avoid corruption of a referring call list
Handle TDVMCALL for GetQuote to generate a TD-Quote.
GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter. Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave. When completed, the
Quote is returned via the same shared-memory area.
KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously. KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not. When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not. When completed, the generated Quote is returned via the same
buffer.
Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete. While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark reported that futex_priv_hash fails on ARM64.
It turns out that the command line parsing does not terminate properly
and ends in the default case assuming an invalid option was passed.
Use an int as the return type for getopt().
Closes: https://lore.kernel.org/all/31869a69-063f-44a3-a079-ba71b2506cce@sirena.org.uk/
Fixes: 3163369407baf ("selftests/futex: Add futex_numa_mpol")
Fixes: cda95faef7bcf ("selftests/futex: Add futex_priv_hash")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250528085521.1938355-2-bigeasy@linutronix.de
Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting
bug at task exit") moves the event->state update to before
list_del_event(). This makes the event->state test in list_del_event()
always false; never calling perf_cgroup_event_disable().
As a result, cpuctx->cgrp won't be cleared properly; causing havoc.
Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/
Pull powerpc fixes from Madhavan Srinivasan:
- Fix to handle VDSO32 with pcrel
- Couple of dts fixes in microwatt and mpc8315erdb
- Fix to handle PE bridge reconfiguration in VFIO EEH recovery path
- Fix ioctl macros related to struct termio
Thanks to Christophe Leroy, Ganesh Goudar, J. Neuschäfer, Justin M.
Forbes, Michael Ellerman, Narayana Murty N, Tulio Magno, and Vaibhav
Jain
* tag 'powerpc-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix struct termio related ioctl macros
powerpc: dts: mpc8315erdb: Add GPIO controller node
powerpc/microwatt: Fix model property in device tree
powerpc/eeh: Fix missing PE bridge reconfiguration during VFIO EEH recovery
powerpc/vdso: Fix build of VDSO32 with pcrel
The commit d6d1e3e6580c ("mm/execmem: Unify early execmem_cache
behaviour") changed early behaviour of execemem ROX cache to allow its
usage in early x86 code that allocates text pages when
CONFIG_MITGATION_ITS is enabled.
The permission management of the pages allocated from execmem for ITS
mitigation is now completely contained in arch/x86/kernel/alternatives.c
and therefore there is no need to special case early allocations in
execmem.
This reverts commit d6d1e3e6580ca35071ad474381f053cbf1fb6414.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250603111446.2609381-6-rppt@kernel.org
A lockdep fix removed two rdt_last_cmd_clear() calls that were used to
clear the last_cmd_status buffer but called without holding the required
rdtgroup_mutex.
The impacted resctrl commands are writing to the cpus or cpus_list files
and creating a new monitor or control group. With stale data in the
last_cmd_status buffer the impacted resctrl commands report the stale error
on success, or append its own failure message to the stale error on
failure.
Consequently, restore the rdt_last_cmd_clear() calls after acquiring
rdtgroup_mutex.
Fixes: c8eafe149530 ("x86/resctrl: Fix potential lockdep warning")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20250603125828.1590067-1-zengheng4@huawei.com
Pull Kbuild updates from Masahiro Yamada:
- Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which
exports a symbol only to specified modules
- Improve ABI handling in gendwarfksyms
- Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n
- Add checkers for redundant or missing <linux/export.h> inclusion
- Deprecate the extra-y syntax
- Fix a genksyms bug when including enum constants from *.symref files
* tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits)
genksyms: Fix enum consts from a reference affecting new values
arch: use always-$(KBUILD_BUILTIN) for vmlinux.lds
kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES}
efi/libstub: use 'targets' instead of extra-y in Makefile
module: make __mod_device_table__* symbols static
scripts/misc-check: check unnecessary #include <linux/export.h> when W=1
scripts/misc-check: check missing #include <linux/export.h> when W=1
scripts/misc-check: add double-quotes to satisfy shellcheck
kbuild: move W=1 check for scripts/misc-check to top-level Makefile
scripts/tags.sh: allow to use alternative ctags implementation
kconfig: introduce menu type enum
docs: symbol-namespaces: fix reST warning with literal block
kbuild: link lib-y objects to vmlinux forcibly even when CONFIG_MODULES=n
tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
docs/core-api/symbol-namespaces: drop table of contents and section numbering
modpost: check forbidden MODULE_IMPORT_NS("module:") at compile time
kbuild: move kbuild syntax processing to scripts/Makefile.build
Makefile: remove dependency on archscripts for header installation
Documentation/kbuild: Add new gendwarfksyms kABI rules
Documentation/kbuild: Drop section numbers
...
Pull erofs fixes from Gao Xiang:
- Use the mounter’s credentials for file-backed mounts to resolve
Android SELinux permission issues
- Remove the unused trace event `erofs_destroy_inode`
- Error out on crafted out-of-file-range encoded extents
- Remove an incorrect check for encoded extents
* tag 'erofs-for-6.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove a superfluous check for encoded extents
erofs: refuse crafted out-of-file-range encoded extents
erofs: remove unused trace event erofs_destroy_inode
erofs: impersonate the opener's credentials when accessing backing file
tianshuo han reported a remotely-triggerable crash if the client sends a
kernel RPC server a specially crafted packet. If decoding the RPC reply
fails in such a way that SVC_GARBAGE is returned without setting the
rq_accept_statp pointer, then that pointer can be dereferenced and a
value stored there.
If it's the first time the thread has processed an RPC, then that
pointer will be set to NULL and the kernel will crash. In other cases,
it could create a memory scribble.
The server sunrpc code treats a SVC_GARBAGE return from svc_authenticate
or pg_authenticate as if it should send a GARBAGE_ARGS reply. RFC 5531
says that if authentication fails that the RPC should be rejected
instead with a status of AUTH_ERR.
Handle a SVC_GARBAGE return as an AUTH_ERROR, with a reason of
AUTH_BADCRED instead of returning GARBAGE_ARGS in that case. This
sidesteps the whole problem of touching the rpc_accept_statp pointer in
this situation and avoids the crash.
Cc: stable@kernel.org
Fixes: 29cd2927fb91 ("SUNRPC: Fix encoding of accepted but unsuccessful RPC replies")
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.
Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.
Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
retry.
2. The subfunction is not supported.
For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.
With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull Kbuild fixes from Masahiro Yamada:
- Move warnings about linux/export.h from W=1 to W=2
- Fix structure type overrides in gendwarfksyms
* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- KUnit '#[test]'s:
- Support KUnit-mapped 'assert!' macros.
The support that landed last cycle was very basic, and the
'assert!' macros panicked since they were the standard library
ones. Now, they are mapped to the KUnit ones in a similar way to
how is done for doctests, reusing the infrastructure there.
With this, a failing test like:
#[test]
fn my_first_test() {
assert_eq!(42, 43);
}
will report:
# my_first_test: ASSERTION FAILED at rust/kernel/lib.rs:251
Expected 42 == 43 to be true, but is false
# my_first_test.speed: normal
not ok 1 my_first_test
- Support tests with checked 'Result' return types.
The return value of test functions that return a 'Result' will
be checked, thus one can now easily catch errors when e.g. using
the '?' operator in tests.
With this, a failing test like:
#[test]
fn my_test() -> Result {
f()?;
Ok(())
}
will report:
# my_test: ASSERTION FAILED at rust/kernel/lib.rs:321
Expected is_test_result_ok(my_test()) to be true, but is false
# my_test.speed: normal
not ok 1 my_test
- Add 'kunit_tests' to the prelude.
- Clarify the remaining language unstable features in use.
- Compile 'core' with edition 2024 for Rust >= 1.87.
- Workaround 'bindgen' issue with forward references to 'enum' types.
- objtool: relax slice condition to cover more 'noreturn' functions.
- Use absolute paths in macros referencing 'core' and 'kernel'
crates.
- Skip '-mno-fdpic' flag for bindgen in GCC 32-bit arm builds.
- Clean some 'doc_markdown' lint hits -- we may enable it later on.
'kernel' crate:
- 'alloc' module:
- 'Box': support for type coercion, e.g. 'Box<T>' to 'Box<dyn U>'
if 'T' implements 'U'.
- 'Vec': implement new methods (prerequisites for nova-core and
binder): 'truncate', 'resize', 'clear', 'pop',
'push_within_capacity' (with new error type 'PushError'),
'drain_all', 'retain', 'remove' (with new error type
'RemoveError'), insert_within_capacity' (with new error type
'InsertError').
In addition, simplify 'push' using 'spare_capacity_mut', split
'set_len' into 'inc_len' and 'dec_len', add type invariant 'len
<= capacity' and simplify 'truncate' using 'dec_len'.
- 'time' module:
- Morph the Rust hrtimer subsystem into the Rust timekeeping
subsystem, covering delay, sleep, timekeeping, timers. This new
subsystem has all the relevant timekeeping C maintainers listed
in the entry.
- Replace 'Ktime' with 'Delta' and 'Instant' types to represent a
duration of time and a point in time.
- Temporarily add 'Ktime' to 'hrtimer' module to allow 'hrtimer'
to delay converting to 'Instant' and 'Delta'.
- 'xarray' module:
- Add a Rust abstraction for the 'xarray' data structure. This
abstraction allows Rust code to leverage the 'xarray' to store
types that implement 'ForeignOwnable'. This support is a
dependency for memory backing feature of the Rust null block
driver, which is waiting to be merged.
- Set up an entry in 'MAINTAINERS' for the XArray Rust support.
Patches will go to the new Rust XArray tree and then via the
Rust subsystem tree for now.
- Allow 'ForeignOwnable' to carry information about the pointed-to
type. This helps asserting alignment requirements for the
pointer passed to the foreign language.
- 'container_of!': retain pointer mut-ness and add a compile-time
check of the type of the first parameter ('$field_ptr').
- Support optional message in 'static_assert!'.
- Add C FFI types (e.g. 'c_int') to the prelude.
- 'str' module: simplify KUnit tests 'format!' macro, convert
'rusttest' tests into KUnit, take advantage of the '-> Result'
support in KUnit '#[test]'s.
- 'list' module: add examples for 'List', fix path of
'assert_pinned!' (so far unused macro rule).
- 'workqueue' module: remove 'HasWork::OFFSET'.
- 'page' module: add 'inline' attribute.
'macros' crate:
- 'module' macro: place 'cleanup_module()' in '.exit.text' section.
'pin-init' crate:
- Add 'Wrapper<T>' trait for creating pin-initializers for wrapper
structs with a structurally pinned value such as 'UnsafeCell<T>' or
'MaybeUninit<T>'.
- Add 'MaybeZeroable' derive macro to try to derive 'Zeroable', but
not error if not all fields implement it. This is needed to derive
'Zeroable' for all bindgen-generated structs.
- Add 'unsafe fn cast_[pin_]init()' functions to unsafely change the
initialized type of an initializer. These are utilized by the
'Wrapper<T>' implementations.
- Add support for visibility in 'Zeroable' derive macro.
- Add support for 'union's in 'Zeroable' derive macro.
- Upstream dev news: streamline CI, fix some bugs. Add new workflows
to check if the user-space version and the one in the kernel tree
have diverged. Use the issues tab [1] to track them, which should
help folks report and diagnose issues w.r.t. 'pin-init' better.
[1] https://github.com/rust-for-linux/pin-init/issues
Documentation:
- Testing: add docs on the new KUnit '#[test]' tests.
- Coding guidelines: explain that '///' vs. '//' applies to private
items too. Add section on C FFI types.
- Quick Start guide: update Ubuntu instructions and split them into
"25.04" and "24.04 LTS and older".
And a few other cleanups and improvements"
* tag 'rust-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (78 commits)
rust: list: Fix typo `much` in arc.rs
rust: check type of `$ptr` in `container_of!`
rust: workqueue: remove HasWork::OFFSET
rust: retain pointer mut-ness in `container_of!`
Documentation: rust: testing: add docs on the new KUnit `#[test]` tests
Documentation: rust: rename `#[test]`s to "`rusttest` host tests"
rust: str: take advantage of the `-> Result` support in KUnit `#[test]`'s
rust: str: simplify KUnit tests `format!` macro
rust: str: convert `rusttest` tests into KUnit
rust: add `kunit_tests` to the prelude
rust: kunit: support checked `-> Result`s in KUnit `#[test]`s
rust: kunit: support KUnit-mapped `assert!` macros in `#[test]`s
rust: make section names plural
rust: list: fix path of `assert_pinned!`
rust: compile libcore with edition 2024 for 1.87+
rust: dma: add missing Markdown code span
rust: task: add missing Markdown code spans and intra-doc links
rust: pci: fix docs related to missing Markdown code spans
rust: alloc: add missing Markdown code span
rust: alloc: add missing Markdown code spans
...
While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.
Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().
Unify the disable paths to ensure this.
Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
Pull vfs fixes from Christian Brauner:
- Fix a regression in overlayfs caused by reworking the lookup_one*()
set of helpers
- Make sure that the name of the dentry is printed in overlayfs'
mkdir() helper
- Add missing iocb values to TRACE_IOCB_STRINGS define
- Unlock the superblock during iterate_supers_type(). This was an
accidental internal api change
- Drop a misleading assert in file_seek_cur_needs_f_lock() helper
- Never refuse to return PIDFD_GET_INGO when parent pid is zero
That can trivially happen in container scenarios where the parent
process might be located in an ancestor pid namespace
- Don't revalidate in try_lookup_noperm() as that causes regression for
filesystems such as cifs
- Fix simple_xattr_list() and reset the err variable after
security_inode_listsecurity() got called so as not to confuse
userspace about the length of the xattr
* tag 'vfs-6.16-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: drop assert in file_seek_cur_needs_f_lock
fs: unlock the superblock during iterate_supers_type
ovl: fix debug print in case of mkdir error
VFS: change try_lookup_noperm() to skip revalidation
fs: add missing values to TRACE_IOCB_STRINGS
fs/xattr.c: fix simple_xattr_list()
ovl: fix regression caused by lookup helpers API changes
pidfs: never refuse ppid == 0 in PIDFD_GET_INFO