commits
Pull i2c fix from Wolfram Sang:
"One revert because of a regression in the I2C core which has sadly not
showed up during its time in -next"
* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Revert "i2c: boardinfo: Annotate code used in init phase only"
Pull irq fixes from Borislav Petkov:
- Skip interrupt ID 0 in sifive-plic during suspend/resume because
ID 0 is reserved and accessing reserved register space could result
in undefined behavior
- Fix a function's retval check in aspeed-scu-ic
* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
This reverts commit 1a2b423be6a89dd07d5fc27ea042be68697a6a49 because we
got a regression report and need time to find out the details.
Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Closes: https://lore.kernel.org/r/29ec0082-4dd4-4120-acd2-44b35b4b9487@oss.qualcomm.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy()"
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
According to the PLIC specification[1], global interrupt sources are
assigned small unsigned integer identifiers beginning at the value 1.
An interrupt ID of 0 is reserved to mean "no interrupt".
The current plic_irq_resume() and plic_irq_suspend() functions incorrectly
start the loop from index 0, which accesses the register space for the
reserved interrupt ID 0.
Change the loop to start from index 1, skipping the reserved
interrupt ID 0 as per the PLIC specification.
This prevents potential undefined behavior when accessing the reserved
register space during suspend/resume cycles.
Fixes: e80f0b6a2cf3 ("irqchip/irq-sifive-plic: Add syscore callbacks for hibernation")
Co-developed-by: Jia Wang <wangjia@ultrarisc.com>
Signed-off-by: Jia Wang <wangjia@ultrarisc.com>
Co-developed-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://github.com/riscv/riscv-plic-spec/releases/tag/1.0.0
Pull RTC updates from Alexandre Belloni:
"This cycle, we have a new RTC driver, for the SpacemiT P1. The optee
driver gets alarm support. We also get a fix for a race condition that
was fairly rare unless while stress testing the alarms.
Subsystem:
- Fix race when setting alarm
- Ensure alarm irq is enabled when UIE is enabled
- remove unneeded 'fast_io' parameter in regmap_config
New driver:
- SpacemiT P1 RTC
Drivers:
- efi: Remove wakeup functionality
- optee: add alarms support
- s3c: Drop support for S3C2410
- zynqmp: Restore alarm functionality after kexec transition"
* tag 'rtc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (29 commits)
rtc: interface: Ensure alarm irq is enabled when UIE is enabled
rtc: tps6586x: Fix initial enable_irq/disable_irq balance
rtc: cpcap: Fix initial enable_irq/disable_irq balance
rtc: isl12022: Fix initial enable_irq/disable_irq balance
rtc: interface: Fix long-standing race when setting alarm
rtc: pcf2127: fix watchdog interrupt mask on pcf2131
rtc: zynqmp: Restore alarm functionality after kexec transition
rtc: amlogic-a4: Optimize global variables
rtc: sd2405al: Add I2C address.
rtc: Kconfig: move symbols to proper section
rtc: optee: make optee_rtc_pm_ops static
rtc: optee: Fix error code in optee_rtc_read_alarm()
rtc: optee: fix error code in probe()
dt-bindings: rtc: Convert apm,xgene-rtc to DT schema
rtc: spacemit: support the SpacemiT P1 RTC
rtc: optee: add alarm related rtc ops to optee rtc driver
rtc: optee: remove unnecessary memory operations
rtc: optee: fix memory leak on driver removal
rtc: x1205: Fix Xicor X1205 vendor prefix
dt-bindings: rtc: Fix Xicor X1205 vendor prefix
...
Pull Kbuild fixes from Nathan Chancellor:
- Fix UAPI types check in headers_check.pl
- Only enable -Werror for hostprogs with CONFIG_WERROR / W=e
- Ignore fsync() error when output of gen_init_cpio is a pipe
- Several little build fixes for recent modules.builtin.modinfo series
* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
kbuild: Add '.rel.*' strip pattern for vmlinux
kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
gen_init_cpio: Ignore fsync() returning EINVAL on pipes
scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
kbuild: uapi: Strip comments before size type check
The way tracing_mark_raw_write() records its data is that it has the
following structure:
struct {
struct trace_entry;
int id;
char buf[];
};
But memcpy(&entry->id, buf, size) triggers the following warning when the
size is greater than the id:
------------[ cut here ]------------
memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4)
WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Modules linked in:
CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48
RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001
RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40
R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010
R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000
FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0
Call Trace:
<TASK>
tracing_mark_raw_write+0x1fe/0x290
? __pfx_tracing_mark_raw_write+0x10/0x10
? security_file_permission+0x50/0xf0
? rw_verify_area+0x6f/0x4b0
vfs_write+0x1d8/0xdd0
? __pfx_vfs_write+0x10/0x10
? __pfx_css_rstat_updated+0x10/0x10
? count_memcg_events+0xd9/0x410
? fdget_pos+0x53/0x5e0
ksys_write+0x182/0x200
? __pfx_ksys_write+0x10/0x10
? do_user_addr_fault+0x4af/0xa30
do_syscall_64+0x63/0x350
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa61d318687
Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687
RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001
RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---
This is because fortify string sees that the size of entry->id is only 4
bytes, but it is writing more than that. But this is OK as the
dynamic_array is allocated to handle that copy.
The size allocated on the ring buffer was actually a bit too big:
size = sizeof(*entry) + cnt;
But cnt includes the 'id' and the buffer data, so adding cnt to the size
of *entry actually allocates too much on the ring buffer.
Change the allocation to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
and the memcpy() to unsafe_memcpy() with an added justification.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
of_iomap() doesn't return error pointers, it returns NULL. Fix the error
checking to check for NULL pointers.
Fixes: 86cd4301c285 ("irqchip/aspeed-scu-ic: Refactor driver to support variant-based initialization")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull SCSI fixes from James Bottomley:
"Fixes only in drivers (ufs, mvsas, qla2xxx, target) that came in just
before or during the merge window.
The most important one is the qla2xxx which reverts a conversion to
fix flexible array member warnings, that went up in this merge window
but which turned out on further testing to be causing data corruption"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS
scsi: ufs: sysfs: Make HID attributes visible
scsi: mvsas: Fix use-after-free bugs in mvs_work_queue
scsi: ufs: core: Fix PM QoS mutex initialization
scsi: ufs: core: Fix runtime suspend error deadlock
Revert "scsi: qla2xxx: Fix memcpy() field-spanning write issue"
scsi: target: target_core_configfs: Add length check to avoid buffer overflow
When setting a normal alarm, user-space is responsible for using
RTC_AIE_ON/RTC_AIE_OFF to control if alarm irq should be enabled.
But when RTC_UIE_ON is used, interrupts must be enabled so that the
requested irq events are generated.
When RTC_UIE_OFF is used, alarm irq is disabled if there are no other
alarms queued, so this commit brings symmetry to that.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-5-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
After commit 5ab23c7923a1 ("modpost: Create modalias for builtin
modules"), relocatable RISC-V kernels with CONFIG_KASAN=y start failing
when attempting to strip the module device table symbols:
riscv64-linux-objcopy: not stripping symbol `__mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table' because it is named in a relocation
make[4]: *** [scripts/Makefile.vmlinux:97: vmlinux] Error 1
The relocation appears to come from .LASANLOC5 in .data.rel.local:
$ llvm-objdump --disassemble-symbols=.LASANLOC5 --disassemble-all -r drivers/irqchip/irq-starfive-jh8100-intc.o
drivers/irqchip/irq-starfive-jh8100-intc.o: file format elf64-littleriscv
Disassembly of section .data.rel.local:
0000000000000180 <.LASANLOC5>:
...
1d0: 0000 unimp
00000000000001d0: R_RISCV_64 __mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table
...
This section appears to come from GCC for including additional
information about global variables that may be protected by KASAN.
There appears to be no way to opt out of the generation of these symbols
through either a flag or attribute. Attempting to remove '.LASANLOC*'
with '--strip-symbol' results in the same error as above because these
symbols may refer to (thus have relocation between) each other.
Avoid this build breakage by switching to '--strip-unneeded-symbol' for
removing __mod_device_table__ symbols, as it will only remove the symbol
when there is no relocation pointing to it. While this may result in a
little more bloat in the symbol table in certain configurations, it is
not as bad as outright build failures.
Fixes: 5ab23c7923a1 ("modpost: Create modalias for builtin modules")
Reported-by: Charles Mirabile <cmirabil@redhat.com>
Closes: https://lore.kernel.org/20251007011637.2512413-1-cmirabil@redhat.com/
Suggested-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull btrfs fixes from David Sterba:
"Two short fixes that would be good to have before rc1"
* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix PAGE_SIZE format specifier in open_ctree()
btrfs: avoid potential out-of-bounds in btrfs_encode_fh()
Pull more x86 updates from Borislav Petkov:
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...
When a UTP error occurs in isolation, UFS is not currently recoverable.
This is because the UTP error is not considered fatal in the error
handling code, leading to either an I/O timeout or an OCS error.
Add the UTP error flag to INT_FATAL_ERRORS so the controller will be
reset in this situation.
sd 0:0:0:0: [sda] tag#38 UNKNOWN(0x2003) Result: hostbyte=0x07
driverbyte=DRIVER_OK cmd_age=0s
sd 0:0:0:0: [sda] tag#38 CDB: opcode=0x28 28 00 00 51 24 e2 00 00 08 00
I/O error, dev sda, sector 42542864 op 0x0:(READ) flags 0x80700 phys_seg
8 prio class 2
OCS error from controller = 9 for tag 39
pa_err[1] = 0x80000010 at 2667224756 us
pa_err: total cnt=2
dl_err[0] = 0x80000002 at 2667148060 us
dl_err[1] = 0x80002000 at 2667282844 us
No record of nl_err
No record of tl_err
No record of dme_err
No record of auto_hibern8_err
fatal_err[0] = 0x804 at 2667282836 us
---------------------------------------------------
REGISTER
---------------------------------------------------
NAME OFFSET VALUE
STD HCI SFR 0xfffffff0 0x0
AHIT 0x18 0x814
INTERRUPT STATUS 0x20 0x1000
INTERRUPT ENABLE 0x24 0x70ef5
[mkp: commit desc]
Signed-off-by: Hoyoung Seo <hy50.seo@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Message-Id: <20250930061428.617955-1-hy50.seo@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize irq_en accordingly to avoid causing an unbalanced enable
warning.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-4-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
This is a series to address some problems that were exposed by the
recent modules.builtin.modinfo series that landed in commit c7d3dd9163e6
("Merge patch series "Add generated modalias to
modules.builtin.modinfo"").
The third patch is not directly related to the aforementioned series, as
the warning it fixes happens prior to the series but commit 8d18ef04f940
("s390: vmlinux.lds.S: Reorder sections") from the series creates
conflicts in this area, so I included it here.
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-0-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.
Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/
The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:
tracing_mark_write: <faulted>
After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.
Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.
A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:
preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.
By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df065 ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull nfsd updates from Chuck Lever:
"Mike Snitzer has prototyped a mechanism for disabling I/O caching in
NFSD. This is introduced in v6.18 as an experimental feature. This
enables scaling NFSD in /both/ directions:
- NFS service can be supported on systems with small memory
footprints, such as low-cost cloud instances
- Large NFS workloads will be less likely to force the eviction of
server-local activity, helping it avoid thrashing
Jeff Layton contributed a number of fixes to the new attribute
delegation implementation (based on a pending Internet RFC) that we
hope will make attribute delegation reliable enough to enable by
default, as it is on the Linux NFS client.
The remaining patches in this pull request are clean-ups and minor
optimizations. Many thanks to the contributors, reviewers, testers,
and bug reporters who participated during the v6.18 NFSD development
cycle"
* tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits)
nfsd: discard nfserr_dropit
SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it
NFSD: Add io_cache_{read,write} controls to debugfs
NFSD: Do the grace period check in ->proc_layoutget
nfsd: delete unnecessary NULL check in __fh_verify()
NFSD: Allow layoutcommit during grace period
NFSD: Disallow layoutget during grace period
sunrpc: fix "occurence"->"occurrence"
nfsd: Don't force CRYPTO_LIB_SHA256 to be built-in
nfsd: nfserr_jukebox in nlm_fopen should lead to a retry
NFSD: Reduce DRC bucket size
NFSD: Delay adding new entries to LRU
SUNRPC: Move the svc_rpcb_cleanup() call sites
NFS: Remove rpcbind cleanup for NFSv4.0 callback
nfsd: unregister with rpcbind when deleting a transport
NFSD: Drop redundant conversion to bool
sunrpc: eliminate return pointer in svc_tcp_sendmsg()
sunrpc: fix pr_notice in svc_tcp_sendto() to show correct length
nfsd: decouple the xprtsec policy check from check_nfsd_access()
NFSD: Fix destination buffer size in nfsd4_ssc_setup_dul()
...
There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:
In file included from fs/btrfs/compression.h:17,
from fs/btrfs/extent_io.h:15,
from fs/btrfs/locking.h:13,
from fs/btrfs/ctree.h:19,
from fs/btrfs/disk-io.c:22:
fs/btrfs/disk-io.c: In function 'open_ctree':
include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
...
fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
3398 | btrfs_warn(fs_info,
| ^~~~~~~~~~
PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.
Fixes: 98077f7f2180 ("btrfs: enable experimental bs > ps support")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull x86 cleanups from Borislav Petkov:
- Simplify inline asm flag output operands now that the minimum
compiler version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support
for various instruction mnemonics now that the minimum assembler
version supports them all
- The usual cleanups all over the place
* tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__
x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>
x86/mtrr: Remove license boilerplate text with bad FSF address
x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h>
x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h>
x86/entry/fred: Push __KERNEL_CS directly
x86/kconfig: Remove CONFIG_AS_AVX512
crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ
crypto: X86 - Remove CONFIG_AS_VAES
crypto: x86 - Remove CONFIG_AS_GFNI
x86/kconfig: Drop unused and needless config X86_64_SMP
Currently the very common retpoline: "CS CALL __x86_indirect_thunk_r11"
is transformed into "CALL *R11; NOP3" for eIBRS/BHI_NO parts.
Similarly, paranoid fineibt has: "CALL *R11; NOP".
Recognise that CS stuffing can avoid the extra NOP. However, due to
prefix decode penalties, make sure to not emit too many CS prefixes.
Notably: "CS CALL __x86_indirect_thunk_rax" must not become "CS CS CS
CS CALL *RAX". Prefix decode penalties are typically many more cycles
than decoding an extra NOP.
Additionally, if the retpoline is a tail-call, the "JMP *%\reg" should
be followed by INT3 for straight-line-speculation mitigation, since
emit_indirect() now has a length argument, move this into
emit_indirect() such that other users (paranoid-fineibt) also do this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250902104627.GM4068168@noisy.programming.kicks-ass.net
Call sysfs_update_group() after reading the device descriptor to ensure
the HID sysfs attributes are visible when the feature is supported.
Fixes: ae7795a8c258 ("scsi: ufs: core: Add HID support")
Signed-off-by: Daniel Lee <chullee@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize alarm_enabled accordingly to avoid causing an unbalanced enable
warning.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-3-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The reproducer:
echo | ./usr/gen_init_cpio /dev/stdin > /dev/null
fsync() on a pipe fd returns -EINVAL, which makes gen_init_cpio fail.
Ignore -EINVAL from fsync().
Fixes: ae18b94099b0 ("gen_init_cpio: support -o <output_file> parameter")
Cc: David Disseldorp <ddiss@suse.de>
Cc: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Link: https://patch.msgid.link/20251007-gen_init_cpio-pipe-v2-1-b098ab94b58a@arista.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
When building s390 defconfig with binutils older than 2.32, there are
several warnings during the final linking stage:
s390-linux-ld: .tmp_vmlinux1: warning: allocated section `.got.plt' not in segment
s390-linux-ld: .tmp_vmlinux2: warning: allocated section `.got.plt' not in segment
s390-linux-ld: vmlinux.unstripped: warning: allocated section `.got.plt' not in segment
s390-linux-objcopy: vmlinux: warning: allocated section `.got.plt' not in segment
s390-linux-objcopy: st7afZyb: warning: allocated section `.got.plt' not in segment
binutils commit afca762f598 ("S/390: Improve partial relro support for
64 bit") [1] in 2.32 changed where .got.plt is emitted, avoiding the
warning.
The :NONE in the .vmlinux.info output section description changes the
segment for subsequent allocated sections. Move .vmlinux.info right
above the discards section to place all other sections in the previously
defined segment, .data.
Fixes: 30226853d6ec ("s390: vmlinux.lds.S: explicitly handle '.got' and '.plt' sections")
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=afca762f598d453c563f244cd3777715b1a0cb72 [1]
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-3-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The return value from `__rb_map_vma()`, which rejects writable or
executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being
ignored. As a result the caller of `__rb_map_vma` always returned 0
even when the mapping had actually failed, allowing it to proceed
with an invalid VMA.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb
Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull x86 kvm updates from Paolo Bonzini:
"Generic:
- Rework almost all of KVM's exports to expose symbols only to KVM's
x86 vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko
x86:
- Rework almost all of KVM x86's exports to expose symbols only to
KVM's vendor modules, i.e. to kvm-{amd,intel}.ko
- Add support for virtualizing Control-flow Enforcement Technology
(CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD
(Shadow Stacks).
It is worth noting that while SHSTK and IBT can be enabled
separately in CPUID, it is not really possible to virtualize them
separately. Therefore, Intel processors will really allow both
SHSTK and IBT under the hood if either is made visible in the
guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons
- Fix a variety of fuzzing WARNs all caused by checking L1 intercepts
when completing userspace I/O. KVM has already committed to
allowing L2 to to perform I/O at that point
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the
MSR is supposed to exist for v2 PMUs
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside
of forced emulation and other contrived testing scenarios)
- Clean up the MSR APIs in preparation for CET and FRED
virtualization, as well as mediated vPMU support
- Clean up a pile of PMU code in anticipation of adding support for
mediated vPMUs
- Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI
vmexits needed to faithfully emulate an I/O APIC for such guests
- Many cleanups and minor fixes
- Recover possible NX huge pages within the TDP MMU under read lock
to reduce guest jitter when restoring NX huge pages
- Return -EAGAIN during prefault if userspace concurrently
deletes/moves the relevant memslot, to fix an issue where
prefaulting could deadlock with the memslot update
x86 (AMD):
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is
supported
- Require a minimum GHCB version of 2 when starting SEV-SNP guests
via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate
errors instead of latent guest failures
- Add support for SEV-SNP's CipherText Hiding, an opt-in feature that
prevents unauthorized CPU accesses from reading the ciphertext of
SNP guest private memory, e.g. to attempt an offline attack. This
feature splits the shared SEV-ES/SEV-SNP ASID space into separate
ranges for SEV-ES and SEV-SNP guests, therefore a new module
parameter is needed to control the number of ASIDs that can be used
for VMs with CipherText Hiding vs. how many can be used to run
SEV-ES guests
- Add support for Secure TSC for SEV-SNP guests, which prevents the
untrusted host from tampering with the guest's TSC frequency, while
still allowing the the VMM to configure the guest's TSC frequency
prior to launch
- Validate the XCR0 provided by the guest (via the GHCB) to avoid
bugs resulting from bogus XCR0 values
- Save an SEV guest's policy if and only if LAUNCH_START fully
succeeds to avoid leaving behind stale state (thankfully not
consumed in KVM)
- Explicitly reject non-positive effective lengths during SNP's
LAUNCH_UPDATE instead of subtly relying on guest_memfd to deal with
them
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the
host's desired TSC_AUX, to fix a bug where KVM was keeping a
different vCPU's TSC_AUX in the host MSR until return to userspace
KVM (Intel):
- Preparation for FRED support
- Don't retry in TDX's anti-zero-step mitigation if the target
memslot is invalid, i.e. is being deleted or moved, to fix a
deadlock scenario similar to the aforementioned prefaulting case
- Misc bugfixes and minor cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: x86: Export KVM-internal symbols for sub-modules only
KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks
KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
KVM: Export KVM-internal symbols for sub-modules only
KVM: s390/vfio-ap: Use kvm_is_gpa_in_memslot() instead of open coded equivalent
KVM: VMX: Make CR4.CET a guest owned bit
KVM: selftests: Verify MSRs are (not) in save/restore list when (un)supported
KVM: selftests: Add coverage for KVM-defined registers in MSRs test
KVM: selftests: Add KVM_{G,S}ET_ONE_REG coverage to MSRs test
KVM: selftests: Extend MSRs test to validate vCPUs without supported features
KVM: selftests: Add support for MSR_IA32_{S,U}_CET to MSRs test
KVM: selftests: Add an MSR test to exercise guest/host and read/write
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
KVM: x86: Define Control Protection Exception (#CP) vector
KVM: x86: Add human friendly formatting for #XM, and #VE
KVM: SVM: Enable shadow stack virtualization for SVM
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
KVM: SVM: Pass through shadow stack MSRs as appropriate
KVM: SVM: Update dump_vmcb with shadow stack save area additions
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
...
nfserr_dropit hasn't been used for over a decade, since rq_dropme and
the RQ_DROPME were introduced.
Time to get rid of it completely.
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The function btrfs_encode_fh() does not properly account for the three
cases it handles.
Before writing to the file handle (fh), the function only returns to the
user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or
BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes).
However, when a parent exists and the root ID of the parent and the
inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT
(10 dwords, 40 bytes).
If *max_len is not large enough, this write goes out of bounds because
BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than
BTRFS_FID_SIZE_CONNECTABLE originally returned.
This results in an 8-byte out-of-bounds write at
fid->parent_root_objectid = parent_root_id.
A previous attempt to fix this issue was made but was lost.
https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/
Although this issue does not seem to be easily triggerable, it is a
potential memory corruption bug that should be fixed. This patch
resolves the issue by ensuring the function returns the appropriate size
for all three cases and validates that *max_len is large enough before
writing any data.
Fixes: be6e8dc0ba84 ("NFS support for btrfs - v3")
CC: stable@vger.kernel.org # 3.0+
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull slab fix from Vlastimil Babka:
"A NULL pointer deref hotfix"
* tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: fix barn NULL pointer dereference on memoryless nodes
The minimum supported GCC version is 8.1, which supports flag output operands
and always defines __GCC_ASM_FLAG_OUTPUTS__ macro.
Remove code depending on __GCC_ASM_FLAG_OUTPUTS__ and use the "=@ccCOND" flag
output operand directly.
Use the equivalent "=@ccz" instead of "=@cce" flag output operand for
CMPXCHG8B and CMPXCHG16B instructions. These instructions set a single flag
bit - the Zero flag - and "=@ccz" is used to distinguish the CC user from
comparison instructions, where set ZERO flag indeed means that the values are
equal.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250905121723.GCaLrU04lP2A50PT-B@fat_crate.local
A while ago [0] FineIBT started using the 0xEA instruction to raise #UD.
All existing parts will generate #UD in 64bit mode on that instruction.
However; Intel/AMD have not blessed using this instruction, it is on
their 'reserved' opcode list for future use.
Peter Anvin worked the committees and got use of 0xD6 blessed, it
shall be called UDB (per the next SDM or so), and it being a single
byte instruction is easy to slip into a single byte immediate -- as
is done by this very patch.
Reworking the FineIBT code to use UDB wasn't entirely trivial. Notably
the FineIBT-BHI1 case ran out of bytes. In order to condense the
encoding some it was required to move the hash register from R10D to
EAX (thanks hpa!).
Per the x86_64 ABI, RAX is used to pass the number of vector registers
for vararg function calls -- something that should not happen in the
kernel. More so, the kernel is built with -mskip-rax-setup, which
should leave RAX completely unused, allowing its re-use.
[ For BPF; while the bpf2bpf tail-call uses RAX in its calling
convention, that does not use CFI and is unaffected. Only the
'regular' C->BPF transition is covered by CFI. ]
The ENDBR poison value is changed from 'OSP NOP3' to 'NOPL -42(%RAX)',
this is basically NOP4 but with UDB as its immediate. As such it is
still a non-standard NOP value unique to prior ENDBR sites, but now
also provides UDB.
Per Agner Fog's optimization guide, Jcc is assumed not-taken. That is,
the expected path should be the fallthrough case for improved
throughput.
Since the preamble now relies on the ENDBR poison to provide UDB, the
code is changed to write the poison right along with the initial
preamble -- this is possible because the ITS mitigation already
disabled IBT over rewriting the CFI scheme.
The scheme in detail:
Preamble:
FineIBT FineIBT-BHI1 FineIBT-BHI
__cfi_\func: __cfi_\func: __cfi_\func:
endbr endbr endbr
subl $0x12345678, %eax subl $0x12345678, %eax subl $0x12345678, %eax
jne.d32,np \func+3 cmovne %rax, %rdi cs cs call __bhi_args_N
jne.d8,np \func+3
\func: \func: \func:
nopl -42(%rax) nopl -42(%rax) nopl -42(%rax)
Notably there are 7 bytes available after the SUBL; this enables the
BHI1 case to fit without the nasty overlapping case it had previously.
The !BHI case uses Jcc.d32,np to consume all 7 bytes without the need
for an additional NOP, while the BHI case uses CS padding to align the
CALL with the end of the preamble such that it returns to \func+0.
Caller:
FineIBT Paranoid-FineIBT
fineibt_caller: fineibt_caller:
mov $0x12345678, %eax mov $0x12345678, %eax
lea -10(%r11), %r11 cmp -0x11(%r11), %eax
nop5 cs lea -0x10(%r11), %r11
retpoline: retpoline:
cs call __x86_indirect_thunk_r11 jne fineibt_caller+0xd
call *%r11
nop
Notably this is before apply_retpolines() which will fix up the
retpoline call -- since all parts with IBT also have eIBRS (lets
ignore ITS). Typically the retpoline site is rewritten (when still
intact) into:
call *%r11
nop3
[0] 06926c6cdb95 ("x86/ibt: Optimize the FineIBT instruction sequence")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250901191307.GI4067720@noisy.programming.kicks-ass.net
During the detaching of Marvell's SAS/SATA controller, the original code
calls cancel_delayed_work() in mvs_free() to cancel the delayed work
item mwq->work_q. However, if mwq->work_q is already running, the
cancel_delayed_work() may fail to cancel it. This can lead to
use-after-free scenarios where mvs_free() frees the mvs_info while
mvs_work_queue() is still executing and attempts to access the
already-freed mvs_info.
A typical race condition is illustrated below:
CPU 0 (remove) | CPU 1 (delayed work callback)
mvs_pci_remove() |
mvs_free() | mvs_work_queue()
cancel_delayed_work() |
kfree(mvi) |
| mvi-> // UAF
Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the delayed work item is properly canceled and any executing
delayed work item completes before the mvs_info is deallocated.
This bug was found by static analysis.
Fixes: 20b09c2992fe ("[SCSI] mvsas: add support for 94xx; layout change; bug fixes")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize irq_enabled accordingly to avoid causing an unbalanced enable
warning.
Fixes: c62d658e5253 ("rtc: isl12022: Add alarm support")
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-2-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Commit 27758d8c2583 ("kbuild: enable -Werror for hostprogs")
unconditionally enabled -Werror for the compiler, assembler, and linker
when building the host programs, as the build footprint of the host
programs is small (thus risk of build failures from warnings are low)
and that stage of the build may not have Kconfig values (thus
CONFIG_WERROR could not be used as a precondition).
While turning warnings into errors unconditionally happens in a few
places within the kernel, it can be disruptive to people who may be
building with newer compilers, such as while doing a bisect. While it is
possible to avoid this behavior by passing HOSTCFLAGS=-w or
HOSTCFLAGS=-Wno-error, it may not be the most intuitive for regular
users not intimately familiar with Kbuild.
Avoid being disruptive to the entire build by depending on the explicit
opt-in of CONFIG_WERROR or W=e to enable -Werror and the like while
building the host programs. While this means there is a small portion of
the build that does not have -Werror enabled (namely scripts/kconfig/*
and scripts/basic/fixdep), it is better than not having it altogether.
Fixes: 27758d8c2583 ("kbuild: enable -Werror for hostprogs")
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/20251005011100.1035272-1-safinaskar@gmail.com/
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Tested-by: Miguel Ojeda <ojeda@kernel.org> # Rust
Link: https://patch.msgid.link/20251006-kbuild-hostprogs-werror-fix-v1-1-23cf1ffced5c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Prior to binutils commit c12d9fa2afe ("Support objcopy
--remove-section=.relaFOO") [1] in 2.32, stripping relocation sections
required the trailing period (i.e., '.rel.*') to work properly.
After commit 3e86e4d74c04 ("kbuild: keep .modinfo section in
vmlinux.unstripped"), there is an error with binutils 2.31.1 or earlier
because these sections are not properly removed:
s390-linux-objcopy: st6tO8Ev: symbol `.modinfo' required but not present
s390-linux-objcopy:st6tO8Ev: no symbols
Add the old pattern to resolve this issue (along with a comment to allow
cleaning this when binutils 2.32 or newer is the minimum supported
version). While the aforementioned kbuild change exposes this, the
pattern was originally changed by commit 71d815bf5dfd ("kbuild: Strip
runtime const RELA sections correctly"), where it would still be
incorrect with binutils older than 2.32.
Fixes: 71d815bf5dfd ("kbuild: Strip runtime const RELA sections correctly")
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=c12d9fa2afe7abcbe407a00e15719e1a1350c2a7 [1]
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/CA+G9fYvVktRhFtZXdNgVOL8j+ArsJDpvMLgCitaQvQmCx=hwOQ@mail.gmail.com/
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-2-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call
func_prolog_dec() that will test if the data->disable is already set and
if not, increment it and return. If it was set, it returns false and the
caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home
Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull LoongArch updates from Huacai Chen:
- Initialize acpi_gbl_use_global_lock to false
- Allow specify SIMD width via kernel parameters
- Add kexec_file (both EFI & ELF format) support
- Add PER_VMA_LOCK for page fault handling support
- Improve BPF trampoline support
- Update the default config file
- Some bug fixes and other small changes
* tag 'loongarch-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (23 commits)
LoongArch: Update Loongson-3 default config file
LoongArch: BPF: Sign-extend struct ops return values properly
LoongArch: BPF: Make error handling robust in arch_prepare_bpf_trampoline()
LoongArch: BPF: Make trampoline size stable
LoongArch: BPF: Don't align trampoline size
LoongArch: BPF: No support of struct argument in trampoline programs
LoongArch: BPF: No text_poke() for kernel text
LoongArch: BPF: Remove duplicated bpf_flush_icache()
LoongArch: BPF: Remove duplicated flags check
LoongArch: BPF: Fix uninitialized symbol 'retval_off'
LoongArch: BPF: Optimize sign-extention mov instructions
LoongArch: Handle new atomic instructions for probes
LoongArch: Try VMA lock-based page fault handling first
LoongArch: Automatically disable kaslr if boot from kexec_file
LoongArch: Add crash dump support for kexec_file
LoongArch: Add ELF binary support for kexec_file
LoongArch: Add EFI binary support for kexec_file
LoongArch: Add preparatory infrastructure for kexec_file
LoongArch: Add struct loongarch_image_header for kernel
LoongArch: Allow specify SIMD width via kernel parameters
...
Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that
are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly
designed/intended for external usage.
Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it. This
unblocks the eventual removal of the selection of CRYPTO from NFSD_V4,
which will no longer be needed by nfsd itself due to switching to the
crypto library functions. But NFSD_V4 selects RPCSEC_GSS_KRB5, which
still needs CRYPTO. It makes more sense for RPCSEC_GSS_KRB5 to select
CRYPTO itself, like most other kconfig options that need CRYPTO do.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer
data type for the some variants in btrfs_root"). On the reader side, in
should_cow_block(), we also use the counterpart smp_mb__before_atomic()
which generates further confusion.
So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't
even need any barrier at all since create_pending_snapshot() is called
in the critical section of a transaction commit and therefore no one
can concurrently join/attach the transaction, or start a new one, until
the transaction is unblocked. By the time someone starts a new transaction
and enters should_cow_block(), a lot of implicit memory barriers already
took place by having acquired several locks such as fs_info->trans_lock
and extent buffer locks on the root node at least. Nevertlheless, for
consistency use smp_mb__after_atomic() after setting the force cow bit
in create_pending_snapshot().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull bpf fixes from Alexei Starovoitov:
- Finish constification of 1st parameter of bpf_d_path() (Rong Tao)
- Harden userspace-supplied xdp_desc validation (Alexander Lobakin)
- Fix metadata_dst leak in __bpf_redirect_neigh_v{4,6}() (Daniel
Borkmann)
- Fix undefined behavior in {get,put}_unaligned_be32() (Eric Biggers)
- Use correct context to unpin bpf hash map with special types (KaFai
Wan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add test for unpinning htab with internal timer struct
bpf: Avoid RCU context warning when unpinning htab with internal structs
xsk: Harden userspace-supplied xdp_desc validation
bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
libbpf: Fix undefined behavior in {get,put}_unaligned_be32()
bpf: Finish constification of 1st parameter of bpf_d_path()
Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."
# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.
The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.
For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.
Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.
Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Current minimum required version of binutils is 2.30, which supports ENCLS
instruction mnemonic.
Replace the byte-wise specification of ENCLS with this proper mnemonic.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/20250616085716.158942-1-ubizjak@gmail.com
Commit 66f793099a63 ("x86/retpoline: Avoid retpolines for built-in __init
functions") disabled retpolines in __init sections (__noinitretpoline)
as a precaution against potential issues with retpolines in early boot,
but it has not been a problem in practice (i.e. see Clang below).
Commit 87358710c1fb ("x86/retpoline: Support retpoline builds with Clang")
narrowed this to only GCC, as Clang doesn't have per-function control
over retpoline emission. As such, Clang has been booting with retpolines
in __init since retpoline support was introduced.
Clang KCFI has been instrumenting __init since CFI was introduced.
With the introduction of KCFI for GCC, KCFI instrumentation with
retpolines disabled means that objtool does not construct .retpoline_sites
section entries for the non-retpoline KCFI calls. At boot, the KCFI
rehashing code, via __apply_fineibt(), misses all __init KCFI calls
(since they are not retpolines), resulting in immediate hash mismatches:
all preambles are rehashed (via .cfi_sites) and none of the __init call
sites are rehashed.
Remove __noinitretpoline since it provides no meaningful utility and
creates problems with CFI. Additionally remove __noretpoline since it
is now unused.
Alternatively, cfi_rand_callers() could walk the .kcfi_traps section which
is exactly the list of KCFI instrumentation sites. But it seems better to
have as few differences in common instruction sequences between compilers
as possible, so better to remove the special handling of retpolines in
__init for GCC.
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250904034656.3670313-6-kees@kernel.org
hba->pm_qos_mutex is used very early as a part of ufshcd_init(), so it
need to be initialized before that call. This fixes the following
warning:
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: kernel/locking/mutex.c:577 at __mutex_lock+0x268/0x894, CPU#4: kworker/u32:4/72
Modules linked in:
CPU: 4 UID: 0 PID: 72 Comm: kworker/u32:4 Not tainted 6.17.0-rc7-next-20250926+ #11223 PREEMPT
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __mutex_lock+0x268/0x894
lr : __mutex_lock+0x268/0x894
...
Call trace:
__mutex_lock+0x268/0x894 (P)
mutex_lock_nested+0x24/0x30
ufshcd_pm_qos_update+0x30/0x78
ufshcd_setup_clocks+0x2d4/0x3c4
ufshcd_init+0x234/0x126c
ufshcd_pltfrm_init+0x62c/0x82c
ufs_qcom_probe+0x20/0x58
platform_probe+0x5c/0xac
really_probe+0xbc/0x298
__driver_probe_device+0x78/0x12c
driver_probe_device+0x40/0x164
__device_attach_driver+0xb8/0x138
bus_for_each_drv+0x80/0xdc
__device_attach+0xa8/0x1b0
device_initial_probe+0x14/0x20
bus_probe_device+0xb0/0xb4
deferred_probe_work_func+0x8c/0xc8
process_one_work+0x208/0x60c
worker_thread+0x244/0x388
kthread+0x150/0x228
ret_from_fork+0x10/0x20
irq event stamp: 57267
hardirqs last enabled at (57267): [<ffffd761485e868c>] _raw_spin_unlock_irqrestore+0x74/0x78
hardirqs last disabled at (57266): [<ffffd76147b13c44>] clk_enable_lock+0x7c/0xf0
softirqs last enabled at (56270): [<ffffd7614734446c>] handle_softirqs+0x4c4/0x4dc
softirqs last disabled at (56265): [<ffffd76147290690>] __do_softirq+0x14/0x20
---[ end trace 0000000000000000 ]---
Fixes: 79dde5f7dc7c ("scsi: ufs: core: Fix data race in CPU latency PM QoS request handling")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Message-Id: <20250929112730.3782765-1-m.szyprowski@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
As described in the old comment dating back to
commit 6610e0893b8b ("RTC: Rework RTC code to use timerqueue for events")
from 2010, we have been living with a race window when setting alarm
with an expiry in the near future (i.e. next second).
With 1 second resolution, it can happen that the second ticks after the
check for the timer having expired, but before the alarm is actually set.
When this happen, no alarm IRQ is generated, at least not with some RTC
chips (isl12022 is an example of this).
With UIE RTC timer being implemented on top of alarm irq, being re-armed
every second, UIE will occasionally fail to work, as an alarm irq lost
due to this race will stop the re-arming loop.
For now, I have limited the additional expiry check to only be done for
alarms set to next seconds. I expect it should be good enough, although I
don't know if we can now for sure that systems with loads could end up
causing the same problems for alarms set 2 seconds or even longer in the
future.
I haven't been able to reproduce the problem with this check in place.
Cc: stable@vger.kernel.org
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-1-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
On m68k, check_sizetypes in headers_check reports:
./usr/include/asm/bootinfo-amiga.h:17: found __[us]{8,16,32,64} type without #include <linux/types.h>
This header file does not use any of the Linux-specific integer types,
but merely refers to them from comments, so this is a false positive.
As of commit c3a9d74ee413bdb3 ("kbuild: uapi: upgrade check_sizetypes()
warning to error"), this check was promoted to an error, breaking m68k
all{mod,yes}config builds.
Fix this by stripping simple comments before looking for Linux-specific
integer types.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/949f096337e28d50510e970ae3ba3ec9c1342ec0.1759753998.git.geert@linux-m68k.org
[nathan: Adjust comment and remove unnecessary escaping from slashes in
regex]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Commit 0ce5139fd96e ("kbuild: always create intermediate
vmlinux.unstripped") removed the pattern to avoid stripping .rela.dyn
sections added by commit e9d86b8e17e7 ("scripts: Do not strip .rela.dyn
section"). Restore it so that .rela.dyn sections remain in the final
vmlinux.
Fixes: 0ce5139fd96e ("kbuild: always create intermediate vmlinux.unstripped")
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-1-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The functions wakeup_graph_entry() and wakeup_graph_return() both call
func_prolog_preempt_disable() that will test if the data->disable is
already set and if not, increment it and disable preemption. If it was
set, it returns false and the caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home
Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull uml updates from Johannes Berg:
- minor preparations for SMP support
- SPARSE_IRQ support for kunit
- help output cleanups
* tag 'uml-for-linux-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: Remove unused ipi_pipe field from cpuinfo_um
um: Remove unused cpu_data and current_cpu_data macros
um: Stop tracking virtual CPUs via mm_cpumask()
um: Centralize stub size calculations
um: Remove outdated comment about STUB_DATA_PAGES
um: Remove unused offset and child_err fields from stub_data
um: Indent time-travel help messages
um: Fix help message for ssl-non-raw
um: vector: Fix indentation for help message
um: Add missing trailing newline to help messages
um: virtio-pci: implement .shutdown()
um: Support SPARSE_IRQ
1. Increase NR_CPUS to 2048.
2. Enable KEXEC_FILE options.
3. Enable some EXT4/XFS options.
4. Enable Loongson PWM/MMC/BMC/SE drivers.
5. Enable simple DRM driver.
6. Enable some Intel ethernet drivers.
7. Enable some gpio drivers.
8. Enable some network options.
9. Enable SM3/SM4 cryptographic algorithms.
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Drop the exporting of several kvm_arch_xxx() hooks that are only called
from arch-neutral code, i.e. that are only called from kvm.ko.
Link: https://lore.kernel.org/r/20250919003303.1355064-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add 'io_cache_read' to NFSD's debugfs interface so that any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
io_cache_read may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
Add 'io_cache_write' to NFSD's debugfs interface so that any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
io_cache_write may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
The default value for both settings is NFSD_IO_BUFFERED, which is
NFSD's existing behavior for both read and write. Changes to these
settings take immediate effect for all exports and NFS versions.
Currently only xfs and ext4 implement RWF_DONTCACHE. For file
systems that do not implement RWF_DONTCACHE, NFSD use only buffered
I/O when the io_cache setting is NFSD_IO_DONTCACHE.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
KaFai Wan says:
====================
bpf: Avoid RCU context warning when unpinning htab with internal structs
This small patchset is about avoid RCU context warning when unpinning
htab with internal structs (timer, workqueue, or task_work).
v3:
- fix nit (Yonghong Song)
- add Acked-by: Yonghong Song <yonghong.song@linux.dev>
v2:
- rename bpf_free_inode() to bpf_destroy_inode() (Andrii)
https://lore.kernel.org/all/20251007012235.755853-1-kafai.wan@linux.dev/
v1:
https://lore.kernel.org/all/20251003084528.502518-1-kafai.wan@linux.dev/
---
====================
Link: https://patch.msgid.link/20251008102628.808045-1-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull irq fixes from Borislav Petkov:
- Skip interrupt ID 0 in sifive-plic during suspend/resume because
ID 0 is reserved and accessing reserved register space could result
in undefined behavior
- Fix a function's retval check in aspeed-scu-ic
* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
This reverts commit 1a2b423be6a89dd07d5fc27ea042be68697a6a49 because we
got a regression report and need time to find out the details.
Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Closes: https://lore.kernel.org/r/29ec0082-4dd4-4120-acd2-44b35b4b9487@oss.qualcomm.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy()"
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
According to the PLIC specification[1], global interrupt sources are
assigned small unsigned integer identifiers beginning at the value 1.
An interrupt ID of 0 is reserved to mean "no interrupt".
The current plic_irq_resume() and plic_irq_suspend() functions incorrectly
start the loop from index 0, which accesses the register space for the
reserved interrupt ID 0.
Change the loop to start from index 1, skipping the reserved
interrupt ID 0 as per the PLIC specification.
This prevents potential undefined behavior when accessing the reserved
register space during suspend/resume cycles.
Fixes: e80f0b6a2cf3 ("irqchip/irq-sifive-plic: Add syscore callbacks for hibernation")
Co-developed-by: Jia Wang <wangjia@ultrarisc.com>
Signed-off-by: Jia Wang <wangjia@ultrarisc.com>
Co-developed-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://github.com/riscv/riscv-plic-spec/releases/tag/1.0.0
Pull RTC updates from Alexandre Belloni:
"This cycle, we have a new RTC driver, for the SpacemiT P1. The optee
driver gets alarm support. We also get a fix for a race condition that
was fairly rare unless while stress testing the alarms.
Subsystem:
- Fix race when setting alarm
- Ensure alarm irq is enabled when UIE is enabled
- remove unneeded 'fast_io' parameter in regmap_config
New driver:
- SpacemiT P1 RTC
Drivers:
- efi: Remove wakeup functionality
- optee: add alarms support
- s3c: Drop support for S3C2410
- zynqmp: Restore alarm functionality after kexec transition"
* tag 'rtc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (29 commits)
rtc: interface: Ensure alarm irq is enabled when UIE is enabled
rtc: tps6586x: Fix initial enable_irq/disable_irq balance
rtc: cpcap: Fix initial enable_irq/disable_irq balance
rtc: isl12022: Fix initial enable_irq/disable_irq balance
rtc: interface: Fix long-standing race when setting alarm
rtc: pcf2127: fix watchdog interrupt mask on pcf2131
rtc: zynqmp: Restore alarm functionality after kexec transition
rtc: amlogic-a4: Optimize global variables
rtc: sd2405al: Add I2C address.
rtc: Kconfig: move symbols to proper section
rtc: optee: make optee_rtc_pm_ops static
rtc: optee: Fix error code in optee_rtc_read_alarm()
rtc: optee: fix error code in probe()
dt-bindings: rtc: Convert apm,xgene-rtc to DT schema
rtc: spacemit: support the SpacemiT P1 RTC
rtc: optee: add alarm related rtc ops to optee rtc driver
rtc: optee: remove unnecessary memory operations
rtc: optee: fix memory leak on driver removal
rtc: x1205: Fix Xicor X1205 vendor prefix
dt-bindings: rtc: Fix Xicor X1205 vendor prefix
...
Pull Kbuild fixes from Nathan Chancellor:
- Fix UAPI types check in headers_check.pl
- Only enable -Werror for hostprogs with CONFIG_WERROR / W=e
- Ignore fsync() error when output of gen_init_cpio is a pipe
- Several little build fixes for recent modules.builtin.modinfo series
* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
kbuild: Add '.rel.*' strip pattern for vmlinux
kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
gen_init_cpio: Ignore fsync() returning EINVAL on pipes
scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
kbuild: uapi: Strip comments before size type check
The way tracing_mark_raw_write() records its data is that it has the
following structure:
struct {
struct trace_entry;
int id;
char buf[];
};
But memcpy(&entry->id, buf, size) triggers the following warning when the
size is greater than the id:
------------[ cut here ]------------
memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4)
WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Modules linked in:
CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48
RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001
RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40
R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010
R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000
FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0
Call Trace:
<TASK>
tracing_mark_raw_write+0x1fe/0x290
? __pfx_tracing_mark_raw_write+0x10/0x10
? security_file_permission+0x50/0xf0
? rw_verify_area+0x6f/0x4b0
vfs_write+0x1d8/0xdd0
? __pfx_vfs_write+0x10/0x10
? __pfx_css_rstat_updated+0x10/0x10
? count_memcg_events+0xd9/0x410
? fdget_pos+0x53/0x5e0
ksys_write+0x182/0x200
? __pfx_ksys_write+0x10/0x10
? do_user_addr_fault+0x4af/0xa30
do_syscall_64+0x63/0x350
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa61d318687
Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687
RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001
RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---
This is because fortify string sees that the size of entry->id is only 4
bytes, but it is writing more than that. But this is OK as the
dynamic_array is allocated to handle that copy.
The size allocated on the ring buffer was actually a bit too big:
size = sizeof(*entry) + cnt;
But cnt includes the 'id' and the buffer data, so adding cnt to the size
of *entry actually allocates too much on the ring buffer.
Change the allocation to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
and the memcpy() to unsafe_memcpy() with an added justification.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
of_iomap() doesn't return error pointers, it returns NULL. Fix the error
checking to check for NULL pointers.
Fixes: 86cd4301c285 ("irqchip/aspeed-scu-ic: Refactor driver to support variant-based initialization")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull SCSI fixes from James Bottomley:
"Fixes only in drivers (ufs, mvsas, qla2xxx, target) that came in just
before or during the merge window.
The most important one is the qla2xxx which reverts a conversion to
fix flexible array member warnings, that went up in this merge window
but which turned out on further testing to be causing data corruption"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS
scsi: ufs: sysfs: Make HID attributes visible
scsi: mvsas: Fix use-after-free bugs in mvs_work_queue
scsi: ufs: core: Fix PM QoS mutex initialization
scsi: ufs: core: Fix runtime suspend error deadlock
Revert "scsi: qla2xxx: Fix memcpy() field-spanning write issue"
scsi: target: target_core_configfs: Add length check to avoid buffer overflow
When setting a normal alarm, user-space is responsible for using
RTC_AIE_ON/RTC_AIE_OFF to control if alarm irq should be enabled.
But when RTC_UIE_ON is used, interrupts must be enabled so that the
requested irq events are generated.
When RTC_UIE_OFF is used, alarm irq is disabled if there are no other
alarms queued, so this commit brings symmetry to that.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-5-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
After commit 5ab23c7923a1 ("modpost: Create modalias for builtin
modules"), relocatable RISC-V kernels with CONFIG_KASAN=y start failing
when attempting to strip the module device table symbols:
riscv64-linux-objcopy: not stripping symbol `__mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table' because it is named in a relocation
make[4]: *** [scripts/Makefile.vmlinux:97: vmlinux] Error 1
The relocation appears to come from .LASANLOC5 in .data.rel.local:
$ llvm-objdump --disassemble-symbols=.LASANLOC5 --disassemble-all -r drivers/irqchip/irq-starfive-jh8100-intc.o
drivers/irqchip/irq-starfive-jh8100-intc.o: file format elf64-littleriscv
Disassembly of section .data.rel.local:
0000000000000180 <.LASANLOC5>:
...
1d0: 0000 unimp
00000000000001d0: R_RISCV_64 __mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table
...
This section appears to come from GCC for including additional
information about global variables that may be protected by KASAN.
There appears to be no way to opt out of the generation of these symbols
through either a flag or attribute. Attempting to remove '.LASANLOC*'
with '--strip-symbol' results in the same error as above because these
symbols may refer to (thus have relocation between) each other.
Avoid this build breakage by switching to '--strip-unneeded-symbol' for
removing __mod_device_table__ symbols, as it will only remove the symbol
when there is no relocation pointing to it. While this may result in a
little more bloat in the symbol table in certain configurations, it is
not as bad as outright build failures.
Fixes: 5ab23c7923a1 ("modpost: Create modalias for builtin modules")
Reported-by: Charles Mirabile <cmirabil@redhat.com>
Closes: https://lore.kernel.org/20251007011637.2512413-1-cmirabil@redhat.com/
Suggested-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull more x86 updates from Borislav Petkov:
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...
When a UTP error occurs in isolation, UFS is not currently recoverable.
This is because the UTP error is not considered fatal in the error
handling code, leading to either an I/O timeout or an OCS error.
Add the UTP error flag to INT_FATAL_ERRORS so the controller will be
reset in this situation.
sd 0:0:0:0: [sda] tag#38 UNKNOWN(0x2003) Result: hostbyte=0x07
driverbyte=DRIVER_OK cmd_age=0s
sd 0:0:0:0: [sda] tag#38 CDB: opcode=0x28 28 00 00 51 24 e2 00 00 08 00
I/O error, dev sda, sector 42542864 op 0x0:(READ) flags 0x80700 phys_seg
8 prio class 2
OCS error from controller = 9 for tag 39
pa_err[1] = 0x80000010 at 2667224756 us
pa_err: total cnt=2
dl_err[0] = 0x80000002 at 2667148060 us
dl_err[1] = 0x80002000 at 2667282844 us
No record of nl_err
No record of tl_err
No record of dme_err
No record of auto_hibern8_err
fatal_err[0] = 0x804 at 2667282836 us
---------------------------------------------------
REGISTER
---------------------------------------------------
NAME OFFSET VALUE
STD HCI SFR 0xfffffff0 0x0
AHIT 0x18 0x814
INTERRUPT STATUS 0x20 0x1000
INTERRUPT ENABLE 0x24 0x70ef5
[mkp: commit desc]
Signed-off-by: Hoyoung Seo <hy50.seo@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Message-Id: <20250930061428.617955-1-hy50.seo@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize irq_en accordingly to avoid causing an unbalanced enable
warning.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-4-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
This is a series to address some problems that were exposed by the
recent modules.builtin.modinfo series that landed in commit c7d3dd9163e6
("Merge patch series "Add generated modalias to
modules.builtin.modinfo"").
The third patch is not directly related to the aforementioned series, as
the warning it fixes happens prior to the series but commit 8d18ef04f940
("s390: vmlinux.lds.S: Reorder sections") from the series creates
conflicts in this area, so I included it here.
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-0-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.
Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/
The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:
tracing_mark_write: <faulted>
After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.
Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.
A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:
preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.
By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df065 ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull nfsd updates from Chuck Lever:
"Mike Snitzer has prototyped a mechanism for disabling I/O caching in
NFSD. This is introduced in v6.18 as an experimental feature. This
enables scaling NFSD in /both/ directions:
- NFS service can be supported on systems with small memory
footprints, such as low-cost cloud instances
- Large NFS workloads will be less likely to force the eviction of
server-local activity, helping it avoid thrashing
Jeff Layton contributed a number of fixes to the new attribute
delegation implementation (based on a pending Internet RFC) that we
hope will make attribute delegation reliable enough to enable by
default, as it is on the Linux NFS client.
The remaining patches in this pull request are clean-ups and minor
optimizations. Many thanks to the contributors, reviewers, testers,
and bug reporters who participated during the v6.18 NFSD development
cycle"
* tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits)
nfsd: discard nfserr_dropit
SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it
NFSD: Add io_cache_{read,write} controls to debugfs
NFSD: Do the grace period check in ->proc_layoutget
nfsd: delete unnecessary NULL check in __fh_verify()
NFSD: Allow layoutcommit during grace period
NFSD: Disallow layoutget during grace period
sunrpc: fix "occurence"->"occurrence"
nfsd: Don't force CRYPTO_LIB_SHA256 to be built-in
nfsd: nfserr_jukebox in nlm_fopen should lead to a retry
NFSD: Reduce DRC bucket size
NFSD: Delay adding new entries to LRU
SUNRPC: Move the svc_rpcb_cleanup() call sites
NFS: Remove rpcbind cleanup for NFSv4.0 callback
nfsd: unregister with rpcbind when deleting a transport
NFSD: Drop redundant conversion to bool
sunrpc: eliminate return pointer in svc_tcp_sendmsg()
sunrpc: fix pr_notice in svc_tcp_sendto() to show correct length
nfsd: decouple the xprtsec policy check from check_nfsd_access()
NFSD: Fix destination buffer size in nfsd4_ssc_setup_dul()
...
There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:
In file included from fs/btrfs/compression.h:17,
from fs/btrfs/extent_io.h:15,
from fs/btrfs/locking.h:13,
from fs/btrfs/ctree.h:19,
from fs/btrfs/disk-io.c:22:
fs/btrfs/disk-io.c: In function 'open_ctree':
include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
...
fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
3398 | btrfs_warn(fs_info,
| ^~~~~~~~~~
PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.
Fixes: 98077f7f2180 ("btrfs: enable experimental bs > ps support")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull x86 cleanups from Borislav Petkov:
- Simplify inline asm flag output operands now that the minimum
compiler version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support
for various instruction mnemonics now that the minimum assembler
version supports them all
- The usual cleanups all over the place
* tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__
x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>
x86/mtrr: Remove license boilerplate text with bad FSF address
x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h>
x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h>
x86/entry/fred: Push __KERNEL_CS directly
x86/kconfig: Remove CONFIG_AS_AVX512
crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ
crypto: X86 - Remove CONFIG_AS_VAES
crypto: x86 - Remove CONFIG_AS_GFNI
x86/kconfig: Drop unused and needless config X86_64_SMP
Currently the very common retpoline: "CS CALL __x86_indirect_thunk_r11"
is transformed into "CALL *R11; NOP3" for eIBRS/BHI_NO parts.
Similarly, paranoid fineibt has: "CALL *R11; NOP".
Recognise that CS stuffing can avoid the extra NOP. However, due to
prefix decode penalties, make sure to not emit too many CS prefixes.
Notably: "CS CALL __x86_indirect_thunk_rax" must not become "CS CS CS
CS CALL *RAX". Prefix decode penalties are typically many more cycles
than decoding an extra NOP.
Additionally, if the retpoline is a tail-call, the "JMP *%\reg" should
be followed by INT3 for straight-line-speculation mitigation, since
emit_indirect() now has a length argument, move this into
emit_indirect() such that other users (paranoid-fineibt) also do this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250902104627.GM4068168@noisy.programming.kicks-ass.net
Call sysfs_update_group() after reading the device descriptor to ensure
the HID sysfs attributes are visible when the feature is supported.
Fixes: ae7795a8c258 ("scsi: ufs: core: Add HID support")
Signed-off-by: Daniel Lee <chullee@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize alarm_enabled accordingly to avoid causing an unbalanced enable
warning.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-3-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The reproducer:
echo | ./usr/gen_init_cpio /dev/stdin > /dev/null
fsync() on a pipe fd returns -EINVAL, which makes gen_init_cpio fail.
Ignore -EINVAL from fsync().
Fixes: ae18b94099b0 ("gen_init_cpio: support -o <output_file> parameter")
Cc: David Disseldorp <ddiss@suse.de>
Cc: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Link: https://patch.msgid.link/20251007-gen_init_cpio-pipe-v2-1-b098ab94b58a@arista.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
When building s390 defconfig with binutils older than 2.32, there are
several warnings during the final linking stage:
s390-linux-ld: .tmp_vmlinux1: warning: allocated section `.got.plt' not in segment
s390-linux-ld: .tmp_vmlinux2: warning: allocated section `.got.plt' not in segment
s390-linux-ld: vmlinux.unstripped: warning: allocated section `.got.plt' not in segment
s390-linux-objcopy: vmlinux: warning: allocated section `.got.plt' not in segment
s390-linux-objcopy: st7afZyb: warning: allocated section `.got.plt' not in segment
binutils commit afca762f598 ("S/390: Improve partial relro support for
64 bit") [1] in 2.32 changed where .got.plt is emitted, avoiding the
warning.
The :NONE in the .vmlinux.info output section description changes the
segment for subsequent allocated sections. Move .vmlinux.info right
above the discards section to place all other sections in the previously
defined segment, .data.
Fixes: 30226853d6ec ("s390: vmlinux.lds.S: explicitly handle '.got' and '.plt' sections")
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=afca762f598d453c563f244cd3777715b1a0cb72 [1]
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-3-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The return value from `__rb_map_vma()`, which rejects writable or
executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being
ignored. As a result the caller of `__rb_map_vma` always returned 0
even when the mapping had actually failed, allowing it to proceed
with an invalid VMA.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb
Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull x86 kvm updates from Paolo Bonzini:
"Generic:
- Rework almost all of KVM's exports to expose symbols only to KVM's
x86 vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko
x86:
- Rework almost all of KVM x86's exports to expose symbols only to
KVM's vendor modules, i.e. to kvm-{amd,intel}.ko
- Add support for virtualizing Control-flow Enforcement Technology
(CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD
(Shadow Stacks).
It is worth noting that while SHSTK and IBT can be enabled
separately in CPUID, it is not really possible to virtualize them
separately. Therefore, Intel processors will really allow both
SHSTK and IBT under the hood if either is made visible in the
guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons
- Fix a variety of fuzzing WARNs all caused by checking L1 intercepts
when completing userspace I/O. KVM has already committed to
allowing L2 to to perform I/O at that point
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the
MSR is supposed to exist for v2 PMUs
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside
of forced emulation and other contrived testing scenarios)
- Clean up the MSR APIs in preparation for CET and FRED
virtualization, as well as mediated vPMU support
- Clean up a pile of PMU code in anticipation of adding support for
mediated vPMUs
- Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI
vmexits needed to faithfully emulate an I/O APIC for such guests
- Many cleanups and minor fixes
- Recover possible NX huge pages within the TDP MMU under read lock
to reduce guest jitter when restoring NX huge pages
- Return -EAGAIN during prefault if userspace concurrently
deletes/moves the relevant memslot, to fix an issue where
prefaulting could deadlock with the memslot update
x86 (AMD):
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is
supported
- Require a minimum GHCB version of 2 when starting SEV-SNP guests
via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate
errors instead of latent guest failures
- Add support for SEV-SNP's CipherText Hiding, an opt-in feature that
prevents unauthorized CPU accesses from reading the ciphertext of
SNP guest private memory, e.g. to attempt an offline attack. This
feature splits the shared SEV-ES/SEV-SNP ASID space into separate
ranges for SEV-ES and SEV-SNP guests, therefore a new module
parameter is needed to control the number of ASIDs that can be used
for VMs with CipherText Hiding vs. how many can be used to run
SEV-ES guests
- Add support for Secure TSC for SEV-SNP guests, which prevents the
untrusted host from tampering with the guest's TSC frequency, while
still allowing the the VMM to configure the guest's TSC frequency
prior to launch
- Validate the XCR0 provided by the guest (via the GHCB) to avoid
bugs resulting from bogus XCR0 values
- Save an SEV guest's policy if and only if LAUNCH_START fully
succeeds to avoid leaving behind stale state (thankfully not
consumed in KVM)
- Explicitly reject non-positive effective lengths during SNP's
LAUNCH_UPDATE instead of subtly relying on guest_memfd to deal with
them
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the
host's desired TSC_AUX, to fix a bug where KVM was keeping a
different vCPU's TSC_AUX in the host MSR until return to userspace
KVM (Intel):
- Preparation for FRED support
- Don't retry in TDX's anti-zero-step mitigation if the target
memslot is invalid, i.e. is being deleted or moved, to fix a
deadlock scenario similar to the aforementioned prefaulting case
- Misc bugfixes and minor cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: x86: Export KVM-internal symbols for sub-modules only
KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks
KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
KVM: Export KVM-internal symbols for sub-modules only
KVM: s390/vfio-ap: Use kvm_is_gpa_in_memslot() instead of open coded equivalent
KVM: VMX: Make CR4.CET a guest owned bit
KVM: selftests: Verify MSRs are (not) in save/restore list when (un)supported
KVM: selftests: Add coverage for KVM-defined registers in MSRs test
KVM: selftests: Add KVM_{G,S}ET_ONE_REG coverage to MSRs test
KVM: selftests: Extend MSRs test to validate vCPUs without supported features
KVM: selftests: Add support for MSR_IA32_{S,U}_CET to MSRs test
KVM: selftests: Add an MSR test to exercise guest/host and read/write
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
KVM: x86: Define Control Protection Exception (#CP) vector
KVM: x86: Add human friendly formatting for #XM, and #VE
KVM: SVM: Enable shadow stack virtualization for SVM
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
KVM: SVM: Pass through shadow stack MSRs as appropriate
KVM: SVM: Update dump_vmcb with shadow stack save area additions
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
...
The function btrfs_encode_fh() does not properly account for the three
cases it handles.
Before writing to the file handle (fh), the function only returns to the
user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or
BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes).
However, when a parent exists and the root ID of the parent and the
inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT
(10 dwords, 40 bytes).
If *max_len is not large enough, this write goes out of bounds because
BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than
BTRFS_FID_SIZE_CONNECTABLE originally returned.
This results in an 8-byte out-of-bounds write at
fid->parent_root_objectid = parent_root_id.
A previous attempt to fix this issue was made but was lost.
https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/
Although this issue does not seem to be easily triggerable, it is a
potential memory corruption bug that should be fixed. This patch
resolves the issue by ensuring the function returns the appropriate size
for all three cases and validates that *max_len is large enough before
writing any data.
Fixes: be6e8dc0ba84 ("NFS support for btrfs - v3")
CC: stable@vger.kernel.org # 3.0+
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The minimum supported GCC version is 8.1, which supports flag output operands
and always defines __GCC_ASM_FLAG_OUTPUTS__ macro.
Remove code depending on __GCC_ASM_FLAG_OUTPUTS__ and use the "=@ccCOND" flag
output operand directly.
Use the equivalent "=@ccz" instead of "=@cce" flag output operand for
CMPXCHG8B and CMPXCHG16B instructions. These instructions set a single flag
bit - the Zero flag - and "=@ccz" is used to distinguish the CC user from
comparison instructions, where set ZERO flag indeed means that the values are
equal.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250905121723.GCaLrU04lP2A50PT-B@fat_crate.local
A while ago [0] FineIBT started using the 0xEA instruction to raise #UD.
All existing parts will generate #UD in 64bit mode on that instruction.
However; Intel/AMD have not blessed using this instruction, it is on
their 'reserved' opcode list for future use.
Peter Anvin worked the committees and got use of 0xD6 blessed, it
shall be called UDB (per the next SDM or so), and it being a single
byte instruction is easy to slip into a single byte immediate -- as
is done by this very patch.
Reworking the FineIBT code to use UDB wasn't entirely trivial. Notably
the FineIBT-BHI1 case ran out of bytes. In order to condense the
encoding some it was required to move the hash register from R10D to
EAX (thanks hpa!).
Per the x86_64 ABI, RAX is used to pass the number of vector registers
for vararg function calls -- something that should not happen in the
kernel. More so, the kernel is built with -mskip-rax-setup, which
should leave RAX completely unused, allowing its re-use.
[ For BPF; while the bpf2bpf tail-call uses RAX in its calling
convention, that does not use CFI and is unaffected. Only the
'regular' C->BPF transition is covered by CFI. ]
The ENDBR poison value is changed from 'OSP NOP3' to 'NOPL -42(%RAX)',
this is basically NOP4 but with UDB as its immediate. As such it is
still a non-standard NOP value unique to prior ENDBR sites, but now
also provides UDB.
Per Agner Fog's optimization guide, Jcc is assumed not-taken. That is,
the expected path should be the fallthrough case for improved
throughput.
Since the preamble now relies on the ENDBR poison to provide UDB, the
code is changed to write the poison right along with the initial
preamble -- this is possible because the ITS mitigation already
disabled IBT over rewriting the CFI scheme.
The scheme in detail:
Preamble:
FineIBT FineIBT-BHI1 FineIBT-BHI
__cfi_\func: __cfi_\func: __cfi_\func:
endbr endbr endbr
subl $0x12345678, %eax subl $0x12345678, %eax subl $0x12345678, %eax
jne.d32,np \func+3 cmovne %rax, %rdi cs cs call __bhi_args_N
jne.d8,np \func+3
\func: \func: \func:
nopl -42(%rax) nopl -42(%rax) nopl -42(%rax)
Notably there are 7 bytes available after the SUBL; this enables the
BHI1 case to fit without the nasty overlapping case it had previously.
The !BHI case uses Jcc.d32,np to consume all 7 bytes without the need
for an additional NOP, while the BHI case uses CS padding to align the
CALL with the end of the preamble such that it returns to \func+0.
Caller:
FineIBT Paranoid-FineIBT
fineibt_caller: fineibt_caller:
mov $0x12345678, %eax mov $0x12345678, %eax
lea -10(%r11), %r11 cmp -0x11(%r11), %eax
nop5 cs lea -0x10(%r11), %r11
retpoline: retpoline:
cs call __x86_indirect_thunk_r11 jne fineibt_caller+0xd
call *%r11
nop
Notably this is before apply_retpolines() which will fix up the
retpoline call -- since all parts with IBT also have eIBRS (lets
ignore ITS). Typically the retpoline site is rewritten (when still
intact) into:
call *%r11
nop3
[0] 06926c6cdb95 ("x86/ibt: Optimize the FineIBT instruction sequence")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250901191307.GI4067720@noisy.programming.kicks-ass.net
During the detaching of Marvell's SAS/SATA controller, the original code
calls cancel_delayed_work() in mvs_free() to cancel the delayed work
item mwq->work_q. However, if mwq->work_q is already running, the
cancel_delayed_work() may fail to cancel it. This can lead to
use-after-free scenarios where mvs_free() frees the mvs_info while
mvs_work_queue() is still executing and attempts to access the
already-freed mvs_info.
A typical race condition is illustrated below:
CPU 0 (remove) | CPU 1 (delayed work callback)
mvs_pci_remove() |
mvs_free() | mvs_work_queue()
cancel_delayed_work() |
kfree(mvi) |
| mvi-> // UAF
Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the delayed work item is properly canceled and any executing
delayed work item completes before the mvs_info is deallocated.
This bug was found by static analysis.
Fixes: 20b09c2992fe ("[SCSI] mvsas: add support for 94xx; layout change; bug fixes")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Interrupts are automatically enabled when requested, so we need to
initialize irq_enabled accordingly to avoid causing an unbalanced enable
warning.
Fixes: c62d658e5253 ("rtc: isl12022: Add alarm support")
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-2-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Commit 27758d8c2583 ("kbuild: enable -Werror for hostprogs")
unconditionally enabled -Werror for the compiler, assembler, and linker
when building the host programs, as the build footprint of the host
programs is small (thus risk of build failures from warnings are low)
and that stage of the build may not have Kconfig values (thus
CONFIG_WERROR could not be used as a precondition).
While turning warnings into errors unconditionally happens in a few
places within the kernel, it can be disruptive to people who may be
building with newer compilers, such as while doing a bisect. While it is
possible to avoid this behavior by passing HOSTCFLAGS=-w or
HOSTCFLAGS=-Wno-error, it may not be the most intuitive for regular
users not intimately familiar with Kbuild.
Avoid being disruptive to the entire build by depending on the explicit
opt-in of CONFIG_WERROR or W=e to enable -Werror and the like while
building the host programs. While this means there is a small portion of
the build that does not have -Werror enabled (namely scripts/kconfig/*
and scripts/basic/fixdep), it is better than not having it altogether.
Fixes: 27758d8c2583 ("kbuild: enable -Werror for hostprogs")
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/20251005011100.1035272-1-safinaskar@gmail.com/
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Tested-by: Miguel Ojeda <ojeda@kernel.org> # Rust
Link: https://patch.msgid.link/20251006-kbuild-hostprogs-werror-fix-v1-1-23cf1ffced5c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Prior to binutils commit c12d9fa2afe ("Support objcopy
--remove-section=.relaFOO") [1] in 2.32, stripping relocation sections
required the trailing period (i.e., '.rel.*') to work properly.
After commit 3e86e4d74c04 ("kbuild: keep .modinfo section in
vmlinux.unstripped"), there is an error with binutils 2.31.1 or earlier
because these sections are not properly removed:
s390-linux-objcopy: st6tO8Ev: symbol `.modinfo' required but not present
s390-linux-objcopy:st6tO8Ev: no symbols
Add the old pattern to resolve this issue (along with a comment to allow
cleaning this when binutils 2.32 or newer is the minimum supported
version). While the aforementioned kbuild change exposes this, the
pattern was originally changed by commit 71d815bf5dfd ("kbuild: Strip
runtime const RELA sections correctly"), where it would still be
incorrect with binutils older than 2.32.
Fixes: 71d815bf5dfd ("kbuild: Strip runtime const RELA sections correctly")
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=c12d9fa2afe7abcbe407a00e15719e1a1350c2a7 [1]
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/CA+G9fYvVktRhFtZXdNgVOL8j+ArsJDpvMLgCitaQvQmCx=hwOQ@mail.gmail.com/
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-2-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call
func_prolog_dec() that will test if the data->disable is already set and
if not, increment it and return. If it was set, it returns false and the
caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home
Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull LoongArch updates from Huacai Chen:
- Initialize acpi_gbl_use_global_lock to false
- Allow specify SIMD width via kernel parameters
- Add kexec_file (both EFI & ELF format) support
- Add PER_VMA_LOCK for page fault handling support
- Improve BPF trampoline support
- Update the default config file
- Some bug fixes and other small changes
* tag 'loongarch-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (23 commits)
LoongArch: Update Loongson-3 default config file
LoongArch: BPF: Sign-extend struct ops return values properly
LoongArch: BPF: Make error handling robust in arch_prepare_bpf_trampoline()
LoongArch: BPF: Make trampoline size stable
LoongArch: BPF: Don't align trampoline size
LoongArch: BPF: No support of struct argument in trampoline programs
LoongArch: BPF: No text_poke() for kernel text
LoongArch: BPF: Remove duplicated bpf_flush_icache()
LoongArch: BPF: Remove duplicated flags check
LoongArch: BPF: Fix uninitialized symbol 'retval_off'
LoongArch: BPF: Optimize sign-extention mov instructions
LoongArch: Handle new atomic instructions for probes
LoongArch: Try VMA lock-based page fault handling first
LoongArch: Automatically disable kaslr if boot from kexec_file
LoongArch: Add crash dump support for kexec_file
LoongArch: Add ELF binary support for kexec_file
LoongArch: Add EFI binary support for kexec_file
LoongArch: Add preparatory infrastructure for kexec_file
LoongArch: Add struct loongarch_image_header for kernel
LoongArch: Allow specify SIMD width via kernel parameters
...
Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that
are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly
designed/intended for external usage.
Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it. This
unblocks the eventual removal of the selection of CRYPTO from NFSD_V4,
which will no longer be needed by nfsd itself due to switching to the
crypto library functions. But NFSD_V4 selects RPCSEC_GSS_KRB5, which
still needs CRYPTO. It makes more sense for RPCSEC_GSS_KRB5 to select
CRYPTO itself, like most other kconfig options that need CRYPTO do.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer
data type for the some variants in btrfs_root"). On the reader side, in
should_cow_block(), we also use the counterpart smp_mb__before_atomic()
which generates further confusion.
So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't
even need any barrier at all since create_pending_snapshot() is called
in the critical section of a transaction commit and therefore no one
can concurrently join/attach the transaction, or start a new one, until
the transaction is unblocked. By the time someone starts a new transaction
and enters should_cow_block(), a lot of implicit memory barriers already
took place by having acquired several locks such as fs_info->trans_lock
and extent buffer locks on the root node at least. Nevertlheless, for
consistency use smp_mb__after_atomic() after setting the force cow bit
in create_pending_snapshot().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull bpf fixes from Alexei Starovoitov:
- Finish constification of 1st parameter of bpf_d_path() (Rong Tao)
- Harden userspace-supplied xdp_desc validation (Alexander Lobakin)
- Fix metadata_dst leak in __bpf_redirect_neigh_v{4,6}() (Daniel
Borkmann)
- Fix undefined behavior in {get,put}_unaligned_be32() (Eric Biggers)
- Use correct context to unpin bpf hash map with special types (KaFai
Wan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add test for unpinning htab with internal timer struct
bpf: Avoid RCU context warning when unpinning htab with internal structs
xsk: Harden userspace-supplied xdp_desc validation
bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
libbpf: Fix undefined behavior in {get,put}_unaligned_be32()
bpf: Finish constification of 1st parameter of bpf_d_path()
Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."
# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.
The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.
For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.
Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.
Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Current minimum required version of binutils is 2.30, which supports ENCLS
instruction mnemonic.
Replace the byte-wise specification of ENCLS with this proper mnemonic.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/20250616085716.158942-1-ubizjak@gmail.com
Commit 66f793099a63 ("x86/retpoline: Avoid retpolines for built-in __init
functions") disabled retpolines in __init sections (__noinitretpoline)
as a precaution against potential issues with retpolines in early boot,
but it has not been a problem in practice (i.e. see Clang below).
Commit 87358710c1fb ("x86/retpoline: Support retpoline builds with Clang")
narrowed this to only GCC, as Clang doesn't have per-function control
over retpoline emission. As such, Clang has been booting with retpolines
in __init since retpoline support was introduced.
Clang KCFI has been instrumenting __init since CFI was introduced.
With the introduction of KCFI for GCC, KCFI instrumentation with
retpolines disabled means that objtool does not construct .retpoline_sites
section entries for the non-retpoline KCFI calls. At boot, the KCFI
rehashing code, via __apply_fineibt(), misses all __init KCFI calls
(since they are not retpolines), resulting in immediate hash mismatches:
all preambles are rehashed (via .cfi_sites) and none of the __init call
sites are rehashed.
Remove __noinitretpoline since it provides no meaningful utility and
creates problems with CFI. Additionally remove __noretpoline since it
is now unused.
Alternatively, cfi_rand_callers() could walk the .kcfi_traps section which
is exactly the list of KCFI instrumentation sites. But it seems better to
have as few differences in common instruction sequences between compilers
as possible, so better to remove the special handling of retpolines in
__init for GCC.
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250904034656.3670313-6-kees@kernel.org
hba->pm_qos_mutex is used very early as a part of ufshcd_init(), so it
need to be initialized before that call. This fixes the following
warning:
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: kernel/locking/mutex.c:577 at __mutex_lock+0x268/0x894, CPU#4: kworker/u32:4/72
Modules linked in:
CPU: 4 UID: 0 PID: 72 Comm: kworker/u32:4 Not tainted 6.17.0-rc7-next-20250926+ #11223 PREEMPT
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __mutex_lock+0x268/0x894
lr : __mutex_lock+0x268/0x894
...
Call trace:
__mutex_lock+0x268/0x894 (P)
mutex_lock_nested+0x24/0x30
ufshcd_pm_qos_update+0x30/0x78
ufshcd_setup_clocks+0x2d4/0x3c4
ufshcd_init+0x234/0x126c
ufshcd_pltfrm_init+0x62c/0x82c
ufs_qcom_probe+0x20/0x58
platform_probe+0x5c/0xac
really_probe+0xbc/0x298
__driver_probe_device+0x78/0x12c
driver_probe_device+0x40/0x164
__device_attach_driver+0xb8/0x138
bus_for_each_drv+0x80/0xdc
__device_attach+0xa8/0x1b0
device_initial_probe+0x14/0x20
bus_probe_device+0xb0/0xb4
deferred_probe_work_func+0x8c/0xc8
process_one_work+0x208/0x60c
worker_thread+0x244/0x388
kthread+0x150/0x228
ret_from_fork+0x10/0x20
irq event stamp: 57267
hardirqs last enabled at (57267): [<ffffd761485e868c>] _raw_spin_unlock_irqrestore+0x74/0x78
hardirqs last disabled at (57266): [<ffffd76147b13c44>] clk_enable_lock+0x7c/0xf0
softirqs last enabled at (56270): [<ffffd7614734446c>] handle_softirqs+0x4c4/0x4dc
softirqs last disabled at (56265): [<ffffd76147290690>] __do_softirq+0x14/0x20
---[ end trace 0000000000000000 ]---
Fixes: 79dde5f7dc7c ("scsi: ufs: core: Fix data race in CPU latency PM QoS request handling")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Message-Id: <20250929112730.3782765-1-m.szyprowski@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
As described in the old comment dating back to
commit 6610e0893b8b ("RTC: Rework RTC code to use timerqueue for events")
from 2010, we have been living with a race window when setting alarm
with an expiry in the near future (i.e. next second).
With 1 second resolution, it can happen that the second ticks after the
check for the timer having expired, but before the alarm is actually set.
When this happen, no alarm IRQ is generated, at least not with some RTC
chips (isl12022 is an example of this).
With UIE RTC timer being implemented on top of alarm irq, being re-armed
every second, UIE will occasionally fail to work, as an alarm irq lost
due to this race will stop the re-arming loop.
For now, I have limited the additional expiry check to only be done for
alarms set to next seconds. I expect it should be good enough, although I
don't know if we can now for sure that systems with loads could end up
causing the same problems for alarms set 2 seconds or even longer in the
future.
I haven't been able to reproduce the problem with this check in place.
Cc: stable@vger.kernel.org
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-1-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
On m68k, check_sizetypes in headers_check reports:
./usr/include/asm/bootinfo-amiga.h:17: found __[us]{8,16,32,64} type without #include <linux/types.h>
This header file does not use any of the Linux-specific integer types,
but merely refers to them from comments, so this is a false positive.
As of commit c3a9d74ee413bdb3 ("kbuild: uapi: upgrade check_sizetypes()
warning to error"), this check was promoted to an error, breaking m68k
all{mod,yes}config builds.
Fix this by stripping simple comments before looking for Linux-specific
integer types.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/949f096337e28d50510e970ae3ba3ec9c1342ec0.1759753998.git.geert@linux-m68k.org
[nathan: Adjust comment and remove unnecessary escaping from slashes in
regex]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Commit 0ce5139fd96e ("kbuild: always create intermediate
vmlinux.unstripped") removed the pattern to avoid stripping .rela.dyn
sections added by commit e9d86b8e17e7 ("scripts: Do not strip .rela.dyn
section"). Restore it so that .rela.dyn sections remain in the final
vmlinux.
Fixes: 0ce5139fd96e ("kbuild: always create intermediate vmlinux.unstripped")
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251008-kbuild-fix-modinfo-regressions-v1-1-9fc776c5887c@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The functions wakeup_graph_entry() and wakeup_graph_return() both call
func_prolog_preempt_disable() that will test if the data->disable is
already set and if not, increment it and disable preemption. If it was
set, it returns false and the caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home
Fixes: a485ea9e3ef3 ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull uml updates from Johannes Berg:
- minor preparations for SMP support
- SPARSE_IRQ support for kunit
- help output cleanups
* tag 'uml-for-linux-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: Remove unused ipi_pipe field from cpuinfo_um
um: Remove unused cpu_data and current_cpu_data macros
um: Stop tracking virtual CPUs via mm_cpumask()
um: Centralize stub size calculations
um: Remove outdated comment about STUB_DATA_PAGES
um: Remove unused offset and child_err fields from stub_data
um: Indent time-travel help messages
um: Fix help message for ssl-non-raw
um: vector: Fix indentation for help message
um: Add missing trailing newline to help messages
um: virtio-pci: implement .shutdown()
um: Support SPARSE_IRQ
1. Increase NR_CPUS to 2048.
2. Enable KEXEC_FILE options.
3. Enable some EXT4/XFS options.
4. Enable Loongson PWM/MMC/BMC/SE drivers.
5. Enable simple DRM driver.
6. Enable some Intel ethernet drivers.
7. Enable some gpio drivers.
8. Enable some network options.
9. Enable SM3/SM4 cryptographic algorithms.
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Drop the exporting of several kvm_arch_xxx() hooks that are only called
from arch-neutral code, i.e. that are only called from kvm.ko.
Link: https://lore.kernel.org/r/20250919003303.1355064-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add 'io_cache_read' to NFSD's debugfs interface so that any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
io_cache_read may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
Add 'io_cache_write' to NFSD's debugfs interface so that any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
io_cache_write may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
The default value for both settings is NFSD_IO_BUFFERED, which is
NFSD's existing behavior for both read and write. Changes to these
settings take immediate effect for all exports and NFS versions.
Currently only xfs and ext4 implement RWF_DONTCACHE. For file
systems that do not implement RWF_DONTCACHE, NFSD use only buffered
I/O when the io_cache setting is NFSD_IO_DONTCACHE.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
KaFai Wan says:
====================
bpf: Avoid RCU context warning when unpinning htab with internal structs
This small patchset is about avoid RCU context warning when unpinning
htab with internal structs (timer, workqueue, or task_work).
v3:
- fix nit (Yonghong Song)
- add Acked-by: Yonghong Song <yonghong.song@linux.dev>
v2:
- rename bpf_free_inode() to bpf_destroy_inode() (Andrii)
https://lore.kernel.org/all/20251007012235.755853-1-kafai.wan@linux.dev/
v1:
https://lore.kernel.org/all/20251003084528.502518-1-kafai.wan@linux.dev/
---
====================
Link: https://patch.msgid.link/20251008102628.808045-1-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>