commits

Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.

Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:

- Fix tracing_mark_raw_write() to use per CPU buffer

The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.

The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.

- Stop the fortify string warning from writing into trace_marker_raw

After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:

struct {
struct trace_entry ent;
int id;
char buf[];
};

The size of this structure is reserved on the ring buffer with:

size = sizeof(*entry) + cnt;

Then it is copied from the buffer into the ring buffer with:

memcpy(&entry->id, buf, cnt);

This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.

The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.

Change the size function to:

size = struct_size(entry, buf, cnt - sizeof(entry->id));

And update the memcpy() to unsafe_memcpy()"

* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf

3mo ago

Lucas Zampieri

f75e07bf

irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume

3mo ago

Linus Torvalds

98906f9d

Merge tag 'rtc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

3mo ago

Linus Torvalds

c04022dc

Merge tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux

3mo ago

Steven Rostedt

54b91e54

tracing: Stop fortify-string from warning in tracing_mark_raw_write()

The way tracing_mark_raw_write() records its data is that it has the
following structure:

struct {
struct trace_entry;
int id;
char buf[];
};

But memcpy(&entry->id, buf, size) triggers the following warning when the
size is greater than the id:

------------[ cut here ]------------
memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4)
WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Modules linked in:
CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48
RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001
RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40
R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010
R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000
FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0
Call Trace:
<TASK>
tracing_mark_raw_write+0x1fe/0x290
? __pfx_tracing_mark_raw_write+0x10/0x10
? security_file_permission+0x50/0xf0
? rw_verify_area+0x6f/0x4b0
vfs_write+0x1d8/0xdd0
? __pfx_vfs_write+0x10/0x10
? __pfx_css_rstat_updated+0x10/0x10
? count_memcg_events+0xd9/0x410
? fdget_pos+0x53/0x5e0
ksys_write+0x182/0x200
? __pfx_ksys_write+0x10/0x10
? do_user_addr_fault+0x4af/0xa30
do_syscall_64+0x63/0x350
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa61d318687
Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687
RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001
RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---

This is because fortify string sees that the size of entry->id is only 4
bytes, but it is writing more than that. But this is OK as the
dynamic_array is allocated to handle that copy.

The size allocated on the ring buffer was actually a bit too big:

size = sizeof(*entry) + cnt;

But cnt includes the 'id' and the buffer data, so adding cnt to the size
of *entry actually allocates too much on the ring buffer.

Change the allocation to:

size = struct_size(entry, buf, cnt - sizeof(entry->id));

and the memcpy() to unsafe_memcpy() with an added justification.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

3mo ago

Dan Carpenter

196754c2

irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check

3mo ago

Linus Torvalds

2a6edd86

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

3mo ago

Esben Haabendal

9db26d58

rtc: interface: Ensure alarm irq is enabled when UIE is enabled

3mo ago

Nathan Chancellor

b0f2942a

kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols

3mo ago

Steven Rostedt

bda745ee

tracing: Fix tracing_mark_raw_write() to use buf and not ubuf

3mo ago

Linus Torvalds

c746c3b5

Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

3mo ago

Linus Torvalds

9591fdb0

Merge tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more x86 updates from Borislav Petkov:

- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C

- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm

- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant

- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware

- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups

- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors

- Other smaller cleanups and touchups all over the place

* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...

3mo ago

Hoyoung Seo

558ae457

scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS

3mo ago

Esben Haabendal

1502fe0e

rtc: tps6586x: Fix initial enable_irq/disable_irq balance

3mo ago

Nathan Chancellor

cfc58453

Merge patch series "kbuild: Fixes for fallout from recent modules.builtin.modinfo series"

3mo ago

Steven Rostedt

64cf7d05

tracing: Have trace_marker use per-cpu data to read user space

It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.

Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/

The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:

tracing_mark_write: <faulted>

After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.

Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.

A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:

preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));

if (!ret)
ring_buffer_write(buffer);
preempt_enable();

It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.

By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df065 ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

3mo ago

Linus Torvalds

81538c8e

Merge tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
"Mike Snitzer has prototyped a mechanism for disabling I/O caching in
NFSD. This is introduced in v6.18 as an experimental feature. This
enables scaling NFSD in /both/ directions:

- NFS service can be supported on systems with small memory
footprints, such as low-cost cloud instances

- Large NFS workloads will be less likely to force the eviction of
server-local activity, helping it avoid thrashing

Jeff Layton contributed a number of fixes to the new attribute
delegation implementation (based on a pending Internet RFC) that we
hope will make attribute delegation reliable enough to enable by
default, as it is on the Linux NFS client.

The remaining patches in this pull request are clean-ups and minor
optimizations. Many thanks to the contributors, reviewers, testers,
and bug reporters who participated during the v6.18 NFSD development
cycle"

* tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits)
nfsd: discard nfserr_dropit
SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it
NFSD: Add io_cache_{read,write} controls to debugfs
NFSD: Do the grace period check in ->proc_layoutget
nfsd: delete unnecessary NULL check in __fh_verify()
NFSD: Allow layoutcommit during grace period
NFSD: Disallow layoutget during grace period
sunrpc: fix "occurence"->"occurrence"
nfsd: Don't force CRYPTO_LIB_SHA256 to be built-in
nfsd: nfserr_jukebox in nlm_fopen should lead to a retry
NFSD: Reduce DRC bucket size
NFSD: Delay adding new entries to LRU
SUNRPC: Move the svc_rpcb_cleanup() call sites
NFS: Remove rpcbind cleanup for NFSv4.0 callback
nfsd: unregister with rpcbind when deleting a transport
NFSD: Drop redundant conversion to bool
sunrpc: eliminate return pointer in svc_tcp_sendmsg()
sunrpc: fix pr_notice in svc_tcp_sendto() to show correct length
nfsd: decouple the xprtsec policy check from check_nfsd_access()
NFSD: Fix destination buffer size in nfsd4_ssc_setup_dul()
...

3mo ago

Nathan Chancellor

4335c449

btrfs: fix PAGE_SIZE format specifier in open_ctree()

3mo ago

Linus Torvalds

2f0a7504

Merge tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3mo ago

Peter Zijlstra

4a1e02b1

x86,retpoline: Optimize patch_retpoline()

4mo ago

Daniel Lee

bb7663de

scsi: ufs: sysfs: Make HID attributes visible

3mo ago

Esben Haabendal

e0762fd2

rtc: cpcap: Fix initial enable_irq/disable_irq balance

3mo ago

Dmitry Safonov

38492c57

gen_init_cpio: Ignore fsync() returning EINVAL on pipes

3mo ago

Nathan Chancellor

9338d660

s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections

3mo ago

Ankit Khushwaha

de4cbd70

ring buffer: Propagate __rb_map_vma return value to caller

3mo ago

Linus Torvalds

256e3417

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull x86 kvm updates from Paolo Bonzini:
"Generic:

- Rework almost all of KVM's exports to expose symbols only to KVM's
x86 vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko

x86:

- Rework almost all of KVM x86's exports to expose symbols only to
KVM's vendor modules, i.e. to kvm-{amd,intel}.ko

- Add support for virtualizing Control-flow Enforcement Technology
(CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD
(Shadow Stacks).

It is worth noting that while SHSTK and IBT can be enabled
separately in CPUID, it is not really possible to virtualize them
separately. Therefore, Intel processors will really allow both
SHSTK and IBT under the hood if either is made visible in the
guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons

- Fix a variety of fuzzing WARNs all caused by checking L1 intercepts
when completing userspace I/O. KVM has already committed to
allowing L2 to to perform I/O at that point

- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the
MSR is supposed to exist for v2 PMUs

- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs

- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside
of forced emulation and other contrived testing scenarios)

- Clean up the MSR APIs in preparation for CET and FRED
virtualization, as well as mediated vPMU support

- Clean up a pile of PMU code in anticipation of adding support for
mediated vPMUs

- Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI
vmexits needed to faithfully emulate an I/O APIC for such guests

- Many cleanups and minor fixes

- Recover possible NX huge pages within the TDP MMU under read lock
to reduce guest jitter when restoring NX huge pages

- Return -EAGAIN during prefault if userspace concurrently
deletes/moves the relevant memslot, to fix an issue where
prefaulting could deadlock with the memslot update

x86 (AMD):

- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is
supported

- Require a minimum GHCB version of 2 when starting SEV-SNP guests
via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate
errors instead of latent guest failures

- Add support for SEV-SNP's CipherText Hiding, an opt-in feature that
prevents unauthorized CPU accesses from reading the ciphertext of
SNP guest private memory, e.g. to attempt an offline attack. This
feature splits the shared SEV-ES/SEV-SNP ASID space into separate
ranges for SEV-ES and SEV-SNP guests, therefore a new module
parameter is needed to control the number of ASIDs that can be used
for VMs with CipherText Hiding vs. how many can be used to run
SEV-ES guests

- Add support for Secure TSC for SEV-SNP guests, which prevents the
untrusted host from tampering with the guest's TSC frequency, while
still allowing the the VMM to configure the guest's TSC frequency
prior to launch

- Validate the XCR0 provided by the guest (via the GHCB) to avoid
bugs resulting from bogus XCR0 values

- Save an SEV guest's policy if and only if LAUNCH_START fully
succeeds to avoid leaving behind stale state (thankfully not
consumed in KVM)

- Explicitly reject non-positive effective lengths during SNP's
LAUNCH_UPDATE instead of subtly relying on guest_memfd to deal with
them

- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the
host's desired TSC_AUX, to fix a bug where KVM was keeping a
different vCPU's TSC_AUX in the host MSR until return to userspace

KVM (Intel):

- Preparation for FRED support

- Don't retry in TDX's anti-zero-step mitigation if the target
memslot is invalid, i.e. is being deleted or moved, to fix a
deadlock scenario similar to the aforementioned prefaulting case

- Misc bugfixes and minor cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: x86: Export KVM-internal symbols for sub-modules only
KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks
KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
KVM: Export KVM-internal symbols for sub-modules only
KVM: s390/vfio-ap: Use kvm_is_gpa_in_memslot() instead of open coded equivalent
KVM: VMX: Make CR4.CET a guest owned bit
KVM: selftests: Verify MSRs are (not) in save/restore list when (un)supported
KVM: selftests: Add coverage for KVM-defined registers in MSRs test
KVM: selftests: Add KVM_{G,S}ET_ONE_REG coverage to MSRs test
KVM: selftests: Extend MSRs test to validate vCPUs without supported features
KVM: selftests: Add support for MSR_IA32_{S,U}_CET to MSRs test
KVM: selftests: Add an MSR test to exercise guest/host and read/write
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
KVM: x86: Define Control Protection Exception (#CP) vector
KVM: x86: Add human friendly formatting for #XM, and #VE
KVM: SVM: Enable shadow stack virtualization for SVM
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
KVM: SVM: Pass through shadow stack MSRs as appropriate
KVM: SVM: Update dump_vmcb with shadow stack save area additions
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
...

3mo ago

NeilBrown

73cc6ec1

nfsd: discard nfserr_dropit

3mo ago

Anderson Nascimento

dff4f9ff

btrfs: avoid potential out-of-bounds in btrfs_encode_fh()

4mo ago

Linus Torvalds

6bb71f0f

Merge tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

3mo ago

Uros Bizjak

c6c973db

x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__

4mo ago

Peter Zijlstra

85a2d4a8

x86,ibt: Use UDB instead of 0xEA

A while ago [0] FineIBT started using the 0xEA instruction to raise #UD.
All existing parts will generate #UD in 64bit mode on that instruction.

However; Intel/AMD have not blessed using this instruction, it is on
their 'reserved' opcode list for future use.

Peter Anvin worked the committees and got use of 0xD6 blessed, it
shall be called UDB (per the next SDM or so), and it being a single
byte instruction is easy to slip into a single byte immediate -- as
is done by this very patch.

Reworking the FineIBT code to use UDB wasn't entirely trivial. Notably
the FineIBT-BHI1 case ran out of bytes. In order to condense the
encoding some it was required to move the hash register from R10D to
EAX (thanks hpa!).

Per the x86_64 ABI, RAX is used to pass the number of vector registers
for vararg function calls -- something that should not happen in the
kernel. More so, the kernel is built with -mskip-rax-setup, which
should leave RAX completely unused, allowing its re-use.

[ For BPF; while the bpf2bpf tail-call uses RAX in its calling
convention, that does not use CFI and is unaffected. Only the
'regular' C->BPF transition is covered by CFI. ]

The ENDBR poison value is changed from 'OSP NOP3' to 'NOPL -42(%RAX)',
this is basically NOP4 but with UDB as its immediate. As such it is
still a non-standard NOP value unique to prior ENDBR sites, but now
also provides UDB.

Per Agner Fog's optimization guide, Jcc is assumed not-taken. That is,
the expected path should be the fallthrough case for improved
throughput.

Since the preamble now relies on the ENDBR poison to provide UDB, the
code is changed to write the poison right along with the initial
preamble -- this is possible because the ITS mitigation already
disabled IBT over rewriting the CFI scheme.

The scheme in detail:

Preamble:

FineIBT FineIBT-BHI1 FineIBT-BHI

__cfi_\func: __cfi_\func: __cfi_\func:
endbr endbr endbr
subl $0x12345678, %eax subl $0x12345678, %eax subl $0x12345678, %eax
jne.d32,np \func+3 cmovne %rax, %rdi cs cs call __bhi_args_N
jne.d8,np \func+3
\func: \func: \func:
nopl -42(%rax) nopl -42(%rax) nopl -42(%rax)

Notably there are 7 bytes available after the SUBL; this enables the
BHI1 case to fit without the nasty overlapping case it had previously.
The !BHI case uses Jcc.d32,np to consume all 7 bytes without the need
for an additional NOP, while the BHI case uses CS padding to align the
CALL with the end of the preamble such that it returns to \func+0.

Caller:

FineIBT Paranoid-FineIBT

fineibt_caller: fineibt_caller:
mov $0x12345678, %eax mov $0x12345678, %eax
lea -10(%r11), %r11 cmp -0x11(%r11), %eax
nop5 cs lea -0x10(%r11), %r11
retpoline: retpoline:
cs call __x86_indirect_thunk_r11 jne fineibt_caller+0xd
call *%r11
nop

Notably this is before apply_retpolines() which will fix up the
retpoline call -- since all parts with IBT also have eIBRS (lets
ignore ITS). Typically the retpoline site is rewritten (when still
intact) into:

call *%r11
nop3

[0] 06926c6cdb95 ("x86/ibt: Optimize the FineIBT instruction sequence")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250901191307.GI4067720@noisy.programming.kicks-ass.net

4mo ago

Duoming Zhou

60cd16a3

scsi: mvsas: Fix use-after-free bugs in mvs_work_queue

3mo ago

Esben Haabendal

9ffe06b6

rtc: isl12022: Fix initial enable_irq/disable_irq balance

3mo ago

Nathan Chancellor

7ded7d37

scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs

3mo ago

Nathan Chancellor

8ec3af91

kbuild: Add '.rel.*' strip pattern for vmlinux

3mo ago

Steven Rostedt

c834a979

tracing: Fix irqoff tracers on failure of acquiring calltime

3mo ago

Linus Torvalds

fb5bc347

Merge tag 'loongarch-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

3mo ago

Sean Christopherson

6b36119b

KVM: x86: Export KVM-internal symbols for sub-modules only

3mo ago

Eric Biggers

d8e97cc4

SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it

3mo ago

Filipe Manana

45c22246

btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()

4mo ago

Linus Torvalds

fbde105f

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

3mo ago

Vlastimil Babka

fd6db588

slab: fix barn NULL pointer dereference on memoryless nodes

Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):

BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---

And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB

Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.

The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.

For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.

Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.

Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

3mo ago

Uros Bizjak

13bdfb53

x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>

5mo ago

Kees Cook

0b815825

x86/cfi: Remove __noinitretpoline and __noretpoline

4mo ago

Marek Szyprowski

0ba7a254

scsi: ufs: core: Fix PM QoS mutex initialization

3mo ago

Esben Haabendal

795cda83

rtc: interface: Fix long-standing race when setting alarm

3mo ago

Geert Uytterhoeven

66128f42

kbuild: uapi: Strip comments before size type check

3mo ago

Nathan Chancellor

4b47a3ae

kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux

3mo ago

Steven Rostedt

4f7bf54b

tracing: Fix wakeup tracers on failure of acquiring calltime

3mo ago

Linus Torvalds

fc282d17

Merge tag 'uml-for-linux-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

3mo ago

Huacai Chen

032676ff

LoongArch: Update Loongson-3 default config file

3mo ago

Sean Christopherson

65604683

KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks

3mo ago

Mike Snitzer

6304affe

NFSD: Add io_cache_{read,write} controls to debugfs

3mo ago

David Sterba

a929904c

btrfs: add unlikely annotations to branches leading to transaction abort

4mo ago

Linus Torvalds

ae13bd23

Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

3mo ago

Alexei Starovoitov

ffce84bc

Merge branch 'bpf-avoid-rcu-context-warning-when-unpinning-htab-with-internal-structs'

3mo ago

Linux 6.18-rc1 v6.18-rc1

3a866087

Linus Torvalds

3mo

Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

3dd7b812

Linus Torvalds

3mo

Merge tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

8765f467

Linus Torvalds

3mo

Revert "i2c: boardinfo: Annotate code used in init phase only"

a8482d2c

Wolfram Sang

3mo

Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

67029a49

Linus Torvalds

3mo

irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume

f75e07bf

Lucas Zampieri

3mo

Merge tag 'rtc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

98906f9d

Linus Torvalds

3mo

Merge tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux

c04022dc

Linus Torvalds

3mo

tracing: Stop fortify-string from warning in tracing_mark_raw_write()

54b91e54

Steven Rostedt

3mo

irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check

196754c2

Dan Carpenter

3mo

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

2a6edd86

Linus Torvalds

3mo

rtc: interface: Ensure alarm irq is enabled when UIE is enabled

9db26d58

Esben Haabendal

3mo

kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols

After commit 5ab23c7923a1 ("modpost: Create modalias for builtin
modules"), relocatable RISC-V kernels with CONFIG_KASAN=y start failing
when attempting to strip the module device table symbols:

riscv64-linux-objcopy: not stripping symbol `__mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table' because it is named in a relocation
make[4]: *** [scripts/Makefile.vmlinux:97: vmlinux] Error 1

The relocation appears to come from .LASANLOC5 in .data.rel.local:

$ llvm-objdump --disassemble-symbols=.LASANLOC5 --disassemble-all -r drivers/irqchip/irq-starfive-jh8100-intc.o

drivers/irqchip/irq-starfive-jh8100-intc.o: file format elf64-littleriscv

Disassembly of section .data.rel.local:

0000000000000180 <.LASANLOC5>:
...
1d0: 0000 unimp
00000000000001d0: R_RISCV_64 __mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table
...

This section appears to come from GCC for including additional
information about global variables that may be protected by KASAN.

There appears to be no way to opt out of the generation of these symbols
through either a flag or attribute. Attempting to remove '.LASANLOC*'
with '--strip-symbol' results in the same error as above because these
symbols may refer to (thus have relocation between) each other.

Avoid this build breakage by switching to '--strip-unneeded-symbol' for
removing __mod_device_table__ symbols, as it will only remove the symbol
when there is no relocation pointing to it. While this may result in a
little more bloat in the symbol table in certain configurations, it is
not as bad as outright build failures.

Fixes: 5ab23c7923a1 ("modpost: Create modalias for builtin modules")
Reported-by: Charles Mirabile <cmirabil@redhat.com>
Closes: https://lore.kernel.org/20251007011637.2512413-1-cmirabil@redhat.com/
Suggested-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

b0f2942a

Nathan Chancellor

3mo

tracing: Fix tracing_mark_raw_write() to use buf and not ubuf

bda745ee

Steven Rostedt

3mo

Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

c746c3b5

Linus Torvalds

3mo

Merge tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

9591fdb0

Linus Torvalds

3mo

scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS

558ae457

Hoyoung Seo

3mo

rtc: tps6586x: Fix initial enable_irq/disable_irq balance

1502fe0e

Esben Haabendal

3mo

Merge patch series "kbuild: Fixes for fallout from recent modules.builtin.modinfo series"

cfc58453

Nathan Chancellor

3mo

tracing: Have trace_marker use per-cpu data to read user space

64cf7d05

Steven Rostedt

3mo

Merge tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

81538c8e

Linus Torvalds

3mo

btrfs: fix PAGE_SIZE format specifier in open_ctree()

4335c449

Nathan Chancellor

3mo

Merge tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2f0a7504

Linus Torvalds

3mo

x86,retpoline: Optimize patch_retpoline()

4a1e02b1

Peter Zijlstra

4mo

scsi: ufs: sysfs: Make HID attributes visible

bb7663de

Daniel Lee

3mo

rtc: cpcap: Fix initial enable_irq/disable_irq balance

e0762fd2

Esben Haabendal

3mo

gen_init_cpio: Ignore fsync() returning EINVAL on pipes

38492c57

Dmitry Safonov

3mo

s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections

9338d660

Nathan Chancellor

3mo

ring buffer: Propagate __rb_map_vma return value to caller

de4cbd70

Ankit Khushwaha

3mo

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

256e3417

Linus Torvalds

3mo

nfsd: discard nfserr_dropit

73cc6ec1

NeilBrown

3mo

btrfs: avoid potential out-of-bounds in btrfs_encode_fh()

dff4f9ff

Anderson Nascimento

4mo

Merge tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

6bb71f0f

Linus Torvalds

3mo

x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__

c6c973db

Uros Bizjak

4mo

x86,ibt: Use UDB instead of 0xEA

85a2d4a8

Peter Zijlstra

4mo

scsi: mvsas: Fix use-after-free bugs in mvs_work_queue

60cd16a3

Duoming Zhou

3mo

rtc: isl12022: Fix initial enable_irq/disable_irq balance

9ffe06b6

Esben Haabendal

3mo

scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs

7ded7d37

Nathan Chancellor

3mo

kbuild: Add '.rel.*' strip pattern for vmlinux

8ec3af91

Nathan Chancellor

3mo

tracing: Fix irqoff tracers on failure of acquiring calltime

c834a979

Steven Rostedt

3mo

Merge tag 'loongarch-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

fb5bc347

Linus Torvalds

3mo

KVM: x86: Export KVM-internal symbols for sub-modules only

6b36119b

Sean Christopherson

3mo

SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it

d8e97cc4

Eric Biggers

3mo

btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()

45c22246

Filipe Manana

4mo

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

fbde105f

Linus Torvalds

3mo

slab: fix barn NULL pointer dereference on memoryless nodes

fd6db588

Vlastimil Babka

3mo

x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>

13bdfb53

Uros Bizjak

5mo

x86/cfi: Remove __noinitretpoline and __noretpoline

0b815825

Kees Cook

4mo

scsi: ufs: core: Fix PM QoS mutex initialization

hba->pm_qos_mutex is used very early as a part of ufshcd_init(), so it
need to be initialized before that call. This fixes the following
warning:

------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: kernel/locking/mutex.c:577 at __mutex_lock+0x268/0x894, CPU#4: kworker/u32:4/72
Modules linked in:
CPU: 4 UID: 0 PID: 72 Comm: kworker/u32:4 Not tainted 6.17.0-rc7-next-20250926+ #11223 PREEMPT
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __mutex_lock+0x268/0x894
lr : __mutex_lock+0x268/0x894
...
Call trace:
__mutex_lock+0x268/0x894 (P)
mutex_lock_nested+0x24/0x30
ufshcd_pm_qos_update+0x30/0x78
ufshcd_setup_clocks+0x2d4/0x3c4
ufshcd_init+0x234/0x126c
ufshcd_pltfrm_init+0x62c/0x82c
ufs_qcom_probe+0x20/0x58
platform_probe+0x5c/0xac
really_probe+0xbc/0x298
__driver_probe_device+0x78/0x12c
driver_probe_device+0x40/0x164
__device_attach_driver+0xb8/0x138
bus_for_each_drv+0x80/0xdc
__device_attach+0xa8/0x1b0
device_initial_probe+0x14/0x20
bus_probe_device+0xb0/0xb4
deferred_probe_work_func+0x8c/0xc8
process_one_work+0x208/0x60c
worker_thread+0x244/0x388
kthread+0x150/0x228
ret_from_fork+0x10/0x20
irq event stamp: 57267
hardirqs last enabled at (57267): [<ffffd761485e868c>] _raw_spin_unlock_irqrestore+0x74/0x78
hardirqs last disabled at (57266): [<ffffd76147b13c44>] clk_enable_lock+0x7c/0xf0
softirqs last enabled at (56270): [<ffffd7614734446c>] handle_softirqs+0x4c4/0x4dc
softirqs last disabled at (56265): [<ffffd76147290690>] __do_softirq+0x14/0x20
---[ end trace 0000000000000000 ]---

Fixes: 79dde5f7dc7c ("scsi: ufs: core: Fix data race in CPU latency PM QoS request handling")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Message-Id: <20250929112730.3782765-1-m.szyprowski@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>