Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
code
Clone this repository
https://tangled.org/tjh.dev/kernel
git@gordian.tjh.dev:tjh.dev/kernel
For self-hosted knots, clone URLs may differ based on your setup.
Pull tracing/probes fixes from Steven Rostedt:
- Fix possible NULL pointer dereference on trace_event_file in
kprobe_event_gen_test_exit()
- Fix NULL pointer dereference for trace_array in
kprobe_event_gen_test_exit()
- Fix memory leak of filter string for eprobes
- Fix a possible memory leak in rethook_alloc()
- Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case which
can cause a possible use-after-free
- Fix warning in eprobe filter creation
- Fix eprobe filter creation as it picked the wrong event for the
fields
* tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: Fix eprobe filter to make a filter correctly
tracing/eprobe: Fix warning in filter creation
kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
rethook: fix a potential memleak in rethook_alloc()
tracing/eprobe: Fix memory leak of filter string
tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
Pull tracing fixes from Steven Rostedt:
- Fix polling to block on watermark like the reads do, as user space
applications get confused when the select says read is available, and
then the read blocks
- Fix accounting of ring buffer dropped pages as it is what is used to
determine if the buffer is empty or not
- Fix memory leak in tracing_read_pipe()
- Fix struct trace_array warning about being declared in parameters
- Fix accounting of ftrace pages used in output at start up.
- Fix allocation of dyn_ftrace pages by subtracting one from order
instead of diving it by 2
- Static analyzer found a case were a pointer being used outside of a
NULL check (rb_head_page_deactivate())
- Fix possible NULL pointer dereference if kstrdup() fails in
ftrace_add_mod()
- Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
- Fix bad pointer dereference in register_synth_event() on error path
- Remove unused __bad_type_size() method
- Fix possible NULL pointer dereference of entry in list 'tr->err_log'
- Fix NULL pointer deference race if eprobe is called before the event
setup
* tag 'trace-v6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix race where eprobes can be called before the event
tracing: Fix potential null-pointer-access of entry in list 'tr->err_log'
tracing: Remove unused __bad_type_size() method
tracing: Fix wild-memory-access in register_synth_event()
tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
ftrace: Fix null pointer dereference in ftrace_add_mod()
ring_buffer: Do not deactivate non-existant pages
ftrace: Optimize the allocation for mcount entries
ftrace: Fix the possible incorrect kernel message
tracing: Fix warning on variable 'struct trace_array'
tracing: Fix memory leak in tracing_read_pipe()
ring-buffer: Include dropped pages in counting dirty patches
tracing/ring-buffer: Have polling block on watermark
Since the eprobe filter was defined based on the eprobe's trace event
itself, it doesn't work correctly. Use the original trace event of
the eprobe when making the filter so that the filter works correctly.
Without this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
[ 114.551550] event trace: Could not enable event sys_enter_openat
-bash: echo: write error: Invalid argument
With this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
# tail trace
cat-241 [000] ...1. 266.498449: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
cat-242 [000] ...1. 266.977640: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
Link: https://lore.kernel.org/all/166823166395.1385292.8931770640212414483.stgit@devnote3/
Fixes: 752be5c5c910 ("tracing/eprobe: Add eprobe filter support")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Tested-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull x86 fixes from Borislav Petkov:
- Do not hold fpregs lock when inheriting FPU permissions because the
fpregs lock disables preemption on RT but fpu_inherit_perms() does
spin_lock_irq(), which, on RT, uses rtmutexes and they need to be
preemptible.
- Check the page offset and the length of the data supplied by
userspace for overflow when specifying a set of pages to add to an
SGX enclave
* tag 'x86_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Drop fpregs lock before inheriting FPU permissions
x86/sgx: Add overflow check in sgx_validate_offset_length()
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.
Test for a NULL record to keep this from happening.
Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221117214249.2addbe10@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The filter pointer (filterp) passed to create_filter() function must be a
pointer that references a NULL pointer, otherwise, we get a warning when
adding a filter option to the event probe:
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core sched/sched_stat_runtime \
runtime=$runtime:u32 if cpu < 4' >> dynamic_events
[ 5034.340439] ------------[ cut here ]------------
[ 5034.341258] WARNING: CPU: 0 PID: 223 at kernel/trace/trace_events_filter.c:1939 create_filter+0x1db/0x250
[...] stripped
[ 5034.345518] RIP: 0010:create_filter+0x1db/0x250
[...] stripped
[ 5034.351604] Call Trace:
[ 5034.351803] <TASK>
[ 5034.351959] ? process_preds+0x1b40/0x1b40
[ 5034.352241] ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.352604] ? kasan_set_track+0x29/0x40
[ 5034.352904] ? kasan_save_alloc_info+0x1f/0x30
[ 5034.353264] create_event_filter+0x38/0x50
[ 5034.353573] __trace_eprobe_create+0x16f4/0x1d20
[ 5034.353964] ? eprobe_dyn_event_release+0x360/0x360
[ 5034.354363] ? mark_held_locks+0xa6/0xf0
[ 5034.354684] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355105] ? trace_hardirqs_on+0x41/0x120
[ 5034.355417] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355751] ? __create_object+0x5b7/0xcf0
[ 5034.356027] ? lock_is_held_type+0xaf/0x120
[ 5034.356362] ? rcu_read_lock_bh_held+0xb0/0xd0
[ 5034.356716] ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.357084] ? kasan_set_track+0x29/0x40
[ 5034.357411] ? kasan_save_alloc_info+0x1f/0x30
[ 5034.357715] ? __kasan_kmalloc+0xb8/0xc0
[ 5034.357985] ? write_comp_data+0x2f/0x90
[ 5034.358302] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.358691] ? argv_split+0x381/0x460
[ 5034.358949] ? write_comp_data+0x2f/0x90
[ 5034.359240] ? eprobe_dyn_event_release+0x360/0x360
[ 5034.359620] trace_probe_create+0xf6/0x110
[ 5034.359940] ? trace_probe_match_command_args+0x240/0x240
[ 5034.360376] eprobe_dyn_event_create+0x21/0x30
[ 5034.360709] create_dyn_event+0xf3/0x1a0
[ 5034.360983] trace_parse_run_command+0x1a9/0x2e0
[ 5034.361297] ? dyn_event_release+0x500/0x500
[ 5034.361591] dyn_event_write+0x39/0x50
[ 5034.361851] vfs_write+0x311/0xe50
[ 5034.362091] ? dyn_event_seq_next+0x40/0x40
[ 5034.362376] ? kernel_write+0x5b0/0x5b0
[ 5034.362637] ? write_comp_data+0x2f/0x90
[ 5034.362937] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.363258] ? ftrace_syscall_enter+0x544/0x840
[ 5034.363563] ? write_comp_data+0x2f/0x90
[ 5034.363837] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.364156] ? write_comp_data+0x2f/0x90
[ 5034.364468] ? write_comp_data+0x2f/0x90
[ 5034.364770] ksys_write+0x158/0x2a0
[ 5034.365022] ? __ia32_sys_read+0xc0/0xc0
[ 5034.365344] __x64_sys_write+0x7c/0xc0
[ 5034.365669] ? syscall_enter_from_user_mode+0x53/0x70
[ 5034.366084] do_syscall_64+0x60/0x90
[ 5034.366356] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5034.366767] RIP: 0033:0x7ff0b43938f3
[...] stripped
[ 5034.371892] </TASK>
[ 5034.374720] ---[ end trace 0000000000000000 ]---
Link: https://lore.kernel.org/all/20221108202148.1020111-1-rafaelmendsr@gmail.com/
Fixes: 752be5c5c910 ("tracing/eprobe: Add eprobe filter support")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull scheduler fixes from Borislav Petkov:
- Fix a small race on the task's exit path where there's a
misunderstanding whether the task holds rq->lock or not
- Prevent processes from getting killed when using deprecated or
unknown rseq ABI flags in order to be able to fuzz the rseq() syscall
with syzkaller
* tag 'sched_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix race in task_call_func()
rseq: Use pr_warn_once() when deprecated/unknown ABI flags are encountered
Mike Galbraith reported the following against an old fork of preempt-rt
but the same issue also applies to the current preempt-rt tree.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
fpu_clone
CPU: 6 PID: 1 Comm: systemd Tainted: G E (unreleased)
Call Trace:
<TASK>
dump_stack_lvl
? fpu_clone
__might_resched
rt_spin_lock
fpu_clone
? copy_thread
? copy_process
? shmem_alloc_inode
? kmem_cache_alloc
? kernel_clone
? __do_sys_clone
? do_syscall_64
? __x64_sys_rt_sigprocmask
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? exc_page_fault
? entry_SYSCALL_64_after_hwframe
</TASK>
Mike says:
The splat comes from fpu_inherit_perms() being called under fpregs_lock(),
and us reaching the spin_lock_irq() therein due to fpu_state_size_dynamic()
returning true despite static key __fpu_state_size_dynamic having never
been enabled.
Mike's assessment looks correct. fpregs_lock on a PREEMPT_RT kernel disables
preemption so calling spin_lock_irq() in fpu_inherit_perms() is unsafe. This
problem exists since commit
9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features").
Even though the original bug report should not have enabled the paths at
all, the bug still exists.
fpregs_lock is necessary when editing the FPU registers or a task's FP
state but it is not necessary for fpu_inherit_perms(). The only write
of any FP state in fpu_inherit_perms() is for the new child which is
not running yet and cannot context switch or be borrowed by a kernel
thread yet. Hence, fpregs_lock is not protecting anything in the new
child until clone() completes and can be dropped earlier. The siglock
still needs to be acquired by fpu_inherit_perms() as the read of the
parent's permissions has to be serialised.
[ bp: Cleanup splat. ]
Fixes: 9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20221110124400.zgymc2lnwqjukgfh@techsingularity.net
Entries in list 'tr->err_log' will be reused after entry number
exceed TRACING_LOG_ERRS_MAX.
The cmd string of the to be reused entry will be freed first then
allocated a new one. If the allocation failed, then the entry will
still be in list 'tr->err_log' but its 'cmd' field is set to be NULL,
later access of 'cmd' is risky.
Currently above problem can cause the loss of 'cmd' information of first
entry in 'tr->err_log'. When execute `cat /sys/kernel/tracing/error_log`,
reproduce logs like:
[ 37.495100] trace_kprobe: error: Maxactive is not for kprobe(null) ^
[ 38.412517] trace_kprobe: error: Maxactive is not for kprobe
Command: p4:myprobe2 do_sys_openat2
^
Link: https://lore.kernel.org/linux-trace-kernel/20221114104632.3547266-1-zhengyejian1@huawei.com
Fixes: 1581a884b7ca ("tracing: Remove size restriction on tracing_log_err cmd strings")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In __unregister_kprobe_top(), if the currently unregistered probe has
post_handler but other child probes of the aggrprobe do not have
post_handler, the post_handler of the aggrprobe is cleared. If this is
a ftrace-based probe, there is a problem. In later calls to
disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is
NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in
__disarm_kprobe_ftrace() and may even cause use-after-free:
Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2)
WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0
Modules linked in: testKprobe_007(-)
CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18
[...]
Call Trace:
<TASK>
__disable_kprobe+0xcd/0xe0
__unregister_kprobe_top+0x12/0x150
? mutex_lock+0xe/0x30
unregister_kprobes.part.23+0x31/0xa0
unregister_kprobe+0x32/0x40
__x64_sys_delete_module+0x15e/0x260
? do_user_addr_fault+0x2cd/0x6b0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]
For the kprobe-on-ftrace case, we keep the post_handler setting to
identify this aggrprobe armed with kprobe_ipmodify_ops. This way we
can disarm it correctly.
Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/
Fixes: 0bc11ed5ab60 ("kprobes: Allow kprobes coexist with livepatch")
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull perf fixes from Borislav Petkov:
- Fix an intel PT erratum where CPUs do not support single range output
for more than 4K
- Fix a NULL ptr dereference which can happen after an NMI interferes
with the event enabling dance in amd_pmu_enable_all()
- Free the events array too when freeing uncore contexts on CPU online,
thereby fixing a memory leak
- Improve the pending SIGTRAP check
* tag 'perf_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix sampling using single range output
perf/x86/amd: Fix crash due to race between amd_pmu_enable_all, perf NMI and throttling
perf/x86/amd/uncore: Fix memory leak for events array
perf: Improve missing SIGTRAP checking
There is a very narrow race between schedule() and task_call_func().
CPU0 CPU1
__schedule()
rq_lock();
prev_state = READ_ONCE(prev->__state);
if (... && prev_state) {
deactivate_tasl(rq, prev, ...)
prev->on_rq = 0;
task_call_func()
raw_spin_lock_irqsave(p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb();
if (... || p->on_rq) // false!!!
rq = __task_rq_lock()
ret = func();
next = pick_next_task();
rq = context_switch(prev, next)
prepare_lock_switch()
spin_release(&__rq_lockp(rq)->dep_map...)
So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.
Avoid this by waiting for p->on_cpu to get cleared, which guarantees
the task is fully finished on the old CPU.
( While arguably the fixes tag is 'wrong' -- none of the previous
task_call_func() users appears to care for this case. )
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://lkml.kernel.org/r/Y1kdRNNfUeAU+FNl@hirez.programming.kicks-ass.net
sgx_validate_offset_length() function verifies "offset" and "length"
arguments provided by userspace, but was missing an overflow check on
their addition. Add it.
Fixes: c6d26d370767 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Cc: stable@vger.kernel.org # v5.11+
Link: https://lore.kernel.org/r/0d91ac79-6d84-abed-5821-4dbe59fa1a38@invisiblethingslab.com
__bad_type_size() is unused after
commit 04ae87a52074("ftrace: Rework event_create_dir()").
So, remove it.
Link: https://lkml.kernel.org/r/D062EC2E-7DB7-4402-A67E-33C3577F551E@gmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In rethook_alloc(), the variable rh is not freed or passed out
if handler is NULL, which could lead to a memleak, fix it.
Link: https://lore.kernel.org/all/20221110104438.88099-1-yiyang13@huawei.com/
[Masami: Add "rethook:" tag to the title.]
Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Acke-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>