Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix thresh_return of function graph tracer

The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what
functions to ignore. The trace_graph_return() was updated to handle
this, but when function_graph tracer is using a threshold (only trace
functions that took longer than a specified time), it uses
trace_graph_thresh_return() instead.

This function was still incorrectly using the task struct recursion
word causing the function graph tracer to permanently set all
functions to "notrace"

- Fix thresh_return nosleep accounting

When the calltime was moved to the shadow stack storage instead of
being on the fgraph descriptor, the calculations for the amount of
sleep time was updated. The calculation was done in the
trace_graph_thresh_return() function, which also called the
trace_graph_return(), which did the calculation again, causing the
time to be doubled.

Remove the call to trace_graph_return() as what it needed to do
wasn't that much, and just do the work in
trace_graph_thresh_return().

- Fix syscall trace event activation on boot up

The syscall trace events are pseudo events attached to the
raw_syscall tracepoints. When the first syscall event is enabled, it
enables the raw_syscall tracepoint and doesn't need to do anything
when a second syscall event is also enabled.

When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init. This
is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different types
of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled.

The problem happened is that this "disable-enable" is done one at a
time. If more than one syscall event is specified on the command
line, by disabling them one at a time, the counter never gets to
zero, and the raw_syscall is not disabled and enabled, keeping the
syscall events in their non-fully functional state.

Instead, disable all events and re-enabled them all, as that will
ensure the raw_syscall event is also disabled and re-enabled.

- Disable preemption in ftrace pid filtering

The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables
protected by RCU (preemption disabled). Now that tracepoint callbacks
are called with preemption enabled, this protection needs to be added
explicitly, and not depend on the functions being called with
preemption disabled.

- Disable preemption in event pid filtering

The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.

- Fix accounting of the memory mapped ring buffer on fork

Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
But this does not prevent the application from calling
madvise(MADVISE_DOFORK). This causes the mapping to be copied on
fork. After the first tasks exits, the mapping is considered unmapped
by everyone. But when he second task exits, the counter goes below
zero and triggers a WARN_ON.

Since nothing prevents two separate tasks from mmapping the ftrace
ring buffer (although two mappings may mess each other up), there's
no reason to stop the memory from being copied on fork.

Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.

- Add all ftrace headers in MAINTAINERS file

The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
all *ftrace* files.

* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MAINTAINERS entries for all ftrace headers
tracing: Fix WARN_ON in tracing_buffers_mmap_close
tracing: Disable preemption in the tracepoint callbacks handling filtered pids
ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
tracing: Fix syscall events activation by ensuring refcount hits zero
fgraph: Fix thresh_return nosleeptime double-adjust
fgraph: Fix thresh_return clear per-task notrace

+90 -22
+1 -1
MAINTAINERS
··· 10484 10484 F: Documentation/trace/ftrace* 10485 10485 F: arch/*/*/*/*ftrace* 10486 10486 F: arch/*/*/*ftrace* 10487 - F: include/*/ftrace.h 10487 + F: include/*/*ftrace* 10488 10488 F: kernel/trace/fgraph.c 10489 10489 F: kernel/trace/ftrace* 10490 10490 F: samples/ftrace
+1
include/linux/ring_buffer.h
··· 248 248 249 249 int ring_buffer_map(struct trace_buffer *buffer, int cpu, 250 250 struct vm_area_struct *vma); 251 + void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu); 251 252 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); 252 253 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); 253 254 #endif /* _LINUX_RING_BUFFER_H */
+2
kernel/trace/ftrace.c
··· 8611 8611 struct trace_pid_list *pid_list; 8612 8612 struct trace_array *tr = data; 8613 8613 8614 + guard(preempt)(); 8614 8615 pid_list = rcu_dereference_sched(tr->function_pids); 8615 8616 trace_filter_add_remove_task(pid_list, self, task); 8616 8617 ··· 8625 8624 struct trace_pid_list *pid_list; 8626 8625 struct trace_array *tr = data; 8627 8626 8627 + guard(preempt)(); 8628 8628 pid_list = rcu_dereference_sched(tr->function_pids); 8629 8629 trace_filter_add_remove_task(pid_list, NULL, task); 8630 8630
+21
kernel/trace/ring_buffer.c
··· 7310 7310 return err; 7311 7311 } 7312 7312 7313 + /* 7314 + * This is called when a VMA is duplicated (e.g., on fork()) to increment 7315 + * the user_mapped counter without remapping pages. 7316 + */ 7317 + void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu) 7318 + { 7319 + struct ring_buffer_per_cpu *cpu_buffer; 7320 + 7321 + if (WARN_ON(!cpumask_test_cpu(cpu, buffer->cpumask))) 7322 + return; 7323 + 7324 + cpu_buffer = buffer->buffers[cpu]; 7325 + 7326 + guard(mutex)(&cpu_buffer->mapping_lock); 7327 + 7328 + if (cpu_buffer->user_mapped) 7329 + __rb_inc_dec_mapped(cpu_buffer, true); 7330 + else 7331 + WARN(1, "Unexpected buffer stat, it should be mapped"); 7332 + } 7333 + 7313 7334 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) 7314 7335 { 7315 7336 struct ring_buffer_per_cpu *cpu_buffer;
+13
kernel/trace/trace.c
··· 8213 8213 static inline void put_snapshot_map(struct trace_array *tr) { } 8214 8214 #endif 8215 8215 8216 + /* 8217 + * This is called when a VMA is duplicated (e.g., on fork()) to increment 8218 + * the user_mapped counter without remapping pages. 8219 + */ 8220 + static void tracing_buffers_mmap_open(struct vm_area_struct *vma) 8221 + { 8222 + struct ftrace_buffer_info *info = vma->vm_file->private_data; 8223 + struct trace_iterator *iter = &info->iter; 8224 + 8225 + ring_buffer_map_dup(iter->array_buffer->buffer, iter->cpu_file); 8226 + } 8227 + 8216 8228 static void tracing_buffers_mmap_close(struct vm_area_struct *vma) 8217 8229 { 8218 8230 struct ftrace_buffer_info *info = vma->vm_file->private_data; ··· 8244 8232 } 8245 8233 8246 8234 static const struct vm_operations_struct tracing_buffers_vmops = { 8235 + .open = tracing_buffers_mmap_open, 8247 8236 .close = tracing_buffers_mmap_close, 8248 8237 .may_split = tracing_buffers_may_split, 8249 8238 };
+39 -15
kernel/trace/trace_events.c
··· 1039 1039 struct trace_pid_list *pid_list; 1040 1040 struct trace_array *tr = data; 1041 1041 1042 + guard(preempt)(); 1042 1043 pid_list = rcu_dereference_raw(tr->filtered_pids); 1043 1044 trace_filter_add_remove_task(pid_list, NULL, task); 1044 1045 ··· 1055 1054 struct trace_pid_list *pid_list; 1056 1055 struct trace_array *tr = data; 1057 1056 1057 + guard(preempt)(); 1058 1058 pid_list = rcu_dereference_sched(tr->filtered_pids); 1059 1059 trace_filter_add_remove_task(pid_list, self, task); 1060 1060 ··· 4670 4668 return 0; 4671 4669 } 4672 4670 4673 - __init void 4674 - early_enable_events(struct trace_array *tr, char *buf, bool disable_first) 4671 + /* 4672 + * Helper function to enable or disable a comma-separated list of events 4673 + * from the bootup buffer. 4674 + */ 4675 + static __init void __early_set_events(struct trace_array *tr, char *buf, bool enable) 4675 4676 { 4676 4677 char *token; 4677 - int ret; 4678 4678 4679 - while (true) { 4680 - token = strsep(&buf, ","); 4681 - 4682 - if (!token) 4683 - break; 4684 - 4679 + while ((token = strsep(&buf, ","))) { 4685 4680 if (*token) { 4686 - /* Restarting syscalls requires that we stop them first */ 4687 - if (disable_first) 4681 + if (enable) { 4682 + if (ftrace_set_clr_event(tr, token, 1)) 4683 + pr_warn("Failed to enable trace event: %s\n", token); 4684 + } else { 4688 4685 ftrace_set_clr_event(tr, token, 0); 4689 - 4690 - ret = ftrace_set_clr_event(tr, token, 1); 4691 - if (ret) 4692 - pr_warn("Failed to enable trace event: %s\n", token); 4686 + } 4693 4687 } 4694 4688 4695 4689 /* Put back the comma to allow this to be called again */ 4696 4690 if (buf) 4697 4691 *(buf - 1) = ','; 4698 4692 } 4693 + } 4694 + 4695 + /** 4696 + * early_enable_events - enable events from the bootup buffer 4697 + * @tr: The trace array to enable the events in 4698 + * @buf: The buffer containing the comma separated list of events 4699 + * @disable_first: If true, disable all events in @buf before enabling them 4700 + * 4701 + * This function enables events from the bootup buffer. If @disable_first 4702 + * is true, it will first disable all events in the buffer before enabling 4703 + * them. 4704 + * 4705 + * For syscall events, which rely on a global refcount to register the 4706 + * SYSCALL_WORK_SYSCALL_TRACEPOINT flag (especially for pid 1), we must 4707 + * ensure the refcount hits zero before re-enabling them. A simple 4708 + * "disable then enable" per-event is not enough if multiple syscalls are 4709 + * used, as the refcount will stay above zero. Thus, we need a two-phase 4710 + * approach: disable all, then enable all. 4711 + */ 4712 + __init void 4713 + early_enable_events(struct trace_array *tr, char *buf, bool disable_first) 4714 + { 4715 + if (disable_first) 4716 + __early_set_events(tr, buf, false); 4717 + 4718 + __early_set_events(tr, buf, true); 4699 4719 } 4700 4720 4701 4721 static __init int event_trace_enable(void)
+13 -6
kernel/trace/trace_functions_graph.c
··· 400 400 struct fgraph_ops *gops, 401 401 struct ftrace_regs *fregs) 402 402 { 403 + unsigned long *task_var = fgraph_get_task_var(gops); 403 404 struct fgraph_times *ftimes; 404 405 struct trace_array *tr; 406 + unsigned int trace_ctx; 407 + u64 calltime, rettime; 405 408 int size; 409 + 410 + rettime = trace_clock_local(); 406 411 407 412 ftrace_graph_addr_finish(gops, trace); 408 413 409 - if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) { 410 - trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT); 414 + if (*task_var & TRACE_GRAPH_NOTRACE) { 415 + *task_var &= ~TRACE_GRAPH_NOTRACE; 411 416 return; 412 417 } 413 418 ··· 423 418 tr = gops->private; 424 419 handle_nosleeptime(tr, trace, ftimes, size); 425 420 426 - if (tracing_thresh && 427 - (trace_clock_local() - ftimes->calltime < tracing_thresh)) 421 + calltime = ftimes->calltime; 422 + 423 + if (tracing_thresh && (rettime - calltime < tracing_thresh)) 428 424 return; 429 - else 430 - trace_graph_return(trace, gops, fregs); 425 + 426 + trace_ctx = tracing_gen_ctx(); 427 + __trace_graph_return(tr, trace, trace_ctx, calltime, rettime); 431 428 } 432 429 433 430 static struct fgraph_ops funcgraph_ops = {