Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix thresh_return of function graph tracer

The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what
functions to ignore. The trace_graph_return() was updated to handle
this, but when function_graph tracer is using a threshold (only trace
functions that took longer than a specified time), it uses
trace_graph_thresh_return() instead.

This function was still incorrectly using the task struct recursion
word causing the function graph tracer to permanently set all
functions to "notrace"

- Fix thresh_return nosleep accounting

When the calltime was moved to the shadow stack storage instead of
being on the fgraph descriptor, the calculations for the amount of
sleep time was updated. The calculation was done in the
trace_graph_thresh_return() function, which also called the
trace_graph_return(), which did the calculation again, causing the
time to be doubled.

Remove the call to trace_graph_return() as what it needed to do
wasn't that much, and just do the work in
trace_graph_thresh_return().

- Fix syscall trace event activation on boot up

The syscall trace events are pseudo events attached to the
raw_syscall tracepoints. When the first syscall event is enabled, it
enables the raw_syscall tracepoint and doesn't need to do anything
when a second syscall event is also enabled.

When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init. This
is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different types
of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled.

The problem happened is that this "disable-enable" is done one at a
time. If more than one syscall event is specified on the command
line, by disabling them one at a time, the counter never gets to
zero, and the raw_syscall is not disabled and enabled, keeping the
syscall events in their non-fully functional state.

Instead, disable all events and re-enabled them all, as that will
ensure the raw_syscall event is also disabled and re-enabled.

- Disable preemption in ftrace pid filtering

The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables
protected by RCU (preemption disabled). Now that tracepoint callbacks
are called with preemption enabled, this protection needs to be added
explicitly, and not depend on the functions being called with
preemption disabled.

- Disable preemption in event pid filtering

The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.

- Fix accounting of the memory mapped ring buffer on fork

Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
But this does not prevent the application from calling
madvise(MADVISE_DOFORK). This causes the mapping to be copied on
fork. After the first tasks exits, the mapping is considered unmapped
by everyone. But when he second task exits, the counter goes below
zero and triggers a WARN_ON.

Since nothing prevents two separate tasks from mmapping the ftrace
ring buffer (although two mappings may mess each other up), there's
no reason to stop the memory from being copied on fork.

Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.

- Add all ftrace headers in MAINTAINERS file

The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
all *ftrace* files.

* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MAINTAINERS entries for all ftrace headers
tracing: Fix WARN_ON in tracing_buffers_mmap_close
tracing: Disable preemption in the tracepoint callbacks handling filtered pids
ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
tracing: Fix syscall events activation by ensuring refcount hits zero
fgraph: Fix thresh_return nosleeptime double-adjust
fgraph: Fix thresh_return clear per-task notrace

+90 -22
+1 -1
MAINTAINERS
··· 10484 F: Documentation/trace/ftrace* 10485 F: arch/*/*/*/*ftrace* 10486 F: arch/*/*/*ftrace* 10487 - F: include/*/ftrace.h 10488 F: kernel/trace/fgraph.c 10489 F: kernel/trace/ftrace* 10490 F: samples/ftrace
··· 10484 F: Documentation/trace/ftrace* 10485 F: arch/*/*/*/*ftrace* 10486 F: arch/*/*/*ftrace* 10487 + F: include/*/*ftrace* 10488 F: kernel/trace/fgraph.c 10489 F: kernel/trace/ftrace* 10490 F: samples/ftrace
+1
include/linux/ring_buffer.h
··· 248 249 int ring_buffer_map(struct trace_buffer *buffer, int cpu, 250 struct vm_area_struct *vma); 251 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); 252 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); 253 #endif /* _LINUX_RING_BUFFER_H */
··· 248 249 int ring_buffer_map(struct trace_buffer *buffer, int cpu, 250 struct vm_area_struct *vma); 251 + void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu); 252 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); 253 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); 254 #endif /* _LINUX_RING_BUFFER_H */
+2
kernel/trace/ftrace.c
··· 8611 struct trace_pid_list *pid_list; 8612 struct trace_array *tr = data; 8613 8614 pid_list = rcu_dereference_sched(tr->function_pids); 8615 trace_filter_add_remove_task(pid_list, self, task); 8616 ··· 8625 struct trace_pid_list *pid_list; 8626 struct trace_array *tr = data; 8627 8628 pid_list = rcu_dereference_sched(tr->function_pids); 8629 trace_filter_add_remove_task(pid_list, NULL, task); 8630
··· 8611 struct trace_pid_list *pid_list; 8612 struct trace_array *tr = data; 8613 8614 + guard(preempt)(); 8615 pid_list = rcu_dereference_sched(tr->function_pids); 8616 trace_filter_add_remove_task(pid_list, self, task); 8617 ··· 8624 struct trace_pid_list *pid_list; 8625 struct trace_array *tr = data; 8626 8627 + guard(preempt)(); 8628 pid_list = rcu_dereference_sched(tr->function_pids); 8629 trace_filter_add_remove_task(pid_list, NULL, task); 8630
+21
kernel/trace/ring_buffer.c
··· 7310 return err; 7311 } 7312 7313 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) 7314 { 7315 struct ring_buffer_per_cpu *cpu_buffer;
··· 7310 return err; 7311 } 7312 7313 + /* 7314 + * This is called when a VMA is duplicated (e.g., on fork()) to increment 7315 + * the user_mapped counter without remapping pages. 7316 + */ 7317 + void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu) 7318 + { 7319 + struct ring_buffer_per_cpu *cpu_buffer; 7320 + 7321 + if (WARN_ON(!cpumask_test_cpu(cpu, buffer->cpumask))) 7322 + return; 7323 + 7324 + cpu_buffer = buffer->buffers[cpu]; 7325 + 7326 + guard(mutex)(&cpu_buffer->mapping_lock); 7327 + 7328 + if (cpu_buffer->user_mapped) 7329 + __rb_inc_dec_mapped(cpu_buffer, true); 7330 + else 7331 + WARN(1, "Unexpected buffer stat, it should be mapped"); 7332 + } 7333 + 7334 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) 7335 { 7336 struct ring_buffer_per_cpu *cpu_buffer;
+13
kernel/trace/trace.c
··· 8213 static inline void put_snapshot_map(struct trace_array *tr) { } 8214 #endif 8215 8216 static void tracing_buffers_mmap_close(struct vm_area_struct *vma) 8217 { 8218 struct ftrace_buffer_info *info = vma->vm_file->private_data; ··· 8244 } 8245 8246 static const struct vm_operations_struct tracing_buffers_vmops = { 8247 .close = tracing_buffers_mmap_close, 8248 .may_split = tracing_buffers_may_split, 8249 };
··· 8213 static inline void put_snapshot_map(struct trace_array *tr) { } 8214 #endif 8215 8216 + /* 8217 + * This is called when a VMA is duplicated (e.g., on fork()) to increment 8218 + * the user_mapped counter without remapping pages. 8219 + */ 8220 + static void tracing_buffers_mmap_open(struct vm_area_struct *vma) 8221 + { 8222 + struct ftrace_buffer_info *info = vma->vm_file->private_data; 8223 + struct trace_iterator *iter = &info->iter; 8224 + 8225 + ring_buffer_map_dup(iter->array_buffer->buffer, iter->cpu_file); 8226 + } 8227 + 8228 static void tracing_buffers_mmap_close(struct vm_area_struct *vma) 8229 { 8230 struct ftrace_buffer_info *info = vma->vm_file->private_data; ··· 8232 } 8233 8234 static const struct vm_operations_struct tracing_buffers_vmops = { 8235 + .open = tracing_buffers_mmap_open, 8236 .close = tracing_buffers_mmap_close, 8237 .may_split = tracing_buffers_may_split, 8238 };
+39 -15
kernel/trace/trace_events.c
··· 1039 struct trace_pid_list *pid_list; 1040 struct trace_array *tr = data; 1041 1042 pid_list = rcu_dereference_raw(tr->filtered_pids); 1043 trace_filter_add_remove_task(pid_list, NULL, task); 1044 ··· 1055 struct trace_pid_list *pid_list; 1056 struct trace_array *tr = data; 1057 1058 pid_list = rcu_dereference_sched(tr->filtered_pids); 1059 trace_filter_add_remove_task(pid_list, self, task); 1060 ··· 4670 return 0; 4671 } 4672 4673 - __init void 4674 - early_enable_events(struct trace_array *tr, char *buf, bool disable_first) 4675 { 4676 char *token; 4677 - int ret; 4678 4679 - while (true) { 4680 - token = strsep(&buf, ","); 4681 - 4682 - if (!token) 4683 - break; 4684 - 4685 if (*token) { 4686 - /* Restarting syscalls requires that we stop them first */ 4687 - if (disable_first) 4688 ftrace_set_clr_event(tr, token, 0); 4689 - 4690 - ret = ftrace_set_clr_event(tr, token, 1); 4691 - if (ret) 4692 - pr_warn("Failed to enable trace event: %s\n", token); 4693 } 4694 4695 /* Put back the comma to allow this to be called again */ 4696 if (buf) 4697 *(buf - 1) = ','; 4698 } 4699 } 4700 4701 static __init int event_trace_enable(void)
··· 1039 struct trace_pid_list *pid_list; 1040 struct trace_array *tr = data; 1041 1042 + guard(preempt)(); 1043 pid_list = rcu_dereference_raw(tr->filtered_pids); 1044 trace_filter_add_remove_task(pid_list, NULL, task); 1045 ··· 1054 struct trace_pid_list *pid_list; 1055 struct trace_array *tr = data; 1056 1057 + guard(preempt)(); 1058 pid_list = rcu_dereference_sched(tr->filtered_pids); 1059 trace_filter_add_remove_task(pid_list, self, task); 1060 ··· 4668 return 0; 4669 } 4670 4671 + /* 4672 + * Helper function to enable or disable a comma-separated list of events 4673 + * from the bootup buffer. 4674 + */ 4675 + static __init void __early_set_events(struct trace_array *tr, char *buf, bool enable) 4676 { 4677 char *token; 4678 4679 + while ((token = strsep(&buf, ","))) { 4680 if (*token) { 4681 + if (enable) { 4682 + if (ftrace_set_clr_event(tr, token, 1)) 4683 + pr_warn("Failed to enable trace event: %s\n", token); 4684 + } else { 4685 ftrace_set_clr_event(tr, token, 0); 4686 + } 4687 } 4688 4689 /* Put back the comma to allow this to be called again */ 4690 if (buf) 4691 *(buf - 1) = ','; 4692 } 4693 + } 4694 + 4695 + /** 4696 + * early_enable_events - enable events from the bootup buffer 4697 + * @tr: The trace array to enable the events in 4698 + * @buf: The buffer containing the comma separated list of events 4699 + * @disable_first: If true, disable all events in @buf before enabling them 4700 + * 4701 + * This function enables events from the bootup buffer. If @disable_first 4702 + * is true, it will first disable all events in the buffer before enabling 4703 + * them. 4704 + * 4705 + * For syscall events, which rely on a global refcount to register the 4706 + * SYSCALL_WORK_SYSCALL_TRACEPOINT flag (especially for pid 1), we must 4707 + * ensure the refcount hits zero before re-enabling them. A simple 4708 + * "disable then enable" per-event is not enough if multiple syscalls are 4709 + * used, as the refcount will stay above zero. Thus, we need a two-phase 4710 + * approach: disable all, then enable all. 4711 + */ 4712 + __init void 4713 + early_enable_events(struct trace_array *tr, char *buf, bool disable_first) 4714 + { 4715 + if (disable_first) 4716 + __early_set_events(tr, buf, false); 4717 + 4718 + __early_set_events(tr, buf, true); 4719 } 4720 4721 static __init int event_trace_enable(void)
+13 -6
kernel/trace/trace_functions_graph.c
··· 400 struct fgraph_ops *gops, 401 struct ftrace_regs *fregs) 402 { 403 struct fgraph_times *ftimes; 404 struct trace_array *tr; 405 int size; 406 407 ftrace_graph_addr_finish(gops, trace); 408 409 - if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) { 410 - trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT); 411 return; 412 } 413 ··· 423 tr = gops->private; 424 handle_nosleeptime(tr, trace, ftimes, size); 425 426 - if (tracing_thresh && 427 - (trace_clock_local() - ftimes->calltime < tracing_thresh)) 428 return; 429 - else 430 - trace_graph_return(trace, gops, fregs); 431 } 432 433 static struct fgraph_ops funcgraph_ops = {
··· 400 struct fgraph_ops *gops, 401 struct ftrace_regs *fregs) 402 { 403 + unsigned long *task_var = fgraph_get_task_var(gops); 404 struct fgraph_times *ftimes; 405 struct trace_array *tr; 406 + unsigned int trace_ctx; 407 + u64 calltime, rettime; 408 int size; 409 + 410 + rettime = trace_clock_local(); 411 412 ftrace_graph_addr_finish(gops, trace); 413 414 + if (*task_var & TRACE_GRAPH_NOTRACE) { 415 + *task_var &= ~TRACE_GRAPH_NOTRACE; 416 return; 417 } 418 ··· 418 tr = gops->private; 419 handle_nosleeptime(tr, trace, ftimes, size); 420 421 + calltime = ftimes->calltime; 422 + 423 + if (tracing_thresh && (rettime - calltime < tracing_thresh)) 424 return; 425 + 426 + trace_ctx = tracing_gen_ctx(); 427 + __trace_graph_return(tr, trace, trace_ctx, calltime, rettime); 428 } 429 430 static struct fgraph_ops funcgraph_ops = {