Merge tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix eventfs to check creating new files for events with names greater
than NAME_MAX. The eventfs lookup needs to check the return result of
simple_lookup().

- Fix the ring buffer to check the proper max data size. Events must be
able to fit on the ring buffer sub-buffer, if it cannot, then it
fails to be written and the logic to add the event is avoided. The
code to check if an event can fit failed to add the possible absolute
timestamp which may make the event not be able to fit. This causes
the ring buffer to go into an infinite loop trying to find a
sub-buffer that would fit the event. Luckily, there's a check that
will bail out if it looped over a 1000 times and it also warns.

The real fix is not to add the absolute timestamp to an event that is
starting at the beginning of a sub-buffer because it uses the
sub-buffer timestamp.

By avoiding the timestamp at the start of the sub-buffer allows
events that pass the first check to always find a sub-buffer that it
can fit on.

- Have large events that do not fit on a trace_seq to print "LINE TOO
BIG" like it does for the trace_pipe instead of what it does now
which is to silently drop the output.

- Fix a memory leak of forgetting to free the spare page that is saved
by a trace instance.

- Update the size of the snapshot buffer when the main buffer is
updated if the snapshot buffer is allocated.

- Fix ring buffer timestamp logic by removing all the places that tried
to put the before_stamp back to the write stamp so that the next
event doesn't add an absolute timestamp. But each of these updates
added a race where by making the two timestamp equal, it was
validating the write_stamp so that it can be incorrectly used for
calculating the delta of an event.

- There's a temp buffer used for printing the event that was using the
event data size for allocation when it needed to use the size of the
entire event (meta-data and payload data)

- For hardening, use "%.*s" for printing the trace_marker output, to
limit the amount that is printed by the size of the event. This was
discovered by development that added a bug that truncated the '\0'
and caused a crash.

- Fix a use-after-free bug in the use of the histogram files when an
instance is being removed.

- Remove a useless update in the rb_try_to_discard of the write_stamp.
The before_stamp was already changed to force the next event to add
an absolute timestamp that the write_stamp is not used. But the
write_stamp is modified again using an unneeded 64-bit cmpxchg.

- Fix several races in the 32-bit implementation of the
rb_time_cmpxchg() that does a 64-bit cmpxchg.

- While looking at fixing the 64-bit cmpxchg, I noticed that because
the ring buffer uses normal cmpxchg, and this can be done in NMI
context, there's some architectures that do not have a working
cmpxchg in NMI context. For these architectures, fail recording
events that happen in NMI context.

* tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
ring-buffer: Have rb_time_cmpxchg() set the msb counter too
ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
ring-buffer: Do not try to put back write_stamp
tracing: Fix uaf issue when open the hist or hist_debug file
tracing: Add size check when printing trace_marker output
ring-buffer: Have saved event hold the entire event
ring-buffer: Do not update before stamp when switching sub-buffers
tracing: Update snapshot buffer on resize if it is allocated
ring-buffer: Fix memory leak of free page
eventfs: Fix events beyond NAME_MAX blocking tasks
tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
ring-buffer: Fix writing to the buffer with max_data_size

Linus Torvalds 2 years ago 3b8a9b2e c8e97fc6

+72 -82

6 changed files

expand all

tracefs

event_inode.c

kernel

trace

ring_buffer.c

trace.c

trace.h

trace_events_hist.c

trace_output.c

fs/tracefs/event_inode.c

··· 546 546 if (strcmp(ei_child->name, name) != 0) 547 547 continue; 548 548 ret = simple_lookup(dir, dentry, flags); 549 + if (IS_ERR(ret)) 550 + goto out; 549 551 create_dir_dentry(ei, ei_child, ei_dentry, true); 550 552 created = true; 551 553 break; ··· 570 568 if (r <= 0) 571 569 continue; 572 570 ret = simple_lookup(dir, dentry, flags); 571 + if (IS_ERR(ret)) 572 + goto out; 573 573 create_file_dentry(ei, i, ei_dentry, name, mode, cdata, 574 574 fops, true); 575 575 break;

+42 -73

kernel/trace/ring_buffer.c

··· 644 644 645 645 *cnt = rb_time_cnt(top); 646 646 647 - /* If top and msb counts don't match, this interrupted a write */ 648 - if (*cnt != rb_time_cnt(msb)) 647 + /* If top, msb or bottom counts don't match, this interrupted a write */ 648 + if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom)) 649 649 return false; 650 650 651 651 /* The shift to msb will lose its cnt bits */ ··· 706 706 unsigned long cnt2, top2, bottom2, msb2; 707 707 u64 val; 708 708 709 + /* Any interruptions in this function should cause a failure */ 710 + cnt = local_read(&t->cnt); 711 + 709 712 /* The cmpxchg always fails if it interrupted an update */ 710 713 if (!__rb_time_read(t, &val, &cnt2)) 711 714 return false; ··· 716 713 if (val != expect) 717 714 return false; 718 715 719 - cnt = local_read(&t->cnt); 720 716 if ((cnt & 3) != cnt2) 721 717 return false; 722 718 723 719 cnt2 = cnt + 1; 724 720 725 721 rb_time_split(val, &top, &bottom, &msb); 722 + msb = rb_time_val_cnt(msb, cnt); 726 723 top = rb_time_val_cnt(top, cnt); 727 724 bottom = rb_time_val_cnt(bottom, cnt); 728 725 729 726 rb_time_split(set, &top2, &bottom2, &msb2); 727 + msb2 = rb_time_val_cnt(msb2, cnt); 730 728 top2 = rb_time_val_cnt(top2, cnt2); 731 729 bottom2 = rb_time_val_cnt(bottom2, cnt2); 732 730 ··· 1791 1787 free_buffer_page(bpage); 1792 1788 } 1793 1789 1790 + free_page((unsigned long)cpu_buffer->free_page); 1791 + 1794 1792 kfree(cpu_buffer); 1795 1793 } 1796 1794 ··· 2413 2407 */ 2414 2408 barrier(); 2415 2409 2416 - if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE) 2410 + if ((iter->head + length) > commit || length > BUF_PAGE_SIZE) 2417 2411 /* Writer corrupted the read? */ 2418 2412 goto reset; 2419 2413 ··· 2987 2981 return length; 2988 2982 } 2989 2983 2990 - static u64 rb_time_delta(struct ring_buffer_event *event) 2991 - { 2992 - switch (event->type_len) { 2993 - case RINGBUF_TYPE_PADDING: 2994 - return 0; 2995 - 2996 - case RINGBUF_TYPE_TIME_EXTEND: 2997 - return rb_event_time_stamp(event); 2998 - 2999 - case RINGBUF_TYPE_TIME_STAMP: 3000 - return 0; 3001 - 3002 - case RINGBUF_TYPE_DATA: 3003 - return event->time_delta; 3004 - default: 3005 - return 0; 3006 - } 3007 - } 3008 - 3009 2984 static inline bool 3010 2985 rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer, 3011 2986 struct ring_buffer_event *event) ··· 2994 3007 unsigned long new_index, old_index; 2995 3008 struct buffer_page *bpage; 2996 3009 unsigned long addr; 2997 - u64 write_stamp; 2998 - u64 delta; 2999 3010 3000 3011 new_index = rb_event_index(event); 3001 3012 old_index = new_index + rb_event_ts_length(event); ··· 3002 3017 3003 3018 bpage = READ_ONCE(cpu_buffer->tail_page); 3004 3019 3005 - delta = rb_time_delta(event); 3006 - 3007 - if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp)) 3008 - return false; 3009 - 3010 - /* Make sure the write stamp is read before testing the location */ 3011 - barrier(); 3012 - 3020 + /* 3021 + * Make sure the tail_page is still the same and 3022 + * the next write location is the end of this event 3023 + */ 3013 3024 if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) { 3014 3025 unsigned long write_mask = 3015 3026 local_read(&bpage->write) & ~RB_WRITE_MASK; ··· 3016 3035 * to make sure that the next event adds an absolute 3017 3036 * value and does not rely on the saved write stamp, which 3018 3037 * is now going to be bogus. 3038 + * 3039 + * By setting the before_stamp to zero, the next event 3040 + * is not going to use the write_stamp and will instead 3041 + * create an absolute timestamp. This means there's no 3042 + * reason to update the wirte_stamp! 3019 3043 */ 3020 3044 rb_time_set(&cpu_buffer->before_stamp, 0); 3021 - 3022 - /* Something came in, can't discard */ 3023 - if (!rb_time_cmpxchg(&cpu_buffer->write_stamp, 3024 - write_stamp, write_stamp - delta)) 3025 - return false; 3026 3045 3027 3046 /* 3028 3047 * If an event were to come in now, it would see that the 3029 3048 * write_stamp and the before_stamp are different, and assume 3030 3049 * that this event just added itself before updating 3031 3050 * the write stamp. The interrupting event will fix the 3032 - * write stamp for us, and use the before stamp as its delta. 3051 + * write stamp for us, and use an absolute timestamp. 3033 3052 */ 3034 3053 3035 3054 /* ··· 3466 3485 return; 3467 3486 3468 3487 /* 3469 - * If this interrupted another event, 3488 + * If this interrupted another event, 3470 3489 */ 3471 3490 if (atomic_inc_return(this_cpu_ptr(&checking)) != 1) 3472 3491 goto out; ··· 3560 3579 * absolute timestamp. 3561 3580 * Don't bother if this is the start of a new page (w == 0). 3562 3581 */ 3563 - if (unlikely(!a_ok || !b_ok || (info->before != info->after && w))) { 3582 + if (!w) { 3583 + /* Use the sub-buffer timestamp */ 3584 + info->delta = 0; 3585 + } else if (unlikely(!a_ok || !b_ok || info->before != info->after)) { 3564 3586 info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND; 3565 3587 info->length += RB_LEN_TIME_EXTEND; 3566 3588 } else { ··· 3586 3602 3587 3603 /* See if we shot pass the end of this buffer page */ 3588 3604 if (unlikely(write > BUF_PAGE_SIZE)) { 3589 - /* before and after may now different, fix it up*/ 3590 - b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before); 3591 - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); 3592 - if (a_ok && b_ok && info->before != info->after) 3593 - (void)rb_time_cmpxchg(&cpu_buffer->before_stamp, 3594 - info->before, info->after); 3595 - if (a_ok && b_ok) 3596 - check_buffer(cpu_buffer, info, CHECK_FULL_PAGE); 3605 + check_buffer(cpu_buffer, info, CHECK_FULL_PAGE); 3597 3606 return rb_move_tail(cpu_buffer, tail, info); 3598 3607 } 3599 3608 3600 3609 if (likely(tail == w)) { 3601 - u64 save_before; 3602 - bool s_ok; 3603 - 3604 3610 /* Nothing interrupted us between A and C */ 3605 3611 /*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts); 3606 - barrier(); 3607 - /*E*/ s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before); 3608 - RB_WARN_ON(cpu_buffer, !s_ok); 3612 + /* 3613 + * If something came in between C and D, the write stamp 3614 + * may now not be in sync. But that's fine as the before_stamp 3615 + * will be different and then next event will just be forced 3616 + * to use an absolute timestamp. 3617 + */ 3609 3618 if (likely(!(info->add_timestamp & 3610 3619 (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)))) 3611 3620 /* This did not interrupt any time update */ ··· 3606 3629 else 3607 3630 /* Just use full timestamp for interrupting event */ 3608 3631 info->delta = info->ts; 3609 - barrier(); 3610 3632 check_buffer(cpu_buffer, info, tail); 3611 - if (unlikely(info->ts != save_before)) { 3612 - /* SLOW PATH - Interrupted between C and E */ 3613 - 3614 - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after); 3615 - RB_WARN_ON(cpu_buffer, !a_ok); 3616 - 3617 - /* Write stamp must only go forward */ 3618 - if (save_before > info->after) { 3619 - /* 3620 - * We do not care about the result, only that 3621 - * it gets updated atomically. 3622 - */ 3623 - (void)rb_time_cmpxchg(&cpu_buffer->write_stamp, 3624 - info->after, save_before); 3625 - } 3626 - } 3627 3633 } else { 3628 3634 u64 ts; 3629 3635 /* SLOW PATH - Interrupted between A and C */ ··· 3674 3714 int nr_loops = 0; 3675 3715 int add_ts_default; 3676 3716 3717 + /* ring buffer does cmpxchg, make sure it is safe in NMI context */ 3718 + if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && 3719 + (unlikely(in_nmi()))) { 3720 + return NULL; 3721 + } 3722 + 3677 3723 rb_start_commit(cpu_buffer); 3678 3724 /* The commit page can not change after this */ 3679 3725 ··· 3703 3737 if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) { 3704 3738 add_ts_default = RB_ADD_STAMP_ABSOLUTE; 3705 3739 info.length += RB_LEN_TIME_EXTEND; 3740 + if (info.length > BUF_MAX_DATA_SIZE) 3741 + goto out_fail; 3706 3742 } else { 3707 3743 add_ts_default = RB_ADD_STAMP_NONE; 3708 3744 } ··· 5086 5118 if (!iter) 5087 5119 return NULL; 5088 5120 5089 - iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags); 5121 + /* Holds the entire event: data and meta data */ 5122 + iter->event = kmalloc(BUF_PAGE_SIZE, flags); 5090 5123 if (!iter->event) { 5091 5124 kfree(iter); 5092 5125 return NULL;

+13 -3

kernel/trace/trace.c

··· 4722 4722 iter->leftover = ret; 4723 4723 4724 4724 } else { 4725 - print_trace_line(iter); 4725 + ret = print_trace_line(iter); 4726 + if (ret == TRACE_TYPE_PARTIAL_LINE) { 4727 + iter->seq.full = 0; 4728 + trace_seq_puts(&iter->seq, "[LINE TOO BIG]\n"); 4729 + } 4726 4730 ret = trace_print_seq(m, &iter->seq); 4727 4731 /* 4728 4732 * If we overflow the seq_file buffer, then it will ··· 4966 4962 event_file_put(file); 4967 4963 4968 4964 return 0; 4965 + } 4966 + 4967 + int tracing_single_release_file_tr(struct inode *inode, struct file *filp) 4968 + { 4969 + tracing_release_file_tr(inode, filp); 4970 + return single_release(inode, filp); 4969 4971 } 4970 4972 4971 4973 static int tracing_mark_open(struct inode *inode, struct file *filp) ··· 6354 6344 if (!tr->array_buffer.buffer) 6355 6345 return 0; 6356 6346 6357 - /* Do not allow tracing while resizng ring buffer */ 6347 + /* Do not allow tracing while resizing ring buffer */ 6358 6348 tracing_stop_tr(tr); 6359 6349 6360 6350 ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu); ··· 6362 6352 goto out_start; 6363 6353 6364 6354 #ifdef CONFIG_TRACER_MAX_TRACE 6365 - if (!tr->current_trace->use_max_tr) 6355 + if (!tr->allocated_snapshot) 6366 6356 goto out; 6367 6357 6368 6358 ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);

kernel/trace/trace.h

··· 617 617 int tracing_open_generic_tr(struct inode *inode, struct file *filp); 618 618 int tracing_open_file_tr(struct inode *inode, struct file *filp); 619 619 int tracing_release_file_tr(struct inode *inode, struct file *filp); 620 + int tracing_single_release_file_tr(struct inode *inode, struct file *filp); 620 621 bool tracing_is_disabled(void); 621 622 bool tracer_tracing_is_on(struct trace_array *tr); 622 623 void tracer_tracing_on(struct trace_array *tr);

+8 -4

kernel/trace/trace_events_hist.c

··· 5623 5623 { 5624 5624 int ret; 5625 5625 5626 - ret = security_locked_down(LOCKDOWN_TRACEFS); 5626 + ret = tracing_open_file_tr(inode, file); 5627 5627 if (ret) 5628 5628 return ret; 5629 5629 5630 + /* Clear private_data to avoid warning in single_open() */ 5631 + file->private_data = NULL; 5630 5632 return single_open(file, hist_show, file); 5631 5633 } 5632 5634 ··· 5636 5634 .open = event_hist_open, 5637 5635 .read = seq_read, 5638 5636 .llseek = seq_lseek, 5639 - .release = single_release, 5637 + .release = tracing_single_release_file_tr, 5640 5638 }; 5641 5639 5642 5640 #ifdef CONFIG_HIST_TRIGGERS_DEBUG ··· 5902 5900 { 5903 5901 int ret; 5904 5902 5905 - ret = security_locked_down(LOCKDOWN_TRACEFS); 5903 + ret = tracing_open_file_tr(inode, file); 5906 5904 if (ret) 5907 5905 return ret; 5908 5906 5907 + /* Clear private_data to avoid warning in single_open() */ 5908 + file->private_data = NULL; 5909 5909 return single_open(file, hist_debug_show, file); 5910 5910 } 5911 5911 ··· 5915 5911 .open = event_hist_debug_open, 5916 5912 .read = seq_read, 5917 5913 .llseek = seq_lseek, 5918 - .release = single_release, 5914 + .release = tracing_single_release_file_tr, 5919 5915 }; 5920 5916 #endif 5921 5917

+4 -2

kernel/trace/trace_output.c

··· 1587 1587 { 1588 1588 struct print_entry *field; 1589 1589 struct trace_seq *s = &iter->seq; 1590 + int max = iter->ent_size - offsetof(struct print_entry, buf); 1590 1591 1591 1592 trace_assign_type(field, iter->ent); 1592 1593 1593 1594 seq_print_ip_sym(s, field->ip, flags); 1594 - trace_seq_printf(s, ": %s", field->buf); 1595 + trace_seq_printf(s, ": %.*s", max, field->buf); 1595 1596 1596 1597 return trace_handle_return(s); 1597 1598 } ··· 1601 1600 struct trace_event *event) 1602 1601 { 1603 1602 struct print_entry *field; 1603 + int max = iter->ent_size - offsetof(struct print_entry, buf); 1604 1604 1605 1605 trace_assign_type(field, iter->ent); 1606 1606 1607 - trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf); 1607 + trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf); 1608 1608 1609 1609 return trace_handle_return(&iter->seq); 1610 1610 }

Configure Feed

Configure Feed