Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ftrace: document updates

The following updates were recommended by Elias Oltmanns and Randy Dunlap.

[ updates based on Andrew Morton's comments are still to come. ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Steven Rostedt and committed by
Linus Torvalds
a41eebab 17489c05

+71 -63
+71 -63
Documentation/ftrace.txt
··· 2 2 ======================== 3 3 4 4 Copyright 2008 Red Hat Inc. 5 - Author: Steven Rostedt <srostedt@redhat.com> 5 + Author: Steven Rostedt <srostedt@redhat.com> 6 + License: The GNU Free Documentation License, Version 1.2 7 + Reviewers: Elias Oltmanns and Randy Dunlap 6 8 9 + Writen for: 2.6.26-rc8 linux-2.6-tip.git tip/tracing/ftrace branch 7 10 8 11 Introduction 9 12 ------------ ··· 49 46 that is configured. 50 47 51 48 available_tracers : This holds the different types of tracers that 52 - has been compiled into the kernel. The tracers 49 + have been compiled into the kernel. The tracers 53 50 listed here can be configured by echoing in their 54 51 name into current_tracer. 55 52 ··· 93 90 trace_entries : This sets or displays the number of trace 94 91 entries each CPU buffer can hold. The tracer buffers 95 92 are the same size for each CPU, so care must be 96 - taken when modifying the trace_entries. The number 97 - of actually entries will be the number given 98 - times the number of possible CPUS. The buffers 99 - are saved as individual pages, and the actual entries 100 - will always be rounded up to entries per page. 93 + taken when modifying the trace_entries. The trace 94 + buffers are allocated in pages (blocks of memory that 95 + the kernel uses for allocation, usually 4 KB in size). 96 + Since each entry is smaller than a page, if the last 97 + allocated page has room for more entries than were 98 + requested, the rest of the page is used to allocate 99 + entries. 101 100 102 101 This can only be updated when the current_tracer 103 102 is set to "none". ··· 119 114 in performance. This also has a side effect of 120 115 enabling or disabling specific functions to be 121 116 traced. Echoing in names of functions into this 122 - file will limit the trace to only those files. 117 + file will limit the trace to only these functions. 123 118 124 119 set_ftrace_notrace: This has the opposite effect that 125 120 set_ftrace_filter has. Any function that is added 126 121 here will not be traced. If a function exists 127 - in both set_ftrace_filter and set_ftrace_notrace 128 - the function will _not_ bet traced. 122 + in both set_ftrace_filter and set_ftrace_notrace, 123 + the function will _not_ be traced. 129 124 130 125 available_filter_functions : When a function is encountered the first 131 126 time by the dynamic tracer, it is recorded and ··· 143 138 144 139 ftrace - function tracer that uses mcount to trace all functions. 145 140 It is possible to filter out which functions that are 146 - traced when dynamic ftrace is configured in. 141 + to be traced when dynamic ftrace is configured in. 147 142 148 143 sched_switch - traces the context switches between tasks. 149 144 ··· 302 297 303 298 The above is mostly meaningful for kernel developers. 304 299 305 - time: This differs from the trace output where as the trace output 306 - contained a absolute timestamp. This timestamp is relative 307 - to the start of the first entry in the the trace. 300 + time: This differs from the trace file output. The trace file output 301 + included an absolute timestamp. The timestamp used by the 302 + latency_trace file is relative to the start of the trace. 308 303 309 304 delay: This is just to help catch your eye a bit better. And 310 305 needs to be fixed to be only relative to the same CPU. 311 - The marks is determined by the difference between this 306 + The marks are determined by the difference between this 312 307 current trace and the next trace. 313 308 '!' - greater than preempt_mark_thresh (default 100) 314 309 '+' - greater than 1 microsecond ··· 327 322 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ 328 323 noblock nostacktrace nosched-tree 329 324 330 - To disable one of the options, echo in the option appended with "no". 325 + To disable one of the options, echo in the option prepended with "no". 331 326 332 327 echo noprint-parent > /debug/tracing/iter_ctrl 333 328 334 329 To enable an option, leave off the "no". 335 330 336 - echo sym-offest > /debug/tracing/iter_ctrl 331 + echo sym-offset > /debug/tracing/iter_ctrl 337 332 338 333 Here are the available options: 339 334 ··· 349 344 350 345 sym-offset - Display not only the function name, but also the offset 351 346 in the function. For example, instead of seeing just 352 - "ktime_get" you will see "ktime_get+0xb/0x20" 347 + "ktime_get", you will see "ktime_get+0xb/0x20". 353 348 354 349 sym-offset: 355 350 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 ··· 369 364 user applications that can translate the raw numbers better than 370 365 having it done in the kernel. 371 366 372 - hex - similar to raw, but the numbers will be in a hexadecimal format. 367 + hex - Similar to raw, but the numbers will be in a hexadecimal format. 373 368 374 369 bin - This will print out the formats in raw binary. 375 370 ··· 386 381 ------------ 387 382 388 383 This tracer simply records schedule switches. Here's an example 389 - on how to implement it. 384 + of how to use it. 390 385 391 386 # echo sched_switch > /debug/tracing/current_tracer 392 387 # echo 1 > /debug/tracing/tracing_enabled ··· 475 470 kernel know of a new mouse event. The result is a latency with the 476 471 reaction time. 477 472 478 - The irqsoff tracer tracks the time interrupts are disabled and when 473 + The irqsoff tracer tracks the time interrupts are disabled to the time 479 474 they are re-enabled. When a new maximum latency is hit, it saves off 480 475 the trace so that it may be retrieved at a later time. Every time a 481 476 new maximum in reached, the old saved trace is discarded and the new ··· 524 519 because the clock must have incremented between the time of recording 525 520 the max latency and recording the function that had that latency. 526 521 527 - Note the above had ftrace_enabled not set. If we set the ftrace_enabled 522 + Note the above had ftrace_enabled not set. If we set the ftrace_enabled, 528 523 we get a much larger output: 529 524 530 525 # tracer: irqsoff ··· 575 570 576 571 577 572 Here we traced a 50 microsecond latency. But we also see all the 578 - functions that were called during that time. Note that enabling 579 - function tracing we endure an added overhead. This overhead may 580 - extend the latency times. But never the less, this trace has provided 581 - some very helpful debugging. 573 + functions that were called during that time. Note that by enabling 574 + function tracing, we endure an added overhead. This overhead may 575 + extend the latency times. But nevertheless, this trace has provided 576 + some very helpful debugging information. 582 577 583 578 584 579 preemptoff 585 580 ---------- 586 581 587 - When preemption is disabled we may be able to receive interrupts but 588 - the task can not be preempted and a higher priority task must wait 582 + When preemption is disabled, we may be able to receive interrupts but 583 + the task cannot be preempted and a higher priority task must wait 589 584 for preemption to be enabled again before it can preempt a lower 590 585 priority task. 591 586 592 - The preemptoff tracer traces the places that disables preemption. 587 + The preemptoff tracer traces the places that disable preemption. 593 588 Like the irqsoff, it records the maximum latency that preemption 594 589 was disabled. The control of preemptoff is much like the irqsoff. 595 590 ··· 701 696 It may seem that we missed a preempt enabled. What really happened 702 697 is that the preempt count is held on the threads stack and we 703 698 switched to the softirq stack (4K stacks in effect). The code 704 - does not copy the preempt count, but because interrupts are disabled 699 + does not copy the preempt count, but because interrupts are disabled, 705 700 we don't need to worry about it. Having a tracer like this is good 706 701 to let people know what really happens inside the kernel. 707 702 ··· 737 732 738 733 Again, using this trace is much like the irqsoff and preemptoff tracers. 739 734 740 - # echo preemptoff > /debug/tracing/current_tracer 735 + # echo preemptirqsoff > /debug/tracing/current_tracer 741 736 # echo 0 > /debug/tracing/tracing_max_latency 742 737 # echo 1 > /debug/tracing/tracing_enabled 743 738 # ls -ltr ··· 867 862 the ls task. We see that the task had the "need_resched" bit set 868 863 with the 'N' in the trace. Interrupts are disabled in the spin_lock 869 864 and the trace started. We see that a schedule took place to run 870 - sshd. When the interrupts were enabled we took an interrupt. 871 - On return of the interrupt the softirq ran. We took another interrupt 872 - while running the softirq as we see with the capital 'H'. 865 + sshd. When the interrupts were enabled, we took an interrupt. 866 + On return from the interrupt handler, the softirq ran. We took another 867 + interrupt while running the softirq as we see with the capital 'H'. 873 868 874 869 875 870 wakeup ··· 881 876 I stress the point that this is about RT tasks. It is also important 882 877 to know the scheduling latency of non-RT tasks, but the average 883 878 schedule latency is better for non-RT tasks. Tools like 884 - LatencyTop is more appropriate for such measurements. 879 + LatencyTop are more appropriate for such measurements. 885 880 886 - Real-Time environments is interested in the worst case latency. 881 + Real-Time environments are interested in the worst case latency. 887 882 That is the longest latency it takes for something to happen, and 888 883 not the average. We can have a very fast scheduler that may only 889 884 have a large latency once in a while, but that would not work well ··· 894 889 of RT tasks. 895 890 896 891 Since this tracer only deals with RT tasks, we will run this slightly 897 - different than we did with the previous tracers. Instead of performing 898 - an 'ls' we will run 'sleep 1' under 'chrt' which changes the 892 + differently than we did with the previous tracers. Instead of performing 893 + an 'ls', we will run 'sleep 1' under 'chrt' which changes the 899 894 priority of the task. 900 895 901 896 # echo wakeup > /debug/tracing/current_tracer ··· 929 924 vim:ft=help 930 925 931 926 932 - Running this on an idle system we see that it only took 4 microseconds 927 + Running this on an idle system, we see that it only took 4 microseconds 933 928 to perform the task switch. Note, since the trace marker in the 934 - schedule is before the actual "switch" we stop the tracing when 929 + schedule is before the actual "switch", we stop the tracing when 935 930 the recorded task is about to schedule in. This may change if 936 931 we add a new marker at the end of the scheduler. 937 932 ··· 997 992 998 993 The interrupt went off while running ksoftirqd. This task runs at 999 994 SCHED_OTHER. Why didn't we see the 'N' set early? This may be 1000 - a harmless bug with x86_32 and 4K stacks. The need_reched() function 1001 - that tests if we need to reschedule looks on the actual stack. 1002 - Where as the setting of the NEED_RESCHED bit happens on the 1003 - task's stack. But because we are in a hard interrupt, the test 1004 - is with the interrupts stack which has that to be false. We don't 1005 - see the 'N' until we switch back to the task's stack. 995 + a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks 996 + configured, the interrupt and softirq runs with their own stack. 997 + Some information is held on the top of the task's stack (need_resched 998 + and preempt_count are both stored there). The setting of the NEED_RESCHED 999 + bit is done directly to the task's stack, but the reading of the 1000 + NEED_RESCHED is done by looking at the current stack, which in this case 1001 + is the stack for the hard interrupt. This hides the fact that NEED_RESCHED 1002 + has been set. We don't see the 'N' until we switch back to the task's 1003 + assigned stack. 1006 1004 1007 1005 ftrace 1008 1006 ------ ··· 1075 1067 every kernel function, produced by the -pg switch in gcc), starts 1076 1068 of pointing to a simple return. 1077 1069 1078 - When dynamic ftrace is initialized, it calls kstop_machine to make it 1079 - act like a uniprocessor so that it can freely modify code without 1080 - worrying about other processors executing that same code. At 1081 - initialization, the mcount calls are change to call a "record_ip" 1070 + When dynamic ftrace is initialized, it calls kstop_machine to make 1071 + the machine act like a uniprocessor so that it can freely modify code 1072 + without worrying about other processors executing that same code. At 1073 + initialization, the mcount calls are changed to call a "record_ip" 1082 1074 function. After this, the first time a kernel function is called, 1083 1075 it has the calling address saved in a hash table. 1084 1076 ··· 1093 1085 want to trace and which ones we want the mcount calls to remain as 1094 1086 nops. 1095 1087 1096 - Two files that contain to the enabling and disabling of recorded 1097 - functions are: 1088 + Two files are used, one for enabling and one for disabling the tracing 1089 + of recorded functions. They are: 1098 1090 1099 1091 set_ftrace_filter 1100 1092 ··· 1102 1094 1103 1095 set_ftrace_notrace 1104 1096 1105 - A list of available functions that you can add to this files is listed 1097 + A list of available functions that you can add to these files is listed 1106 1098 in: 1107 1099 1108 1100 available_filter_functions ··· 1141 1133 1142 1134 1143 1135 Perhaps this isn't enough. The filters also allow simple wild cards. 1144 - Only the following is currently available 1136 + Only the following are currently available 1145 1137 1146 - <match>* - will match functions that begins with <match> 1138 + <match>* - will match functions that begin with <match> 1147 1139 *<match> - will match functions that end with <match> 1148 1140 *<match>* - will match functions that have <match> in it 1149 1141 ··· 1195 1187 To rewrite the filters, use '>' 1196 1188 To append to the filters, use '>>' 1197 1189 1198 - To clear out a filter so that all functions will be recorded again. 1190 + To clear out a filter so that all functions will be recorded again: 1199 1191 1200 1192 # echo > /debug/tracing/set_ftrace_filter 1201 1193 # cat /debug/tracing/set_ftrace_filter ··· 1254 1246 1255 1247 As mentioned above, when dynamic ftrace is configured in, a kernel 1256 1248 thread wakes up once a second and checks to see if there are mcount 1257 - calls that need to be converted into nops. If there is not, then 1258 - it simply goes back to sleep. But if there is, it will call 1249 + calls that need to be converted into nops. If there are not any, then 1250 + it simply goes back to sleep. But if there are some, it will call 1259 1251 kstop_machine to convert the calls to nops. 1260 1252 1261 1253 There may be a case that you do not want this added latency. ··· 1270 1262 to calling mcount. Without this kernel thread, that overhead will 1271 1263 exist. 1272 1264 1273 - Any write to the ftraced_enabled file will cause the kstop_machine 1274 - to run if there are recorded calls to mcount. This means that a 1265 + If there are recorded calls to mcount, any write to the ftraced_enabled 1266 + file will cause the kstop_machine to run. This means that a 1275 1267 user can manually perform the updates when they want to by simply 1276 1268 echoing a '0' into the ftraced_enabled file. 1277 1269 ··· 1323 1315 1324 1316 Having too much or not enough data can be troublesome in diagnosing 1325 1317 some issue in the kernel. The file trace_entries is used to modify 1326 - the size of the internal trace buffers. The numbers listed 1318 + the size of the internal trace buffers. The number listed 1327 1319 is the number of entries that can be recorded per CPU. To know 1328 1320 the full size, multiply the number of possible CPUS with the 1329 1321 number of entries. ··· 1331 1323 # cat /debug/tracing/trace_entries 1332 1324 65620 1333 1325 1334 - Note, to modify this you must have tracing fulling disabled. To do that, 1326 + Note, to modify this, you must have tracing completely disabled. To do that, 1335 1327 echo "none" into the current_tracer. 1336 1328 1337 1329 # echo none > /debug/tracing/current_tracer ··· 1352 1344 This shows us that 85 entries can fit on a single page. 1353 1345 1354 1346 The number of pages that will be allocated is a percentage of available 1355 - memory. Allocating too much will produces an error. 1347 + memory. Allocating too much will produce an error. 1356 1348 1357 1349 # echo 1000000000000 > /debug/tracing/trace_entries 1358 1350 -bash: echo: write error: Cannot allocate memory