Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

perf stat: Show percore counts in per CPU output

We have supported the event modifier "percore" which sums up the event
counts for all hardware threads in a core and show the counts per core.

For example,

# perf stat -e cpu/event=cpu-cycles,percore/ -a -A -- sleep 1

Performance counter stats for 'system wide':

S0-D0-C0 395,072 cpu/event=cpu-cycles,percore/
S0-D0-C1 851,248 cpu/event=cpu-cycles,percore/
S0-D0-C2 954,226 cpu/event=cpu-cycles,percore/
S0-D0-C3 1,233,659 cpu/event=cpu-cycles,percore/

This patch provides a new option "--percore-show-thread". It is used
with event modifier "percore" together to sum up the event counts for
all hardware threads in a core but show the counts per hardware thread.

This is essentially a replacement for the any bit (which is gone in
Icelake). Per core counts are useful for some formulas, e.g. CoreIPC.
The original percore version was inconvenient to post process. This
variant matches the output of the any bit.

With this patch, for example,

# perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -- sleep 1

Performance counter stats for 'system wide':

CPU0 2,453,061 cpu/event=cpu-cycles,percore/
CPU1 1,823,921 cpu/event=cpu-cycles,percore/
CPU2 1,383,166 cpu/event=cpu-cycles,percore/
CPU3 1,102,652 cpu/event=cpu-cycles,percore/
CPU4 2,453,061 cpu/event=cpu-cycles,percore/
CPU5 1,823,921 cpu/event=cpu-cycles,percore/
CPU6 1,383,166 cpu/event=cpu-cycles,percore/
CPU7 1,102,652 cpu/event=cpu-cycles,percore/

We can see counts are duplicated in CPU pairs (CPU0/CPU4, CPU1/CPU5,
CPU2/CPU6, CPU3/CPU7).

The interval mode also works. For example,

# perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -I 1000
# time CPU counts unit events
1.000425421 CPU0 925,032 cpu/event=cpu-cycles,percore/
1.000425421 CPU1 430,202 cpu/event=cpu-cycles,percore/
1.000425421 CPU2 436,843 cpu/event=cpu-cycles,percore/
1.000425421 CPU3 1,192,504 cpu/event=cpu-cycles,percore/
1.000425421 CPU4 925,032 cpu/event=cpu-cycles,percore/
1.000425421 CPU5 430,202 cpu/event=cpu-cycles,percore/
1.000425421 CPU6 436,843 cpu/event=cpu-cycles,percore/
1.000425421 CPU7 1,192,504 cpu/event=cpu-cycles,percore/

If we offline CPU5, the result is:

# perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -- sleep 1

Performance counter stats for 'system wide':

CPU0 2,752,148 cpu/event=cpu-cycles,percore/
CPU1 1,009,312 cpu/event=cpu-cycles,percore/
CPU2 2,784,072 cpu/event=cpu-cycles,percore/
CPU3 2,427,922 cpu/event=cpu-cycles,percore/
CPU4 2,752,148 cpu/event=cpu-cycles,percore/
CPU6 2,784,072 cpu/event=cpu-cycles,percore/
CPU7 2,427,922 cpu/event=cpu-cycles,percore/

1.001416041 seconds time elapsed

v4:
---
Ravi Bangoria reports an issue in v3. Once we offline a CPU,
the output is not correct. The issue is we should use the cpu
idx in print_percore_thread rather than using the cpu value.

v3:
---
1. Fix the interval mode output error
2. Use cpu value (not cpu index) in config->aggr_get_id().
3. Refine the code according to Jiri's comments.

v2:
---
Add the explanation in change log. This is essentially a replacement
for the any bit. No code change.

Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20200214080452.26402-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

authored by

Jin Yao and committed by
Arnaldo Carvalho de Melo
1af62ce6 7982a898

+42 -5
+9
tools/perf/Documentation/perf-stat.txt
··· 334 334 --all-user:: 335 335 Configure all used events to run in user space. 336 336 337 + --percore-show-thread:: 338 + The event modifier "percore" has supported to sum up the event counts 339 + for all hardware threads in a core and show the counts per core. 340 + 341 + This option with event modifier "percore" enabled also sums up the event 342 + counts for all hardware threads in a core but show the sum counts per 343 + hardware thread. This is essentially a replacement for the any bit and 344 + convenient for post processing. 345 + 337 346 EXAMPLES 338 347 -------- 339 348
+4
tools/perf/builtin-stat.c
··· 929 929 OPT_BOOLEAN_FLAG(0, "all-user", &stat_config.all_user, 930 930 "Configure all used events to run in user space.", 931 931 PARSE_OPT_EXCLUSIVE), 932 + OPT_BOOLEAN(0, "percore-show-thread", &stat_config.percore_show_thread, 933 + "Use with 'percore' event qualifier to show the event " 934 + "counts of one hardware thread by sum up total hardware " 935 + "threads of same physical core"), 932 936 OPT_END() 933 937 }; 934 938
+28 -5
tools/perf/util/stat-display.c
··· 110 110 config->csv_sep); 111 111 break; 112 112 case AGGR_NONE: 113 - if (evsel->percore) { 113 + if (evsel->percore && !config->percore_show_thread) { 114 114 fprintf(config->output, "S%d-D%d-C%*d%s", 115 115 cpu_map__id_to_socket(id), 116 116 cpu_map__id_to_die(id), ··· 628 628 static void print_counter_aggrdata(struct perf_stat_config *config, 629 629 struct evsel *counter, int s, 630 630 char *prefix, bool metric_only, 631 - bool *first) 631 + bool *first, int cpu) 632 632 { 633 633 struct aggr_data ad; 634 634 FILE *output = config->output; ··· 654 654 fprintf(output, "%s", prefix); 655 655 656 656 uval = val * counter->scale; 657 - printout(config, id, nr, counter, uval, prefix, 657 + printout(config, cpu != -1 ? cpu : id, nr, counter, uval, prefix, 658 658 run, ena, 1.0, &rt_stat); 659 659 if (!metric_only) 660 660 fputc('\n', output); ··· 687 687 evlist__for_each_entry(evlist, counter) { 688 688 print_counter_aggrdata(config, counter, s, 689 689 prefix, metric_only, 690 - &first); 690 + &first, -1); 691 691 } 692 692 if (metric_only) 693 693 fputc('\n', output); ··· 1146 1146 "the same PMU. Try reorganizing the group.\n"); 1147 1147 } 1148 1148 1149 + static void print_percore_thread(struct perf_stat_config *config, 1150 + struct evsel *counter, char *prefix) 1151 + { 1152 + int s, s2, id; 1153 + bool first = true; 1154 + 1155 + for (int i = 0; i < perf_evsel__nr_cpus(counter); i++) { 1156 + s2 = config->aggr_get_id(config, evsel__cpus(counter), i); 1157 + for (s = 0; s < config->aggr_map->nr; s++) { 1158 + id = config->aggr_map->map[s]; 1159 + if (s2 == id) 1160 + break; 1161 + } 1162 + 1163 + print_counter_aggrdata(config, counter, s, 1164 + prefix, false, 1165 + &first, i); 1166 + } 1167 + } 1168 + 1149 1169 static void print_percore(struct perf_stat_config *config, 1150 1170 struct evsel *counter, char *prefix) 1151 1171 { ··· 1177 1157 if (!(config->aggr_map || config->aggr_get_id)) 1178 1158 return; 1179 1159 1160 + if (config->percore_show_thread) 1161 + return print_percore_thread(config, counter, prefix); 1162 + 1180 1163 for (s = 0; s < config->aggr_map->nr; s++) { 1181 1164 if (prefix && metric_only) 1182 1165 fprintf(output, "%s", prefix); 1183 1166 1184 1167 print_counter_aggrdata(config, counter, s, 1185 1168 prefix, metric_only, 1186 - &first); 1169 + &first, -1); 1187 1170 } 1188 1171 1189 1172 if (metric_only)
+1
tools/perf/util/stat.h
··· 109 109 bool walltime_run_table; 110 110 bool all_kernel; 111 111 bool all_user; 112 + bool percore_show_thread; 112 113 FILE *output; 113 114 unsigned int interval; 114 115 unsigned int timeout;