Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

perf doc: Add AMD IBS usage document

Add a perf man page document that describes how to exploit AMD IBS with
Linux perf. Brief intro about IBS and simple one-liner examples will help
naive users to get started. This is not meant to be an exhaustive IBS
guide. User should refer latest AMD64 Architecture Programmer's Manual
for detailed description of IBS.

Usage:

$ man perf-amd-ibs

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: ananth.narayan@amd.com
Cc: sandipan.das@amd.com
Cc: santosh.shukla@amd.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240620054104.815-1-ravi.bangoria@amd.com

authored by

Ravi Bangoria and committed by
Namhyung Kim
b739759c 90d32e92

+191 -1
+189
tools/perf/Documentation/perf-amd-ibs.txt
··· 1 + perf-amd-ibs(1) 2 + =============== 3 + 4 + NAME 5 + ---- 6 + perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool 7 + 8 + SYNOPSIS 9 + -------- 10 + [verse] 11 + 'perf record' -e ibs_op// 12 + 'perf record' -e ibs_fetch// 13 + 14 + DESCRIPTION 15 + ----------- 16 + 17 + Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) 18 + profiling support on AMD platforms. IBS has two independent components: IBS 19 + Op and IBS Fetch. IBS Op sampling provides information about instruction 20 + execution (micro-op execution to be precise) with details like d-cache 21 + hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch 22 + behavior etc. IBS Fetch sampling provides information about instruction fetch 23 + with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is 24 + per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. 25 + 26 + Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited 27 + using the Linux perf utility. The following files will be created at boot time 28 + if IBS is supported by the hardware and kernel. 29 + 30 + /sys/bus/event_source/devices/ibs_op/ 31 + /sys/bus/event_source/devices/ibs_fetch/ 32 + 33 + IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports 34 + one event: fetch ops. 35 + 36 + IBS PMUs do not have user/kernel filtering capability and thus it requires 37 + CAP_SYS_ADMIN or CAP_PERFMON privilege. 38 + 39 + IBS VS. REGULAR CORE PMU 40 + ------------------------ 41 + 42 + IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has 43 + no skid. Whereas the IP recorded by regular core PMU will have some skid 44 + (sample was generated at IP X but perf would record it at IP X+n). Hence, 45 + regular core PMU might not help for profiling with instruction level 46 + precision. Further, IBS provides additional information about the sample in 47 + question. On the other hand, regular core PMU has it's own advantages like 48 + plethora of events, counting mode (less interference), up to 6 parallel 49 + counters, event grouping support, filtering capabilities etc. 50 + 51 + Three regular core PMU events are internally forwarded to IBS Op PMU when 52 + precise_ip attribute is set: 53 + 54 + -e cpu-cycles:p becomes -e ibs_op// 55 + -e r076:p becomes -e ibs_op// 56 + -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 57 + 58 + EXAMPLES 59 + -------- 60 + 61 + IBS Op PMU 62 + ~~~~~~~~~~ 63 + 64 + System-wide profile, cycles event, sampling period: 100000 65 + 66 + # perf record -e ibs_op// -c 100000 -a 67 + 68 + Per-cpu profile (cpu10), cycles event, sampling period: 100000 69 + 70 + # perf record -e ibs_op// -c 100000 -C 10 71 + 72 + Per-cpu profile (cpu10), cycles event, sampling freq: 1000 73 + 74 + # perf record -e ibs_op// -F 1000 -C 10 75 + 76 + System-wide profile, uOps event, sampling period: 100000 77 + 78 + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a 79 + 80 + Same command, but also capture IBS register raw dump along with perf sample: 81 + 82 + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples 83 + 84 + System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) 85 + 86 + # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a 87 + 88 + Per process(upstream v6.2 onward), uOps event, sampling period: 100000 89 + 90 + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 91 + 92 + Per process(upstream v6.2 onward), uOps event, sampling period: 100000 93 + 94 + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls 95 + 96 + To analyse recorded profile in aggregate mode 97 + 98 + # perf report 99 + /* Select a line and press 'a' to drill down at instruction level. */ 100 + 101 + To go over each sample 102 + 103 + # perf script 104 + 105 + Raw dump of IBS registers when profiled with --raw-samples 106 + 107 + # perf report -D 108 + /* Look for PERF_RECORD_SAMPLE */ 109 + 110 + Example register raw dump: 111 + 112 + ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 113 + Val 1 CntCtl 0=cycles CurCnt 707 114 + IbsOpRip: ffffffff8204aea7 115 + ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 116 + BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 117 + ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM 118 + ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 119 + DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 120 + DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 121 + DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 122 + DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes 123 + OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 124 + IbsDCLinAd: ff110008a5398920 125 + IbsDCPhysAd: 00000008a5398920 126 + 127 + IBS applied in a real world usecase 128 + 129 + ~90% regression was observed in tbench with specific scheduler hint 130 + which was counter intuitive. IBS profile of good and bad run captured 131 + using perf helped in identifying exact cause of the problem: 132 + 133 + https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com 134 + 135 + IBS Fetch PMU 136 + ~~~~~~~~~~~~~ 137 + 138 + Similar commands can be used with Fetch PMU as well. 139 + 140 + System-wide profile, fetch ops event, sampling period: 100000 141 + 142 + # perf record -e ibs_fetch// -c 100000 -a 143 + 144 + System-wide profile, fetch ops event, sampling period: 100000, Random enable 145 + 146 + # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a 147 + 148 + Random enable adds small degree of variability to sample period. This 149 + helps in cases like long running loops where PMU is tagging the same 150 + instruction over and over because of fixed sample period. 151 + 152 + etc. 153 + 154 + PERF MEM AND PERF C2C 155 + --------------------- 156 + 157 + perf mem is a memory access profiler tool and perf c2c is a shared data 158 + cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. 159 + Below is a simple example of the perf mem tool. 160 + 161 + # perf mem record -c 100000 -- make 162 + # perf mem report 163 + 164 + A normal perf mem report output will provide detailed memory access profile. 165 + However, it can also be aggregated based on output fields. For example: 166 + 167 + # perf mem report -F mem,sample,snoop 168 + Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 169 + Memory access Samples Snoop 170 + N/A 1903343 N/A 171 + L1 hit 1056754 N/A 172 + L2 hit 75231 N/A 173 + L3 hit 9496 HitM 174 + L3 hit 2270 N/A 175 + RAM hit 8710 N/A 176 + Remote node, same socket RAM hit 3241 N/A 177 + Remote core, same node Any cache hit 1572 HitM 178 + Remote core, same node Any cache hit 514 N/A 179 + Remote node, same socket Any cache hit 1216 HitM 180 + Remote node, same socket Any cache hit 350 N/A 181 + Uncached hit 18 N/A 182 + 183 + Please refer to their man page for more detail. 184 + 185 + SEE ALSO 186 + -------- 187 + 188 + linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 189 + linkperf:perf-mem[1], linkperf:perf-c2c[1]
+2 -1
tools/perf/Documentation/perf.txt
··· 82 82 linkperf:perf-record[1], linkperf:perf-report[1], 83 83 linkperf:perf-list[1] 84 84 85 - linkperf:perf-annotate[1],linkperf:perf-archive[1],linkperf:perf-arm-spe[1], 85 + linkperf:perf-amd-ibs[1], linkperf:perf-annotate[1], 86 + linkperf:perf-archive[1], linkperf:perf-arm-spe[1], 86 87 linkperf:perf-bench[1], linkperf:perf-buildid-cache[1], 87 88 linkperf:perf-buildid-list[1], linkperf:perf-c2c[1], 88 89 linkperf:perf-config[1], linkperf:perf-data[1], linkperf:perf-diff[1],