Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc/perf: Add per-event excludes on Power8

Power8 has a new register (MMCR2), which contains individual freeze bits
for each counter. This is an improvement on previous chips as it means
we can have multiple events on the PMU at the same time with different
exclude_{user,kernel,hv} settings. Previously we had to ensure all
events on the PMU had the same exclude settings.

The core of the patch is fairly simple. We use the 207S feature flag to
indicate that the PMU backend supports per-event excludes, if it's set
we skip the generic logic that enforces the equality of excludes between
events. We also use that flag to skip setting the freeze bits in MMCR0,
the PMU backend is expected to have handled setting them in MMCR2.

The complication arises with EBB. The FCxP bits in MMCR2 are accessible
R/W to a task using EBB. Which means a task using EBB will be able to
see that we are using MMCR2 for freezing, whereas the old logic which
used MMCR0 is not user visible.

The task can not see or affect exclude_kernel & exclude_hv, so we only
need to consider exclude_user.

The table below summarises the behaviour both before and after this
commit is applied:

exclude_user true false
------------------------------------
| User visible | N N
Before | Can freeze | Y Y
| Can unfreeze | N Y
------------------------------------
| User visible | Y Y
After | Can freeze | Y Y
| Can unfreeze | Y/N Y
------------------------------------

So firstly I assert that the simple visibility of the exclude_user
setting in MMCR2 is a non-issue. The event belongs to the task, and
was most likely created by the task. So the exclude_user setting is not
privileged information in any way.

Secondly, the behaviour in the exclude_user = false case is unchanged.
This is important as it is the case that is actually useful, ie. the
event is created with no exclude setting and the task uses MMCR2 to
implement exclusion manually.

For exclude_user = true there is no meaningful change to freezing the
event. Previously the task could use MMCR2 to freeze the event, though
it was already frozen with MMCR0. With the new code the task can use
MMCR2 to freeze the event, though it was already frozen with MMCR2.

The only real change is when exclude_user = true and the task tries to
use MMCR2 to unfreeze the event. Previously this had no effect, because
the event was already frozen in MMCR0. With the new code the task can
unfreeze the event in MMCR2, but at some indeterminate time in the
future the kernel will overwrite its setting and refreeze the event.

Therefore my final assertion is that any task using exclude_user = true
and also fiddling with MMCR2 was deeply confused before this change, and
remains so after it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

authored by

Michael Ellerman and committed by
Benjamin Herrenschmidt
9de5cb0f 8abd818f

+67 -24
+45 -22
arch/powerpc/perf/core-book3s.c
··· 36 36 struct perf_event *event[MAX_HWEVENTS]; 37 37 u64 events[MAX_HWEVENTS]; 38 38 unsigned int flags[MAX_HWEVENTS]; 39 - unsigned long mmcr[3]; 39 + /* 40 + * The order of the MMCR array is: 41 + * - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2 42 + * - 32-bit, MMCR0, MMCR1, MMCR2 43 + */ 44 + unsigned long mmcr[4]; 40 45 struct perf_event *limited_counter[MAX_LIMITED_HWCOUNTERS]; 41 46 u8 limited_hwidx[MAX_LIMITED_HWCOUNTERS]; 42 47 u64 alternatives[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES]; ··· 117 112 static int ebb_event_check(struct perf_event *event) { return 0; } 118 113 static void ebb_event_add(struct perf_event *event) { } 119 114 static void ebb_switch_out(unsigned long mmcr0) { } 120 - static unsigned long ebb_switch_in(bool ebb, unsigned long mmcr0) 115 + static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw) 121 116 { 122 - return mmcr0; 117 + return cpuhw->mmcr[0]; 123 118 } 124 119 125 120 static inline void power_pmu_bhrb_enable(struct perf_event *event) {} ··· 547 542 current->thread.mmcr2 = mfspr(SPRN_MMCR2) & MMCR2_USER_MASK; 548 543 } 549 544 550 - static unsigned long ebb_switch_in(bool ebb, unsigned long mmcr0) 545 + static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw) 551 546 { 547 + unsigned long mmcr0 = cpuhw->mmcr[0]; 548 + 552 549 if (!ebb) 553 550 goto out; 554 551 ··· 575 568 mtspr(SPRN_SIAR, current->thread.siar); 576 569 mtspr(SPRN_SIER, current->thread.sier); 577 570 mtspr(SPRN_SDAR, current->thread.sdar); 578 - mtspr(SPRN_MMCR2, current->thread.mmcr2); 571 + 572 + /* 573 + * Merge the kernel & user values of MMCR2. The semantics we implement 574 + * are that the user MMCR2 can set bits, ie. cause counters to freeze, 575 + * but not clear bits. If a task wants to be able to clear bits, ie. 576 + * unfreeze counters, it should not set exclude_xxx in its events and 577 + * instead manage the MMCR2 entirely by itself. 578 + */ 579 + mtspr(SPRN_MMCR2, cpuhw->mmcr[3] | current->thread.mmcr2); 579 580 out: 580 581 return mmcr0; 581 582 } ··· 930 915 int i, n, first; 931 916 struct perf_event *event; 932 917 918 + /* 919 + * If the PMU we're on supports per event exclude settings then we 920 + * don't need to do any of this logic. NB. This assumes no PMU has both 921 + * per event exclude and limited PMCs. 922 + */ 923 + if (ppmu->flags & PPMU_ARCH_207S) 924 + return 0; 925 + 933 926 n = n_prev + n_new; 934 927 if (n <= 1) 935 928 return 0; ··· 1253 1230 goto out; 1254 1231 } 1255 1232 1256 - /* 1257 - * Add in MMCR0 freeze bits corresponding to the 1258 - * attr.exclude_* bits for the first event. 1259 - * We have already checked that all events have the 1260 - * same values for these bits as the first event. 1261 - */ 1262 - event = cpuhw->event[0]; 1263 - if (event->attr.exclude_user) 1264 - cpuhw->mmcr[0] |= MMCR0_FCP; 1265 - if (event->attr.exclude_kernel) 1266 - cpuhw->mmcr[0] |= freeze_events_kernel; 1267 - if (event->attr.exclude_hv) 1268 - cpuhw->mmcr[0] |= MMCR0_FCHV; 1233 + if (!(ppmu->flags & PPMU_ARCH_207S)) { 1234 + /* 1235 + * Add in MMCR0 freeze bits corresponding to the attr.exclude_* 1236 + * bits for the first event. We have already checked that all 1237 + * events have the same value for these bits as the first event. 1238 + */ 1239 + event = cpuhw->event[0]; 1240 + if (event->attr.exclude_user) 1241 + cpuhw->mmcr[0] |= MMCR0_FCP; 1242 + if (event->attr.exclude_kernel) 1243 + cpuhw->mmcr[0] |= freeze_events_kernel; 1244 + if (event->attr.exclude_hv) 1245 + cpuhw->mmcr[0] |= MMCR0_FCHV; 1246 + } 1269 1247 1270 1248 /* 1271 1249 * Write the new configuration to MMCR* with the freeze ··· 1278 1254 mtspr(SPRN_MMCR1, cpuhw->mmcr[1]); 1279 1255 mtspr(SPRN_MMCR0, (cpuhw->mmcr[0] & ~(MMCR0_PMC1CE | MMCR0_PMCjCE)) 1280 1256 | MMCR0_FC); 1257 + if (ppmu->flags & PPMU_ARCH_207S) 1258 + mtspr(SPRN_MMCR2, cpuhw->mmcr[3]); 1281 1259 1282 1260 /* 1283 1261 * Read off any pre-existing events that need to move ··· 1335 1309 out_enable: 1336 1310 pmao_restore_workaround(ebb); 1337 1311 1338 - if (ppmu->flags & PPMU_ARCH_207S) 1339 - mtspr(SPRN_MMCR2, 0); 1340 - 1341 - mmcr0 = ebb_switch_in(ebb, cpuhw->mmcr[0]); 1312 + mmcr0 = ebb_switch_in(ebb, cpuhw); 1342 1313 1343 1314 mb(); 1344 1315 if (cpuhw->bhrb_users)
+22 -2
arch/powerpc/perf/power8-pmu.c
··· 15 15 #include <linux/kernel.h> 16 16 #include <linux/perf_event.h> 17 17 #include <asm/firmware.h> 18 + #include <asm/cputable.h> 18 19 19 20 20 21 /* ··· 267 266 #define MMCRA_SDAR_MODE_TLB (1ull << 42) 268 267 #define MMCRA_IFM_SHIFT 30 269 268 269 + /* Bits in MMCR2 for POWER8 */ 270 + #define MMCR2_FCS(pmc) (1ull << (63 - (((pmc) - 1) * 9))) 271 + #define MMCR2_FCP(pmc) (1ull << (62 - (((pmc) - 1) * 9))) 272 + #define MMCR2_FCH(pmc) (1ull << (57 - (((pmc) - 1) * 9))) 273 + 270 274 271 275 static inline bool event_is_fab_match(u64 event) 272 276 { ··· 402 396 unsigned int hwc[], unsigned long mmcr[], 403 397 struct perf_event *pevents[]) 404 398 { 405 - unsigned long mmcra, mmcr1, unit, combine, psel, cache, val; 399 + unsigned long mmcra, mmcr1, mmcr2, unit, combine, psel, cache, val; 406 400 unsigned int pmc, pmc_inuse; 407 401 int i; 408 402 ··· 417 411 418 412 /* In continous sampling mode, update SDAR on TLB miss */ 419 413 mmcra = MMCRA_SDAR_MODE_TLB; 420 - mmcr1 = 0; 414 + mmcr1 = mmcr2 = 0; 421 415 422 416 /* Second pass: assign PMCs, set all MMCR1 fields */ 423 417 for (i = 0; i < n_ev; ++i) { ··· 479 473 mmcra |= val << MMCRA_IFM_SHIFT; 480 474 } 481 475 476 + if (pevents[i]->attr.exclude_user) 477 + mmcr2 |= MMCR2_FCP(pmc); 478 + 479 + if (pevents[i]->attr.exclude_hv) 480 + mmcr2 |= MMCR2_FCH(pmc); 481 + 482 + if (pevents[i]->attr.exclude_kernel) { 483 + if (cpu_has_feature(CPU_FTR_HVMODE)) 484 + mmcr2 |= MMCR2_FCH(pmc); 485 + else 486 + mmcr2 |= MMCR2_FCS(pmc); 487 + } 488 + 482 489 hwc[i] = pmc - 1; 483 490 } 484 491 ··· 511 492 512 493 mmcr[1] = mmcr1; 513 494 mmcr[2] = mmcra; 495 + mmcr[3] = mmcr2; 514 496 515 497 return 0; 516 498 }