Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
"Core:
- Move perf_event sysctls into kernel/events/ (Joel Granados)
- Use POLLHUP for pinned events in error (Namhyung Kim)
- Avoid the read if the count is already updated (Peter Zijlstra)
- Allow the EPOLLRDNORM flag for poll (Tao Chen)
- locking/percpu-rwsem: Add guard support [ NOTE: this got
(mis-)merged into the perf tree due to related work ] (Peter
Zijlstra)

perf_pmu_unregister() related improvements: (Peter Zijlstra)
- Simplify the perf_event_alloc() error path
- Simplify the perf_pmu_register() error path
- Simplify perf_pmu_register()
- Simplify perf_init_event()
- Simplify perf_event_alloc()
- Merge struct pmu::pmu_disable_count into struct
perf_cpu_pmu_context::pmu_disable_count
- Add this_cpc() helper
- Introduce perf_free_addr_filters()
- Robustify perf_event_free_bpf_prog()
- Simplify the perf_mmap() control flow
- Further simplify perf_mmap()
- Remove retry loop from perf_mmap()
- Lift event->mmap_mutex in perf_mmap()
- Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
- Fix perf_mmap() failure path

Uprobes:
- Harden x86 uretprobe syscall trampoline check (Jiri Olsa)
- Remove redundant spinlock in uprobe_deny_signal() (Liao Chang)
- Remove the spinlock within handle_singlestep() (Liao Chang)

x86 Intel PMU enhancements:
- Support PEBS counters snapshotting (Kan Liang)
- Fix intel_pmu_read_event() (Kan Liang)
- Extend per event callchain limit to branch stack (Kan Liang)
- Fix system-wide LBR profiling (Kan Liang)
- Allocate bts_ctx only if necessary (Li RongQing)
- Apply static call for drain_pebs (Peter Zijlstra)

x86 AMD PMU enhancements: (Ravi Bangoria)
- Remove pointless sample period check
- Fix ->config to sample period calculation for OP PMU
- Fix perf_ibs_op.cnt_mask for CurCnt
- Don't allow freq mode event creation through ->config interface
- Add PMU specific minimum period
- Add ->check_period() callback
- Ceil sample_period to min_period
- Add support for OP Load Latency Filtering
- Update DTLB/PageSize decode logic

Hardware breakpoints:
- Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar
Bhaskar)

Hardlockup detector improvements: (Li Huafei)
- perf_event memory leak
- Warn if watchdog_ev is leaked

Fixes and cleanups:
- Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter
Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)"

* tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
perf: Fix __percpu annotation
perf: Clean up pmu specific data
perf/x86: Remove swap_task_ctx()
perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode
perf: Supply task information to sched_task()
perf: attach/detach PMU specific data
locking/percpu-rwsem: Add guard support
perf: Save PMU specific data in task_struct
perf: Extend per event callchain limit to branch stack
perf/ring_buffer: Allow the EPOLLRDNORM flag for poll
perf/core: Use POLLHUP for pinned events in error
perf/core: Use sysfs_emit() instead of scnprintf()
perf/core: Remove optional 'size' arguments from strscpy() calls
perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions
uprobes/x86: Harden uretprobe syscall trampoline check
watchdog/hardlockup/perf: Warn if watchdog_ev is leaked
watchdog/hardlockup/perf: Fix perf_event memory leak
perf/x86: Annotate struct bts_buffer::buf with __counted_by()
perf/core: Clean up perf_try_init_event()
perf/core: Fix perf_mmap() failure path
...

+1425 -746
+6 -2
arch/powerpc/perf/core-book3s.c
··· 132 132 133 133 static inline void power_pmu_bhrb_enable(struct perf_event *event) {} 134 134 static inline void power_pmu_bhrb_disable(struct perf_event *event) {} 135 - static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) {} 135 + static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, 136 + struct task_struct *task, bool sched_in) 137 + { 138 + } 136 139 static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {} 137 140 static void pmao_restore_workaround(bool ebb) { } 138 141 #endif /* CONFIG_PPC32 */ ··· 447 444 /* Called from ctxsw to prevent one process's branch entries to 448 445 * mingle with the other process's entries during context switch. 449 446 */ 450 - static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 447 + static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, 448 + struct task_struct *task, bool sched_in) 451 449 { 452 450 if (!ppmu->bhrb_nr) 453 451 return;
+2 -1
arch/s390/kernel/perf_pai_crypto.c
··· 518 518 /* Called on schedule-in and schedule-out. No access to event structure, 519 519 * but for sampling only event CRYPTO_ALL is allowed. 520 520 */ 521 - static void paicrypt_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 521 + static void paicrypt_sched_task(struct perf_event_pmu_context *pmu_ctx, 522 + struct task_struct *task, bool sched_in) 522 523 { 523 524 /* We started with a clean page on event installation. So read out 524 525 * results on schedule_out and if page was dirty, save old values.
+2 -1
arch/s390/kernel/perf_pai_ext.c
··· 542 542 /* Called on schedule-in and schedule-out. No access to event structure, 543 543 * but for sampling only event NNPA_ALL is allowed. 544 544 */ 545 - static void paiext_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 545 + static void paiext_sched_task(struct perf_event_pmu_context *pmu_ctx, 546 + struct task_struct *task, bool sched_in) 546 547 { 547 548 /* We started with a clean page on event installation. So read out 548 549 * results on schedule_out and if page was dirty, save old values.
+2 -1
arch/x86/events/amd/brs.c
··· 381 381 * On ctxswin, sched_in = true, called after the PMU has started 382 382 * On ctxswout, sched_in = false, called before the PMU is stopped 383 383 */ 384 - void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 384 + void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, 385 + struct task_struct *task, bool sched_in) 385 386 { 386 387 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 387 388
+175 -31
arch/x86/events/amd/ibs.c
··· 28 28 #include <asm/nmi.h> 29 29 #include <asm/amd-ibs.h> 30 30 31 - #define IBS_FETCH_CONFIG_MASK (IBS_FETCH_RAND_EN | IBS_FETCH_MAX_CNT) 32 - #define IBS_OP_CONFIG_MASK IBS_OP_MAX_CNT 33 - 34 31 /* attr.config2 */ 35 32 #define IBS_SW_FILTER_MASK 1 36 33 ··· 86 89 u64 cnt_mask; 87 90 u64 enable_mask; 88 91 u64 valid_mask; 92 + u16 min_period; 89 93 u64 max_period; 90 94 unsigned long offset_mask[1]; 91 95 int offset_max; ··· 268 270 return 0; 269 271 } 270 272 273 + static bool perf_ibs_ldlat_event(struct perf_ibs *perf_ibs, 274 + struct perf_event *event) 275 + { 276 + return perf_ibs == &perf_ibs_op && 277 + (ibs_caps & IBS_CAPS_OPLDLAT) && 278 + (event->attr.config1 & 0xFFF); 279 + } 280 + 271 281 static int perf_ibs_init(struct perf_event *event) 272 282 { 273 283 struct hw_perf_event *hwc = &event->hw; 274 284 struct perf_ibs *perf_ibs; 275 - u64 max_cnt, config; 285 + u64 config; 276 286 int ret; 277 287 278 288 perf_ibs = get_ibs_pmu(event->attr.type); ··· 316 310 if (config & perf_ibs->cnt_mask) 317 311 /* raw max_cnt may not be set */ 318 312 return -EINVAL; 319 - if (!event->attr.sample_freq && hwc->sample_period & 0x0f) 320 - /* 321 - * lower 4 bits can not be set in ibs max cnt, 322 - * but allowing it in case we adjust the 323 - * sample period to set a frequency. 324 - */ 325 - return -EINVAL; 326 - hwc->sample_period &= ~0x0FULL; 327 - if (!hwc->sample_period) 328 - hwc->sample_period = 0x10; 313 + 314 + if (event->attr.freq) { 315 + hwc->sample_period = perf_ibs->min_period; 316 + } else { 317 + /* Silently mask off lower nibble. IBS hw mandates it. */ 318 + hwc->sample_period &= ~0x0FULL; 319 + if (hwc->sample_period < perf_ibs->min_period) 320 + return -EINVAL; 321 + } 329 322 } else { 330 - max_cnt = config & perf_ibs->cnt_mask; 323 + u64 period = 0; 324 + 325 + if (event->attr.freq) 326 + return -EINVAL; 327 + 328 + if (perf_ibs == &perf_ibs_op) { 329 + period = (config & IBS_OP_MAX_CNT) << 4; 330 + if (ibs_caps & IBS_CAPS_OPCNTEXT) 331 + period |= config & IBS_OP_MAX_CNT_EXT_MASK; 332 + } else { 333 + period = (config & IBS_FETCH_MAX_CNT) << 4; 334 + } 335 + 331 336 config &= ~perf_ibs->cnt_mask; 332 - event->attr.sample_period = max_cnt << 4; 333 - hwc->sample_period = event->attr.sample_period; 337 + event->attr.sample_period = period; 338 + hwc->sample_period = period; 339 + 340 + if (hwc->sample_period < perf_ibs->min_period) 341 + return -EINVAL; 334 342 } 335 343 336 - if (!hwc->sample_period) 337 - return -EINVAL; 344 + if (perf_ibs_ldlat_event(perf_ibs, event)) { 345 + u64 ldlat = event->attr.config1 & 0xFFF; 346 + 347 + if (ldlat < 128 || ldlat > 2048) 348 + return -EINVAL; 349 + ldlat >>= 7; 350 + 351 + config |= (ldlat - 1) << 59; 352 + config |= IBS_OP_L3MISSONLY | IBS_OP_LDLAT_EN; 353 + } 338 354 339 355 /* 340 356 * If we modify hwc->sample_period, we also need to update ··· 377 349 int overflow; 378 350 379 351 /* ignore lower 4 bits in min count: */ 380 - overflow = perf_event_set_period(hwc, 1<<4, perf_ibs->max_period, period); 352 + overflow = perf_event_set_period(hwc, perf_ibs->min_period, 353 + perf_ibs->max_period, period); 381 354 local64_set(&hwc->prev_count, 0); 382 355 383 356 return overflow; ··· 475 446 476 447 WARN_ON_ONCE(!(hwc->state & PERF_HES_UPTODATE)); 477 448 hwc->state = 0; 449 + 450 + if (event->attr.freq && hwc->sample_period < perf_ibs->min_period) 451 + hwc->sample_period = perf_ibs->min_period; 478 452 479 453 perf_ibs_set_period(perf_ibs, hwc, &period); 480 454 if (perf_ibs == &perf_ibs_op && (ibs_caps & IBS_CAPS_OPCNTEXT)) { ··· 586 554 587 555 static void perf_ibs_read(struct perf_event *event) { } 588 556 557 + static int perf_ibs_check_period(struct perf_event *event, u64 value) 558 + { 559 + struct perf_ibs *perf_ibs; 560 + u64 low_nibble; 561 + 562 + if (event->attr.freq) 563 + return 0; 564 + 565 + perf_ibs = container_of(event->pmu, struct perf_ibs, pmu); 566 + low_nibble = value & 0xFULL; 567 + 568 + /* 569 + * This contradicts with perf_ibs_init() which allows sample period 570 + * with lower nibble bits set but silently masks them off. Whereas 571 + * this returns error. 572 + */ 573 + if (low_nibble || value < perf_ibs->min_period) 574 + return -EINVAL; 575 + 576 + return 0; 577 + } 578 + 589 579 /* 590 580 * We need to initialize with empty group if all attributes in the 591 581 * group are dynamic. ··· 626 572 PMU_FORMAT_ATTR(swfilt, "config2:0"); 627 573 PMU_EVENT_ATTR_STRING(l3missonly, fetch_l3missonly, "config:59"); 628 574 PMU_EVENT_ATTR_STRING(l3missonly, op_l3missonly, "config:16"); 575 + PMU_EVENT_ATTR_STRING(ldlat, ibs_op_ldlat_format, "config1:0-11"); 629 576 PMU_EVENT_ATTR_STRING(zen4_ibs_extensions, zen4_ibs_extensions, "1"); 577 + PMU_EVENT_ATTR_STRING(ldlat, ibs_op_ldlat_cap, "1"); 578 + PMU_EVENT_ATTR_STRING(dtlb_pgsize, ibs_op_dtlb_pgsize_cap, "1"); 630 579 631 580 static umode_t 632 581 zen4_ibs_extensions_is_visible(struct kobject *kobj, struct attribute *attr, int i) 633 582 { 634 583 return ibs_caps & IBS_CAPS_ZEN4 ? attr->mode : 0; 584 + } 585 + 586 + static umode_t 587 + ibs_op_ldlat_is_visible(struct kobject *kobj, struct attribute *attr, int i) 588 + { 589 + return ibs_caps & IBS_CAPS_OPLDLAT ? attr->mode : 0; 590 + } 591 + 592 + static umode_t 593 + ibs_op_dtlb_pgsize_is_visible(struct kobject *kobj, struct attribute *attr, int i) 594 + { 595 + return ibs_caps & IBS_CAPS_OPDTLBPGSIZE ? attr->mode : 0; 635 596 } 636 597 637 598 static struct attribute *fetch_attrs[] = { ··· 665 596 NULL, 666 597 }; 667 598 599 + static struct attribute *ibs_op_ldlat_cap_attrs[] = { 600 + &ibs_op_ldlat_cap.attr.attr, 601 + NULL, 602 + }; 603 + 604 + static struct attribute *ibs_op_dtlb_pgsize_cap_attrs[] = { 605 + &ibs_op_dtlb_pgsize_cap.attr.attr, 606 + NULL, 607 + }; 608 + 668 609 static struct attribute_group group_fetch_formats = { 669 610 .name = "format", 670 611 .attrs = fetch_attrs, ··· 690 611 .name = "caps", 691 612 .attrs = zen4_ibs_extensions_attrs, 692 613 .is_visible = zen4_ibs_extensions_is_visible, 614 + }; 615 + 616 + static struct attribute_group group_ibs_op_ldlat_cap = { 617 + .name = "caps", 618 + .attrs = ibs_op_ldlat_cap_attrs, 619 + .is_visible = ibs_op_ldlat_is_visible, 620 + }; 621 + 622 + static struct attribute_group group_ibs_op_dtlb_pgsize_cap = { 623 + .name = "caps", 624 + .attrs = ibs_op_dtlb_pgsize_cap_attrs, 625 + .is_visible = ibs_op_dtlb_pgsize_is_visible, 693 626 }; 694 627 695 628 static const struct attribute_group *fetch_attr_groups[] = { ··· 742 651 .attrs = op_attrs, 743 652 }; 744 653 654 + static struct attribute *ibs_op_ldlat_format_attrs[] = { 655 + &ibs_op_ldlat_format.attr.attr, 656 + NULL, 657 + }; 658 + 745 659 static struct attribute_group group_cnt_ctl = { 746 660 .name = "format", 747 661 .attrs = cnt_ctl_attrs, ··· 765 669 NULL, 766 670 }; 767 671 672 + static struct attribute_group group_ibs_op_ldlat_format = { 673 + .name = "format", 674 + .attrs = ibs_op_ldlat_format_attrs, 675 + .is_visible = ibs_op_ldlat_is_visible, 676 + }; 677 + 768 678 static const struct attribute_group *op_attr_update[] = { 769 679 &group_cnt_ctl, 770 680 &group_op_l3missonly, 771 681 &group_zen4_ibs_extensions, 682 + &group_ibs_op_ldlat_cap, 683 + &group_ibs_op_ldlat_format, 684 + &group_ibs_op_dtlb_pgsize_cap, 772 685 NULL, 773 686 }; 774 687 ··· 791 686 .start = perf_ibs_start, 792 687 .stop = perf_ibs_stop, 793 688 .read = perf_ibs_read, 689 + .check_period = perf_ibs_check_period, 794 690 }, 795 691 .msr = MSR_AMD64_IBSFETCHCTL, 796 - .config_mask = IBS_FETCH_CONFIG_MASK, 692 + .config_mask = IBS_FETCH_MAX_CNT | IBS_FETCH_RAND_EN, 797 693 .cnt_mask = IBS_FETCH_MAX_CNT, 798 694 .enable_mask = IBS_FETCH_ENABLE, 799 695 .valid_mask = IBS_FETCH_VAL, 696 + .min_period = 0x10, 800 697 .max_period = IBS_FETCH_MAX_CNT << 4, 801 698 .offset_mask = { MSR_AMD64_IBSFETCH_REG_MASK }, 802 699 .offset_max = MSR_AMD64_IBSFETCH_REG_COUNT, ··· 816 709 .start = perf_ibs_start, 817 710 .stop = perf_ibs_stop, 818 711 .read = perf_ibs_read, 712 + .check_period = perf_ibs_check_period, 819 713 }, 820 714 .msr = MSR_AMD64_IBSOPCTL, 821 - .config_mask = IBS_OP_CONFIG_MASK, 715 + .config_mask = IBS_OP_MAX_CNT, 822 716 .cnt_mask = IBS_OP_MAX_CNT | IBS_OP_CUR_CNT | 823 717 IBS_OP_CUR_CNT_RAND, 824 718 .enable_mask = IBS_OP_ENABLE, 825 719 .valid_mask = IBS_OP_VAL, 720 + .min_period = 0x90, 826 721 .max_period = IBS_OP_MAX_CNT << 4, 827 722 .offset_mask = { MSR_AMD64_IBSOP_REG_MASK }, 828 723 .offset_max = MSR_AMD64_IBSOP_REG_COUNT, ··· 1026 917 if (!op_data3->dc_lin_addr_valid) 1027 918 return; 1028 919 920 + if ((ibs_caps & IBS_CAPS_OPDTLBPGSIZE) && 921 + !op_data3->dc_phy_addr_valid) 922 + return; 923 + 1029 924 if (!op_data3->dc_l1tlb_miss) { 1030 925 data_src->mem_dtlb = PERF_MEM_TLB_L1 | PERF_MEM_TLB_HIT; 1031 926 return; ··· 1136 1023 } 1137 1024 } 1138 1025 1139 - static int perf_ibs_get_offset_max(struct perf_ibs *perf_ibs, u64 sample_type, 1026 + static bool perf_ibs_is_mem_sample_type(struct perf_ibs *perf_ibs, 1027 + struct perf_event *event) 1028 + { 1029 + u64 sample_type = event->attr.sample_type; 1030 + 1031 + return perf_ibs == &perf_ibs_op && 1032 + sample_type & (PERF_SAMPLE_DATA_SRC | 1033 + PERF_SAMPLE_WEIGHT_TYPE | 1034 + PERF_SAMPLE_ADDR | 1035 + PERF_SAMPLE_PHYS_ADDR); 1036 + } 1037 + 1038 + static int perf_ibs_get_offset_max(struct perf_ibs *perf_ibs, 1039 + struct perf_event *event, 1140 1040 int check_rip) 1141 1041 { 1142 - if (sample_type & PERF_SAMPLE_RAW || 1143 - (perf_ibs == &perf_ibs_op && 1144 - (sample_type & PERF_SAMPLE_DATA_SRC || 1145 - sample_type & PERF_SAMPLE_WEIGHT_TYPE || 1146 - sample_type & PERF_SAMPLE_ADDR || 1147 - sample_type & PERF_SAMPLE_PHYS_ADDR))) 1042 + if (event->attr.sample_type & PERF_SAMPLE_RAW || 1043 + perf_ibs_is_mem_sample_type(perf_ibs, event) || 1044 + perf_ibs_ldlat_event(perf_ibs, event)) 1148 1045 return perf_ibs->offset_max; 1149 1046 else if (check_rip) 1150 1047 return 3; ··· 1271 1148 offset = 1; 1272 1149 check_rip = (perf_ibs == &perf_ibs_op && (ibs_caps & IBS_CAPS_RIPINVALIDCHK)); 1273 1150 1274 - offset_max = perf_ibs_get_offset_max(perf_ibs, event->attr.sample_type, check_rip); 1151 + offset_max = perf_ibs_get_offset_max(perf_ibs, event, check_rip); 1275 1152 1276 1153 do { 1277 1154 rdmsrl(msr + offset, *buf++); ··· 1280 1157 perf_ibs->offset_max, 1281 1158 offset + 1); 1282 1159 } while (offset < offset_max); 1160 + 1161 + if (perf_ibs_ldlat_event(perf_ibs, event)) { 1162 + union ibs_op_data3 op_data3; 1163 + 1164 + op_data3.val = ibs_data.regs[ibs_op_msr_idx(MSR_AMD64_IBSOPDATA3)]; 1165 + /* 1166 + * Opening event is errored out if load latency threshold is 1167 + * outside of [128, 2048] range. Since the event has reached 1168 + * interrupt handler, we can safely assume the threshold is 1169 + * within [128, 2048] range. 1170 + */ 1171 + if (!op_data3.ld_op || !op_data3.dc_miss || 1172 + op_data3.dc_miss_lat <= (event->attr.config1 & 0xFFF)) 1173 + goto out; 1174 + } 1175 + 1283 1176 /* 1284 1177 * Read IbsBrTarget, IbsOpData4, and IbsExtdCtl separately 1285 1178 * depending on their availability. ··· 1368 1229 perf_sample_save_callchain(&data, event, iregs); 1369 1230 1370 1231 throttle = perf_event_overflow(event, &data, &regs); 1232 + 1233 + if (event->attr.freq && hwc->sample_period < perf_ibs->min_period) 1234 + hwc->sample_period = perf_ibs->min_period; 1235 + 1371 1236 out: 1372 1237 if (throttle) { 1373 1238 perf_ibs_stop(event, 0); ··· 1461 1318 if (ibs_caps & IBS_CAPS_OPCNTEXT) { 1462 1319 perf_ibs_op.max_period |= IBS_OP_MAX_CNT_EXT_MASK; 1463 1320 perf_ibs_op.config_mask |= IBS_OP_MAX_CNT_EXT_MASK; 1464 - perf_ibs_op.cnt_mask |= IBS_OP_MAX_CNT_EXT_MASK; 1321 + perf_ibs_op.cnt_mask |= (IBS_OP_MAX_CNT_EXT_MASK | 1322 + IBS_OP_CUR_CNT_EXT_MASK); 1465 1323 } 1466 1324 1467 1325 if (ibs_caps & IBS_CAPS_ZEN4)
+1 -1
arch/x86/events/amd/iommu.c
··· 30 30 #define GET_DOMID_MASK(x) (((x)->conf1 >> 16) & 0xFFFFULL) 31 31 #define GET_PASID_MASK(x) (((x)->conf1 >> 32) & 0xFFFFFULL) 32 32 33 - #define IOMMU_NAME_SIZE 16 33 + #define IOMMU_NAME_SIZE 24 34 34 35 35 struct perf_amd_iommu { 36 36 struct list_head list;
+2 -1
arch/x86/events/amd/lbr.c
··· 371 371 perf_sched_cb_dec(event->pmu); 372 372 } 373 373 374 - void amd_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 374 + void amd_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, 375 + struct task_struct *task, bool sched_in) 375 376 { 376 377 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 377 378
+16 -11
arch/x86/events/core.c
··· 87 87 DEFINE_STATIC_CALL_NULL(x86_pmu_stop_scheduling, *x86_pmu.stop_scheduling); 88 88 89 89 DEFINE_STATIC_CALL_NULL(x86_pmu_sched_task, *x86_pmu.sched_task); 90 - DEFINE_STATIC_CALL_NULL(x86_pmu_swap_task_ctx, *x86_pmu.swap_task_ctx); 91 90 92 91 DEFINE_STATIC_CALL_NULL(x86_pmu_drain_pebs, *x86_pmu.drain_pebs); 93 92 DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, *x86_pmu.pebs_aliases); 94 93 95 94 DEFINE_STATIC_CALL_NULL(x86_pmu_filter, *x86_pmu.filter); 95 + 96 + DEFINE_STATIC_CALL_NULL(x86_pmu_late_setup, *x86_pmu.late_setup); 96 97 97 98 /* 98 99 * This one is magic, it will get called even when PMU init fails (because ··· 1299 1298 1300 1299 if (cpuc->n_added) { 1301 1300 int n_running = cpuc->n_events - cpuc->n_added; 1301 + 1302 + /* 1303 + * The late setup (after counters are scheduled) 1304 + * is required for some cases, e.g., PEBS counters 1305 + * snapshotting. Because an accurate counter index 1306 + * is needed. 1307 + */ 1308 + static_call_cond(x86_pmu_late_setup)(); 1309 + 1302 1310 /* 1303 1311 * apply assignment obtained either from 1304 1312 * hw_perf_group_sched_in() or x86_pmu_enable() ··· 2038 2028 static_call_update(x86_pmu_stop_scheduling, x86_pmu.stop_scheduling); 2039 2029 2040 2030 static_call_update(x86_pmu_sched_task, x86_pmu.sched_task); 2041 - static_call_update(x86_pmu_swap_task_ctx, x86_pmu.swap_task_ctx); 2042 2031 2043 2032 static_call_update(x86_pmu_drain_pebs, x86_pmu.drain_pebs); 2044 2033 static_call_update(x86_pmu_pebs_aliases, x86_pmu.pebs_aliases); 2045 2034 2046 2035 static_call_update(x86_pmu_guest_get_msrs, x86_pmu.guest_get_msrs); 2047 2036 static_call_update(x86_pmu_filter, x86_pmu.filter); 2037 + 2038 + static_call_update(x86_pmu_late_setup, x86_pmu.late_setup); 2048 2039 } 2049 2040 2050 2041 static void _x86_pmu_read(struct perf_event *event) ··· 2636 2625 NULL, 2637 2626 }; 2638 2627 2639 - static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 2628 + static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, 2629 + struct task_struct *task, bool sched_in) 2640 2630 { 2641 - static_call_cond(x86_pmu_sched_task)(pmu_ctx, sched_in); 2642 - } 2643 - 2644 - static void x86_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc, 2645 - struct perf_event_pmu_context *next_epc) 2646 - { 2647 - static_call_cond(x86_pmu_swap_task_ctx)(prev_epc, next_epc); 2631 + static_call_cond(x86_pmu_sched_task)(pmu_ctx, task, sched_in); 2648 2632 } 2649 2633 2650 2634 void perf_check_microcode(void) ··· 2706 2700 2707 2701 .event_idx = x86_pmu_event_idx, 2708 2702 .sched_task = x86_pmu_sched_task, 2709 - .swap_task_ctx = x86_pmu_swap_task_ctx, 2710 2703 .check_period = x86_pmu_check_period, 2711 2704 2712 2705 .aux_output_match = x86_pmu_aux_output_match,
+31 -12
arch/x86/events/intel/bts.c
··· 36 36 BTS_STATE_ACTIVE, 37 37 }; 38 38 39 - static DEFINE_PER_CPU(struct bts_ctx, bts_ctx); 39 + static struct bts_ctx __percpu *bts_ctx; 40 40 41 41 #define BTS_RECORD_SIZE 24 42 42 #define BTS_SAFETY_MARGIN 4080 ··· 58 58 local_t head; 59 59 unsigned long end; 60 60 void **data_pages; 61 - struct bts_phys buf[]; 61 + struct bts_phys buf[] __counted_by(nr_bufs); 62 62 }; 63 63 64 64 static struct pmu bts_pmu; ··· 231 231 232 232 static void __bts_event_start(struct perf_event *event) 233 233 { 234 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 234 + struct bts_ctx *bts = this_cpu_ptr(bts_ctx); 235 235 struct bts_buffer *buf = perf_get_aux(&bts->handle); 236 236 u64 config = 0; 237 237 ··· 260 260 static void bts_event_start(struct perf_event *event, int flags) 261 261 { 262 262 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 263 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 263 + struct bts_ctx *bts = this_cpu_ptr(bts_ctx); 264 264 struct bts_buffer *buf; 265 265 266 266 buf = perf_aux_output_begin(&bts->handle, event); ··· 290 290 291 291 static void __bts_event_stop(struct perf_event *event, int state) 292 292 { 293 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 293 + struct bts_ctx *bts = this_cpu_ptr(bts_ctx); 294 294 295 295 /* ACTIVE -> INACTIVE(PMI)/STOPPED(->stop()) */ 296 296 WRITE_ONCE(bts->state, state); ··· 305 305 static void bts_event_stop(struct perf_event *event, int flags) 306 306 { 307 307 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 308 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 308 + struct bts_ctx *bts = this_cpu_ptr(bts_ctx); 309 309 struct bts_buffer *buf = NULL; 310 310 int state = READ_ONCE(bts->state); 311 311 ··· 338 338 339 339 void intel_bts_enable_local(void) 340 340 { 341 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 342 - int state = READ_ONCE(bts->state); 341 + struct bts_ctx *bts; 342 + int state; 343 343 344 + if (!bts_ctx) 345 + return; 346 + 347 + bts = this_cpu_ptr(bts_ctx); 348 + state = READ_ONCE(bts->state); 344 349 /* 345 350 * Here we transition from INACTIVE to ACTIVE; 346 351 * if we instead are STOPPED from the interrupt handler, ··· 363 358 364 359 void intel_bts_disable_local(void) 365 360 { 366 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 361 + struct bts_ctx *bts; 362 + 363 + if (!bts_ctx) 364 + return; 365 + 366 + bts = this_cpu_ptr(bts_ctx); 367 367 368 368 /* 369 369 * Here we transition from ACTIVE to INACTIVE; ··· 460 450 int intel_bts_interrupt(void) 461 451 { 462 452 struct debug_store *ds = this_cpu_ptr(&cpu_hw_events)->ds; 463 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 464 - struct perf_event *event = bts->handle.event; 453 + struct bts_ctx *bts; 454 + struct perf_event *event; 465 455 struct bts_buffer *buf; 466 456 s64 old_head; 467 457 int err = -ENOSPC, handled = 0; 468 458 459 + if (!bts_ctx) 460 + return 0; 461 + 462 + bts = this_cpu_ptr(bts_ctx); 463 + event = bts->handle.event; 469 464 /* 470 465 * The only surefire way of knowing if this NMI is ours is by checking 471 466 * the write ptr against the PMI threshold. ··· 533 518 534 519 static int bts_event_add(struct perf_event *event, int mode) 535 520 { 536 - struct bts_ctx *bts = this_cpu_ptr(&bts_ctx); 521 + struct bts_ctx *bts = this_cpu_ptr(bts_ctx); 537 522 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 538 523 struct hw_perf_event *hwc = &event->hw; 539 524 ··· 619 604 */ 620 605 return -ENODEV; 621 606 } 607 + 608 + bts_ctx = alloc_percpu(struct bts_ctx); 609 + if (!bts_ctx) 610 + return -ENOMEM; 622 611 623 612 bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE | 624 613 PERF_PMU_CAP_EXCLUSIVE;
+85 -42
arch/x86/events/intel/core.c
··· 2714 2714 * modify by a NMI. PMU has to be disabled before calling this function. 2715 2715 */ 2716 2716 2717 - static u64 intel_update_topdown_event(struct perf_event *event, int metric_end) 2717 + static u64 intel_update_topdown_event(struct perf_event *event, int metric_end, u64 *val) 2718 2718 { 2719 2719 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 2720 2720 struct perf_event *other; ··· 2722 2722 bool reset = true; 2723 2723 int idx; 2724 2724 2725 - /* read Fixed counter 3 */ 2726 - rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots); 2727 - if (!slots) 2728 - return 0; 2725 + if (!val) { 2726 + /* read Fixed counter 3 */ 2727 + rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots); 2728 + if (!slots) 2729 + return 0; 2729 2730 2730 - /* read PERF_METRICS */ 2731 - rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics); 2731 + /* read PERF_METRICS */ 2732 + rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics); 2733 + } else { 2734 + slots = val[0]; 2735 + metrics = val[1]; 2736 + /* 2737 + * Don't reset the PERF_METRICS and Fixed counter 3 2738 + * for each PEBS record read. Utilize the RDPMC metrics 2739 + * clear mode. 2740 + */ 2741 + reset = false; 2742 + } 2732 2743 2733 2744 for_each_set_bit(idx, cpuc->active_mask, metric_end + 1) { 2734 2745 if (!is_topdown_idx(idx)) ··· 2782 2771 return slots; 2783 2772 } 2784 2773 2785 - static u64 icl_update_topdown_event(struct perf_event *event) 2774 + static u64 icl_update_topdown_event(struct perf_event *event, u64 *val) 2786 2775 { 2787 2776 return intel_update_topdown_event(event, INTEL_PMC_IDX_METRIC_BASE + 2788 - x86_pmu.num_topdown_events - 1); 2777 + x86_pmu.num_topdown_events - 1, 2778 + val); 2789 2779 } 2790 2780 2791 - DEFINE_STATIC_CALL(intel_pmu_update_topdown_event, x86_perf_event_update); 2792 - 2793 - static void intel_pmu_read_topdown_event(struct perf_event *event) 2794 - { 2795 - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 2796 - 2797 - /* Only need to call update_topdown_event() once for group read. */ 2798 - if ((cpuc->txn_flags & PERF_PMU_TXN_READ) && 2799 - !is_slots_event(event)) 2800 - return; 2801 - 2802 - perf_pmu_disable(event->pmu); 2803 - static_call(intel_pmu_update_topdown_event)(event); 2804 - perf_pmu_enable(event->pmu); 2805 - } 2781 + DEFINE_STATIC_CALL(intel_pmu_update_topdown_event, intel_pmu_topdown_event_update); 2806 2782 2807 2783 static void intel_pmu_read_event(struct perf_event *event) 2808 2784 { 2809 - if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) 2810 - intel_pmu_auto_reload_read(event); 2811 - else if (is_topdown_count(event)) 2812 - intel_pmu_read_topdown_event(event); 2813 - else 2814 - x86_perf_event_update(event); 2785 + if (event->hw.flags & (PERF_X86_EVENT_AUTO_RELOAD | PERF_X86_EVENT_TOPDOWN) || 2786 + is_pebs_counter_event_group(event)) { 2787 + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 2788 + bool pmu_enabled = cpuc->enabled; 2789 + 2790 + /* Only need to call update_topdown_event() once for group read. */ 2791 + if (is_metric_event(event) && (cpuc->txn_flags & PERF_PMU_TXN_READ)) 2792 + return; 2793 + 2794 + cpuc->enabled = 0; 2795 + if (pmu_enabled) 2796 + intel_pmu_disable_all(); 2797 + 2798 + /* 2799 + * If the PEBS counters snapshotting is enabled, 2800 + * the topdown event is available in PEBS records. 2801 + */ 2802 + if (is_topdown_event(event) && !is_pebs_counter_event_group(event)) 2803 + static_call(intel_pmu_update_topdown_event)(event, NULL); 2804 + else 2805 + intel_pmu_drain_pebs_buffer(); 2806 + 2807 + cpuc->enabled = pmu_enabled; 2808 + if (pmu_enabled) 2809 + intel_pmu_enable_all(0); 2810 + 2811 + return; 2812 + } 2813 + 2814 + x86_perf_event_update(event); 2815 2815 } 2816 2816 2817 2817 static void intel_pmu_enable_fixed(struct perf_event *event) ··· 2954 2932 static u64 intel_pmu_update(struct perf_event *event) 2955 2933 { 2956 2934 if (unlikely(is_topdown_count(event))) 2957 - return static_call(intel_pmu_update_topdown_event)(event); 2935 + return static_call(intel_pmu_update_topdown_event)(event, NULL); 2958 2936 2959 2937 return x86_perf_event_update(event); 2960 2938 } ··· 3092 3070 3093 3071 handled++; 3094 3072 x86_pmu_handle_guest_pebs(regs, &data); 3095 - x86_pmu.drain_pebs(regs, &data); 3073 + static_call(x86_pmu_drain_pebs)(regs, &data); 3096 3074 status &= intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI; 3097 3075 3098 3076 /* ··· 3120 3098 */ 3121 3099 if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) { 3122 3100 handled++; 3123 - static_call(intel_pmu_update_topdown_event)(NULL); 3101 + static_call(intel_pmu_update_topdown_event)(NULL, NULL); 3124 3102 } 3125 3103 3126 3104 /* ··· 3137 3115 3138 3116 if (!test_bit(bit, cpuc->active_mask)) 3139 3117 continue; 3118 + 3119 + /* 3120 + * There may be unprocessed PEBS records in the PEBS buffer, 3121 + * which still stores the previous values. 3122 + * Process those records first before handling the latest value. 3123 + * For example, 3124 + * A is a regular counter 3125 + * B is a PEBS event which reads A 3126 + * C is a PEBS event 3127 + * 3128 + * The following can happen: 3129 + * B-assist A=1 3130 + * C A=2 3131 + * B-assist A=3 3132 + * A-overflow-PMI A=4 3133 + * C-assist-PMI (PEBS buffer) A=5 3134 + * 3135 + * The PEBS buffer has to be drained before handling the A-PMI 3136 + */ 3137 + if (is_pebs_counter_event_group(event)) 3138 + x86_pmu.drain_pebs(regs, &data); 3140 3139 3141 3140 if (!intel_pmu_save_and_restart(event)) 3142 3141 continue; ··· 4190 4147 4191 4148 event->hw.flags |= PERF_X86_EVENT_PEBS_VIA_PT; 4192 4149 } 4150 + 4151 + if ((event->attr.sample_type & PERF_SAMPLE_READ) && 4152 + (x86_pmu.intel_cap.pebs_format >= 6) && 4153 + x86_pmu.intel_cap.pebs_baseline && 4154 + is_sampling_event(event) && 4155 + event->attr.precise_ip) 4156 + event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR; 4193 4157 4194 4158 if ((event->attr.type == PERF_TYPE_HARDWARE) || 4195 4159 (event->attr.type == PERF_TYPE_HW_CACHE)) ··· 5294 5244 } 5295 5245 5296 5246 static void intel_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, 5297 - bool sched_in) 5247 + struct task_struct *task, bool sched_in) 5298 5248 { 5299 5249 intel_pmu_pebs_sched_task(pmu_ctx, sched_in); 5300 - intel_pmu_lbr_sched_task(pmu_ctx, sched_in); 5301 - } 5302 - 5303 - static void intel_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc, 5304 - struct perf_event_pmu_context *next_epc) 5305 - { 5306 - intel_pmu_lbr_swap_task_ctx(prev_epc, next_epc); 5250 + intel_pmu_lbr_sched_task(pmu_ctx, task, sched_in); 5307 5251 } 5308 5252 5309 5253 static int intel_pmu_check_period(struct perf_event *event, u64 value) ··· 5468 5424 5469 5425 .guest_get_msrs = intel_guest_get_msrs, 5470 5426 .sched_task = intel_pmu_sched_task, 5471 - .swap_task_ctx = intel_pmu_swap_task_ctx, 5472 5427 5473 5428 .check_period = intel_pmu_check_period, 5474 5429
+184 -20
arch/x86/events/intel/ds.c
··· 953 953 return 1; 954 954 } 955 955 956 - static inline void intel_pmu_drain_pebs_buffer(void) 956 + void intel_pmu_drain_pebs_buffer(void) 957 957 { 958 958 struct perf_sample_data data; 959 959 960 - x86_pmu.drain_pebs(NULL, &data); 960 + static_call(x86_pmu_drain_pebs)(NULL, &data); 961 961 } 962 962 963 963 /* ··· 1294 1294 ds->pebs_interrupt_threshold = threshold; 1295 1295 } 1296 1296 1297 + #define PEBS_DATACFG_CNTRS(x) \ 1298 + ((x >> PEBS_DATACFG_CNTR_SHIFT) & PEBS_DATACFG_CNTR_MASK) 1299 + 1300 + #define PEBS_DATACFG_CNTR_BIT(x) \ 1301 + (((1ULL << x) & PEBS_DATACFG_CNTR_MASK) << PEBS_DATACFG_CNTR_SHIFT) 1302 + 1303 + #define PEBS_DATACFG_FIX(x) \ 1304 + ((x >> PEBS_DATACFG_FIX_SHIFT) & PEBS_DATACFG_FIX_MASK) 1305 + 1306 + #define PEBS_DATACFG_FIX_BIT(x) \ 1307 + (((1ULL << (x)) & PEBS_DATACFG_FIX_MASK) \ 1308 + << PEBS_DATACFG_FIX_SHIFT) 1309 + 1297 1310 static void adaptive_pebs_record_size_update(void) 1298 1311 { 1299 1312 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); ··· 1321 1308 sz += sizeof(struct pebs_xmm); 1322 1309 if (pebs_data_cfg & PEBS_DATACFG_LBRS) 1323 1310 sz += x86_pmu.lbr_nr * sizeof(struct lbr_entry); 1311 + if (pebs_data_cfg & (PEBS_DATACFG_METRICS | PEBS_DATACFG_CNTR)) { 1312 + sz += sizeof(struct pebs_cntr_header); 1313 + 1314 + /* Metrics base and Metrics Data */ 1315 + if (pebs_data_cfg & PEBS_DATACFG_METRICS) 1316 + sz += 2 * sizeof(u64); 1317 + 1318 + if (pebs_data_cfg & PEBS_DATACFG_CNTR) { 1319 + sz += (hweight64(PEBS_DATACFG_CNTRS(pebs_data_cfg)) + 1320 + hweight64(PEBS_DATACFG_FIX(pebs_data_cfg))) * 1321 + sizeof(u64); 1322 + } 1323 + } 1324 1324 1325 1325 cpuc->pebs_record_size = sz; 1326 + } 1327 + 1328 + static void __intel_pmu_pebs_update_cfg(struct perf_event *event, 1329 + int idx, u64 *pebs_data_cfg) 1330 + { 1331 + if (is_metric_event(event)) { 1332 + *pebs_data_cfg |= PEBS_DATACFG_METRICS; 1333 + return; 1334 + } 1335 + 1336 + *pebs_data_cfg |= PEBS_DATACFG_CNTR; 1337 + 1338 + if (idx >= INTEL_PMC_IDX_FIXED) 1339 + *pebs_data_cfg |= PEBS_DATACFG_FIX_BIT(idx - INTEL_PMC_IDX_FIXED); 1340 + else 1341 + *pebs_data_cfg |= PEBS_DATACFG_CNTR_BIT(idx); 1342 + } 1343 + 1344 + 1345 + static void intel_pmu_late_setup(void) 1346 + { 1347 + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 1348 + struct perf_event *event; 1349 + u64 pebs_data_cfg = 0; 1350 + int i; 1351 + 1352 + for (i = 0; i < cpuc->n_events; i++) { 1353 + event = cpuc->event_list[i]; 1354 + if (!is_pebs_counter_event_group(event)) 1355 + continue; 1356 + __intel_pmu_pebs_update_cfg(event, cpuc->assign[i], &pebs_data_cfg); 1357 + } 1358 + 1359 + if (pebs_data_cfg & ~cpuc->pebs_data_cfg) 1360 + cpuc->pebs_data_cfg |= pebs_data_cfg | PEBS_UPDATE_DS_SW; 1326 1361 } 1327 1362 1328 1363 #define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC | \ ··· 1975 1914 #endif 1976 1915 } 1977 1916 1917 + static void intel_perf_event_update_pmc(struct perf_event *event, u64 pmc) 1918 + { 1919 + int shift = 64 - x86_pmu.cntval_bits; 1920 + struct hw_perf_event *hwc; 1921 + u64 delta, prev_pmc; 1922 + 1923 + /* 1924 + * A recorded counter may not have an assigned event in the 1925 + * following cases. The value should be dropped. 1926 + * - An event is deleted. There is still an active PEBS event. 1927 + * The PEBS record doesn't shrink on pmu::del(). 1928 + * If the counter of the deleted event once occurred in a PEBS 1929 + * record, PEBS still records the counter until the counter is 1930 + * reassigned. 1931 + * - An event is stopped for some reason, e.g., throttled. 1932 + * During this period, another event is added and takes the 1933 + * counter of the stopped event. The stopped event is assigned 1934 + * to another new and uninitialized counter, since the 1935 + * x86_pmu_start(RELOAD) is not invoked for a stopped event. 1936 + * The PEBS__DATA_CFG is updated regardless of the event state. 1937 + * The uninitialized counter can be recorded in a PEBS record. 1938 + * But the cpuc->events[uninitialized_counter] is always NULL, 1939 + * because the event is stopped. The uninitialized value is 1940 + * safely dropped. 1941 + */ 1942 + if (!event) 1943 + return; 1944 + 1945 + hwc = &event->hw; 1946 + prev_pmc = local64_read(&hwc->prev_count); 1947 + 1948 + /* Only update the count when the PMU is disabled */ 1949 + WARN_ON(this_cpu_read(cpu_hw_events.enabled)); 1950 + local64_set(&hwc->prev_count, pmc); 1951 + 1952 + delta = (pmc << shift) - (prev_pmc << shift); 1953 + delta >>= shift; 1954 + 1955 + local64_add(delta, &event->count); 1956 + local64_sub(delta, &hwc->period_left); 1957 + } 1958 + 1959 + static inline void __setup_pebs_counter_group(struct cpu_hw_events *cpuc, 1960 + struct perf_event *event, 1961 + struct pebs_cntr_header *cntr, 1962 + void *next_record) 1963 + { 1964 + int bit; 1965 + 1966 + for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) { 1967 + intel_perf_event_update_pmc(cpuc->events[bit], *(u64 *)next_record); 1968 + next_record += sizeof(u64); 1969 + } 1970 + 1971 + for_each_set_bit(bit, (unsigned long *)&cntr->fixed, INTEL_PMC_MAX_FIXED) { 1972 + /* The slots event will be handled with perf_metric later */ 1973 + if ((cntr->metrics == INTEL_CNTR_METRICS) && 1974 + (bit + INTEL_PMC_IDX_FIXED == INTEL_PMC_IDX_FIXED_SLOTS)) { 1975 + next_record += sizeof(u64); 1976 + continue; 1977 + } 1978 + intel_perf_event_update_pmc(cpuc->events[bit + INTEL_PMC_IDX_FIXED], 1979 + *(u64 *)next_record); 1980 + next_record += sizeof(u64); 1981 + } 1982 + 1983 + /* HW will reload the value right after the overflow. */ 1984 + if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) 1985 + local64_set(&event->hw.prev_count, (u64)-event->hw.sample_period); 1986 + 1987 + if (cntr->metrics == INTEL_CNTR_METRICS) { 1988 + static_call(intel_pmu_update_topdown_event) 1989 + (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS], 1990 + (u64 *)next_record); 1991 + next_record += 2 * sizeof(u64); 1992 + } 1993 + } 1994 + 1978 1995 #define PEBS_LATENCY_MASK 0xffff 1979 1996 1980 1997 /* 1981 1998 * With adaptive PEBS the layout depends on what fields are configured. 1982 1999 */ 1983 - 1984 2000 static void setup_pebs_adaptive_sample_data(struct perf_event *event, 1985 2001 struct pt_regs *iregs, void *__pebs, 1986 2002 struct perf_sample_data *data, ··· 2187 2049 } 2188 2050 } 2189 2051 2052 + if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) { 2053 + struct pebs_cntr_header *cntr = next_record; 2054 + unsigned int nr; 2055 + 2056 + next_record += sizeof(struct pebs_cntr_header); 2057 + /* 2058 + * The PEBS_DATA_CFG is a global register, which is the 2059 + * superset configuration for all PEBS events. 2060 + * For the PEBS record of non-sample-read group, ignore 2061 + * the counter snapshot fields. 2062 + */ 2063 + if (is_pebs_counter_event_group(event)) { 2064 + __setup_pebs_counter_group(cpuc, event, cntr, next_record); 2065 + data->sample_flags |= PERF_SAMPLE_READ; 2066 + } 2067 + 2068 + nr = hweight32(cntr->cntr) + hweight32(cntr->fixed); 2069 + if (cntr->metrics == INTEL_CNTR_METRICS) 2070 + nr += 2; 2071 + next_record += nr * sizeof(u64); 2072 + } 2073 + 2190 2074 WARN_ONCE(next_record != __pebs + basic->format_size, 2191 2075 "PEBS record size %u, expected %llu, config %llx\n", 2192 2076 basic->format_size, ··· 2252 2092 } 2253 2093 } 2254 2094 return NULL; 2255 - } 2256 - 2257 - void intel_pmu_auto_reload_read(struct perf_event *event) 2258 - { 2259 - WARN_ON(!(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)); 2260 - 2261 - perf_pmu_disable(event->pmu); 2262 - intel_pmu_drain_pebs_buffer(); 2263 - perf_pmu_enable(event->pmu); 2264 2095 } 2265 2096 2266 2097 /* ··· 2362 2211 } 2363 2212 2364 2213 if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) { 2365 - /* 2366 - * Now, auto-reload is only enabled in fixed period mode. 2367 - * The reload value is always hwc->sample_period. 2368 - * May need to change it, if auto-reload is enabled in 2369 - * freq mode later. 2370 - */ 2371 - intel_pmu_save_and_restart_reload(event, count); 2214 + if ((is_pebs_counter_event_group(event))) { 2215 + /* 2216 + * The value of each sample has been updated when setup 2217 + * the corresponding sample data. 2218 + */ 2219 + perf_event_update_userpage(event); 2220 + } else { 2221 + /* 2222 + * Now, auto-reload is only enabled in fixed period mode. 2223 + * The reload value is always hwc->sample_period. 2224 + * May need to change it, if auto-reload is enabled in 2225 + * freq mode later. 2226 + */ 2227 + intel_pmu_save_and_restart_reload(event, count); 2228 + } 2372 2229 } else 2373 2230 intel_pmu_save_and_restart(event); 2374 2231 } ··· 2711 2552 break; 2712 2553 2713 2554 case 6: 2555 + if (x86_pmu.intel_cap.pebs_baseline) { 2556 + x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ; 2557 + x86_pmu.late_setup = intel_pmu_late_setup; 2558 + } 2559 + fallthrough; 2714 2560 case 5: 2715 2561 x86_pmu.pebs_ept = 1; 2716 2562 fallthrough; ··· 2740 2576 PERF_SAMPLE_REGS_USER | 2741 2577 PERF_SAMPLE_REGS_INTR); 2742 2578 } 2743 - pr_cont("PEBS fmt4%c%s, ", pebs_type, pebs_qual); 2579 + pr_cont("PEBS fmt%d%c%s, ", format, pebs_type, pebs_qual); 2744 2580 2745 2581 /* 2746 2582 * The PEBS-via-PT is not supported on hybrid platforms,
+41 -32
arch/x86/events/intel/lbr.c
··· 422 422 return !rdlbr_from(((struct x86_perf_task_context *)ctx)->tos, NULL); 423 423 } 424 424 425 + static inline bool has_lbr_callstack_users(void *ctx) 426 + { 427 + return task_context_opt(ctx)->lbr_callstack_users || 428 + x86_pmu.lbr_callstack_users; 429 + } 430 + 425 431 static void __intel_pmu_lbr_restore(void *ctx) 426 432 { 427 433 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 428 434 429 - if (task_context_opt(ctx)->lbr_callstack_users == 0 || 435 + if (!has_lbr_callstack_users(ctx) || 430 436 task_context_opt(ctx)->lbr_stack_state == LBR_NONE) { 431 437 intel_pmu_lbr_reset(); 432 438 return; ··· 509 503 { 510 504 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 511 505 512 - if (task_context_opt(ctx)->lbr_callstack_users == 0) { 506 + if (!has_lbr_callstack_users(ctx)) { 513 507 task_context_opt(ctx)->lbr_stack_state = LBR_NONE; 514 508 return; 515 509 } ··· 522 516 cpuc->last_log_id = ++task_context_opt(ctx)->log_id; 523 517 } 524 518 525 - void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc, 526 - struct perf_event_pmu_context *next_epc) 527 - { 528 - void *prev_ctx_data, *next_ctx_data; 529 - 530 - swap(prev_epc->task_ctx_data, next_epc->task_ctx_data); 531 - 532 - /* 533 - * Architecture specific synchronization makes sense in case 534 - * both prev_epc->task_ctx_data and next_epc->task_ctx_data 535 - * pointers are allocated. 536 - */ 537 - 538 - prev_ctx_data = next_epc->task_ctx_data; 539 - next_ctx_data = prev_epc->task_ctx_data; 540 - 541 - if (!prev_ctx_data || !next_ctx_data) 542 - return; 543 - 544 - swap(task_context_opt(prev_ctx_data)->lbr_callstack_users, 545 - task_context_opt(next_ctx_data)->lbr_callstack_users); 546 - } 547 - 548 - void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 519 + void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, 520 + struct task_struct *task, bool sched_in) 549 521 { 550 522 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 523 + struct perf_ctx_data *ctx_data; 551 524 void *task_ctx; 552 525 553 526 if (!cpuc->lbr_users) ··· 537 552 * the task was scheduled out, restore the stack. Otherwise flush 538 553 * the LBR stack. 539 554 */ 540 - task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL; 555 + rcu_read_lock(); 556 + ctx_data = rcu_dereference(task->perf_ctx_data); 557 + task_ctx = ctx_data ? ctx_data->data : NULL; 541 558 if (task_ctx) { 542 559 if (sched_in) 543 560 __intel_pmu_lbr_restore(task_ctx); 544 561 else 545 562 __intel_pmu_lbr_save(task_ctx); 563 + rcu_read_unlock(); 546 564 return; 547 565 } 566 + rcu_read_unlock(); 548 567 549 568 /* 550 569 * Since a context switch can flip the address space and LBR entries ··· 577 588 578 589 cpuc->br_sel = event->hw.branch_reg.reg; 579 590 580 - if (branch_user_callstack(cpuc->br_sel) && event->pmu_ctx->task_ctx_data) 581 - task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users++; 591 + if (branch_user_callstack(cpuc->br_sel)) { 592 + if (event->attach_state & PERF_ATTACH_TASK) { 593 + struct task_struct *task = event->hw.target; 594 + struct perf_ctx_data *ctx_data; 582 595 596 + rcu_read_lock(); 597 + ctx_data = rcu_dereference(task->perf_ctx_data); 598 + if (ctx_data) 599 + task_context_opt(ctx_data->data)->lbr_callstack_users++; 600 + rcu_read_unlock(); 601 + } else 602 + x86_pmu.lbr_callstack_users++; 603 + } 583 604 /* 584 605 * Request pmu::sched_task() callback, which will fire inside the 585 606 * regular perf event scheduling, so that call will: ··· 663 664 if (!x86_pmu.lbr_nr) 664 665 return; 665 666 666 - if (branch_user_callstack(cpuc->br_sel) && 667 - event->pmu_ctx->task_ctx_data) 668 - task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users--; 667 + if (branch_user_callstack(cpuc->br_sel)) { 668 + if (event->attach_state & PERF_ATTACH_TASK) { 669 + struct task_struct *task = event->hw.target; 670 + struct perf_ctx_data *ctx_data; 671 + 672 + rcu_read_lock(); 673 + ctx_data = rcu_dereference(task->perf_ctx_data); 674 + if (ctx_data) 675 + task_context_opt(ctx_data->data)->lbr_callstack_users--; 676 + rcu_read_unlock(); 677 + } else 678 + x86_pmu.lbr_callstack_users--; 679 + } 669 680 670 681 if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT) 671 682 cpuc->lbr_select = 0;
+25 -17
arch/x86/events/perf_event.h
··· 115 115 return event->group_leader->hw.flags & PERF_X86_EVENT_BRANCH_COUNTERS; 116 116 } 117 117 118 + static inline bool is_pebs_counter_event_group(struct perf_event *event) 119 + { 120 + return event->group_leader->hw.flags & PERF_X86_EVENT_PEBS_CNTR; 121 + } 122 + 118 123 struct amd_nb { 119 124 int nb_id; /* NorthBridge id */ 120 125 int refcnt; /* reference count */ ··· 805 800 u64 (*update)(struct perf_event *event); 806 801 int (*hw_config)(struct perf_event *event); 807 802 int (*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign); 803 + void (*late_setup)(void); 808 804 unsigned eventsel; 809 805 unsigned perfctr; 810 806 unsigned fixedctr; ··· 875 869 876 870 void (*check_microcode)(void); 877 871 void (*sched_task)(struct perf_event_pmu_context *pmu_ctx, 878 - bool sched_in); 872 + struct task_struct *task, bool sched_in); 879 873 880 874 /* 881 875 * Intel Arch Perfmon v2+ ··· 920 914 const int *lbr_sel_map; /* lbr_select mappings */ 921 915 int *lbr_ctl_map; /* LBR_CTL mappings */ 922 916 }; 917 + u64 lbr_callstack_users; /* lbr callstack system wide users */ 923 918 bool lbr_double_abort; /* duplicated lbr aborts */ 924 919 bool lbr_pt_coexist; /* (LBR|BTS) may coexist with PT */ 925 920 ··· 957 950 * Intel perf metrics 958 951 */ 959 952 int num_topdown_events; 960 - 961 - /* 962 - * perf task context (i.e. struct perf_event_pmu_context::task_ctx_data) 963 - * switch helper to bridge calls from perf/core to perf/x86. 964 - * See struct pmu::swap_task_ctx() usage for examples; 965 - */ 966 - void (*swap_task_ctx)(struct perf_event_pmu_context *prev_epc, 967 - struct perf_event_pmu_context *next_epc); 968 953 969 954 /* 970 955 * AMD bits ··· 1106 1107 1107 1108 DECLARE_STATIC_CALL(x86_pmu_set_period, *x86_pmu.set_period); 1108 1109 DECLARE_STATIC_CALL(x86_pmu_update, *x86_pmu.update); 1110 + DECLARE_STATIC_CALL(x86_pmu_drain_pebs, *x86_pmu.drain_pebs); 1111 + DECLARE_STATIC_CALL(x86_pmu_late_setup, *x86_pmu.late_setup); 1109 1112 1110 1113 static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx) 1111 1114 { ··· 1148 1147 [PERF_COUNT_HW_CACHE_RESULT_MAX]; 1149 1148 1150 1149 u64 x86_perf_event_update(struct perf_event *event); 1150 + 1151 + static inline u64 intel_pmu_topdown_event_update(struct perf_event *event, u64 *val) 1152 + { 1153 + return x86_perf_event_update(event); 1154 + } 1155 + DECLARE_STATIC_CALL(intel_pmu_update_topdown_event, intel_pmu_topdown_event_update); 1151 1156 1152 1157 static inline unsigned int x86_pmu_config_addr(int index) 1153 1158 { ··· 1401 1394 void amd_pmu_lbr_read(void); 1402 1395 void amd_pmu_lbr_add(struct perf_event *event); 1403 1396 void amd_pmu_lbr_del(struct perf_event *event); 1404 - void amd_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in); 1397 + void amd_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, 1398 + struct task_struct *task, bool sched_in); 1405 1399 void amd_pmu_lbr_enable_all(void); 1406 1400 void amd_pmu_lbr_disable_all(void); 1407 1401 int amd_pmu_lbr_hw_config(struct perf_event *event); ··· 1456 1448 perf_sched_cb_dec(event->pmu); 1457 1449 } 1458 1450 1459 - void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in); 1451 + void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, 1452 + struct task_struct *task, bool sched_in); 1460 1453 #else 1461 1454 static inline int amd_brs_init(void) 1462 1455 { ··· 1482 1473 { 1483 1474 } 1484 1475 1485 - static inline void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) 1476 + static inline void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, 1477 + struct task_struct *task, bool sched_in) 1486 1478 { 1487 1479 } 1488 1480 ··· 1653 1643 1654 1644 void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in); 1655 1645 1656 - void intel_pmu_auto_reload_read(struct perf_event *event); 1646 + void intel_pmu_drain_pebs_buffer(void); 1657 1647 1658 1648 void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr); 1659 1649 ··· 1663 1653 struct cpu_hw_events *cpuc, 1664 1654 struct perf_event *event); 1665 1655 1666 - void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc, 1667 - struct perf_event_pmu_context *next_epc); 1668 - 1669 - void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in); 1656 + void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, 1657 + struct task_struct *task, bool sched_in); 1670 1658 1671 1659 u64 lbr_from_signext_quirk_wr(u64 val); 1672 1660
+1 -1
arch/x86/events/perf_event_flags.h
··· 9 9 PERF_ARCH(PEBS_NA_HSW, 0x00010) /* haswell style datala, unknown */ 10 10 PERF_ARCH(EXCL, 0x00020) /* HT exclusivity on counter */ 11 11 PERF_ARCH(DYNAMIC, 0x00040) /* dynamic alloc'd constraint */ 12 - /* 0x00080 */ 12 + PERF_ARCH(PEBS_CNTR, 0x00080) /* PEBS counters snapshot */ 13 13 PERF_ARCH(EXCL_ACCT, 0x00100) /* accounted EXCL event */ 14 14 PERF_ARCH(AUTO_RELOAD, 0x00200) /* use PEBS auto-reload */ 15 15 PERF_ARCH(LARGE_PEBS, 0x00400) /* use large PEBS */
+2 -1
arch/x86/include/asm/amd-ibs.h
··· 64 64 opmaxcnt_ext:7, /* 20-26: upper 7 bits of periodic op maximum count */ 65 65 reserved0:5, /* 27-31: reserved */ 66 66 opcurcnt:27, /* 32-58: periodic op counter current count */ 67 - reserved1:5; /* 59-63: reserved */ 67 + ldlat_thrsh:4, /* 59-62: Load Latency threshold */ 68 + ldlat_en:1; /* 63: Load Latency enabled */ 68 69 }; 69 70 }; 70 71
+20
arch/x86/include/asm/perf_event.h
··· 141 141 #define PEBS_DATACFG_XMMS BIT_ULL(2) 142 142 #define PEBS_DATACFG_LBRS BIT_ULL(3) 143 143 #define PEBS_DATACFG_LBR_SHIFT 24 144 + #define PEBS_DATACFG_CNTR BIT_ULL(4) 145 + #define PEBS_DATACFG_CNTR_SHIFT 32 146 + #define PEBS_DATACFG_CNTR_MASK GENMASK_ULL(15, 0) 147 + #define PEBS_DATACFG_FIX_SHIFT 48 148 + #define PEBS_DATACFG_FIX_MASK GENMASK_ULL(7, 0) 149 + #define PEBS_DATACFG_METRICS BIT_ULL(5) 144 150 145 151 /* Steal the highest bit of pebs_data_cfg for SW usage */ 146 152 #define PEBS_UPDATE_DS_SW BIT_ULL(63) ··· 488 482 u64 xmm[16*2]; /* two entries for each register */ 489 483 }; 490 484 485 + struct pebs_cntr_header { 486 + u32 cntr; 487 + u32 fixed; 488 + u32 metrics; 489 + u32 reserved; 490 + }; 491 + 492 + #define INTEL_CNTR_METRICS 0x3 493 + 491 494 /* 492 495 * AMD Extended Performance Monitoring and Debug cpuid feature detection 493 496 */ ··· 524 509 #define IBS_CAPS_FETCHCTLEXTD (1U<<9) 525 510 #define IBS_CAPS_OPDATA4 (1U<<10) 526 511 #define IBS_CAPS_ZEN4 (1U<<11) 512 + #define IBS_CAPS_OPLDLAT (1U<<12) 513 + #define IBS_CAPS_OPDTLBPGSIZE (1U<<19) 527 514 528 515 #define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ 529 516 | IBS_CAPS_FETCHSAM \ ··· 551 534 * The lower 7 bits of the current count are random bits 552 535 * preloaded by hardware and ignored in software 553 536 */ 537 + #define IBS_OP_LDLAT_EN (1ULL<<63) 538 + #define IBS_OP_LDLAT_THRSH (0xFULL<<59) 554 539 #define IBS_OP_CUR_CNT (0xFFF80ULL<<32) 555 540 #define IBS_OP_CUR_CNT_RAND (0x0007FULL<<32) 541 + #define IBS_OP_CUR_CNT_EXT_MASK (0x7FULL<<52) 556 542 #define IBS_OP_CNT_CTL (1ULL<<19) 557 543 #define IBS_OP_VAL (1ULL<<18) 558 544 #define IBS_OP_ENABLE (1ULL<<17)
+9 -5
arch/x86/kernel/uprobes.c
··· 357 357 return &insn; 358 358 } 359 359 360 - static unsigned long trampoline_check_ip(void) 360 + static unsigned long trampoline_check_ip(unsigned long tramp) 361 361 { 362 - unsigned long tramp = uprobe_get_trampoline_vaddr(); 363 - 364 362 return tramp + (uretprobe_syscall_check - uretprobe_trampoline_entry); 365 363 } 366 364 367 365 SYSCALL_DEFINE0(uretprobe) 368 366 { 369 367 struct pt_regs *regs = task_pt_regs(current); 370 - unsigned long err, ip, sp, r11_cx_ax[3]; 368 + unsigned long err, ip, sp, r11_cx_ax[3], tramp; 371 369 372 - if (regs->ip != trampoline_check_ip()) 370 + /* If there's no trampoline, we are called from wrong place. */ 371 + tramp = uprobe_get_trampoline_vaddr(); 372 + if (unlikely(tramp == UPROBE_NO_TRAMPOLINE_VADDR)) 373 + goto sigill; 374 + 375 + /* Make sure the ip matches the only allowed sys_uretprobe caller. */ 376 + if (unlikely(regs->ip != trampoline_check_ip(tramp))) 373 377 goto sigill; 374 378 375 379 err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
+17
include/linux/idr.h
··· 15 15 #include <linux/radix-tree.h> 16 16 #include <linux/gfp.h> 17 17 #include <linux/percpu.h> 18 + #include <linux/cleanup.h> 18 19 19 20 struct idr { 20 21 struct radix_tree_root idr_rt; ··· 124 123 void *idr_get_next_ul(struct idr *, unsigned long *nextid); 125 124 void *idr_replace(struct idr *, void *, unsigned long id); 126 125 void idr_destroy(struct idr *); 126 + 127 + struct __class_idr { 128 + struct idr *idr; 129 + int id; 130 + }; 131 + 132 + #define idr_null ((struct __class_idr){ NULL, -1 }) 133 + #define take_idr_id(id) __get_and_null(id, idr_null) 134 + 135 + DEFINE_CLASS(idr_alloc, struct __class_idr, 136 + if (_T.id >= 0) idr_remove(_T.idr, _T.id), 137 + ((struct __class_idr){ 138 + .idr = idr, 139 + .id = idr_alloc(idr, ptr, start, end, gfp), 140 + }), 141 + struct idr *idr, void *ptr, int start, int end, gfp_t gfp); 127 142 128 143 /** 129 144 * idr_init_base() - Initialise an IDR.
-4
include/linux/nmi.h
··· 17 17 void lockup_detector_init(void); 18 18 void lockup_detector_retry_init(void); 19 19 void lockup_detector_soft_poweroff(void); 20 - void lockup_detector_cleanup(void); 21 20 22 21 extern int watchdog_user_enabled; 23 22 extern int watchdog_thresh; ··· 36 37 static inline void lockup_detector_init(void) { } 37 38 static inline void lockup_detector_retry_init(void) { } 38 39 static inline void lockup_detector_soft_poweroff(void) { } 39 - static inline void lockup_detector_cleanup(void) { } 40 40 #endif /* !CONFIG_LOCKUP_DETECTOR */ 41 41 42 42 #ifdef CONFIG_SOFTLOCKUP_DETECTOR ··· 102 104 #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) 103 105 extern void hardlockup_detector_perf_stop(void); 104 106 extern void hardlockup_detector_perf_restart(void); 105 - extern void hardlockup_detector_perf_cleanup(void); 106 107 extern void hardlockup_config_perf_event(const char *str); 107 108 #else 108 109 static inline void hardlockup_detector_perf_stop(void) { } 109 110 static inline void hardlockup_detector_perf_restart(void) { } 110 - static inline void hardlockup_detector_perf_cleanup(void) { } 111 111 static inline void hardlockup_config_perf_event(const char *str) { } 112 112 #endif 113 113
+8
include/linux/percpu-rwsem.h
··· 8 8 #include <linux/wait.h> 9 9 #include <linux/rcu_sync.h> 10 10 #include <linux/lockdep.h> 11 + #include <linux/cleanup.h> 11 12 12 13 struct percpu_rw_semaphore { 13 14 struct rcu_sync rss; ··· 125 124 extern bool percpu_is_read_locked(struct percpu_rw_semaphore *); 126 125 extern void percpu_down_write(struct percpu_rw_semaphore *); 127 126 extern void percpu_up_write(struct percpu_rw_semaphore *); 127 + 128 + DEFINE_GUARD(percpu_read, struct percpu_rw_semaphore *, 129 + percpu_down_read(_T), percpu_up_read(_T)) 130 + DEFINE_GUARD_COND(percpu_read, _try, percpu_down_read_trylock(_T)) 131 + 132 + DEFINE_GUARD(percpu_write, struct percpu_rw_semaphore *, 133 + percpu_down_write(_T), percpu_up_write(_T)) 128 134 129 135 static inline bool percpu_is_write_locked(struct percpu_rw_semaphore *sem) 130 136 {
+59 -33
include/linux/perf_event.h
··· 343 343 */ 344 344 unsigned int scope; 345 345 346 - int __percpu *pmu_disable_count; 347 - struct perf_cpu_pmu_context __percpu *cpu_pmu_context; 346 + struct perf_cpu_pmu_context * __percpu *cpu_pmu_context; 348 347 atomic_t exclusive_cnt; /* < 0: cpu; > 0: tsk */ 349 348 int task_ctx_nr; 350 349 int hrtimer_interval_ms; ··· 494 495 * context-switches callback 495 496 */ 496 497 void (*sched_task) (struct perf_event_pmu_context *pmu_ctx, 497 - bool sched_in); 498 + struct task_struct *task, bool sched_in); 498 499 499 500 /* 500 501 * Kmem cache of PMU specific data 501 502 */ 502 503 struct kmem_cache *task_ctx_cache; 503 - 504 - /* 505 - * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data) 506 - * can be synchronized using this function. See Intel LBR callstack support 507 - * implementation and Perf core context switch handling callbacks for usage 508 - * examples. 509 - */ 510 - void (*swap_task_ctx) (struct perf_event_pmu_context *prev_epc, 511 - struct perf_event_pmu_context *next_epc); 512 - /* optional */ 513 504 514 505 /* 515 506 * Set up pmu-private data structures for an AUX area ··· 662 673 struct rcu_head rcu_head; 663 674 }; 664 675 665 - #define PERF_ATTACH_CONTEXT 0x01 666 - #define PERF_ATTACH_GROUP 0x02 667 - #define PERF_ATTACH_TASK 0x04 668 - #define PERF_ATTACH_TASK_DATA 0x08 669 - #define PERF_ATTACH_ITRACE 0x10 670 - #define PERF_ATTACH_SCHED_CB 0x20 671 - #define PERF_ATTACH_CHILD 0x40 676 + #define PERF_ATTACH_CONTEXT 0x0001 677 + #define PERF_ATTACH_GROUP 0x0002 678 + #define PERF_ATTACH_TASK 0x0004 679 + #define PERF_ATTACH_TASK_DATA 0x0008 680 + #define PERF_ATTACH_GLOBAL_DATA 0x0010 681 + #define PERF_ATTACH_SCHED_CB 0x0020 682 + #define PERF_ATTACH_CHILD 0x0040 683 + #define PERF_ATTACH_EXCLUSIVE 0x0080 684 + #define PERF_ATTACH_CALLCHAIN 0x0100 685 + #define PERF_ATTACH_ITRACE 0x0200 672 686 673 687 struct bpf_prog; 674 688 struct perf_cgroup; ··· 913 921 struct list_head pinned_active; 914 922 struct list_head flexible_active; 915 923 916 - /* Used to avoid freeing per-cpu perf_event_pmu_context */ 924 + /* Used to identify the per-cpu perf_event_pmu_context */ 917 925 unsigned int embedded : 1; 918 926 919 927 unsigned int nr_events; ··· 923 931 atomic_t refcount; /* event <-> epc */ 924 932 struct rcu_head rcu_head; 925 933 926 - void *task_ctx_data; /* pmu specific data */ 927 934 /* 928 935 * Set when one or more (plausibly active) event can't be scheduled 929 936 * due to pmu overcommit or pmu constraints, except tolerant to ··· 970 979 int nr_user; 971 980 int is_active; 972 981 973 - int nr_task_data; 974 982 int nr_stat; 975 983 int nr_freq; 976 984 int rotate_disable; ··· 1010 1020 local_t nr_no_switch_fast; 1011 1021 }; 1012 1022 1023 + /** 1024 + * struct perf_ctx_data - PMU specific data for a task 1025 + * @rcu_head: To avoid the race on free PMU specific data 1026 + * @refcount: To track users 1027 + * @global: To track system-wide users 1028 + * @ctx_cache: Kmem cache of PMU specific data 1029 + * @data: PMU specific data 1030 + * 1031 + * Currently, the struct is only used in Intel LBR call stack mode to 1032 + * save/restore the call stack of a task on context switches. 1033 + * 1034 + * The rcu_head is used to prevent the race on free the data. 1035 + * The data only be allocated when Intel LBR call stack mode is enabled. 1036 + * The data will be freed when the mode is disabled. 1037 + * The content of the data will only be accessed in context switch, which 1038 + * should be protected by rcu_read_lock(). 1039 + * 1040 + * Because of the alignment requirement of Intel Arch LBR, the Kmem cache 1041 + * is used to allocate the PMU specific data. The ctx_cache is to track 1042 + * the Kmem cache. 1043 + * 1044 + * Careful: Struct perf_ctx_data is added as a pointer in struct task_struct. 1045 + * When system-wide Intel LBR call stack mode is enabled, a buffer with 1046 + * constant size will be allocated for each task. 1047 + * Also, system memory consumption can further grow when the size of 1048 + * struct perf_ctx_data enlarges. 1049 + */ 1050 + struct perf_ctx_data { 1051 + struct rcu_head rcu_head; 1052 + refcount_t refcount; 1053 + int global; 1054 + struct kmem_cache *ctx_cache; 1055 + void *data; 1056 + }; 1057 + 1013 1058 struct perf_cpu_pmu_context { 1014 1059 struct perf_event_pmu_context epc; 1015 1060 struct perf_event_pmu_context *task_epc; ··· 1054 1029 1055 1030 int active_oncpu; 1056 1031 int exclusive; 1032 + int pmu_disable_count; 1057 1033 1058 1034 raw_spinlock_t hrtimer_lock; 1059 1035 struct hrtimer hrtimer; ··· 1088 1062 struct perf_buffer *rb; 1089 1063 unsigned long wakeup; 1090 1064 unsigned long size; 1091 - u64 aux_flags; 1065 + union { 1066 + u64 flags; /* perf_output*() */ 1067 + u64 aux_flags; /* perf_aux_output*() */ 1068 + struct { 1069 + u64 skip_read : 1; 1070 + }; 1071 + }; 1092 1072 union { 1093 1073 void *addr; 1094 1074 unsigned long head; ··· 1371 1339 1372 1340 if (branch_sample_hw_index(event)) 1373 1341 size += sizeof(u64); 1342 + 1343 + brs->nr = min_t(u16, event->attr.sample_max_stack, brs->nr); 1344 + 1374 1345 size += brs->nr * sizeof(struct perf_branch_entry); 1375 1346 1376 1347 /* ··· 1681 1646 } 1682 1647 1683 1648 extern int sysctl_perf_event_paranoid; 1684 - extern int sysctl_perf_event_mlock; 1685 1649 extern int sysctl_perf_event_sample_rate; 1686 - extern int sysctl_perf_cpu_time_max_percent; 1687 1650 1688 1651 extern void perf_sample_event_took(u64 sample_len_ns); 1689 - 1690 - int perf_event_max_sample_rate_handler(const struct ctl_table *table, int write, 1691 - void *buffer, size_t *lenp, loff_t *ppos); 1692 - int perf_cpu_time_max_percent_handler(const struct ctl_table *table, int write, 1693 - void *buffer, size_t *lenp, loff_t *ppos); 1694 - int perf_event_max_stack_handler(const struct ctl_table *table, int write, 1695 - void *buffer, size_t *lenp, loff_t *ppos); 1696 1652 1697 1653 /* Access to perf_event_open(2) syscall. */ 1698 1654 #define PERF_SECURITY_OPEN 0
+2
include/linux/sched.h
··· 65 65 struct nameidata; 66 66 struct nsproxy; 67 67 struct perf_event_context; 68 + struct perf_ctx_data; 68 69 struct pid_namespace; 69 70 struct pipe_inode_info; 70 71 struct rcu_node; ··· 1317 1316 struct perf_event_context *perf_event_ctxp; 1318 1317 struct mutex perf_event_mutex; 1319 1318 struct list_head perf_event_list; 1319 + struct perf_ctx_data __rcu *perf_ctx_data; 1320 1320 #endif 1321 1321 #ifdef CONFIG_DEBUG_PREEMPT 1322 1322 unsigned long preempt_disable_ip;
+3
include/linux/uprobes.h
··· 39 39 40 40 #define MAX_URETPROBE_DEPTH 64 41 41 42 + #define UPROBE_NO_TRAMPOLINE_VADDR (~0UL) 43 + 42 44 struct uprobe_consumer { 43 45 /* 44 46 * handler() can return UPROBE_HANDLER_REMOVE to signal the need to ··· 145 143 146 144 struct uprobe *active_uprobe; 147 145 unsigned long xol_vaddr; 146 + bool signal_denied; 148 147 149 148 struct arch_uprobe *auprobe; 150 149 };
+2
include/uapi/linux/perf_event.h
··· 385 385 * 386 386 * @sample_max_stack: Max number of frame pointers in a callchain, 387 387 * should be < /proc/sys/kernel/perf_event_max_stack 388 + * Max number of entries of branch stack 389 + * should be < hardware limit 388 390 */ 389 391 struct perf_event_attr { 390 392
-5
kernel/cpu.c
··· 1453 1453 1454 1454 out: 1455 1455 cpus_write_unlock(); 1456 - /* 1457 - * Do post unplug cleanup. This is still protected against 1458 - * concurrent CPU hotplug via cpu_add_remove_lock. 1459 - */ 1460 - lockup_detector_cleanup(); 1461 1456 arch_smt_update(); 1462 1457 return ret; 1463 1458 }
+32 -6
kernel/events/callchain.c
··· 22 22 23 23 int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH; 24 24 int sysctl_perf_event_max_contexts_per_stack __read_mostly = PERF_MAX_CONTEXTS_PER_STACK; 25 + static const int six_hundred_forty_kb = 640 * 1024; 25 26 26 27 static inline size_t perf_callchain_entry__sizeof(void) 27 28 { ··· 267 266 return entry; 268 267 } 269 268 270 - /* 271 - * Used for sysctl_perf_event_max_stack and 272 - * sysctl_perf_event_max_contexts_per_stack. 273 - */ 274 - int perf_event_max_stack_handler(const struct ctl_table *table, int write, 275 - void *buffer, size_t *lenp, loff_t *ppos) 269 + static int perf_event_max_stack_handler(const struct ctl_table *table, int write, 270 + void *buffer, size_t *lenp, loff_t *ppos) 276 271 { 277 272 int *value = table->data; 278 273 int new_value = *value, ret; ··· 289 292 290 293 return ret; 291 294 } 295 + 296 + static const struct ctl_table callchain_sysctl_table[] = { 297 + { 298 + .procname = "perf_event_max_stack", 299 + .data = &sysctl_perf_event_max_stack, 300 + .maxlen = sizeof(sysctl_perf_event_max_stack), 301 + .mode = 0644, 302 + .proc_handler = perf_event_max_stack_handler, 303 + .extra1 = SYSCTL_ZERO, 304 + .extra2 = (void *)&six_hundred_forty_kb, 305 + }, 306 + { 307 + .procname = "perf_event_max_contexts_per_stack", 308 + .data = &sysctl_perf_event_max_contexts_per_stack, 309 + .maxlen = sizeof(sysctl_perf_event_max_contexts_per_stack), 310 + .mode = 0644, 311 + .proc_handler = perf_event_max_stack_handler, 312 + .extra1 = SYSCTL_ZERO, 313 + .extra2 = SYSCTL_ONE_THOUSAND, 314 + }, 315 + }; 316 + 317 + static int __init init_callchain_sysctls(void) 318 + { 319 + register_sysctl_init("kernel", callchain_sysctl_table); 320 + return 0; 321 + } 322 + core_initcall(init_callchain_sysctls); 323 +
+683 -393
kernel/events/core.c
··· 55 55 #include <linux/pgtable.h> 56 56 #include <linux/buildid.h> 57 57 #include <linux/task_work.h> 58 + #include <linux/percpu-rwsem.h> 58 59 59 60 #include "internal.h" 60 61 ··· 453 452 */ 454 453 int sysctl_perf_event_paranoid __read_mostly = 2; 455 454 456 - /* Minimum for 512 kiB + 1 user control page */ 457 - int sysctl_perf_event_mlock __read_mostly = 512 + (PAGE_SIZE / 1024); /* 'free' kiB per user */ 455 + /* Minimum for 512 kiB + 1 user control page. 'free' kiB per user. */ 456 + static int sysctl_perf_event_mlock __read_mostly = 512 + (PAGE_SIZE / 1024); 458 457 459 458 /* 460 459 * max perf event sample rate ··· 464 463 #define DEFAULT_CPU_TIME_MAX_PERCENT 25 465 464 466 465 int sysctl_perf_event_sample_rate __read_mostly = DEFAULT_MAX_SAMPLE_RATE; 466 + static int sysctl_perf_cpu_time_max_percent __read_mostly = DEFAULT_CPU_TIME_MAX_PERCENT; 467 467 468 468 static int max_samples_per_tick __read_mostly = DIV_ROUND_UP(DEFAULT_MAX_SAMPLE_RATE, HZ); 469 469 static int perf_sample_period_ns __read_mostly = DEFAULT_SAMPLE_PERIOD_NS; ··· 486 484 487 485 static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc); 488 486 489 - int perf_event_max_sample_rate_handler(const struct ctl_table *table, int write, 487 + static int perf_event_max_sample_rate_handler(const struct ctl_table *table, int write, 490 488 void *buffer, size_t *lenp, loff_t *ppos) 491 489 { 492 490 int ret; ··· 508 506 return 0; 509 507 } 510 508 511 - int sysctl_perf_cpu_time_max_percent __read_mostly = DEFAULT_CPU_TIME_MAX_PERCENT; 512 - 513 - int perf_cpu_time_max_percent_handler(const struct ctl_table *table, int write, 509 + static int perf_cpu_time_max_percent_handler(const struct ctl_table *table, int write, 514 510 void *buffer, size_t *lenp, loff_t *ppos) 515 511 { 516 512 int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); ··· 527 527 528 528 return 0; 529 529 } 530 + 531 + static const struct ctl_table events_core_sysctl_table[] = { 532 + /* 533 + * User-space relies on this file as a feature check for 534 + * perf_events being enabled. It's an ABI, do not remove! 535 + */ 536 + { 537 + .procname = "perf_event_paranoid", 538 + .data = &sysctl_perf_event_paranoid, 539 + .maxlen = sizeof(sysctl_perf_event_paranoid), 540 + .mode = 0644, 541 + .proc_handler = proc_dointvec, 542 + }, 543 + { 544 + .procname = "perf_event_mlock_kb", 545 + .data = &sysctl_perf_event_mlock, 546 + .maxlen = sizeof(sysctl_perf_event_mlock), 547 + .mode = 0644, 548 + .proc_handler = proc_dointvec, 549 + }, 550 + { 551 + .procname = "perf_event_max_sample_rate", 552 + .data = &sysctl_perf_event_sample_rate, 553 + .maxlen = sizeof(sysctl_perf_event_sample_rate), 554 + .mode = 0644, 555 + .proc_handler = perf_event_max_sample_rate_handler, 556 + .extra1 = SYSCTL_ONE, 557 + }, 558 + { 559 + .procname = "perf_cpu_time_max_percent", 560 + .data = &sysctl_perf_cpu_time_max_percent, 561 + .maxlen = sizeof(sysctl_perf_cpu_time_max_percent), 562 + .mode = 0644, 563 + .proc_handler = perf_cpu_time_max_percent_handler, 564 + .extra1 = SYSCTL_ZERO, 565 + .extra2 = SYSCTL_ONE_HUNDRED, 566 + }, 567 + }; 568 + 569 + static int __init init_events_core_sysctls(void) 570 + { 571 + register_sysctl_init("kernel", events_core_sysctl_table); 572 + return 0; 573 + } 574 + core_initcall(init_events_core_sysctls); 575 + 530 576 531 577 /* 532 578 * perf samples are done in some very critical code paths (NMIs). ··· 1218 1172 return perf_mux_hrtimer_restart(arg); 1219 1173 } 1220 1174 1175 + static __always_inline struct perf_cpu_pmu_context *this_cpc(struct pmu *pmu) 1176 + { 1177 + return *this_cpu_ptr(pmu->cpu_pmu_context); 1178 + } 1179 + 1221 1180 void perf_pmu_disable(struct pmu *pmu) 1222 1181 { 1223 - int *count = this_cpu_ptr(pmu->pmu_disable_count); 1182 + int *count = &this_cpc(pmu)->pmu_disable_count; 1224 1183 if (!(*count)++) 1225 1184 pmu->pmu_disable(pmu); 1226 1185 } 1227 1186 1228 1187 void perf_pmu_enable(struct pmu *pmu) 1229 1188 { 1230 - int *count = this_cpu_ptr(pmu->pmu_disable_count); 1189 + int *count = &this_cpc(pmu)->pmu_disable_count; 1231 1190 if (!--(*count)) 1232 1191 pmu->pmu_enable(pmu); 1233 1192 } 1234 1193 1235 1194 static void perf_assert_pmu_disabled(struct pmu *pmu) 1236 1195 { 1237 - WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0); 1196 + int *count = &this_cpc(pmu)->pmu_disable_count; 1197 + WARN_ON_ONCE(*count == 0); 1198 + } 1199 + 1200 + static inline void perf_pmu_read(struct perf_event *event) 1201 + { 1202 + if (event->state == PERF_EVENT_STATE_ACTIVE) 1203 + event->pmu->read(event); 1238 1204 } 1239 1205 1240 1206 static void get_ctx(struct perf_event_context *ctx) 1241 1207 { 1242 1208 refcount_inc(&ctx->refcount); 1243 - } 1244 - 1245 - static void *alloc_task_ctx_data(struct pmu *pmu) 1246 - { 1247 - if (pmu->task_ctx_cache) 1248 - return kmem_cache_zalloc(pmu->task_ctx_cache, GFP_KERNEL); 1249 - 1250 - return NULL; 1251 - } 1252 - 1253 - static void free_task_ctx_data(struct pmu *pmu, void *task_ctx_data) 1254 - { 1255 - if (pmu->task_ctx_cache && task_ctx_data) 1256 - kmem_cache_free(pmu->task_ctx_cache, task_ctx_data); 1257 1209 } 1258 1210 1259 1211 static void free_ctx(struct rcu_head *head) ··· 2347 2303 event_sched_out(struct perf_event *event, struct perf_event_context *ctx) 2348 2304 { 2349 2305 struct perf_event_pmu_context *epc = event->pmu_ctx; 2350 - struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context); 2306 + struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu); 2351 2307 enum perf_event_state state = PERF_EVENT_STATE_INACTIVE; 2352 2308 2353 2309 // XXX cpc serialization, probably per-cpu IRQ disabled ··· 2488 2444 pmu_ctx->rotate_necessary = 0; 2489 2445 2490 2446 if (ctx->task && ctx->is_active) { 2491 - struct perf_cpu_pmu_context *cpc; 2447 + struct perf_cpu_pmu_context *cpc = this_cpc(pmu_ctx->pmu); 2492 2448 2493 - cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context); 2494 2449 WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx); 2495 2450 cpc->task_epc = NULL; 2496 2451 } ··· 2627 2584 event_sched_in(struct perf_event *event, struct perf_event_context *ctx) 2628 2585 { 2629 2586 struct perf_event_pmu_context *epc = event->pmu_ctx; 2630 - struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context); 2587 + struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu); 2631 2588 int ret = 0; 2632 2589 2633 2590 WARN_ON_ONCE(event->ctx != ctx); ··· 2734 2691 static int group_can_go_on(struct perf_event *event, int can_add_hw) 2735 2692 { 2736 2693 struct perf_event_pmu_context *epc = event->pmu_ctx; 2737 - struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context); 2694 + struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu); 2738 2695 2739 2696 /* 2740 2697 * Groups consisting entirely of software events can always go on. ··· 3357 3314 struct pmu *pmu = pmu_ctx->pmu; 3358 3315 3359 3316 if (ctx->task && !(ctx->is_active & EVENT_ALL)) { 3360 - struct perf_cpu_pmu_context *cpc; 3317 + struct perf_cpu_pmu_context *cpc = this_cpc(pmu); 3361 3318 3362 - cpc = this_cpu_ptr(pmu->cpu_pmu_context); 3363 3319 WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx); 3364 3320 cpc->task_epc = NULL; 3365 3321 } ··· 3515 3473 * we know the event must be on the current CPU, therefore we 3516 3474 * don't need to use it. 3517 3475 */ 3518 - if (event->state == PERF_EVENT_STATE_ACTIVE) 3519 - event->pmu->read(event); 3476 + perf_pmu_read(event); 3520 3477 3521 3478 perf_event_update_time(event); 3522 3479 ··· 3563 3522 } 3564 3523 } 3565 3524 3566 - #define double_list_for_each_entry(pos1, pos2, head1, head2, member) \ 3567 - for (pos1 = list_first_entry(head1, typeof(*pos1), member), \ 3568 - pos2 = list_first_entry(head2, typeof(*pos2), member); \ 3569 - !list_entry_is_head(pos1, head1, member) && \ 3570 - !list_entry_is_head(pos2, head2, member); \ 3571 - pos1 = list_next_entry(pos1, member), \ 3572 - pos2 = list_next_entry(pos2, member)) 3573 - 3574 - static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx, 3575 - struct perf_event_context *next_ctx) 3576 - { 3577 - struct perf_event_pmu_context *prev_epc, *next_epc; 3578 - 3579 - if (!prev_ctx->nr_task_data) 3580 - return; 3581 - 3582 - double_list_for_each_entry(prev_epc, next_epc, 3583 - &prev_ctx->pmu_ctx_list, &next_ctx->pmu_ctx_list, 3584 - pmu_ctx_entry) { 3585 - 3586 - if (WARN_ON_ONCE(prev_epc->pmu != next_epc->pmu)) 3587 - continue; 3588 - 3589 - /* 3590 - * PMU specific parts of task perf context can require 3591 - * additional synchronization. As an example of such 3592 - * synchronization see implementation details of Intel 3593 - * LBR call stack data profiling; 3594 - */ 3595 - if (prev_epc->pmu->swap_task_ctx) 3596 - prev_epc->pmu->swap_task_ctx(prev_epc, next_epc); 3597 - else 3598 - swap(prev_epc->task_ctx_data, next_epc->task_ctx_data); 3599 - } 3600 - } 3601 - 3602 - static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in) 3525 + static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, 3526 + struct task_struct *task, bool sched_in) 3603 3527 { 3604 3528 struct perf_event_pmu_context *pmu_ctx; 3605 3529 struct perf_cpu_pmu_context *cpc; 3606 3530 3607 3531 list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { 3608 - cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context); 3532 + cpc = this_cpc(pmu_ctx->pmu); 3609 3533 3610 3534 if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task) 3611 - pmu_ctx->pmu->sched_task(pmu_ctx, sched_in); 3535 + pmu_ctx->pmu->sched_task(pmu_ctx, task, sched_in); 3612 3536 } 3613 3537 } 3614 3538 ··· 3636 3630 WRITE_ONCE(ctx->task, next); 3637 3631 WRITE_ONCE(next_ctx->task, task); 3638 3632 3639 - perf_ctx_sched_task_cb(ctx, false); 3640 - perf_event_swap_task_ctx_data(ctx, next_ctx); 3633 + perf_ctx_sched_task_cb(ctx, task, false); 3641 3634 3642 3635 perf_ctx_enable(ctx, false); 3643 3636 3644 3637 /* 3645 3638 * RCU_INIT_POINTER here is safe because we've not 3646 3639 * modified the ctx and the above modification of 3647 - * ctx->task and ctx->task_ctx_data are immaterial 3648 - * since those values are always verified under 3649 - * ctx->lock which we're now holding. 3640 + * ctx->task is immaterial since this value is 3641 + * always verified under ctx->lock which we're now 3642 + * holding. 3650 3643 */ 3651 3644 RCU_INIT_POINTER(task->perf_event_ctxp, next_ctx); 3652 3645 RCU_INIT_POINTER(next->perf_event_ctxp, ctx); ··· 3665 3660 perf_ctx_disable(ctx, false); 3666 3661 3667 3662 inside_switch: 3668 - perf_ctx_sched_task_cb(ctx, false); 3663 + perf_ctx_sched_task_cb(ctx, task, false); 3669 3664 task_ctx_sched_out(ctx, NULL, EVENT_ALL); 3670 3665 3671 3666 perf_ctx_enable(ctx, false); ··· 3678 3673 3679 3674 void perf_sched_cb_dec(struct pmu *pmu) 3680 3675 { 3681 - struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context); 3676 + struct perf_cpu_pmu_context *cpc = this_cpc(pmu); 3682 3677 3683 3678 this_cpu_dec(perf_sched_cb_usages); 3684 3679 barrier(); ··· 3690 3685 3691 3686 void perf_sched_cb_inc(struct pmu *pmu) 3692 3687 { 3693 - struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context); 3688 + struct perf_cpu_pmu_context *cpc = this_cpc(pmu); 3694 3689 3695 3690 if (!cpc->sched_cb_usage++) 3696 3691 list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list)); ··· 3707 3702 * PEBS requires this to provide PID/TID information. This requires we flush 3708 3703 * all queued PEBS records before we context switch to a new task. 3709 3704 */ 3710 - static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, bool sched_in) 3705 + static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, 3706 + struct task_struct *task, bool sched_in) 3711 3707 { 3712 3708 struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); 3713 3709 struct pmu *pmu; ··· 3722 3716 perf_ctx_lock(cpuctx, cpuctx->task_ctx); 3723 3717 perf_pmu_disable(pmu); 3724 3718 3725 - pmu->sched_task(cpc->task_epc, sched_in); 3719 + pmu->sched_task(cpc->task_epc, task, sched_in); 3726 3720 3727 3721 perf_pmu_enable(pmu); 3728 3722 perf_ctx_unlock(cpuctx, cpuctx->task_ctx); ··· 3740 3734 return; 3741 3735 3742 3736 list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry) 3743 - __perf_pmu_sched_task(cpc, sched_in); 3737 + __perf_pmu_sched_task(cpc, sched_in ? next : prev, sched_in); 3744 3738 } 3745 3739 3746 3740 static void perf_event_switch(struct task_struct *task, ··· 3808 3802 if (!pmu_ctx->ctx->task) 3809 3803 return; 3810 3804 3811 - cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context); 3805 + cpc = this_cpc(pmu_ctx->pmu); 3812 3806 WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx); 3813 3807 cpc->task_epc = pmu_ctx; 3814 3808 } ··· 3936 3930 if (event->attr.pinned) { 3937 3931 perf_cgroup_event_disable(event, ctx); 3938 3932 perf_event_set_state(event, PERF_EVENT_STATE_ERROR); 3933 + 3934 + if (*perf_event_fasync(event)) 3935 + event->pending_kill = POLL_HUP; 3936 + 3937 + perf_event_wakeup(event); 3939 3938 } else { 3940 - struct perf_cpu_pmu_context *cpc; 3939 + struct perf_cpu_pmu_context *cpc = this_cpc(event->pmu_ctx->pmu); 3941 3940 3942 3941 event->pmu_ctx->rotate_necessary = 1; 3943 - cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context); 3944 3942 perf_mux_hrtimer_restart(cpc); 3945 3943 group_update_userpage(event); 3946 3944 } ··· 4039 4029 perf_ctx_lock(cpuctx, ctx); 4040 4030 perf_ctx_disable(ctx, false); 4041 4031 4042 - perf_ctx_sched_task_cb(ctx, true); 4032 + perf_ctx_sched_task_cb(ctx, task, true); 4043 4033 4044 4034 perf_ctx_enable(ctx, false); 4045 4035 perf_ctx_unlock(cpuctx, ctx); ··· 4070 4060 4071 4061 perf_event_sched_in(cpuctx, ctx, NULL); 4072 4062 4073 - perf_ctx_sched_task_cb(cpuctx->task_ctx, true); 4063 + perf_ctx_sched_task_cb(cpuctx->task_ctx, task, true); 4074 4064 4075 4065 if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) 4076 4066 perf_ctx_enable(&cpuctx->ctx, false); ··· 4628 4618 4629 4619 pmu->read(event); 4630 4620 4631 - for_each_sibling_event(sub, event) { 4632 - if (sub->state == PERF_EVENT_STATE_ACTIVE) { 4633 - /* 4634 - * Use sibling's PMU rather than @event's since 4635 - * sibling could be on different (eg: software) PMU. 4636 - */ 4637 - sub->pmu->read(sub); 4638 - } 4639 - } 4621 + for_each_sibling_event(sub, event) 4622 + perf_pmu_read(sub); 4640 4623 4641 4624 data->ret = pmu->commit_txn(pmu); 4642 4625 ··· 4954 4951 struct perf_event *event) 4955 4952 { 4956 4953 struct perf_event_pmu_context *new = NULL, *pos = NULL, *epc; 4957 - void *task_ctx_data = NULL; 4958 4954 4959 4955 if (!ctx->task) { 4960 4956 /* ··· 4963 4961 */ 4964 4962 struct perf_cpu_pmu_context *cpc; 4965 4963 4966 - cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu); 4964 + cpc = *per_cpu_ptr(pmu->cpu_pmu_context, event->cpu); 4967 4965 epc = &cpc->epc; 4968 4966 raw_spin_lock_irq(&ctx->lock); 4969 4967 if (!epc->ctx) { 4970 - atomic_set(&epc->refcount, 1); 4968 + /* 4969 + * One extra reference for the pmu; see perf_pmu_free(). 4970 + */ 4971 + atomic_set(&epc->refcount, 2); 4971 4972 epc->embedded = 1; 4972 4973 list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list); 4973 4974 epc->ctx = ctx; ··· 4985 4980 new = kzalloc(sizeof(*epc), GFP_KERNEL); 4986 4981 if (!new) 4987 4982 return ERR_PTR(-ENOMEM); 4988 - 4989 - if (event->attach_state & PERF_ATTACH_TASK_DATA) { 4990 - task_ctx_data = alloc_task_ctx_data(pmu); 4991 - if (!task_ctx_data) { 4992 - kfree(new); 4993 - return ERR_PTR(-ENOMEM); 4994 - } 4995 - } 4996 4983 4997 4984 __perf_init_event_pmu_context(new, pmu); 4998 4985 ··· 5020 5023 epc->ctx = ctx; 5021 5024 5022 5025 found_epc: 5023 - if (task_ctx_data && !epc->task_ctx_data) { 5024 - epc->task_ctx_data = task_ctx_data; 5025 - task_ctx_data = NULL; 5026 - ctx->nr_task_data++; 5027 - } 5028 5026 raw_spin_unlock_irq(&ctx->lock); 5029 - 5030 - free_task_ctx_data(pmu, task_ctx_data); 5031 5027 kfree(new); 5032 5028 5033 5029 return epc; ··· 5031 5041 WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount)); 5032 5042 } 5033 5043 5044 + static void free_cpc_rcu(struct rcu_head *head) 5045 + { 5046 + struct perf_cpu_pmu_context *cpc = 5047 + container_of(head, typeof(*cpc), epc.rcu_head); 5048 + 5049 + kfree(cpc); 5050 + } 5051 + 5034 5052 static void free_epc_rcu(struct rcu_head *head) 5035 5053 { 5036 5054 struct perf_event_pmu_context *epc = container_of(head, typeof(*epc), rcu_head); 5037 5055 5038 - kfree(epc->task_ctx_data); 5039 5056 kfree(epc); 5040 5057 } 5041 5058 ··· 5072 5075 5073 5076 raw_spin_unlock_irqrestore(&ctx->lock, flags); 5074 5077 5075 - if (epc->embedded) 5078 + if (epc->embedded) { 5079 + call_rcu(&epc->rcu_head, free_cpc_rcu); 5076 5080 return; 5081 + } 5077 5082 5078 5083 call_rcu(&epc->rcu_head, free_epc_rcu); 5079 5084 } ··· 5149 5150 unaccount_freq_event_nohz(); 5150 5151 else 5151 5152 atomic_dec(&nr_freq_events); 5153 + } 5154 + 5155 + 5156 + static struct perf_ctx_data * 5157 + alloc_perf_ctx_data(struct kmem_cache *ctx_cache, bool global) 5158 + { 5159 + struct perf_ctx_data *cd; 5160 + 5161 + cd = kzalloc(sizeof(*cd), GFP_KERNEL); 5162 + if (!cd) 5163 + return NULL; 5164 + 5165 + cd->data = kmem_cache_zalloc(ctx_cache, GFP_KERNEL); 5166 + if (!cd->data) { 5167 + kfree(cd); 5168 + return NULL; 5169 + } 5170 + 5171 + cd->global = global; 5172 + cd->ctx_cache = ctx_cache; 5173 + refcount_set(&cd->refcount, 1); 5174 + 5175 + return cd; 5176 + } 5177 + 5178 + static void free_perf_ctx_data(struct perf_ctx_data *cd) 5179 + { 5180 + kmem_cache_free(cd->ctx_cache, cd->data); 5181 + kfree(cd); 5182 + } 5183 + 5184 + static void __free_perf_ctx_data_rcu(struct rcu_head *rcu_head) 5185 + { 5186 + struct perf_ctx_data *cd; 5187 + 5188 + cd = container_of(rcu_head, struct perf_ctx_data, rcu_head); 5189 + free_perf_ctx_data(cd); 5190 + } 5191 + 5192 + static inline void perf_free_ctx_data_rcu(struct perf_ctx_data *cd) 5193 + { 5194 + call_rcu(&cd->rcu_head, __free_perf_ctx_data_rcu); 5195 + } 5196 + 5197 + static int 5198 + attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, 5199 + bool global) 5200 + { 5201 + struct perf_ctx_data *cd, *old = NULL; 5202 + 5203 + cd = alloc_perf_ctx_data(ctx_cache, global); 5204 + if (!cd) 5205 + return -ENOMEM; 5206 + 5207 + for (;;) { 5208 + if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { 5209 + if (old) 5210 + perf_free_ctx_data_rcu(old); 5211 + return 0; 5212 + } 5213 + 5214 + if (!old) { 5215 + /* 5216 + * After seeing a dead @old, we raced with 5217 + * removal and lost, try again to install @cd. 5218 + */ 5219 + continue; 5220 + } 5221 + 5222 + if (refcount_inc_not_zero(&old->refcount)) { 5223 + free_perf_ctx_data(cd); /* unused */ 5224 + return 0; 5225 + } 5226 + 5227 + /* 5228 + * @old is a dead object, refcount==0 is stable, try and 5229 + * replace it with @cd. 5230 + */ 5231 + } 5232 + return 0; 5233 + } 5234 + 5235 + static void __detach_global_ctx_data(void); 5236 + DEFINE_STATIC_PERCPU_RWSEM(global_ctx_data_rwsem); 5237 + static refcount_t global_ctx_data_ref; 5238 + 5239 + static int 5240 + attach_global_ctx_data(struct kmem_cache *ctx_cache) 5241 + { 5242 + struct task_struct *g, *p; 5243 + struct perf_ctx_data *cd; 5244 + int ret; 5245 + 5246 + if (refcount_inc_not_zero(&global_ctx_data_ref)) 5247 + return 0; 5248 + 5249 + guard(percpu_write)(&global_ctx_data_rwsem); 5250 + if (refcount_inc_not_zero(&global_ctx_data_ref)) 5251 + return 0; 5252 + again: 5253 + /* Allocate everything */ 5254 + scoped_guard (rcu) { 5255 + for_each_process_thread(g, p) { 5256 + cd = rcu_dereference(p->perf_ctx_data); 5257 + if (cd && !cd->global) { 5258 + cd->global = 1; 5259 + if (!refcount_inc_not_zero(&cd->refcount)) 5260 + cd = NULL; 5261 + } 5262 + if (!cd) { 5263 + get_task_struct(p); 5264 + goto alloc; 5265 + } 5266 + } 5267 + } 5268 + 5269 + refcount_set(&global_ctx_data_ref, 1); 5270 + 5271 + return 0; 5272 + alloc: 5273 + ret = attach_task_ctx_data(p, ctx_cache, true); 5274 + put_task_struct(p); 5275 + if (ret) { 5276 + __detach_global_ctx_data(); 5277 + return ret; 5278 + } 5279 + goto again; 5280 + } 5281 + 5282 + static int 5283 + attach_perf_ctx_data(struct perf_event *event) 5284 + { 5285 + struct task_struct *task = event->hw.target; 5286 + struct kmem_cache *ctx_cache = event->pmu->task_ctx_cache; 5287 + int ret; 5288 + 5289 + if (!ctx_cache) 5290 + return -ENOMEM; 5291 + 5292 + if (task) 5293 + return attach_task_ctx_data(task, ctx_cache, false); 5294 + 5295 + ret = attach_global_ctx_data(ctx_cache); 5296 + if (ret) 5297 + return ret; 5298 + 5299 + event->attach_state |= PERF_ATTACH_GLOBAL_DATA; 5300 + return 0; 5301 + } 5302 + 5303 + static void 5304 + detach_task_ctx_data(struct task_struct *p) 5305 + { 5306 + struct perf_ctx_data *cd; 5307 + 5308 + scoped_guard (rcu) { 5309 + cd = rcu_dereference(p->perf_ctx_data); 5310 + if (!cd || !refcount_dec_and_test(&cd->refcount)) 5311 + return; 5312 + } 5313 + 5314 + /* 5315 + * The old ctx_data may be lost because of the race. 5316 + * Nothing is required to do for the case. 5317 + * See attach_task_ctx_data(). 5318 + */ 5319 + if (try_cmpxchg((struct perf_ctx_data **)&p->perf_ctx_data, &cd, NULL)) 5320 + perf_free_ctx_data_rcu(cd); 5321 + } 5322 + 5323 + static void __detach_global_ctx_data(void) 5324 + { 5325 + struct task_struct *g, *p; 5326 + struct perf_ctx_data *cd; 5327 + 5328 + again: 5329 + scoped_guard (rcu) { 5330 + for_each_process_thread(g, p) { 5331 + cd = rcu_dereference(p->perf_ctx_data); 5332 + if (!cd || !cd->global) 5333 + continue; 5334 + cd->global = 0; 5335 + get_task_struct(p); 5336 + goto detach; 5337 + } 5338 + } 5339 + return; 5340 + detach: 5341 + detach_task_ctx_data(p); 5342 + put_task_struct(p); 5343 + goto again; 5344 + } 5345 + 5346 + static void detach_global_ctx_data(void) 5347 + { 5348 + if (refcount_dec_not_one(&global_ctx_data_ref)) 5349 + return; 5350 + 5351 + guard(percpu_write)(&global_ctx_data_rwsem); 5352 + if (!refcount_dec_and_test(&global_ctx_data_ref)) 5353 + return; 5354 + 5355 + /* remove everything */ 5356 + __detach_global_ctx_data(); 5357 + } 5358 + 5359 + static void detach_perf_ctx_data(struct perf_event *event) 5360 + { 5361 + struct task_struct *task = event->hw.target; 5362 + 5363 + event->attach_state &= ~PERF_ATTACH_TASK_DATA; 5364 + 5365 + if (task) 5366 + return detach_task_ctx_data(task); 5367 + 5368 + if (event->attach_state & PERF_ATTACH_GLOBAL_DATA) { 5369 + detach_global_ctx_data(); 5370 + event->attach_state &= ~PERF_ATTACH_GLOBAL_DATA; 5371 + } 5152 5372 } 5153 5373 5154 5374 static void unaccount_event(struct perf_event *event) ··· 5464 5246 return -EBUSY; 5465 5247 } 5466 5248 5249 + event->attach_state |= PERF_ATTACH_EXCLUSIVE; 5250 + 5467 5251 return 0; 5468 5252 } 5469 5253 ··· 5473 5253 { 5474 5254 struct pmu *pmu = event->pmu; 5475 5255 5476 - if (!is_exclusive_pmu(pmu)) 5477 - return; 5478 - 5479 5256 /* see comment in exclusive_event_init() */ 5480 5257 if (event->attach_state & PERF_ATTACH_TASK) 5481 5258 atomic_dec(&pmu->exclusive_cnt); 5482 5259 else 5483 5260 atomic_inc(&pmu->exclusive_cnt); 5261 + 5262 + event->attach_state &= ~PERF_ATTACH_EXCLUSIVE; 5484 5263 } 5485 5264 5486 5265 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2) ··· 5511 5292 return true; 5512 5293 } 5513 5294 5514 - static void perf_addr_filters_splice(struct perf_event *event, 5515 - struct list_head *head); 5295 + static void perf_free_addr_filters(struct perf_event *event); 5516 5296 5517 5297 static void perf_pending_task_sync(struct perf_event *event) 5518 5298 { ··· 5537 5319 rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE); 5538 5320 } 5539 5321 5322 + /* vs perf_event_alloc() error */ 5323 + static void __free_event(struct perf_event *event) 5324 + { 5325 + if (event->attach_state & PERF_ATTACH_CALLCHAIN) 5326 + put_callchain_buffers(); 5327 + 5328 + kfree(event->addr_filter_ranges); 5329 + 5330 + if (event->attach_state & PERF_ATTACH_EXCLUSIVE) 5331 + exclusive_event_destroy(event); 5332 + 5333 + if (is_cgroup_event(event)) 5334 + perf_detach_cgroup(event); 5335 + 5336 + if (event->attach_state & PERF_ATTACH_TASK_DATA) 5337 + detach_perf_ctx_data(event); 5338 + 5339 + if (event->destroy) 5340 + event->destroy(event); 5341 + 5342 + /* 5343 + * Must be after ->destroy(), due to uprobe_perf_close() using 5344 + * hw.target. 5345 + */ 5346 + if (event->hw.target) 5347 + put_task_struct(event->hw.target); 5348 + 5349 + if (event->pmu_ctx) { 5350 + /* 5351 + * put_pmu_ctx() needs an event->ctx reference, because of 5352 + * epc->ctx. 5353 + */ 5354 + WARN_ON_ONCE(!event->ctx); 5355 + WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx); 5356 + put_pmu_ctx(event->pmu_ctx); 5357 + } 5358 + 5359 + /* 5360 + * perf_event_free_task() relies on put_ctx() being 'last', in 5361 + * particular all task references must be cleaned up. 5362 + */ 5363 + if (event->ctx) 5364 + put_ctx(event->ctx); 5365 + 5366 + if (event->pmu) 5367 + module_put(event->pmu->module); 5368 + 5369 + call_rcu(&event->rcu_head, free_event_rcu); 5370 + } 5371 + 5372 + DEFINE_FREE(__free_event, struct perf_event *, if (_T) __free_event(_T)) 5373 + 5374 + /* vs perf_event_alloc() success */ 5540 5375 static void _free_event(struct perf_event *event) 5541 5376 { 5542 5377 irq_work_sync(&event->pending_irq); ··· 5612 5341 mutex_unlock(&event->mmap_mutex); 5613 5342 } 5614 5343 5615 - if (is_cgroup_event(event)) 5616 - perf_detach_cgroup(event); 5617 - 5618 - if (!event->parent) { 5619 - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) 5620 - put_callchain_buffers(); 5621 - } 5622 - 5623 5344 perf_event_free_bpf_prog(event); 5624 - perf_addr_filters_splice(event, NULL); 5625 - kfree(event->addr_filter_ranges); 5345 + perf_free_addr_filters(event); 5626 5346 5627 - if (event->destroy) 5628 - event->destroy(event); 5629 - 5630 - /* 5631 - * Must be after ->destroy(), due to uprobe_perf_close() using 5632 - * hw.target. 5633 - */ 5634 - if (event->hw.target) 5635 - put_task_struct(event->hw.target); 5636 - 5637 - if (event->pmu_ctx) 5638 - put_pmu_ctx(event->pmu_ctx); 5639 - 5640 - /* 5641 - * perf_event_free_task() relies on put_ctx() being 'last', in particular 5642 - * all task references must be cleaned up. 5643 - */ 5644 - if (event->ctx) 5645 - put_ctx(event->ctx); 5646 - 5647 - exclusive_event_destroy(event); 5648 - module_put(event->pmu->module); 5649 - 5650 - call_rcu(&event->rcu_head, free_event_rcu); 5347 + __free_event(event); 5651 5348 } 5652 5349 5653 5350 /* ··· 6084 5845 poll_wait(file, &event->waitq, wait); 6085 5846 6086 5847 if (is_event_hup(event)) 5848 + return events; 5849 + 5850 + if (unlikely(READ_ONCE(event->state) == PERF_EVENT_STATE_ERROR && 5851 + event->attr.pinned)) 6087 5852 return events; 6088 5853 6089 5854 /* ··· 6859 6616 unsigned long vma_size; 6860 6617 unsigned long nr_pages; 6861 6618 long user_extra = 0, extra = 0; 6862 - int ret = 0, flags = 0; 6619 + int ret, flags = 0; 6863 6620 6864 6621 /* 6865 6622 * Don't allow mmap() of inherited per-task counters. This would ··· 6877 6634 return ret; 6878 6635 6879 6636 vma_size = vma->vm_end - vma->vm_start; 6637 + nr_pages = vma_size / PAGE_SIZE; 6638 + 6639 + if (nr_pages > INT_MAX) 6640 + return -ENOMEM; 6641 + 6642 + if (vma_size != PAGE_SIZE * nr_pages) 6643 + return -EINVAL; 6644 + 6645 + user_extra = nr_pages; 6646 + 6647 + mutex_lock(&event->mmap_mutex); 6648 + ret = -EINVAL; 6880 6649 6881 6650 if (vma->vm_pgoff == 0) { 6882 - nr_pages = (vma_size / PAGE_SIZE) - 1; 6651 + nr_pages -= 1; 6652 + 6653 + /* 6654 + * If we have rb pages ensure they're a power-of-two number, so we 6655 + * can do bitmasks instead of modulo. 6656 + */ 6657 + if (nr_pages != 0 && !is_power_of_2(nr_pages)) 6658 + goto unlock; 6659 + 6660 + WARN_ON_ONCE(event->ctx->parent_ctx); 6661 + 6662 + if (event->rb) { 6663 + if (data_page_nr(event->rb) != nr_pages) 6664 + goto unlock; 6665 + 6666 + if (atomic_inc_not_zero(&event->rb->mmap_count)) { 6667 + /* 6668 + * Success -- managed to mmap() the same buffer 6669 + * multiple times. 6670 + */ 6671 + ret = 0; 6672 + /* We need the rb to map pages. */ 6673 + rb = event->rb; 6674 + goto unlock; 6675 + } 6676 + 6677 + /* 6678 + * Raced against perf_mmap_close()'s 6679 + * atomic_dec_and_mutex_lock() remove the 6680 + * event and continue as if !event->rb 6681 + */ 6682 + ring_buffer_attach(event, NULL); 6683 + } 6684 + 6883 6685 } else { 6884 6686 /* 6885 6687 * AUX area mapping: if rb->aux_nr_pages != 0, it's already ··· 6932 6644 * and offset. Must be above the normal perf buffer. 6933 6645 */ 6934 6646 u64 aux_offset, aux_size; 6935 - 6936 - if (!event->rb) 6937 - return -EINVAL; 6938 - 6939 - nr_pages = vma_size / PAGE_SIZE; 6940 - if (nr_pages > INT_MAX) 6941 - return -ENOMEM; 6942 - 6943 - mutex_lock(&event->mmap_mutex); 6944 - ret = -EINVAL; 6945 6647 6946 6648 rb = event->rb; 6947 6649 if (!rb) ··· 6973 6695 } 6974 6696 6975 6697 atomic_set(&rb->aux_mmap_count, 1); 6976 - user_extra = nr_pages; 6977 - 6978 - goto accounting; 6979 6698 } 6980 6699 6981 - /* 6982 - * If we have rb pages ensure they're a power-of-two number, so we 6983 - * can do bitmasks instead of modulo. 6984 - */ 6985 - if (nr_pages != 0 && !is_power_of_2(nr_pages)) 6986 - return -EINVAL; 6987 - 6988 - if (vma_size != PAGE_SIZE * (1 + nr_pages)) 6989 - return -EINVAL; 6990 - 6991 - WARN_ON_ONCE(event->ctx->parent_ctx); 6992 - again: 6993 - mutex_lock(&event->mmap_mutex); 6994 - if (event->rb) { 6995 - if (data_page_nr(event->rb) != nr_pages) { 6996 - ret = -EINVAL; 6997 - goto unlock; 6998 - } 6999 - 7000 - if (!atomic_inc_not_zero(&event->rb->mmap_count)) { 7001 - /* 7002 - * Raced against perf_mmap_close(); remove the 7003 - * event and try again. 7004 - */ 7005 - ring_buffer_attach(event, NULL); 7006 - mutex_unlock(&event->mmap_mutex); 7007 - goto again; 7008 - } 7009 - 7010 - /* We need the rb to map pages. */ 7011 - rb = event->rb; 7012 - goto unlock; 7013 - } 7014 - 7015 - user_extra = nr_pages + 1; 7016 - 7017 - accounting: 7018 6700 user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10); 7019 6701 7020 6702 /* ··· 7042 6804 rb->aux_mmap_locked = extra; 7043 6805 } 7044 6806 6807 + ret = 0; 6808 + 7045 6809 unlock: 7046 6810 if (!ret) { 7047 6811 atomic_long_add(user_extra, &user->locked_vm); ··· 7068 6828 if (!ret) 7069 6829 ret = map_range(rb, vma); 7070 6830 7071 - if (event->pmu->event_mapped) 6831 + if (!ret && event->pmu->event_mapped) 7072 6832 event->pmu->event_mapped(event, vma->vm_mm); 7073 6833 7074 6834 return ret; ··· 7692 7452 if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) 7693 7453 values[n++] = running; 7694 7454 7695 - if ((leader != event) && 7696 - (leader->state == PERF_EVENT_STATE_ACTIVE)) 7697 - leader->pmu->read(leader); 7455 + if ((leader != event) && !handle->skip_read) 7456 + perf_pmu_read(leader); 7698 7457 7699 7458 values[n++] = perf_event_count(leader, self); 7700 7459 if (read_format & PERF_FORMAT_ID) ··· 7706 7467 for_each_sibling_event(sub, leader) { 7707 7468 n = 0; 7708 7469 7709 - if ((sub != event) && 7710 - (sub->state == PERF_EVENT_STATE_ACTIVE)) 7711 - sub->pmu->read(sub); 7470 + if ((sub != event) && !handle->skip_read) 7471 + perf_pmu_read(sub); 7712 7472 7713 7473 values[n++] = perf_event_count(sub, self); 7714 7474 if (read_format & PERF_FORMAT_ID) ··· 7765 7527 struct perf_event *event) 7766 7528 { 7767 7529 u64 sample_type = data->type; 7530 + 7531 + if (data->sample_flags & PERF_SAMPLE_READ) 7532 + handle->skip_read = 1; 7768 7533 7769 7534 perf_output_put(handle, *header); 7770 7535 ··· 8763 8522 task_ctx); 8764 8523 } 8765 8524 8525 + /* 8526 + * Allocate data for a new task when profiling system-wide 8527 + * events which require PMU specific data 8528 + */ 8529 + static void 8530 + perf_event_alloc_task_data(struct task_struct *child, 8531 + struct task_struct *parent) 8532 + { 8533 + struct kmem_cache *ctx_cache = NULL; 8534 + struct perf_ctx_data *cd; 8535 + 8536 + if (!refcount_read(&global_ctx_data_ref)) 8537 + return; 8538 + 8539 + scoped_guard (rcu) { 8540 + cd = rcu_dereference(parent->perf_ctx_data); 8541 + if (cd) 8542 + ctx_cache = cd->ctx_cache; 8543 + } 8544 + 8545 + if (!ctx_cache) 8546 + return; 8547 + 8548 + guard(percpu_read)(&global_ctx_data_rwsem); 8549 + scoped_guard (rcu) { 8550 + cd = rcu_dereference(child->perf_ctx_data); 8551 + if (!cd) { 8552 + /* 8553 + * A system-wide event may be unaccount, 8554 + * when attaching the perf_ctx_data. 8555 + */ 8556 + if (!refcount_read(&global_ctx_data_ref)) 8557 + return; 8558 + goto attach; 8559 + } 8560 + 8561 + if (!cd->global) { 8562 + cd->global = 1; 8563 + refcount_inc(&cd->refcount); 8564 + } 8565 + } 8566 + 8567 + return; 8568 + attach: 8569 + attach_task_ctx_data(child, ctx_cache, true); 8570 + } 8571 + 8766 8572 void perf_event_fork(struct task_struct *task) 8767 8573 { 8768 8574 perf_event_task(task, NULL, 1); 8769 8575 perf_event_namespaces(task); 8576 + perf_event_alloc_task_data(task, current); 8770 8577 } 8771 8578 8772 8579 /* ··· 8878 8589 unsigned int size; 8879 8590 8880 8591 memset(comm, 0, sizeof(comm)); 8881 - strscpy(comm, comm_event->task->comm, sizeof(comm)); 8592 + strscpy(comm, comm_event->task->comm); 8882 8593 size = ALIGN(strlen(comm)+1, sizeof(u64)); 8883 8594 8884 8595 comm_event->comm = comm; ··· 9322 9033 } 9323 9034 9324 9035 cpy_name: 9325 - strscpy(tmp, name, sizeof(tmp)); 9036 + strscpy(tmp, name); 9326 9037 name = tmp; 9327 9038 got_name: 9328 9039 /* ··· 9746 9457 ksym_type == PERF_RECORD_KSYMBOL_TYPE_UNKNOWN) 9747 9458 goto err; 9748 9459 9749 - strscpy(name, sym, KSYM_NAME_LEN); 9460 + strscpy(name, sym); 9750 9461 name_len = strlen(name) + 1; 9751 9462 while (!IS_ALIGNED(name_len, sizeof(u64))) 9752 9463 name[name_len++] = '\0'; ··· 11129 10840 11130 10841 void perf_event_free_bpf_prog(struct perf_event *event) 11131 10842 { 10843 + if (!event->prog) 10844 + return; 10845 + 11132 10846 if (!perf_event_is_tracing(event)) { 11133 10847 perf_event_free_bpf_handler(event); 11134 10848 return; ··· 11228 10936 raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags); 11229 10937 11230 10938 free_filters_list(&list); 10939 + } 10940 + 10941 + static void perf_free_addr_filters(struct perf_event *event) 10942 + { 10943 + /* 10944 + * Used during free paths, there is no concurrency. 10945 + */ 10946 + if (list_empty(&event->addr_filters.list)) 10947 + return; 10948 + 10949 + perf_addr_filters_splice(event, NULL); 11231 10950 } 11232 10951 11233 10952 /* ··· 11917 11614 return 0; 11918 11615 } 11919 11616 11920 - static void free_pmu_context(struct pmu *pmu) 11921 - { 11922 - free_percpu(pmu->cpu_pmu_context); 11923 - } 11924 - 11925 11617 /* 11926 11618 * Let userspace know that this PMU supports address range filtering: 11927 11619 */ ··· 11926 11628 { 11927 11629 struct pmu *pmu = dev_get_drvdata(dev); 11928 11630 11929 - return scnprintf(page, PAGE_SIZE - 1, "%d\n", pmu->nr_addr_filters); 11631 + return sysfs_emit(page, "%d\n", pmu->nr_addr_filters); 11930 11632 } 11931 11633 DEVICE_ATTR_RO(nr_addr_filters); 11932 11634 ··· 11937 11639 { 11938 11640 struct pmu *pmu = dev_get_drvdata(dev); 11939 11641 11940 - return scnprintf(page, PAGE_SIZE - 1, "%d\n", pmu->type); 11642 + return sysfs_emit(page, "%d\n", pmu->type); 11941 11643 } 11942 11644 static DEVICE_ATTR_RO(type); 11943 11645 ··· 11948 11650 { 11949 11651 struct pmu *pmu = dev_get_drvdata(dev); 11950 11652 11951 - return scnprintf(page, PAGE_SIZE - 1, "%d\n", pmu->hrtimer_interval_ms); 11653 + return sysfs_emit(page, "%d\n", pmu->hrtimer_interval_ms); 11952 11654 } 11953 11655 11954 11656 static DEFINE_MUTEX(mux_interval_mutex); ··· 11979 11681 cpus_read_lock(); 11980 11682 for_each_online_cpu(cpu) { 11981 11683 struct perf_cpu_pmu_context *cpc; 11982 - cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu); 11684 + cpc = *per_cpu_ptr(pmu->cpu_pmu_context, cpu); 11983 11685 cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer); 11984 11686 11985 11687 cpu_function_call(cpu, perf_mux_hrtimer_restart_ipi, cpc); ··· 12122 11824 12123 11825 free_dev: 12124 11826 put_device(pmu->dev); 11827 + pmu->dev = NULL; 12125 11828 goto out; 12126 11829 } 12127 11830 ··· 12144 11845 return true; 12145 11846 } 12146 11847 12147 - int perf_pmu_register(struct pmu *pmu, const char *name, int type) 11848 + static void perf_pmu_free(struct pmu *pmu) 12148 11849 { 12149 - int cpu, ret, max = PERF_TYPE_MAX; 12150 - 12151 - mutex_lock(&pmus_lock); 12152 - ret = -ENOMEM; 12153 - pmu->pmu_disable_count = alloc_percpu(int); 12154 - if (!pmu->pmu_disable_count) 12155 - goto unlock; 12156 - 12157 - pmu->type = -1; 12158 - if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) { 12159 - ret = -EINVAL; 12160 - goto free_pdc; 11850 + if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) { 11851 + if (pmu->nr_addr_filters) 11852 + device_remove_file(pmu->dev, &dev_attr_nr_addr_filters); 11853 + device_del(pmu->dev); 11854 + put_device(pmu->dev); 12161 11855 } 12162 11856 12163 - if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, "Can not register a pmu with an invalid scope.\n")) { 12164 - ret = -EINVAL; 12165 - goto free_pdc; 11857 + if (pmu->cpu_pmu_context) { 11858 + int cpu; 11859 + 11860 + for_each_possible_cpu(cpu) { 11861 + struct perf_cpu_pmu_context *cpc; 11862 + 11863 + cpc = *per_cpu_ptr(pmu->cpu_pmu_context, cpu); 11864 + if (!cpc) 11865 + continue; 11866 + if (cpc->epc.embedded) { 11867 + /* refcount managed */ 11868 + put_pmu_ctx(&cpc->epc); 11869 + continue; 11870 + } 11871 + kfree(cpc); 11872 + } 11873 + free_percpu(pmu->cpu_pmu_context); 12166 11874 } 11875 + } 11876 + 11877 + DEFINE_FREE(pmu_unregister, struct pmu *, if (_T) perf_pmu_free(_T)) 11878 + 11879 + int perf_pmu_register(struct pmu *_pmu, const char *name, int type) 11880 + { 11881 + int cpu, max = PERF_TYPE_MAX; 11882 + 11883 + struct pmu *pmu __free(pmu_unregister) = _pmu; 11884 + guard(mutex)(&pmus_lock); 11885 + 11886 + if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) 11887 + return -EINVAL; 11888 + 11889 + if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, 11890 + "Can not register a pmu with an invalid scope.\n")) 11891 + return -EINVAL; 12167 11892 12168 11893 pmu->name = name; 12169 11894 12170 11895 if (type >= 0) 12171 11896 max = type; 12172 11897 12173 - ret = idr_alloc(&pmu_idr, NULL, max, 0, GFP_KERNEL); 12174 - if (ret < 0) 12175 - goto free_pdc; 11898 + CLASS(idr_alloc, pmu_type)(&pmu_idr, NULL, max, 0, GFP_KERNEL); 11899 + if (pmu_type.id < 0) 11900 + return pmu_type.id; 12176 11901 12177 - WARN_ON(type >= 0 && ret != type); 11902 + WARN_ON(type >= 0 && pmu_type.id != type); 12178 11903 12179 - type = ret; 12180 - pmu->type = type; 11904 + pmu->type = pmu_type.id; 12181 11905 atomic_set(&pmu->exclusive_cnt, 0); 12182 11906 12183 11907 if (pmu_bus_running && !pmu->dev) { 12184 - ret = pmu_dev_alloc(pmu); 11908 + int ret = pmu_dev_alloc(pmu); 12185 11909 if (ret) 12186 - goto free_idr; 11910 + return ret; 12187 11911 } 12188 11912 12189 - ret = -ENOMEM; 12190 - pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context); 11913 + pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context *); 12191 11914 if (!pmu->cpu_pmu_context) 12192 - goto free_dev; 11915 + return -ENOMEM; 12193 11916 12194 11917 for_each_possible_cpu(cpu) { 12195 - struct perf_cpu_pmu_context *cpc; 11918 + struct perf_cpu_pmu_context *cpc = 11919 + kmalloc_node(sizeof(struct perf_cpu_pmu_context), 11920 + GFP_KERNEL | __GFP_ZERO, 11921 + cpu_to_node(cpu)); 12196 11922 12197 - cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu); 11923 + if (!cpc) 11924 + return -ENOMEM; 11925 + 11926 + *per_cpu_ptr(pmu->cpu_pmu_context, cpu) = cpc; 12198 11927 __perf_init_event_pmu_context(&cpc->epc, pmu); 12199 11928 __perf_mux_hrtimer_init(cpc, cpu); 12200 11929 } ··· 12259 11932 * Now that the PMU is complete, make it visible to perf_try_init_event(). 12260 11933 */ 12261 11934 if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu)) 12262 - goto free_context; 11935 + return -EINVAL; 12263 11936 list_add_rcu(&pmu->entry, &pmus); 12264 11937 12265 - ret = 0; 12266 - unlock: 12267 - mutex_unlock(&pmus_lock); 12268 - 12269 - return ret; 12270 - 12271 - free_context: 12272 - free_percpu(pmu->cpu_pmu_context); 12273 - 12274 - free_dev: 12275 - if (pmu->dev && pmu->dev != PMU_NULL_DEV) { 12276 - device_del(pmu->dev); 12277 - put_device(pmu->dev); 12278 - } 12279 - 12280 - free_idr: 12281 - idr_remove(&pmu_idr, pmu->type); 12282 - 12283 - free_pdc: 12284 - free_percpu(pmu->pmu_disable_count); 12285 - goto unlock; 11938 + take_idr_id(pmu_type); 11939 + _pmu = no_free_ptr(pmu); // let it rip 11940 + return 0; 12286 11941 } 12287 11942 EXPORT_SYMBOL_GPL(perf_pmu_register); 12288 11943 12289 11944 void perf_pmu_unregister(struct pmu *pmu) 12290 11945 { 12291 - mutex_lock(&pmus_lock); 12292 - list_del_rcu(&pmu->entry); 12293 - idr_remove(&pmu_idr, pmu->type); 12294 - mutex_unlock(&pmus_lock); 11946 + scoped_guard (mutex, &pmus_lock) { 11947 + list_del_rcu(&pmu->entry); 11948 + idr_remove(&pmu_idr, pmu->type); 11949 + } 12295 11950 12296 11951 /* 12297 11952 * We dereference the pmu list under both SRCU and regular RCU, so ··· 12282 11973 synchronize_srcu(&pmus_srcu); 12283 11974 synchronize_rcu(); 12284 11975 12285 - free_percpu(pmu->pmu_disable_count); 12286 - if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) { 12287 - if (pmu->nr_addr_filters) 12288 - device_remove_file(pmu->dev, &dev_attr_nr_addr_filters); 12289 - device_del(pmu->dev); 12290 - put_device(pmu->dev); 12291 - } 12292 - free_pmu_context(pmu); 11976 + perf_pmu_free(pmu); 12293 11977 } 12294 11978 EXPORT_SYMBOL_GPL(perf_pmu_unregister); 12295 11979 ··· 12322 12020 if (ctx) 12323 12021 perf_event_ctx_unlock(event->group_leader, ctx); 12324 12022 12325 - if (!ret) { 12326 - if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) && 12327 - has_extended_regs(event)) 12328 - ret = -EOPNOTSUPP; 12023 + if (ret) 12024 + goto err_pmu; 12329 12025 12330 - if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && 12331 - event_has_any_exclude_flag(event)) 12332 - ret = -EINVAL; 12333 - 12334 - if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) { 12335 - const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu); 12336 - struct cpumask *pmu_cpumask = perf_scope_cpumask(pmu->scope); 12337 - int cpu; 12338 - 12339 - if (pmu_cpumask && cpumask) { 12340 - cpu = cpumask_any_and(pmu_cpumask, cpumask); 12341 - if (cpu >= nr_cpu_ids) 12342 - ret = -ENODEV; 12343 - else 12344 - event->event_caps |= PERF_EV_CAP_READ_SCOPE; 12345 - } else { 12346 - ret = -ENODEV; 12347 - } 12348 - } 12349 - 12350 - if (ret && event->destroy) 12351 - event->destroy(event); 12026 + if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) && 12027 + has_extended_regs(event)) { 12028 + ret = -EOPNOTSUPP; 12029 + goto err_destroy; 12352 12030 } 12353 12031 12354 - if (ret) 12355 - module_put(pmu->module); 12032 + if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && 12033 + event_has_any_exclude_flag(event)) { 12034 + ret = -EINVAL; 12035 + goto err_destroy; 12036 + } 12356 12037 12038 + if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) { 12039 + const struct cpumask *cpumask; 12040 + struct cpumask *pmu_cpumask; 12041 + int cpu; 12042 + 12043 + cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu); 12044 + pmu_cpumask = perf_scope_cpumask(pmu->scope); 12045 + 12046 + ret = -ENODEV; 12047 + if (!pmu_cpumask || !cpumask) 12048 + goto err_destroy; 12049 + 12050 + cpu = cpumask_any_and(pmu_cpumask, cpumask); 12051 + if (cpu >= nr_cpu_ids) 12052 + goto err_destroy; 12053 + 12054 + event->event_caps |= PERF_EV_CAP_READ_SCOPE; 12055 + } 12056 + 12057 + return 0; 12058 + 12059 + err_destroy: 12060 + if (event->destroy) { 12061 + event->destroy(event); 12062 + event->destroy = NULL; 12063 + } 12064 + 12065 + err_pmu: 12066 + event->pmu = NULL; 12067 + module_put(pmu->module); 12357 12068 return ret; 12358 12069 } 12359 12070 12360 12071 static struct pmu *perf_init_event(struct perf_event *event) 12361 12072 { 12362 12073 bool extended_type = false; 12363 - int idx, type, ret; 12364 12074 struct pmu *pmu; 12075 + int type, ret; 12365 12076 12366 - idx = srcu_read_lock(&pmus_srcu); 12077 + guard(srcu)(&pmus_srcu); 12367 12078 12368 12079 /* 12369 12080 * Save original type before calling pmu->event_init() since certain ··· 12389 12074 pmu = event->parent->pmu; 12390 12075 ret = perf_try_init_event(pmu, event); 12391 12076 if (!ret) 12392 - goto unlock; 12077 + return pmu; 12393 12078 } 12394 12079 12395 12080 /* ··· 12408 12093 } 12409 12094 12410 12095 again: 12411 - rcu_read_lock(); 12412 - pmu = idr_find(&pmu_idr, type); 12413 - rcu_read_unlock(); 12096 + scoped_guard (rcu) 12097 + pmu = idr_find(&pmu_idr, type); 12414 12098 if (pmu) { 12415 12099 if (event->attr.type != type && type != PERF_TYPE_RAW && 12416 12100 !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) 12417 - goto fail; 12101 + return ERR_PTR(-ENOENT); 12418 12102 12419 12103 ret = perf_try_init_event(pmu, event); 12420 12104 if (ret == -ENOENT && event->attr.type != type && !extended_type) { ··· 12422 12108 } 12423 12109 12424 12110 if (ret) 12425 - pmu = ERR_PTR(ret); 12111 + return ERR_PTR(ret); 12426 12112 12427 - goto unlock; 12113 + return pmu; 12428 12114 } 12429 12115 12430 12116 list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) { 12431 12117 ret = perf_try_init_event(pmu, event); 12432 12118 if (!ret) 12433 - goto unlock; 12119 + return pmu; 12434 12120 12435 - if (ret != -ENOENT) { 12436 - pmu = ERR_PTR(ret); 12437 - goto unlock; 12438 - } 12121 + if (ret != -ENOENT) 12122 + return ERR_PTR(ret); 12439 12123 } 12440 - fail: 12441 - pmu = ERR_PTR(-ENOENT); 12442 - unlock: 12443 - srcu_read_unlock(&pmus_srcu, idx); 12444 12124 12445 - return pmu; 12125 + return ERR_PTR(-ENOENT); 12446 12126 } 12447 12127 12448 12128 static void attach_sb_event(struct perf_event *event) ··· 12563 12255 void *context, int cgroup_fd) 12564 12256 { 12565 12257 struct pmu *pmu; 12566 - struct perf_event *event; 12567 12258 struct hw_perf_event *hwc; 12568 12259 long err = -EINVAL; 12569 12260 int node; ··· 12577 12270 } 12578 12271 12579 12272 node = (cpu >= 0) ? cpu_to_node(cpu) : -1; 12580 - event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO, 12581 - node); 12273 + struct perf_event *event __free(__free_event) = 12274 + kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO, node); 12582 12275 if (!event) 12583 12276 return ERR_PTR(-ENOMEM); 12584 12277 ··· 12685 12378 * See perf_output_read(). 12686 12379 */ 12687 12380 if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID)) 12688 - goto err_ns; 12381 + return ERR_PTR(-EINVAL); 12689 12382 12690 12383 if (!has_branch_stack(event)) 12691 12384 event->attr.branch_sample_type = 0; 12692 12385 12693 12386 pmu = perf_init_event(event); 12694 - if (IS_ERR(pmu)) { 12695 - err = PTR_ERR(pmu); 12696 - goto err_ns; 12387 + if (IS_ERR(pmu)) 12388 + return (void*)pmu; 12389 + 12390 + /* 12391 + * The PERF_ATTACH_TASK_DATA is set in the event_init()->hw_config(). 12392 + * The attach should be right after the perf_init_event(). 12393 + * Otherwise, the __free_event() would mistakenly detach the non-exist 12394 + * perf_ctx_data because of the other errors between them. 12395 + */ 12396 + if (event->attach_state & PERF_ATTACH_TASK_DATA) { 12397 + err = attach_perf_ctx_data(event); 12398 + if (err) 12399 + return ERR_PTR(err); 12697 12400 } 12698 12401 12699 12402 /* ··· 12711 12394 * events (they don't make sense as the cgroup will be different 12712 12395 * on other CPUs in the uncore mask). 12713 12396 */ 12714 - if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) { 12715 - err = -EINVAL; 12716 - goto err_pmu; 12717 - } 12397 + if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) 12398 + return ERR_PTR(-EINVAL); 12718 12399 12719 12400 if (event->attr.aux_output && 12720 12401 (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) || 12721 - event->attr.aux_pause || event->attr.aux_resume)) { 12722 - err = -EOPNOTSUPP; 12723 - goto err_pmu; 12724 - } 12402 + event->attr.aux_pause || event->attr.aux_resume)) 12403 + return ERR_PTR(-EOPNOTSUPP); 12725 12404 12726 - if (event->attr.aux_pause && event->attr.aux_resume) { 12727 - err = -EINVAL; 12728 - goto err_pmu; 12729 - } 12405 + if (event->attr.aux_pause && event->attr.aux_resume) 12406 + return ERR_PTR(-EINVAL); 12730 12407 12731 12408 if (event->attr.aux_start_paused) { 12732 - if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) { 12733 - err = -EOPNOTSUPP; 12734 - goto err_pmu; 12735 - } 12409 + if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) 12410 + return ERR_PTR(-EOPNOTSUPP); 12736 12411 event->hw.aux_paused = 1; 12737 12412 } 12738 12413 12739 12414 if (cgroup_fd != -1) { 12740 12415 err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); 12741 12416 if (err) 12742 - goto err_pmu; 12417 + return ERR_PTR(err); 12743 12418 } 12744 12419 12745 12420 err = exclusive_event_init(event); 12746 12421 if (err) 12747 - goto err_pmu; 12422 + return ERR_PTR(err); 12748 12423 12749 12424 if (has_addr_filter(event)) { 12750 12425 event->addr_filter_ranges = kcalloc(pmu->nr_addr_filters, 12751 12426 sizeof(struct perf_addr_filter_range), 12752 12427 GFP_KERNEL); 12753 - if (!event->addr_filter_ranges) { 12754 - err = -ENOMEM; 12755 - goto err_per_task; 12756 - } 12428 + if (!event->addr_filter_ranges) 12429 + return ERR_PTR(-ENOMEM); 12757 12430 12758 12431 /* 12759 12432 * Clone the parent's vma offsets: they are valid until exec() ··· 12767 12460 if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) { 12768 12461 err = get_callchain_buffers(attr->sample_max_stack); 12769 12462 if (err) 12770 - goto err_addr_filters; 12463 + return ERR_PTR(err); 12464 + event->attach_state |= PERF_ATTACH_CALLCHAIN; 12771 12465 } 12772 12466 } 12773 12467 12774 12468 err = security_perf_event_alloc(event); 12775 12469 if (err) 12776 - goto err_callchain_buffer; 12470 + return ERR_PTR(err); 12777 12471 12778 12472 /* symmetric to unaccount_event() in _free_event() */ 12779 12473 account_event(event); 12780 12474 12781 - return event; 12782 - 12783 - err_callchain_buffer: 12784 - if (!event->parent) { 12785 - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) 12786 - put_callchain_buffers(); 12787 - } 12788 - err_addr_filters: 12789 - kfree(event->addr_filter_ranges); 12790 - 12791 - err_per_task: 12792 - exclusive_event_destroy(event); 12793 - 12794 - err_pmu: 12795 - if (is_cgroup_event(event)) 12796 - perf_detach_cgroup(event); 12797 - if (event->destroy) 12798 - event->destroy(event); 12799 - module_put(pmu->module); 12800 - err_ns: 12801 - if (event->hw.target) 12802 - put_task_struct(event->hw.target); 12803 - call_rcu(&event->rcu_head, free_event_rcu); 12804 - 12805 - return ERR_PTR(err); 12475 + return_ptr(event); 12806 12476 } 12807 12477 12808 12478 static int perf_copy_attr(struct perf_event_attr __user *uattr, ··· 13853 13569 * At this point we need to send EXIT events to cpu contexts. 13854 13570 */ 13855 13571 perf_event_task(child, NULL, 0); 13572 + 13573 + /* 13574 + * Detach the perf_ctx_data for the system-wide event. 13575 + */ 13576 + guard(percpu_read)(&global_ctx_data_rwsem); 13577 + detach_task_ctx_data(child); 13856 13578 } 13857 13579 13858 13580 static void perf_free_event(struct perf_event *event, ··· 14034 13744 if (is_orphaned_event(parent_event) || 14035 13745 !atomic_long_inc_not_zero(&parent_event->refcount)) { 14036 13746 mutex_unlock(&parent_event->child_mutex); 14037 - /* task_ctx_data is freed with child_ctx */ 14038 13747 free_event(child_event); 14039 13748 return NULL; 14040 13749 } ··· 14291 14002 child->perf_event_ctxp = NULL; 14292 14003 mutex_init(&child->perf_event_mutex); 14293 14004 INIT_LIST_HEAD(&child->perf_event_list); 14005 + child->perf_ctx_data = NULL; 14294 14006 14295 14007 ret = perf_event_init_context(child, clone_flags); 14296 14008 if (ret) {
+3 -2
kernel/events/hw_breakpoint.c
··· 950 950 return -ENOENT; 951 951 952 952 /* 953 - * no branch sampling for breakpoint events 953 + * Check if breakpoint type is supported before proceeding. 954 + * Also, no branch sampling for breakpoint events. 954 955 */ 955 - if (has_branch_stack(bp)) 956 + if (!hw_breakpoint_slots_cached(find_slot_idx(bp->attr.bp_type)) || has_branch_stack(bp)) 956 957 return -EOPNOTSUPP; 957 958 958 959 err = register_perf_hw_breakpoint(bp);
+2 -1
kernel/events/ring_buffer.c
··· 19 19 20 20 static void perf_output_wakeup(struct perf_output_handle *handle) 21 21 { 22 - atomic_set(&handle->rb->poll, EPOLLIN); 22 + atomic_set(&handle->rb->poll, EPOLLIN | EPOLLRDNORM); 23 23 24 24 handle->event->pending_wakeup = 1; 25 25 ··· 185 185 186 186 handle->rb = rb; 187 187 handle->event = event; 188 + handle->flags = 0; 188 189 189 190 have_lost = local_read(&rb->lost); 190 191 if (unlikely(have_lost)) {
+6 -6
kernel/events/uprobes.c
··· 2169 2169 */ 2170 2170 unsigned long uprobe_get_trampoline_vaddr(void) 2171 2171 { 2172 + unsigned long trampoline_vaddr = UPROBE_NO_TRAMPOLINE_VADDR; 2172 2173 struct xol_area *area; 2173 - unsigned long trampoline_vaddr = -1; 2174 2174 2175 2175 /* Pairs with xol_add_vma() smp_store_release() */ 2176 2176 area = READ_ONCE(current->mm->uprobes_state.xol_area); /* ^^^ */ ··· 2311 2311 WARN_ON_ONCE(utask->state != UTASK_SSTEP); 2312 2312 2313 2313 if (task_sigpending(t)) { 2314 - spin_lock_irq(&t->sighand->siglock); 2314 + utask->signal_denied = true; 2315 2315 clear_tsk_thread_flag(t, TIF_SIGPENDING); 2316 - spin_unlock_irq(&t->sighand->siglock); 2317 2316 2318 2317 if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) { 2319 2318 utask->state = UTASK_SSTEP_TRAPPED; ··· 2745 2746 utask->state = UTASK_RUNNING; 2746 2747 xol_free_insn_slot(utask); 2747 2748 2748 - spin_lock_irq(&current->sighand->siglock); 2749 - recalc_sigpending(); /* see uprobe_deny_signal() */ 2750 - spin_unlock_irq(&current->sighand->siglock); 2749 + if (utask->signal_denied) { 2750 + set_thread_flag(TIF_SIGPENDING); 2751 + utask->signal_denied = false; 2752 + } 2751 2753 2752 2754 if (unlikely(err)) { 2753 2755 uprobe_warn(current, "execute the probed insn, sending SIGILL.");
-64
kernel/sysctl.c
··· 54 54 #include <linux/acpi.h> 55 55 #include <linux/reboot.h> 56 56 #include <linux/ftrace.h> 57 - #include <linux/perf_event.h> 58 57 #include <linux/oom.h> 59 58 #include <linux/kmod.h> 60 59 #include <linux/capability.h> ··· 90 91 #if defined(CONFIG_SYSCTL) 91 92 92 93 /* Constants used for minimum and maximum */ 93 - 94 - #ifdef CONFIG_PERF_EVENTS 95 - static const int six_hundred_forty_kb = 640 * 1024; 96 - #endif 97 - 98 - 99 94 static const int ngroups_max = NGROUPS_MAX; 100 95 static const int cap_last_cap = CAP_LAST_CAP; 101 96 ··· 1924 1931 .maxlen = sizeof(int), 1925 1932 .mode = 0644, 1926 1933 .proc_handler = proc_dointvec, 1927 - }, 1928 - #endif 1929 - #ifdef CONFIG_PERF_EVENTS 1930 - /* 1931 - * User-space scripts rely on the existence of this file 1932 - * as a feature check for perf_events being enabled. 1933 - * 1934 - * So it's an ABI, do not remove! 1935 - */ 1936 - { 1937 - .procname = "perf_event_paranoid", 1938 - .data = &sysctl_perf_event_paranoid, 1939 - .maxlen = sizeof(sysctl_perf_event_paranoid), 1940 - .mode = 0644, 1941 - .proc_handler = proc_dointvec, 1942 - }, 1943 - { 1944 - .procname = "perf_event_mlock_kb", 1945 - .data = &sysctl_perf_event_mlock, 1946 - .maxlen = sizeof(sysctl_perf_event_mlock), 1947 - .mode = 0644, 1948 - .proc_handler = proc_dointvec, 1949 - }, 1950 - { 1951 - .procname = "perf_event_max_sample_rate", 1952 - .data = &sysctl_perf_event_sample_rate, 1953 - .maxlen = sizeof(sysctl_perf_event_sample_rate), 1954 - .mode = 0644, 1955 - .proc_handler = perf_event_max_sample_rate_handler, 1956 - .extra1 = SYSCTL_ONE, 1957 - }, 1958 - { 1959 - .procname = "perf_cpu_time_max_percent", 1960 - .data = &sysctl_perf_cpu_time_max_percent, 1961 - .maxlen = sizeof(sysctl_perf_cpu_time_max_percent), 1962 - .mode = 0644, 1963 - .proc_handler = perf_cpu_time_max_percent_handler, 1964 - .extra1 = SYSCTL_ZERO, 1965 - .extra2 = SYSCTL_ONE_HUNDRED, 1966 - }, 1967 - { 1968 - .procname = "perf_event_max_stack", 1969 - .data = &sysctl_perf_event_max_stack, 1970 - .maxlen = sizeof(sysctl_perf_event_max_stack), 1971 - .mode = 0644, 1972 - .proc_handler = perf_event_max_stack_handler, 1973 - .extra1 = SYSCTL_ZERO, 1974 - .extra2 = (void *)&six_hundred_forty_kb, 1975 - }, 1976 - { 1977 - .procname = "perf_event_max_contexts_per_stack", 1978 - .data = &sysctl_perf_event_max_contexts_per_stack, 1979 - .maxlen = sizeof(sysctl_perf_event_max_contexts_per_stack), 1980 - .mode = 0644, 1981 - .proc_handler = perf_event_max_stack_handler, 1982 - .extra1 = SYSCTL_ZERO, 1983 - .extra2 = SYSCTL_ONE_THOUSAND, 1984 1934 }, 1985 1935 #endif 1986 1936 {
-25
kernel/watchdog.c
··· 347 347 } 348 348 __setup("watchdog_thresh=", watchdog_thresh_setup); 349 349 350 - static void __lockup_detector_cleanup(void); 351 - 352 350 #ifdef CONFIG_SOFTLOCKUP_DETECTOR_INTR_STORM 353 351 enum stats_per_group { 354 352 STATS_SYSTEM, ··· 884 886 885 887 watchdog_hardlockup_start(); 886 888 cpus_read_unlock(); 887 - /* 888 - * Must be called outside the cpus locked section to prevent 889 - * recursive locking in the perf code. 890 - */ 891 - __lockup_detector_cleanup(); 892 889 } 893 890 894 891 void lockup_detector_reconfigure(void) ··· 932 939 __lockup_detector_reconfigure(); 933 940 } 934 941 #endif /* !CONFIG_SOFTLOCKUP_DETECTOR */ 935 - 936 - static void __lockup_detector_cleanup(void) 937 - { 938 - lockdep_assert_held(&watchdog_mutex); 939 - hardlockup_detector_perf_cleanup(); 940 - } 941 - 942 - /** 943 - * lockup_detector_cleanup - Cleanup after cpu hotplug or sysctl changes 944 - * 945 - * Caller must not hold the cpu hotplug rwsem. 946 - */ 947 - void lockup_detector_cleanup(void) 948 - { 949 - mutex_lock(&watchdog_mutex); 950 - __lockup_detector_cleanup(); 951 - mutex_unlock(&watchdog_mutex); 952 - } 953 942 954 943 /** 955 944 * lockup_detector_soft_poweroff - Interface to stop lockup detector(s)
+2 -27
kernel/watchdog_perf.c
··· 21 21 #include <linux/perf_event.h> 22 22 23 23 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev); 24 - static DEFINE_PER_CPU(struct perf_event *, dead_event); 25 - static struct cpumask dead_events_mask; 26 24 27 25 static atomic_t watchdog_cpus = ATOMIC_INIT(0); 28 26 ··· 144 146 PTR_ERR(evt)); 145 147 return PTR_ERR(evt); 146 148 } 149 + WARN_ONCE(this_cpu_read(watchdog_ev), "unexpected watchdog_ev leak"); 147 150 this_cpu_write(watchdog_ev, evt); 148 151 return 0; 149 152 } ··· 180 181 181 182 if (event) { 182 183 perf_event_disable(event); 184 + perf_event_release_kernel(event); 183 185 this_cpu_write(watchdog_ev, NULL); 184 - this_cpu_write(dead_event, event); 185 - cpumask_set_cpu(smp_processor_id(), &dead_events_mask); 186 186 atomic_dec(&watchdog_cpus); 187 187 } 188 - } 189 - 190 - /** 191 - * hardlockup_detector_perf_cleanup - Cleanup disabled events and destroy them 192 - * 193 - * Called from lockup_detector_cleanup(). Serialized by the caller. 194 - */ 195 - void hardlockup_detector_perf_cleanup(void) 196 - { 197 - int cpu; 198 - 199 - for_each_cpu(cpu, &dead_events_mask) { 200 - struct perf_event *event = per_cpu(dead_event, cpu); 201 - 202 - /* 203 - * Required because for_each_cpu() reports unconditionally 204 - * CPU0 as set on UP kernels. Sigh. 205 - */ 206 - if (event) 207 - perf_event_release_kernel(event); 208 - per_cpu(dead_event, cpu) = NULL; 209 - } 210 - cpumask_clear(&dead_events_mask); 211 188 } 212 189 213 190 /**
+2 -1
tools/arch/x86/include/asm/amd-ibs.h
··· 64 64 opmaxcnt_ext:7, /* 20-26: upper 7 bits of periodic op maximum count */ 65 65 reserved0:5, /* 27-31: reserved */ 66 66 opcurcnt:27, /* 32-58: periodic op counter current count */ 67 - reserved1:5; /* 59-63: reserved */ 67 + ldlat_thrsh:4, /* 59-62: Load Latency threshold */ 68 + ldlat_en:1; /* 63: Load Latency enabled */ 68 69 }; 69 70 }; 70 71