Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache quality monitoring update from Thomas Gleixner:
"This update provides a complete rewrite of the Cache Quality
Monitoring (CQM) facility.

The existing CQM support was duct taped into perf with a lot of issues
and the attempts to fix those turned out to be incomplete and
horrible.

After lengthy discussions it was decided to integrate the CQM support
into the Resource Director Technology (RDT) facility, which is the
obvious choise as in hardware CQM is part of RDT. This allowed to add
Memory Bandwidth Monitoring support on top.

As a result the mechanisms for allocating cache/memory bandwidth and
the corresponding monitoring mechanisms are integrated into a single
management facility with a consistent user interface"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/intel_rdt: Turn off most RDT features on Skylake
x86/intel_rdt: Add command line options for resource director technology
x86/intel_rdt: Move special case code for Haswell to a quirk function
x86/intel_rdt: Remove redundant ternary operator on return
x86/intel_rdt/cqm: Improve limbo list processing
x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
x86/intel_rdt: Modify the intel_pqr_state for better performance
x86/intel_rdt/cqm: Clear the default RMID during hotcpu
x86/intel_rdt: Show bitmask of shareable resource with other executing units
x86/intel_rdt/mbm: Handle counter overflow
x86/intel_rdt/mbm: Add mbm counter initialization
x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
x86/intel_rdt/cqm: Add CPU hotplug support
x86/intel_rdt/cqm: Add sched_in support
x86/intel_rdt: Introduce rdt_enable_key for scheduling
x86/intel_rdt/cqm: Add mount,umount support
x86/intel_rdt/cqm: Add rmdir support
x86/intel_rdt: Separate the ctrl bits from rmdir
x86/intel_rdt/cqm: Add mon_data
x86/intel_rdt: Prepare for RDT monitor data support
...

+2643 -2439
+1
Documentation/admin-guide/kernel-parameters.rst
··· 138 138 PPT Parallel port support is enabled. 139 139 PS2 Appropriate PS/2 support is enabled. 140 140 RAM RAM disk support is enabled. 141 + RDT Intel Resource Director Technology. 141 142 S390 S390 architecture is enabled. 142 143 SCSI Appropriate SCSI support is enabled. 143 144 A lot of drivers have their options described inside
+6
Documentation/admin-guide/kernel-parameters.txt
··· 3612 3612 Run specified binary instead of /init from the ramdisk, 3613 3613 used for early userspace startup. See initrd. 3614 3614 3615 + rdt= [HW,X86,RDT] 3616 + Turn on/off individual RDT features. List is: 3617 + cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, mba. 3618 + E.g. to turn on cmt and turn off mba use: 3619 + rdt=cmt,!mba 3620 + 3615 3621 reboot= [KNL] 3616 3622 Format (x86 or x86_64): 3617 3623 [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \
+285 -38
Documentation/x86/intel_rdt_ui.txt
··· 6 6 Tony Luck <tony.luck@intel.com> 7 7 Vikas Shivappa <vikas.shivappa@intel.com> 8 8 9 - This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the 10 - X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3". 9 + This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the 10 + X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3". 11 11 12 12 To use the feature mount the file system: 13 13 ··· 17 17 18 18 "cdp": Enable code/data prioritization in L3 cache allocations. 19 19 20 + RDT features are orthogonal. A particular system may support only 21 + monitoring, only control, or both monitoring and control. 22 + 23 + The mount succeeds if either of allocation or monitoring is present, but 24 + only those files and directories supported by the system will be created. 25 + For more details on the behavior of the interface during monitoring 26 + and allocation, see the "Resource alloc and monitor groups" section. 20 27 21 28 Info directory 22 29 -------------- ··· 31 24 The 'info' directory contains information about the enabled 32 25 resources. Each resource has its own subdirectory. The subdirectory 33 26 names reflect the resource names. 34 - Cache resource(L3/L2) subdirectory contains the following files: 27 + 28 + Each subdirectory contains the following files with respect to 29 + allocation: 30 + 31 + Cache resource(L3/L2) subdirectory contains the following files 32 + related to allocation: 35 33 36 34 "num_closids": The number of CLOSIDs which are valid for this 37 35 resource. The kernel uses the smallest number of ··· 48 36 "min_cbm_bits": The minimum number of consecutive bits which 49 37 must be set when writing a mask. 50 38 51 - Memory bandwitdh(MB) subdirectory contains the following files: 39 + "shareable_bits": Bitmask of shareable resource with other executing 40 + entities (e.g. I/O). User can use this when 41 + setting up exclusive cache partitions. Note that 42 + some platforms support devices that have their 43 + own settings for cache use which can over-ride 44 + these bits. 45 + 46 + Memory bandwitdh(MB) subdirectory contains the following files 47 + with respect to allocation: 52 48 53 49 "min_bandwidth": The minimum memory bandwidth percentage which 54 50 user can request. ··· 72 52 non-linear. This field is purely informational 73 53 only. 74 54 75 - Resource groups 76 - --------------- 55 + If RDT monitoring is available there will be an "L3_MON" directory 56 + with the following files: 57 + 58 + "num_rmids": The number of RMIDs available. This is the 59 + upper bound for how many "CTRL_MON" + "MON" 60 + groups can be created. 61 + 62 + "mon_features": Lists the monitoring events if 63 + monitoring is enabled for the resource. 64 + 65 + "max_threshold_occupancy": 66 + Read/write file provides the largest value (in 67 + bytes) at which a previously used LLC_occupancy 68 + counter can be considered for re-use. 69 + 70 + 71 + Resource alloc and monitor groups 72 + --------------------------------- 73 + 77 74 Resource groups are represented as directories in the resctrl file 78 - system. The default group is the root directory. Other groups may be 79 - created as desired by the system administrator using the "mkdir(1)" 80 - command, and removed using "rmdir(1)". 75 + system. The default group is the root directory which, immediately 76 + after mounting, owns all the tasks and cpus in the system and can make 77 + full use of all resources. 81 78 82 - There are three files associated with each group: 79 + On a system with RDT control features additional directories can be 80 + created in the root directory that specify different amounts of each 81 + resource (see "schemata" below). The root and these additional top level 82 + directories are referred to as "CTRL_MON" groups below. 83 83 84 - "tasks": A list of tasks that belongs to this group. Tasks can be 85 - added to a group by writing the task ID to the "tasks" file 86 - (which will automatically remove them from the previous 87 - group to which they belonged). New tasks created by fork(2) 88 - and clone(2) are added to the same group as their parent. 89 - If a pid is not in any sub partition, it is in root partition 90 - (i.e. default partition). 84 + On a system with RDT monitoring the root directory and other top level 85 + directories contain a directory named "mon_groups" in which additional 86 + directories can be created to monitor subsets of tasks in the CTRL_MON 87 + group that is their ancestor. These are called "MON" groups in the rest 88 + of this document. 91 89 92 - "cpus": A bitmask of logical CPUs assigned to this group. Writing 93 - a new mask can add/remove CPUs from this group. Added CPUs 94 - are removed from their previous group. Removed ones are 95 - given to the default (root) group. You cannot remove CPUs 96 - from the default group. 90 + Removing a directory will move all tasks and cpus owned by the group it 91 + represents to the parent. Removing one of the created CTRL_MON groups 92 + will automatically remove all MON groups below it. 97 93 98 - "cpus_list": One or more CPU ranges of logical CPUs assigned to this 99 - group. Same rules apply like for the "cpus" file. 94 + All groups contain the following files: 100 95 101 - "schemata": A list of all the resources available to this group. 102 - Each resource has its own line and format - see below for 103 - details. 96 + "tasks": 97 + Reading this file shows the list of all tasks that belong to 98 + this group. Writing a task id to the file will add a task to the 99 + group. If the group is a CTRL_MON group the task is removed from 100 + whichever previous CTRL_MON group owned the task and also from 101 + any MON group that owned the task. If the group is a MON group, 102 + then the task must already belong to the CTRL_MON parent of this 103 + group. The task is removed from any previous MON group. 104 104 105 - When a task is running the following rules define which resources 106 - are available to it: 105 + 106 + "cpus": 107 + Reading this file shows a bitmask of the logical CPUs owned by 108 + this group. Writing a mask to this file will add and remove 109 + CPUs to/from this group. As with the tasks file a hierarchy is 110 + maintained where MON groups may only include CPUs owned by the 111 + parent CTRL_MON group. 112 + 113 + 114 + "cpus_list": 115 + Just like "cpus", only using ranges of CPUs instead of bitmasks. 116 + 117 + 118 + When control is enabled all CTRL_MON groups will also contain: 119 + 120 + "schemata": 121 + A list of all the resources available to this group. 122 + Each resource has its own line and format - see below for details. 123 + 124 + When monitoring is enabled all MON groups will also contain: 125 + 126 + "mon_data": 127 + This contains a set of files organized by L3 domain and by 128 + RDT event. E.g. on a system with two L3 domains there will 129 + be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 130 + directories have one file per event (e.g. "llc_occupancy", 131 + "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 132 + files provide a read out of the current value of the event for 133 + all tasks in the group. In CTRL_MON groups these files provide 134 + the sum for all tasks in the CTRL_MON group and all tasks in 135 + MON groups. Please see example section for more details on usage. 136 + 137 + Resource allocation rules 138 + ------------------------- 139 + When a task is running the following rules define which resources are 140 + available to it: 107 141 108 142 1) If the task is a member of a non-default group, then the schemata 109 - for that group is used. 143 + for that group is used. 110 144 111 145 2) Else if the task belongs to the default group, but is running on a 112 - CPU that is assigned to some specific group, then the schemata for 113 - the CPU's group is used. 146 + CPU that is assigned to some specific group, then the schemata for the 147 + CPU's group is used. 114 148 115 149 3) Otherwise the schemata for the default group is used. 116 150 151 + Resource monitoring rules 152 + ------------------------- 153 + 1) If a task is a member of a MON group, or non-default CTRL_MON group 154 + then RDT events for the task will be reported in that group. 155 + 156 + 2) If a task is a member of the default CTRL_MON group, but is running 157 + on a CPU that is assigned to some specific group, then the RDT events 158 + for the task will be reported in that group. 159 + 160 + 3) Otherwise RDT events for the task will be reported in the root level 161 + "mon_data" group. 162 + 163 + 164 + Notes on cache occupancy monitoring and control 165 + ----------------------------------------------- 166 + When moving a task from one group to another you should remember that 167 + this only affects *new* cache allocations by the task. E.g. you may have 168 + a task in a monitor group showing 3 MB of cache occupancy. If you move 169 + to a new group and immediately check the occupancy of the old and new 170 + groups you will likely see that the old group is still showing 3 MB and 171 + the new group zero. When the task accesses locations still in cache from 172 + before the move, the h/w does not update any counters. On a busy system 173 + you will likely see the occupancy in the old group go down as cache lines 174 + are evicted and re-used while the occupancy in the new group rises as 175 + the task accesses memory and loads into the cache are counted based on 176 + membership in the new group. 177 + 178 + The same applies to cache allocation control. Moving a task to a group 179 + with a smaller cache partition will not evict any cache lines. The 180 + process may continue to use them from the old partition. 181 + 182 + Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 183 + to identify a control group and a monitoring group respectively. Each of 184 + the resource groups are mapped to these IDs based on the kind of group. The 185 + number of CLOSid and RMID are limited by the hardware and hence the creation of 186 + a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 187 + and creation of "MON" group may fail if we run out of RMIDs. 188 + 189 + max_threshold_occupancy - generic concepts 190 + ------------------------------------------ 191 + 192 + Note that an RMID once freed may not be immediately available for use as 193 + the RMID is still tagged the cache lines of the previous user of RMID. 194 + Hence such RMIDs are placed on limbo list and checked back if the cache 195 + occupancy has gone down. If there is a time when system has a lot of 196 + limbo RMIDs but which are not ready to be used, user may see an -EBUSY 197 + during mkdir. 198 + 199 + max_threshold_occupancy is a user configurable value to determine the 200 + occupancy at which an RMID can be freed. 117 201 118 202 Schemata files - general concepts 119 203 --------------------------------- ··· 267 143 sharing a core will result in both threads being throttled to use the 268 144 low bandwidth. 269 145 270 - L3 details (code and data prioritization disabled) 271 - -------------------------------------------------- 146 + L3 schemata file details (code and data prioritization disabled) 147 + ---------------------------------------------------------------- 272 148 With CDP disabled the L3 schemata format is: 273 149 274 150 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 275 151 276 - L3 details (CDP enabled via mount option to resctrl) 277 - ---------------------------------------------------- 152 + L3 schemata file details (CDP enabled via mount option to resctrl) 153 + ------------------------------------------------------------------ 278 154 When CDP is enabled L3 control is split into two separate resources 279 155 so you can specify independent masks for code and data like this: 280 156 281 157 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 282 158 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 283 159 284 - L2 details 285 - ---------- 160 + L2 schemata file details 161 + ------------------------ 286 162 L2 cache does not support code and data prioritization, so the 287 163 schemata format is always: 288 164 ··· 308 184 # cat schemata 309 185 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 310 186 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 187 + 188 + Examples for RDT allocation usage: 311 189 312 190 Example 1 313 191 --------- ··· 536 410 /* code to read and write directory contents */ 537 411 resctrl_release_lock(fd); 538 412 } 413 + 414 + Examples for RDT Monitoring along with allocation usage: 415 + 416 + Reading monitored data 417 + ---------------------- 418 + Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 419 + show the current snapshot of LLC occupancy of the corresponding MON 420 + group or CTRL_MON group. 421 + 422 + 423 + Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 424 + --------- 425 + On a two socket machine (one L3 cache per socket) with just four bits 426 + for cache bit masks 427 + 428 + # mount -t resctrl resctrl /sys/fs/resctrl 429 + # cd /sys/fs/resctrl 430 + # mkdir p0 p1 431 + # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 432 + # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 433 + # echo 5678 > p1/tasks 434 + # echo 5679 > p1/tasks 435 + 436 + The default resource group is unmodified, so we have access to all parts 437 + of all caches (its schemata file reads "L3:0=f;1=f"). 438 + 439 + Tasks that are under the control of group "p0" may only allocate from the 440 + "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 441 + Tasks in group "p1" use the "lower" 50% of cache on both sockets. 442 + 443 + Create monitor groups and assign a subset of tasks to each monitor group. 444 + 445 + # cd /sys/fs/resctrl/p1/mon_groups 446 + # mkdir m11 m12 447 + # echo 5678 > m11/tasks 448 + # echo 5679 > m12/tasks 449 + 450 + fetch data (data shown in bytes) 451 + 452 + # cat m11/mon_data/mon_L3_00/llc_occupancy 453 + 16234000 454 + # cat m11/mon_data/mon_L3_01/llc_occupancy 455 + 14789000 456 + # cat m12/mon_data/mon_L3_00/llc_occupancy 457 + 16789000 458 + 459 + The parent ctrl_mon group shows the aggregated data. 460 + 461 + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 462 + 31234000 463 + 464 + Example 2 (Monitor a task from its creation) 465 + --------- 466 + On a two socket machine (one L3 cache per socket) 467 + 468 + # mount -t resctrl resctrl /sys/fs/resctrl 469 + # cd /sys/fs/resctrl 470 + # mkdir p0 p1 471 + 472 + An RMID is allocated to the group once its created and hence the <cmd> 473 + below is monitored from its creation. 474 + 475 + # echo $$ > /sys/fs/resctrl/p1/tasks 476 + # <cmd> 477 + 478 + Fetch the data 479 + 480 + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 481 + 31789000 482 + 483 + Example 3 (Monitor without CAT support or before creating CAT groups) 484 + --------- 485 + 486 + Assume a system like HSW has only CQM and no CAT support. In this case 487 + the resctrl will still mount but cannot create CTRL_MON directories. 488 + But user can create different MON groups within the root group thereby 489 + able to monitor all tasks including kernel threads. 490 + 491 + This can also be used to profile jobs cache size footprint before being 492 + able to allocate them to different allocation groups. 493 + 494 + # mount -t resctrl resctrl /sys/fs/resctrl 495 + # cd /sys/fs/resctrl 496 + # mkdir mon_groups/m01 497 + # mkdir mon_groups/m02 498 + 499 + # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 500 + # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 501 + 502 + Monitor the groups separately and also get per domain data. From the 503 + below its apparent that the tasks are mostly doing work on 504 + domain(socket) 0. 505 + 506 + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 507 + 31234000 508 + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 509 + 34555 510 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 511 + 31234000 512 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 513 + 32789 514 + 515 + 516 + Example 4 (Monitor real time tasks) 517 + ----------------------------------- 518 + 519 + A single socket system which has real time tasks running on cores 4-7 520 + and non real time tasks on other cpus. We want to monitor the cache 521 + occupancy of the real time threads on these cores. 522 + 523 + # mount -t resctrl resctrl /sys/fs/resctrl 524 + # cd /sys/fs/resctrl 525 + # mkdir p1 526 + 527 + Move the cpus 4-7 over to p1 528 + # echo f0 > p0/cpus 529 + 530 + View the llc occupancy snapshot 531 + 532 + # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 533 + 11234000
+1 -1
MAINTAINERS
··· 11121 11121 L: linux-kernel@vger.kernel.org 11122 11122 S: Supported 11123 11123 F: arch/x86/kernel/cpu/intel_rdt* 11124 - F: arch/x86/include/asm/intel_rdt* 11124 + F: arch/x86/include/asm/intel_rdt_sched.h 11125 11125 F: Documentation/x86/intel_rdt* 11126 11126 11127 11127 READ-COPY UPDATE (RCU)
+6 -6
arch/x86/Kconfig
··· 429 429 def_bool y 430 430 depends on X86_GOLDFISH 431 431 432 - config INTEL_RDT_A 433 - bool "Intel Resource Director Technology Allocation support" 432 + config INTEL_RDT 433 + bool "Intel Resource Director Technology support" 434 434 default n 435 435 depends on X86 && CPU_SUP_INTEL 436 436 select KERNFS 437 437 help 438 - Select to enable resource allocation which is a sub-feature of 439 - Intel Resource Director Technology(RDT). More information about 440 - RDT can be found in the Intel x86 Architecture Software 441 - Developer Manual. 438 + Select to enable resource allocation and monitoring which are 439 + sub-features of Intel Resource Director Technology(RDT). More 440 + information about RDT can be found in the Intel x86 441 + Architecture Software Developer Manual. 442 442 443 443 Say N if unsure. 444 444
+1 -1
arch/x86/events/intel/Makefile
··· 1 - obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o cqm.o 1 + obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o 2 2 obj-$(CONFIG_CPU_SUP_INTEL) += ds.o knc.o 3 3 obj-$(CONFIG_CPU_SUP_INTEL) += lbr.o p4.o p6.o pt.o 4 4 obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += intel-rapl-perf.o
-1766
arch/x86/events/intel/cqm.c
··· 1 - /* 2 - * Intel Cache Quality-of-Service Monitoring (CQM) support. 3 - * 4 - * Based very, very heavily on work by Peter Zijlstra. 5 - */ 6 - 7 - #include <linux/perf_event.h> 8 - #include <linux/slab.h> 9 - #include <asm/cpu_device_id.h> 10 - #include <asm/intel_rdt_common.h> 11 - #include "../perf_event.h" 12 - 13 - #define MSR_IA32_QM_CTR 0x0c8e 14 - #define MSR_IA32_QM_EVTSEL 0x0c8d 15 - 16 - #define MBM_CNTR_WIDTH 24 17 - /* 18 - * Guaranteed time in ms as per SDM where MBM counters will not overflow. 19 - */ 20 - #define MBM_CTR_OVERFLOW_TIME 1000 21 - 22 - static u32 cqm_max_rmid = -1; 23 - static unsigned int cqm_l3_scale; /* supposedly cacheline size */ 24 - static bool cqm_enabled, mbm_enabled; 25 - unsigned int mbm_socket_max; 26 - 27 - /* 28 - * The cached intel_pqr_state is strictly per CPU and can never be 29 - * updated from a remote CPU. Both functions which modify the state 30 - * (intel_cqm_event_start and intel_cqm_event_stop) are called with 31 - * interrupts disabled, which is sufficient for the protection. 32 - */ 33 - DEFINE_PER_CPU(struct intel_pqr_state, pqr_state); 34 - static struct hrtimer *mbm_timers; 35 - /** 36 - * struct sample - mbm event's (local or total) data 37 - * @total_bytes #bytes since we began monitoring 38 - * @prev_msr previous value of MSR 39 - */ 40 - struct sample { 41 - u64 total_bytes; 42 - u64 prev_msr; 43 - }; 44 - 45 - /* 46 - * samples profiled for total memory bandwidth type events 47 - */ 48 - static struct sample *mbm_total; 49 - /* 50 - * samples profiled for local memory bandwidth type events 51 - */ 52 - static struct sample *mbm_local; 53 - 54 - #define pkg_id topology_physical_package_id(smp_processor_id()) 55 - /* 56 - * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array. 57 - * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of 58 - * rmids per socket, an example is given below 59 - * RMID1 of Socket0: vrmid = 1 60 - * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1 61 - * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1 62 - */ 63 - #define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid) 64 - /* 65 - * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru. 66 - * Also protects event->hw.cqm_rmid 67 - * 68 - * Hold either for stability, both for modification of ->hw.cqm_rmid. 69 - */ 70 - static DEFINE_MUTEX(cache_mutex); 71 - static DEFINE_RAW_SPINLOCK(cache_lock); 72 - 73 - /* 74 - * Groups of events that have the same target(s), one RMID per group. 75 - */ 76 - static LIST_HEAD(cache_groups); 77 - 78 - /* 79 - * Mask of CPUs for reading CQM values. We only need one per-socket. 80 - */ 81 - static cpumask_t cqm_cpumask; 82 - 83 - #define RMID_VAL_ERROR (1ULL << 63) 84 - #define RMID_VAL_UNAVAIL (1ULL << 62) 85 - 86 - /* 87 - * Event IDs are used to program IA32_QM_EVTSEL before reading event 88 - * counter from IA32_QM_CTR 89 - */ 90 - #define QOS_L3_OCCUP_EVENT_ID 0x01 91 - #define QOS_MBM_TOTAL_EVENT_ID 0x02 92 - #define QOS_MBM_LOCAL_EVENT_ID 0x03 93 - 94 - /* 95 - * This is central to the rotation algorithm in __intel_cqm_rmid_rotate(). 96 - * 97 - * This rmid is always free and is guaranteed to have an associated 98 - * near-zero occupancy value, i.e. no cachelines are tagged with this 99 - * RMID, once __intel_cqm_rmid_rotate() returns. 100 - */ 101 - static u32 intel_cqm_rotation_rmid; 102 - 103 - #define INVALID_RMID (-1) 104 - 105 - /* 106 - * Is @rmid valid for programming the hardware? 107 - * 108 - * rmid 0 is reserved by the hardware for all non-monitored tasks, which 109 - * means that we should never come across an rmid with that value. 110 - * Likewise, an rmid value of -1 is used to indicate "no rmid currently 111 - * assigned" and is used as part of the rotation code. 112 - */ 113 - static inline bool __rmid_valid(u32 rmid) 114 - { 115 - if (!rmid || rmid == INVALID_RMID) 116 - return false; 117 - 118 - return true; 119 - } 120 - 121 - static u64 __rmid_read(u32 rmid) 122 - { 123 - u64 val; 124 - 125 - /* 126 - * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt, 127 - * it just says that to increase confusion. 128 - */ 129 - wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid); 130 - rdmsrl(MSR_IA32_QM_CTR, val); 131 - 132 - /* 133 - * Aside from the ERROR and UNAVAIL bits, assume this thing returns 134 - * the number of cachelines tagged with @rmid. 135 - */ 136 - return val; 137 - } 138 - 139 - enum rmid_recycle_state { 140 - RMID_YOUNG = 0, 141 - RMID_AVAILABLE, 142 - RMID_DIRTY, 143 - }; 144 - 145 - struct cqm_rmid_entry { 146 - u32 rmid; 147 - enum rmid_recycle_state state; 148 - struct list_head list; 149 - unsigned long queue_time; 150 - }; 151 - 152 - /* 153 - * cqm_rmid_free_lru - A least recently used list of RMIDs. 154 - * 155 - * Oldest entry at the head, newest (most recently used) entry at the 156 - * tail. This list is never traversed, it's only used to keep track of 157 - * the lru order. That is, we only pick entries of the head or insert 158 - * them on the tail. 159 - * 160 - * All entries on the list are 'free', and their RMIDs are not currently 161 - * in use. To mark an RMID as in use, remove its entry from the lru 162 - * list. 163 - * 164 - * 165 - * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs. 166 - * 167 - * This list is contains RMIDs that no one is currently using but that 168 - * may have a non-zero occupancy value associated with them. The 169 - * rotation worker moves RMIDs from the limbo list to the free list once 170 - * the occupancy value drops below __intel_cqm_threshold. 171 - * 172 - * Both lists are protected by cache_mutex. 173 - */ 174 - static LIST_HEAD(cqm_rmid_free_lru); 175 - static LIST_HEAD(cqm_rmid_limbo_lru); 176 - 177 - /* 178 - * We use a simple array of pointers so that we can lookup a struct 179 - * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid() 180 - * and __put_rmid() from having to worry about dealing with struct 181 - * cqm_rmid_entry - they just deal with rmids, i.e. integers. 182 - * 183 - * Once this array is initialized it is read-only. No locks are required 184 - * to access it. 185 - * 186 - * All entries for all RMIDs can be looked up in the this array at all 187 - * times. 188 - */ 189 - static struct cqm_rmid_entry **cqm_rmid_ptrs; 190 - 191 - static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid) 192 - { 193 - struct cqm_rmid_entry *entry; 194 - 195 - entry = cqm_rmid_ptrs[rmid]; 196 - WARN_ON(entry->rmid != rmid); 197 - 198 - return entry; 199 - } 200 - 201 - /* 202 - * Returns < 0 on fail. 203 - * 204 - * We expect to be called with cache_mutex held. 205 - */ 206 - static u32 __get_rmid(void) 207 - { 208 - struct cqm_rmid_entry *entry; 209 - 210 - lockdep_assert_held(&cache_mutex); 211 - 212 - if (list_empty(&cqm_rmid_free_lru)) 213 - return INVALID_RMID; 214 - 215 - entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list); 216 - list_del(&entry->list); 217 - 218 - return entry->rmid; 219 - } 220 - 221 - static void __put_rmid(u32 rmid) 222 - { 223 - struct cqm_rmid_entry *entry; 224 - 225 - lockdep_assert_held(&cache_mutex); 226 - 227 - WARN_ON(!__rmid_valid(rmid)); 228 - entry = __rmid_entry(rmid); 229 - 230 - entry->queue_time = jiffies; 231 - entry->state = RMID_YOUNG; 232 - 233 - list_add_tail(&entry->list, &cqm_rmid_limbo_lru); 234 - } 235 - 236 - static void cqm_cleanup(void) 237 - { 238 - int i; 239 - 240 - if (!cqm_rmid_ptrs) 241 - return; 242 - 243 - for (i = 0; i < cqm_max_rmid; i++) 244 - kfree(cqm_rmid_ptrs[i]); 245 - 246 - kfree(cqm_rmid_ptrs); 247 - cqm_rmid_ptrs = NULL; 248 - cqm_enabled = false; 249 - } 250 - 251 - static int intel_cqm_setup_rmid_cache(void) 252 - { 253 - struct cqm_rmid_entry *entry; 254 - unsigned int nr_rmids; 255 - int r = 0; 256 - 257 - nr_rmids = cqm_max_rmid + 1; 258 - cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) * 259 - nr_rmids, GFP_KERNEL); 260 - if (!cqm_rmid_ptrs) 261 - return -ENOMEM; 262 - 263 - for (; r <= cqm_max_rmid; r++) { 264 - struct cqm_rmid_entry *entry; 265 - 266 - entry = kmalloc(sizeof(*entry), GFP_KERNEL); 267 - if (!entry) 268 - goto fail; 269 - 270 - INIT_LIST_HEAD(&entry->list); 271 - entry->rmid = r; 272 - cqm_rmid_ptrs[r] = entry; 273 - 274 - list_add_tail(&entry->list, &cqm_rmid_free_lru); 275 - } 276 - 277 - /* 278 - * RMID 0 is special and is always allocated. It's used for all 279 - * tasks that are not monitored. 280 - */ 281 - entry = __rmid_entry(0); 282 - list_del(&entry->list); 283 - 284 - mutex_lock(&cache_mutex); 285 - intel_cqm_rotation_rmid = __get_rmid(); 286 - mutex_unlock(&cache_mutex); 287 - 288 - return 0; 289 - 290 - fail: 291 - cqm_cleanup(); 292 - return -ENOMEM; 293 - } 294 - 295 - /* 296 - * Determine if @a and @b measure the same set of tasks. 297 - * 298 - * If @a and @b measure the same set of tasks then we want to share a 299 - * single RMID. 300 - */ 301 - static bool __match_event(struct perf_event *a, struct perf_event *b) 302 - { 303 - /* Per-cpu and task events don't mix */ 304 - if ((a->attach_state & PERF_ATTACH_TASK) != 305 - (b->attach_state & PERF_ATTACH_TASK)) 306 - return false; 307 - 308 - #ifdef CONFIG_CGROUP_PERF 309 - if (a->cgrp != b->cgrp) 310 - return false; 311 - #endif 312 - 313 - /* If not task event, we're machine wide */ 314 - if (!(b->attach_state & PERF_ATTACH_TASK)) 315 - return true; 316 - 317 - /* 318 - * Events that target same task are placed into the same cache group. 319 - * Mark it as a multi event group, so that we update ->count 320 - * for every event rather than just the group leader later. 321 - */ 322 - if (a->hw.target == b->hw.target) { 323 - b->hw.is_group_event = true; 324 - return true; 325 - } 326 - 327 - /* 328 - * Are we an inherited event? 329 - */ 330 - if (b->parent == a) 331 - return true; 332 - 333 - return false; 334 - } 335 - 336 - #ifdef CONFIG_CGROUP_PERF 337 - static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event) 338 - { 339 - if (event->attach_state & PERF_ATTACH_TASK) 340 - return perf_cgroup_from_task(event->hw.target, event->ctx); 341 - 342 - return event->cgrp; 343 - } 344 - #endif 345 - 346 - /* 347 - * Determine if @a's tasks intersect with @b's tasks 348 - * 349 - * There are combinations of events that we explicitly prohibit, 350 - * 351 - * PROHIBITS 352 - * system-wide -> cgroup and task 353 - * cgroup -> system-wide 354 - * -> task in cgroup 355 - * task -> system-wide 356 - * -> task in cgroup 357 - * 358 - * Call this function before allocating an RMID. 359 - */ 360 - static bool __conflict_event(struct perf_event *a, struct perf_event *b) 361 - { 362 - #ifdef CONFIG_CGROUP_PERF 363 - /* 364 - * We can have any number of cgroups but only one system-wide 365 - * event at a time. 366 - */ 367 - if (a->cgrp && b->cgrp) { 368 - struct perf_cgroup *ac = a->cgrp; 369 - struct perf_cgroup *bc = b->cgrp; 370 - 371 - /* 372 - * This condition should have been caught in 373 - * __match_event() and we should be sharing an RMID. 374 - */ 375 - WARN_ON_ONCE(ac == bc); 376 - 377 - if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) || 378 - cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup)) 379 - return true; 380 - 381 - return false; 382 - } 383 - 384 - if (a->cgrp || b->cgrp) { 385 - struct perf_cgroup *ac, *bc; 386 - 387 - /* 388 - * cgroup and system-wide events are mutually exclusive 389 - */ 390 - if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) || 391 - (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK))) 392 - return true; 393 - 394 - /* 395 - * Ensure neither event is part of the other's cgroup 396 - */ 397 - ac = event_to_cgroup(a); 398 - bc = event_to_cgroup(b); 399 - if (ac == bc) 400 - return true; 401 - 402 - /* 403 - * Must have cgroup and non-intersecting task events. 404 - */ 405 - if (!ac || !bc) 406 - return false; 407 - 408 - /* 409 - * We have cgroup and task events, and the task belongs 410 - * to a cgroup. Check for for overlap. 411 - */ 412 - if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) || 413 - cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup)) 414 - return true; 415 - 416 - return false; 417 - } 418 - #endif 419 - /* 420 - * If one of them is not a task, same story as above with cgroups. 421 - */ 422 - if (!(a->attach_state & PERF_ATTACH_TASK) || 423 - !(b->attach_state & PERF_ATTACH_TASK)) 424 - return true; 425 - 426 - /* 427 - * Must be non-overlapping. 428 - */ 429 - return false; 430 - } 431 - 432 - struct rmid_read { 433 - u32 rmid; 434 - u32 evt_type; 435 - atomic64_t value; 436 - }; 437 - 438 - static void __intel_cqm_event_count(void *info); 439 - static void init_mbm_sample(u32 rmid, u32 evt_type); 440 - static void __intel_mbm_event_count(void *info); 441 - 442 - static bool is_cqm_event(int e) 443 - { 444 - return (e == QOS_L3_OCCUP_EVENT_ID); 445 - } 446 - 447 - static bool is_mbm_event(int e) 448 - { 449 - return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID); 450 - } 451 - 452 - static void cqm_mask_call(struct rmid_read *rr) 453 - { 454 - if (is_mbm_event(rr->evt_type)) 455 - on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, rr, 1); 456 - else 457 - on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, rr, 1); 458 - } 459 - 460 - /* 461 - * Exchange the RMID of a group of events. 462 - */ 463 - static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid) 464 - { 465 - struct perf_event *event; 466 - struct list_head *head = &group->hw.cqm_group_entry; 467 - u32 old_rmid = group->hw.cqm_rmid; 468 - 469 - lockdep_assert_held(&cache_mutex); 470 - 471 - /* 472 - * If our RMID is being deallocated, perform a read now. 473 - */ 474 - if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) { 475 - struct rmid_read rr = { 476 - .rmid = old_rmid, 477 - .evt_type = group->attr.config, 478 - .value = ATOMIC64_INIT(0), 479 - }; 480 - 481 - cqm_mask_call(&rr); 482 - local64_set(&group->count, atomic64_read(&rr.value)); 483 - } 484 - 485 - raw_spin_lock_irq(&cache_lock); 486 - 487 - group->hw.cqm_rmid = rmid; 488 - list_for_each_entry(event, head, hw.cqm_group_entry) 489 - event->hw.cqm_rmid = rmid; 490 - 491 - raw_spin_unlock_irq(&cache_lock); 492 - 493 - /* 494 - * If the allocation is for mbm, init the mbm stats. 495 - * Need to check if each event in the group is mbm event 496 - * because there could be multiple type of events in the same group. 497 - */ 498 - if (__rmid_valid(rmid)) { 499 - event = group; 500 - if (is_mbm_event(event->attr.config)) 501 - init_mbm_sample(rmid, event->attr.config); 502 - 503 - list_for_each_entry(event, head, hw.cqm_group_entry) { 504 - if (is_mbm_event(event->attr.config)) 505 - init_mbm_sample(rmid, event->attr.config); 506 - } 507 - } 508 - 509 - return old_rmid; 510 - } 511 - 512 - /* 513 - * If we fail to assign a new RMID for intel_cqm_rotation_rmid because 514 - * cachelines are still tagged with RMIDs in limbo, we progressively 515 - * increment the threshold until we find an RMID in limbo with <= 516 - * __intel_cqm_threshold lines tagged. This is designed to mitigate the 517 - * problem where cachelines tagged with an RMID are not steadily being 518 - * evicted. 519 - * 520 - * On successful rotations we decrease the threshold back towards zero. 521 - * 522 - * __intel_cqm_max_threshold provides an upper bound on the threshold, 523 - * and is measured in bytes because it's exposed to userland. 524 - */ 525 - static unsigned int __intel_cqm_threshold; 526 - static unsigned int __intel_cqm_max_threshold; 527 - 528 - /* 529 - * Test whether an RMID has a zero occupancy value on this cpu. 530 - */ 531 - static void intel_cqm_stable(void *arg) 532 - { 533 - struct cqm_rmid_entry *entry; 534 - 535 - list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) { 536 - if (entry->state != RMID_AVAILABLE) 537 - break; 538 - 539 - if (__rmid_read(entry->rmid) > __intel_cqm_threshold) 540 - entry->state = RMID_DIRTY; 541 - } 542 - } 543 - 544 - /* 545 - * If we have group events waiting for an RMID that don't conflict with 546 - * events already running, assign @rmid. 547 - */ 548 - static bool intel_cqm_sched_in_event(u32 rmid) 549 - { 550 - struct perf_event *leader, *event; 551 - 552 - lockdep_assert_held(&cache_mutex); 553 - 554 - leader = list_first_entry(&cache_groups, struct perf_event, 555 - hw.cqm_groups_entry); 556 - event = leader; 557 - 558 - list_for_each_entry_continue(event, &cache_groups, 559 - hw.cqm_groups_entry) { 560 - if (__rmid_valid(event->hw.cqm_rmid)) 561 - continue; 562 - 563 - if (__conflict_event(event, leader)) 564 - continue; 565 - 566 - intel_cqm_xchg_rmid(event, rmid); 567 - return true; 568 - } 569 - 570 - return false; 571 - } 572 - 573 - /* 574 - * Initially use this constant for both the limbo queue time and the 575 - * rotation timer interval, pmu::hrtimer_interval_ms. 576 - * 577 - * They don't need to be the same, but the two are related since if you 578 - * rotate faster than you recycle RMIDs, you may run out of available 579 - * RMIDs. 580 - */ 581 - #define RMID_DEFAULT_QUEUE_TIME 250 /* ms */ 582 - 583 - static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME; 584 - 585 - /* 586 - * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list 587 - * @nr_available: number of freeable RMIDs on the limbo list 588 - * 589 - * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no 590 - * cachelines are tagged with those RMIDs. After this we can reuse them 591 - * and know that the current set of active RMIDs is stable. 592 - * 593 - * Return %true or %false depending on whether stabilization needs to be 594 - * reattempted. 595 - * 596 - * If we return %true then @nr_available is updated to indicate the 597 - * number of RMIDs on the limbo list that have been queued for the 598 - * minimum queue time (RMID_AVAILABLE), but whose data occupancy values 599 - * are above __intel_cqm_threshold. 600 - */ 601 - static bool intel_cqm_rmid_stabilize(unsigned int *available) 602 - { 603 - struct cqm_rmid_entry *entry, *tmp; 604 - 605 - lockdep_assert_held(&cache_mutex); 606 - 607 - *available = 0; 608 - list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) { 609 - unsigned long min_queue_time; 610 - unsigned long now = jiffies; 611 - 612 - /* 613 - * We hold RMIDs placed into limbo for a minimum queue 614 - * time. Before the minimum queue time has elapsed we do 615 - * not recycle RMIDs. 616 - * 617 - * The reasoning is that until a sufficient time has 618 - * passed since we stopped using an RMID, any RMID 619 - * placed onto the limbo list will likely still have 620 - * data tagged in the cache, which means we'll probably 621 - * fail to recycle it anyway. 622 - * 623 - * We can save ourselves an expensive IPI by skipping 624 - * any RMIDs that have not been queued for the minimum 625 - * time. 626 - */ 627 - min_queue_time = entry->queue_time + 628 - msecs_to_jiffies(__rmid_queue_time_ms); 629 - 630 - if (time_after(min_queue_time, now)) 631 - break; 632 - 633 - entry->state = RMID_AVAILABLE; 634 - (*available)++; 635 - } 636 - 637 - /* 638 - * Fast return if none of the RMIDs on the limbo list have been 639 - * sitting on the queue for the minimum queue time. 640 - */ 641 - if (!*available) 642 - return false; 643 - 644 - /* 645 - * Test whether an RMID is free for each package. 646 - */ 647 - on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true); 648 - 649 - list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) { 650 - /* 651 - * Exhausted all RMIDs that have waited min queue time. 652 - */ 653 - if (entry->state == RMID_YOUNG) 654 - break; 655 - 656 - if (entry->state == RMID_DIRTY) 657 - continue; 658 - 659 - list_del(&entry->list); /* remove from limbo */ 660 - 661 - /* 662 - * The rotation RMID gets priority if it's 663 - * currently invalid. In which case, skip adding 664 - * the RMID to the the free lru. 665 - */ 666 - if (!__rmid_valid(intel_cqm_rotation_rmid)) { 667 - intel_cqm_rotation_rmid = entry->rmid; 668 - continue; 669 - } 670 - 671 - /* 672 - * If we have groups waiting for RMIDs, hand 673 - * them one now provided they don't conflict. 674 - */ 675 - if (intel_cqm_sched_in_event(entry->rmid)) 676 - continue; 677 - 678 - /* 679 - * Otherwise place it onto the free list. 680 - */ 681 - list_add_tail(&entry->list, &cqm_rmid_free_lru); 682 - } 683 - 684 - 685 - return __rmid_valid(intel_cqm_rotation_rmid); 686 - } 687 - 688 - /* 689 - * Pick a victim group and move it to the tail of the group list. 690 - * @next: The first group without an RMID 691 - */ 692 - static void __intel_cqm_pick_and_rotate(struct perf_event *next) 693 - { 694 - struct perf_event *rotor; 695 - u32 rmid; 696 - 697 - lockdep_assert_held(&cache_mutex); 698 - 699 - rotor = list_first_entry(&cache_groups, struct perf_event, 700 - hw.cqm_groups_entry); 701 - 702 - /* 703 - * The group at the front of the list should always have a valid 704 - * RMID. If it doesn't then no groups have RMIDs assigned and we 705 - * don't need to rotate the list. 706 - */ 707 - if (next == rotor) 708 - return; 709 - 710 - rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID); 711 - __put_rmid(rmid); 712 - 713 - list_rotate_left(&cache_groups); 714 - } 715 - 716 - /* 717 - * Deallocate the RMIDs from any events that conflict with @event, and 718 - * place them on the back of the group list. 719 - */ 720 - static void intel_cqm_sched_out_conflicting_events(struct perf_event *event) 721 - { 722 - struct perf_event *group, *g; 723 - u32 rmid; 724 - 725 - lockdep_assert_held(&cache_mutex); 726 - 727 - list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) { 728 - if (group == event) 729 - continue; 730 - 731 - rmid = group->hw.cqm_rmid; 732 - 733 - /* 734 - * Skip events that don't have a valid RMID. 735 - */ 736 - if (!__rmid_valid(rmid)) 737 - continue; 738 - 739 - /* 740 - * No conflict? No problem! Leave the event alone. 741 - */ 742 - if (!__conflict_event(group, event)) 743 - continue; 744 - 745 - intel_cqm_xchg_rmid(group, INVALID_RMID); 746 - __put_rmid(rmid); 747 - } 748 - } 749 - 750 - /* 751 - * Attempt to rotate the groups and assign new RMIDs. 752 - * 753 - * We rotate for two reasons, 754 - * 1. To handle the scheduling of conflicting events 755 - * 2. To recycle RMIDs 756 - * 757 - * Rotating RMIDs is complicated because the hardware doesn't give us 758 - * any clues. 759 - * 760 - * There's problems with the hardware interface; when you change the 761 - * task:RMID map cachelines retain their 'old' tags, giving a skewed 762 - * picture. In order to work around this, we must always keep one free 763 - * RMID - intel_cqm_rotation_rmid. 764 - * 765 - * Rotation works by taking away an RMID from a group (the old RMID), 766 - * and assigning the free RMID to another group (the new RMID). We must 767 - * then wait for the old RMID to not be used (no cachelines tagged). 768 - * This ensure that all cachelines are tagged with 'active' RMIDs. At 769 - * this point we can start reading values for the new RMID and treat the 770 - * old RMID as the free RMID for the next rotation. 771 - * 772 - * Return %true or %false depending on whether we did any rotating. 773 - */ 774 - static bool __intel_cqm_rmid_rotate(void) 775 - { 776 - struct perf_event *group, *start = NULL; 777 - unsigned int threshold_limit; 778 - unsigned int nr_needed = 0; 779 - unsigned int nr_available; 780 - bool rotated = false; 781 - 782 - mutex_lock(&cache_mutex); 783 - 784 - again: 785 - /* 786 - * Fast path through this function if there are no groups and no 787 - * RMIDs that need cleaning. 788 - */ 789 - if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru)) 790 - goto out; 791 - 792 - list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) { 793 - if (!__rmid_valid(group->hw.cqm_rmid)) { 794 - if (!start) 795 - start = group; 796 - nr_needed++; 797 - } 798 - } 799 - 800 - /* 801 - * We have some event groups, but they all have RMIDs assigned 802 - * and no RMIDs need cleaning. 803 - */ 804 - if (!nr_needed && list_empty(&cqm_rmid_limbo_lru)) 805 - goto out; 806 - 807 - if (!nr_needed) 808 - goto stabilize; 809 - 810 - /* 811 - * We have more event groups without RMIDs than available RMIDs, 812 - * or we have event groups that conflict with the ones currently 813 - * scheduled. 814 - * 815 - * We force deallocate the rmid of the group at the head of 816 - * cache_groups. The first event group without an RMID then gets 817 - * assigned intel_cqm_rotation_rmid. This ensures we always make 818 - * forward progress. 819 - * 820 - * Rotate the cache_groups list so the previous head is now the 821 - * tail. 822 - */ 823 - __intel_cqm_pick_and_rotate(start); 824 - 825 - /* 826 - * If the rotation is going to succeed, reduce the threshold so 827 - * that we don't needlessly reuse dirty RMIDs. 828 - */ 829 - if (__rmid_valid(intel_cqm_rotation_rmid)) { 830 - intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid); 831 - intel_cqm_rotation_rmid = __get_rmid(); 832 - 833 - intel_cqm_sched_out_conflicting_events(start); 834 - 835 - if (__intel_cqm_threshold) 836 - __intel_cqm_threshold--; 837 - } 838 - 839 - rotated = true; 840 - 841 - stabilize: 842 - /* 843 - * We now need to stablize the RMID we freed above (if any) to 844 - * ensure that the next time we rotate we have an RMID with zero 845 - * occupancy value. 846 - * 847 - * Alternatively, if we didn't need to perform any rotation, 848 - * we'll have a bunch of RMIDs in limbo that need stabilizing. 849 - */ 850 - threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale; 851 - 852 - while (intel_cqm_rmid_stabilize(&nr_available) && 853 - __intel_cqm_threshold < threshold_limit) { 854 - unsigned int steal_limit; 855 - 856 - /* 857 - * Don't spin if nobody is actively waiting for an RMID, 858 - * the rotation worker will be kicked as soon as an 859 - * event needs an RMID anyway. 860 - */ 861 - if (!nr_needed) 862 - break; 863 - 864 - /* Allow max 25% of RMIDs to be in limbo. */ 865 - steal_limit = (cqm_max_rmid + 1) / 4; 866 - 867 - /* 868 - * We failed to stabilize any RMIDs so our rotation 869 - * logic is now stuck. In order to make forward progress 870 - * we have a few options: 871 - * 872 - * 1. rotate ("steal") another RMID 873 - * 2. increase the threshold 874 - * 3. do nothing 875 - * 876 - * We do both of 1. and 2. until we hit the steal limit. 877 - * 878 - * The steal limit prevents all RMIDs ending up on the 879 - * limbo list. This can happen if every RMID has a 880 - * non-zero occupancy above threshold_limit, and the 881 - * occupancy values aren't dropping fast enough. 882 - * 883 - * Note that there is prioritisation at work here - we'd 884 - * rather increase the number of RMIDs on the limbo list 885 - * than increase the threshold, because increasing the 886 - * threshold skews the event data (because we reuse 887 - * dirty RMIDs) - threshold bumps are a last resort. 888 - */ 889 - if (nr_available < steal_limit) 890 - goto again; 891 - 892 - __intel_cqm_threshold++; 893 - } 894 - 895 - out: 896 - mutex_unlock(&cache_mutex); 897 - return rotated; 898 - } 899 - 900 - static void intel_cqm_rmid_rotate(struct work_struct *work); 901 - 902 - static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate); 903 - 904 - static struct pmu intel_cqm_pmu; 905 - 906 - static void intel_cqm_rmid_rotate(struct work_struct *work) 907 - { 908 - unsigned long delay; 909 - 910 - __intel_cqm_rmid_rotate(); 911 - 912 - delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms); 913 - schedule_delayed_work(&intel_cqm_rmid_work, delay); 914 - } 915 - 916 - static u64 update_sample(unsigned int rmid, u32 evt_type, int first) 917 - { 918 - struct sample *mbm_current; 919 - u32 vrmid = rmid_2_index(rmid); 920 - u64 val, bytes, shift; 921 - u32 eventid; 922 - 923 - if (evt_type == QOS_MBM_LOCAL_EVENT_ID) { 924 - mbm_current = &mbm_local[vrmid]; 925 - eventid = QOS_MBM_LOCAL_EVENT_ID; 926 - } else { 927 - mbm_current = &mbm_total[vrmid]; 928 - eventid = QOS_MBM_TOTAL_EVENT_ID; 929 - } 930 - 931 - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); 932 - rdmsrl(MSR_IA32_QM_CTR, val); 933 - if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) 934 - return mbm_current->total_bytes; 935 - 936 - if (first) { 937 - mbm_current->prev_msr = val; 938 - mbm_current->total_bytes = 0; 939 - return mbm_current->total_bytes; 940 - } 941 - 942 - /* 943 - * The h/w guarantees that counters will not overflow 944 - * so long as we poll them at least once per second. 945 - */ 946 - shift = 64 - MBM_CNTR_WIDTH; 947 - bytes = (val << shift) - (mbm_current->prev_msr << shift); 948 - bytes >>= shift; 949 - 950 - bytes *= cqm_l3_scale; 951 - 952 - mbm_current->total_bytes += bytes; 953 - mbm_current->prev_msr = val; 954 - 955 - return mbm_current->total_bytes; 956 - } 957 - 958 - static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type) 959 - { 960 - return update_sample(rmid, evt_type, 0); 961 - } 962 - 963 - static void __intel_mbm_event_init(void *info) 964 - { 965 - struct rmid_read *rr = info; 966 - 967 - update_sample(rr->rmid, rr->evt_type, 1); 968 - } 969 - 970 - static void init_mbm_sample(u32 rmid, u32 evt_type) 971 - { 972 - struct rmid_read rr = { 973 - .rmid = rmid, 974 - .evt_type = evt_type, 975 - .value = ATOMIC64_INIT(0), 976 - }; 977 - 978 - /* on each socket, init sample */ 979 - on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1); 980 - } 981 - 982 - /* 983 - * Find a group and setup RMID. 984 - * 985 - * If we're part of a group, we use the group's RMID. 986 - */ 987 - static void intel_cqm_setup_event(struct perf_event *event, 988 - struct perf_event **group) 989 - { 990 - struct perf_event *iter; 991 - bool conflict = false; 992 - u32 rmid; 993 - 994 - event->hw.is_group_event = false; 995 - list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) { 996 - rmid = iter->hw.cqm_rmid; 997 - 998 - if (__match_event(iter, event)) { 999 - /* All tasks in a group share an RMID */ 1000 - event->hw.cqm_rmid = rmid; 1001 - *group = iter; 1002 - if (is_mbm_event(event->attr.config) && __rmid_valid(rmid)) 1003 - init_mbm_sample(rmid, event->attr.config); 1004 - return; 1005 - } 1006 - 1007 - /* 1008 - * We only care about conflicts for events that are 1009 - * actually scheduled in (and hence have a valid RMID). 1010 - */ 1011 - if (__conflict_event(iter, event) && __rmid_valid(rmid)) 1012 - conflict = true; 1013 - } 1014 - 1015 - if (conflict) 1016 - rmid = INVALID_RMID; 1017 - else 1018 - rmid = __get_rmid(); 1019 - 1020 - if (is_mbm_event(event->attr.config) && __rmid_valid(rmid)) 1021 - init_mbm_sample(rmid, event->attr.config); 1022 - 1023 - event->hw.cqm_rmid = rmid; 1024 - } 1025 - 1026 - static void intel_cqm_event_read(struct perf_event *event) 1027 - { 1028 - unsigned long flags; 1029 - u32 rmid; 1030 - u64 val; 1031 - 1032 - /* 1033 - * Task events are handled by intel_cqm_event_count(). 1034 - */ 1035 - if (event->cpu == -1) 1036 - return; 1037 - 1038 - raw_spin_lock_irqsave(&cache_lock, flags); 1039 - rmid = event->hw.cqm_rmid; 1040 - 1041 - if (!__rmid_valid(rmid)) 1042 - goto out; 1043 - 1044 - if (is_mbm_event(event->attr.config)) 1045 - val = rmid_read_mbm(rmid, event->attr.config); 1046 - else 1047 - val = __rmid_read(rmid); 1048 - 1049 - /* 1050 - * Ignore this reading on error states and do not update the value. 1051 - */ 1052 - if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) 1053 - goto out; 1054 - 1055 - local64_set(&event->count, val); 1056 - out: 1057 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1058 - } 1059 - 1060 - static void __intel_cqm_event_count(void *info) 1061 - { 1062 - struct rmid_read *rr = info; 1063 - u64 val; 1064 - 1065 - val = __rmid_read(rr->rmid); 1066 - 1067 - if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) 1068 - return; 1069 - 1070 - atomic64_add(val, &rr->value); 1071 - } 1072 - 1073 - static inline bool cqm_group_leader(struct perf_event *event) 1074 - { 1075 - return !list_empty(&event->hw.cqm_groups_entry); 1076 - } 1077 - 1078 - static void __intel_mbm_event_count(void *info) 1079 - { 1080 - struct rmid_read *rr = info; 1081 - u64 val; 1082 - 1083 - val = rmid_read_mbm(rr->rmid, rr->evt_type); 1084 - if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) 1085 - return; 1086 - atomic64_add(val, &rr->value); 1087 - } 1088 - 1089 - static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer) 1090 - { 1091 - struct perf_event *iter, *iter1; 1092 - int ret = HRTIMER_RESTART; 1093 - struct list_head *head; 1094 - unsigned long flags; 1095 - u32 grp_rmid; 1096 - 1097 - /* 1098 - * Need to cache_lock as the timer Event Select MSR reads 1099 - * can race with the mbm/cqm count() and mbm_init() reads. 1100 - */ 1101 - raw_spin_lock_irqsave(&cache_lock, flags); 1102 - 1103 - if (list_empty(&cache_groups)) { 1104 - ret = HRTIMER_NORESTART; 1105 - goto out; 1106 - } 1107 - 1108 - list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) { 1109 - grp_rmid = iter->hw.cqm_rmid; 1110 - if (!__rmid_valid(grp_rmid)) 1111 - continue; 1112 - if (is_mbm_event(iter->attr.config)) 1113 - update_sample(grp_rmid, iter->attr.config, 0); 1114 - 1115 - head = &iter->hw.cqm_group_entry; 1116 - if (list_empty(head)) 1117 - continue; 1118 - list_for_each_entry(iter1, head, hw.cqm_group_entry) { 1119 - if (!iter1->hw.is_group_event) 1120 - break; 1121 - if (is_mbm_event(iter1->attr.config)) 1122 - update_sample(iter1->hw.cqm_rmid, 1123 - iter1->attr.config, 0); 1124 - } 1125 - } 1126 - 1127 - hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME)); 1128 - out: 1129 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1130 - 1131 - return ret; 1132 - } 1133 - 1134 - static void __mbm_start_timer(void *info) 1135 - { 1136 - hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME), 1137 - HRTIMER_MODE_REL_PINNED); 1138 - } 1139 - 1140 - static void __mbm_stop_timer(void *info) 1141 - { 1142 - hrtimer_cancel(&mbm_timers[pkg_id]); 1143 - } 1144 - 1145 - static void mbm_start_timers(void) 1146 - { 1147 - on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1); 1148 - } 1149 - 1150 - static void mbm_stop_timers(void) 1151 - { 1152 - on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1); 1153 - } 1154 - 1155 - static void mbm_hrtimer_init(void) 1156 - { 1157 - struct hrtimer *hr; 1158 - int i; 1159 - 1160 - for (i = 0; i < mbm_socket_max; i++) { 1161 - hr = &mbm_timers[i]; 1162 - hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 1163 - hr->function = mbm_hrtimer_handle; 1164 - } 1165 - } 1166 - 1167 - static u64 intel_cqm_event_count(struct perf_event *event) 1168 - { 1169 - unsigned long flags; 1170 - struct rmid_read rr = { 1171 - .evt_type = event->attr.config, 1172 - .value = ATOMIC64_INIT(0), 1173 - }; 1174 - 1175 - /* 1176 - * We only need to worry about task events. System-wide events 1177 - * are handled like usual, i.e. entirely with 1178 - * intel_cqm_event_read(). 1179 - */ 1180 - if (event->cpu != -1) 1181 - return __perf_event_count(event); 1182 - 1183 - /* 1184 - * Only the group leader gets to report values except in case of 1185 - * multiple events in the same group, we still need to read the 1186 - * other events.This stops us 1187 - * reporting duplicate values to userspace, and gives us a clear 1188 - * rule for which task gets to report the values. 1189 - * 1190 - * Note that it is impossible to attribute these values to 1191 - * specific packages - we forfeit that ability when we create 1192 - * task events. 1193 - */ 1194 - if (!cqm_group_leader(event) && !event->hw.is_group_event) 1195 - return 0; 1196 - 1197 - /* 1198 - * Getting up-to-date values requires an SMP IPI which is not 1199 - * possible if we're being called in interrupt context. Return 1200 - * the cached values instead. 1201 - */ 1202 - if (unlikely(in_interrupt())) 1203 - goto out; 1204 - 1205 - /* 1206 - * Notice that we don't perform the reading of an RMID 1207 - * atomically, because we can't hold a spin lock across the 1208 - * IPIs. 1209 - * 1210 - * Speculatively perform the read, since @event might be 1211 - * assigned a different (possibly invalid) RMID while we're 1212 - * busying performing the IPI calls. It's therefore necessary to 1213 - * check @event's RMID afterwards, and if it has changed, 1214 - * discard the result of the read. 1215 - */ 1216 - rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid); 1217 - 1218 - if (!__rmid_valid(rr.rmid)) 1219 - goto out; 1220 - 1221 - cqm_mask_call(&rr); 1222 - 1223 - raw_spin_lock_irqsave(&cache_lock, flags); 1224 - if (event->hw.cqm_rmid == rr.rmid) 1225 - local64_set(&event->count, atomic64_read(&rr.value)); 1226 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1227 - out: 1228 - return __perf_event_count(event); 1229 - } 1230 - 1231 - static void intel_cqm_event_start(struct perf_event *event, int mode) 1232 - { 1233 - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); 1234 - u32 rmid = event->hw.cqm_rmid; 1235 - 1236 - if (!(event->hw.cqm_state & PERF_HES_STOPPED)) 1237 - return; 1238 - 1239 - event->hw.cqm_state &= ~PERF_HES_STOPPED; 1240 - 1241 - if (state->rmid_usecnt++) { 1242 - if (!WARN_ON_ONCE(state->rmid != rmid)) 1243 - return; 1244 - } else { 1245 - WARN_ON_ONCE(state->rmid); 1246 - } 1247 - 1248 - state->rmid = rmid; 1249 - wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid); 1250 - } 1251 - 1252 - static void intel_cqm_event_stop(struct perf_event *event, int mode) 1253 - { 1254 - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); 1255 - 1256 - if (event->hw.cqm_state & PERF_HES_STOPPED) 1257 - return; 1258 - 1259 - event->hw.cqm_state |= PERF_HES_STOPPED; 1260 - 1261 - intel_cqm_event_read(event); 1262 - 1263 - if (!--state->rmid_usecnt) { 1264 - state->rmid = 0; 1265 - wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid); 1266 - } else { 1267 - WARN_ON_ONCE(!state->rmid); 1268 - } 1269 - } 1270 - 1271 - static int intel_cqm_event_add(struct perf_event *event, int mode) 1272 - { 1273 - unsigned long flags; 1274 - u32 rmid; 1275 - 1276 - raw_spin_lock_irqsave(&cache_lock, flags); 1277 - 1278 - event->hw.cqm_state = PERF_HES_STOPPED; 1279 - rmid = event->hw.cqm_rmid; 1280 - 1281 - if (__rmid_valid(rmid) && (mode & PERF_EF_START)) 1282 - intel_cqm_event_start(event, mode); 1283 - 1284 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1285 - 1286 - return 0; 1287 - } 1288 - 1289 - static void intel_cqm_event_destroy(struct perf_event *event) 1290 - { 1291 - struct perf_event *group_other = NULL; 1292 - unsigned long flags; 1293 - 1294 - mutex_lock(&cache_mutex); 1295 - /* 1296 - * Hold the cache_lock as mbm timer handlers could be 1297 - * scanning the list of events. 1298 - */ 1299 - raw_spin_lock_irqsave(&cache_lock, flags); 1300 - 1301 - /* 1302 - * If there's another event in this group... 1303 - */ 1304 - if (!list_empty(&event->hw.cqm_group_entry)) { 1305 - group_other = list_first_entry(&event->hw.cqm_group_entry, 1306 - struct perf_event, 1307 - hw.cqm_group_entry); 1308 - list_del(&event->hw.cqm_group_entry); 1309 - } 1310 - 1311 - /* 1312 - * And we're the group leader.. 1313 - */ 1314 - if (cqm_group_leader(event)) { 1315 - /* 1316 - * If there was a group_other, make that leader, otherwise 1317 - * destroy the group and return the RMID. 1318 - */ 1319 - if (group_other) { 1320 - list_replace(&event->hw.cqm_groups_entry, 1321 - &group_other->hw.cqm_groups_entry); 1322 - } else { 1323 - u32 rmid = event->hw.cqm_rmid; 1324 - 1325 - if (__rmid_valid(rmid)) 1326 - __put_rmid(rmid); 1327 - list_del(&event->hw.cqm_groups_entry); 1328 - } 1329 - } 1330 - 1331 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1332 - 1333 - /* 1334 - * Stop the mbm overflow timers when the last event is destroyed. 1335 - */ 1336 - if (mbm_enabled && list_empty(&cache_groups)) 1337 - mbm_stop_timers(); 1338 - 1339 - mutex_unlock(&cache_mutex); 1340 - } 1341 - 1342 - static int intel_cqm_event_init(struct perf_event *event) 1343 - { 1344 - struct perf_event *group = NULL; 1345 - bool rotate = false; 1346 - unsigned long flags; 1347 - 1348 - if (event->attr.type != intel_cqm_pmu.type) 1349 - return -ENOENT; 1350 - 1351 - if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) || 1352 - (event->attr.config > QOS_MBM_LOCAL_EVENT_ID)) 1353 - return -EINVAL; 1354 - 1355 - if ((is_cqm_event(event->attr.config) && !cqm_enabled) || 1356 - (is_mbm_event(event->attr.config) && !mbm_enabled)) 1357 - return -EINVAL; 1358 - 1359 - /* unsupported modes and filters */ 1360 - if (event->attr.exclude_user || 1361 - event->attr.exclude_kernel || 1362 - event->attr.exclude_hv || 1363 - event->attr.exclude_idle || 1364 - event->attr.exclude_host || 1365 - event->attr.exclude_guest || 1366 - event->attr.sample_period) /* no sampling */ 1367 - return -EINVAL; 1368 - 1369 - INIT_LIST_HEAD(&event->hw.cqm_group_entry); 1370 - INIT_LIST_HEAD(&event->hw.cqm_groups_entry); 1371 - 1372 - event->destroy = intel_cqm_event_destroy; 1373 - 1374 - mutex_lock(&cache_mutex); 1375 - 1376 - /* 1377 - * Start the mbm overflow timers when the first event is created. 1378 - */ 1379 - if (mbm_enabled && list_empty(&cache_groups)) 1380 - mbm_start_timers(); 1381 - 1382 - /* Will also set rmid */ 1383 - intel_cqm_setup_event(event, &group); 1384 - 1385 - /* 1386 - * Hold the cache_lock as mbm timer handlers be 1387 - * scanning the list of events. 1388 - */ 1389 - raw_spin_lock_irqsave(&cache_lock, flags); 1390 - 1391 - if (group) { 1392 - list_add_tail(&event->hw.cqm_group_entry, 1393 - &group->hw.cqm_group_entry); 1394 - } else { 1395 - list_add_tail(&event->hw.cqm_groups_entry, 1396 - &cache_groups); 1397 - 1398 - /* 1399 - * All RMIDs are either in use or have recently been 1400 - * used. Kick the rotation worker to clean/free some. 1401 - * 1402 - * We only do this for the group leader, rather than for 1403 - * every event in a group to save on needless work. 1404 - */ 1405 - if (!__rmid_valid(event->hw.cqm_rmid)) 1406 - rotate = true; 1407 - } 1408 - 1409 - raw_spin_unlock_irqrestore(&cache_lock, flags); 1410 - mutex_unlock(&cache_mutex); 1411 - 1412 - if (rotate) 1413 - schedule_delayed_work(&intel_cqm_rmid_work, 0); 1414 - 1415 - return 0; 1416 - } 1417 - 1418 - EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01"); 1419 - EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1"); 1420 - EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes"); 1421 - EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL); 1422 - EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1"); 1423 - 1424 - EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02"); 1425 - EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1"); 1426 - EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB"); 1427 - EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6"); 1428 - 1429 - EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03"); 1430 - EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1"); 1431 - EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB"); 1432 - EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6"); 1433 - 1434 - static struct attribute *intel_cqm_events_attr[] = { 1435 - EVENT_PTR(intel_cqm_llc), 1436 - EVENT_PTR(intel_cqm_llc_pkg), 1437 - EVENT_PTR(intel_cqm_llc_unit), 1438 - EVENT_PTR(intel_cqm_llc_scale), 1439 - EVENT_PTR(intel_cqm_llc_snapshot), 1440 - NULL, 1441 - }; 1442 - 1443 - static struct attribute *intel_mbm_events_attr[] = { 1444 - EVENT_PTR(intel_cqm_total_bytes), 1445 - EVENT_PTR(intel_cqm_local_bytes), 1446 - EVENT_PTR(intel_cqm_total_bytes_pkg), 1447 - EVENT_PTR(intel_cqm_local_bytes_pkg), 1448 - EVENT_PTR(intel_cqm_total_bytes_unit), 1449 - EVENT_PTR(intel_cqm_local_bytes_unit), 1450 - EVENT_PTR(intel_cqm_total_bytes_scale), 1451 - EVENT_PTR(intel_cqm_local_bytes_scale), 1452 - NULL, 1453 - }; 1454 - 1455 - static struct attribute *intel_cmt_mbm_events_attr[] = { 1456 - EVENT_PTR(intel_cqm_llc), 1457 - EVENT_PTR(intel_cqm_total_bytes), 1458 - EVENT_PTR(intel_cqm_local_bytes), 1459 - EVENT_PTR(intel_cqm_llc_pkg), 1460 - EVENT_PTR(intel_cqm_total_bytes_pkg), 1461 - EVENT_PTR(intel_cqm_local_bytes_pkg), 1462 - EVENT_PTR(intel_cqm_llc_unit), 1463 - EVENT_PTR(intel_cqm_total_bytes_unit), 1464 - EVENT_PTR(intel_cqm_local_bytes_unit), 1465 - EVENT_PTR(intel_cqm_llc_scale), 1466 - EVENT_PTR(intel_cqm_total_bytes_scale), 1467 - EVENT_PTR(intel_cqm_local_bytes_scale), 1468 - EVENT_PTR(intel_cqm_llc_snapshot), 1469 - NULL, 1470 - }; 1471 - 1472 - static struct attribute_group intel_cqm_events_group = { 1473 - .name = "events", 1474 - .attrs = NULL, 1475 - }; 1476 - 1477 - PMU_FORMAT_ATTR(event, "config:0-7"); 1478 - static struct attribute *intel_cqm_formats_attr[] = { 1479 - &format_attr_event.attr, 1480 - NULL, 1481 - }; 1482 - 1483 - static struct attribute_group intel_cqm_format_group = { 1484 - .name = "format", 1485 - .attrs = intel_cqm_formats_attr, 1486 - }; 1487 - 1488 - static ssize_t 1489 - max_recycle_threshold_show(struct device *dev, struct device_attribute *attr, 1490 - char *page) 1491 - { 1492 - ssize_t rv; 1493 - 1494 - mutex_lock(&cache_mutex); 1495 - rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold); 1496 - mutex_unlock(&cache_mutex); 1497 - 1498 - return rv; 1499 - } 1500 - 1501 - static ssize_t 1502 - max_recycle_threshold_store(struct device *dev, 1503 - struct device_attribute *attr, 1504 - const char *buf, size_t count) 1505 - { 1506 - unsigned int bytes, cachelines; 1507 - int ret; 1508 - 1509 - ret = kstrtouint(buf, 0, &bytes); 1510 - if (ret) 1511 - return ret; 1512 - 1513 - mutex_lock(&cache_mutex); 1514 - 1515 - __intel_cqm_max_threshold = bytes; 1516 - cachelines = bytes / cqm_l3_scale; 1517 - 1518 - /* 1519 - * The new maximum takes effect immediately. 1520 - */ 1521 - if (__intel_cqm_threshold > cachelines) 1522 - __intel_cqm_threshold = cachelines; 1523 - 1524 - mutex_unlock(&cache_mutex); 1525 - 1526 - return count; 1527 - } 1528 - 1529 - static DEVICE_ATTR_RW(max_recycle_threshold); 1530 - 1531 - static struct attribute *intel_cqm_attrs[] = { 1532 - &dev_attr_max_recycle_threshold.attr, 1533 - NULL, 1534 - }; 1535 - 1536 - static const struct attribute_group intel_cqm_group = { 1537 - .attrs = intel_cqm_attrs, 1538 - }; 1539 - 1540 - static const struct attribute_group *intel_cqm_attr_groups[] = { 1541 - &intel_cqm_events_group, 1542 - &intel_cqm_format_group, 1543 - &intel_cqm_group, 1544 - NULL, 1545 - }; 1546 - 1547 - static struct pmu intel_cqm_pmu = { 1548 - .hrtimer_interval_ms = RMID_DEFAULT_QUEUE_TIME, 1549 - .attr_groups = intel_cqm_attr_groups, 1550 - .task_ctx_nr = perf_sw_context, 1551 - .event_init = intel_cqm_event_init, 1552 - .add = intel_cqm_event_add, 1553 - .del = intel_cqm_event_stop, 1554 - .start = intel_cqm_event_start, 1555 - .stop = intel_cqm_event_stop, 1556 - .read = intel_cqm_event_read, 1557 - .count = intel_cqm_event_count, 1558 - }; 1559 - 1560 - static inline void cqm_pick_event_reader(int cpu) 1561 - { 1562 - int reader; 1563 - 1564 - /* First online cpu in package becomes the reader */ 1565 - reader = cpumask_any_and(&cqm_cpumask, topology_core_cpumask(cpu)); 1566 - if (reader >= nr_cpu_ids) 1567 - cpumask_set_cpu(cpu, &cqm_cpumask); 1568 - } 1569 - 1570 - static int intel_cqm_cpu_starting(unsigned int cpu) 1571 - { 1572 - struct intel_pqr_state *state = &per_cpu(pqr_state, cpu); 1573 - struct cpuinfo_x86 *c = &cpu_data(cpu); 1574 - 1575 - state->rmid = 0; 1576 - state->closid = 0; 1577 - state->rmid_usecnt = 0; 1578 - 1579 - WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid); 1580 - WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale); 1581 - 1582 - cqm_pick_event_reader(cpu); 1583 - return 0; 1584 - } 1585 - 1586 - static int intel_cqm_cpu_exit(unsigned int cpu) 1587 - { 1588 - int target; 1589 - 1590 - /* Is @cpu the current cqm reader for this package ? */ 1591 - if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask)) 1592 - return 0; 1593 - 1594 - /* Find another online reader in this package */ 1595 - target = cpumask_any_but(topology_core_cpumask(cpu), cpu); 1596 - 1597 - if (target < nr_cpu_ids) 1598 - cpumask_set_cpu(target, &cqm_cpumask); 1599 - 1600 - return 0; 1601 - } 1602 - 1603 - static const struct x86_cpu_id intel_cqm_match[] = { 1604 - { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_OCCUP_LLC }, 1605 - {} 1606 - }; 1607 - 1608 - static void mbm_cleanup(void) 1609 - { 1610 - if (!mbm_enabled) 1611 - return; 1612 - 1613 - kfree(mbm_local); 1614 - kfree(mbm_total); 1615 - mbm_enabled = false; 1616 - } 1617 - 1618 - static const struct x86_cpu_id intel_mbm_local_match[] = { 1619 - { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL }, 1620 - {} 1621 - }; 1622 - 1623 - static const struct x86_cpu_id intel_mbm_total_match[] = { 1624 - { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL }, 1625 - {} 1626 - }; 1627 - 1628 - static int intel_mbm_init(void) 1629 - { 1630 - int ret = 0, array_size, maxid = cqm_max_rmid + 1; 1631 - 1632 - mbm_socket_max = topology_max_packages(); 1633 - array_size = sizeof(struct sample) * maxid * mbm_socket_max; 1634 - mbm_local = kmalloc(array_size, GFP_KERNEL); 1635 - if (!mbm_local) 1636 - return -ENOMEM; 1637 - 1638 - mbm_total = kmalloc(array_size, GFP_KERNEL); 1639 - if (!mbm_total) { 1640 - ret = -ENOMEM; 1641 - goto out; 1642 - } 1643 - 1644 - array_size = sizeof(struct hrtimer) * mbm_socket_max; 1645 - mbm_timers = kmalloc(array_size, GFP_KERNEL); 1646 - if (!mbm_timers) { 1647 - ret = -ENOMEM; 1648 - goto out; 1649 - } 1650 - mbm_hrtimer_init(); 1651 - 1652 - out: 1653 - if (ret) 1654 - mbm_cleanup(); 1655 - 1656 - return ret; 1657 - } 1658 - 1659 - static int __init intel_cqm_init(void) 1660 - { 1661 - char *str = NULL, scale[20]; 1662 - int cpu, ret; 1663 - 1664 - if (x86_match_cpu(intel_cqm_match)) 1665 - cqm_enabled = true; 1666 - 1667 - if (x86_match_cpu(intel_mbm_local_match) && 1668 - x86_match_cpu(intel_mbm_total_match)) 1669 - mbm_enabled = true; 1670 - 1671 - if (!cqm_enabled && !mbm_enabled) 1672 - return -ENODEV; 1673 - 1674 - cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale; 1675 - 1676 - /* 1677 - * It's possible that not all resources support the same number 1678 - * of RMIDs. Instead of making scheduling much more complicated 1679 - * (where we have to match a task's RMID to a cpu that supports 1680 - * that many RMIDs) just find the minimum RMIDs supported across 1681 - * all cpus. 1682 - * 1683 - * Also, check that the scales match on all cpus. 1684 - */ 1685 - cpus_read_lock(); 1686 - for_each_online_cpu(cpu) { 1687 - struct cpuinfo_x86 *c = &cpu_data(cpu); 1688 - 1689 - if (c->x86_cache_max_rmid < cqm_max_rmid) 1690 - cqm_max_rmid = c->x86_cache_max_rmid; 1691 - 1692 - if (c->x86_cache_occ_scale != cqm_l3_scale) { 1693 - pr_err("Multiple LLC scale values, disabling\n"); 1694 - ret = -EINVAL; 1695 - goto out; 1696 - } 1697 - } 1698 - 1699 - /* 1700 - * A reasonable upper limit on the max threshold is the number 1701 - * of lines tagged per RMID if all RMIDs have the same number of 1702 - * lines tagged in the LLC. 1703 - * 1704 - * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC. 1705 - */ 1706 - __intel_cqm_max_threshold = 1707 - boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1); 1708 - 1709 - snprintf(scale, sizeof(scale), "%u", cqm_l3_scale); 1710 - str = kstrdup(scale, GFP_KERNEL); 1711 - if (!str) { 1712 - ret = -ENOMEM; 1713 - goto out; 1714 - } 1715 - 1716 - event_attr_intel_cqm_llc_scale.event_str = str; 1717 - 1718 - ret = intel_cqm_setup_rmid_cache(); 1719 - if (ret) 1720 - goto out; 1721 - 1722 - if (mbm_enabled) 1723 - ret = intel_mbm_init(); 1724 - if (ret && !cqm_enabled) 1725 - goto out; 1726 - 1727 - if (cqm_enabled && mbm_enabled) 1728 - intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr; 1729 - else if (!cqm_enabled && mbm_enabled) 1730 - intel_cqm_events_group.attrs = intel_mbm_events_attr; 1731 - else if (cqm_enabled && !mbm_enabled) 1732 - intel_cqm_events_group.attrs = intel_cqm_events_attr; 1733 - 1734 - ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1); 1735 - if (ret) { 1736 - pr_err("Intel CQM perf registration failed: %d\n", ret); 1737 - goto out; 1738 - } 1739 - 1740 - if (cqm_enabled) 1741 - pr_info("Intel CQM monitoring enabled\n"); 1742 - if (mbm_enabled) 1743 - pr_info("Intel MBM enabled\n"); 1744 - 1745 - /* 1746 - * Setup the hot cpu notifier once we are sure cqm 1747 - * is enabled to avoid notifier leak. 1748 - */ 1749 - cpuhp_setup_state_cpuslocked(CPUHP_AP_PERF_X86_CQM_STARTING, 1750 - "perf/x86/cqm:starting", 1751 - intel_cqm_cpu_starting, NULL); 1752 - cpuhp_setup_state_cpuslocked(CPUHP_AP_PERF_X86_CQM_ONLINE, 1753 - "perf/x86/cqm:online", 1754 - NULL, intel_cqm_cpu_exit); 1755 - out: 1756 - cpus_read_unlock(); 1757 - 1758 - if (ret) { 1759 - kfree(str); 1760 - cqm_cleanup(); 1761 - mbm_cleanup(); 1762 - } 1763 - 1764 - return ret; 1765 - } 1766 - device_initcall(intel_cqm_init);
-286
arch/x86/include/asm/intel_rdt.h
··· 1 - #ifndef _ASM_X86_INTEL_RDT_H 2 - #define _ASM_X86_INTEL_RDT_H 3 - 4 - #ifdef CONFIG_INTEL_RDT_A 5 - 6 - #include <linux/sched.h> 7 - #include <linux/kernfs.h> 8 - #include <linux/jump_label.h> 9 - 10 - #include <asm/intel_rdt_common.h> 11 - 12 - #define IA32_L3_QOS_CFG 0xc81 13 - #define IA32_L3_CBM_BASE 0xc90 14 - #define IA32_L2_CBM_BASE 0xd10 15 - #define IA32_MBA_THRTL_BASE 0xd50 16 - 17 - #define L3_QOS_CDP_ENABLE 0x01ULL 18 - 19 - /** 20 - * struct rdtgroup - store rdtgroup's data in resctrl file system. 21 - * @kn: kernfs node 22 - * @rdtgroup_list: linked list for all rdtgroups 23 - * @closid: closid for this rdtgroup 24 - * @cpu_mask: CPUs assigned to this rdtgroup 25 - * @flags: status bits 26 - * @waitcount: how many cpus expect to find this 27 - * group when they acquire rdtgroup_mutex 28 - */ 29 - struct rdtgroup { 30 - struct kernfs_node *kn; 31 - struct list_head rdtgroup_list; 32 - int closid; 33 - struct cpumask cpu_mask; 34 - int flags; 35 - atomic_t waitcount; 36 - }; 37 - 38 - /* rdtgroup.flags */ 39 - #define RDT_DELETED 1 40 - 41 - /* rftype.flags */ 42 - #define RFTYPE_FLAGS_CPUS_LIST 1 43 - 44 - /* List of all resource groups */ 45 - extern struct list_head rdt_all_groups; 46 - 47 - extern int max_name_width, max_data_width; 48 - 49 - int __init rdtgroup_init(void); 50 - 51 - /** 52 - * struct rftype - describe each file in the resctrl file system 53 - * @name: File name 54 - * @mode: Access mode 55 - * @kf_ops: File operations 56 - * @flags: File specific RFTYPE_FLAGS_* flags 57 - * @seq_show: Show content of the file 58 - * @write: Write to the file 59 - */ 60 - struct rftype { 61 - char *name; 62 - umode_t mode; 63 - struct kernfs_ops *kf_ops; 64 - unsigned long flags; 65 - 66 - int (*seq_show)(struct kernfs_open_file *of, 67 - struct seq_file *sf, void *v); 68 - /* 69 - * write() is the generic write callback which maps directly to 70 - * kernfs write operation and overrides all other operations. 71 - * Maximum write size is determined by ->max_write_len. 72 - */ 73 - ssize_t (*write)(struct kernfs_open_file *of, 74 - char *buf, size_t nbytes, loff_t off); 75 - }; 76 - 77 - /** 78 - * struct rdt_domain - group of cpus sharing an RDT resource 79 - * @list: all instances of this resource 80 - * @id: unique id for this instance 81 - * @cpu_mask: which cpus share this resource 82 - * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID) 83 - * @new_ctrl: new ctrl value to be loaded 84 - * @have_new_ctrl: did user provide new_ctrl for this domain 85 - */ 86 - struct rdt_domain { 87 - struct list_head list; 88 - int id; 89 - struct cpumask cpu_mask; 90 - u32 *ctrl_val; 91 - u32 new_ctrl; 92 - bool have_new_ctrl; 93 - }; 94 - 95 - /** 96 - * struct msr_param - set a range of MSRs from a domain 97 - * @res: The resource to use 98 - * @low: Beginning index from base MSR 99 - * @high: End index 100 - */ 101 - struct msr_param { 102 - struct rdt_resource *res; 103 - int low; 104 - int high; 105 - }; 106 - 107 - /** 108 - * struct rdt_cache - Cache allocation related data 109 - * @cbm_len: Length of the cache bit mask 110 - * @min_cbm_bits: Minimum number of consecutive bits to be set 111 - * @cbm_idx_mult: Multiplier of CBM index 112 - * @cbm_idx_offset: Offset of CBM index. CBM index is computed by: 113 - * closid * cbm_idx_multi + cbm_idx_offset 114 - * in a cache bit mask 115 - */ 116 - struct rdt_cache { 117 - unsigned int cbm_len; 118 - unsigned int min_cbm_bits; 119 - unsigned int cbm_idx_mult; 120 - unsigned int cbm_idx_offset; 121 - }; 122 - 123 - /** 124 - * struct rdt_membw - Memory bandwidth allocation related data 125 - * @max_delay: Max throttle delay. Delay is the hardware 126 - * representation for memory bandwidth. 127 - * @min_bw: Minimum memory bandwidth percentage user can request 128 - * @bw_gran: Granularity at which the memory bandwidth is allocated 129 - * @delay_linear: True if memory B/W delay is in linear scale 130 - * @mb_map: Mapping of memory B/W percentage to memory B/W delay 131 - */ 132 - struct rdt_membw { 133 - u32 max_delay; 134 - u32 min_bw; 135 - u32 bw_gran; 136 - u32 delay_linear; 137 - u32 *mb_map; 138 - }; 139 - 140 - /** 141 - * struct rdt_resource - attributes of an RDT resource 142 - * @enabled: Is this feature enabled on this machine 143 - * @capable: Is this feature available on this machine 144 - * @name: Name to use in "schemata" file 145 - * @num_closid: Number of CLOSIDs available 146 - * @cache_level: Which cache level defines scope of this resource 147 - * @default_ctrl: Specifies default cache cbm or memory B/W percent. 148 - * @msr_base: Base MSR address for CBMs 149 - * @msr_update: Function pointer to update QOS MSRs 150 - * @data_width: Character width of data when displaying 151 - * @domains: All domains for this resource 152 - * @cache: Cache allocation related data 153 - * @info_files: resctrl info files for the resource 154 - * @nr_info_files: Number of info files 155 - * @format_str: Per resource format string to show domain value 156 - * @parse_ctrlval: Per resource function pointer to parse control values 157 - */ 158 - struct rdt_resource { 159 - bool enabled; 160 - bool capable; 161 - char *name; 162 - int num_closid; 163 - int cache_level; 164 - u32 default_ctrl; 165 - unsigned int msr_base; 166 - void (*msr_update) (struct rdt_domain *d, struct msr_param *m, 167 - struct rdt_resource *r); 168 - int data_width; 169 - struct list_head domains; 170 - struct rdt_cache cache; 171 - struct rdt_membw membw; 172 - struct rftype *info_files; 173 - int nr_info_files; 174 - const char *format_str; 175 - int (*parse_ctrlval) (char *buf, struct rdt_resource *r, 176 - struct rdt_domain *d); 177 - }; 178 - 179 - void rdt_get_cache_infofile(struct rdt_resource *r); 180 - void rdt_get_mba_infofile(struct rdt_resource *r); 181 - int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d); 182 - int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d); 183 - 184 - extern struct mutex rdtgroup_mutex; 185 - 186 - extern struct rdt_resource rdt_resources_all[]; 187 - extern struct rdtgroup rdtgroup_default; 188 - DECLARE_STATIC_KEY_FALSE(rdt_enable_key); 189 - 190 - int __init rdtgroup_init(void); 191 - 192 - enum { 193 - RDT_RESOURCE_L3, 194 - RDT_RESOURCE_L3DATA, 195 - RDT_RESOURCE_L3CODE, 196 - RDT_RESOURCE_L2, 197 - RDT_RESOURCE_MBA, 198 - 199 - /* Must be the last */ 200 - RDT_NUM_RESOURCES, 201 - }; 202 - 203 - #define for_each_capable_rdt_resource(r) \ 204 - for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 205 - r++) \ 206 - if (r->capable) 207 - 208 - #define for_each_enabled_rdt_resource(r) \ 209 - for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 210 - r++) \ 211 - if (r->enabled) 212 - 213 - /* CPUID.(EAX=10H, ECX=ResID=1).EAX */ 214 - union cpuid_0x10_1_eax { 215 - struct { 216 - unsigned int cbm_len:5; 217 - } split; 218 - unsigned int full; 219 - }; 220 - 221 - /* CPUID.(EAX=10H, ECX=ResID=3).EAX */ 222 - union cpuid_0x10_3_eax { 223 - struct { 224 - unsigned int max_delay:12; 225 - } split; 226 - unsigned int full; 227 - }; 228 - 229 - /* CPUID.(EAX=10H, ECX=ResID).EDX */ 230 - union cpuid_0x10_x_edx { 231 - struct { 232 - unsigned int cos_max:16; 233 - } split; 234 - unsigned int full; 235 - }; 236 - 237 - DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid); 238 - 239 - void rdt_ctrl_update(void *arg); 240 - struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); 241 - void rdtgroup_kn_unlock(struct kernfs_node *kn); 242 - ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, 243 - char *buf, size_t nbytes, loff_t off); 244 - int rdtgroup_schemata_show(struct kernfs_open_file *of, 245 - struct seq_file *s, void *v); 246 - 247 - /* 248 - * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR 249 - * 250 - * Following considerations are made so that this has minimal impact 251 - * on scheduler hot path: 252 - * - This will stay as no-op unless we are running on an Intel SKU 253 - * which supports resource control and we enable by mounting the 254 - * resctrl file system. 255 - * - Caches the per cpu CLOSid values and does the MSR write only 256 - * when a task with a different CLOSid is scheduled in. 257 - * 258 - * Must be called with preemption disabled. 259 - */ 260 - static inline void intel_rdt_sched_in(void) 261 - { 262 - if (static_branch_likely(&rdt_enable_key)) { 263 - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); 264 - int closid; 265 - 266 - /* 267 - * If this task has a closid assigned, use it. 268 - * Else use the closid assigned to this cpu. 269 - */ 270 - closid = current->closid; 271 - if (closid == 0) 272 - closid = this_cpu_read(cpu_closid); 273 - 274 - if (closid != state->closid) { 275 - state->closid = closid; 276 - wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid); 277 - } 278 - } 279 - } 280 - 281 - #else 282 - 283 - static inline void intel_rdt_sched_in(void) {} 284 - 285 - #endif /* CONFIG_INTEL_RDT_A */ 286 - #endif /* _ASM_X86_INTEL_RDT_H */
-27
arch/x86/include/asm/intel_rdt_common.h
··· 1 - #ifndef _ASM_X86_INTEL_RDT_COMMON_H 2 - #define _ASM_X86_INTEL_RDT_COMMON_H 3 - 4 - #define MSR_IA32_PQR_ASSOC 0x0c8f 5 - 6 - /** 7 - * struct intel_pqr_state - State cache for the PQR MSR 8 - * @rmid: The cached Resource Monitoring ID 9 - * @closid: The cached Class Of Service ID 10 - * @rmid_usecnt: The usage counter for rmid 11 - * 12 - * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the 13 - * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always 14 - * contains both parts, so we need to cache them. 15 - * 16 - * The cache also helps to avoid pointless updates if the value does 17 - * not change. 18 - */ 19 - struct intel_pqr_state { 20 - u32 rmid; 21 - u32 closid; 22 - int rmid_usecnt; 23 - }; 24 - 25 - DECLARE_PER_CPU(struct intel_pqr_state, pqr_state); 26 - 27 - #endif /* _ASM_X86_INTEL_RDT_COMMON_H */
+92
arch/x86/include/asm/intel_rdt_sched.h
··· 1 + #ifndef _ASM_X86_INTEL_RDT_SCHED_H 2 + #define _ASM_X86_INTEL_RDT_SCHED_H 3 + 4 + #ifdef CONFIG_INTEL_RDT 5 + 6 + #include <linux/sched.h> 7 + #include <linux/jump_label.h> 8 + 9 + #define IA32_PQR_ASSOC 0x0c8f 10 + 11 + /** 12 + * struct intel_pqr_state - State cache for the PQR MSR 13 + * @cur_rmid: The cached Resource Monitoring ID 14 + * @cur_closid: The cached Class Of Service ID 15 + * @default_rmid: The user assigned Resource Monitoring ID 16 + * @default_closid: The user assigned cached Class Of Service ID 17 + * 18 + * The upper 32 bits of IA32_PQR_ASSOC contain closid and the 19 + * lower 10 bits rmid. The update to IA32_PQR_ASSOC always 20 + * contains both parts, so we need to cache them. This also 21 + * stores the user configured per cpu CLOSID and RMID. 22 + * 23 + * The cache also helps to avoid pointless updates if the value does 24 + * not change. 25 + */ 26 + struct intel_pqr_state { 27 + u32 cur_rmid; 28 + u32 cur_closid; 29 + u32 default_rmid; 30 + u32 default_closid; 31 + }; 32 + 33 + DECLARE_PER_CPU(struct intel_pqr_state, pqr_state); 34 + 35 + DECLARE_STATIC_KEY_FALSE(rdt_enable_key); 36 + DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); 37 + DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); 38 + 39 + /* 40 + * __intel_rdt_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR 41 + * 42 + * Following considerations are made so that this has minimal impact 43 + * on scheduler hot path: 44 + * - This will stay as no-op unless we are running on an Intel SKU 45 + * which supports resource control or monitoring and we enable by 46 + * mounting the resctrl file system. 47 + * - Caches the per cpu CLOSid/RMID values and does the MSR write only 48 + * when a task with a different CLOSid/RMID is scheduled in. 49 + * - We allocate RMIDs/CLOSids globally in order to keep this as 50 + * simple as possible. 51 + * Must be called with preemption disabled. 52 + */ 53 + static void __intel_rdt_sched_in(void) 54 + { 55 + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); 56 + u32 closid = state->default_closid; 57 + u32 rmid = state->default_rmid; 58 + 59 + /* 60 + * If this task has a closid/rmid assigned, use it. 61 + * Else use the closid/rmid assigned to this cpu. 62 + */ 63 + if (static_branch_likely(&rdt_alloc_enable_key)) { 64 + if (current->closid) 65 + closid = current->closid; 66 + } 67 + 68 + if (static_branch_likely(&rdt_mon_enable_key)) { 69 + if (current->rmid) 70 + rmid = current->rmid; 71 + } 72 + 73 + if (closid != state->cur_closid || rmid != state->cur_rmid) { 74 + state->cur_closid = closid; 75 + state->cur_rmid = rmid; 76 + wrmsr(IA32_PQR_ASSOC, rmid, closid); 77 + } 78 + } 79 + 80 + static inline void intel_rdt_sched_in(void) 81 + { 82 + if (static_branch_likely(&rdt_enable_key)) 83 + __intel_rdt_sched_in(); 84 + } 85 + 86 + #else 87 + 88 + static inline void intel_rdt_sched_in(void) {} 89 + 90 + #endif /* CONFIG_INTEL_RDT */ 91 + 92 + #endif /* _ASM_X86_INTEL_RDT_SCHED_H */
+1 -1
arch/x86/kernel/cpu/Makefile
··· 33 33 obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o 34 34 obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o 35 35 36 - obj-$(CONFIG_INTEL_RDT_A) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o 36 + obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o intel_rdt_ctrlmondata.o 37 37 38 38 obj-$(CONFIG_X86_MCE) += mcheck/ 39 39 obj-$(CONFIG_MTRR) += mtrr/
+314 -61
arch/x86/kernel/cpu/intel_rdt.c
··· 30 30 #include <linux/cpuhotplug.h> 31 31 32 32 #include <asm/intel-family.h> 33 - #include <asm/intel_rdt.h> 33 + #include <asm/intel_rdt_sched.h> 34 + #include "intel_rdt.h" 34 35 35 36 #define MAX_MBA_BW 100u 36 37 #define MBA_IS_LINEAR 0x4 ··· 39 38 /* Mutex to protect rdtgroup access. */ 40 39 DEFINE_MUTEX(rdtgroup_mutex); 41 40 42 - DEFINE_PER_CPU_READ_MOSTLY(int, cpu_closid); 41 + /* 42 + * The cached intel_pqr_state is strictly per CPU and can never be 43 + * updated from a remote CPU. Functions which modify the state 44 + * are called with interrupts disabled and no preemption, which 45 + * is sufficient for the protection. 46 + */ 47 + DEFINE_PER_CPU(struct intel_pqr_state, pqr_state); 43 48 44 49 /* 45 50 * Used to store the max resource name width and max resource data width 46 51 * to display the schemata in a tabular format 47 52 */ 48 53 int max_name_width, max_data_width; 54 + 55 + /* 56 + * Global boolean for rdt_alloc which is true if any 57 + * resource allocation is enabled. 58 + */ 59 + bool rdt_alloc_capable; 49 60 50 61 static void 51 62 mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); ··· 67 54 #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].domains) 68 55 69 56 struct rdt_resource rdt_resources_all[] = { 57 + [RDT_RESOURCE_L3] = 70 58 { 59 + .rid = RDT_RESOURCE_L3, 71 60 .name = "L3", 72 61 .domains = domain_init(RDT_RESOURCE_L3), 73 62 .msr_base = IA32_L3_CBM_BASE, ··· 82 67 }, 83 68 .parse_ctrlval = parse_cbm, 84 69 .format_str = "%d=%0*x", 70 + .fflags = RFTYPE_RES_CACHE, 85 71 }, 72 + [RDT_RESOURCE_L3DATA] = 86 73 { 74 + .rid = RDT_RESOURCE_L3DATA, 87 75 .name = "L3DATA", 88 76 .domains = domain_init(RDT_RESOURCE_L3DATA), 89 77 .msr_base = IA32_L3_CBM_BASE, ··· 99 81 }, 100 82 .parse_ctrlval = parse_cbm, 101 83 .format_str = "%d=%0*x", 84 + .fflags = RFTYPE_RES_CACHE, 102 85 }, 86 + [RDT_RESOURCE_L3CODE] = 103 87 { 88 + .rid = RDT_RESOURCE_L3CODE, 104 89 .name = "L3CODE", 105 90 .domains = domain_init(RDT_RESOURCE_L3CODE), 106 91 .msr_base = IA32_L3_CBM_BASE, ··· 116 95 }, 117 96 .parse_ctrlval = parse_cbm, 118 97 .format_str = "%d=%0*x", 98 + .fflags = RFTYPE_RES_CACHE, 119 99 }, 100 + [RDT_RESOURCE_L2] = 120 101 { 102 + .rid = RDT_RESOURCE_L2, 121 103 .name = "L2", 122 104 .domains = domain_init(RDT_RESOURCE_L2), 123 105 .msr_base = IA32_L2_CBM_BASE, ··· 133 109 }, 134 110 .parse_ctrlval = parse_cbm, 135 111 .format_str = "%d=%0*x", 112 + .fflags = RFTYPE_RES_CACHE, 136 113 }, 114 + [RDT_RESOURCE_MBA] = 137 115 { 116 + .rid = RDT_RESOURCE_MBA, 138 117 .name = "MB", 139 118 .domains = domain_init(RDT_RESOURCE_MBA), 140 119 .msr_base = IA32_MBA_THRTL_BASE, ··· 145 118 .cache_level = 3, 146 119 .parse_ctrlval = parse_bw, 147 120 .format_str = "%d=%*d", 121 + .fflags = RFTYPE_RES_MB, 148 122 }, 149 123 }; 150 124 ··· 172 144 * is always 20 on hsw server parts. The minimum cache bitmask length 173 145 * allowed for HSW server is always 2 bits. Hardcode all of them. 174 146 */ 175 - static inline bool cache_alloc_hsw_probe(void) 147 + static inline void cache_alloc_hsw_probe(void) 176 148 { 177 - if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && 178 - boot_cpu_data.x86 == 6 && 179 - boot_cpu_data.x86_model == INTEL_FAM6_HASWELL_X) { 180 - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3]; 181 - u32 l, h, max_cbm = BIT_MASK(20) - 1; 149 + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3]; 150 + u32 l, h, max_cbm = BIT_MASK(20) - 1; 182 151 183 - if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0)) 184 - return false; 185 - rdmsr(IA32_L3_CBM_BASE, l, h); 152 + if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0)) 153 + return; 154 + rdmsr(IA32_L3_CBM_BASE, l, h); 186 155 187 - /* If all the bits were set in MSR, return success */ 188 - if (l != max_cbm) 189 - return false; 156 + /* If all the bits were set in MSR, return success */ 157 + if (l != max_cbm) 158 + return; 190 159 191 - r->num_closid = 4; 192 - r->default_ctrl = max_cbm; 193 - r->cache.cbm_len = 20; 194 - r->cache.min_cbm_bits = 2; 195 - r->capable = true; 196 - r->enabled = true; 160 + r->num_closid = 4; 161 + r->default_ctrl = max_cbm; 162 + r->cache.cbm_len = 20; 163 + r->cache.shareable_bits = 0xc0000; 164 + r->cache.min_cbm_bits = 2; 165 + r->alloc_capable = true; 166 + r->alloc_enabled = true; 197 167 198 - return true; 199 - } 200 - 201 - return false; 168 + rdt_alloc_capable = true; 202 169 } 203 170 204 171 /* ··· 236 213 return false; 237 214 } 238 215 r->data_width = 3; 239 - rdt_get_mba_infofile(r); 240 216 241 - r->capable = true; 242 - r->enabled = true; 217 + r->alloc_capable = true; 218 + r->alloc_enabled = true; 243 219 244 220 return true; 245 221 } 246 222 247 - static void rdt_get_cache_config(int idx, struct rdt_resource *r) 223 + static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) 248 224 { 249 225 union cpuid_0x10_1_eax eax; 250 226 union cpuid_0x10_x_edx edx; ··· 253 231 r->num_closid = edx.split.cos_max + 1; 254 232 r->cache.cbm_len = eax.split.cbm_len + 1; 255 233 r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1; 234 + r->cache.shareable_bits = ebx & r->default_ctrl; 256 235 r->data_width = (r->cache.cbm_len + 3) / 4; 257 - rdt_get_cache_infofile(r); 258 - r->capable = true; 259 - r->enabled = true; 236 + r->alloc_capable = true; 237 + r->alloc_enabled = true; 260 238 } 261 239 262 240 static void rdt_get_cdp_l3_config(int type) ··· 268 246 r->cache.cbm_len = r_l3->cache.cbm_len; 269 247 r->default_ctrl = r_l3->default_ctrl; 270 248 r->data_width = (r->cache.cbm_len + 3) / 4; 271 - r->capable = true; 249 + r->alloc_capable = true; 272 250 /* 273 251 * By default, CDP is disabled. CDP can be enabled by mount parameter 274 252 * "cdp" during resctrl file system mount time. 275 253 */ 276 - r->enabled = false; 254 + r->alloc_enabled = false; 277 255 } 278 256 279 257 static int get_cache_id(int cpu, int level) ··· 322 300 wrmsrl(r->msr_base + cbm_idx(r, i), d->ctrl_val[i]); 323 301 } 324 302 303 + struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) 304 + { 305 + struct rdt_domain *d; 306 + 307 + list_for_each_entry(d, &r->domains, list) { 308 + /* Find the domain that contains this CPU */ 309 + if (cpumask_test_cpu(cpu, &d->cpu_mask)) 310 + return d; 311 + } 312 + 313 + return NULL; 314 + } 315 + 325 316 void rdt_ctrl_update(void *arg) 326 317 { 327 318 struct msr_param *m = arg; ··· 342 307 int cpu = smp_processor_id(); 343 308 struct rdt_domain *d; 344 309 345 - list_for_each_entry(d, &r->domains, list) { 346 - /* Find the domain that contains this CPU */ 347 - if (cpumask_test_cpu(cpu, &d->cpu_mask)) { 348 - r->msr_update(d, m, r); 349 - return; 350 - } 310 + d = get_domain_from_cpu(cpu, r); 311 + if (d) { 312 + r->msr_update(d, m, r); 313 + return; 351 314 } 352 315 pr_warn_once("cpu %d not found in any domain for resource %s\n", 353 316 cpu, r->name); ··· 359 326 * caller, return the first domain whose id is bigger than the input id. 360 327 * The domain list is sorted by id in ascending order. 361 328 */ 362 - static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, 363 - struct list_head **pos) 329 + struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, 330 + struct list_head **pos) 364 331 { 365 332 struct rdt_domain *d; 366 333 struct list_head *l; ··· 410 377 return 0; 411 378 } 412 379 380 + static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) 381 + { 382 + size_t tsize; 383 + 384 + if (is_llc_occupancy_enabled()) { 385 + d->rmid_busy_llc = kcalloc(BITS_TO_LONGS(r->num_rmid), 386 + sizeof(unsigned long), 387 + GFP_KERNEL); 388 + if (!d->rmid_busy_llc) 389 + return -ENOMEM; 390 + INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); 391 + } 392 + if (is_mbm_total_enabled()) { 393 + tsize = sizeof(*d->mbm_total); 394 + d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL); 395 + if (!d->mbm_total) { 396 + kfree(d->rmid_busy_llc); 397 + return -ENOMEM; 398 + } 399 + } 400 + if (is_mbm_local_enabled()) { 401 + tsize = sizeof(*d->mbm_local); 402 + d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL); 403 + if (!d->mbm_local) { 404 + kfree(d->rmid_busy_llc); 405 + kfree(d->mbm_total); 406 + return -ENOMEM; 407 + } 408 + } 409 + 410 + if (is_mbm_enabled()) { 411 + INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); 412 + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL); 413 + } 414 + 415 + return 0; 416 + } 417 + 413 418 /* 414 419 * domain_add_cpu - Add a cpu to a resource's domain list. 415 420 * ··· 483 412 return; 484 413 485 414 d->id = id; 415 + cpumask_set_cpu(cpu, &d->cpu_mask); 486 416 487 - if (domain_setup_ctrlval(r, d)) { 417 + if (r->alloc_capable && domain_setup_ctrlval(r, d)) { 488 418 kfree(d); 489 419 return; 490 420 } 491 421 492 - cpumask_set_cpu(cpu, &d->cpu_mask); 422 + if (r->mon_capable && domain_setup_mon_state(r, d)) { 423 + kfree(d); 424 + return; 425 + } 426 + 493 427 list_add_tail(&d->list, add_pos); 428 + 429 + /* 430 + * If resctrl is mounted, add 431 + * per domain monitor data directories. 432 + */ 433 + if (static_branch_unlikely(&rdt_mon_enable_key)) 434 + mkdir_mondata_subdir_allrdtgrp(r, d); 494 435 } 495 436 496 437 static void domain_remove_cpu(int cpu, struct rdt_resource *r) ··· 518 435 519 436 cpumask_clear_cpu(cpu, &d->cpu_mask); 520 437 if (cpumask_empty(&d->cpu_mask)) { 438 + /* 439 + * If resctrl is mounted, remove all the 440 + * per domain monitor data directories. 441 + */ 442 + if (static_branch_unlikely(&rdt_mon_enable_key)) 443 + rmdir_mondata_subdir_allrdtgrp(r, d->id); 521 444 kfree(d->ctrl_val); 445 + kfree(d->rmid_busy_llc); 446 + kfree(d->mbm_total); 447 + kfree(d->mbm_local); 522 448 list_del(&d->list); 449 + if (is_mbm_enabled()) 450 + cancel_delayed_work(&d->mbm_over); 451 + if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) { 452 + /* 453 + * When a package is going down, forcefully 454 + * decrement rmid->ebusy. There is no way to know 455 + * that the L3 was flushed and hence may lead to 456 + * incorrect counts in rare scenarios, but leaving 457 + * the RMID as busy creates RMID leaks if the 458 + * package never comes back. 459 + */ 460 + __check_limbo(d, true); 461 + cancel_delayed_work(&d->cqm_limbo); 462 + } 463 + 523 464 kfree(d); 465 + return; 466 + } 467 + 468 + if (r == &rdt_resources_all[RDT_RESOURCE_L3]) { 469 + if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { 470 + cancel_delayed_work(&d->mbm_over); 471 + mbm_setup_overflow_handler(d, 0); 472 + } 473 + if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && 474 + has_busy_rmid(r, d)) { 475 + cancel_delayed_work(&d->cqm_limbo); 476 + cqm_setup_limbo_handler(d, 0); 477 + } 524 478 } 525 479 } 526 480 527 - static void clear_closid(int cpu) 481 + static void clear_closid_rmid(int cpu) 528 482 { 529 483 struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); 530 484 531 - per_cpu(cpu_closid, cpu) = 0; 532 - state->closid = 0; 533 - wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0); 485 + state->default_closid = 0; 486 + state->default_rmid = 0; 487 + state->cur_closid = 0; 488 + state->cur_rmid = 0; 489 + wrmsr(IA32_PQR_ASSOC, 0, 0); 534 490 } 535 491 536 492 static int intel_rdt_online_cpu(unsigned int cpu) ··· 581 459 domain_add_cpu(cpu, r); 582 460 /* The cpu is set in default rdtgroup after online. */ 583 461 cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); 584 - clear_closid(cpu); 462 + clear_closid_rmid(cpu); 585 463 mutex_unlock(&rdtgroup_mutex); 586 464 587 465 return 0; 466 + } 467 + 468 + static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) 469 + { 470 + struct rdtgroup *cr; 471 + 472 + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { 473 + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) { 474 + break; 475 + } 476 + } 588 477 } 589 478 590 479 static int intel_rdt_offline_cpu(unsigned int cpu) ··· 607 474 for_each_capable_rdt_resource(r) 608 475 domain_remove_cpu(cpu, r); 609 476 list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { 610 - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) 477 + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { 478 + clear_childcpus(rdtgrp, cpu); 611 479 break; 480 + } 612 481 } 613 - clear_closid(cpu); 482 + clear_closid_rmid(cpu); 614 483 mutex_unlock(&rdtgroup_mutex); 615 484 616 485 return 0; ··· 627 492 struct rdt_resource *r; 628 493 int cl; 629 494 630 - for_each_capable_rdt_resource(r) { 495 + for_each_alloc_capable_rdt_resource(r) { 631 496 cl = strlen(r->name); 632 497 if (cl > max_name_width) 633 498 max_name_width = cl; ··· 637 502 } 638 503 } 639 504 640 - static __init bool get_rdt_resources(void) 505 + enum { 506 + RDT_FLAG_CMT, 507 + RDT_FLAG_MBM_TOTAL, 508 + RDT_FLAG_MBM_LOCAL, 509 + RDT_FLAG_L3_CAT, 510 + RDT_FLAG_L3_CDP, 511 + RDT_FLAG_L2_CAT, 512 + RDT_FLAG_MBA, 513 + }; 514 + 515 + #define RDT_OPT(idx, n, f) \ 516 + [idx] = { \ 517 + .name = n, \ 518 + .flag = f \ 519 + } 520 + 521 + struct rdt_options { 522 + char *name; 523 + int flag; 524 + bool force_off, force_on; 525 + }; 526 + 527 + static struct rdt_options rdt_options[] __initdata = { 528 + RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC), 529 + RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL), 530 + RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL), 531 + RDT_OPT(RDT_FLAG_L3_CAT, "l3cat", X86_FEATURE_CAT_L3), 532 + RDT_OPT(RDT_FLAG_L3_CDP, "l3cdp", X86_FEATURE_CDP_L3), 533 + RDT_OPT(RDT_FLAG_L2_CAT, "l2cat", X86_FEATURE_CAT_L2), 534 + RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA), 535 + }; 536 + #define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options) 537 + 538 + static int __init set_rdt_options(char *str) 539 + { 540 + struct rdt_options *o; 541 + bool force_off; 542 + char *tok; 543 + 544 + if (*str == '=') 545 + str++; 546 + while ((tok = strsep(&str, ",")) != NULL) { 547 + force_off = *tok == '!'; 548 + if (force_off) 549 + tok++; 550 + for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) { 551 + if (strcmp(tok, o->name) == 0) { 552 + if (force_off) 553 + o->force_off = true; 554 + else 555 + o->force_on = true; 556 + break; 557 + } 558 + } 559 + } 560 + return 1; 561 + } 562 + __setup("rdt", set_rdt_options); 563 + 564 + static bool __init rdt_cpu_has(int flag) 565 + { 566 + bool ret = boot_cpu_has(flag); 567 + struct rdt_options *o; 568 + 569 + if (!ret) 570 + return ret; 571 + 572 + for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) { 573 + if (flag == o->flag) { 574 + if (o->force_off) 575 + ret = false; 576 + if (o->force_on) 577 + ret = true; 578 + break; 579 + } 580 + } 581 + return ret; 582 + } 583 + 584 + static __init bool get_rdt_alloc_resources(void) 641 585 { 642 586 bool ret = false; 643 587 644 - if (cache_alloc_hsw_probe()) 588 + if (rdt_alloc_capable) 645 589 return true; 646 590 647 591 if (!boot_cpu_has(X86_FEATURE_RDT_A)) 648 592 return false; 649 593 650 - if (boot_cpu_has(X86_FEATURE_CAT_L3)) { 651 - rdt_get_cache_config(1, &rdt_resources_all[RDT_RESOURCE_L3]); 652 - if (boot_cpu_has(X86_FEATURE_CDP_L3)) { 594 + if (rdt_cpu_has(X86_FEATURE_CAT_L3)) { 595 + rdt_get_cache_alloc_cfg(1, &rdt_resources_all[RDT_RESOURCE_L3]); 596 + if (rdt_cpu_has(X86_FEATURE_CDP_L3)) { 653 597 rdt_get_cdp_l3_config(RDT_RESOURCE_L3DATA); 654 598 rdt_get_cdp_l3_config(RDT_RESOURCE_L3CODE); 655 599 } 656 600 ret = true; 657 601 } 658 - if (boot_cpu_has(X86_FEATURE_CAT_L2)) { 602 + if (rdt_cpu_has(X86_FEATURE_CAT_L2)) { 659 603 /* CPUID 0x10.2 fields are same format at 0x10.1 */ 660 - rdt_get_cache_config(2, &rdt_resources_all[RDT_RESOURCE_L2]); 604 + rdt_get_cache_alloc_cfg(2, &rdt_resources_all[RDT_RESOURCE_L2]); 661 605 ret = true; 662 606 } 663 607 664 - if (boot_cpu_has(X86_FEATURE_MBA)) { 608 + if (rdt_cpu_has(X86_FEATURE_MBA)) { 665 609 if (rdt_get_mem_config(&rdt_resources_all[RDT_RESOURCE_MBA])) 666 610 ret = true; 667 611 } 668 - 669 612 return ret; 613 + } 614 + 615 + static __init bool get_rdt_mon_resources(void) 616 + { 617 + if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) 618 + rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID); 619 + if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) 620 + rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID); 621 + if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) 622 + rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID); 623 + 624 + if (!rdt_mon_features) 625 + return false; 626 + 627 + return !rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3]); 628 + } 629 + 630 + static __init void rdt_quirks(void) 631 + { 632 + switch (boot_cpu_data.x86_model) { 633 + case INTEL_FAM6_HASWELL_X: 634 + if (!rdt_options[RDT_FLAG_L3_CAT].force_off) 635 + cache_alloc_hsw_probe(); 636 + break; 637 + case INTEL_FAM6_SKYLAKE_X: 638 + if (boot_cpu_data.x86_mask <= 4) 639 + set_rdt_options("!cmt,!mbmtotal,!mbmlocal,!l3cat"); 640 + } 641 + } 642 + 643 + static __init bool get_rdt_resources(void) 644 + { 645 + rdt_quirks(); 646 + rdt_alloc_capable = get_rdt_alloc_resources(); 647 + rdt_mon_capable = get_rdt_mon_resources(); 648 + 649 + return (rdt_mon_capable || rdt_alloc_capable); 670 650 } 671 651 672 652 static int __init intel_rdt_late_init(void) ··· 806 556 return ret; 807 557 } 808 558 809 - for_each_capable_rdt_resource(r) 559 + for_each_alloc_capable_rdt_resource(r) 810 560 pr_info("Intel RDT %s allocation detected\n", r->name); 561 + 562 + for_each_mon_capable_rdt_resource(r) 563 + pr_info("Intel RDT %s monitoring detected\n", r->name); 811 564 812 565 return 0; 813 566 }
+440
arch/x86/kernel/cpu/intel_rdt.h
··· 1 + #ifndef _ASM_X86_INTEL_RDT_H 2 + #define _ASM_X86_INTEL_RDT_H 3 + 4 + #include <linux/sched.h> 5 + #include <linux/kernfs.h> 6 + #include <linux/jump_label.h> 7 + 8 + #define IA32_L3_QOS_CFG 0xc81 9 + #define IA32_L3_CBM_BASE 0xc90 10 + #define IA32_L2_CBM_BASE 0xd10 11 + #define IA32_MBA_THRTL_BASE 0xd50 12 + 13 + #define L3_QOS_CDP_ENABLE 0x01ULL 14 + 15 + /* 16 + * Event IDs are used to program IA32_QM_EVTSEL before reading event 17 + * counter from IA32_QM_CTR 18 + */ 19 + #define QOS_L3_OCCUP_EVENT_ID 0x01 20 + #define QOS_L3_MBM_TOTAL_EVENT_ID 0x02 21 + #define QOS_L3_MBM_LOCAL_EVENT_ID 0x03 22 + 23 + #define CQM_LIMBOCHECK_INTERVAL 1000 24 + 25 + #define MBM_CNTR_WIDTH 24 26 + #define MBM_OVERFLOW_INTERVAL 1000 27 + 28 + #define RMID_VAL_ERROR BIT_ULL(63) 29 + #define RMID_VAL_UNAVAIL BIT_ULL(62) 30 + 31 + DECLARE_STATIC_KEY_FALSE(rdt_enable_key); 32 + 33 + /** 34 + * struct mon_evt - Entry in the event list of a resource 35 + * @evtid: event id 36 + * @name: name of the event 37 + */ 38 + struct mon_evt { 39 + u32 evtid; 40 + char *name; 41 + struct list_head list; 42 + }; 43 + 44 + /** 45 + * struct mon_data_bits - Monitoring details for each event file 46 + * @rid: Resource id associated with the event file. 47 + * @evtid: Event id associated with the event file 48 + * @domid: The domain to which the event file belongs 49 + */ 50 + union mon_data_bits { 51 + void *priv; 52 + struct { 53 + unsigned int rid : 10; 54 + unsigned int evtid : 8; 55 + unsigned int domid : 14; 56 + } u; 57 + }; 58 + 59 + struct rmid_read { 60 + struct rdtgroup *rgrp; 61 + struct rdt_domain *d; 62 + int evtid; 63 + bool first; 64 + u64 val; 65 + }; 66 + 67 + extern unsigned int intel_cqm_threshold; 68 + extern bool rdt_alloc_capable; 69 + extern bool rdt_mon_capable; 70 + extern unsigned int rdt_mon_features; 71 + 72 + enum rdt_group_type { 73 + RDTCTRL_GROUP = 0, 74 + RDTMON_GROUP, 75 + RDT_NUM_GROUP, 76 + }; 77 + 78 + /** 79 + * struct mongroup - store mon group's data in resctrl fs. 80 + * @mon_data_kn kernlfs node for the mon_data directory 81 + * @parent: parent rdtgrp 82 + * @crdtgrp_list: child rdtgroup node list 83 + * @rmid: rmid for this rdtgroup 84 + */ 85 + struct mongroup { 86 + struct kernfs_node *mon_data_kn; 87 + struct rdtgroup *parent; 88 + struct list_head crdtgrp_list; 89 + u32 rmid; 90 + }; 91 + 92 + /** 93 + * struct rdtgroup - store rdtgroup's data in resctrl file system. 94 + * @kn: kernfs node 95 + * @rdtgroup_list: linked list for all rdtgroups 96 + * @closid: closid for this rdtgroup 97 + * @cpu_mask: CPUs assigned to this rdtgroup 98 + * @flags: status bits 99 + * @waitcount: how many cpus expect to find this 100 + * group when they acquire rdtgroup_mutex 101 + * @type: indicates type of this rdtgroup - either 102 + * monitor only or ctrl_mon group 103 + * @mon: mongroup related data 104 + */ 105 + struct rdtgroup { 106 + struct kernfs_node *kn; 107 + struct list_head rdtgroup_list; 108 + u32 closid; 109 + struct cpumask cpu_mask; 110 + int flags; 111 + atomic_t waitcount; 112 + enum rdt_group_type type; 113 + struct mongroup mon; 114 + }; 115 + 116 + /* rdtgroup.flags */ 117 + #define RDT_DELETED 1 118 + 119 + /* rftype.flags */ 120 + #define RFTYPE_FLAGS_CPUS_LIST 1 121 + 122 + /* 123 + * Define the file type flags for base and info directories. 124 + */ 125 + #define RFTYPE_INFO BIT(0) 126 + #define RFTYPE_BASE BIT(1) 127 + #define RF_CTRLSHIFT 4 128 + #define RF_MONSHIFT 5 129 + #define RFTYPE_CTRL BIT(RF_CTRLSHIFT) 130 + #define RFTYPE_MON BIT(RF_MONSHIFT) 131 + #define RFTYPE_RES_CACHE BIT(8) 132 + #define RFTYPE_RES_MB BIT(9) 133 + #define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) 134 + #define RF_MON_INFO (RFTYPE_INFO | RFTYPE_MON) 135 + #define RF_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) 136 + 137 + /* List of all resource groups */ 138 + extern struct list_head rdt_all_groups; 139 + 140 + extern int max_name_width, max_data_width; 141 + 142 + int __init rdtgroup_init(void); 143 + 144 + /** 145 + * struct rftype - describe each file in the resctrl file system 146 + * @name: File name 147 + * @mode: Access mode 148 + * @kf_ops: File operations 149 + * @flags: File specific RFTYPE_FLAGS_* flags 150 + * @fflags: File specific RF_* or RFTYPE_* flags 151 + * @seq_show: Show content of the file 152 + * @write: Write to the file 153 + */ 154 + struct rftype { 155 + char *name; 156 + umode_t mode; 157 + struct kernfs_ops *kf_ops; 158 + unsigned long flags; 159 + unsigned long fflags; 160 + 161 + int (*seq_show)(struct kernfs_open_file *of, 162 + struct seq_file *sf, void *v); 163 + /* 164 + * write() is the generic write callback which maps directly to 165 + * kernfs write operation and overrides all other operations. 166 + * Maximum write size is determined by ->max_write_len. 167 + */ 168 + ssize_t (*write)(struct kernfs_open_file *of, 169 + char *buf, size_t nbytes, loff_t off); 170 + }; 171 + 172 + /** 173 + * struct mbm_state - status for each MBM counter in each domain 174 + * @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes) 175 + * @prev_msr Value of IA32_QM_CTR for this RMID last time we read it 176 + */ 177 + struct mbm_state { 178 + u64 chunks; 179 + u64 prev_msr; 180 + }; 181 + 182 + /** 183 + * struct rdt_domain - group of cpus sharing an RDT resource 184 + * @list: all instances of this resource 185 + * @id: unique id for this instance 186 + * @cpu_mask: which cpus share this resource 187 + * @rmid_busy_llc: 188 + * bitmap of which limbo RMIDs are above threshold 189 + * @mbm_total: saved state for MBM total bandwidth 190 + * @mbm_local: saved state for MBM local bandwidth 191 + * @mbm_over: worker to periodically read MBM h/w counters 192 + * @cqm_limbo: worker to periodically read CQM h/w counters 193 + * @mbm_work_cpu: 194 + * worker cpu for MBM h/w counters 195 + * @cqm_work_cpu: 196 + * worker cpu for CQM h/w counters 197 + * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID) 198 + * @new_ctrl: new ctrl value to be loaded 199 + * @have_new_ctrl: did user provide new_ctrl for this domain 200 + */ 201 + struct rdt_domain { 202 + struct list_head list; 203 + int id; 204 + struct cpumask cpu_mask; 205 + unsigned long *rmid_busy_llc; 206 + struct mbm_state *mbm_total; 207 + struct mbm_state *mbm_local; 208 + struct delayed_work mbm_over; 209 + struct delayed_work cqm_limbo; 210 + int mbm_work_cpu; 211 + int cqm_work_cpu; 212 + u32 *ctrl_val; 213 + u32 new_ctrl; 214 + bool have_new_ctrl; 215 + }; 216 + 217 + /** 218 + * struct msr_param - set a range of MSRs from a domain 219 + * @res: The resource to use 220 + * @low: Beginning index from base MSR 221 + * @high: End index 222 + */ 223 + struct msr_param { 224 + struct rdt_resource *res; 225 + int low; 226 + int high; 227 + }; 228 + 229 + /** 230 + * struct rdt_cache - Cache allocation related data 231 + * @cbm_len: Length of the cache bit mask 232 + * @min_cbm_bits: Minimum number of consecutive bits to be set 233 + * @cbm_idx_mult: Multiplier of CBM index 234 + * @cbm_idx_offset: Offset of CBM index. CBM index is computed by: 235 + * closid * cbm_idx_multi + cbm_idx_offset 236 + * in a cache bit mask 237 + * @shareable_bits: Bitmask of shareable resource with other 238 + * executing entities 239 + */ 240 + struct rdt_cache { 241 + unsigned int cbm_len; 242 + unsigned int min_cbm_bits; 243 + unsigned int cbm_idx_mult; 244 + unsigned int cbm_idx_offset; 245 + unsigned int shareable_bits; 246 + }; 247 + 248 + /** 249 + * struct rdt_membw - Memory bandwidth allocation related data 250 + * @max_delay: Max throttle delay. Delay is the hardware 251 + * representation for memory bandwidth. 252 + * @min_bw: Minimum memory bandwidth percentage user can request 253 + * @bw_gran: Granularity at which the memory bandwidth is allocated 254 + * @delay_linear: True if memory B/W delay is in linear scale 255 + * @mb_map: Mapping of memory B/W percentage to memory B/W delay 256 + */ 257 + struct rdt_membw { 258 + u32 max_delay; 259 + u32 min_bw; 260 + u32 bw_gran; 261 + u32 delay_linear; 262 + u32 *mb_map; 263 + }; 264 + 265 + static inline bool is_llc_occupancy_enabled(void) 266 + { 267 + return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); 268 + } 269 + 270 + static inline bool is_mbm_total_enabled(void) 271 + { 272 + return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); 273 + } 274 + 275 + static inline bool is_mbm_local_enabled(void) 276 + { 277 + return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); 278 + } 279 + 280 + static inline bool is_mbm_enabled(void) 281 + { 282 + return (is_mbm_total_enabled() || is_mbm_local_enabled()); 283 + } 284 + 285 + static inline bool is_mbm_event(int e) 286 + { 287 + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && 288 + e <= QOS_L3_MBM_LOCAL_EVENT_ID); 289 + } 290 + 291 + /** 292 + * struct rdt_resource - attributes of an RDT resource 293 + * @rid: The index of the resource 294 + * @alloc_enabled: Is allocation enabled on this machine 295 + * @mon_enabled: Is monitoring enabled for this feature 296 + * @alloc_capable: Is allocation available on this machine 297 + * @mon_capable: Is monitor feature available on this machine 298 + * @name: Name to use in "schemata" file 299 + * @num_closid: Number of CLOSIDs available 300 + * @cache_level: Which cache level defines scope of this resource 301 + * @default_ctrl: Specifies default cache cbm or memory B/W percent. 302 + * @msr_base: Base MSR address for CBMs 303 + * @msr_update: Function pointer to update QOS MSRs 304 + * @data_width: Character width of data when displaying 305 + * @domains: All domains for this resource 306 + * @cache: Cache allocation related data 307 + * @format_str: Per resource format string to show domain value 308 + * @parse_ctrlval: Per resource function pointer to parse control values 309 + * @evt_list: List of monitoring events 310 + * @num_rmid: Number of RMIDs available 311 + * @mon_scale: cqm counter * mon_scale = occupancy in bytes 312 + * @fflags: flags to choose base and info files 313 + */ 314 + struct rdt_resource { 315 + int rid; 316 + bool alloc_enabled; 317 + bool mon_enabled; 318 + bool alloc_capable; 319 + bool mon_capable; 320 + char *name; 321 + int num_closid; 322 + int cache_level; 323 + u32 default_ctrl; 324 + unsigned int msr_base; 325 + void (*msr_update) (struct rdt_domain *d, struct msr_param *m, 326 + struct rdt_resource *r); 327 + int data_width; 328 + struct list_head domains; 329 + struct rdt_cache cache; 330 + struct rdt_membw membw; 331 + const char *format_str; 332 + int (*parse_ctrlval) (char *buf, struct rdt_resource *r, 333 + struct rdt_domain *d); 334 + struct list_head evt_list; 335 + int num_rmid; 336 + unsigned int mon_scale; 337 + unsigned long fflags; 338 + }; 339 + 340 + int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d); 341 + int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d); 342 + 343 + extern struct mutex rdtgroup_mutex; 344 + 345 + extern struct rdt_resource rdt_resources_all[]; 346 + extern struct rdtgroup rdtgroup_default; 347 + DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); 348 + 349 + int __init rdtgroup_init(void); 350 + 351 + enum { 352 + RDT_RESOURCE_L3, 353 + RDT_RESOURCE_L3DATA, 354 + RDT_RESOURCE_L3CODE, 355 + RDT_RESOURCE_L2, 356 + RDT_RESOURCE_MBA, 357 + 358 + /* Must be the last */ 359 + RDT_NUM_RESOURCES, 360 + }; 361 + 362 + #define for_each_capable_rdt_resource(r) \ 363 + for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 364 + r++) \ 365 + if (r->alloc_capable || r->mon_capable) 366 + 367 + #define for_each_alloc_capable_rdt_resource(r) \ 368 + for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 369 + r++) \ 370 + if (r->alloc_capable) 371 + 372 + #define for_each_mon_capable_rdt_resource(r) \ 373 + for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 374 + r++) \ 375 + if (r->mon_capable) 376 + 377 + #define for_each_alloc_enabled_rdt_resource(r) \ 378 + for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 379 + r++) \ 380 + if (r->alloc_enabled) 381 + 382 + #define for_each_mon_enabled_rdt_resource(r) \ 383 + for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\ 384 + r++) \ 385 + if (r->mon_enabled) 386 + 387 + /* CPUID.(EAX=10H, ECX=ResID=1).EAX */ 388 + union cpuid_0x10_1_eax { 389 + struct { 390 + unsigned int cbm_len:5; 391 + } split; 392 + unsigned int full; 393 + }; 394 + 395 + /* CPUID.(EAX=10H, ECX=ResID=3).EAX */ 396 + union cpuid_0x10_3_eax { 397 + struct { 398 + unsigned int max_delay:12; 399 + } split; 400 + unsigned int full; 401 + }; 402 + 403 + /* CPUID.(EAX=10H, ECX=ResID).EDX */ 404 + union cpuid_0x10_x_edx { 405 + struct { 406 + unsigned int cos_max:16; 407 + } split; 408 + unsigned int full; 409 + }; 410 + 411 + void rdt_ctrl_update(void *arg); 412 + struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); 413 + void rdtgroup_kn_unlock(struct kernfs_node *kn); 414 + struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, 415 + struct list_head **pos); 416 + ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, 417 + char *buf, size_t nbytes, loff_t off); 418 + int rdtgroup_schemata_show(struct kernfs_open_file *of, 419 + struct seq_file *s, void *v); 420 + struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); 421 + int alloc_rmid(void); 422 + void free_rmid(u32 rmid); 423 + int rdt_get_mon_l3_config(struct rdt_resource *r); 424 + void mon_event_count(void *info); 425 + int rdtgroup_mondata_show(struct seq_file *m, void *arg); 426 + void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, 427 + unsigned int dom_id); 428 + void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, 429 + struct rdt_domain *d); 430 + void mon_event_read(struct rmid_read *rr, struct rdt_domain *d, 431 + struct rdtgroup *rdtgrp, int evtid, int first); 432 + void mbm_setup_overflow_handler(struct rdt_domain *dom, 433 + unsigned long delay_ms); 434 + void mbm_handle_overflow(struct work_struct *work); 435 + void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms); 436 + void cqm_handle_limbo(struct work_struct *work); 437 + bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d); 438 + void __check_limbo(struct rdt_domain *d, bool force_free); 439 + 440 + #endif /* _ASM_X86_INTEL_RDT_H */
+499
arch/x86/kernel/cpu/intel_rdt_monitor.c
··· 1 + /* 2 + * Resource Director Technology(RDT) 3 + * - Monitoring code 4 + * 5 + * Copyright (C) 2017 Intel Corporation 6 + * 7 + * Author: 8 + * Vikas Shivappa <vikas.shivappa@intel.com> 9 + * 10 + * This replaces the cqm.c based on perf but we reuse a lot of 11 + * code and datastructures originally from Peter Zijlstra and Matt Fleming. 12 + * 13 + * This program is free software; you can redistribute it and/or modify it 14 + * under the terms and conditions of the GNU General Public License, 15 + * version 2, as published by the Free Software Foundation. 16 + * 17 + * This program is distributed in the hope it will be useful, but WITHOUT 18 + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 19 + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 20 + * more details. 21 + * 22 + * More information about RDT be found in the Intel (R) x86 Architecture 23 + * Software Developer Manual June 2016, volume 3, section 17.17. 24 + */ 25 + 26 + #include <linux/module.h> 27 + #include <linux/slab.h> 28 + #include <asm/cpu_device_id.h> 29 + #include "intel_rdt.h" 30 + 31 + #define MSR_IA32_QM_CTR 0x0c8e 32 + #define MSR_IA32_QM_EVTSEL 0x0c8d 33 + 34 + struct rmid_entry { 35 + u32 rmid; 36 + int busy; 37 + struct list_head list; 38 + }; 39 + 40 + /** 41 + * @rmid_free_lru A least recently used list of free RMIDs 42 + * These RMIDs are guaranteed to have an occupancy less than the 43 + * threshold occupancy 44 + */ 45 + static LIST_HEAD(rmid_free_lru); 46 + 47 + /** 48 + * @rmid_limbo_count count of currently unused but (potentially) 49 + * dirty RMIDs. 50 + * This counts RMIDs that no one is currently using but that 51 + * may have a occupancy value > intel_cqm_threshold. User can change 52 + * the threshold occupancy value. 53 + */ 54 + unsigned int rmid_limbo_count; 55 + 56 + /** 57 + * @rmid_entry - The entry in the limbo and free lists. 58 + */ 59 + static struct rmid_entry *rmid_ptrs; 60 + 61 + /* 62 + * Global boolean for rdt_monitor which is true if any 63 + * resource monitoring is enabled. 64 + */ 65 + bool rdt_mon_capable; 66 + 67 + /* 68 + * Global to indicate which monitoring events are enabled. 69 + */ 70 + unsigned int rdt_mon_features; 71 + 72 + /* 73 + * This is the threshold cache occupancy at which we will consider an 74 + * RMID available for re-allocation. 75 + */ 76 + unsigned int intel_cqm_threshold; 77 + 78 + static inline struct rmid_entry *__rmid_entry(u32 rmid) 79 + { 80 + struct rmid_entry *entry; 81 + 82 + entry = &rmid_ptrs[rmid]; 83 + WARN_ON(entry->rmid != rmid); 84 + 85 + return entry; 86 + } 87 + 88 + static u64 __rmid_read(u32 rmid, u32 eventid) 89 + { 90 + u64 val; 91 + 92 + /* 93 + * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured 94 + * with a valid event code for supported resource type and the bits 95 + * IA32_QM_EVTSEL.RMID (bits 41:32) are configured with valid RMID, 96 + * IA32_QM_CTR.data (bits 61:0) reports the monitored data. 97 + * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) 98 + * are error bits. 99 + */ 100 + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); 101 + rdmsrl(MSR_IA32_QM_CTR, val); 102 + 103 + return val; 104 + } 105 + 106 + static bool rmid_dirty(struct rmid_entry *entry) 107 + { 108 + u64 val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID); 109 + 110 + return val >= intel_cqm_threshold; 111 + } 112 + 113 + /* 114 + * Check the RMIDs that are marked as busy for this domain. If the 115 + * reported LLC occupancy is below the threshold clear the busy bit and 116 + * decrement the count. If the busy count gets to zero on an RMID, we 117 + * free the RMID 118 + */ 119 + void __check_limbo(struct rdt_domain *d, bool force_free) 120 + { 121 + struct rmid_entry *entry; 122 + struct rdt_resource *r; 123 + u32 crmid = 1, nrmid; 124 + 125 + r = &rdt_resources_all[RDT_RESOURCE_L3]; 126 + 127 + /* 128 + * Skip RMID 0 and start from RMID 1 and check all the RMIDs that 129 + * are marked as busy for occupancy < threshold. If the occupancy 130 + * is less than the threshold decrement the busy counter of the 131 + * RMID and move it to the free list when the counter reaches 0. 132 + */ 133 + for (;;) { 134 + nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid); 135 + if (nrmid >= r->num_rmid) 136 + break; 137 + 138 + entry = __rmid_entry(nrmid); 139 + if (force_free || !rmid_dirty(entry)) { 140 + clear_bit(entry->rmid, d->rmid_busy_llc); 141 + if (!--entry->busy) { 142 + rmid_limbo_count--; 143 + list_add_tail(&entry->list, &rmid_free_lru); 144 + } 145 + } 146 + crmid = nrmid + 1; 147 + } 148 + } 149 + 150 + bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d) 151 + { 152 + return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid; 153 + } 154 + 155 + /* 156 + * As of now the RMIDs allocation is global. 157 + * However we keep track of which packages the RMIDs 158 + * are used to optimize the limbo list management. 159 + */ 160 + int alloc_rmid(void) 161 + { 162 + struct rmid_entry *entry; 163 + 164 + lockdep_assert_held(&rdtgroup_mutex); 165 + 166 + if (list_empty(&rmid_free_lru)) 167 + return rmid_limbo_count ? -EBUSY : -ENOSPC; 168 + 169 + entry = list_first_entry(&rmid_free_lru, 170 + struct rmid_entry, list); 171 + list_del(&entry->list); 172 + 173 + return entry->rmid; 174 + } 175 + 176 + static void add_rmid_to_limbo(struct rmid_entry *entry) 177 + { 178 + struct rdt_resource *r; 179 + struct rdt_domain *d; 180 + int cpu; 181 + u64 val; 182 + 183 + r = &rdt_resources_all[RDT_RESOURCE_L3]; 184 + 185 + entry->busy = 0; 186 + cpu = get_cpu(); 187 + list_for_each_entry(d, &r->domains, list) { 188 + if (cpumask_test_cpu(cpu, &d->cpu_mask)) { 189 + val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID); 190 + if (val <= intel_cqm_threshold) 191 + continue; 192 + } 193 + 194 + /* 195 + * For the first limbo RMID in the domain, 196 + * setup up the limbo worker. 197 + */ 198 + if (!has_busy_rmid(r, d)) 199 + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL); 200 + set_bit(entry->rmid, d->rmid_busy_llc); 201 + entry->busy++; 202 + } 203 + put_cpu(); 204 + 205 + if (entry->busy) 206 + rmid_limbo_count++; 207 + else 208 + list_add_tail(&entry->list, &rmid_free_lru); 209 + } 210 + 211 + void free_rmid(u32 rmid) 212 + { 213 + struct rmid_entry *entry; 214 + 215 + if (!rmid) 216 + return; 217 + 218 + lockdep_assert_held(&rdtgroup_mutex); 219 + 220 + entry = __rmid_entry(rmid); 221 + 222 + if (is_llc_occupancy_enabled()) 223 + add_rmid_to_limbo(entry); 224 + else 225 + list_add_tail(&entry->list, &rmid_free_lru); 226 + } 227 + 228 + static int __mon_event_count(u32 rmid, struct rmid_read *rr) 229 + { 230 + u64 chunks, shift, tval; 231 + struct mbm_state *m; 232 + 233 + tval = __rmid_read(rmid, rr->evtid); 234 + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) { 235 + rr->val = tval; 236 + return -EINVAL; 237 + } 238 + switch (rr->evtid) { 239 + case QOS_L3_OCCUP_EVENT_ID: 240 + rr->val += tval; 241 + return 0; 242 + case QOS_L3_MBM_TOTAL_EVENT_ID: 243 + m = &rr->d->mbm_total[rmid]; 244 + break; 245 + case QOS_L3_MBM_LOCAL_EVENT_ID: 246 + m = &rr->d->mbm_local[rmid]; 247 + break; 248 + default: 249 + /* 250 + * Code would never reach here because 251 + * an invalid event id would fail the __rmid_read. 252 + */ 253 + return -EINVAL; 254 + } 255 + 256 + if (rr->first) { 257 + m->prev_msr = tval; 258 + m->chunks = 0; 259 + return 0; 260 + } 261 + 262 + shift = 64 - MBM_CNTR_WIDTH; 263 + chunks = (tval << shift) - (m->prev_msr << shift); 264 + chunks >>= shift; 265 + m->chunks += chunks; 266 + m->prev_msr = tval; 267 + 268 + rr->val += m->chunks; 269 + return 0; 270 + } 271 + 272 + /* 273 + * This is called via IPI to read the CQM/MBM counters 274 + * on a domain. 275 + */ 276 + void mon_event_count(void *info) 277 + { 278 + struct rdtgroup *rdtgrp, *entry; 279 + struct rmid_read *rr = info; 280 + struct list_head *head; 281 + 282 + rdtgrp = rr->rgrp; 283 + 284 + if (__mon_event_count(rdtgrp->mon.rmid, rr)) 285 + return; 286 + 287 + /* 288 + * For Ctrl groups read data from child monitor groups. 289 + */ 290 + head = &rdtgrp->mon.crdtgrp_list; 291 + 292 + if (rdtgrp->type == RDTCTRL_GROUP) { 293 + list_for_each_entry(entry, head, mon.crdtgrp_list) { 294 + if (__mon_event_count(entry->mon.rmid, rr)) 295 + return; 296 + } 297 + } 298 + } 299 + 300 + static void mbm_update(struct rdt_domain *d, int rmid) 301 + { 302 + struct rmid_read rr; 303 + 304 + rr.first = false; 305 + rr.d = d; 306 + 307 + /* 308 + * This is protected from concurrent reads from user 309 + * as both the user and we hold the global mutex. 310 + */ 311 + if (is_mbm_total_enabled()) { 312 + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; 313 + __mon_event_count(rmid, &rr); 314 + } 315 + if (is_mbm_local_enabled()) { 316 + rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; 317 + __mon_event_count(rmid, &rr); 318 + } 319 + } 320 + 321 + /* 322 + * Handler to scan the limbo list and move the RMIDs 323 + * to free list whose occupancy < threshold_occupancy. 324 + */ 325 + void cqm_handle_limbo(struct work_struct *work) 326 + { 327 + unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); 328 + int cpu = smp_processor_id(); 329 + struct rdt_resource *r; 330 + struct rdt_domain *d; 331 + 332 + mutex_lock(&rdtgroup_mutex); 333 + 334 + r = &rdt_resources_all[RDT_RESOURCE_L3]; 335 + d = get_domain_from_cpu(cpu, r); 336 + 337 + if (!d) { 338 + pr_warn_once("Failure to get domain for limbo worker\n"); 339 + goto out_unlock; 340 + } 341 + 342 + __check_limbo(d, false); 343 + 344 + if (has_busy_rmid(r, d)) 345 + schedule_delayed_work_on(cpu, &d->cqm_limbo, delay); 346 + 347 + out_unlock: 348 + mutex_unlock(&rdtgroup_mutex); 349 + } 350 + 351 + void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms) 352 + { 353 + unsigned long delay = msecs_to_jiffies(delay_ms); 354 + struct rdt_resource *r; 355 + int cpu; 356 + 357 + r = &rdt_resources_all[RDT_RESOURCE_L3]; 358 + 359 + cpu = cpumask_any(&dom->cpu_mask); 360 + dom->cqm_work_cpu = cpu; 361 + 362 + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); 363 + } 364 + 365 + void mbm_handle_overflow(struct work_struct *work) 366 + { 367 + unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); 368 + struct rdtgroup *prgrp, *crgrp; 369 + int cpu = smp_processor_id(); 370 + struct list_head *head; 371 + struct rdt_domain *d; 372 + 373 + mutex_lock(&rdtgroup_mutex); 374 + 375 + if (!static_branch_likely(&rdt_enable_key)) 376 + goto out_unlock; 377 + 378 + d = get_domain_from_cpu(cpu, &rdt_resources_all[RDT_RESOURCE_L3]); 379 + if (!d) 380 + goto out_unlock; 381 + 382 + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { 383 + mbm_update(d, prgrp->mon.rmid); 384 + 385 + head = &prgrp->mon.crdtgrp_list; 386 + list_for_each_entry(crgrp, head, mon.crdtgrp_list) 387 + mbm_update(d, crgrp->mon.rmid); 388 + } 389 + 390 + schedule_delayed_work_on(cpu, &d->mbm_over, delay); 391 + 392 + out_unlock: 393 + mutex_unlock(&rdtgroup_mutex); 394 + } 395 + 396 + void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) 397 + { 398 + unsigned long delay = msecs_to_jiffies(delay_ms); 399 + int cpu; 400 + 401 + if (!static_branch_likely(&rdt_enable_key)) 402 + return; 403 + cpu = cpumask_any(&dom->cpu_mask); 404 + dom->mbm_work_cpu = cpu; 405 + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); 406 + } 407 + 408 + static int dom_data_init(struct rdt_resource *r) 409 + { 410 + struct rmid_entry *entry = NULL; 411 + int i, nr_rmids; 412 + 413 + nr_rmids = r->num_rmid; 414 + rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL); 415 + if (!rmid_ptrs) 416 + return -ENOMEM; 417 + 418 + for (i = 0; i < nr_rmids; i++) { 419 + entry = &rmid_ptrs[i]; 420 + INIT_LIST_HEAD(&entry->list); 421 + 422 + entry->rmid = i; 423 + list_add_tail(&entry->list, &rmid_free_lru); 424 + } 425 + 426 + /* 427 + * RMID 0 is special and is always allocated. It's used for all 428 + * tasks that are not monitored. 429 + */ 430 + entry = __rmid_entry(0); 431 + list_del(&entry->list); 432 + 433 + return 0; 434 + } 435 + 436 + static struct mon_evt llc_occupancy_event = { 437 + .name = "llc_occupancy", 438 + .evtid = QOS_L3_OCCUP_EVENT_ID, 439 + }; 440 + 441 + static struct mon_evt mbm_total_event = { 442 + .name = "mbm_total_bytes", 443 + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, 444 + }; 445 + 446 + static struct mon_evt mbm_local_event = { 447 + .name = "mbm_local_bytes", 448 + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, 449 + }; 450 + 451 + /* 452 + * Initialize the event list for the resource. 453 + * 454 + * Note that MBM events are also part of RDT_RESOURCE_L3 resource 455 + * because as per the SDM the total and local memory bandwidth 456 + * are enumerated as part of L3 monitoring. 457 + */ 458 + static void l3_mon_evt_init(struct rdt_resource *r) 459 + { 460 + INIT_LIST_HEAD(&r->evt_list); 461 + 462 + if (is_llc_occupancy_enabled()) 463 + list_add_tail(&llc_occupancy_event.list, &r->evt_list); 464 + if (is_mbm_total_enabled()) 465 + list_add_tail(&mbm_total_event.list, &r->evt_list); 466 + if (is_mbm_local_enabled()) 467 + list_add_tail(&mbm_local_event.list, &r->evt_list); 468 + } 469 + 470 + int rdt_get_mon_l3_config(struct rdt_resource *r) 471 + { 472 + int ret; 473 + 474 + r->mon_scale = boot_cpu_data.x86_cache_occ_scale; 475 + r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1; 476 + 477 + /* 478 + * A reasonable upper limit on the max threshold is the number 479 + * of lines tagged per RMID if all RMIDs have the same number of 480 + * lines tagged in the LLC. 481 + * 482 + * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC. 483 + */ 484 + intel_cqm_threshold = boot_cpu_data.x86_cache_size * 1024 / r->num_rmid; 485 + 486 + /* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */ 487 + intel_cqm_threshold /= r->mon_scale; 488 + 489 + ret = dom_data_init(r); 490 + if (ret) 491 + return ret; 492 + 493 + l3_mon_evt_init(r); 494 + 495 + r->mon_capable = true; 496 + r->mon_enabled = true; 497 + 498 + return 0; 499 + }
+930 -211
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
··· 32 32 33 33 #include <uapi/linux/magic.h> 34 34 35 - #include <asm/intel_rdt.h> 36 - #include <asm/intel_rdt_common.h> 35 + #include <asm/intel_rdt_sched.h> 36 + #include "intel_rdt.h" 37 37 38 38 DEFINE_STATIC_KEY_FALSE(rdt_enable_key); 39 - struct kernfs_root *rdt_root; 39 + DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); 40 + DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key); 41 + static struct kernfs_root *rdt_root; 40 42 struct rdtgroup rdtgroup_default; 41 43 LIST_HEAD(rdt_all_groups); 42 44 43 45 /* Kernel fs node for "info" directory under root */ 44 46 static struct kernfs_node *kn_info; 47 + 48 + /* Kernel fs node for "mon_groups" directory under root */ 49 + static struct kernfs_node *kn_mongrp; 50 + 51 + /* Kernel fs node for "mon_data" directory under root */ 52 + static struct kernfs_node *kn_mondata; 45 53 46 54 /* 47 55 * Trivial allocator for CLOSIDs. Since h/w only supports a small number, ··· 74 66 int rdt_min_closid = 32; 75 67 76 68 /* Compute rdt_min_closid across all resources */ 77 - for_each_enabled_rdt_resource(r) 69 + for_each_alloc_enabled_rdt_resource(r) 78 70 rdt_min_closid = min(rdt_min_closid, r->num_closid); 79 71 80 72 closid_free_map = BIT_MASK(rdt_min_closid) - 1; ··· 83 75 closid_free_map &= ~1; 84 76 } 85 77 86 - int closid_alloc(void) 78 + static int closid_alloc(void) 87 79 { 88 - int closid = ffs(closid_free_map); 80 + u32 closid = ffs(closid_free_map); 89 81 90 82 if (closid == 0) 91 83 return -ENOSPC; ··· 133 125 return 0; 134 126 } 135 127 136 - static int rdtgroup_add_files(struct kernfs_node *kn, struct rftype *rfts, 137 - int len) 138 - { 139 - struct rftype *rft; 140 - int ret; 141 - 142 - lockdep_assert_held(&rdtgroup_mutex); 143 - 144 - for (rft = rfts; rft < rfts + len; rft++) { 145 - ret = rdtgroup_add_file(kn, rft); 146 - if (ret) 147 - goto error; 148 - } 149 - 150 - return 0; 151 - error: 152 - pr_warn("Failed to add %s, err=%d\n", rft->name, ret); 153 - while (--rft >= rfts) 154 - kernfs_remove_by_name(kn, rft->name); 155 - return ret; 156 - } 157 - 158 128 static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) 159 129 { 160 130 struct kernfs_open_file *of = m->private; ··· 158 172 .atomic_write_len = PAGE_SIZE, 159 173 .write = rdtgroup_file_write, 160 174 .seq_show = rdtgroup_seqfile_show, 175 + }; 176 + 177 + static struct kernfs_ops kf_mondata_ops = { 178 + .atomic_write_len = PAGE_SIZE, 179 + .seq_show = rdtgroup_mondata_show, 161 180 }; 162 181 163 182 static bool is_cpu_list(struct kernfs_open_file *of) ··· 194 203 /* 195 204 * This is safe against intel_rdt_sched_in() called from __switch_to() 196 205 * because __switch_to() is executed with interrupts disabled. A local call 197 - * from rdt_update_closid() is proteced against __switch_to() because 206 + * from update_closid_rmid() is proteced against __switch_to() because 198 207 * preemption is disabled. 199 208 */ 200 - static void rdt_update_cpu_closid(void *closid) 209 + static void update_cpu_closid_rmid(void *info) 201 210 { 202 - if (closid) 203 - this_cpu_write(cpu_closid, *(int *)closid); 211 + struct rdtgroup *r = info; 212 + 213 + if (r) { 214 + this_cpu_write(pqr_state.default_closid, r->closid); 215 + this_cpu_write(pqr_state.default_rmid, r->mon.rmid); 216 + } 217 + 204 218 /* 205 219 * We cannot unconditionally write the MSR because the current 206 220 * executing task might have its own closid selected. Just reuse ··· 217 221 /* 218 222 * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, 219 223 * 220 - * Per task closids must have been set up before calling this function. 221 - * 222 - * The per cpu closids are updated with the smp function call, when @closid 223 - * is not NULL. If @closid is NULL then all affected percpu closids must 224 - * have been set up before calling this function. 224 + * Per task closids/rmids must have been set up before calling this function. 225 225 */ 226 226 static void 227 - rdt_update_closid(const struct cpumask *cpu_mask, int *closid) 227 + update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) 228 228 { 229 229 int cpu = get_cpu(); 230 230 231 231 if (cpumask_test_cpu(cpu, cpu_mask)) 232 - rdt_update_cpu_closid(closid); 233 - smp_call_function_many(cpu_mask, rdt_update_cpu_closid, closid, 1); 232 + update_cpu_closid_rmid(r); 233 + smp_call_function_many(cpu_mask, update_cpu_closid_rmid, r, 1); 234 234 put_cpu(); 235 + } 236 + 237 + static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, 238 + cpumask_var_t tmpmask) 239 + { 240 + struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; 241 + struct list_head *head; 242 + 243 + /* Check whether cpus belong to parent ctrl group */ 244 + cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); 245 + if (cpumask_weight(tmpmask)) 246 + return -EINVAL; 247 + 248 + /* Check whether cpus are dropped from this group */ 249 + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); 250 + if (cpumask_weight(tmpmask)) { 251 + /* Give any dropped cpus to parent rdtgroup */ 252 + cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); 253 + update_closid_rmid(tmpmask, prgrp); 254 + } 255 + 256 + /* 257 + * If we added cpus, remove them from previous group that owned them 258 + * and update per-cpu rmid 259 + */ 260 + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); 261 + if (cpumask_weight(tmpmask)) { 262 + head = &prgrp->mon.crdtgrp_list; 263 + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { 264 + if (crgrp == rdtgrp) 265 + continue; 266 + cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, 267 + tmpmask); 268 + } 269 + update_closid_rmid(tmpmask, rdtgrp); 270 + } 271 + 272 + /* Done pushing/pulling - update this group with new mask */ 273 + cpumask_copy(&rdtgrp->cpu_mask, newmask); 274 + 275 + return 0; 276 + } 277 + 278 + static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) 279 + { 280 + struct rdtgroup *crgrp; 281 + 282 + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); 283 + /* update the child mon group masks as well*/ 284 + list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) 285 + cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); 286 + } 287 + 288 + static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, 289 + cpumask_var_t tmpmask, cpumask_var_t tmpmask1) 290 + { 291 + struct rdtgroup *r, *crgrp; 292 + struct list_head *head; 293 + 294 + /* Check whether cpus are dropped from this group */ 295 + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); 296 + if (cpumask_weight(tmpmask)) { 297 + /* Can't drop from default group */ 298 + if (rdtgrp == &rdtgroup_default) 299 + return -EINVAL; 300 + 301 + /* Give any dropped cpus to rdtgroup_default */ 302 + cpumask_or(&rdtgroup_default.cpu_mask, 303 + &rdtgroup_default.cpu_mask, tmpmask); 304 + update_closid_rmid(tmpmask, &rdtgroup_default); 305 + } 306 + 307 + /* 308 + * If we added cpus, remove them from previous group and 309 + * the prev group's child groups that owned them 310 + * and update per-cpu closid/rmid. 311 + */ 312 + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); 313 + if (cpumask_weight(tmpmask)) { 314 + list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { 315 + if (r == rdtgrp) 316 + continue; 317 + cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); 318 + if (cpumask_weight(tmpmask1)) 319 + cpumask_rdtgrp_clear(r, tmpmask1); 320 + } 321 + update_closid_rmid(tmpmask, rdtgrp); 322 + } 323 + 324 + /* Done pushing/pulling - update this group with new mask */ 325 + cpumask_copy(&rdtgrp->cpu_mask, newmask); 326 + 327 + /* 328 + * Clear child mon group masks since there is a new parent mask 329 + * now and update the rmid for the cpus the child lost. 330 + */ 331 + head = &rdtgrp->mon.crdtgrp_list; 332 + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { 333 + cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); 334 + update_closid_rmid(tmpmask, rdtgrp); 335 + cpumask_clear(&crgrp->cpu_mask); 336 + } 337 + 338 + return 0; 235 339 } 236 340 237 341 static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, 238 342 char *buf, size_t nbytes, loff_t off) 239 343 { 240 - cpumask_var_t tmpmask, newmask; 241 - struct rdtgroup *rdtgrp, *r; 344 + cpumask_var_t tmpmask, newmask, tmpmask1; 345 + struct rdtgroup *rdtgrp; 242 346 int ret; 243 347 244 348 if (!buf) ··· 348 252 return -ENOMEM; 349 253 if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { 350 254 free_cpumask_var(tmpmask); 255 + return -ENOMEM; 256 + } 257 + if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { 258 + free_cpumask_var(tmpmask); 259 + free_cpumask_var(newmask); 351 260 return -ENOMEM; 352 261 } 353 262 ··· 377 276 goto unlock; 378 277 } 379 278 380 - /* Check whether cpus are dropped from this group */ 381 - cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); 382 - if (cpumask_weight(tmpmask)) { 383 - /* Can't drop from default group */ 384 - if (rdtgrp == &rdtgroup_default) { 385 - ret = -EINVAL; 386 - goto unlock; 387 - } 388 - /* Give any dropped cpus to rdtgroup_default */ 389 - cpumask_or(&rdtgroup_default.cpu_mask, 390 - &rdtgroup_default.cpu_mask, tmpmask); 391 - rdt_update_closid(tmpmask, &rdtgroup_default.closid); 392 - } 393 - 394 - /* 395 - * If we added cpus, remove them from previous group that owned them 396 - * and update per-cpu closid 397 - */ 398 - cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); 399 - if (cpumask_weight(tmpmask)) { 400 - list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { 401 - if (r == rdtgrp) 402 - continue; 403 - cpumask_andnot(&r->cpu_mask, &r->cpu_mask, tmpmask); 404 - } 405 - rdt_update_closid(tmpmask, &rdtgrp->closid); 406 - } 407 - 408 - /* Done pushing/pulling - update this group with new mask */ 409 - cpumask_copy(&rdtgrp->cpu_mask, newmask); 279 + if (rdtgrp->type == RDTCTRL_GROUP) 280 + ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); 281 + else if (rdtgrp->type == RDTMON_GROUP) 282 + ret = cpus_mon_write(rdtgrp, newmask, tmpmask); 283 + else 284 + ret = -EINVAL; 410 285 411 286 unlock: 412 287 rdtgroup_kn_unlock(of->kn); 413 288 free_cpumask_var(tmpmask); 414 289 free_cpumask_var(newmask); 290 + free_cpumask_var(tmpmask1); 415 291 416 292 return ret ?: nbytes; 417 293 } ··· 414 336 if (atomic_dec_and_test(&rdtgrp->waitcount) && 415 337 (rdtgrp->flags & RDT_DELETED)) { 416 338 current->closid = 0; 339 + current->rmid = 0; 417 340 kfree(rdtgrp); 418 341 } 419 342 ··· 453 374 atomic_dec(&rdtgrp->waitcount); 454 375 kfree(callback); 455 376 } else { 456 - tsk->closid = rdtgrp->closid; 377 + /* 378 + * For ctrl_mon groups move both closid and rmid. 379 + * For monitor groups, can move the tasks only from 380 + * their parent CTRL group. 381 + */ 382 + if (rdtgrp->type == RDTCTRL_GROUP) { 383 + tsk->closid = rdtgrp->closid; 384 + tsk->rmid = rdtgrp->mon.rmid; 385 + } else if (rdtgrp->type == RDTMON_GROUP) { 386 + if (rdtgrp->mon.parent->closid == tsk->closid) 387 + tsk->rmid = rdtgrp->mon.rmid; 388 + else 389 + ret = -EINVAL; 390 + } 457 391 } 458 392 return ret; 459 393 } ··· 546 454 547 455 rcu_read_lock(); 548 456 for_each_process_thread(p, t) { 549 - if (t->closid == r->closid) 457 + if ((r->type == RDTCTRL_GROUP && t->closid == r->closid) || 458 + (r->type == RDTMON_GROUP && t->rmid == r->mon.rmid)) 550 459 seq_printf(s, "%d\n", t->pid); 551 460 } 552 461 rcu_read_unlock(); ··· 568 475 569 476 return ret; 570 477 } 571 - 572 - /* Files in each rdtgroup */ 573 - static struct rftype rdtgroup_base_files[] = { 574 - { 575 - .name = "cpus", 576 - .mode = 0644, 577 - .kf_ops = &rdtgroup_kf_single_ops, 578 - .write = rdtgroup_cpus_write, 579 - .seq_show = rdtgroup_cpus_show, 580 - }, 581 - { 582 - .name = "cpus_list", 583 - .mode = 0644, 584 - .kf_ops = &rdtgroup_kf_single_ops, 585 - .write = rdtgroup_cpus_write, 586 - .seq_show = rdtgroup_cpus_show, 587 - .flags = RFTYPE_FLAGS_CPUS_LIST, 588 - }, 589 - { 590 - .name = "tasks", 591 - .mode = 0644, 592 - .kf_ops = &rdtgroup_kf_single_ops, 593 - .write = rdtgroup_tasks_write, 594 - .seq_show = rdtgroup_tasks_show, 595 - }, 596 - { 597 - .name = "schemata", 598 - .mode = 0644, 599 - .kf_ops = &rdtgroup_kf_single_ops, 600 - .write = rdtgroup_schemata_write, 601 - .seq_show = rdtgroup_schemata_show, 602 - }, 603 - }; 604 478 605 479 static int rdt_num_closids_show(struct kernfs_open_file *of, 606 480 struct seq_file *seq, void *v) ··· 596 536 return 0; 597 537 } 598 538 539 + static int rdt_shareable_bits_show(struct kernfs_open_file *of, 540 + struct seq_file *seq, void *v) 541 + { 542 + struct rdt_resource *r = of->kn->parent->priv; 543 + 544 + seq_printf(seq, "%x\n", r->cache.shareable_bits); 545 + return 0; 546 + } 547 + 599 548 static int rdt_min_bw_show(struct kernfs_open_file *of, 600 549 struct seq_file *seq, void *v) 601 550 { 602 551 struct rdt_resource *r = of->kn->parent->priv; 603 552 604 553 seq_printf(seq, "%u\n", r->membw.min_bw); 554 + return 0; 555 + } 556 + 557 + static int rdt_num_rmids_show(struct kernfs_open_file *of, 558 + struct seq_file *seq, void *v) 559 + { 560 + struct rdt_resource *r = of->kn->parent->priv; 561 + 562 + seq_printf(seq, "%d\n", r->num_rmid); 563 + 564 + return 0; 565 + } 566 + 567 + static int rdt_mon_features_show(struct kernfs_open_file *of, 568 + struct seq_file *seq, void *v) 569 + { 570 + struct rdt_resource *r = of->kn->parent->priv; 571 + struct mon_evt *mevt; 572 + 573 + list_for_each_entry(mevt, &r->evt_list, list) 574 + seq_printf(seq, "%s\n", mevt->name); 575 + 605 576 return 0; 606 577 } 607 578 ··· 654 563 return 0; 655 564 } 656 565 566 + static int max_threshold_occ_show(struct kernfs_open_file *of, 567 + struct seq_file *seq, void *v) 568 + { 569 + struct rdt_resource *r = of->kn->parent->priv; 570 + 571 + seq_printf(seq, "%u\n", intel_cqm_threshold * r->mon_scale); 572 + 573 + return 0; 574 + } 575 + 576 + static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, 577 + char *buf, size_t nbytes, loff_t off) 578 + { 579 + struct rdt_resource *r = of->kn->parent->priv; 580 + unsigned int bytes; 581 + int ret; 582 + 583 + ret = kstrtouint(buf, 0, &bytes); 584 + if (ret) 585 + return ret; 586 + 587 + if (bytes > (boot_cpu_data.x86_cache_size * 1024)) 588 + return -EINVAL; 589 + 590 + intel_cqm_threshold = bytes / r->mon_scale; 591 + 592 + return nbytes; 593 + } 594 + 657 595 /* rdtgroup information files for one cache resource. */ 658 - static struct rftype res_cache_info_files[] = { 596 + static struct rftype res_common_files[] = { 659 597 { 660 598 .name = "num_closids", 661 599 .mode = 0444, 662 600 .kf_ops = &rdtgroup_kf_single_ops, 663 601 .seq_show = rdt_num_closids_show, 602 + .fflags = RF_CTRL_INFO, 603 + }, 604 + { 605 + .name = "mon_features", 606 + .mode = 0444, 607 + .kf_ops = &rdtgroup_kf_single_ops, 608 + .seq_show = rdt_mon_features_show, 609 + .fflags = RF_MON_INFO, 610 + }, 611 + { 612 + .name = "num_rmids", 613 + .mode = 0444, 614 + .kf_ops = &rdtgroup_kf_single_ops, 615 + .seq_show = rdt_num_rmids_show, 616 + .fflags = RF_MON_INFO, 664 617 }, 665 618 { 666 619 .name = "cbm_mask", 667 620 .mode = 0444, 668 621 .kf_ops = &rdtgroup_kf_single_ops, 669 622 .seq_show = rdt_default_ctrl_show, 623 + .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, 670 624 }, 671 625 { 672 626 .name = "min_cbm_bits", 673 627 .mode = 0444, 674 628 .kf_ops = &rdtgroup_kf_single_ops, 675 629 .seq_show = rdt_min_cbm_bits_show, 630 + .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, 676 631 }, 677 - }; 678 - 679 - /* rdtgroup information files for memory bandwidth. */ 680 - static struct rftype res_mba_info_files[] = { 681 632 { 682 - .name = "num_closids", 633 + .name = "shareable_bits", 683 634 .mode = 0444, 684 635 .kf_ops = &rdtgroup_kf_single_ops, 685 - .seq_show = rdt_num_closids_show, 636 + .seq_show = rdt_shareable_bits_show, 637 + .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, 686 638 }, 687 639 { 688 640 .name = "min_bandwidth", 689 641 .mode = 0444, 690 642 .kf_ops = &rdtgroup_kf_single_ops, 691 643 .seq_show = rdt_min_bw_show, 644 + .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, 692 645 }, 693 646 { 694 647 .name = "bandwidth_gran", 695 648 .mode = 0444, 696 649 .kf_ops = &rdtgroup_kf_single_ops, 697 650 .seq_show = rdt_bw_gran_show, 651 + .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, 698 652 }, 699 653 { 700 654 .name = "delay_linear", 701 655 .mode = 0444, 702 656 .kf_ops = &rdtgroup_kf_single_ops, 703 657 .seq_show = rdt_delay_linear_show, 658 + .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, 659 + }, 660 + { 661 + .name = "max_threshold_occupancy", 662 + .mode = 0644, 663 + .kf_ops = &rdtgroup_kf_single_ops, 664 + .write = max_threshold_occ_write, 665 + .seq_show = max_threshold_occ_show, 666 + .fflags = RF_MON_INFO | RFTYPE_RES_CACHE, 667 + }, 668 + { 669 + .name = "cpus", 670 + .mode = 0644, 671 + .kf_ops = &rdtgroup_kf_single_ops, 672 + .write = rdtgroup_cpus_write, 673 + .seq_show = rdtgroup_cpus_show, 674 + .fflags = RFTYPE_BASE, 675 + }, 676 + { 677 + .name = "cpus_list", 678 + .mode = 0644, 679 + .kf_ops = &rdtgroup_kf_single_ops, 680 + .write = rdtgroup_cpus_write, 681 + .seq_show = rdtgroup_cpus_show, 682 + .flags = RFTYPE_FLAGS_CPUS_LIST, 683 + .fflags = RFTYPE_BASE, 684 + }, 685 + { 686 + .name = "tasks", 687 + .mode = 0644, 688 + .kf_ops = &rdtgroup_kf_single_ops, 689 + .write = rdtgroup_tasks_write, 690 + .seq_show = rdtgroup_tasks_show, 691 + .fflags = RFTYPE_BASE, 692 + }, 693 + { 694 + .name = "schemata", 695 + .mode = 0644, 696 + .kf_ops = &rdtgroup_kf_single_ops, 697 + .write = rdtgroup_schemata_write, 698 + .seq_show = rdtgroup_schemata_show, 699 + .fflags = RF_CTRL_BASE, 704 700 }, 705 701 }; 706 702 707 - void rdt_get_mba_infofile(struct rdt_resource *r) 703 + static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) 708 704 { 709 - r->info_files = res_mba_info_files; 710 - r->nr_info_files = ARRAY_SIZE(res_mba_info_files); 705 + struct rftype *rfts, *rft; 706 + int ret, len; 707 + 708 + rfts = res_common_files; 709 + len = ARRAY_SIZE(res_common_files); 710 + 711 + lockdep_assert_held(&rdtgroup_mutex); 712 + 713 + for (rft = rfts; rft < rfts + len; rft++) { 714 + if ((fflags & rft->fflags) == rft->fflags) { 715 + ret = rdtgroup_add_file(kn, rft); 716 + if (ret) 717 + goto error; 718 + } 719 + } 720 + 721 + return 0; 722 + error: 723 + pr_warn("Failed to add %s, err=%d\n", rft->name, ret); 724 + while (--rft >= rfts) { 725 + if ((fflags & rft->fflags) == rft->fflags) 726 + kernfs_remove_by_name(kn, rft->name); 727 + } 728 + return ret; 711 729 } 712 730 713 - void rdt_get_cache_infofile(struct rdt_resource *r) 731 + static int rdtgroup_mkdir_info_resdir(struct rdt_resource *r, char *name, 732 + unsigned long fflags) 714 733 { 715 - r->info_files = res_cache_info_files; 716 - r->nr_info_files = ARRAY_SIZE(res_cache_info_files); 734 + struct kernfs_node *kn_subdir; 735 + int ret; 736 + 737 + kn_subdir = kernfs_create_dir(kn_info, name, 738 + kn_info->mode, r); 739 + if (IS_ERR(kn_subdir)) 740 + return PTR_ERR(kn_subdir); 741 + 742 + kernfs_get(kn_subdir); 743 + ret = rdtgroup_kn_set_ugid(kn_subdir); 744 + if (ret) 745 + return ret; 746 + 747 + ret = rdtgroup_add_files(kn_subdir, fflags); 748 + if (!ret) 749 + kernfs_activate(kn_subdir); 750 + 751 + return ret; 717 752 } 718 753 719 754 static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) 720 755 { 721 - struct kernfs_node *kn_subdir; 722 - struct rftype *res_info_files; 723 756 struct rdt_resource *r; 724 - int ret, len; 757 + unsigned long fflags; 758 + char name[32]; 759 + int ret; 725 760 726 761 /* create the directory */ 727 762 kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); ··· 855 638 return PTR_ERR(kn_info); 856 639 kernfs_get(kn_info); 857 640 858 - for_each_enabled_rdt_resource(r) { 859 - kn_subdir = kernfs_create_dir(kn_info, r->name, 860 - kn_info->mode, r); 861 - if (IS_ERR(kn_subdir)) { 862 - ret = PTR_ERR(kn_subdir); 863 - goto out_destroy; 864 - } 865 - kernfs_get(kn_subdir); 866 - ret = rdtgroup_kn_set_ugid(kn_subdir); 641 + for_each_alloc_enabled_rdt_resource(r) { 642 + fflags = r->fflags | RF_CTRL_INFO; 643 + ret = rdtgroup_mkdir_info_resdir(r, r->name, fflags); 867 644 if (ret) 868 645 goto out_destroy; 646 + } 869 647 870 - res_info_files = r->info_files; 871 - len = r->nr_info_files; 872 - 873 - ret = rdtgroup_add_files(kn_subdir, res_info_files, len); 648 + for_each_mon_enabled_rdt_resource(r) { 649 + fflags = r->fflags | RF_MON_INFO; 650 + sprintf(name, "%s_MON", r->name); 651 + ret = rdtgroup_mkdir_info_resdir(r, name, fflags); 874 652 if (ret) 875 653 goto out_destroy; 876 - kernfs_activate(kn_subdir); 877 654 } 878 655 879 656 /* ··· 889 678 return ret; 890 679 } 891 680 681 + static int 682 + mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, 683 + char *name, struct kernfs_node **dest_kn) 684 + { 685 + struct kernfs_node *kn; 686 + int ret; 687 + 688 + /* create the directory */ 689 + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); 690 + if (IS_ERR(kn)) 691 + return PTR_ERR(kn); 692 + 693 + if (dest_kn) 694 + *dest_kn = kn; 695 + 696 + /* 697 + * This extra ref will be put in kernfs_remove() and guarantees 698 + * that @rdtgrp->kn is always accessible. 699 + */ 700 + kernfs_get(kn); 701 + 702 + ret = rdtgroup_kn_set_ugid(kn); 703 + if (ret) 704 + goto out_destroy; 705 + 706 + kernfs_activate(kn); 707 + 708 + return 0; 709 + 710 + out_destroy: 711 + kernfs_remove(kn); 712 + return ret; 713 + } 892 714 static void l3_qos_cfg_update(void *arg) 893 715 { 894 716 bool *enable = arg; ··· 962 718 struct rdt_resource *r_l3 = &rdt_resources_all[RDT_RESOURCE_L3]; 963 719 int ret; 964 720 965 - if (!r_l3->capable || !r_l3data->capable || !r_l3code->capable) 721 + if (!r_l3->alloc_capable || !r_l3data->alloc_capable || 722 + !r_l3code->alloc_capable) 966 723 return -EINVAL; 967 724 968 725 ret = set_l3_qos_cfg(r_l3, true); 969 726 if (!ret) { 970 - r_l3->enabled = false; 971 - r_l3data->enabled = true; 972 - r_l3code->enabled = true; 727 + r_l3->alloc_enabled = false; 728 + r_l3data->alloc_enabled = true; 729 + r_l3code->alloc_enabled = true; 973 730 } 974 731 return ret; 975 732 } ··· 979 734 { 980 735 struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3]; 981 736 982 - r->enabled = r->capable; 737 + r->alloc_enabled = r->alloc_capable; 983 738 984 - if (rdt_resources_all[RDT_RESOURCE_L3DATA].enabled) { 985 - rdt_resources_all[RDT_RESOURCE_L3DATA].enabled = false; 986 - rdt_resources_all[RDT_RESOURCE_L3CODE].enabled = false; 739 + if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled) { 740 + rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled = false; 741 + rdt_resources_all[RDT_RESOURCE_L3CODE].alloc_enabled = false; 987 742 set_l3_qos_cfg(r, false); 988 743 } 989 744 } ··· 1068 823 } 1069 824 } 1070 825 826 + static int mkdir_mondata_all(struct kernfs_node *parent_kn, 827 + struct rdtgroup *prgrp, 828 + struct kernfs_node **mon_data_kn); 829 + 1071 830 static struct dentry *rdt_mount(struct file_system_type *fs_type, 1072 831 int flags, const char *unused_dev_name, 1073 832 void *data) 1074 833 { 834 + struct rdt_domain *dom; 835 + struct rdt_resource *r; 1075 836 struct dentry *dentry; 1076 837 int ret; 1077 838 ··· 1104 853 goto out_cdp; 1105 854 } 1106 855 856 + if (rdt_mon_capable) { 857 + ret = mongroup_create_dir(rdtgroup_default.kn, 858 + NULL, "mon_groups", 859 + &kn_mongrp); 860 + if (ret) { 861 + dentry = ERR_PTR(ret); 862 + goto out_info; 863 + } 864 + kernfs_get(kn_mongrp); 865 + 866 + ret = mkdir_mondata_all(rdtgroup_default.kn, 867 + &rdtgroup_default, &kn_mondata); 868 + if (ret) { 869 + dentry = ERR_PTR(ret); 870 + goto out_mongrp; 871 + } 872 + kernfs_get(kn_mondata); 873 + rdtgroup_default.mon.mon_data_kn = kn_mondata; 874 + } 875 + 1107 876 dentry = kernfs_mount(fs_type, flags, rdt_root, 1108 877 RDTGROUP_SUPER_MAGIC, NULL); 1109 878 if (IS_ERR(dentry)) 1110 - goto out_destroy; 879 + goto out_mondata; 1111 880 1112 - static_branch_enable(&rdt_enable_key); 881 + if (rdt_alloc_capable) 882 + static_branch_enable(&rdt_alloc_enable_key); 883 + if (rdt_mon_capable) 884 + static_branch_enable(&rdt_mon_enable_key); 885 + 886 + if (rdt_alloc_capable || rdt_mon_capable) 887 + static_branch_enable(&rdt_enable_key); 888 + 889 + if (is_mbm_enabled()) { 890 + r = &rdt_resources_all[RDT_RESOURCE_L3]; 891 + list_for_each_entry(dom, &r->domains, list) 892 + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL); 893 + } 894 + 1113 895 goto out; 1114 896 1115 - out_destroy: 897 + out_mondata: 898 + if (rdt_mon_capable) 899 + kernfs_remove(kn_mondata); 900 + out_mongrp: 901 + if (rdt_mon_capable) 902 + kernfs_remove(kn_mongrp); 903 + out_info: 1116 904 kernfs_remove(kn_info); 1117 905 out_cdp: 1118 906 cdp_disable(); ··· 1199 909 return 0; 1200 910 } 1201 911 912 + static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) 913 + { 914 + return (rdt_alloc_capable && 915 + (r->type == RDTCTRL_GROUP) && (t->closid == r->closid)); 916 + } 917 + 918 + static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) 919 + { 920 + return (rdt_mon_capable && 921 + (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid)); 922 + } 923 + 1202 924 /* 1203 925 * Move tasks from one to the other group. If @from is NULL, then all tasks 1204 926 * in the systems are moved unconditionally (used for teardown). ··· 1226 924 1227 925 read_lock(&tasklist_lock); 1228 926 for_each_process_thread(p, t) { 1229 - if (!from || t->closid == from->closid) { 927 + if (!from || is_closid_match(t, from) || 928 + is_rmid_match(t, from)) { 1230 929 t->closid = to->closid; 930 + t->rmid = to->mon.rmid; 931 + 1231 932 #ifdef CONFIG_SMP 1232 933 /* 1233 934 * This is safe on x86 w/o barriers as the ordering ··· 1249 944 read_unlock(&tasklist_lock); 1250 945 } 1251 946 947 + static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) 948 + { 949 + struct rdtgroup *sentry, *stmp; 950 + struct list_head *head; 951 + 952 + head = &rdtgrp->mon.crdtgrp_list; 953 + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { 954 + free_rmid(sentry->mon.rmid); 955 + list_del(&sentry->mon.crdtgrp_list); 956 + kfree(sentry); 957 + } 958 + } 959 + 1252 960 /* 1253 961 * Forcibly remove all of subdirectories under root. 1254 962 */ ··· 1273 955 rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); 1274 956 1275 957 list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { 958 + /* Free any child rmids */ 959 + free_all_child_rdtgrp(rdtgrp); 960 + 1276 961 /* Remove each rdtgroup other than root */ 1277 962 if (rdtgrp == &rdtgroup_default) 1278 963 continue; ··· 1288 967 cpumask_or(&rdtgroup_default.cpu_mask, 1289 968 &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); 1290 969 970 + free_rmid(rdtgrp->mon.rmid); 971 + 1291 972 kernfs_remove(rdtgrp->kn); 1292 973 list_del(&rdtgrp->rdtgroup_list); 1293 974 kfree(rdtgrp); 1294 975 } 1295 976 /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ 1296 977 get_online_cpus(); 1297 - rdt_update_closid(cpu_online_mask, &rdtgroup_default.closid); 978 + update_closid_rmid(cpu_online_mask, &rdtgroup_default); 1298 979 put_online_cpus(); 1299 980 1300 981 kernfs_remove(kn_info); 982 + kernfs_remove(kn_mongrp); 983 + kernfs_remove(kn_mondata); 1301 984 } 1302 985 1303 986 static void rdt_kill_sb(struct super_block *sb) ··· 1311 986 mutex_lock(&rdtgroup_mutex); 1312 987 1313 988 /*Put everything back to default values. */ 1314 - for_each_enabled_rdt_resource(r) 989 + for_each_alloc_enabled_rdt_resource(r) 1315 990 reset_all_ctrls(r); 1316 991 cdp_disable(); 1317 992 rmdir_all_sub(); 993 + static_branch_disable(&rdt_alloc_enable_key); 994 + static_branch_disable(&rdt_mon_enable_key); 1318 995 static_branch_disable(&rdt_enable_key); 1319 996 kernfs_kill_sb(sb); 1320 997 mutex_unlock(&rdtgroup_mutex); ··· 1328 1001 .kill_sb = rdt_kill_sb, 1329 1002 }; 1330 1003 1331 - static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, 1332 - umode_t mode) 1004 + static int mon_addfile(struct kernfs_node *parent_kn, const char *name, 1005 + void *priv) 1333 1006 { 1334 - struct rdtgroup *parent, *rdtgrp; 1335 1007 struct kernfs_node *kn; 1336 - int ret, closid; 1008 + int ret = 0; 1337 1009 1338 - /* Only allow mkdir in the root directory */ 1339 - if (parent_kn != rdtgroup_default.kn) 1340 - return -EPERM; 1010 + kn = __kernfs_create_file(parent_kn, name, 0444, 0, 1011 + &kf_mondata_ops, priv, NULL, NULL); 1012 + if (IS_ERR(kn)) 1013 + return PTR_ERR(kn); 1341 1014 1342 - /* Do not accept '\n' to avoid unparsable situation. */ 1343 - if (strchr(name, '\n')) 1344 - return -EINVAL; 1015 + ret = rdtgroup_kn_set_ugid(kn); 1016 + if (ret) { 1017 + kernfs_remove(kn); 1018 + return ret; 1019 + } 1345 1020 1346 - parent = rdtgroup_kn_lock_live(parent_kn); 1347 - if (!parent) { 1021 + return ret; 1022 + } 1023 + 1024 + /* 1025 + * Remove all subdirectories of mon_data of ctrl_mon groups 1026 + * and monitor groups with given domain id. 1027 + */ 1028 + void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int dom_id) 1029 + { 1030 + struct rdtgroup *prgrp, *crgrp; 1031 + char name[32]; 1032 + 1033 + if (!r->mon_enabled) 1034 + return; 1035 + 1036 + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { 1037 + sprintf(name, "mon_%s_%02d", r->name, dom_id); 1038 + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); 1039 + 1040 + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) 1041 + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); 1042 + } 1043 + } 1044 + 1045 + static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, 1046 + struct rdt_domain *d, 1047 + struct rdt_resource *r, struct rdtgroup *prgrp) 1048 + { 1049 + union mon_data_bits priv; 1050 + struct kernfs_node *kn; 1051 + struct mon_evt *mevt; 1052 + struct rmid_read rr; 1053 + char name[32]; 1054 + int ret; 1055 + 1056 + sprintf(name, "mon_%s_%02d", r->name, d->id); 1057 + /* create the directory */ 1058 + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); 1059 + if (IS_ERR(kn)) 1060 + return PTR_ERR(kn); 1061 + 1062 + /* 1063 + * This extra ref will be put in kernfs_remove() and guarantees 1064 + * that kn is always accessible. 1065 + */ 1066 + kernfs_get(kn); 1067 + ret = rdtgroup_kn_set_ugid(kn); 1068 + if (ret) 1069 + goto out_destroy; 1070 + 1071 + if (WARN_ON(list_empty(&r->evt_list))) { 1072 + ret = -EPERM; 1073 + goto out_destroy; 1074 + } 1075 + 1076 + priv.u.rid = r->rid; 1077 + priv.u.domid = d->id; 1078 + list_for_each_entry(mevt, &r->evt_list, list) { 1079 + priv.u.evtid = mevt->evtid; 1080 + ret = mon_addfile(kn, mevt->name, priv.priv); 1081 + if (ret) 1082 + goto out_destroy; 1083 + 1084 + if (is_mbm_event(mevt->evtid)) 1085 + mon_event_read(&rr, d, prgrp, mevt->evtid, true); 1086 + } 1087 + kernfs_activate(kn); 1088 + return 0; 1089 + 1090 + out_destroy: 1091 + kernfs_remove(kn); 1092 + return ret; 1093 + } 1094 + 1095 + /* 1096 + * Add all subdirectories of mon_data for "ctrl_mon" groups 1097 + * and "monitor" groups with given domain id. 1098 + */ 1099 + void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, 1100 + struct rdt_domain *d) 1101 + { 1102 + struct kernfs_node *parent_kn; 1103 + struct rdtgroup *prgrp, *crgrp; 1104 + struct list_head *head; 1105 + 1106 + if (!r->mon_enabled) 1107 + return; 1108 + 1109 + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { 1110 + parent_kn = prgrp->mon.mon_data_kn; 1111 + mkdir_mondata_subdir(parent_kn, d, r, prgrp); 1112 + 1113 + head = &prgrp->mon.crdtgrp_list; 1114 + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { 1115 + parent_kn = crgrp->mon.mon_data_kn; 1116 + mkdir_mondata_subdir(parent_kn, d, r, crgrp); 1117 + } 1118 + } 1119 + } 1120 + 1121 + static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, 1122 + struct rdt_resource *r, 1123 + struct rdtgroup *prgrp) 1124 + { 1125 + struct rdt_domain *dom; 1126 + int ret; 1127 + 1128 + list_for_each_entry(dom, &r->domains, list) { 1129 + ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); 1130 + if (ret) 1131 + return ret; 1132 + } 1133 + 1134 + return 0; 1135 + } 1136 + 1137 + /* 1138 + * This creates a directory mon_data which contains the monitored data. 1139 + * 1140 + * mon_data has one directory for each domain whic are named 1141 + * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data 1142 + * with L3 domain looks as below: 1143 + * ./mon_data: 1144 + * mon_L3_00 1145 + * mon_L3_01 1146 + * mon_L3_02 1147 + * ... 1148 + * 1149 + * Each domain directory has one file per event: 1150 + * ./mon_L3_00/: 1151 + * llc_occupancy 1152 + * 1153 + */ 1154 + static int mkdir_mondata_all(struct kernfs_node *parent_kn, 1155 + struct rdtgroup *prgrp, 1156 + struct kernfs_node **dest_kn) 1157 + { 1158 + struct rdt_resource *r; 1159 + struct kernfs_node *kn; 1160 + int ret; 1161 + 1162 + /* 1163 + * Create the mon_data directory first. 1164 + */ 1165 + ret = mongroup_create_dir(parent_kn, NULL, "mon_data", &kn); 1166 + if (ret) 1167 + return ret; 1168 + 1169 + if (dest_kn) 1170 + *dest_kn = kn; 1171 + 1172 + /* 1173 + * Create the subdirectories for each domain. Note that all events 1174 + * in a domain like L3 are grouped into a resource whose domain is L3 1175 + */ 1176 + for_each_mon_enabled_rdt_resource(r) { 1177 + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); 1178 + if (ret) 1179 + goto out_destroy; 1180 + } 1181 + 1182 + return 0; 1183 + 1184 + out_destroy: 1185 + kernfs_remove(kn); 1186 + return ret; 1187 + } 1188 + 1189 + static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, 1190 + struct kernfs_node *prgrp_kn, 1191 + const char *name, umode_t mode, 1192 + enum rdt_group_type rtype, struct rdtgroup **r) 1193 + { 1194 + struct rdtgroup *prdtgrp, *rdtgrp; 1195 + struct kernfs_node *kn; 1196 + uint files = 0; 1197 + int ret; 1198 + 1199 + prdtgrp = rdtgroup_kn_lock_live(prgrp_kn); 1200 + if (!prdtgrp) { 1348 1201 ret = -ENODEV; 1349 1202 goto out_unlock; 1350 1203 } 1351 - 1352 - ret = closid_alloc(); 1353 - if (ret < 0) 1354 - goto out_unlock; 1355 - closid = ret; 1356 1204 1357 1205 /* allocate the rdtgroup. */ 1358 1206 rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); 1359 1207 if (!rdtgrp) { 1360 1208 ret = -ENOSPC; 1361 - goto out_closid_free; 1209 + goto out_unlock; 1362 1210 } 1363 - rdtgrp->closid = closid; 1364 - list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); 1211 + *r = rdtgrp; 1212 + rdtgrp->mon.parent = prdtgrp; 1213 + rdtgrp->type = rtype; 1214 + INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); 1365 1215 1366 1216 /* kernfs creates the directory for rdtgrp */ 1367 - kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp); 1217 + kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); 1368 1218 if (IS_ERR(kn)) { 1369 1219 ret = PTR_ERR(kn); 1370 - goto out_cancel_ref; 1220 + goto out_free_rgrp; 1371 1221 } 1372 1222 rdtgrp->kn = kn; 1373 1223 ··· 1560 1056 if (ret) 1561 1057 goto out_destroy; 1562 1058 1563 - ret = rdtgroup_add_files(kn, rdtgroup_base_files, 1564 - ARRAY_SIZE(rdtgroup_base_files)); 1059 + files = RFTYPE_BASE | RFTYPE_CTRL; 1060 + files = RFTYPE_BASE | BIT(RF_CTRLSHIFT + rtype); 1061 + ret = rdtgroup_add_files(kn, files); 1565 1062 if (ret) 1566 1063 goto out_destroy; 1567 1064 1065 + if (rdt_mon_capable) { 1066 + ret = alloc_rmid(); 1067 + if (ret < 0) 1068 + goto out_destroy; 1069 + rdtgrp->mon.rmid = ret; 1070 + 1071 + ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn); 1072 + if (ret) 1073 + goto out_idfree; 1074 + } 1568 1075 kernfs_activate(kn); 1569 1076 1570 - ret = 0; 1571 - goto out_unlock; 1077 + /* 1078 + * The caller unlocks the prgrp_kn upon success. 1079 + */ 1080 + return 0; 1572 1081 1082 + out_idfree: 1083 + free_rmid(rdtgrp->mon.rmid); 1573 1084 out_destroy: 1574 1085 kernfs_remove(rdtgrp->kn); 1575 - out_cancel_ref: 1576 - list_del(&rdtgrp->rdtgroup_list); 1086 + out_free_rgrp: 1577 1087 kfree(rdtgrp); 1578 - out_closid_free: 1579 - closid_free(closid); 1580 1088 out_unlock: 1581 - rdtgroup_kn_unlock(parent_kn); 1089 + rdtgroup_kn_unlock(prgrp_kn); 1582 1090 return ret; 1091 + } 1092 + 1093 + static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) 1094 + { 1095 + kernfs_remove(rgrp->kn); 1096 + free_rmid(rgrp->mon.rmid); 1097 + kfree(rgrp); 1098 + } 1099 + 1100 + /* 1101 + * Create a monitor group under "mon_groups" directory of a control 1102 + * and monitor group(ctrl_mon). This is a resource group 1103 + * to monitor a subset of tasks and cpus in its parent ctrl_mon group. 1104 + */ 1105 + static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, 1106 + struct kernfs_node *prgrp_kn, 1107 + const char *name, 1108 + umode_t mode) 1109 + { 1110 + struct rdtgroup *rdtgrp, *prgrp; 1111 + int ret; 1112 + 1113 + ret = mkdir_rdt_prepare(parent_kn, prgrp_kn, name, mode, RDTMON_GROUP, 1114 + &rdtgrp); 1115 + if (ret) 1116 + return ret; 1117 + 1118 + prgrp = rdtgrp->mon.parent; 1119 + rdtgrp->closid = prgrp->closid; 1120 + 1121 + /* 1122 + * Add the rdtgrp to the list of rdtgrps the parent 1123 + * ctrl_mon group has to track. 1124 + */ 1125 + list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); 1126 + 1127 + rdtgroup_kn_unlock(prgrp_kn); 1128 + return ret; 1129 + } 1130 + 1131 + /* 1132 + * These are rdtgroups created under the root directory. Can be used 1133 + * to allocate and monitor resources. 1134 + */ 1135 + static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, 1136 + struct kernfs_node *prgrp_kn, 1137 + const char *name, umode_t mode) 1138 + { 1139 + struct rdtgroup *rdtgrp; 1140 + struct kernfs_node *kn; 1141 + u32 closid; 1142 + int ret; 1143 + 1144 + ret = mkdir_rdt_prepare(parent_kn, prgrp_kn, name, mode, RDTCTRL_GROUP, 1145 + &rdtgrp); 1146 + if (ret) 1147 + return ret; 1148 + 1149 + kn = rdtgrp->kn; 1150 + ret = closid_alloc(); 1151 + if (ret < 0) 1152 + goto out_common_fail; 1153 + closid = ret; 1154 + 1155 + rdtgrp->closid = closid; 1156 + list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); 1157 + 1158 + if (rdt_mon_capable) { 1159 + /* 1160 + * Create an empty mon_groups directory to hold the subset 1161 + * of tasks and cpus to monitor. 1162 + */ 1163 + ret = mongroup_create_dir(kn, NULL, "mon_groups", NULL); 1164 + if (ret) 1165 + goto out_id_free; 1166 + } 1167 + 1168 + goto out_unlock; 1169 + 1170 + out_id_free: 1171 + closid_free(closid); 1172 + list_del(&rdtgrp->rdtgroup_list); 1173 + out_common_fail: 1174 + mkdir_rdt_prepare_clean(rdtgrp); 1175 + out_unlock: 1176 + rdtgroup_kn_unlock(prgrp_kn); 1177 + return ret; 1178 + } 1179 + 1180 + /* 1181 + * We allow creating mon groups only with in a directory called "mon_groups" 1182 + * which is present in every ctrl_mon group. Check if this is a valid 1183 + * "mon_groups" directory. 1184 + * 1185 + * 1. The directory should be named "mon_groups". 1186 + * 2. The mon group itself should "not" be named "mon_groups". 1187 + * This makes sure "mon_groups" directory always has a ctrl_mon group 1188 + * as parent. 1189 + */ 1190 + static bool is_mon_groups(struct kernfs_node *kn, const char *name) 1191 + { 1192 + return (!strcmp(kn->name, "mon_groups") && 1193 + strcmp(name, "mon_groups")); 1194 + } 1195 + 1196 + static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, 1197 + umode_t mode) 1198 + { 1199 + /* Do not accept '\n' to avoid unparsable situation. */ 1200 + if (strchr(name, '\n')) 1201 + return -EINVAL; 1202 + 1203 + /* 1204 + * If the parent directory is the root directory and RDT 1205 + * allocation is supported, add a control and monitoring 1206 + * subdirectory 1207 + */ 1208 + if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn) 1209 + return rdtgroup_mkdir_ctrl_mon(parent_kn, parent_kn, name, mode); 1210 + 1211 + /* 1212 + * If RDT monitoring is supported and the parent directory is a valid 1213 + * "mon_groups" directory, add a monitoring subdirectory. 1214 + */ 1215 + if (rdt_mon_capable && is_mon_groups(parent_kn, name)) 1216 + return rdtgroup_mkdir_mon(parent_kn, parent_kn->parent, name, mode); 1217 + 1218 + return -EPERM; 1219 + } 1220 + 1221 + static int rdtgroup_rmdir_mon(struct kernfs_node *kn, struct rdtgroup *rdtgrp, 1222 + cpumask_var_t tmpmask) 1223 + { 1224 + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; 1225 + int cpu; 1226 + 1227 + /* Give any tasks back to the parent group */ 1228 + rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); 1229 + 1230 + /* Update per cpu rmid of the moved CPUs first */ 1231 + for_each_cpu(cpu, &rdtgrp->cpu_mask) 1232 + per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; 1233 + /* 1234 + * Update the MSR on moved CPUs and CPUs which have moved 1235 + * task running on them. 1236 + */ 1237 + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); 1238 + update_closid_rmid(tmpmask, NULL); 1239 + 1240 + rdtgrp->flags = RDT_DELETED; 1241 + free_rmid(rdtgrp->mon.rmid); 1242 + 1243 + /* 1244 + * Remove the rdtgrp from the parent ctrl_mon group's list 1245 + */ 1246 + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); 1247 + list_del(&rdtgrp->mon.crdtgrp_list); 1248 + 1249 + /* 1250 + * one extra hold on this, will drop when we kfree(rdtgrp) 1251 + * in rdtgroup_kn_unlock() 1252 + */ 1253 + kernfs_get(kn); 1254 + kernfs_remove(rdtgrp->kn); 1255 + 1256 + return 0; 1257 + } 1258 + 1259 + static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp, 1260 + cpumask_var_t tmpmask) 1261 + { 1262 + int cpu; 1263 + 1264 + /* Give any tasks back to the default group */ 1265 + rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); 1266 + 1267 + /* Give any CPUs back to the default group */ 1268 + cpumask_or(&rdtgroup_default.cpu_mask, 1269 + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); 1270 + 1271 + /* Update per cpu closid and rmid of the moved CPUs first */ 1272 + for_each_cpu(cpu, &rdtgrp->cpu_mask) { 1273 + per_cpu(pqr_state.default_closid, cpu) = rdtgroup_default.closid; 1274 + per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; 1275 + } 1276 + 1277 + /* 1278 + * Update the MSR on moved CPUs and CPUs which have moved 1279 + * task running on them. 1280 + */ 1281 + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); 1282 + update_closid_rmid(tmpmask, NULL); 1283 + 1284 + rdtgrp->flags = RDT_DELETED; 1285 + closid_free(rdtgrp->closid); 1286 + free_rmid(rdtgrp->mon.rmid); 1287 + 1288 + /* 1289 + * Free all the child monitor group rmids. 1290 + */ 1291 + free_all_child_rdtgrp(rdtgrp); 1292 + 1293 + list_del(&rdtgrp->rdtgroup_list); 1294 + 1295 + /* 1296 + * one extra hold on this, will drop when we kfree(rdtgrp) 1297 + * in rdtgroup_kn_unlock() 1298 + */ 1299 + kernfs_get(kn); 1300 + kernfs_remove(rdtgrp->kn); 1301 + 1302 + return 0; 1583 1303 } 1584 1304 1585 1305 static int rdtgroup_rmdir(struct kernfs_node *kn) 1586 1306 { 1587 - int ret, cpu, closid = rdtgroup_default.closid; 1307 + struct kernfs_node *parent_kn = kn->parent; 1588 1308 struct rdtgroup *rdtgrp; 1589 1309 cpumask_var_t tmpmask; 1310 + int ret = 0; 1590 1311 1591 1312 if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) 1592 1313 return -ENOMEM; ··· 1822 1093 goto out; 1823 1094 } 1824 1095 1825 - /* Give any tasks back to the default group */ 1826 - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); 1827 - 1828 - /* Give any CPUs back to the default group */ 1829 - cpumask_or(&rdtgroup_default.cpu_mask, 1830 - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); 1831 - 1832 - /* Update per cpu closid of the moved CPUs first */ 1833 - for_each_cpu(cpu, &rdtgrp->cpu_mask) 1834 - per_cpu(cpu_closid, cpu) = closid; 1835 1096 /* 1836 - * Update the MSR on moved CPUs and CPUs which have moved 1837 - * task running on them. 1097 + * If the rdtgroup is a ctrl_mon group and parent directory 1098 + * is the root directory, remove the ctrl_mon group. 1099 + * 1100 + * If the rdtgroup is a mon group and parent directory 1101 + * is a valid "mon_groups" directory, remove the mon group. 1838 1102 */ 1839 - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); 1840 - rdt_update_closid(tmpmask, NULL); 1103 + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn) 1104 + ret = rdtgroup_rmdir_ctrl(kn, rdtgrp, tmpmask); 1105 + else if (rdtgrp->type == RDTMON_GROUP && 1106 + is_mon_groups(parent_kn, kn->name)) 1107 + ret = rdtgroup_rmdir_mon(kn, rdtgrp, tmpmask); 1108 + else 1109 + ret = -EPERM; 1841 1110 1842 - rdtgrp->flags = RDT_DELETED; 1843 - closid_free(rdtgrp->closid); 1844 - list_del(&rdtgrp->rdtgroup_list); 1845 - 1846 - /* 1847 - * one extra hold on this, will drop when we kfree(rdtgrp) 1848 - * in rdtgroup_kn_unlock() 1849 - */ 1850 - kernfs_get(kn); 1851 - kernfs_remove(rdtgrp->kn); 1852 - ret = 0; 1853 1111 out: 1854 1112 rdtgroup_kn_unlock(kn); 1855 1113 free_cpumask_var(tmpmask); ··· 1845 1129 1846 1130 static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) 1847 1131 { 1848 - if (rdt_resources_all[RDT_RESOURCE_L3DATA].enabled) 1132 + if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled) 1849 1133 seq_puts(seq, ",cdp"); 1850 1134 return 0; 1851 1135 } ··· 1869 1153 mutex_lock(&rdtgroup_mutex); 1870 1154 1871 1155 rdtgroup_default.closid = 0; 1156 + rdtgroup_default.mon.rmid = 0; 1157 + rdtgroup_default.type = RDTCTRL_GROUP; 1158 + INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); 1159 + 1872 1160 list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); 1873 1161 1874 - ret = rdtgroup_add_files(rdt_root->kn, rdtgroup_base_files, 1875 - ARRAY_SIZE(rdtgroup_base_files)); 1162 + ret = rdtgroup_add_files(rdt_root->kn, RF_CTRL_BASE); 1876 1163 if (ret) { 1877 1164 kernfs_destroy_root(rdt_root); 1878 1165 goto out;
+61 -6
arch/x86/kernel/cpu/intel_rdt_schemata.c arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
··· 26 26 #include <linux/kernfs.h> 27 27 #include <linux/seq_file.h> 28 28 #include <linux/slab.h> 29 - #include <asm/intel_rdt.h> 29 + #include "intel_rdt.h" 30 30 31 31 /* 32 32 * Check whether MBA bandwidth percentage value is correct. The value is ··· 192 192 { 193 193 struct rdt_resource *r; 194 194 195 - for_each_enabled_rdt_resource(r) { 195 + for_each_alloc_enabled_rdt_resource(r) { 196 196 if (!strcmp(resname, r->name) && closid < r->num_closid) 197 197 return parse_line(tok, r); 198 198 } ··· 221 221 222 222 closid = rdtgrp->closid; 223 223 224 - for_each_enabled_rdt_resource(r) { 224 + for_each_alloc_enabled_rdt_resource(r) { 225 225 list_for_each_entry(dom, &r->domains, list) 226 226 dom->have_new_ctrl = false; 227 227 } ··· 237 237 goto out; 238 238 } 239 239 240 - for_each_enabled_rdt_resource(r) { 240 + for_each_alloc_enabled_rdt_resource(r) { 241 241 ret = update_domains(r, closid); 242 242 if (ret) 243 243 goto out; ··· 269 269 { 270 270 struct rdtgroup *rdtgrp; 271 271 struct rdt_resource *r; 272 - int closid, ret = 0; 272 + int ret = 0; 273 + u32 closid; 273 274 274 275 rdtgrp = rdtgroup_kn_lock_live(of->kn); 275 276 if (rdtgrp) { 276 277 closid = rdtgrp->closid; 277 - for_each_enabled_rdt_resource(r) { 278 + for_each_alloc_enabled_rdt_resource(r) { 278 279 if (closid < r->num_closid) 279 280 show_doms(s, r, closid); 280 281 } 281 282 } else { 282 283 ret = -ENOENT; 283 284 } 285 + rdtgroup_kn_unlock(of->kn); 286 + return ret; 287 + } 288 + 289 + void mon_event_read(struct rmid_read *rr, struct rdt_domain *d, 290 + struct rdtgroup *rdtgrp, int evtid, int first) 291 + { 292 + /* 293 + * setup the parameters to send to the IPI to read the data. 294 + */ 295 + rr->rgrp = rdtgrp; 296 + rr->evtid = evtid; 297 + rr->d = d; 298 + rr->val = 0; 299 + rr->first = first; 300 + 301 + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); 302 + } 303 + 304 + int rdtgroup_mondata_show(struct seq_file *m, void *arg) 305 + { 306 + struct kernfs_open_file *of = m->private; 307 + u32 resid, evtid, domid; 308 + struct rdtgroup *rdtgrp; 309 + struct rdt_resource *r; 310 + union mon_data_bits md; 311 + struct rdt_domain *d; 312 + struct rmid_read rr; 313 + int ret = 0; 314 + 315 + rdtgrp = rdtgroup_kn_lock_live(of->kn); 316 + 317 + md.priv = of->kn->priv; 318 + resid = md.u.rid; 319 + domid = md.u.domid; 320 + evtid = md.u.evtid; 321 + 322 + r = &rdt_resources_all[resid]; 323 + d = rdt_find_domain(r, domid, NULL); 324 + if (!d) { 325 + ret = -ENOENT; 326 + goto out; 327 + } 328 + 329 + mon_event_read(&rr, d, rdtgrp, evtid, false); 330 + 331 + if (rr.val & RMID_VAL_ERROR) 332 + seq_puts(m, "Error\n"); 333 + else if (rr.val & RMID_VAL_UNAVAIL) 334 + seq_puts(m, "Unavailable\n"); 335 + else 336 + seq_printf(m, "%llu\n", rr.val * r->mon_scale); 337 + 338 + out: 284 339 rdtgroup_kn_unlock(of->kn); 285 340 return ret; 286 341 }
+1 -1
arch/x86/kernel/process_32.c
··· 56 56 #include <asm/debugreg.h> 57 57 #include <asm/switch_to.h> 58 58 #include <asm/vm86.h> 59 - #include <asm/intel_rdt.h> 59 + #include <asm/intel_rdt_sched.h> 60 60 #include <asm/proto.h> 61 61 62 62 void __show_regs(struct pt_regs *regs, int all)
+1 -1
arch/x86/kernel/process_64.c
··· 52 52 #include <asm/switch_to.h> 53 53 #include <asm/xen/hypervisor.h> 54 54 #include <asm/vdso.h> 55 - #include <asm/intel_rdt.h> 55 + #include <asm/intel_rdt_sched.h> 56 56 #include <asm/unistd.h> 57 57 #ifdef CONFIG_IA32_EMULATION 58 58 /* Not included via unistd.h */
-18
include/linux/perf_event.h
··· 139 139 /* for tp_event->class */ 140 140 struct list_head tp_list; 141 141 }; 142 - struct { /* intel_cqm */ 143 - int cqm_state; 144 - u32 cqm_rmid; 145 - int is_group_event; 146 - struct list_head cqm_events_entry; 147 - struct list_head cqm_groups_entry; 148 - struct list_head cqm_group_entry; 149 - }; 150 142 struct { /* amd_power */ 151 143 u64 pwr_acc; 152 144 u64 ptsc; ··· 404 412 */ 405 413 size_t task_ctx_size; 406 414 407 - 408 - /* 409 - * Return the count value for a counter. 410 - */ 411 - u64 (*count) (struct perf_event *event); /*optional*/ 412 415 413 416 /* 414 417 * Set up pmu-private data structures for an AUX area ··· 1097 1110 1098 1111 if (static_branch_unlikely(&perf_sched_events)) 1099 1112 __perf_event_task_sched_out(prev, next); 1100 - } 1101 - 1102 - static inline u64 __perf_event_count(struct perf_event *event) 1103 - { 1104 - return local64_read(&event->count) + atomic64_read(&event->child_count); 1105 1113 } 1106 1114 1107 1115 extern void perf_event_mmap(struct vm_area_struct *vma);
+3 -2
include/linux/sched.h
··· 909 909 /* cg_list protected by css_set_lock and tsk->alloc_lock: */ 910 910 struct list_head cg_list; 911 911 #endif 912 - #ifdef CONFIG_INTEL_RDT_A 913 - int closid; 912 + #ifdef CONFIG_INTEL_RDT 913 + u32 closid; 914 + u32 rmid; 914 915 #endif 915 916 #ifdef CONFIG_FUTEX 916 917 struct robust_list_head __user *robust_list;
+1 -13
kernel/events/core.c
··· 3673 3673 3674 3674 static inline u64 perf_event_count(struct perf_event *event) 3675 3675 { 3676 - if (event->pmu->count) 3677 - return event->pmu->count(event); 3678 - 3679 - return __perf_event_count(event); 3676 + return local64_read(&event->count) + atomic64_read(&event->child_count); 3680 3677 } 3681 3678 3682 3679 /* ··· 3700 3703 * all child counters from atomic context. 3701 3704 */ 3702 3705 if (event->attr.inherit) { 3703 - ret = -EOPNOTSUPP; 3704 - goto out; 3705 - } 3706 - 3707 - /* 3708 - * It must not have a pmu::count method, those are not 3709 - * NMI safe. 3710 - */ 3711 - if (event->pmu->count) { 3712 3706 ret = -EOPNOTSUPP; 3713 3707 goto out; 3714 3708 }