Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v4.20-rc7 1118 lines 42 kB view raw
1User Interface for Resource Allocation in Intel Resource Director Technology 2 3Copyright (C) 2016 Intel Corporation 4 5Fenghua Yu <fenghua.yu@intel.com> 6Tony Luck <tony.luck@intel.com> 7Vikas Shivappa <vikas.shivappa@intel.com> 8 9This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the 10X86 /proc/cpuinfo flag bits: 11RDT (Resource Director Technology) Allocation - "rdt_a" 12CAT (Cache Allocation Technology) - "cat_l3", "cat_l2" 13CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2" 14CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc" 15MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local" 16MBA (Memory Bandwidth Allocation) - "mba" 17 18To use the feature mount the file system: 19 20 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl 21 22mount options are: 23 24"cdp": Enable code/data prioritization in L3 cache allocations. 25"cdpl2": Enable code/data prioritization in L2 cache allocations. 26"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA 27 bandwidth in MBps 28 29L2 and L3 CDP are controlled seperately. 30 31RDT features are orthogonal. A particular system may support only 32monitoring, only control, or both monitoring and control. Cache 33pseudo-locking is a unique way of using cache control to "pin" or 34"lock" data in the cache. Details can be found in 35"Cache Pseudo-Locking". 36 37 38The mount succeeds if either of allocation or monitoring is present, but 39only those files and directories supported by the system will be created. 40For more details on the behavior of the interface during monitoring 41and allocation, see the "Resource alloc and monitor groups" section. 42 43Info directory 44-------------- 45 46The 'info' directory contains information about the enabled 47resources. Each resource has its own subdirectory. The subdirectory 48names reflect the resource names. 49 50Each subdirectory contains the following files with respect to 51allocation: 52 53Cache resource(L3/L2) subdirectory contains the following files 54related to allocation: 55 56"num_closids": The number of CLOSIDs which are valid for this 57 resource. The kernel uses the smallest number of 58 CLOSIDs of all enabled resources as limit. 59 60"cbm_mask": The bitmask which is valid for this resource. 61 This mask is equivalent to 100%. 62 63"min_cbm_bits": The minimum number of consecutive bits which 64 must be set when writing a mask. 65 66"shareable_bits": Bitmask of shareable resource with other executing 67 entities (e.g. I/O). User can use this when 68 setting up exclusive cache partitions. Note that 69 some platforms support devices that have their 70 own settings for cache use which can over-ride 71 these bits. 72"bit_usage": Annotated capacity bitmasks showing how all 73 instances of the resource are used. The legend is: 74 "0" - Corresponding region is unused. When the system's 75 resources have been allocated and a "0" is found 76 in "bit_usage" it is a sign that resources are 77 wasted. 78 "H" - Corresponding region is used by hardware only 79 but available for software use. If a resource 80 has bits set in "shareable_bits" but not all 81 of these bits appear in the resource groups' 82 schematas then the bits appearing in 83 "shareable_bits" but no resource group will 84 be marked as "H". 85 "X" - Corresponding region is available for sharing and 86 used by hardware and software. These are the 87 bits that appear in "shareable_bits" as 88 well as a resource group's allocation. 89 "S" - Corresponding region is used by software 90 and available for sharing. 91 "E" - Corresponding region is used exclusively by 92 one resource group. No sharing allowed. 93 "P" - Corresponding region is pseudo-locked. No 94 sharing allowed. 95 96Memory bandwitdh(MB) subdirectory contains the following files 97with respect to allocation: 98 99"min_bandwidth": The minimum memory bandwidth percentage which 100 user can request. 101 102"bandwidth_gran": The granularity in which the memory bandwidth 103 percentage is allocated. The allocated 104 b/w percentage is rounded off to the next 105 control step available on the hardware. The 106 available bandwidth control steps are: 107 min_bandwidth + N * bandwidth_gran. 108 109"delay_linear": Indicates if the delay scale is linear or 110 non-linear. This field is purely informational 111 only. 112 113If RDT monitoring is available there will be an "L3_MON" directory 114with the following files: 115 116"num_rmids": The number of RMIDs available. This is the 117 upper bound for how many "CTRL_MON" + "MON" 118 groups can be created. 119 120"mon_features": Lists the monitoring events if 121 monitoring is enabled for the resource. 122 123"max_threshold_occupancy": 124 Read/write file provides the largest value (in 125 bytes) at which a previously used LLC_occupancy 126 counter can be considered for re-use. 127 128Finally, in the top level of the "info" directory there is a file 129named "last_cmd_status". This is reset with every "command" issued 130via the file system (making new directories or writing to any of the 131control files). If the command was successful, it will read as "ok". 132If the command failed, it will provide more information that can be 133conveyed in the error returns from file operations. E.g. 134 135 # echo L3:0=f7 > schemata 136 bash: echo: write error: Invalid argument 137 # cat info/last_cmd_status 138 mask f7 has non-consecutive 1-bits 139 140Resource alloc and monitor groups 141--------------------------------- 142 143Resource groups are represented as directories in the resctrl file 144system. The default group is the root directory which, immediately 145after mounting, owns all the tasks and cpus in the system and can make 146full use of all resources. 147 148On a system with RDT control features additional directories can be 149created in the root directory that specify different amounts of each 150resource (see "schemata" below). The root and these additional top level 151directories are referred to as "CTRL_MON" groups below. 152 153On a system with RDT monitoring the root directory and other top level 154directories contain a directory named "mon_groups" in which additional 155directories can be created to monitor subsets of tasks in the CTRL_MON 156group that is their ancestor. These are called "MON" groups in the rest 157of this document. 158 159Removing a directory will move all tasks and cpus owned by the group it 160represents to the parent. Removing one of the created CTRL_MON groups 161will automatically remove all MON groups below it. 162 163All groups contain the following files: 164 165"tasks": 166 Reading this file shows the list of all tasks that belong to 167 this group. Writing a task id to the file will add a task to the 168 group. If the group is a CTRL_MON group the task is removed from 169 whichever previous CTRL_MON group owned the task and also from 170 any MON group that owned the task. If the group is a MON group, 171 then the task must already belong to the CTRL_MON parent of this 172 group. The task is removed from any previous MON group. 173 174 175"cpus": 176 Reading this file shows a bitmask of the logical CPUs owned by 177 this group. Writing a mask to this file will add and remove 178 CPUs to/from this group. As with the tasks file a hierarchy is 179 maintained where MON groups may only include CPUs owned by the 180 parent CTRL_MON group. 181 When the resouce group is in pseudo-locked mode this file will 182 only be readable, reflecting the CPUs associated with the 183 pseudo-locked region. 184 185 186"cpus_list": 187 Just like "cpus", only using ranges of CPUs instead of bitmasks. 188 189 190When control is enabled all CTRL_MON groups will also contain: 191 192"schemata": 193 A list of all the resources available to this group. 194 Each resource has its own line and format - see below for details. 195 196"size": 197 Mirrors the display of the "schemata" file to display the size in 198 bytes of each allocation instead of the bits representing the 199 allocation. 200 201"mode": 202 The "mode" of the resource group dictates the sharing of its 203 allocations. A "shareable" resource group allows sharing of its 204 allocations while an "exclusive" resource group does not. A 205 cache pseudo-locked region is created by first writing 206 "pseudo-locksetup" to the "mode" file before writing the cache 207 pseudo-locked region's schemata to the resource group's "schemata" 208 file. On successful pseudo-locked region creation the mode will 209 automatically change to "pseudo-locked". 210 211When monitoring is enabled all MON groups will also contain: 212 213"mon_data": 214 This contains a set of files organized by L3 domain and by 215 RDT event. E.g. on a system with two L3 domains there will 216 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 217 directories have one file per event (e.g. "llc_occupancy", 218 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 219 files provide a read out of the current value of the event for 220 all tasks in the group. In CTRL_MON groups these files provide 221 the sum for all tasks in the CTRL_MON group and all tasks in 222 MON groups. Please see example section for more details on usage. 223 224Resource allocation rules 225------------------------- 226When a task is running the following rules define which resources are 227available to it: 228 2291) If the task is a member of a non-default group, then the schemata 230 for that group is used. 231 2322) Else if the task belongs to the default group, but is running on a 233 CPU that is assigned to some specific group, then the schemata for the 234 CPU's group is used. 235 2363) Otherwise the schemata for the default group is used. 237 238Resource monitoring rules 239------------------------- 2401) If a task is a member of a MON group, or non-default CTRL_MON group 241 then RDT events for the task will be reported in that group. 242 2432) If a task is a member of the default CTRL_MON group, but is running 244 on a CPU that is assigned to some specific group, then the RDT events 245 for the task will be reported in that group. 246 2473) Otherwise RDT events for the task will be reported in the root level 248 "mon_data" group. 249 250 251Notes on cache occupancy monitoring and control 252----------------------------------------------- 253When moving a task from one group to another you should remember that 254this only affects *new* cache allocations by the task. E.g. you may have 255a task in a monitor group showing 3 MB of cache occupancy. If you move 256to a new group and immediately check the occupancy of the old and new 257groups you will likely see that the old group is still showing 3 MB and 258the new group zero. When the task accesses locations still in cache from 259before the move, the h/w does not update any counters. On a busy system 260you will likely see the occupancy in the old group go down as cache lines 261are evicted and re-used while the occupancy in the new group rises as 262the task accesses memory and loads into the cache are counted based on 263membership in the new group. 264 265The same applies to cache allocation control. Moving a task to a group 266with a smaller cache partition will not evict any cache lines. The 267process may continue to use them from the old partition. 268 269Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 270to identify a control group and a monitoring group respectively. Each of 271the resource groups are mapped to these IDs based on the kind of group. The 272number of CLOSid and RMID are limited by the hardware and hence the creation of 273a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 274and creation of "MON" group may fail if we run out of RMIDs. 275 276max_threshold_occupancy - generic concepts 277------------------------------------------ 278 279Note that an RMID once freed may not be immediately available for use as 280the RMID is still tagged the cache lines of the previous user of RMID. 281Hence such RMIDs are placed on limbo list and checked back if the cache 282occupancy has gone down. If there is a time when system has a lot of 283limbo RMIDs but which are not ready to be used, user may see an -EBUSY 284during mkdir. 285 286max_threshold_occupancy is a user configurable value to determine the 287occupancy at which an RMID can be freed. 288 289Schemata files - general concepts 290--------------------------------- 291Each line in the file describes one resource. The line starts with 292the name of the resource, followed by specific values to be applied 293in each of the instances of that resource on the system. 294 295Cache IDs 296--------- 297On current generation systems there is one L3 cache per socket and L2 298caches are generally just shared by the hyperthreads on a core, but this 299isn't an architectural requirement. We could have multiple separate L3 300caches on a socket, multiple cores could share an L2 cache. So instead 301of using "socket" or "core" to define the set of logical cpus sharing 302a resource we use a "Cache ID". At a given cache level this will be a 303unique number across the whole system (but it isn't guaranteed to be a 304contiguous sequence, there may be gaps). To find the ID for each logical 305CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id 306 307Cache Bit Masks (CBM) 308--------------------- 309For cache resources we describe the portion of the cache that is available 310for allocation using a bitmask. The maximum value of the mask is defined 311by each cpu model (and may be different for different cache levels). It 312is found using CPUID, but is also provided in the "info" directory of 313the resctrl file system in "info/{resource}/cbm_mask". X86 hardware 314requires that these masks have all the '1' bits in a contiguous block. So 3150x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 316and 0xA are not. On a system with a 20-bit mask each bit represents 5% 317of the capacity of the cache. You could partition the cache into four 318equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. 319 320Memory bandwidth Allocation and monitoring 321------------------------------------------ 322 323For Memory bandwidth resource, by default the user controls the resource 324by indicating the percentage of total memory bandwidth. 325 326The minimum bandwidth percentage value for each cpu model is predefined 327and can be looked up through "info/MB/min_bandwidth". The bandwidth 328granularity that is allocated is also dependent on the cpu model and can 329be looked up at "info/MB/bandwidth_gran". The available bandwidth 330control steps are: min_bw + N * bw_gran. Intermediate values are rounded 331to the next control step available on the hardware. 332 333The bandwidth throttling is a core specific mechanism on some of Intel 334SKUs. Using a high bandwidth and a low bandwidth setting on two threads 335sharing a core will result in both threads being throttled to use the 336low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core 337specific mechanism where as memory bandwidth monitoring(MBM) is done at 338the package level may lead to confusion when users try to apply control 339via the MBA and then monitor the bandwidth to see if the controls are 340effective. Below are such scenarios: 341 3421. User may *not* see increase in actual bandwidth when percentage 343 values are increased: 344 345This can occur when aggregate L2 external bandwidth is more than L3 346external bandwidth. Consider an SKL SKU with 24 cores on a package and 347where L2 external is 10GBps (hence aggregate L2 external bandwidth is 348240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 349threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 350bandwidth of 100GBps although the percentage value specified is only 50% 351<< 100%. Hence increasing the bandwidth percentage will not yeild any 352more bandwidth. This is because although the L2 external bandwidth still 353has capacity, the L3 external bandwidth is fully used. Also note that 354this would be dependent on number of cores the benchmark is run on. 355 3562. Same bandwidth percentage may mean different actual bandwidth 357 depending on # of threads: 358 359For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 360thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although 361they have same percentage bandwidth of 10%. This is simply because as 362threads start using more cores in an rdtgroup, the actual bandwidth may 363increase or vary although user specified bandwidth percentage is same. 364 365In order to mitigate this and make the interface more user friendly, 366resctrl added support for specifying the bandwidth in MBps as well. The 367kernel underneath would use a software feedback mechanism or a "Software 368Controller(mba_sc)" which reads the actual bandwidth using MBM counters 369and adjust the memowy bandwidth percentages to ensure 370 371 "actual bandwidth < user specified bandwidth". 372 373By default, the schemata would take the bandwidth percentage values 374where as user can switch to the "MBA software controller" mode using 375a mount option 'mba_MBps'. The schemata format is specified in the below 376sections. 377 378L3 schemata file details (code and data prioritization disabled) 379---------------------------------------------------------------- 380With CDP disabled the L3 schemata format is: 381 382 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 383 384L3 schemata file details (CDP enabled via mount option to resctrl) 385------------------------------------------------------------------ 386When CDP is enabled L3 control is split into two separate resources 387so you can specify independent masks for code and data like this: 388 389 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 390 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 391 392L2 schemata file details 393------------------------ 394L2 cache does not support code and data prioritization, so the 395schemata format is always: 396 397 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 398 399Memory bandwidth Allocation (default mode) 400------------------------------------------ 401 402Memory b/w domain is L3 cache. 403 404 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 405 406Memory bandwidth Allocation specified in MBps 407--------------------------------------------- 408 409Memory bandwidth domain is L3 cache. 410 411 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... 412 413Reading/writing the schemata file 414--------------------------------- 415Reading the schemata file will show the state of all resources 416on all domains. When writing you only need to specify those values 417which you wish to change. E.g. 418 419# cat schemata 420L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 421L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 422# echo "L3DATA:2=3c0;" > schemata 423# cat schemata 424L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 425L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 426 427Cache Pseudo-Locking 428-------------------- 429CAT enables a user to specify the amount of cache space that an 430application can fill. Cache pseudo-locking builds on the fact that a 431CPU can still read and write data pre-allocated outside its current 432allocated area on a cache hit. With cache pseudo-locking, data can be 433preloaded into a reserved portion of cache that no application can 434fill, and from that point on will only serve cache hits. The cache 435pseudo-locked memory is made accessible to user space where an 436application can map it into its virtual address space and thus have 437a region of memory with reduced average read latency. 438 439The creation of a cache pseudo-locked region is triggered by a request 440from the user to do so that is accompanied by a schemata of the region 441to be pseudo-locked. The cache pseudo-locked region is created as follows: 442- Create a CAT allocation CLOSNEW with a CBM matching the schemata 443 from the user of the cache region that will contain the pseudo-locked 444 memory. This region must not overlap with any current CAT allocation/CLOS 445 on the system and no future overlap with this cache region is allowed 446 while the pseudo-locked region exists. 447- Create a contiguous region of memory of the same size as the cache 448 region. 449- Flush the cache, disable hardware prefetchers, disable preemption. 450- Make CLOSNEW the active CLOS and touch the allocated memory to load 451 it into the cache. 452- Set the previous CLOS as active. 453- At this point the closid CLOSNEW can be released - the cache 454 pseudo-locked region is protected as long as its CBM does not appear in 455 any CAT allocation. Even though the cache pseudo-locked region will from 456 this point on not appear in any CBM of any CLOS an application running with 457 any CLOS will be able to access the memory in the pseudo-locked region since 458 the region continues to serve cache hits. 459- The contiguous region of memory loaded into the cache is exposed to 460 user-space as a character device. 461 462Cache pseudo-locking increases the probability that data will remain 463in the cache via carefully configuring the CAT feature and controlling 464application behavior. There is no guarantee that data is placed in 465cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict 466“locked” data from cache. Power management C-states may shrink or 467power off cache. Deeper C-states will automatically be restricted on 468pseudo-locked region creation. 469 470It is required that an application using a pseudo-locked region runs 471with affinity to the cores (or a subset of the cores) associated 472with the cache on which the pseudo-locked region resides. A sanity check 473within the code will not allow an application to map pseudo-locked memory 474unless it runs with affinity to cores associated with the cache on which the 475pseudo-locked region resides. The sanity check is only done during the 476initial mmap() handling, there is no enforcement afterwards and the 477application self needs to ensure it remains affine to the correct cores. 478 479Pseudo-locking is accomplished in two stages: 4801) During the first stage the system administrator allocates a portion 481 of cache that should be dedicated to pseudo-locking. At this time an 482 equivalent portion of memory is allocated, loaded into allocated 483 cache portion, and exposed as a character device. 4842) During the second stage a user-space application maps (mmap()) the 485 pseudo-locked memory into its address space. 486 487Cache Pseudo-Locking Interface 488------------------------------ 489A pseudo-locked region is created using the resctrl interface as follows: 490 4911) Create a new resource group by creating a new directory in /sys/fs/resctrl. 4922) Change the new resource group's mode to "pseudo-locksetup" by writing 493 "pseudo-locksetup" to the "mode" file. 4943) Write the schemata of the pseudo-locked region to the "schemata" file. All 495 bits within the schemata should be "unused" according to the "bit_usage" 496 file. 497 498On successful pseudo-locked region creation the "mode" file will contain 499"pseudo-locked" and a new character device with the same name as the resource 500group will exist in /dev/pseudo_lock. This character device can be mmap()'ed 501by user space in order to obtain access to the pseudo-locked memory region. 502 503An example of cache pseudo-locked region creation and usage can be found below. 504 505Cache Pseudo-Locking Debugging Interface 506--------------------------------------- 507The pseudo-locking debugging interface is enabled by default (if 508CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. 509 510There is no explicit way for the kernel to test if a provided memory 511location is present in the cache. The pseudo-locking debugging interface uses 512the tracing infrastructure to provide two ways to measure cache residency of 513the pseudo-locked region: 5141) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data 515 from these measurements are best visualized using a hist trigger (see 516 example below). In this test the pseudo-locked region is traversed at 517 a stride of 32 bytes while hardware prefetchers and preemption 518 are disabled. This also provides a substitute visualization of cache 519 hits and misses. 5202) Cache hit and miss measurements using model specific precision counters if 521 available. Depending on the levels of cache on the system the pseudo_lock_l2 522 and pseudo_lock_l3 tracepoints are available. 523 524When a pseudo-locked region is created a new debugfs directory is created for 525it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single 526write-only file, pseudo_lock_measure, is present in this directory. The 527measurement of the pseudo-locked region depends on the number written to this 528debugfs file: 5291 - writing "1" to the pseudo_lock_measure file will trigger the latency 530 measurement captured in the pseudo_lock_mem_latency tracepoint. See 531 example below. 5322 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache 533 residency (cache hits and misses) measurement captured in the 534 pseudo_lock_l2 tracepoint. See example below. 5353 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache 536 residency (cache hits and misses) measurement captured in the 537 pseudo_lock_l3 tracepoint. 538 539All measurements are recorded with the tracing infrastructure. This requires 540the relevant tracepoints to be enabled before the measurement is triggered. 541 542Example of latency debugging interface: 543In this example a pseudo-locked region named "newlock" was created. Here is 544how we can measure the latency in cycles of reading from this region and 545visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS 546is set: 547# :> /sys/kernel/debug/tracing/trace 548# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger 549# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 550# echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 551# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 552# cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist 553 554# event histogram 555# 556# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] 557# 558 559{ latency: 456 } hitcount: 1 560{ latency: 50 } hitcount: 83 561{ latency: 36 } hitcount: 96 562{ latency: 44 } hitcount: 174 563{ latency: 48 } hitcount: 195 564{ latency: 46 } hitcount: 262 565{ latency: 42 } hitcount: 693 566{ latency: 40 } hitcount: 3204 567{ latency: 38 } hitcount: 3484 568 569Totals: 570 Hits: 8192 571 Entries: 9 572 Dropped: 0 573 574Example of cache hits/misses debugging: 575In this example a pseudo-locked region named "newlock" was created on the L2 576cache of a platform. Here is how we can obtain details of the cache hits 577and misses using the platform's precision counters. 578 579# :> /sys/kernel/debug/tracing/trace 580# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 581# echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 582# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 583# cat /sys/kernel/debug/tracing/trace 584 585# tracer: nop 586# 587# _-----=> irqs-off 588# / _----=> need-resched 589# | / _---=> hardirq/softirq 590# || / _--=> preempt-depth 591# ||| / delay 592# TASK-PID CPU# |||| TIMESTAMP FUNCTION 593# | | | |||| | | 594 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 595 596 597Examples for RDT allocation usage: 598 599Example 1 600--------- 601On a two socket machine (one L3 cache per socket) with just four bits 602for cache bit masks, minimum b/w of 10% with a memory bandwidth 603granularity of 10% 604 605# mount -t resctrl resctrl /sys/fs/resctrl 606# cd /sys/fs/resctrl 607# mkdir p0 p1 608# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata 609# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata 610 611The default resource group is unmodified, so we have access to all parts 612of all caches (its schemata file reads "L3:0=f;1=f"). 613 614Tasks that are under the control of group "p0" may only allocate from the 615"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 616Tasks in group "p1" use the "lower" 50% of cache on both sockets. 617 618Similarly, tasks that are under the control of group "p0" may use a 619maximum memory b/w of 50% on socket0 and 50% on socket 1. 620Tasks in group "p1" may also use 50% memory b/w on both sockets. 621Note that unlike cache masks, memory b/w cannot specify whether these 622allocations can overlap or not. The allocations specifies the maximum 623b/w that the group may be able to use and the system admin can configure 624the b/w accordingly. 625 626If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB 627rather than the percentage values. 628 629# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata 630# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata 631 632In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w 633of 1024MB where as on socket 1 they would use 500MB. 634 635Example 2 636--------- 637Again two sockets, but this time with a more realistic 20-bit mask. 638 639Two real time tasks pid=1234 running on processor 0 and pid=5678 running on 640processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy 641neighbors, each of the two real-time tasks exclusively occupies one quarter 642of L3 cache on socket 0. 643 644# mount -t resctrl resctrl /sys/fs/resctrl 645# cd /sys/fs/resctrl 646 647First we reset the schemata for the default group so that the "upper" 64850% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by 649ordinary tasks: 650 651# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata 652 653Next we make a resource group for our first real time task and give 654it access to the "top" 25% of the cache on socket 0. 655 656# mkdir p0 657# echo "L3:0=f8000;1=fffff" > p0/schemata 658 659Finally we move our first real time task into this resource group. We 660also use taskset(1) to ensure the task always runs on a dedicated CPU 661on socket 0. Most uses of resource groups will also constrain which 662processors tasks run on. 663 664# echo 1234 > p0/tasks 665# taskset -cp 1 1234 666 667Ditto for the second real time task (with the remaining 25% of cache): 668 669# mkdir p1 670# echo "L3:0=7c00;1=fffff" > p1/schemata 671# echo 5678 > p1/tasks 672# taskset -cp 2 5678 673 674For the same 2 socket system with memory b/w resource and CAT L3 the 675schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is 67610): 677 678For our first real time task this would request 20% memory b/w on socket 6790. 680 681# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 682 683For our second real time task this would request an other 20% memory b/w 684on socket 0. 685 686# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 687 688Example 3 689--------- 690 691A single socket system which has real-time tasks running on core 4-7 and 692non real-time workload assigned to core 0-3. The real-time tasks share text 693and data, so a per task association is not required and due to interaction 694with the kernel it's desired that the kernel on these cores shares L3 with 695the tasks. 696 697# mount -t resctrl resctrl /sys/fs/resctrl 698# cd /sys/fs/resctrl 699 700First we reset the schemata for the default group so that the "upper" 70150% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 702cannot be used by ordinary tasks: 703 704# echo "L3:0=3ff\nMB:0=50" > schemata 705 706Next we make a resource group for our real time cores and give it access 707to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on 708socket 0. 709 710# mkdir p0 711# echo "L3:0=ffc00\nMB:0=50" > p0/schemata 712 713Finally we move core 4-7 over to the new group and make sure that the 714kernel and the tasks running there get 50% of the cache. They should 715also get 50% of memory bandwidth assuming that the cores 4-7 are SMT 716siblings and only the real time threads are scheduled on the cores 4-7. 717 718# echo F0 > p0/cpus 719 720Example 4 721--------- 722 723The resource groups in previous examples were all in the default "shareable" 724mode allowing sharing of their cache allocations. If one resource group 725configures a cache allocation then nothing prevents another resource group 726to overlap with that allocation. 727 728In this example a new exclusive resource group will be created on a L2 CAT 729system with two L2 cache instances that can be configured with an 8-bit 730capacity bitmask. The new exclusive resource group will be configured to use 73125% of each cache instance. 732 733# mount -t resctrl resctrl /sys/fs/resctrl/ 734# cd /sys/fs/resctrl 735 736First, we observe that the default group is configured to allocate to all L2 737cache: 738 739# cat schemata 740L2:0=ff;1=ff 741 742We could attempt to create the new resource group at this point, but it will 743fail because of the overlap with the schemata of the default group: 744# mkdir p0 745# echo 'L2:0=0x3;1=0x3' > p0/schemata 746# cat p0/mode 747shareable 748# echo exclusive > p0/mode 749-sh: echo: write error: Invalid argument 750# cat info/last_cmd_status 751schemata overlaps 752 753To ensure that there is no overlap with another resource group the default 754resource group's schemata has to change, making it possible for the new 755resource group to become exclusive. 756# echo 'L2:0=0xfc;1=0xfc' > schemata 757# echo exclusive > p0/mode 758# grep . p0/* 759p0/cpus:0 760p0/mode:exclusive 761p0/schemata:L2:0=03;1=03 762p0/size:L2:0=262144;1=262144 763 764A new resource group will on creation not overlap with an exclusive resource 765group: 766# mkdir p1 767# grep . p1/* 768p1/cpus:0 769p1/mode:shareable 770p1/schemata:L2:0=fc;1=fc 771p1/size:L2:0=786432;1=786432 772 773The bit_usage will reflect how the cache is used: 774# cat info/L2/bit_usage 7750=SSSSSSEE;1=SSSSSSEE 776 777A resource group cannot be forced to overlap with an exclusive resource group: 778# echo 'L2:0=0x1;1=0x1' > p1/schemata 779-sh: echo: write error: Invalid argument 780# cat info/last_cmd_status 781overlaps with exclusive group 782 783Example of Cache Pseudo-Locking 784------------------------------- 785Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked 786region is exposed at /dev/pseudo_lock/newlock that can be provided to 787application for argument to mmap(). 788 789# mount -t resctrl resctrl /sys/fs/resctrl/ 790# cd /sys/fs/resctrl 791 792Ensure that there are bits available that can be pseudo-locked, since only 793unused bits can be pseudo-locked the bits to be pseudo-locked needs to be 794removed from the default resource group's schemata: 795# cat info/L2/bit_usage 7960=SSSSSSSS;1=SSSSSSSS 797# echo 'L2:1=0xfc' > schemata 798# cat info/L2/bit_usage 7990=SSSSSSSS;1=SSSSSS00 800 801Create a new resource group that will be associated with the pseudo-locked 802region, indicate that it will be used for a pseudo-locked region, and 803configure the requested pseudo-locked region capacity bitmask: 804 805# mkdir newlock 806# echo pseudo-locksetup > newlock/mode 807# echo 'L2:1=0x3' > newlock/schemata 808 809On success the resource group's mode will change to pseudo-locked, the 810bit_usage will reflect the pseudo-locked region, and the character device 811exposing the pseudo-locked region will exist: 812 813# cat newlock/mode 814pseudo-locked 815# cat info/L2/bit_usage 8160=SSSSSSSS;1=SSSSSSPP 817# ls -l /dev/pseudo_lock/newlock 818crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock 819 820/* 821 * Example code to access one page of pseudo-locked cache region 822 * from user space. 823 */ 824#define _GNU_SOURCE 825#include <fcntl.h> 826#include <sched.h> 827#include <stdio.h> 828#include <stdlib.h> 829#include <unistd.h> 830#include <sys/mman.h> 831 832/* 833 * It is required that the application runs with affinity to only 834 * cores associated with the pseudo-locked region. Here the cpu 835 * is hardcoded for convenience of example. 836 */ 837static int cpuid = 2; 838 839int main(int argc, char *argv[]) 840{ 841 cpu_set_t cpuset; 842 long page_size; 843 void *mapping; 844 int dev_fd; 845 int ret; 846 847 page_size = sysconf(_SC_PAGESIZE); 848 849 CPU_ZERO(&cpuset); 850 CPU_SET(cpuid, &cpuset); 851 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); 852 if (ret < 0) { 853 perror("sched_setaffinity"); 854 exit(EXIT_FAILURE); 855 } 856 857 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); 858 if (dev_fd < 0) { 859 perror("open"); 860 exit(EXIT_FAILURE); 861 } 862 863 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 864 dev_fd, 0); 865 if (mapping == MAP_FAILED) { 866 perror("mmap"); 867 close(dev_fd); 868 exit(EXIT_FAILURE); 869 } 870 871 /* Application interacts with pseudo-locked memory @mapping */ 872 873 ret = munmap(mapping, page_size); 874 if (ret < 0) { 875 perror("munmap"); 876 close(dev_fd); 877 exit(EXIT_FAILURE); 878 } 879 880 close(dev_fd); 881 exit(EXIT_SUCCESS); 882} 883 884Locking between applications 885---------------------------- 886 887Certain operations on the resctrl filesystem, composed of read/writes 888to/from multiple files, must be atomic. 889 890As an example, the allocation of an exclusive reservation of L3 cache 891involves: 892 893 1. Read the cbmmasks from each directory or the per-resource "bit_usage" 894 2. Find a contiguous set of bits in the global CBM bitmask that is clear 895 in any of the directory cbmmasks 896 3. Create a new directory 897 4. Set the bits found in step 2 to the new directory "schemata" file 898 899If two applications attempt to allocate space concurrently then they can 900end up allocating the same bits so the reservations are shared instead of 901exclusive. 902 903To coordinate atomic operations on the resctrlfs and to avoid the problem 904above, the following locking procedure is recommended: 905 906Locking is based on flock, which is available in libc and also as a shell 907script command 908 909Write lock: 910 911 A) Take flock(LOCK_EX) on /sys/fs/resctrl 912 B) Read/write the directory structure. 913 C) funlock 914 915Read lock: 916 917 A) Take flock(LOCK_SH) on /sys/fs/resctrl 918 B) If success read the directory structure. 919 C) funlock 920 921Example with bash: 922 923# Atomically read directory structure 924$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl 925 926# Read directory contents and create new subdirectory 927 928$ cat create-dir.sh 929find /sys/fs/resctrl/ > output.txt 930mask = function-of(output.txt) 931mkdir /sys/fs/resctrl/newres/ 932echo mask > /sys/fs/resctrl/newres/schemata 933 934$ flock /sys/fs/resctrl/ ./create-dir.sh 935 936Example with C: 937 938/* 939 * Example code do take advisory locks 940 * before accessing resctrl filesystem 941 */ 942#include <sys/file.h> 943#include <stdlib.h> 944 945void resctrl_take_shared_lock(int fd) 946{ 947 int ret; 948 949 /* take shared lock on resctrl filesystem */ 950 ret = flock(fd, LOCK_SH); 951 if (ret) { 952 perror("flock"); 953 exit(-1); 954 } 955} 956 957void resctrl_take_exclusive_lock(int fd) 958{ 959 int ret; 960 961 /* release lock on resctrl filesystem */ 962 ret = flock(fd, LOCK_EX); 963 if (ret) { 964 perror("flock"); 965 exit(-1); 966 } 967} 968 969void resctrl_release_lock(int fd) 970{ 971 int ret; 972 973 /* take shared lock on resctrl filesystem */ 974 ret = flock(fd, LOCK_UN); 975 if (ret) { 976 perror("flock"); 977 exit(-1); 978 } 979} 980 981void main(void) 982{ 983 int fd, ret; 984 985 fd = open("/sys/fs/resctrl", O_DIRECTORY); 986 if (fd == -1) { 987 perror("open"); 988 exit(-1); 989 } 990 resctrl_take_shared_lock(fd); 991 /* code to read directory contents */ 992 resctrl_release_lock(fd); 993 994 resctrl_take_exclusive_lock(fd); 995 /* code to read and write directory contents */ 996 resctrl_release_lock(fd); 997} 998 999Examples for RDT Monitoring along with allocation usage: 1000 1001Reading monitored data 1002---------------------- 1003Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 1004show the current snapshot of LLC occupancy of the corresponding MON 1005group or CTRL_MON group. 1006 1007 1008Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 1009--------- 1010On a two socket machine (one L3 cache per socket) with just four bits 1011for cache bit masks 1012 1013# mount -t resctrl resctrl /sys/fs/resctrl 1014# cd /sys/fs/resctrl 1015# mkdir p0 p1 1016# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 1017# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 1018# echo 5678 > p1/tasks 1019# echo 5679 > p1/tasks 1020 1021The default resource group is unmodified, so we have access to all parts 1022of all caches (its schemata file reads "L3:0=f;1=f"). 1023 1024Tasks that are under the control of group "p0" may only allocate from the 1025"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 1026Tasks in group "p1" use the "lower" 50% of cache on both sockets. 1027 1028Create monitor groups and assign a subset of tasks to each monitor group. 1029 1030# cd /sys/fs/resctrl/p1/mon_groups 1031# mkdir m11 m12 1032# echo 5678 > m11/tasks 1033# echo 5679 > m12/tasks 1034 1035fetch data (data shown in bytes) 1036 1037# cat m11/mon_data/mon_L3_00/llc_occupancy 103816234000 1039# cat m11/mon_data/mon_L3_01/llc_occupancy 104014789000 1041# cat m12/mon_data/mon_L3_00/llc_occupancy 104216789000 1043 1044The parent ctrl_mon group shows the aggregated data. 1045 1046# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 104731234000 1048 1049Example 2 (Monitor a task from its creation) 1050--------- 1051On a two socket machine (one L3 cache per socket) 1052 1053# mount -t resctrl resctrl /sys/fs/resctrl 1054# cd /sys/fs/resctrl 1055# mkdir p0 p1 1056 1057An RMID is allocated to the group once its created and hence the <cmd> 1058below is monitored from its creation. 1059 1060# echo $$ > /sys/fs/resctrl/p1/tasks 1061# <cmd> 1062 1063Fetch the data 1064 1065# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 106631789000 1067 1068Example 3 (Monitor without CAT support or before creating CAT groups) 1069--------- 1070 1071Assume a system like HSW has only CQM and no CAT support. In this case 1072the resctrl will still mount but cannot create CTRL_MON directories. 1073But user can create different MON groups within the root group thereby 1074able to monitor all tasks including kernel threads. 1075 1076This can also be used to profile jobs cache size footprint before being 1077able to allocate them to different allocation groups. 1078 1079# mount -t resctrl resctrl /sys/fs/resctrl 1080# cd /sys/fs/resctrl 1081# mkdir mon_groups/m01 1082# mkdir mon_groups/m02 1083 1084# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 1085# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 1086 1087Monitor the groups separately and also get per domain data. From the 1088below its apparent that the tasks are mostly doing work on 1089domain(socket) 0. 1090 1091# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 109231234000 1093# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 109434555 1095# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 109631234000 1097# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 109832789 1099 1100 1101Example 4 (Monitor real time tasks) 1102----------------------------------- 1103 1104A single socket system which has real time tasks running on cores 4-7 1105and non real time tasks on other cpus. We want to monitor the cache 1106occupancy of the real time threads on these cores. 1107 1108# mount -t resctrl resctrl /sys/fs/resctrl 1109# cd /sys/fs/resctrl 1110# mkdir p1 1111 1112Move the cpus 4-7 over to p1 1113# echo f0 > p1/cpus 1114 1115View the llc occupancy snapshot 1116 1117# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 111811234000