Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

+2 -1

Documentation/admin-guide/perf/index.rst

··· 24 24 thunderx2-pmu 25 25 alibaba_pmu 26 26 dwc_pcie_pmu 27 - nvidia-pmu 27 + nvidia-tegra241-pmu 28 + nvidia-tegra410-pmu 28 29 meson-ddr-pmu 29 30 cxl 30 31 ampere_cspmu

+4 -4

Documentation/admin-guide/perf/nvidia-pmu.rst Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst

··· 1 - ========================================================= 2 - NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU) 3 - ========================================================= 1 + ============================================================ 2 + NVIDIA Tegra241 SoC Uncore Performance Monitoring Unit (PMU) 3 + ============================================================ 4 4 5 - The NVIDIA Tegra SoC includes various system PMUs to measure key performance 5 + The NVIDIA Tegra241 SoC includes various system PMUs to measure key performance 6 6 metrics like memory bandwidth, latency, and utilization: 7 7 8 8 * Scalable Coherency Fabric (SCF)

+522

Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst

··· 1 + ===================================================================== 2 + NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU) 3 + ===================================================================== 4 + 5 + The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance 6 + metrics like memory bandwidth, latency, and utilization: 7 + 8 + * Unified Coherence Fabric (UCF) 9 + * PCIE 10 + * PCIE-TGT 11 + * CPU Memory (CMEM) Latency 12 + * NVLink-C2C 13 + * NV-CLink 14 + * NV-DLink 15 + 16 + PMU Driver 17 + ---------- 18 + 19 + The PMU driver describes the available events and configuration of each PMU in 20 + sysfs. Please see the sections below to get the sysfs path of each PMU. Like 21 + other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show 22 + the CPU id used to handle the PMU event. There is also "associated_cpus" 23 + sysfs attribute, which contains a list of CPUs associated with the PMU instance. 24 + 25 + UCF PMU 26 + ------- 27 + 28 + The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a 29 + distributed cache, last level for CPU Memory and CXL Memory, and cache coherent 30 + interconnect that supports hardware coherence across multiple coherently caching 31 + agents, including: 32 + 33 + * CPU clusters 34 + * GPU 35 + * PCIe Ordering Controller Unit (OCU) 36 + * Other IO-coherent requesters 37 + 38 + The events and configuration options of this PMU device are described in sysfs, 39 + see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>. 40 + 41 + Some of the events available in this PMU can be used to measure bandwidth and 42 + utilization: 43 + 44 + * slc_access_rd: count the number of read requests to SLC. 45 + * slc_access_wr: count the number of write requests to SLC. 46 + * slc_bytes_rd: count the number of bytes transferred by slc_access_rd. 47 + * slc_bytes_wr: count the number of bytes transferred by slc_access_wr. 48 + * mem_access_rd: count the number of read requests to local or remote memory. 49 + * mem_access_wr: count the number of write requests to local or remote memory. 50 + * mem_bytes_rd: count the number of bytes transferred by mem_access_rd. 51 + * mem_bytes_wr: count the number of bytes transferred by mem_access_wr. 52 + * cycles: counts the UCF cycles. 53 + 54 + The average bandwidth is calculated as:: 55 + 56 + AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS 57 + AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS 58 + AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS 59 + AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS 60 + 61 + The average request rate is calculated as:: 62 + 63 + AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES 64 + AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES 65 + AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES 66 + AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES 67 + 68 + More details about what other events are available can be found in Tegra410 SoC 69 + technical reference manual. 70 + 71 + The events can be filtered based on source or destination. The source filter 72 + indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or 73 + remote socket. The destination filter specifies the destination memory type, 74 + e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The 75 + local/remote classification of the destination filter is based on the home 76 + socket of the address, not where the data actually resides. The available 77 + filters are described in 78 + /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/. 79 + 80 + The list of UCF PMU event filters: 81 + 82 + * Source filter: 83 + 84 + * src_loc_cpu: if set, count events from local CPU 85 + * src_loc_noncpu: if set, count events from local non-CPU device 86 + * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket 87 + 88 + * Destination filter: 89 + 90 + * dst_loc_cmem: if set, count events to local system memory (CMEM) address 91 + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 92 + * dst_loc_other: if set, count events to local CXL memory address 93 + * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket 94 + 95 + If the source is not specified, the PMU will count events from all sources. If 96 + the destination is not specified, the PMU will count events to all destinations. 97 + 98 + Example usage: 99 + 100 + * Count event id 0x0 in socket 0 from all sources and to all destinations:: 101 + 102 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0/ 103 + 104 + * Count event id 0x0 in socket 0 with source filter = local CPU and destination 105 + filter = local system memory (CMEM):: 106 + 107 + perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/ 108 + 109 + * Count event id 0x0 in socket 1 with source filter = local non-CPU device and 110 + destination filter = remote memory:: 111 + 112 + perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/ 113 + 114 + PCIE PMU 115 + -------- 116 + 117 + This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and 118 + the memory subsystem. It monitors all read/write traffic from the root port(s) 119 + or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per 120 + PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into 121 + up to 8 root ports. The traffic from each root port can be filtered using RP or 122 + BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will 123 + capture traffic from all RPs. Please see below for more details. 124 + 125 + The events and configuration options of this PMU device are described in sysfs, 126 + see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>. 127 + 128 + The events in this PMU can be used to measure bandwidth, utilization, and 129 + latency: 130 + 131 + * rd_req: count the number of read requests by PCIE device. 132 + * wr_req: count the number of write requests by PCIE device. 133 + * rd_bytes: count the number of bytes transferred by rd_req. 134 + * wr_bytes: count the number of bytes transferred by wr_req. 135 + * rd_cum_outs: count outstanding rd_req each cycle. 136 + * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. 137 + 138 + The average bandwidth is calculated as:: 139 + 140 + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS 141 + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS 142 + 143 + The average request rate is calculated as:: 144 + 145 + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES 146 + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES 147 + 148 + 149 + The average latency is calculated as:: 150 + 151 + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 152 + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ 153 + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ 154 + 155 + The PMU events can be filtered based on the traffic source and destination. 156 + The source filter indicates the PCIE devices that will be monitored. The 157 + destination filter specifies the destination memory type, e.g. local system 158 + memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote 159 + classification of the destination filter is based on the home socket of the 160 + address, not where the data actually resides. These filters can be found in 161 + /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/. 162 + 163 + The list of event filters: 164 + 165 + * Source filter: 166 + 167 + * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this 168 + bitmask represents the RP index in the RC. If the bit is set, all devices under 169 + the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor 170 + devices in root port 0 to 3. 171 + * src_bdf: the BDF that will be monitored. This is a 16-bit value that 172 + follows formula: (bus << 8) + (device << 3) + (function). For example, the 173 + value of BDF 27:01.1 is 0x2781. 174 + * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in 175 + "src_bdf" is used to filter the traffic. 176 + 177 + Note that Root-Port and BDF filters are mutually exclusive and the PMU in 178 + each RC can only have one BDF filter for the whole counters. If BDF filter 179 + is enabled, the BDF filter value will be applied to all events. 180 + 181 + * Destination filter: 182 + 183 + * dst_loc_cmem: if set, count events to local system memory (CMEM) address 184 + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 185 + * dst_loc_pcie_p2p: if set, count events to local PCIE peer address 186 + * dst_loc_pcie_cxl: if set, count events to local CXL memory address 187 + * dst_rem: if set, count events to remote memory address 188 + 189 + If the source filter is not specified, the PMU will count events from all root 190 + ports. If the destination filter is not specified, the PMU will count events 191 + to all destinations. 192 + 193 + Example usage: 194 + 195 + * Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all 196 + destinations:: 197 + 198 + perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/ 199 + 200 + * Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and 201 + targeting just local CMEM of socket 0:: 202 + 203 + perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/ 204 + 205 + * Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all 206 + destinations:: 207 + 208 + perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/ 209 + 210 + * Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and 211 + targeting just local CMEM of socket 1:: 212 + 213 + perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/ 214 + 215 + * Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all 216 + destinations:: 217 + 218 + perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/ 219 + 220 + .. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section: 221 + 222 + Mapping the RC# to lspci segment number 223 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 224 + 225 + Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA 226 + Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space 227 + for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register 228 + contains the following information to map PCIE devices under the RP back to its RC# : 229 + 230 + - Bus# (byte 0xc) : bus number as reported by the lspci output 231 + - Segment# (byte 0xd) : segment number as reported by the lspci output 232 + - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability 233 + - RC# (byte 0xf): root complex number associated with the RP 234 + - Socket# (byte 0x10): socket number associated with the RP 235 + 236 + Example script for mapping lspci BDF to RC# and socket#:: 237 + 238 + #!/bin/bash 239 + while read bdf rest; do 240 + dvsec4_reg=$(lspci -vv -s $bdf | awk ' 241 + /Designated Vendor-Specific: Vendor=10de ID=0004/ { 242 + match($0, /\[([0-9a-fA-F]+)/, arr); 243 + print "0x" arr[1]; 244 + exit 245 + } 246 + ') 247 + if [ -n "$dvsec4_reg" ]; then 248 + bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b) 249 + segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b) 250 + rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b) 251 + rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b) 252 + socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b) 253 + echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket" 254 + fi 255 + done < <(lspci -d 10de:) 256 + 257 + Example output:: 258 + 259 + 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00 260 + 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00 261 + 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00 262 + 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00 263 + 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00 264 + 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00 265 + 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00 266 + 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00 267 + 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00 268 + 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00 269 + 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00 270 + 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01 271 + 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01 272 + 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01 273 + 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01 274 + 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01 275 + 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01 276 + 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01 277 + 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01 278 + 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01 279 + 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01 280 + 281 + PCIE-TGT PMU 282 + ------------ 283 + 284 + This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and 285 + the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges. 286 + There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can 287 + have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU 288 + provides RP filter to count PCIE BAR traffic to each RP and address filter to 289 + count access to PCIE BAR or CXL HDM ranges. The details of the filters are 290 + described in the following sections. 291 + 292 + Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see 293 + :ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info. 294 + 295 + The events and configuration options of this PMU device are available in sysfs, 296 + see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>. 297 + 298 + The events in this PMU can be used to measure bandwidth and utilization: 299 + 300 + * rd_req: count the number of read requests to PCIE. 301 + * wr_req: count the number of write requests to PCIE. 302 + * rd_bytes: count the number of bytes transferred by rd_req. 303 + * wr_bytes: count the number of bytes transferred by wr_req. 304 + * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. 305 + 306 + The average bandwidth is calculated as:: 307 + 308 + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS 309 + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS 310 + 311 + The average request rate is calculated as:: 312 + 313 + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES 314 + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES 315 + 316 + The PMU events can be filtered based on the destination root port or target 317 + address range. Filtering based on RP is only available for PCIE BAR traffic. 318 + Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be 319 + found in sysfs, see 320 + /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/. 321 + 322 + Destination filter settings: 323 + 324 + * dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF" 325 + corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is 326 + only available for PCIE BAR traffic. 327 + * dst_addr_base: BAR or CXL HDM filter base address. 328 + * dst_addr_mask: BAR or CXL HDM filter address mask. 329 + * dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the 330 + address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter 331 + the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison 332 + to determine if the traffic destination address falls within the filter range:: 333 + 334 + (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask) 335 + 336 + If the comparison succeeds, then the event will be counted. 337 + 338 + If the destination filter is not specified, the RP filter will be configured by default 339 + to count PCIE BAR traffic to all root ports. 340 + 341 + Example usage: 342 + 343 + * Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0:: 344 + 345 + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/ 346 + 347 + * Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range 348 + 0x10000 to 0x100FF on socket 0's PCIE RC-1:: 349 + 350 + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/ 351 + 352 + CPU Memory (CMEM) Latency PMU 353 + ----------------------------- 354 + 355 + This PMU monitors latency events of memory read requests from the edge of the 356 + Unified Coherence Fabric (UCF) to local CPU DRAM: 357 + 358 + * RD_REQ counters: count read requests (32B per request). 359 + * RD_CUM_OUTS counters: accumulated outstanding request counter, which track 360 + how many cycles the read requests are in flight. 361 + * CYCLES counter: counts the number of elapsed cycles. 362 + 363 + The average latency is calculated as:: 364 + 365 + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 366 + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ 367 + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ 368 + 369 + The events and configuration options of this PMU device are described in sysfs, 370 + see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>. 371 + 372 + Example usage:: 373 + 374 + perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}' 375 + 376 + NVLink-C2C PMU 377 + -------------- 378 + 379 + This PMU monitors latency events of memory read/write requests that pass through 380 + the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available 381 + in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC). 382 + 383 + The events and configuration options of this PMU device are available in sysfs, 384 + see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>. 385 + 386 + The list of events: 387 + 388 + * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. 389 + * IN_RD_REQ: the number of incoming read requests. 390 + * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests. 391 + * IN_WR_REQ: the number of incoming write requests. 392 + * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. 393 + * OUT_RD_REQ: the number of outgoing read requests. 394 + * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests. 395 + * OUT_WR_REQ: the number of outgoing write requests. 396 + * CYCLES: NVLink-C2C interface cycle counts. 397 + 398 + The incoming events count the reads/writes from remote device to the SoC. 399 + The outgoing events count the reads/writes from the SoC to remote device. 400 + 401 + The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer 402 + contains the information about the connected device. 403 + 404 + When the C2C interface is connected to GPU(s), the user can use the 405 + "gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU 406 + index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1. 407 + The PMU will monitor all GPUs by default if not specified. 408 + 409 + When connected to another SoC, only the read events are available. 410 + 411 + The events can be used to calculate the average latency of the read/write requests:: 412 + 413 + C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 414 + 415 + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 416 + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 417 + 418 + IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ 419 + IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 420 + 421 + OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ 422 + OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 423 + 424 + OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ 425 + OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 426 + 427 + Example usage: 428 + 429 + * Count incoming traffic from all GPUs connected via NVLink-C2C:: 430 + 431 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/ 432 + 433 + * Count incoming traffic from GPU 0 connected via NVLink-C2C:: 434 + 435 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/ 436 + 437 + * Count incoming traffic from GPU 1 connected via NVLink-C2C:: 438 + 439 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/ 440 + 441 + * Count outgoing traffic to all GPUs connected via NVLink-C2C:: 442 + 443 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/ 444 + 445 + * Count outgoing traffic to GPU 0 connected via NVLink-C2C:: 446 + 447 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/ 448 + 449 + * Count outgoing traffic to GPU 1 connected via NVLink-C2C:: 450 + 451 + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/ 452 + 453 + NV-CLink PMU 454 + ------------ 455 + 456 + This PMU monitors latency events of memory read requests that pass through 457 + the NV-CLINK interface. Bandwidth events are not available in this PMU. 458 + In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410 459 + SoC and this PMU only counts read traffic. 460 + 461 + The events and configuration options of this PMU device are available in sysfs, 462 + see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>. 463 + 464 + The list of events: 465 + 466 + * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. 467 + * IN_RD_REQ: the number of incoming read requests. 468 + * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. 469 + * OUT_RD_REQ: the number of outgoing read requests. 470 + * CYCLES: NV-CLINK interface cycle counts. 471 + 472 + The incoming events count the reads from remote device to the SoC. 473 + The outgoing events count the reads from the SoC to remote device. 474 + 475 + The events can be used to calculate the average latency of the read requests:: 476 + 477 + CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 478 + 479 + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 480 + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ 481 + 482 + OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ 483 + OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ 484 + 485 + Example usage: 486 + 487 + * Count incoming read traffic from remote SoC connected via NV-CLINK:: 488 + 489 + perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/ 490 + 491 + * Count outgoing read traffic to remote SoC connected via NV-CLINK:: 492 + 493 + perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/ 494 + 495 + NV-DLink PMU 496 + ------------ 497 + 498 + This PMU monitors latency events of memory read requests that pass through 499 + the NV-DLINK interface. Bandwidth events are not available in this PMU. 500 + In Tegra410 SoC, this PMU only counts CXL memory read traffic. 501 + 502 + The events and configuration options of this PMU device are available in sysfs, 503 + see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>. 504 + 505 + The list of events: 506 + 507 + * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory. 508 + * IN_RD_REQ: the number of read requests to CXL memory. 509 + * CYCLES: NV-DLINK interface cycle counts. 510 + 511 + The events can be used to calculate the average latency of the read requests:: 512 + 513 + DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 514 + 515 + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 516 + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ 517 + 518 + Example usage: 519 + 520 + * Count read events to CXL memory:: 521 + 522 + perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'

+1

Documentation/arch/arm64/index.rst

··· 23 23 memory 24 24 memory-tagging-extension 25 25 mops 26 + mpam 26 27 perf 27 28 pointer-authentication 28 29 ptdump

+72

Documentation/arch/arm64/mpam.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==== 4 + MPAM 5 + ==== 6 + 7 + What is MPAM 8 + ============ 9 + MPAM (Memory Partitioning and Monitoring) is a feature in the CPUs and memory 10 + system components such as the caches or memory controllers that allow memory 11 + traffic to be labelled, partitioned and monitored. 12 + 13 + Traffic is labelled by the CPU, based on the control or monitor group the 14 + current task is assigned to using resctrl. Partitioning policy can be set 15 + using the schemata file in resctrl, and monitor values read via resctrl. 16 + See Documentation/filesystems/resctrl.rst for more details. 17 + 18 + This allows tasks that share memory system resources, such as caches, to be 19 + isolated from each other according to the partitioning policy (so called noisy 20 + neighbours). 21 + 22 + Supported Platforms 23 + =================== 24 + Use of this feature requires CPU support, support in the memory system 25 + components, and a description from firmware of where the MPAM device controls 26 + are in the MMIO address space. (e.g. the 'MPAM' ACPI table). 27 + 28 + The MMIO device that provides MPAM controls/monitors for a memory system 29 + component is called a memory system component. (MSC). 30 + 31 + Because the user interface to MPAM is via resctrl, only MPAM features that are 32 + compatible with resctrl can be exposed to user-space. 33 + 34 + MSC are considered as a group based on the topology. MSC that correspond with 35 + the L3 cache are considered together, it is not possible to mix MSC between L2 36 + and L3 to 'cover' a resctrl schema. 37 + 38 + The supported features are: 39 + 40 + * Cache portion bitmap controls (CPOR) on the L2 or L3 caches. To expose 41 + CPOR at L2 or L3, every CPU must have a corresponding CPU cache at this 42 + level that also supports the feature. Mismatched big/little platforms are 43 + not supported as resctrl's controls would then also depend on task 44 + placement. 45 + 46 + * Memory bandwidth maximum controls (MBW_MAX) on or after the L3 cache. 47 + resctrl uses the L3 cache-id to identify where the memory bandwidth 48 + control is applied. For this reason the platform must have an L3 cache 49 + with cache-id's supplied by firmware. (It doesn't need to support MPAM.) 50 + 51 + To be exported as the 'MB' schema, the topology of the group of MSC chosen 52 + must match the topology of the L3 cache so that the cache-id's can be 53 + repainted. For example: Platforms with Memory bandwidth maximum controls 54 + on CPU-less NUMA nodes cannot expose the 'MB' schema to resctrl as these 55 + nodes do not have a corresponding L3 cache. If the memory bandwidth 56 + control is on the memory rather than the L3 then there must be a single 57 + global L3 as otherwise it is unknown which L3 the traffic came from. There 58 + must be no caches between the L3 and the memory so that the two ends of 59 + the path have equivalent traffic. 60 + 61 + When the MPAM driver finds multiple groups of MSC it can use for the 'MB' 62 + schema, it prefers the group closest to the L3 cache. 63 + 64 + * Cache Storage Usage (CSU) counters can expose the 'llc_occupancy' provided 65 + there is at least one CSU monitor on each MSC that makes up the L3 group. 66 + Exposing CSU counters from other caches or devices is not supported. 67 + 68 + Reporting Bugs 69 + ============== 70 + If you are not seeing the counters or controls you expect please share the 71 + debug messages produced when enabling dynamic debug and booting with: 72 + dyndbg="file mpam_resctrl.c +pl"

+9

Documentation/arch/arm64/silicon-errata.rst

··· 214 214 +----------------+-----------------+-----------------+-----------------------------+ 215 215 | ARM | SI L1 | #4311569 | ARM64_ERRATUM_4311569 | 216 216 +----------------+-----------------+-----------------+-----------------------------+ 217 + | ARM | CMN-650 | #3642720 | N/A | 218 + +----------------+-----------------+-----------------+-----------------------------+ 219 + +----------------+-----------------+-----------------+-----------------------------+ 217 220 | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 | 218 221 +----------------+-----------------+-----------------+-----------------------------+ 219 222 | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 | ··· 249 246 | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | 250 247 +----------------+-----------------+-----------------+-----------------------------+ 251 248 | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | 249 + +----------------+-----------------+-----------------+-----------------------------+ 250 + | NVIDIA | T241 MPAM | T241-MPAM-1 | N/A | 251 + +----------------+-----------------+-----------------+-----------------------------+ 252 + | NVIDIA | T241 MPAM | T241-MPAM-4 | N/A | 253 + +----------------+-----------------+-----------------+-----------------------------+ 254 + | NVIDIA | T241 MPAM | T241-MPAM-6 | N/A | 252 255 +----------------+-----------------+-----------------+-----------------------------+ 253 256 +----------------+-----------------+-----------------+-----------------------------+ 254 257 | Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 |

+7

arch/arm/include/asm/arm_pmuv3.h

··· 238 238 239 239 static inline bool pmuv3_implemented(int pmuver) 240 240 { 241 + /* 242 + * PMUVer follows the standard ID scheme for an unsigned field with the 243 + * exception of 0xF (IMP_DEF) which is treated specially and implies 244 + * FEAT_PMUv3 is not implemented. 245 + * 246 + * See DDI0487L.a D24.1.3.2 for more details. 247 + */ 241 248 return !(pmuver == ARMV8_PMU_DFR_VER_IMP_DEF || 242 249 pmuver == ARMV8_PMU_DFR_VER_NI); 243 250 }

+25 -29

arch/arm64/Kconfig

··· 61 61 select ARCH_HAVE_ELF_PROT 62 62 select ARCH_HAVE_NMI_SAFE_CMPXCHG 63 63 select ARCH_HAVE_TRACE_MMIO_ACCESS 64 - select ARCH_INLINE_READ_LOCK if !PREEMPTION 65 - select ARCH_INLINE_READ_LOCK_BH if !PREEMPTION 66 - select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPTION 67 - select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPTION 68 - select ARCH_INLINE_READ_UNLOCK if !PREEMPTION 69 - select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPTION 70 - select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPTION 71 - select ARCH_INLINE_READ_UNLOCK_IRQRESTORE if !PREEMPTION 72 - select ARCH_INLINE_WRITE_LOCK if !PREEMPTION 73 - select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPTION 74 - select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPTION 75 - select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPTION 76 - select ARCH_INLINE_WRITE_UNLOCK if !PREEMPTION 77 - select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPTION 78 - select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPTION 79 - select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE if !PREEMPTION 80 - select ARCH_INLINE_SPIN_TRYLOCK if !PREEMPTION 81 - select ARCH_INLINE_SPIN_TRYLOCK_BH if !PREEMPTION 82 - select ARCH_INLINE_SPIN_LOCK if !PREEMPTION 83 - select ARCH_INLINE_SPIN_LOCK_BH if !PREEMPTION 84 - select ARCH_INLINE_SPIN_LOCK_IRQ if !PREEMPTION 85 - select ARCH_INLINE_SPIN_LOCK_IRQSAVE if !PREEMPTION 86 - select ARCH_INLINE_SPIN_UNLOCK if !PREEMPTION 87 - select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPTION 88 - select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION 89 - select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION 90 64 select ARCH_KEEP_MEMBLOCK 91 65 select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 92 66 select ARCH_USE_CMPXCHG_LOCKREF ··· 1983 2009 1984 2010 config ARM64_MPAM 1985 2011 bool "Enable support for MPAM" 1986 - select ARM64_MPAM_DRIVER if EXPERT # does nothing yet 1987 - select ACPI_MPAM if ACPI 2012 + select ARM64_MPAM_DRIVER 2013 + select ARCH_HAS_CPU_RESCTRL 1988 2014 help 1989 2015 Memory System Resource Partitioning and Monitoring (MPAM) is an 1990 2016 optional extension to the Arm architecture that allows each ··· 2005 2031 of where the MSCs are in the address space. 2006 2032 2007 2033 MPAM is exposed to user-space via the resctrl pseudo filesystem. 2034 + 2035 + This option enables the extra context switch code. 2008 2036 2009 2037 endmenu # "ARMv8.4 architectural features" 2010 2038 ··· 2184 2208 2185 2209 endmenu # "ARMv9.4 architectural features" 2186 2210 2211 + config AS_HAS_LSUI 2212 + def_bool $(as-instr,.arch_extension lsui) 2213 + help 2214 + Supported by LLVM 20+ and binutils 2.45+. 2215 + 2216 + menu "ARMv9.6 architectural features" 2217 + 2218 + config ARM64_LSUI 2219 + bool "Support Unprivileged Load Store Instructions (LSUI)" 2220 + default y 2221 + depends on AS_HAS_LSUI && !CPU_BIG_ENDIAN 2222 + help 2223 + The Unprivileged Load Store Instructions (LSUI) provides 2224 + variants load/store instructions that access user-space memory 2225 + from the kernel without clearing PSTATE.PAN bit. 2226 + 2227 + This feature is supported by LLVM 20+ and binutils 2.45+. 2228 + 2229 + endmenu # "ARMv9.6 architectural feature" 2230 + 2187 2231 config ARM64_SVE 2188 2232 bool "ARM Scalable Vector Extension support" 2189 2233 default y ··· 2361 2365 default "" 2362 2366 help 2363 2367 Provide a set of default command-line options at build time by 2364 - entering them here. As a minimum, you should specify the the 2368 + entering them here. As a minimum, you should specify the 2365 2369 root device (e.g. root=/dev/nfs). 2366 2370 2367 2371 choice

+1 -1

arch/arm64/include/asm/asm-uaccess.h

··· 15 15 #ifdef CONFIG_ARM64_SW_TTBR0_PAN 16 16 .macro __uaccess_ttbr0_disable, tmp1 17 17 mrs \tmp1, ttbr1_el1 // swapper_pg_dir 18 - bic \tmp1, \tmp1, #TTBR_ASID_MASK 18 + bic \tmp1, \tmp1, #TTBRx_EL1_ASID_MASK 19 19 sub \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET // reserved_pg_dir 20 20 msr ttbr0_el1, \tmp1 // set reserved TTBR0_EL1 21 21 add \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET

+2

arch/arm64/include/asm/cpucaps.h

··· 71 71 return true; 72 72 case ARM64_HAS_PMUV3: 73 73 return IS_ENABLED(CONFIG_HW_PERF_EVENTS); 74 + case ARM64_HAS_LSUI: 75 + return IS_ENABLED(CONFIG_ARM64_LSUI); 74 76 } 75 77 76 78 return true;

+2 -1

arch/arm64/include/asm/el2_setup.h

··· 513 513 check_override id_aa64pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT, .Linit_mpam_\@, .Lskip_mpam_\@, x1, x2 514 514 515 515 .Linit_mpam_\@: 516 - msr_s SYS_MPAM2_EL2, xzr // use the default partition 516 + mov x0, #MPAM2_EL2_EnMPAMSM_MASK 517 + msr_s SYS_MPAM2_EL2, x0 // use the default partition, 517 518 // and disable lower traps 518 519 mrs_s x0, SYS_MPAMIDR_EL1 519 520 tbz x0, #MPAMIDR_EL1_HAS_HCR_SHIFT, .Lskip_mpam_\@ // skip if no MPAMHCR reg

+253 -58

arch/arm64/include/asm/futex.h

··· 9 9 #include <linux/uaccess.h> 10 10 11 11 #include <asm/errno.h> 12 + #include <asm/lsui.h> 12 13 13 14 #define FUTEX_MAX_LOOPS 128 /* What's the largest number you can think of? */ 14 15 15 - #define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg) \ 16 - do { \ 16 + #define LLSC_FUTEX_ATOMIC_OP(op, insn) \ 17 + static __always_inline int \ 18 + __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 19 + { \ 17 20 unsigned int loops = FUTEX_MAX_LOOPS; \ 21 + int ret, oldval, newval; \ 18 22 \ 19 23 uaccess_enable_privileged(); \ 20 - asm volatile( \ 21 - " prfm pstl1strm, %2\n" \ 22 - "1: ldxr %w1, %2\n" \ 24 + asm volatile("// __llsc_futex_atomic_" #op "\n" \ 25 + " prfm pstl1strm, %[uaddr]\n" \ 26 + "1: ldxr %w[oldval], %[uaddr]\n" \ 23 27 insn "\n" \ 24 - "2: stlxr %w0, %w3, %2\n" \ 25 - " cbz %w0, 3f\n" \ 26 - " sub %w4, %w4, %w0\n" \ 27 - " cbnz %w4, 1b\n" \ 28 - " mov %w0, %w6\n" \ 28 + "2: stlxr %w[ret], %w[newval], %[uaddr]\n" \ 29 + " cbz %w[ret], 3f\n" \ 30 + " sub %w[loops], %w[loops], %w[ret]\n" \ 31 + " cbnz %w[loops], 1b\n" \ 32 + " mov %w[ret], %w[err]\n" \ 29 33 "3:\n" \ 30 34 " dmb ish\n" \ 31 - _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0) \ 32 - _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0) \ 33 - : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp), \ 34 - "+r" (loops) \ 35 - : "r" (oparg), "Ir" (-EAGAIN) \ 35 + _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w[ret]) \ 36 + _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w[ret]) \ 37 + : [ret] "=&r" (ret), [oldval] "=&r" (oldval), \ 38 + [uaddr] "+Q" (*uaddr), [newval] "=&r" (newval), \ 39 + [loops] "+r" (loops) \ 40 + : [oparg] "r" (oparg), [err] "Ir" (-EAGAIN) \ 36 41 : "memory"); \ 37 42 uaccess_disable_privileged(); \ 38 - } while (0) 43 + \ 44 + if (!ret) \ 45 + *oval = oldval; \ 46 + \ 47 + return ret; \ 48 + } 49 + 50 + LLSC_FUTEX_ATOMIC_OP(add, "add %w[newval], %w[oldval], %w[oparg]") 51 + LLSC_FUTEX_ATOMIC_OP(or, "orr %w[newval], %w[oldval], %w[oparg]") 52 + LLSC_FUTEX_ATOMIC_OP(and, "and %w[newval], %w[oldval], %w[oparg]") 53 + LLSC_FUTEX_ATOMIC_OP(eor, "eor %w[newval], %w[oldval], %w[oparg]") 54 + LLSC_FUTEX_ATOMIC_OP(set, "mov %w[newval], %w[oparg]") 55 + 56 + static __always_inline int 57 + __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 58 + { 59 + int ret = 0; 60 + unsigned int loops = FUTEX_MAX_LOOPS; 61 + u32 val, tmp; 62 + 63 + uaccess_enable_privileged(); 64 + asm volatile("//__llsc_futex_cmpxchg\n" 65 + " prfm pstl1strm, %[uaddr]\n" 66 + "1: ldxr %w[curval], %[uaddr]\n" 67 + " eor %w[tmp], %w[curval], %w[oldval]\n" 68 + " cbnz %w[tmp], 4f\n" 69 + "2: stlxr %w[tmp], %w[newval], %[uaddr]\n" 70 + " cbz %w[tmp], 3f\n" 71 + " sub %w[loops], %w[loops], %w[tmp]\n" 72 + " cbnz %w[loops], 1b\n" 73 + " mov %w[ret], %w[err]\n" 74 + "3:\n" 75 + " dmb ish\n" 76 + "4:\n" 77 + _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w[ret]) 78 + _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w[ret]) 79 + : [ret] "+r" (ret), [curval] "=&r" (val), 80 + [uaddr] "+Q" (*uaddr), [tmp] "=&r" (tmp), 81 + [loops] "+r" (loops) 82 + : [oldval] "r" (oldval), [newval] "r" (newval), 83 + [err] "Ir" (-EAGAIN) 84 + : "memory"); 85 + uaccess_disable_privileged(); 86 + 87 + if (!ret) 88 + *oval = val; 89 + 90 + return ret; 91 + } 92 + 93 + #ifdef CONFIG_ARM64_LSUI 94 + 95 + /* 96 + * Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(), as 97 + * PAN toggling is not required. 98 + */ 99 + 100 + #define LSUI_FUTEX_ATOMIC_OP(op, asm_op) \ 101 + static __always_inline int \ 102 + __lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 103 + { \ 104 + int ret = 0; \ 105 + int oldval; \ 106 + \ 107 + uaccess_ttbr0_enable(); \ 108 + \ 109 + asm volatile("// __lsui_futex_atomic_" #op "\n" \ 110 + __LSUI_PREAMBLE \ 111 + "1: " #asm_op "al %w[oparg], %w[oldval], %[uaddr]\n" \ 112 + "2:\n" \ 113 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) \ 114 + : [ret] "+r" (ret), [uaddr] "+Q" (*uaddr), \ 115 + [oldval] "=r" (oldval) \ 116 + : [oparg] "r" (oparg) \ 117 + : "memory"); \ 118 + \ 119 + uaccess_ttbr0_disable(); \ 120 + \ 121 + if (!ret) \ 122 + *oval = oldval; \ 123 + return ret; \ 124 + } 125 + 126 + LSUI_FUTEX_ATOMIC_OP(add, ldtadd) 127 + LSUI_FUTEX_ATOMIC_OP(or, ldtset) 128 + LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr) 129 + LSUI_FUTEX_ATOMIC_OP(set, swpt) 130 + 131 + static __always_inline int 132 + __lsui_cmpxchg64(u64 __user *uaddr, u64 *oldval, u64 newval) 133 + { 134 + int ret = 0; 135 + 136 + uaccess_ttbr0_enable(); 137 + 138 + asm volatile("// __lsui_cmpxchg64\n" 139 + __LSUI_PREAMBLE 140 + "1: casalt %[oldval], %[newval], %[uaddr]\n" 141 + "2:\n" 142 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) 143 + : [ret] "+r" (ret), [uaddr] "+Q" (*uaddr), 144 + [oldval] "+r" (*oldval) 145 + : [newval] "r" (newval) 146 + : "memory"); 147 + 148 + uaccess_ttbr0_disable(); 149 + 150 + return ret; 151 + } 152 + 153 + static __always_inline int 154 + __lsui_cmpxchg32(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 155 + { 156 + u64 __user *uaddr64; 157 + bool futex_pos, other_pos; 158 + u32 other, orig_other; 159 + union { 160 + u32 futex[2]; 161 + u64 raw; 162 + } oval64, orig64, nval64; 163 + 164 + uaddr64 = (u64 __user *)PTR_ALIGN_DOWN(uaddr, sizeof(u64)); 165 + futex_pos = !IS_ALIGNED((unsigned long)uaddr, sizeof(u64)); 166 + other_pos = !futex_pos; 167 + 168 + oval64.futex[futex_pos] = oldval; 169 + if (get_user(oval64.futex[other_pos], (u32 __user *)uaddr64 + other_pos)) 170 + return -EFAULT; 171 + 172 + orig64.raw = oval64.raw; 173 + 174 + nval64.futex[futex_pos] = newval; 175 + nval64.futex[other_pos] = oval64.futex[other_pos]; 176 + 177 + if (__lsui_cmpxchg64(uaddr64, &oval64.raw, nval64.raw)) 178 + return -EFAULT; 179 + 180 + oldval = oval64.futex[futex_pos]; 181 + other = oval64.futex[other_pos]; 182 + orig_other = orig64.futex[other_pos]; 183 + 184 + if (other != orig_other) 185 + return -EAGAIN; 186 + 187 + *oval = oldval; 188 + 189 + return 0; 190 + } 191 + 192 + static __always_inline int 193 + __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval) 194 + { 195 + /* 196 + * Undo the bitwise negation applied to the oparg passed from 197 + * arch_futex_atomic_op_inuser() with FUTEX_OP_ANDN. 198 + */ 199 + return __lsui_futex_atomic_andnot(~oparg, uaddr, oval); 200 + } 201 + 202 + static __always_inline int 203 + __lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval) 204 + { 205 + u32 oldval, newval, val; 206 + int ret, i; 207 + 208 + if (get_user(oldval, uaddr)) 209 + return -EFAULT; 210 + 211 + /* 212 + * there are no ldteor/stteor instructions... 213 + */ 214 + for (i = 0; i < FUTEX_MAX_LOOPS; i++) { 215 + newval = oldval ^ oparg; 216 + 217 + ret = __lsui_cmpxchg32(uaddr, oldval, newval, &val); 218 + switch (ret) { 219 + case -EFAULT: 220 + return ret; 221 + case -EAGAIN: 222 + continue; 223 + } 224 + 225 + if (val == oldval) { 226 + *oval = val; 227 + return 0; 228 + } 229 + 230 + oldval = val; 231 + } 232 + 233 + return -EAGAIN; 234 + } 235 + 236 + static __always_inline int 237 + __lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 238 + { 239 + /* 240 + * Callers of futex_atomic_cmpxchg_inatomic() already retry on 241 + * -EAGAIN, no need for another loop of max retries. 242 + */ 243 + return __lsui_cmpxchg32(uaddr, oldval, newval, oval); 244 + } 245 + #endif /* CONFIG_ARM64_LSUI */ 246 + 247 + 248 + #define FUTEX_ATOMIC_OP(op) \ 249 + static __always_inline int \ 250 + __futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 251 + { \ 252 + return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval); \ 253 + } 254 + 255 + FUTEX_ATOMIC_OP(add) 256 + FUTEX_ATOMIC_OP(or) 257 + FUTEX_ATOMIC_OP(and) 258 + FUTEX_ATOMIC_OP(eor) 259 + FUTEX_ATOMIC_OP(set) 260 + 261 + static __always_inline int 262 + __futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 263 + { 264 + return __lsui_llsc_body(futex_cmpxchg, uaddr, oldval, newval, oval); 265 + } 39 266 40 267 static inline int 41 268 arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr) 42 269 { 43 - int oldval = 0, ret, tmp; 44 - u32 __user *uaddr = __uaccess_mask_ptr(_uaddr); 270 + int ret; 271 + u32 __user *uaddr; 45 272 46 273 if (!access_ok(_uaddr, sizeof(u32))) 47 274 return -EFAULT; 48 275 276 + uaddr = __uaccess_mask_ptr(_uaddr); 277 + 49 278 switch (op) { 50 279 case FUTEX_OP_SET: 51 - __futex_atomic_op("mov %w3, %w5", 52 - ret, oldval, uaddr, tmp, oparg); 280 + ret = __futex_atomic_set(oparg, uaddr, oval); 53 281 break; 54 282 case FUTEX_OP_ADD: 55 - __futex_atomic_op("add %w3, %w1, %w5", 56 - ret, oldval, uaddr, tmp, oparg); 283 + ret = __futex_atomic_add(oparg, uaddr, oval); 57 284 break; 58 285 case FUTEX_OP_OR: 59 - __futex_atomic_op("orr %w3, %w1, %w5", 60 - ret, oldval, uaddr, tmp, oparg); 286 + ret = __futex_atomic_or(oparg, uaddr, oval); 61 287 break; 62 288 case FUTEX_OP_ANDN: 63 - __futex_atomic_op("and %w3, %w1, %w5", 64 - ret, oldval, uaddr, tmp, ~oparg); 289 + ret = __futex_atomic_and(~oparg, uaddr, oval); 65 290 break; 66 291 case FUTEX_OP_XOR: 67 - __futex_atomic_op("eor %w3, %w1, %w5", 68 - ret, oldval, uaddr, tmp, oparg); 292 + ret = __futex_atomic_eor(oparg, uaddr, oval); 69 293 break; 70 294 default: 71 295 ret = -ENOSYS; 72 296 } 73 - 74 - if (!ret) 75 - *oval = oldval; 76 297 77 298 return ret; 78 299 } ··· 302 81 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr, 303 82 u32 oldval, u32 newval) 304 83 { 305 - int ret = 0; 306 - unsigned int loops = FUTEX_MAX_LOOPS; 307 - u32 val, tmp; 308 84 u32 __user *uaddr; 309 85 310 86 if (!access_ok(_uaddr, sizeof(u32))) 311 87 return -EFAULT; 312 88 313 89 uaddr = __uaccess_mask_ptr(_uaddr); 314 - uaccess_enable_privileged(); 315 - asm volatile("// futex_atomic_cmpxchg_inatomic\n" 316 - " prfm pstl1strm, %2\n" 317 - "1: ldxr %w1, %2\n" 318 - " sub %w3, %w1, %w5\n" 319 - " cbnz %w3, 4f\n" 320 - "2: stlxr %w3, %w6, %2\n" 321 - " cbz %w3, 3f\n" 322 - " sub %w4, %w4, %w3\n" 323 - " cbnz %w4, 1b\n" 324 - " mov %w0, %w7\n" 325 - "3:\n" 326 - " dmb ish\n" 327 - "4:\n" 328 - _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0) 329 - _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0) 330 - : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops) 331 - : "r" (oldval), "r" (newval), "Ir" (-EAGAIN) 332 - : "memory"); 333 - uaccess_disable_privileged(); 334 90 335 - if (!ret) 336 - *uval = val; 337 - 338 - return ret; 91 + return __futex_cmpxchg(uaddr, oldval, newval, uval); 339 92 } 340 93 341 94 #endif /* __ASM_FUTEX_H */

+6 -6

arch/arm64/include/asm/hugetlb.h

··· 71 71 unsigned long start, 72 72 unsigned long end, 73 73 unsigned long stride, 74 - bool last_level) 74 + tlbf_t flags) 75 75 { 76 76 switch (stride) { 77 77 #ifndef __PAGETABLE_PMD_FOLDED 78 78 case PUD_SIZE: 79 - __flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1); 79 + __flush_tlb_range(vma, start, end, PUD_SIZE, 1, flags); 80 80 break; 81 81 #endif 82 82 case CONT_PMD_SIZE: 83 83 case PMD_SIZE: 84 - __flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2); 84 + __flush_tlb_range(vma, start, end, PMD_SIZE, 2, flags); 85 85 break; 86 86 case CONT_PTE_SIZE: 87 - __flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3); 87 + __flush_tlb_range(vma, start, end, PAGE_SIZE, 3, flags); 88 88 break; 89 89 default: 90 - __flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN); 90 + __flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, flags); 91 91 } 92 92 } 93 93 ··· 98 98 { 99 99 unsigned long stride = huge_page_size(hstate_vma(vma)); 100 100 101 - __flush_hugetlb_tlb_range(vma, start, end, stride, false); 101 + __flush_hugetlb_tlb_range(vma, start, end, stride, TLBF_NONE); 102 102 } 103 103 104 104 #endif /* __ASM_HUGETLB_H */

+2 -118

arch/arm64/include/asm/hwcap.h

··· 60 60 * of KERNEL_HWCAP_{feature}. 61 61 */ 62 62 #define __khwcap_feature(x) const_ilog2(HWCAP_ ## x) 63 - #define KERNEL_HWCAP_FP __khwcap_feature(FP) 64 - #define KERNEL_HWCAP_ASIMD __khwcap_feature(ASIMD) 65 - #define KERNEL_HWCAP_EVTSTRM __khwcap_feature(EVTSTRM) 66 - #define KERNEL_HWCAP_AES __khwcap_feature(AES) 67 - #define KERNEL_HWCAP_PMULL __khwcap_feature(PMULL) 68 - #define KERNEL_HWCAP_SHA1 __khwcap_feature(SHA1) 69 - #define KERNEL_HWCAP_SHA2 __khwcap_feature(SHA2) 70 - #define KERNEL_HWCAP_CRC32 __khwcap_feature(CRC32) 71 - #define KERNEL_HWCAP_ATOMICS __khwcap_feature(ATOMICS) 72 - #define KERNEL_HWCAP_FPHP __khwcap_feature(FPHP) 73 - #define KERNEL_HWCAP_ASIMDHP __khwcap_feature(ASIMDHP) 74 - #define KERNEL_HWCAP_CPUID __khwcap_feature(CPUID) 75 - #define KERNEL_HWCAP_ASIMDRDM __khwcap_feature(ASIMDRDM) 76 - #define KERNEL_HWCAP_JSCVT __khwcap_feature(JSCVT) 77 - #define KERNEL_HWCAP_FCMA __khwcap_feature(FCMA) 78 - #define KERNEL_HWCAP_LRCPC __khwcap_feature(LRCPC) 79 - #define KERNEL_HWCAP_DCPOP __khwcap_feature(DCPOP) 80 - #define KERNEL_HWCAP_SHA3 __khwcap_feature(SHA3) 81 - #define KERNEL_HWCAP_SM3 __khwcap_feature(SM3) 82 - #define KERNEL_HWCAP_SM4 __khwcap_feature(SM4) 83 - #define KERNEL_HWCAP_ASIMDDP __khwcap_feature(ASIMDDP) 84 - #define KERNEL_HWCAP_SHA512 __khwcap_feature(SHA512) 85 - #define KERNEL_HWCAP_SVE __khwcap_feature(SVE) 86 - #define KERNEL_HWCAP_ASIMDFHM __khwcap_feature(ASIMDFHM) 87 - #define KERNEL_HWCAP_DIT __khwcap_feature(DIT) 88 - #define KERNEL_HWCAP_USCAT __khwcap_feature(USCAT) 89 - #define KERNEL_HWCAP_ILRCPC __khwcap_feature(ILRCPC) 90 - #define KERNEL_HWCAP_FLAGM __khwcap_feature(FLAGM) 91 - #define KERNEL_HWCAP_SSBS __khwcap_feature(SSBS) 92 - #define KERNEL_HWCAP_SB __khwcap_feature(SB) 93 - #define KERNEL_HWCAP_PACA __khwcap_feature(PACA) 94 - #define KERNEL_HWCAP_PACG __khwcap_feature(PACG) 95 - #define KERNEL_HWCAP_GCS __khwcap_feature(GCS) 96 - #define KERNEL_HWCAP_CMPBR __khwcap_feature(CMPBR) 97 - #define KERNEL_HWCAP_FPRCVT __khwcap_feature(FPRCVT) 98 - #define KERNEL_HWCAP_F8MM8 __khwcap_feature(F8MM8) 99 - #define KERNEL_HWCAP_F8MM4 __khwcap_feature(F8MM4) 100 - #define KERNEL_HWCAP_SVE_F16MM __khwcap_feature(SVE_F16MM) 101 - #define KERNEL_HWCAP_SVE_ELTPERM __khwcap_feature(SVE_ELTPERM) 102 - #define KERNEL_HWCAP_SVE_AES2 __khwcap_feature(SVE_AES2) 103 - #define KERNEL_HWCAP_SVE_BFSCALE __khwcap_feature(SVE_BFSCALE) 104 - #define KERNEL_HWCAP_SVE2P2 __khwcap_feature(SVE2P2) 105 - #define KERNEL_HWCAP_SME2P2 __khwcap_feature(SME2P2) 106 - #define KERNEL_HWCAP_SME_SBITPERM __khwcap_feature(SME_SBITPERM) 107 - #define KERNEL_HWCAP_SME_AES __khwcap_feature(SME_AES) 108 - #define KERNEL_HWCAP_SME_SFEXPA __khwcap_feature(SME_SFEXPA) 109 - #define KERNEL_HWCAP_SME_STMOP __khwcap_feature(SME_STMOP) 110 - #define KERNEL_HWCAP_SME_SMOP4 __khwcap_feature(SME_SMOP4) 111 - 112 63 #define __khwcap2_feature(x) (const_ilog2(HWCAP2_ ## x) + 64) 113 - #define KERNEL_HWCAP_DCPODP __khwcap2_feature(DCPODP) 114 - #define KERNEL_HWCAP_SVE2 __khwcap2_feature(SVE2) 115 - #define KERNEL_HWCAP_SVEAES __khwcap2_feature(SVEAES) 116 - #define KERNEL_HWCAP_SVEPMULL __khwcap2_feature(SVEPMULL) 117 - #define KERNEL_HWCAP_SVEBITPERM __khwcap2_feature(SVEBITPERM) 118 - #define KERNEL_HWCAP_SVESHA3 __khwcap2_feature(SVESHA3) 119 - #define KERNEL_HWCAP_SVESM4 __khwcap2_feature(SVESM4) 120 - #define KERNEL_HWCAP_FLAGM2 __khwcap2_feature(FLAGM2) 121 - #define KERNEL_HWCAP_FRINT __khwcap2_feature(FRINT) 122 - #define KERNEL_HWCAP_SVEI8MM __khwcap2_feature(SVEI8MM) 123 - #define KERNEL_HWCAP_SVEF32MM __khwcap2_feature(SVEF32MM) 124 - #define KERNEL_HWCAP_SVEF64MM __khwcap2_feature(SVEF64MM) 125 - #define KERNEL_HWCAP_SVEBF16 __khwcap2_feature(SVEBF16) 126 - #define KERNEL_HWCAP_I8MM __khwcap2_feature(I8MM) 127 - #define KERNEL_HWCAP_BF16 __khwcap2_feature(BF16) 128 - #define KERNEL_HWCAP_DGH __khwcap2_feature(DGH) 129 - #define KERNEL_HWCAP_RNG __khwcap2_feature(RNG) 130 - #define KERNEL_HWCAP_BTI __khwcap2_feature(BTI) 131 - #define KERNEL_HWCAP_MTE __khwcap2_feature(MTE) 132 - #define KERNEL_HWCAP_ECV __khwcap2_feature(ECV) 133 - #define KERNEL_HWCAP_AFP __khwcap2_feature(AFP) 134 - #define KERNEL_HWCAP_RPRES __khwcap2_feature(RPRES) 135 - #define KERNEL_HWCAP_MTE3 __khwcap2_feature(MTE3) 136 - #define KERNEL_HWCAP_SME __khwcap2_feature(SME) 137 - #define KERNEL_HWCAP_SME_I16I64 __khwcap2_feature(SME_I16I64) 138 - #define KERNEL_HWCAP_SME_F64F64 __khwcap2_feature(SME_F64F64) 139 - #define KERNEL_HWCAP_SME_I8I32 __khwcap2_feature(SME_I8I32) 140 - #define KERNEL_HWCAP_SME_F16F32 __khwcap2_feature(SME_F16F32) 141 - #define KERNEL_HWCAP_SME_B16F32 __khwcap2_feature(SME_B16F32) 142 - #define KERNEL_HWCAP_SME_F32F32 __khwcap2_feature(SME_F32F32) 143 - #define KERNEL_HWCAP_SME_FA64 __khwcap2_feature(SME_FA64) 144 - #define KERNEL_HWCAP_WFXT __khwcap2_feature(WFXT) 145 - #define KERNEL_HWCAP_EBF16 __khwcap2_feature(EBF16) 146 - #define KERNEL_HWCAP_SVE_EBF16 __khwcap2_feature(SVE_EBF16) 147 - #define KERNEL_HWCAP_CSSC __khwcap2_feature(CSSC) 148 - #define KERNEL_HWCAP_RPRFM __khwcap2_feature(RPRFM) 149 - #define KERNEL_HWCAP_SVE2P1 __khwcap2_feature(SVE2P1) 150 - #define KERNEL_HWCAP_SME2 __khwcap2_feature(SME2) 151 - #define KERNEL_HWCAP_SME2P1 __khwcap2_feature(SME2P1) 152 - #define KERNEL_HWCAP_SME_I16I32 __khwcap2_feature(SME_I16I32) 153 - #define KERNEL_HWCAP_SME_BI32I32 __khwcap2_feature(SME_BI32I32) 154 - #define KERNEL_HWCAP_SME_B16B16 __khwcap2_feature(SME_B16B16) 155 - #define KERNEL_HWCAP_SME_F16F16 __khwcap2_feature(SME_F16F16) 156 - #define KERNEL_HWCAP_MOPS __khwcap2_feature(MOPS) 157 - #define KERNEL_HWCAP_HBC __khwcap2_feature(HBC) 158 - #define KERNEL_HWCAP_SVE_B16B16 __khwcap2_feature(SVE_B16B16) 159 - #define KERNEL_HWCAP_LRCPC3 __khwcap2_feature(LRCPC3) 160 - #define KERNEL_HWCAP_LSE128 __khwcap2_feature(LSE128) 161 - #define KERNEL_HWCAP_FPMR __khwcap2_feature(FPMR) 162 - #define KERNEL_HWCAP_LUT __khwcap2_feature(LUT) 163 - #define KERNEL_HWCAP_FAMINMAX __khwcap2_feature(FAMINMAX) 164 - #define KERNEL_HWCAP_F8CVT __khwcap2_feature(F8CVT) 165 - #define KERNEL_HWCAP_F8FMA __khwcap2_feature(F8FMA) 166 - #define KERNEL_HWCAP_F8DP4 __khwcap2_feature(F8DP4) 167 - #define KERNEL_HWCAP_F8DP2 __khwcap2_feature(F8DP2) 168 - #define KERNEL_HWCAP_F8E4M3 __khwcap2_feature(F8E4M3) 169 - #define KERNEL_HWCAP_F8E5M2 __khwcap2_feature(F8E5M2) 170 - #define KERNEL_HWCAP_SME_LUTV2 __khwcap2_feature(SME_LUTV2) 171 - #define KERNEL_HWCAP_SME_F8F16 __khwcap2_feature(SME_F8F16) 172 - #define KERNEL_HWCAP_SME_F8F32 __khwcap2_feature(SME_F8F32) 173 - #define KERNEL_HWCAP_SME_SF8FMA __khwcap2_feature(SME_SF8FMA) 174 - #define KERNEL_HWCAP_SME_SF8DP4 __khwcap2_feature(SME_SF8DP4) 175 - #define KERNEL_HWCAP_SME_SF8DP2 __khwcap2_feature(SME_SF8DP2) 176 - #define KERNEL_HWCAP_POE __khwcap2_feature(POE) 177 - 178 64 #define __khwcap3_feature(x) (const_ilog2(HWCAP3_ ## x) + 128) 179 - #define KERNEL_HWCAP_MTE_FAR __khwcap3_feature(MTE_FAR) 180 - #define KERNEL_HWCAP_MTE_STORE_ONLY __khwcap3_feature(MTE_STORE_ONLY) 181 - #define KERNEL_HWCAP_LSFE __khwcap3_feature(LSFE) 182 - #define KERNEL_HWCAP_LS64 __khwcap3_feature(LS64) 65 + 66 + #include "asm/kernel-hwcap.h" 183 67 184 68 /* 185 69 * This yields a mask that user programs can use to figure out what

+27

arch/arm64/include/asm/lsui.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __ASM_LSUI_H 3 + #define __ASM_LSUI_H 4 + 5 + #include <linux/compiler_types.h> 6 + #include <linux/stringify.h> 7 + #include <asm/alternative.h> 8 + #include <asm/alternative-macros.h> 9 + #include <asm/cpucaps.h> 10 + 11 + #define __LSUI_PREAMBLE ".arch_extension lsui\n" 12 + 13 + #ifdef CONFIG_ARM64_LSUI 14 + 15 + #define __lsui_llsc_body(op, ...) \ 16 + ({ \ 17 + alternative_has_cap_unlikely(ARM64_HAS_LSUI) ? \ 18 + __lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__); \ 19 + }) 20 + 21 + #else /* CONFIG_ARM64_LSUI */ 22 + 23 + #define __lsui_llsc_body(op, ...) __llsc_##op(__VA_ARGS__) 24 + 25 + #endif /* CONFIG_ARM64_LSUI */ 26 + 27 + #endif /* __ASM_LSUI_H */

+2 -8

arch/arm64/include/asm/mmu.h

··· 10 10 #define MMCF_AARCH32 0x1 /* mm context flag for AArch32 executables */ 11 11 #define USER_ASID_BIT 48 12 12 #define USER_ASID_FLAG (UL(1) << USER_ASID_BIT) 13 - #define TTBR_ASID_MASK (UL(0xffff) << 48) 14 13 15 14 #ifndef __ASSEMBLER__ 16 15 17 16 #include <linux/refcount.h> 18 17 #include <asm/cpufeature.h> 19 - 20 - enum pgtable_type { 21 - TABLE_PTE, 22 - TABLE_PMD, 23 - TABLE_PUD, 24 - TABLE_P4D, 25 - }; 26 18 27 19 typedef struct { 28 20 atomic64_t id; ··· 103 111 #else 104 112 static inline void kpti_install_ng_mappings(void) {} 105 113 #endif 114 + 115 + extern bool page_alloc_available; 106 116 107 117 #endif /* !__ASSEMBLER__ */ 108 118 #endif

+2 -1

arch/arm64/include/asm/mmu_context.h

··· 210 210 if (mm == &init_mm) 211 211 ttbr = phys_to_ttbr(__pa_symbol(reserved_pg_dir)); 212 212 else 213 - ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) | ASID(mm) << 48; 213 + ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) | 214 + FIELD_PREP(TTBRx_EL1_ASID_MASK, ASID(mm)); 214 215 215 216 WRITE_ONCE(task_thread_info(tsk)->ttbr0, ttbr); 216 217 }

+96

arch/arm64/include/asm/mpam.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (C) 2025 Arm Ltd. */ 3 + 4 + #ifndef __ASM__MPAM_H 5 + #define __ASM__MPAM_H 6 + 7 + #include <linux/arm_mpam.h> 8 + #include <linux/bitfield.h> 9 + #include <linux/jump_label.h> 10 + #include <linux/percpu.h> 11 + #include <linux/sched.h> 12 + 13 + #include <asm/sysreg.h> 14 + 15 + DECLARE_STATIC_KEY_FALSE(mpam_enabled); 16 + DECLARE_PER_CPU(u64, arm64_mpam_default); 17 + DECLARE_PER_CPU(u64, arm64_mpam_current); 18 + 19 + /* 20 + * The value of the MPAM0_EL1 sysreg when a task is in resctrl's default group. 21 + * This is used by the context switch code to use the resctrl CPU property 22 + * instead. The value is modified when CDP is enabled/disabled by mounting 23 + * the resctrl filesystem. 24 + */ 25 + extern u64 arm64_mpam_global_default; 26 + 27 + #ifdef CONFIG_ARM64_MPAM 28 + static inline u64 __mpam_regval(u16 partid_d, u16 partid_i, u8 pmg_d, u8 pmg_i) 29 + { 30 + return FIELD_PREP(MPAM0_EL1_PARTID_D, partid_d) | 31 + FIELD_PREP(MPAM0_EL1_PARTID_I, partid_i) | 32 + FIELD_PREP(MPAM0_EL1_PMG_D, pmg_d) | 33 + FIELD_PREP(MPAM0_EL1_PMG_I, pmg_i); 34 + } 35 + 36 + static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, 37 + u8 pmg_d, u8 pmg_i) 38 + { 39 + u64 default_val = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); 40 + 41 + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); 42 + } 43 + 44 + /* 45 + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, 46 + * which may race with reads in mpam_thread_switch(). Ensure only one of the old 47 + * or new values are used. Particular care should be taken with the pmg field as 48 + * mpam_thread_switch() may read a partid and pmg that don't match, causing this 49 + * value to be stored with cache allocations, despite being considered 'free' by 50 + * resctrl. 51 + */ 52 + static inline u64 mpam_get_regval(struct task_struct *tsk) 53 + { 54 + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); 55 + } 56 + 57 + static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, 58 + u16 partid_d, u16 partid_i, 59 + u8 pmg_d, u8 pmg_i) 60 + { 61 + u64 regval = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); 62 + 63 + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); 64 + } 65 + 66 + static inline void mpam_thread_switch(struct task_struct *tsk) 67 + { 68 + u64 oldregval; 69 + int cpu = smp_processor_id(); 70 + u64 regval = mpam_get_regval(tsk); 71 + 72 + if (!static_branch_likely(&mpam_enabled)) 73 + return; 74 + 75 + if (regval == READ_ONCE(arm64_mpam_global_default)) 76 + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); 77 + 78 + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 79 + if (oldregval == regval) 80 + return; 81 + 82 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 83 + if (system_supports_sme()) 84 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); 85 + isb(); 86 + 87 + /* Synchronising the EL0 write is left until the ERET to EL0 */ 88 + write_sysreg_s(regval, SYS_MPAM0_EL1); 89 + 90 + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); 91 + } 92 + #else 93 + static inline void mpam_thread_switch(struct task_struct *tsk) {} 94 + #endif /* CONFIG_ARM64_MPAM */ 95 + 96 + #endif /* __ASM__MPAM_H */

+6

arch/arm64/include/asm/mte.h

··· 252 252 if (!kasan_hw_tags_enabled()) 253 253 return; 254 254 255 + if (!system_uses_mte_async_or_asymm_mode()) 256 + return; 257 + 255 258 mte_check_tfsr_el1(); 256 259 } 257 260 258 261 static inline void mte_check_tfsr_exit(void) 259 262 { 260 263 if (!kasan_hw_tags_enabled()) 264 + return; 265 + 266 + if (!system_uses_mte_async_or_asymm_mode()) 261 267 return; 262 268 263 269 /*

+5 -4

arch/arm64/include/asm/pgtable-hwdef.h

··· 223 223 */ 224 224 #define S1_TABLE_AP (_AT(pmdval_t, 3) << 61) 225 225 226 - #define TTBR_CNP_BIT (UL(1) << 0) 227 - 228 226 /* 229 227 * TCR flags. 230 228 */ ··· 285 287 #endif 286 288 287 289 #ifdef CONFIG_ARM64_VA_BITS_52 290 + #define PTRS_PER_PGD_52_VA (UL(1) << (52 - PGDIR_SHIFT)) 291 + #define PTRS_PER_PGD_48_VA (UL(1) << (48 - PGDIR_SHIFT)) 292 + #define PTRS_PER_PGD_EXTRA (PTRS_PER_PGD_52_VA - PTRS_PER_PGD_48_VA) 293 + 288 294 /* Must be at least 64-byte aligned to prevent corruption of the TTBR */ 289 - #define TTBR1_BADDR_4852_OFFSET (((UL(1) << (52 - PGDIR_SHIFT)) - \ 290 - (UL(1) << (48 - PGDIR_SHIFT))) * 8) 295 + #define TTBR1_BADDR_4852_OFFSET (PTRS_PER_PGD_EXTRA << PTDESC_ORDER) 291 296 #endif 292 297 293 298 #endif

+2

arch/arm64/include/asm/pgtable-prot.h

··· 25 25 */ 26 26 #define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */ 27 27 28 + #define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG) 29 + 28 30 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 29 31 #define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */ 30 32 #define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */

+40 -20

arch/arm64/include/asm/pgtable.h

··· 89 89 90 90 /* Set stride and tlb_level in flush_*_tlb_range */ 91 91 #define flush_pmd_tlb_range(vma, addr, end) \ 92 - __flush_tlb_range(vma, addr, end, PMD_SIZE, false, 2) 92 + __flush_tlb_range(vma, addr, end, PMD_SIZE, 2, TLBF_NONE) 93 93 #define flush_pud_tlb_range(vma, addr, end) \ 94 - __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) 94 + __flush_tlb_range(vma, addr, end, PUD_SIZE, 1, TLBF_NONE) 95 95 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 96 96 97 97 /* ··· 101 101 * entries exist. 102 102 */ 103 103 #define flush_tlb_fix_spurious_fault(vma, address, ptep) \ 104 - local_flush_tlb_page_nonotify(vma, address) 104 + __flush_tlb_page(vma, address, TLBF_NOBROADCAST | TLBF_NONOTIFY) 105 105 106 - #define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ 107 - local_flush_tlb_page_nonotify(vma, address) 106 + #define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ 107 + __flush_tlb_range(vma, address, address + PMD_SIZE, PMD_SIZE, 2, \ 108 + TLBF_NOBROADCAST | TLBF_NONOTIFY | TLBF_NOWALKCACHE) 108 109 109 110 /* 110 111 * ZERO_PAGE is a global shared page that is always zero: used ··· 323 322 return clear_pte_bit(pte, __pgprot(PTE_CONT)); 324 323 } 325 324 326 - static inline pte_t pte_mkvalid(pte_t pte) 325 + static inline pte_t pte_mkvalid_k(pte_t pte) 327 326 { 328 - return set_pte_bit(pte, __pgprot(PTE_VALID)); 327 + pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID)); 328 + pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL)); 329 + return pte; 329 330 } 330 331 331 332 static inline pte_t pte_mkinvalid(pte_t pte) ··· 597 594 #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) 598 595 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 599 596 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) 597 + #define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd))) 600 598 #define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd))) 601 599 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 602 600 #define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd)) ··· 639 635 640 636 #define pud_young(pud) pte_young(pud_pte(pud)) 641 637 #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) 638 + #define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) 639 + #define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud))) 642 640 #define pud_write(pud) pte_write(pud_pte(pud)) 643 641 644 642 static inline pud_t pud_mkhuge(pud_t pud) ··· 785 779 786 780 #define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \ 787 781 PMD_TYPE_TABLE) 788 - #define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \ 789 - PMD_TYPE_SECT) 790 - #define pmd_leaf(pmd) (pmd_present(pmd) && !pmd_table(pmd)) 782 + 783 + #define pmd_leaf pmd_leaf 784 + static inline bool pmd_leaf(pmd_t pmd) 785 + { 786 + return pmd_present(pmd) && !pmd_table(pmd); 787 + } 788 + 791 789 #define pmd_bad(pmd) (!pmd_table(pmd)) 792 790 793 791 #define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE) ··· 809 799 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 810 800 811 801 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3 812 - static inline bool pud_sect(pud_t pud) { return false; } 813 802 static inline bool pud_table(pud_t pud) { return true; } 814 803 #else 815 - #define pud_sect(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \ 816 - PUD_TYPE_SECT) 817 804 #define pud_table(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \ 818 805 PUD_TYPE_TABLE) 819 806 #endif ··· 880 873 PUD_TYPE_TABLE) 881 874 #define pud_present(pud) pte_present(pud_pte(pud)) 882 875 #ifndef __PAGETABLE_PMD_FOLDED 883 - #define pud_leaf(pud) (pud_present(pud) && !pud_table(pud)) 876 + #define pud_leaf pud_leaf 877 + static inline bool pud_leaf(pud_t pud) 878 + { 879 + return pud_present(pud) && !pud_table(pud); 880 + } 884 881 #else 885 882 #define pud_leaf(pud) false 886 883 #endif ··· 1258 1247 return pte_pmd(pte_modify(pmd_pte(pmd), newprot)); 1259 1248 } 1260 1249 1261 - extern int __ptep_set_access_flags(struct vm_area_struct *vma, 1262 - unsigned long address, pte_t *ptep, 1263 - pte_t entry, int dirty); 1250 + extern int __ptep_set_access_flags_anysz(struct vm_area_struct *vma, 1251 + unsigned long address, pte_t *ptep, 1252 + pte_t entry, int dirty, 1253 + unsigned long pgsize); 1254 + 1255 + static inline int __ptep_set_access_flags(struct vm_area_struct *vma, 1256 + unsigned long address, pte_t *ptep, 1257 + pte_t entry, int dirty) 1258 + { 1259 + return __ptep_set_access_flags_anysz(vma, address, ptep, entry, dirty, 1260 + PAGE_SIZE); 1261 + } 1264 1262 1265 1263 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1266 1264 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS ··· 1277 1257 unsigned long address, pmd_t *pmdp, 1278 1258 pmd_t entry, int dirty) 1279 1259 { 1280 - return __ptep_set_access_flags(vma, address, (pte_t *)pmdp, 1281 - pmd_pte(entry), dirty); 1260 + return __ptep_set_access_flags_anysz(vma, address, (pte_t *)pmdp, 1261 + pmd_pte(entry), dirty, PMD_SIZE); 1282 1262 } 1283 1263 #endif 1284 1264 ··· 1340 1320 * context-switch, which provides a DSB to complete the TLB 1341 1321 * invalidation. 1342 1322 */ 1343 - flush_tlb_page_nosync(vma, address); 1323 + __flush_tlb_page(vma, address, TLBF_NOSYNC); 1344 1324 } 1345 1325 1346 1326 return young;

+2

arch/arm64/include/asm/resctrl.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <linux/arm_mpam.h>

+8

arch/arm64/include/asm/scs.h

··· 10 10 #ifdef CONFIG_SHADOW_CALL_STACK 11 11 scs_sp .req x18 12 12 13 + .macro scs_load_current_base 14 + get_current_task scs_sp 15 + ldr scs_sp, [scs_sp, #TSK_TI_SCS_BASE] 16 + .endm 17 + 13 18 .macro scs_load_current 14 19 get_current_task scs_sp 15 20 ldr scs_sp, [scs_sp, #TSK_TI_SCS_SP] ··· 24 19 str scs_sp, [\tsk, #TSK_TI_SCS_SP] 25 20 .endm 26 21 #else 22 + .macro scs_load_current_base 23 + .endm 24 + 27 25 .macro scs_load_current 28 26 .endm 29 27

+3

arch/arm64/include/asm/thread_info.h

··· 42 42 void *scs_base; 43 43 void *scs_sp; 44 44 #endif 45 + #ifdef CONFIG_ARM64_MPAM 46 + u64 mpam_partid_pmg; 47 + #endif 45 48 u32 cpu; 46 49 }; 47 50

+3 -3

arch/arm64/include/asm/tlb.h

··· 53 53 static inline void tlb_flush(struct mmu_gather *tlb) 54 54 { 55 55 struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0); 56 - bool last_level = !tlb->freed_tables; 56 + tlbf_t flags = tlb->freed_tables ? TLBF_NONE : TLBF_NOWALKCACHE; 57 57 unsigned long stride = tlb_get_unmap_size(tlb); 58 58 int tlb_level = tlb_get_level(tlb); 59 59 ··· 63 63 * reallocate our ASID without invalidating the entire TLB. 64 64 */ 65 65 if (tlb->fullmm) { 66 - if (!last_level) 66 + if (tlb->freed_tables) 67 67 flush_tlb_mm(tlb->mm); 68 68 return; 69 69 } 70 70 71 71 __flush_tlb_range(&vma, tlb->start, tlb->end, stride, 72 - last_level, tlb_level); 72 + tlb_level, flags); 73 73 } 74 74 75 75 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,

+263 -208

arch/arm64/include/asm/tlbflush.h

··· 97 97 98 98 #define TLBI_TTL_UNKNOWN INT_MAX 99 99 100 - #define __tlbi_level(op, addr, level) do { \ 101 - u64 arg = addr; \ 102 - \ 103 - if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && \ 104 - level >= 0 && level <= 3) { \ 105 - u64 ttl = level & 3; \ 106 - ttl |= get_trans_granule() << 2; \ 107 - arg &= ~TLBI_TTL_MASK; \ 108 - arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \ 109 - } \ 110 - \ 111 - __tlbi(op, arg); \ 112 - } while(0) 100 + typedef void (*tlbi_op)(u64 arg); 113 101 114 - #define __tlbi_user_level(op, arg, level) do { \ 115 - if (arm64_kernel_unmapped_at_el0()) \ 116 - __tlbi_level(op, (arg | USER_ASID_FLAG), level); \ 117 - } while (0) 102 + static __always_inline void vae1is(u64 arg) 103 + { 104 + __tlbi(vae1is, arg); 105 + __tlbi_user(vae1is, arg); 106 + } 107 + 108 + static __always_inline void vae2is(u64 arg) 109 + { 110 + __tlbi(vae2is, arg); 111 + } 112 + 113 + static __always_inline void vale1(u64 arg) 114 + { 115 + __tlbi(vale1, arg); 116 + __tlbi_user(vale1, arg); 117 + } 118 + 119 + static __always_inline void vale1is(u64 arg) 120 + { 121 + __tlbi(vale1is, arg); 122 + __tlbi_user(vale1is, arg); 123 + } 124 + 125 + static __always_inline void vale2is(u64 arg) 126 + { 127 + __tlbi(vale2is, arg); 128 + } 129 + 130 + static __always_inline void vaale1is(u64 arg) 131 + { 132 + __tlbi(vaale1is, arg); 133 + } 134 + 135 + static __always_inline void ipas2e1(u64 arg) 136 + { 137 + __tlbi(ipas2e1, arg); 138 + } 139 + 140 + static __always_inline void ipas2e1is(u64 arg) 141 + { 142 + __tlbi(ipas2e1is, arg); 143 + } 144 + 145 + static __always_inline void __tlbi_level_asid(tlbi_op op, u64 addr, u32 level, 146 + u16 asid) 147 + { 148 + u64 arg = __TLBI_VADDR(addr, asid); 149 + 150 + if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && level <= 3) { 151 + u64 ttl = level | (get_trans_granule() << 2); 152 + 153 + FIELD_MODIFY(TLBI_TTL_MASK, &arg, ttl); 154 + } 155 + 156 + op(arg); 157 + } 158 + 159 + static inline void __tlbi_level(tlbi_op op, u64 addr, u32 level) 160 + { 161 + __tlbi_level_asid(op, addr, level, 0); 162 + } 118 163 119 164 /* 120 165 * This macro creates a properly formatted VA operand for the TLB RANGE. The ··· 186 141 #define TLBIR_TTL_MASK GENMASK_ULL(38, 37) 187 142 #define TLBIR_BADDR_MASK GENMASK_ULL(36, 0) 188 143 189 - #define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \ 190 - ({ \ 191 - unsigned long __ta = 0; \ 192 - unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \ 193 - __ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \ 194 - __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \ 195 - __ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \ 196 - __ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \ 197 - __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \ 198 - __ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \ 199 - __ta; \ 200 - }) 201 - 202 144 /* These macros are used by the TLBI RANGE feature. */ 203 145 #define __TLBI_RANGE_PAGES(num, scale) \ 204 146 ((unsigned long)((num) + 1) << (5 * (scale) + 1)) ··· 199 167 * range. 200 168 */ 201 169 #define __TLBI_RANGE_NUM(pages, scale) \ 202 - ({ \ 203 - int __pages = min((pages), \ 204 - __TLBI_RANGE_PAGES(31, (scale))); \ 205 - (__pages >> (5 * (scale) + 1)) - 1; \ 206 - }) 170 + (((pages) >> (5 * (scale) + 1)) - 1) 207 171 208 172 #define __repeat_tlbi_sync(op, arg...) \ 209 173 do { \ ··· 269 241 * unmapping pages from vmalloc/io space. 270 242 * 271 243 * flush_tlb_page(vma, addr) 272 - * Invalidate a single user mapping for address 'addr' in the 273 - * address space corresponding to 'vma->mm'. Note that this 274 - * operation only invalidates a single, last-level page-table 275 - * entry and therefore does not affect any walk-caches. 244 + * Equivalent to __flush_tlb_page(..., flags=TLBF_NONE) 276 245 * 277 246 * 278 247 * Next, we have some undocumented invalidation routines that you probably ··· 283 258 * CPUs, ensuring that any walk-cache entries associated with the 284 259 * translation are also invalidated. 285 260 * 286 - * __flush_tlb_range(vma, start, end, stride, last_level, tlb_level) 261 + * __flush_tlb_range(vma, start, end, stride, tlb_level, flags) 287 262 * Invalidate the virtual-address range '[start, end)' on all 288 263 * CPUs for the user address space corresponding to 'vma->mm'. 289 264 * The invalidation operations are issued at a granularity 290 - * determined by 'stride' and only affect any walk-cache entries 291 - * if 'last_level' is equal to false. tlb_level is the level at 265 + * determined by 'stride'. tlb_level is the level at 292 266 * which the invalidation must take place. If the level is wrong, 293 267 * no invalidation may take place. In the case where the level 294 268 * cannot be easily determined, the value TLBI_TTL_UNKNOWN will 295 - * perform a non-hinted invalidation. 269 + * perform a non-hinted invalidation. flags may be TLBF_NONE (0) or 270 + * any combination of TLBF_NOWALKCACHE (elide eviction of walk 271 + * cache entries), TLBF_NONOTIFY (don't call mmu notifiers), 272 + * TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST 273 + * (only perform the invalidation for the local cpu). 296 274 * 297 - * local_flush_tlb_page(vma, addr) 298 - * Local variant of flush_tlb_page(). Stale TLB entries may 299 - * remain in remote CPUs. 300 - * 301 - * local_flush_tlb_page_nonotify(vma, addr) 302 - * Same as local_flush_tlb_page() except MMU notifier will not be 303 - * called. 304 - * 305 - * local_flush_tlb_contpte(vma, addr) 306 - * Invalidate the virtual-address range 307 - * '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU 308 - * for the user address space corresponding to 'vma->mm'. Stale 309 - * TLB entries may remain in remote CPUs. 275 + * __flush_tlb_page(vma, addr, flags) 276 + * Invalidate a single user mapping for address 'addr' in the 277 + * address space corresponding to 'vma->mm'. Note that this 278 + * operation only invalidates a single level 3 page-table entry 279 + * and therefore does not affect any walk-caches. flags may contain 280 + * any combination of TLBF_NONOTIFY (don't call mmu notifiers), 281 + * TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST 282 + * (only perform the invalidation for the local cpu). 310 283 * 311 284 * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented 312 285 * on top of these routines, since that is our interface to the mmu_gather ··· 338 315 mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 339 316 } 340 317 341 - static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm, 342 - unsigned long uaddr) 343 - { 344 - unsigned long addr; 345 - 346 - dsb(nshst); 347 - addr = __TLBI_VADDR(uaddr, ASID(mm)); 348 - __tlbi(vale1, addr); 349 - __tlbi_user(vale1, addr); 350 - } 351 - 352 - static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma, 353 - unsigned long uaddr) 354 - { 355 - __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 356 - dsb(nsh); 357 - } 358 - 359 - static inline void local_flush_tlb_page(struct vm_area_struct *vma, 360 - unsigned long uaddr) 361 - { 362 - __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 363 - mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, 364 - (uaddr & PAGE_MASK) + PAGE_SIZE); 365 - dsb(nsh); 366 - } 367 - 368 - static inline void __flush_tlb_page_nosync(struct mm_struct *mm, 369 - unsigned long uaddr) 370 - { 371 - unsigned long addr; 372 - 373 - dsb(ishst); 374 - addr = __TLBI_VADDR(uaddr, ASID(mm)); 375 - __tlbi(vale1is, addr); 376 - __tlbi_user(vale1is, addr); 377 - mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, 378 - (uaddr & PAGE_MASK) + PAGE_SIZE); 379 - } 380 - 381 - static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, 382 - unsigned long uaddr) 383 - { 384 - return __flush_tlb_page_nosync(vma->vm_mm, uaddr); 385 - } 386 - 387 - static inline void flush_tlb_page(struct vm_area_struct *vma, 388 - unsigned long uaddr) 389 - { 390 - flush_tlb_page_nosync(vma, uaddr); 391 - __tlbi_sync_s1ish(); 392 - } 393 - 394 318 static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) 395 319 { 396 320 return true; ··· 367 397 /* 368 398 * __flush_tlb_range_op - Perform TLBI operation upon a range 369 399 * 370 - * @op: TLBI instruction that operates on a range (has 'r' prefix) 400 + * @lop: TLBI level operation to perform 401 + * @rop: TLBI range operation to perform 371 402 * @start: The start address of the range 372 403 * @pages: Range as the number of pages from 'start' 373 404 * @stride: Flush granularity 374 405 * @asid: The ASID of the task (0 for IPA instructions) 375 - * @tlb_level: Translation Table level hint, if known 376 - * @tlbi_user: If 'true', call an additional __tlbi_user() 377 - * (typically for user ASIDs). 'flase' for IPA instructions 406 + * @level: Translation Table level hint, if known 378 407 * @lpa2: If 'true', the lpa2 scheme is used as set out below 379 408 * 380 409 * When the CPU does not support TLB range operations, flush the TLB ··· 396 427 * operations can only span an even number of pages. We save this for last to 397 428 * ensure 64KB start alignment is maintained for the LPA2 case. 398 429 */ 399 - #define __flush_tlb_range_op(op, start, pages, stride, \ 400 - asid, tlb_level, tlbi_user, lpa2) \ 401 - do { \ 402 - typeof(start) __flush_start = start; \ 403 - typeof(pages) __flush_pages = pages; \ 404 - int num = 0; \ 405 - int scale = 3; \ 406 - int shift = lpa2 ? 16 : PAGE_SHIFT; \ 407 - unsigned long addr; \ 408 - \ 409 - while (__flush_pages > 0) { \ 410 - if (!system_supports_tlb_range() || \ 411 - __flush_pages == 1 || \ 412 - (lpa2 && __flush_start != ALIGN(__flush_start, SZ_64K))) { \ 413 - addr = __TLBI_VADDR(__flush_start, asid); \ 414 - __tlbi_level(op, addr, tlb_level); \ 415 - if (tlbi_user) \ 416 - __tlbi_user_level(op, addr, tlb_level); \ 417 - __flush_start += stride; \ 418 - __flush_pages -= stride >> PAGE_SHIFT; \ 419 - continue; \ 420 - } \ 421 - \ 422 - num = __TLBI_RANGE_NUM(__flush_pages, scale); \ 423 - if (num >= 0) { \ 424 - addr = __TLBI_VADDR_RANGE(__flush_start >> shift, asid, \ 425 - scale, num, tlb_level); \ 426 - __tlbi(r##op, addr); \ 427 - if (tlbi_user) \ 428 - __tlbi_user(r##op, addr); \ 429 - __flush_start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \ 430 - __flush_pages -= __TLBI_RANGE_PAGES(num, scale);\ 431 - } \ 432 - scale--; \ 433 - } \ 434 - } while (0) 435 - 436 - #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \ 437 - __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled()); 438 - 439 - static inline bool __flush_tlb_range_limit_excess(unsigned long start, 440 - unsigned long end, unsigned long pages, unsigned long stride) 430 + static __always_inline void rvae1is(u64 arg) 441 431 { 442 - /* 443 - * When the system does not support TLB range based flush 444 - * operation, (MAX_DVM_OPS - 1) pages can be handled. But 445 - * with TLB range based operation, MAX_TLBI_RANGE_PAGES 446 - * pages can be handled. 447 - */ 448 - if ((!system_supports_tlb_range() && 449 - (end - start) >= (MAX_DVM_OPS * stride)) || 450 - pages > MAX_TLBI_RANGE_PAGES) 451 - return true; 452 - 453 - return false; 432 + __tlbi(rvae1is, arg); 433 + __tlbi_user(rvae1is, arg); 454 434 } 455 435 456 - static inline void __flush_tlb_range_nosync(struct mm_struct *mm, 457 - unsigned long start, unsigned long end, 458 - unsigned long stride, bool last_level, 459 - int tlb_level) 436 + static __always_inline void rvale1(u64 arg) 460 437 { 438 + __tlbi(rvale1, arg); 439 + __tlbi_user(rvale1, arg); 440 + } 441 + 442 + static __always_inline void rvale1is(u64 arg) 443 + { 444 + __tlbi(rvale1is, arg); 445 + __tlbi_user(rvale1is, arg); 446 + } 447 + 448 + static __always_inline void rvaale1is(u64 arg) 449 + { 450 + __tlbi(rvaale1is, arg); 451 + } 452 + 453 + static __always_inline void ripas2e1is(u64 arg) 454 + { 455 + __tlbi(ripas2e1is, arg); 456 + } 457 + 458 + static __always_inline void __tlbi_range(tlbi_op op, u64 addr, 459 + u16 asid, int scale, int num, 460 + u32 level, bool lpa2) 461 + { 462 + u64 arg = 0; 463 + 464 + arg |= FIELD_PREP(TLBIR_BADDR_MASK, addr >> (lpa2 ? 16 : PAGE_SHIFT)); 465 + arg |= FIELD_PREP(TLBIR_TTL_MASK, level > 3 ? 0 : level); 466 + arg |= FIELD_PREP(TLBIR_NUM_MASK, num); 467 + arg |= FIELD_PREP(TLBIR_SCALE_MASK, scale); 468 + arg |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); 469 + arg |= FIELD_PREP(TLBIR_ASID_MASK, asid); 470 + 471 + op(arg); 472 + } 473 + 474 + static __always_inline void __flush_tlb_range_op(tlbi_op lop, tlbi_op rop, 475 + u64 start, size_t pages, 476 + u64 stride, u16 asid, 477 + u32 level, bool lpa2) 478 + { 479 + u64 addr = start, end = start + pages * PAGE_SIZE; 480 + int scale = 3; 481 + 482 + while (addr != end) { 483 + int num; 484 + 485 + pages = (end - addr) >> PAGE_SHIFT; 486 + 487 + if (!system_supports_tlb_range() || pages == 1) 488 + goto invalidate_one; 489 + 490 + if (lpa2 && !IS_ALIGNED(addr, SZ_64K)) 491 + goto invalidate_one; 492 + 493 + num = __TLBI_RANGE_NUM(pages, scale); 494 + if (num >= 0) { 495 + __tlbi_range(rop, addr, asid, scale, num, level, lpa2); 496 + addr += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; 497 + } 498 + 499 + scale--; 500 + continue; 501 + invalidate_one: 502 + __tlbi_level_asid(lop, addr, level, asid); 503 + addr += stride; 504 + } 505 + } 506 + 507 + #define __flush_s1_tlb_range_op(op, start, pages, stride, asid, tlb_level) \ 508 + __flush_tlb_range_op(op, r##op, start, pages, stride, asid, tlb_level, lpa2_is_enabled()) 509 + 510 + #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \ 511 + __flush_tlb_range_op(op, r##op, start, pages, stride, 0, tlb_level, kvm_lpa2_is_enabled()) 512 + 513 + static inline bool __flush_tlb_range_limit_excess(unsigned long pages, 514 + unsigned long stride) 515 + { 516 + /* 517 + * Assume that the worst case number of DVM ops required to flush a 518 + * given range on a system that supports tlb-range is 20 (4 scales, 1 519 + * final page, 15 for alignment on LPA2 systems), which is much smaller 520 + * than MAX_DVM_OPS. 521 + */ 522 + if (system_supports_tlb_range()) 523 + return pages > MAX_TLBI_RANGE_PAGES; 524 + 525 + return pages >= (MAX_DVM_OPS * stride) >> PAGE_SHIFT; 526 + } 527 + 528 + typedef unsigned __bitwise tlbf_t; 529 + 530 + /* No special behaviour. */ 531 + #define TLBF_NONE ((__force tlbf_t)0) 532 + 533 + /* Invalidate tlb entries only, leaving the page table walk cache intact. */ 534 + #define TLBF_NOWALKCACHE ((__force tlbf_t)BIT(0)) 535 + 536 + /* Skip the trailing dsb after issuing tlbi. */ 537 + #define TLBF_NOSYNC ((__force tlbf_t)BIT(1)) 538 + 539 + /* Suppress tlb notifier callbacks for this flush operation. */ 540 + #define TLBF_NONOTIFY ((__force tlbf_t)BIT(2)) 541 + 542 + /* Perform the tlbi locally without broadcasting to other CPUs. */ 543 + #define TLBF_NOBROADCAST ((__force tlbf_t)BIT(3)) 544 + 545 + static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma, 546 + unsigned long start, unsigned long end, 547 + unsigned long stride, int tlb_level, 548 + tlbf_t flags) 549 + { 550 + struct mm_struct *mm = vma->vm_mm; 461 551 unsigned long asid, pages; 462 552 463 - start = round_down(start, stride); 464 - end = round_up(end, stride); 465 553 pages = (end - start) >> PAGE_SHIFT; 466 554 467 - if (__flush_tlb_range_limit_excess(start, end, pages, stride)) { 555 + if (__flush_tlb_range_limit_excess(pages, stride)) { 468 556 flush_tlb_mm(mm); 469 557 return; 470 558 } 471 559 472 - dsb(ishst); 560 + if (!(flags & TLBF_NOBROADCAST)) 561 + dsb(ishst); 562 + else 563 + dsb(nshst); 564 + 473 565 asid = ASID(mm); 474 566 475 - if (last_level) 476 - __flush_tlb_range_op(vale1is, start, pages, stride, asid, 477 - tlb_level, true, lpa2_is_enabled()); 478 - else 479 - __flush_tlb_range_op(vae1is, start, pages, stride, asid, 480 - tlb_level, true, lpa2_is_enabled()); 567 + switch (flags & (TLBF_NOWALKCACHE | TLBF_NOBROADCAST)) { 568 + case TLBF_NONE: 569 + __flush_s1_tlb_range_op(vae1is, start, pages, stride, 570 + asid, tlb_level); 571 + break; 572 + case TLBF_NOWALKCACHE: 573 + __flush_s1_tlb_range_op(vale1is, start, pages, stride, 574 + asid, tlb_level); 575 + break; 576 + case TLBF_NOBROADCAST: 577 + /* Combination unused */ 578 + BUG(); 579 + break; 580 + case TLBF_NOWALKCACHE | TLBF_NOBROADCAST: 581 + __flush_s1_tlb_range_op(vale1, start, pages, stride, 582 + asid, tlb_level); 583 + break; 584 + } 481 585 482 - mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 586 + if (!(flags & TLBF_NONOTIFY)) 587 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 588 + 589 + if (!(flags & TLBF_NOSYNC)) { 590 + if (!(flags & TLBF_NOBROADCAST)) 591 + __tlbi_sync_s1ish(); 592 + else 593 + dsb(nsh); 594 + } 483 595 } 484 596 485 597 static inline void __flush_tlb_range(struct vm_area_struct *vma, 486 598 unsigned long start, unsigned long end, 487 - unsigned long stride, bool last_level, 488 - int tlb_level) 599 + unsigned long stride, int tlb_level, 600 + tlbf_t flags) 489 601 { 490 - __flush_tlb_range_nosync(vma->vm_mm, start, end, stride, 491 - last_level, tlb_level); 492 - __tlbi_sync_s1ish(); 493 - } 494 - 495 - static inline void local_flush_tlb_contpte(struct vm_area_struct *vma, 496 - unsigned long addr) 497 - { 498 - unsigned long asid; 499 - 500 - addr = round_down(addr, CONT_PTE_SIZE); 501 - 502 - dsb(nshst); 503 - asid = ASID(vma->vm_mm); 504 - __flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid, 505 - 3, true, lpa2_is_enabled()); 506 - mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr, 507 - addr + CONT_PTE_SIZE); 508 - dsb(nsh); 602 + start = round_down(start, stride); 603 + end = round_up(end, stride); 604 + __do_flush_tlb_range(vma, start, end, stride, tlb_level, flags); 509 605 } 510 606 511 607 static inline void flush_tlb_range(struct vm_area_struct *vma, ··· 582 548 * Set the tlb_level to TLBI_TTL_UNKNOWN because we can not get enough 583 549 * information here. 584 550 */ 585 - __flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN); 551 + __flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, TLBF_NONE); 552 + } 553 + 554 + static inline void __flush_tlb_page(struct vm_area_struct *vma, 555 + unsigned long uaddr, tlbf_t flags) 556 + { 557 + unsigned long start = round_down(uaddr, PAGE_SIZE); 558 + unsigned long end = start + PAGE_SIZE; 559 + 560 + __do_flush_tlb_range(vma, start, end, PAGE_SIZE, 3, 561 + TLBF_NOWALKCACHE | flags); 562 + } 563 + 564 + static inline void flush_tlb_page(struct vm_area_struct *vma, 565 + unsigned long uaddr) 566 + { 567 + __flush_tlb_page(vma, uaddr, TLBF_NONE); 586 568 } 587 569 588 570 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end) ··· 610 560 end = round_up(end, stride); 611 561 pages = (end - start) >> PAGE_SHIFT; 612 562 613 - if (__flush_tlb_range_limit_excess(start, end, pages, stride)) { 563 + if (__flush_tlb_range_limit_excess(pages, stride)) { 614 564 flush_tlb_all(); 615 565 return; 616 566 } 617 567 618 568 dsb(ishst); 619 - __flush_tlb_range_op(vaale1is, start, pages, stride, 0, 620 - TLBI_TTL_UNKNOWN, false, lpa2_is_enabled()); 569 + __flush_s1_tlb_range_op(vaale1is, start, pages, stride, 0, 570 + TLBI_TTL_UNKNOWN); 621 571 __tlbi_sync_s1ish(); 622 572 isb(); 623 573 } ··· 639 589 static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, 640 590 struct mm_struct *mm, unsigned long start, unsigned long end) 641 591 { 642 - __flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3); 592 + struct vm_area_struct vma = { .vm_mm = mm, .vm_flags = 0 }; 593 + 594 + __flush_tlb_range(&vma, start, end, PAGE_SIZE, 3, 595 + TLBF_NOWALKCACHE | TLBF_NOSYNC); 643 596 } 644 597 645 598 static inline bool __pte_flags_need_flush(ptdesc_t oldval, ptdesc_t newval) ··· 671 618 } 672 619 #define huge_pmd_needs_flush huge_pmd_needs_flush 673 620 621 + #undef __tlbi_user 622 + #undef __TLBI_VADDR 674 623 #endif 675 624 676 625 #endif

+3 -3

arch/arm64/include/asm/uaccess.h

··· 62 62 63 63 local_irq_save(flags); 64 64 ttbr = read_sysreg(ttbr1_el1); 65 - ttbr &= ~TTBR_ASID_MASK; 65 + ttbr &= ~TTBRx_EL1_ASID_MASK; 66 66 /* reserved_pg_dir placed before swapper_pg_dir */ 67 67 write_sysreg(ttbr - RESERVED_SWAPPER_OFFSET, ttbr0_el1); 68 68 /* Set reserved ASID */ ··· 85 85 86 86 /* Restore active ASID */ 87 87 ttbr1 = read_sysreg(ttbr1_el1); 88 - ttbr1 &= ~TTBR_ASID_MASK; /* safety measure */ 89 - ttbr1 |= ttbr0 & TTBR_ASID_MASK; 88 + ttbr1 &= ~TTBRx_EL1_ASID_MASK; /* safety measure */ 89 + ttbr1 |= ttbr0 & TTBRx_EL1_ASID_MASK; 90 90 write_sysreg(ttbr1, ttbr1_el1); 91 91 92 92 /* Restore user page table */

+1

arch/arm64/kernel/Makefile

··· 68 68 obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o 69 69 obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o 70 70 obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o 71 + obj-$(CONFIG_ARM64_MPAM) += mpam.o 71 72 obj-$(CONFIG_ARM64_MTE) += mte.o 72 73 obj-y += vdso-wrap.o 73 74 obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o

+14

arch/arm64/kernel/armv8_deprecated.c

··· 610 610 } 611 611 612 612 #endif 613 + 614 + #ifdef CONFIG_SWP_EMULATION 615 + /* 616 + * The purpose of supporting LSUI is to eliminate PAN toggling. CPUs 617 + * that support LSUI are unlikely to support a 32-bit runtime. Rather 618 + * than emulating the SWP instruction using LSUI instructions, simply 619 + * disable SWP emulation. 620 + */ 621 + if (cpus_have_final_cap(ARM64_HAS_LSUI)) { 622 + insn_swp.status = INSN_UNAVAILABLE; 623 + pr_info("swp/swpb instruction emulation is not supported on this system\n"); 624 + } 625 + #endif 626 + 613 627 for (int i = 0; i < ARRAY_SIZE(insn_emulations); i++) { 614 628 struct insn_emulation *ie = insn_emulations[i]; 615 629

+28 -19

arch/arm64/kernel/cpufeature.c

··· 77 77 #include <linux/percpu.h> 78 78 #include <linux/sched/isolation.h> 79 79 80 + #include <asm/arm_pmuv3.h> 80 81 #include <asm/cpu.h> 81 82 #include <asm/cpufeature.h> 82 83 #include <asm/cpu_ops.h> ··· 87 86 #include <asm/kvm_host.h> 88 87 #include <asm/mmu.h> 89 88 #include <asm/mmu_context.h> 89 + #include <asm/mpam.h> 90 90 #include <asm/mte.h> 91 91 #include <asm/hypervisor.h> 92 92 #include <asm/processor.h> ··· 283 281 284 282 static const struct arm64_ftr_bits ftr_id_aa64isar3[] = { 285 283 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FPRCVT_SHIFT, 4, 0), 284 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSUI_SHIFT, 4, ID_AA64ISAR3_EL1_LSUI_NI), 286 285 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSFE_SHIFT, 4, 0), 287 286 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FAMINMAX_SHIFT, 4, 0), 288 287 ARM64_FTR_END, ··· 568 565 * We can instantiate multiple PMU instances with different levels 569 566 * of support. 570 567 */ 571 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_AA64DFR0_EL1_PMUVer_SHIFT, 4, 0), 568 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_AA64DFR0_EL1_PMUVer_SHIFT, 4, 0), 572 569 ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64DFR0_EL1_DebugVer_SHIFT, 4, 0x6), 573 570 ARM64_FTR_END, 574 571 }; ··· 712 709 713 710 static const struct arm64_ftr_bits ftr_id_dfr0[] = { 714 711 /* [31:28] TraceFilt */ 715 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_EL1_PerfMon_SHIFT, 4, 0), 712 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_EL1_PerfMon_SHIFT, 4, 0), 716 713 ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MProfDbg_SHIFT, 4, 0), 717 714 ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MMapTrc_SHIFT, 4, 0), 718 715 ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_CopTrc_SHIFT, 4, 0), ··· 1930 1927 u64 dfr0 = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1); 1931 1928 unsigned int pmuver; 1932 1929 1933 - /* 1934 - * PMUVer follows the standard ID scheme for an unsigned field with the 1935 - * exception of 0xF (IMP_DEF) which is treated specially and implies 1936 - * FEAT_PMUv3 is not implemented. 1937 - * 1938 - * See DDI0487L.a D24.1.3.2 for more details. 1939 - */ 1940 1930 pmuver = cpuid_feature_extract_unsigned_field(dfr0, 1941 1931 ID_AA64DFR0_EL1_PMUVer_SHIFT); 1942 - if (pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 1943 - return false; 1944 1932 1945 - return pmuver >= ID_AA64DFR0_EL1_PMUVer_IMP; 1933 + return pmuv3_implemented(pmuver); 1946 1934 } 1947 1935 #endif 1948 1936 ··· 2495 2501 static void 2496 2502 cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) 2497 2503 { 2498 - /* 2499 - * Access by the kernel (at EL1) should use the reserved PARTID 2500 - * which is configured unrestricted. This avoids priority-inversion 2501 - * where latency sensitive tasks have to wait for a task that has 2502 - * been throttled to release the lock. 2503 - */ 2504 - write_sysreg_s(0, SYS_MPAM1_EL1); 2504 + int cpu = smp_processor_id(); 2505 + u64 regval = 0; 2506 + 2507 + if (IS_ENABLED(CONFIG_ARM64_MPAM) && static_branch_likely(&mpam_enabled)) 2508 + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 2509 + 2510 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 2511 + if (cpus_have_cap(ARM64_SME)) 2512 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); 2513 + isb(); 2514 + 2515 + /* Synchronising the EL0 write is left until the ERET to EL0 */ 2516 + write_sysreg_s(regval, SYS_MPAM0_EL1); 2505 2517 } 2506 2518 2507 2519 static bool ··· 3178 3178 .cpu_enable = cpu_enable_ls64_v, 3179 3179 ARM64_CPUID_FIELDS(ID_AA64ISAR1_EL1, LS64, LS64_V) 3180 3180 }, 3181 + #ifdef CONFIG_ARM64_LSUI 3182 + { 3183 + .desc = "Unprivileged Load Store Instructions (LSUI)", 3184 + .capability = ARM64_HAS_LSUI, 3185 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 3186 + .matches = has_cpuid_feature, 3187 + ARM64_CPUID_FIELDS(ID_AA64ISAR3_EL1, LSUI, IMP) 3188 + }, 3189 + #endif 3181 3190 {}, 3182 3191 }; 3183 3192

+24 -28

arch/arm64/kernel/entry-common.c

··· 35 35 * Before this function is called it is not safe to call regular kernel code, 36 36 * instrumentable code, or any code which may trigger an exception. 37 37 */ 38 - static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs) 38 + static noinstr irqentry_state_t arm64_enter_from_kernel_mode(struct pt_regs *regs) 39 39 { 40 40 irqentry_state_t state; 41 41 42 - state = irqentry_enter(regs); 42 + state = irqentry_enter_from_kernel_mode(regs); 43 43 mte_check_tfsr_entry(); 44 44 mte_disable_tco_entry(current); 45 45 ··· 51 51 * After this function returns it is not safe to call regular kernel code, 52 52 * instrumentable code, or any code which may trigger an exception. 53 53 */ 54 - static void noinstr exit_to_kernel_mode(struct pt_regs *regs, 55 - irqentry_state_t state) 54 + static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs, 55 + irqentry_state_t state) 56 56 { 57 + local_irq_disable(); 58 + irqentry_exit_to_kernel_mode_preempt(regs, state); 59 + local_daif_mask(); 57 60 mte_check_tfsr_exit(); 58 - irqentry_exit(regs, state); 61 + irqentry_exit_to_kernel_mode_after_preempt(regs, state); 59 62 } 60 63 61 64 /* ··· 301 298 unsigned long far = read_sysreg(far_el1); 302 299 irqentry_state_t state; 303 300 304 - state = enter_from_kernel_mode(regs); 301 + state = arm64_enter_from_kernel_mode(regs); 305 302 local_daif_inherit(regs); 306 303 do_mem_abort(far, esr, regs); 307 - local_daif_mask(); 308 - exit_to_kernel_mode(regs, state); 304 + arm64_exit_to_kernel_mode(regs, state); 309 305 } 310 306 311 307 static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr) ··· 312 310 unsigned long far = read_sysreg(far_el1); 313 311 irqentry_state_t state; 314 312 315 - state = enter_from_kernel_mode(regs); 313 + state = arm64_enter_from_kernel_mode(regs); 316 314 local_daif_inherit(regs); 317 315 do_sp_pc_abort(far, esr, regs); 318 - local_daif_mask(); 319 - exit_to_kernel_mode(regs, state); 316 + arm64_exit_to_kernel_mode(regs, state); 320 317 } 321 318 322 319 static void noinstr el1_undef(struct pt_regs *regs, unsigned long esr) 323 320 { 324 321 irqentry_state_t state; 325 322 326 - state = enter_from_kernel_mode(regs); 323 + state = arm64_enter_from_kernel_mode(regs); 327 324 local_daif_inherit(regs); 328 325 do_el1_undef(regs, esr); 329 - local_daif_mask(); 330 - exit_to_kernel_mode(regs, state); 326 + arm64_exit_to_kernel_mode(regs, state); 331 327 } 332 328 333 329 static void noinstr el1_bti(struct pt_regs *regs, unsigned long esr) 334 330 { 335 331 irqentry_state_t state; 336 332 337 - state = enter_from_kernel_mode(regs); 333 + state = arm64_enter_from_kernel_mode(regs); 338 334 local_daif_inherit(regs); 339 335 do_el1_bti(regs, esr); 340 - local_daif_mask(); 341 - exit_to_kernel_mode(regs, state); 336 + arm64_exit_to_kernel_mode(regs, state); 342 337 } 343 338 344 339 static void noinstr el1_gcs(struct pt_regs *regs, unsigned long esr) 345 340 { 346 341 irqentry_state_t state; 347 342 348 - state = enter_from_kernel_mode(regs); 343 + state = arm64_enter_from_kernel_mode(regs); 349 344 local_daif_inherit(regs); 350 345 do_el1_gcs(regs, esr); 351 - local_daif_mask(); 352 - exit_to_kernel_mode(regs, state); 346 + arm64_exit_to_kernel_mode(regs, state); 353 347 } 354 348 355 349 static void noinstr el1_mops(struct pt_regs *regs, unsigned long esr) 356 350 { 357 351 irqentry_state_t state; 358 352 359 - state = enter_from_kernel_mode(regs); 353 + state = arm64_enter_from_kernel_mode(regs); 360 354 local_daif_inherit(regs); 361 355 do_el1_mops(regs, esr); 362 - local_daif_mask(); 363 - exit_to_kernel_mode(regs, state); 356 + arm64_exit_to_kernel_mode(regs, state); 364 357 } 365 358 366 359 static void noinstr el1_breakpt(struct pt_regs *regs, unsigned long esr) ··· 417 420 { 418 421 irqentry_state_t state; 419 422 420 - state = enter_from_kernel_mode(regs); 423 + state = arm64_enter_from_kernel_mode(regs); 421 424 local_daif_inherit(regs); 422 425 do_el1_fpac(regs, esr); 423 - local_daif_mask(); 424 - exit_to_kernel_mode(regs, state); 426 + arm64_exit_to_kernel_mode(regs, state); 425 427 } 426 428 427 429 asmlinkage void noinstr el1h_64_sync_handler(struct pt_regs *regs) ··· 487 491 { 488 492 irqentry_state_t state; 489 493 490 - state = enter_from_kernel_mode(regs); 494 + state = arm64_enter_from_kernel_mode(regs); 491 495 492 496 irq_enter_rcu(); 493 497 do_interrupt_handler(regs, handler); 494 498 irq_exit_rcu(); 495 499 496 - exit_to_kernel_mode(regs, state); 500 + arm64_exit_to_kernel_mode(regs, state); 497 501 } 498 502 static void noinstr el1_interrupt(struct pt_regs *regs, 499 503 void (*handler)(struct pt_regs *))

+2 -4

arch/arm64/kernel/entry.S

··· 273 273 alternative_else_nop_endif 274 274 1: 275 275 276 - scs_load_current 276 + scs_load_current_base 277 277 .else 278 278 add x21, sp, #PT_REGS_SIZE 279 279 get_current_task tsk ··· 378 378 alternative_else_nop_endif 379 379 #endif 380 380 3: 381 - scs_save tsk 382 - 383 381 /* Ignore asynchronous tag check faults in the uaccess routines */ 384 382 ldr x0, [tsk, THREAD_SCTLR_USER] 385 383 clear_mte_async_tcf x0 ··· 471 473 */ 472 474 SYM_CODE_START_LOCAL(__swpan_entry_el1) 473 475 mrs x21, ttbr0_el1 474 - tst x21, #TTBR_ASID_MASK // Check for the reserved ASID 476 + tst x21, #TTBRx_EL1_ASID_MASK // Check for the reserved ASID 475 477 orr x23, x23, #PSR_PAN_BIT // Set the emulated PAN in the saved SPSR 476 478 b.eq 1f // TTBR0 access already disabled 477 479 and x23, x23, #~PSR_PAN_BIT // Clear the emulated PAN in the saved SPSR

-3

arch/arm64/kernel/machine_kexec.c

··· 129 129 } 130 130 131 131 /* Create a copy of the linear map */ 132 - trans_pgd = kexec_page_alloc(kimage); 133 - if (!trans_pgd) 134 - return -ENOMEM; 135 132 rc = trans_pgd_create_copy(&info, &trans_pgd, PAGE_OFFSET, PAGE_END); 136 133 if (rc) 137 134 return rc;

+62

arch/arm64/kernel/mpam.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (C) 2025 Arm Ltd. */ 3 + 4 + #include <asm/mpam.h> 5 + 6 + #include <linux/arm_mpam.h> 7 + #include <linux/cpu_pm.h> 8 + #include <linux/jump_label.h> 9 + #include <linux/percpu.h> 10 + 11 + DEFINE_STATIC_KEY_FALSE(mpam_enabled); 12 + DEFINE_PER_CPU(u64, arm64_mpam_default); 13 + DEFINE_PER_CPU(u64, arm64_mpam_current); 14 + 15 + u64 arm64_mpam_global_default; 16 + 17 + static int mpam_pm_notifier(struct notifier_block *self, 18 + unsigned long cmd, void *v) 19 + { 20 + u64 regval; 21 + int cpu = smp_processor_id(); 22 + 23 + switch (cmd) { 24 + case CPU_PM_EXIT: 25 + /* 26 + * Don't use mpam_thread_switch() as the system register 27 + * value has changed under our feet. 28 + */ 29 + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 30 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 31 + if (system_supports_sme()) { 32 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), 33 + SYS_MPAMSM_EL1); 34 + } 35 + isb(); 36 + 37 + write_sysreg_s(regval, SYS_MPAM0_EL1); 38 + 39 + return NOTIFY_OK; 40 + default: 41 + return NOTIFY_DONE; 42 + } 43 + } 44 + 45 + static struct notifier_block mpam_pm_nb = { 46 + .notifier_call = mpam_pm_notifier, 47 + }; 48 + 49 + static int __init arm64_mpam_register_cpus(void) 50 + { 51 + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); 52 + u16 partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, mpamidr); 53 + u8 pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, mpamidr); 54 + 55 + if (!system_supports_mpam()) 56 + return 0; 57 + 58 + cpu_pm_register_notifier(&mpam_pm_nb); 59 + return mpam_register_requestor(partid_max, pmg_max); 60 + } 61 + /* Must occur before mpam_msc_driver_init() from subsys_initcall() */ 62 + arch_initcall(arm64_mpam_register_cpus)

+8 -2

arch/arm64/kernel/mte.c

··· 291 291 /* TCO may not have been disabled on exception entry for the current task. */ 292 292 mte_disable_tco_entry(next); 293 293 294 + if (!system_uses_mte_async_or_asymm_mode()) 295 + return; 296 + 294 297 /* 295 298 * Check if an async tag exception occurred at EL1. 296 299 * ··· 318 315 * CnP is not a boot feature so MTE gets enabled before CnP, but let's 319 316 * make sure that is the case. 320 317 */ 321 - BUG_ON(read_sysreg(ttbr0_el1) & TTBR_CNP_BIT); 322 - BUG_ON(read_sysreg(ttbr1_el1) & TTBR_CNP_BIT); 318 + BUG_ON(read_sysreg(ttbr0_el1) & TTBRx_EL1_CnP); 319 + BUG_ON(read_sysreg(ttbr1_el1) & TTBRx_EL1_CnP); 323 320 324 321 /* Normal Tagged memory type at the corresponding MAIR index */ 325 322 sysreg_clear_set(mair_el1, ··· 351 348 void mte_suspend_enter(void) 352 349 { 353 350 if (!system_supports_mte()) 351 + return; 352 + 353 + if (!system_uses_mte_async_or_asymm_mode()) 354 354 return; 355 355 356 356 /*

+32

arch/arm64/kernel/process.c

··· 51 51 #include <asm/fpsimd.h> 52 52 #include <asm/gcs.h> 53 53 #include <asm/mmu_context.h> 54 + #include <asm/mpam.h> 54 55 #include <asm/mte.h> 55 56 #include <asm/processor.h> 56 57 #include <asm/pointer_auth.h> ··· 700 699 isb(); 701 700 } 702 701 702 + static inline void debug_switch_state(void) 703 + { 704 + if (system_uses_irq_prio_masking()) { 705 + unsigned long daif_expected = 0; 706 + unsigned long daif_actual = read_sysreg(daif); 707 + unsigned long pmr_expected = GIC_PRIO_IRQOFF; 708 + unsigned long pmr_actual = read_sysreg_s(SYS_ICC_PMR_EL1); 709 + 710 + WARN_ONCE(daif_actual != daif_expected || 711 + pmr_actual != pmr_expected, 712 + "Unexpected DAIF + PMR: 0x%lx + 0x%lx (expected 0x%lx + 0x%lx)\n", 713 + daif_actual, pmr_actual, 714 + daif_expected, pmr_expected); 715 + } else { 716 + unsigned long daif_expected = DAIF_PROCCTX_NOIRQ; 717 + unsigned long daif_actual = read_sysreg(daif); 718 + 719 + WARN_ONCE(daif_actual != daif_expected, 720 + "Unexpected DAIF value: 0x%lx (expected 0x%lx)\n", 721 + daif_actual, daif_expected); 722 + } 723 + } 724 + 703 725 /* 704 726 * Thread switching. 705 727 */ ··· 731 707 struct task_struct *next) 732 708 { 733 709 struct task_struct *last; 710 + 711 + debug_switch_state(); 734 712 735 713 fpsimd_thread_switch(next); 736 714 tls_thread_switch(next); ··· 763 737 /* avoid expensive SCTLR_EL1 accesses if no change */ 764 738 if (prev->thread.sctlr_user != next->thread.sctlr_user) 765 739 update_sctlr_el1(next->thread.sctlr_user); 740 + 741 + /* 742 + * MPAM thread switch happens after the DSB to ensure prev's accesses 743 + * use prev's MPAM settings. 744 + */ 745 + mpam_thread_switch(next); 766 746 767 747 /* the actual thread switch */ 768 748 last = cpu_switch_to(prev, next);

+1 -1

arch/arm64/kernel/rsi.c

··· 145 145 return; 146 146 if (!rsi_version_matches()) 147 147 return; 148 - if (WARN_ON(rsi_get_realm_config(&config))) 148 + if (WARN_ON(rsi_get_realm_config(lm_alias(&config)))) 149 149 return; 150 150 prot_ns_shared = __phys_to_pte_val(BIT(config.ipa_bits - 1)); 151 151

+1 -1

arch/arm64/kernel/sys_compat.c

··· 36 36 * The workaround requires an inner-shareable tlbi. 37 37 * We pick the reserved-ASID to minimise the impact. 38 38 */ 39 - __tlbi(aside1is, __TLBI_VADDR(0, 0)); 39 + __tlbi(aside1is, 0UL); 40 40 __tlbi_sync_s1ish(); 41 41 } 42 42

+33 -1

arch/arm64/kvm/at.c

··· 9 9 #include <asm/esr.h> 10 10 #include <asm/kvm_hyp.h> 11 11 #include <asm/kvm_mmu.h> 12 + #include <asm/lsui.h> 12 13 13 14 static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool s1ptw) 14 15 { ··· 1680 1679 } 1681 1680 } 1682 1681 1682 + static int __lsui_swap_desc(u64 __user *ptep, u64 old, u64 new) 1683 + { 1684 + u64 tmp = old; 1685 + int ret = 0; 1686 + 1687 + /* 1688 + * Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(), 1689 + * as PAN toggling is not required. 1690 + */ 1691 + uaccess_ttbr0_enable(); 1692 + 1693 + asm volatile(__LSUI_PREAMBLE 1694 + "1: cast %[old], %[new], %[addr]\n" 1695 + "2:\n" 1696 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) 1697 + : [old] "+r" (old), [addr] "+Q" (*ptep), [ret] "+r" (ret) 1698 + : [new] "r" (new) 1699 + : "memory"); 1700 + 1701 + uaccess_ttbr0_disable(); 1702 + 1703 + if (ret) 1704 + return ret; 1705 + if (tmp != old) 1706 + return -EAGAIN; 1707 + 1708 + return ret; 1709 + } 1710 + 1683 1711 static int __lse_swap_desc(u64 __user *ptep, u64 old, u64 new) 1684 1712 { 1685 1713 u64 tmp = old; ··· 1784 1754 return -EPERM; 1785 1755 1786 1756 ptep = (void __user *)hva + offset; 1787 - if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS)) 1757 + if (cpus_have_final_cap(ARM64_HAS_LSUI)) 1758 + r = __lsui_swap_desc(ptep, old, new); 1759 + else if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS)) 1788 1760 r = __lse_swap_desc(ptep, old, new); 1789 1761 else 1790 1762 r = __llsc_swap_desc(ptep, old, new);

+4 -1

arch/arm64/kvm/debug.c

··· 10 10 #include <linux/kvm_host.h> 11 11 #include <linux/hw_breakpoint.h> 12 12 13 + #include <asm/arm_pmuv3.h> 13 14 #include <asm/debug-monitors.h> 14 15 #include <asm/kvm_asm.h> 15 16 #include <asm/kvm_arm.h> ··· 76 75 void kvm_init_host_debug_data(void) 77 76 { 78 77 u64 dfr0 = read_sysreg(id_aa64dfr0_el1); 78 + unsigned int pmuver = cpuid_feature_extract_unsigned_field(dfr0, 79 + ID_AA64DFR0_EL1_PMUVer_SHIFT); 79 80 80 - if (cpuid_feature_extract_signed_field(dfr0, ID_AA64DFR0_EL1_PMUVer_SHIFT) > 0) 81 + if (pmuv3_implemented(pmuver)) 81 82 *host_data_ptr(nr_event_counters) = FIELD_GET(ARMV8_PMU_PMCR_N, 82 83 read_sysreg(pmcr_el0)); 83 84

+8 -4

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 267 267 268 268 static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) 269 269 { 270 - u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; 270 + u64 clr = MPAM2_EL2_EnMPAMSM; 271 + u64 set = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; 271 272 272 273 if (!system_supports_mpam()) 273 274 return; ··· 278 277 write_sysreg_s(MPAMHCR_EL2_TRAP_MPAMIDR_EL1, SYS_MPAMHCR_EL2); 279 278 } else { 280 279 /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ 281 - r |= MPAM2_EL2_TIDR; 280 + set |= MPAM2_EL2_TIDR; 282 281 } 283 282 284 - write_sysreg_s(r, SYS_MPAM2_EL2); 283 + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); 285 284 } 286 285 287 286 static inline void __deactivate_traps_mpam(void) 288 287 { 288 + u64 clr = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1 | MPAM2_EL2_TIDR; 289 + u64 set = MPAM2_EL2_EnMPAMSM; 290 + 289 291 if (!system_supports_mpam()) 290 292 return; 291 293 292 - write_sysreg_s(0, SYS_MPAM2_EL2); 294 + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); 293 295 294 296 if (system_supports_mpam_hcr()) 295 297 write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2);

+2 -2

arch/arm64/kvm/hyp/nvhe/hyp-init.S

··· 130 130 ldr x1, [x0, #NVHE_INIT_PGD_PA] 131 131 phys_to_ttbr x2, x1 132 132 alternative_if ARM64_HAS_CNP 133 - orr x2, x2, #TTBR_CNP_BIT 133 + orr x2, x2, #TTBRx_EL1_CnP 134 134 alternative_else_nop_endif 135 135 msr ttbr0_el2, x2 136 136 ··· 291 291 /* Install the new pgtables */ 292 292 phys_to_ttbr x5, x0 293 293 alternative_if ARM64_HAS_CNP 294 - orr x5, x5, #TTBR_CNP_BIT 294 + orr x5, x5, #TTBRx_EL1_CnP 295 295 alternative_else_nop_endif 296 296 msr ttbr0_el2, x5 297 297

+1 -1

arch/arm64/kvm/hyp/nvhe/mm.c

··· 270 270 * https://lore.kernel.org/kvm/20221017115209.2099-1-will@kernel.org/T/#mf10dfbaf1eaef9274c581b81c53758918c1d0f03 271 271 */ 272 272 dsb(ishst); 273 - __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level); 273 + __tlbi_level(vale2is, addr, level); 274 274 __tlbi_sync_s1ish_hyp(); 275 275 isb(); 276 276 }

-2

arch/arm64/kvm/hyp/nvhe/tlb.c

··· 158 158 * Instead, we invalidate Stage-2 for this IPA, and the 159 159 * whole of Stage-1. Weep... 160 160 */ 161 - ipa >>= 12; 162 161 __tlbi_level(ipas2e1is, ipa, level); 163 162 164 163 /* ··· 187 188 * Instead, we invalidate Stage-2 for this IPA, and the 188 189 * whole of Stage-1. Weep... 189 190 */ 190 - ipa >>= 12; 191 191 __tlbi_level(ipas2e1, ipa, level); 192 192 193 193 /*

+2 -2

arch/arm64/kvm/hyp/pgtable.c

··· 490 490 491 491 kvm_clear_pte(ctx->ptep); 492 492 dsb(ishst); 493 - __tlbi_level(vae2is, __TLBI_VADDR(ctx->addr, 0), TLBI_TTL_UNKNOWN); 493 + __tlbi_level(vae2is, ctx->addr, TLBI_TTL_UNKNOWN); 494 494 } else { 495 495 if (ctx->end - ctx->addr < granule) 496 496 return -EINVAL; 497 497 498 498 kvm_clear_pte(ctx->ptep); 499 499 dsb(ishst); 500 - __tlbi_level(vale2is, __TLBI_VADDR(ctx->addr, 0), ctx->level); 500 + __tlbi_level(vale2is, ctx->addr, ctx->level); 501 501 *unmapped += granule; 502 502 } 503 503

+16

arch/arm64/kvm/hyp/vhe/sysreg-sr.c

··· 183 183 } 184 184 NOKPROBE_SYMBOL(sysreg_restore_guest_state_vhe); 185 185 186 + /* 187 + * The _EL0 value was written by the host's context switch and belongs to the 188 + * VMM. Copy this into the guest's _EL1 register. 189 + */ 190 + static inline void __mpam_guest_load(void) 191 + { 192 + u64 mask = MPAM0_EL1_PARTID_D | MPAM0_EL1_PARTID_I | MPAM0_EL1_PMG_D | MPAM0_EL1_PMG_I; 193 + 194 + if (system_supports_mpam()) { 195 + u64 val = (read_sysreg_s(SYS_MPAM0_EL1) & mask) | MPAM1_EL1_MPAMEN; 196 + 197 + write_sysreg_el1(val, SYS_MPAM1); 198 + } 199 + } 200 + 186 201 /** 187 202 * __vcpu_load_switch_sysregs - Load guest system registers to the physical CPU 188 203 * ··· 237 222 */ 238 223 __sysreg32_restore_state(vcpu); 239 224 __sysreg_restore_user_state(guest_ctxt); 225 + __mpam_guest_load(); 240 226 241 227 if (unlikely(is_hyp_ctxt(vcpu))) { 242 228 __sysreg_restore_vel2_state(vcpu);

-2

arch/arm64/kvm/hyp/vhe/tlb.c

··· 104 104 * Instead, we invalidate Stage-2 for this IPA, and the 105 105 * whole of Stage-1. Weep... 106 106 */ 107 - ipa >>= 12; 108 107 __tlbi_level(ipas2e1is, ipa, level); 109 108 110 109 /* ··· 135 136 * Instead, we invalidate Stage-2 for this IPA, and the 136 137 * whole of Stage-1. Weep... 137 138 */ 138 - ipa >>= 12; 139 139 __tlbi_level(ipas2e1, ipa, level); 140 140 141 141 /*

+4 -1

arch/arm64/kvm/sys_regs.c

··· 1805 1805 break; 1806 1806 case SYS_ID_AA64ISAR3_EL1: 1807 1807 val &= ID_AA64ISAR3_EL1_FPRCVT | ID_AA64ISAR3_EL1_LSFE | 1808 - ID_AA64ISAR3_EL1_FAMINMAX; 1808 + ID_AA64ISAR3_EL1_FAMINMAX | ID_AA64ISAR3_EL1_LSUI; 1809 1809 break; 1810 1810 case SYS_ID_AA64MMFR2_EL1: 1811 1811 val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK; ··· 3252 3252 ID_AA64ISAR2_EL1_GPA3)), 3253 3253 ID_WRITABLE(ID_AA64ISAR3_EL1, (ID_AA64ISAR3_EL1_FPRCVT | 3254 3254 ID_AA64ISAR3_EL1_LSFE | 3255 + ID_AA64ISAR3_EL1_LSUI | 3255 3256 ID_AA64ISAR3_EL1_FAMINMAX)), 3256 3257 ID_UNALLOCATED(6,4), 3257 3258 ID_UNALLOCATED(6,5), ··· 3377 3376 3378 3377 { SYS_DESC(SYS_MPAM1_EL1), undef_access }, 3379 3378 { SYS_DESC(SYS_MPAM0_EL1), undef_access }, 3379 + { SYS_DESC(SYS_MPAMSM_EL1), undef_access }, 3380 + 3380 3381 { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, 3381 3382 { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, 3382 3383

+4 -4

arch/arm64/mm/context.c

··· 354 354 355 355 /* Skip CNP for the reserved ASID */ 356 356 if (system_supports_cnp() && asid) 357 - ttbr0 |= TTBR_CNP_BIT; 357 + ttbr0 |= TTBRx_EL1_CnP; 358 358 359 359 /* SW PAN needs a copy of the ASID in TTBR0 for entry */ 360 360 if (IS_ENABLED(CONFIG_ARM64_SW_TTBR0_PAN)) 361 - ttbr0 |= FIELD_PREP(TTBR_ASID_MASK, asid); 361 + ttbr0 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid); 362 362 363 363 /* Set ASID in TTBR1 since TCR.A1 is set */ 364 - ttbr1 &= ~TTBR_ASID_MASK; 365 - ttbr1 |= FIELD_PREP(TTBR_ASID_MASK, asid); 364 + ttbr1 &= ~TTBRx_EL1_ASID_MASK; 365 + ttbr1 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid); 366 366 367 367 cpu_set_reserved_ttbr0_nosync(); 368 368 write_sysreg(ttbr1, ttbr1_el1);

+8 -4

arch/arm64/mm/contpte.c

··· 225 225 */ 226 226 227 227 if (!system_supports_bbml2_noabort()) 228 - __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); 228 + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, 3, 229 + TLBF_NOWALKCACHE); 229 230 230 231 __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); 231 232 } ··· 552 551 * See comment in __ptep_clear_flush_young(); same rationale for 553 552 * eliding the trailing DSB applies here. 554 553 */ 555 - __flush_tlb_range_nosync(vma->vm_mm, addr, end, 556 - PAGE_SIZE, true, 3); 554 + __flush_tlb_range(vma, addr, end, PAGE_SIZE, 3, 555 + TLBF_NOWALKCACHE | TLBF_NOSYNC); 557 556 } 558 557 559 558 return young; ··· 686 685 __ptep_set_access_flags(vma, addr, ptep, entry, 0); 687 686 688 687 if (dirty) 689 - local_flush_tlb_contpte(vma, start_addr); 688 + __flush_tlb_range(vma, start_addr, 689 + start_addr + CONT_PTE_SIZE, 690 + PAGE_SIZE, 3, 691 + TLBF_NOWALKCACHE | TLBF_NOBROADCAST); 690 692 } else { 691 693 __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); 692 694 __ptep_set_access_flags(vma, addr, ptep, entry, dirty);

+25 -5

arch/arm64/mm/fault.c

··· 204 204 * 205 205 * Returns whether or not the PTE actually changed. 206 206 */ 207 - int __ptep_set_access_flags(struct vm_area_struct *vma, 208 - unsigned long address, pte_t *ptep, 209 - pte_t entry, int dirty) 207 + int __ptep_set_access_flags_anysz(struct vm_area_struct *vma, 208 + unsigned long address, pte_t *ptep, 209 + pte_t entry, int dirty, unsigned long pgsize) 210 210 { 211 211 pteval_t old_pteval, pteval; 212 212 pte_t pte = __ptep_get(ptep); 213 + int level; 213 214 214 215 if (pte_same(pte, entry)) 215 216 return 0; ··· 239 238 * may still cause page faults and be invalidated via 240 239 * flush_tlb_fix_spurious_fault(). 241 240 */ 242 - if (dirty) 243 - local_flush_tlb_page(vma, address); 241 + if (dirty) { 242 + switch (pgsize) { 243 + case PAGE_SIZE: 244 + level = 3; 245 + break; 246 + case PMD_SIZE: 247 + level = 2; 248 + break; 249 + #ifndef __PAGETABLE_PMD_FOLDED 250 + case PUD_SIZE: 251 + level = 1; 252 + break; 253 + #endif 254 + default: 255 + level = TLBI_TTL_UNKNOWN; 256 + WARN_ON(1); 257 + } 258 + 259 + __flush_tlb_range(vma, address, address + pgsize, pgsize, level, 260 + TLBF_NOWALKCACHE | TLBF_NOBROADCAST); 261 + } 244 262 return 1; 245 263 } 246 264

+5 -5

arch/arm64/mm/hugetlbpage.c

··· 181 181 struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); 182 182 unsigned long end = addr + (pgsize * ncontig); 183 183 184 - __flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true); 184 + __flush_hugetlb_tlb_range(&vma, addr, end, pgsize, TLBF_NOWALKCACHE); 185 185 return orig_pte; 186 186 } 187 187 ··· 209 209 if (mm == &init_mm) 210 210 flush_tlb_kernel_range(saddr, addr); 211 211 else 212 - __flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true); 212 + __flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, TLBF_NOWALKCACHE); 213 213 } 214 214 215 215 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, ··· 427 427 pte_t orig_pte; 428 428 429 429 VM_WARN_ON(!pte_present(pte)); 430 + ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize); 430 431 431 432 if (!pte_cont(pte)) 432 - return __ptep_set_access_flags(vma, addr, ptep, pte, dirty); 433 - 434 - ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize); 433 + return __ptep_set_access_flags_anysz(vma, addr, ptep, pte, 434 + dirty, pgsize); 435 435 436 436 if (!__cont_access_flags_changed(ptep, pte, ncontig)) 437 437 return 0;

+8 -1

arch/arm64/mm/init.c

··· 350 350 } 351 351 352 352 swiotlb_init(swiotlb, flags); 353 - swiotlb_update_mem_attributes(); 354 353 355 354 /* 356 355 * Check boundaries twice: Some fundamental inconsistencies can be ··· 374 375 */ 375 376 sysctl_overcommit_memory = OVERCOMMIT_ALWAYS; 376 377 } 378 + } 379 + 380 + bool page_alloc_available __ro_after_init; 381 + 382 + void __init mem_init(void) 383 + { 384 + page_alloc_available = true; 385 + swiotlb_update_mem_attributes(); 377 386 } 378 387 379 388 void free_initmem(void)

+212 -74

arch/arm64/mm/mmu.c

··· 112 112 } 113 113 EXPORT_SYMBOL(phys_mem_access_prot); 114 114 115 - static phys_addr_t __init early_pgtable_alloc(enum pgtable_type pgtable_type) 115 + static phys_addr_t __init early_pgtable_alloc(enum pgtable_level pgtable_level) 116 116 { 117 117 phys_addr_t phys; 118 118 ··· 197 197 static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, 198 198 unsigned long end, phys_addr_t phys, 199 199 pgprot_t prot, 200 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 200 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 201 201 int flags) 202 202 { 203 203 unsigned long next; 204 204 pmd_t pmd = READ_ONCE(*pmdp); 205 205 pte_t *ptep; 206 206 207 - BUG_ON(pmd_sect(pmd)); 207 + BUG_ON(pmd_leaf(pmd)); 208 208 if (pmd_none(pmd)) { 209 209 pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF; 210 210 phys_addr_t pte_phys; ··· 212 212 if (flags & NO_EXEC_MAPPINGS) 213 213 pmdval |= PMD_TABLE_PXN; 214 214 BUG_ON(!pgtable_alloc); 215 - pte_phys = pgtable_alloc(TABLE_PTE); 215 + pte_phys = pgtable_alloc(PGTABLE_LEVEL_PTE); 216 216 if (pte_phys == INVALID_PHYS_ADDR) 217 217 return -ENOMEM; 218 218 ptep = pte_set_fixmap(pte_phys); ··· 252 252 253 253 static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end, 254 254 phys_addr_t phys, pgprot_t prot, 255 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) 255 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), int flags) 256 256 { 257 257 unsigned long next; 258 258 ··· 292 292 static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, 293 293 unsigned long end, phys_addr_t phys, 294 294 pgprot_t prot, 295 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 295 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 296 296 int flags) 297 297 { 298 298 int ret; ··· 303 303 /* 304 304 * Check for initial section mappings in the pgd/pud. 305 305 */ 306 - BUG_ON(pud_sect(pud)); 306 + BUG_ON(pud_leaf(pud)); 307 307 if (pud_none(pud)) { 308 308 pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF; 309 309 phys_addr_t pmd_phys; ··· 311 311 if (flags & NO_EXEC_MAPPINGS) 312 312 pudval |= PUD_TABLE_PXN; 313 313 BUG_ON(!pgtable_alloc); 314 - pmd_phys = pgtable_alloc(TABLE_PMD); 314 + pmd_phys = pgtable_alloc(PGTABLE_LEVEL_PMD); 315 315 if (pmd_phys == INVALID_PHYS_ADDR) 316 316 return -ENOMEM; 317 317 pmdp = pmd_set_fixmap(pmd_phys); ··· 349 349 350 350 static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end, 351 351 phys_addr_t phys, pgprot_t prot, 352 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 352 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 353 353 int flags) 354 354 { 355 355 int ret = 0; ··· 364 364 if (flags & NO_EXEC_MAPPINGS) 365 365 p4dval |= P4D_TABLE_PXN; 366 366 BUG_ON(!pgtable_alloc); 367 - pud_phys = pgtable_alloc(TABLE_PUD); 367 + pud_phys = pgtable_alloc(PGTABLE_LEVEL_PUD); 368 368 if (pud_phys == INVALID_PHYS_ADDR) 369 369 return -ENOMEM; 370 370 pudp = pud_set_fixmap(pud_phys); ··· 415 415 416 416 static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end, 417 417 phys_addr_t phys, pgprot_t prot, 418 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 418 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 419 419 int flags) 420 420 { 421 421 int ret; ··· 430 430 if (flags & NO_EXEC_MAPPINGS) 431 431 pgdval |= PGD_TABLE_PXN; 432 432 BUG_ON(!pgtable_alloc); 433 - p4d_phys = pgtable_alloc(TABLE_P4D); 433 + p4d_phys = pgtable_alloc(PGTABLE_LEVEL_P4D); 434 434 if (p4d_phys == INVALID_PHYS_ADDR) 435 435 return -ENOMEM; 436 436 p4dp = p4d_set_fixmap(p4d_phys); ··· 467 467 static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys, 468 468 unsigned long virt, phys_addr_t size, 469 469 pgprot_t prot, 470 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 470 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 471 471 int flags) 472 472 { 473 473 int ret; ··· 500 500 static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, 501 501 unsigned long virt, phys_addr_t size, 502 502 pgprot_t prot, 503 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 503 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 504 504 int flags) 505 505 { 506 506 int ret; ··· 516 516 static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, 517 517 unsigned long virt, phys_addr_t size, 518 518 pgprot_t prot, 519 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 519 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 520 520 int flags) 521 521 { 522 522 int ret; ··· 528 528 } 529 529 530 530 static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp, 531 - enum pgtable_type pgtable_type) 531 + enum pgtable_level pgtable_level) 532 532 { 533 533 /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */ 534 534 struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0); ··· 539 539 540 540 pa = page_to_phys(ptdesc_page(ptdesc)); 541 541 542 - switch (pgtable_type) { 543 - case TABLE_PTE: 542 + switch (pgtable_level) { 543 + case PGTABLE_LEVEL_PTE: 544 544 BUG_ON(!pagetable_pte_ctor(mm, ptdesc)); 545 545 break; 546 - case TABLE_PMD: 546 + case PGTABLE_LEVEL_PMD: 547 547 BUG_ON(!pagetable_pmd_ctor(mm, ptdesc)); 548 548 break; 549 - case TABLE_PUD: 549 + case PGTABLE_LEVEL_PUD: 550 550 pagetable_pud_ctor(ptdesc); 551 551 break; 552 - case TABLE_P4D: 552 + case PGTABLE_LEVEL_P4D: 553 553 pagetable_p4d_ctor(ptdesc); 554 + break; 555 + case PGTABLE_LEVEL_PGD: 556 + VM_WARN_ON(1); 554 557 break; 555 558 } 556 559 ··· 561 558 } 562 559 563 560 static phys_addr_t 564 - pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp) 561 + pgd_pgtable_alloc_init_mm_gfp(enum pgtable_level pgtable_level, gfp_t gfp) 565 562 { 566 - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type); 563 + return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_level); 567 564 } 568 565 569 566 static phys_addr_t __maybe_unused 570 - pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type) 567 + pgd_pgtable_alloc_init_mm(enum pgtable_level pgtable_level) 571 568 { 572 - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL); 569 + return pgd_pgtable_alloc_init_mm_gfp(pgtable_level, GFP_PGTABLE_KERNEL); 573 570 } 574 571 575 572 static phys_addr_t 576 - pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type) 573 + pgd_pgtable_alloc_special_mm(enum pgtable_level pgtable_level) 577 574 { 578 - return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type); 575 + return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_level); 579 576 } 580 577 581 578 static void split_contpte(pte_t *ptep) ··· 596 593 pte_t *ptep; 597 594 int i; 598 595 599 - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp); 596 + pte_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PTE, gfp); 600 597 if (pte_phys == INVALID_PHYS_ADDR) 601 598 return -ENOMEM; 602 599 ptep = (pte_t *)phys_to_virt(pte_phys); ··· 605 602 tableprot |= PMD_TABLE_PXN; 606 603 607 604 prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE); 605 + if (!pmd_valid(pmd)) 606 + prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot))); 608 607 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 609 608 if (to_cont) 610 609 prot = __pgprot(pgprot_val(prot) | PTE_CONT); ··· 643 638 pmd_t *pmdp; 644 639 int i; 645 640 646 - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp); 641 + pmd_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PMD, gfp); 647 642 if (pmd_phys == INVALID_PHYS_ADDR) 648 643 return -ENOMEM; 649 644 pmdp = (pmd_t *)phys_to_virt(pmd_phys); ··· 652 647 tableprot |= PUD_TABLE_PXN; 653 648 654 649 prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT); 650 + if (!pud_valid(pud)) 651 + prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot))); 655 652 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 656 653 if (to_cont) 657 654 prot = __pgprot(pgprot_val(prot) | PTE_CONT); ··· 775 768 } 776 769 777 770 static DEFINE_MUTEX(pgtable_split_lock); 771 + static bool linear_map_requires_bbml2; 778 772 779 773 int split_kernel_leaf_mapping(unsigned long start, unsigned long end) 780 774 { 781 775 int ret; 782 776 783 777 /* 784 - * !BBML2_NOABORT systems should not be trying to change permissions on 785 - * anything that is not pte-mapped in the first place. Just return early 786 - * and let the permission change code raise a warning if not already 787 - * pte-mapped. 788 - */ 789 - if (!system_supports_bbml2_noabort()) 790 - return 0; 791 - 792 - /* 793 778 * If the region is within a pte-mapped area, there is no need to try to 794 779 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may 795 780 * change permissions from atomic context so for those cases (which are 796 781 * always pte-mapped), we must not go any further because taking the 797 - * mutex below may sleep. 782 + * mutex below may sleep. Do not call force_pte_mapping() here because 783 + * it could return a confusing result if called from a secondary cpu 784 + * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us 785 + * what we need. 798 786 */ 799 - if (force_pte_mapping() || is_kfence_address((void *)start)) 787 + if (!linear_map_requires_bbml2 || is_kfence_address((void *)start)) 800 788 return 0; 789 + 790 + if (!system_supports_bbml2_noabort()) { 791 + /* 792 + * !BBML2_NOABORT systems should not be trying to change 793 + * permissions on anything that is not pte-mapped in the first 794 + * place. Just return early and let the permission change code 795 + * raise a warning if not already pte-mapped. 796 + */ 797 + if (system_capabilities_finalized()) 798 + return 0; 799 + 800 + /* 801 + * Boot-time: split_kernel_leaf_mapping_locked() allocates from 802 + * page allocator. Can't split until it's available. 803 + */ 804 + if (WARN_ON(!page_alloc_available)) 805 + return -EBUSY; 806 + 807 + /* 808 + * Boot-time: Started secondary cpus but don't know if they 809 + * support BBML2_NOABORT yet. Can't allow splitting in this 810 + * window in case they don't. 811 + */ 812 + if (WARN_ON(num_online_cpus() > 1)) 813 + return -EBUSY; 814 + } 801 815 802 816 /* 803 817 * Ensure start and end are at least page-aligned since this is the ··· 918 890 919 891 return ret; 920 892 } 921 - 922 - static bool linear_map_requires_bbml2 __initdata; 923 893 924 894 u32 idmap_kpti_bbml2_flag; 925 895 ··· 1252 1226 1253 1227 static phys_addr_t kpti_ng_temp_alloc __initdata; 1254 1228 1255 - static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type) 1229 + static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_level pgtable_level) 1256 1230 { 1257 1231 kpti_ng_temp_alloc -= PAGE_SIZE; 1258 1232 return kpti_ng_temp_alloc; ··· 1484 1458 1485 1459 WARN_ON(!pte_present(pte)); 1486 1460 __pte_clear(&init_mm, addr, ptep); 1487 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1488 - if (free_mapped) 1461 + if (free_mapped) { 1462 + /* CONT blocks are not supported in the vmemmap */ 1463 + WARN_ON(pte_cont(pte)); 1464 + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1489 1465 free_hotplug_page_range(pte_page(pte), 1490 1466 PAGE_SIZE, altmap); 1467 + } 1468 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1491 1469 } while (addr += PAGE_SIZE, addr < end); 1492 1470 } 1493 1471 ··· 1510 1480 continue; 1511 1481 1512 1482 WARN_ON(!pmd_present(pmd)); 1513 - if (pmd_sect(pmd)) { 1483 + if (pmd_leaf(pmd)) { 1514 1484 pmd_clear(pmdp); 1515 - 1516 - /* 1517 - * One TLBI should be sufficient here as the PMD_SIZE 1518 - * range is mapped with a single block entry. 1519 - */ 1520 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1521 - if (free_mapped) 1485 + if (free_mapped) { 1486 + /* CONT blocks are not supported in the vmemmap */ 1487 + WARN_ON(pmd_cont(pmd)); 1488 + flush_tlb_kernel_range(addr, addr + PMD_SIZE); 1522 1489 free_hotplug_page_range(pmd_page(pmd), 1523 1490 PMD_SIZE, altmap); 1491 + } 1492 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1524 1493 continue; 1525 1494 } 1526 1495 WARN_ON(!pmd_table(pmd)); ··· 1542 1513 continue; 1543 1514 1544 1515 WARN_ON(!pud_present(pud)); 1545 - if (pud_sect(pud)) { 1516 + if (pud_leaf(pud)) { 1546 1517 pud_clear(pudp); 1547 - 1548 - /* 1549 - * One TLBI should be sufficient here as the PUD_SIZE 1550 - * range is mapped with a single block entry. 1551 - */ 1552 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1553 - if (free_mapped) 1518 + if (free_mapped) { 1519 + flush_tlb_kernel_range(addr, addr + PUD_SIZE); 1554 1520 free_hotplug_page_range(pud_page(pud), 1555 1521 PUD_SIZE, altmap); 1522 + } 1523 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1556 1524 continue; 1557 1525 } 1558 1526 WARN_ON(!pud_table(pud)); ··· 1579 1553 static void unmap_hotplug_range(unsigned long addr, unsigned long end, 1580 1554 bool free_mapped, struct vmem_altmap *altmap) 1581 1555 { 1556 + unsigned long start = addr; 1582 1557 unsigned long next; 1583 1558 pgd_t *pgdp, pgd; 1584 1559 ··· 1601 1574 WARN_ON(!pgd_present(pgd)); 1602 1575 unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); 1603 1576 } while (addr = next, addr < end); 1577 + 1578 + if (!free_mapped) 1579 + flush_tlb_kernel_range(start, end); 1604 1580 } 1605 1581 1606 1582 static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, ··· 1657 1627 if (pmd_none(pmd)) 1658 1628 continue; 1659 1629 1660 - WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); 1630 + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd)); 1661 1631 free_empty_pte_table(pmdp, addr, next, floor, ceiling); 1662 1632 } while (addr = next, addr < end); 1663 1633 ··· 1697 1667 if (pud_none(pud)) 1698 1668 continue; 1699 1669 1700 - WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); 1670 + WARN_ON(!pud_present(pud) || !pud_table(pud)); 1701 1671 free_empty_pmd_table(pudp, addr, next, floor, ceiling); 1702 1672 } while (addr = next, addr < end); 1703 1673 ··· 1793 1763 { 1794 1764 vmemmap_verify((pte_t *)pmdp, node, addr, next); 1795 1765 1796 - return pmd_sect(READ_ONCE(*pmdp)); 1766 + return pmd_leaf(READ_ONCE(*pmdp)); 1797 1767 } 1798 1768 1799 1769 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, ··· 1857 1827 1858 1828 int pud_clear_huge(pud_t *pudp) 1859 1829 { 1860 - if (!pud_sect(READ_ONCE(*pudp))) 1830 + if (!pud_leaf(READ_ONCE(*pudp))) 1861 1831 return 0; 1862 1832 pud_clear(pudp); 1863 1833 return 1; ··· 1865 1835 1866 1836 int pmd_clear_huge(pmd_t *pmdp) 1867 1837 { 1868 - if (!pmd_sect(READ_ONCE(*pmdp))) 1838 + if (!pmd_leaf(READ_ONCE(*pmdp))) 1869 1839 return 0; 1870 1840 pmd_clear(pmdp); 1871 1841 return 1; ··· 2040 2010 __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); 2041 2011 } 2042 2012 2013 + 2014 + static bool addr_splits_kernel_leaf(unsigned long addr) 2015 + { 2016 + pgd_t *pgdp, pgd; 2017 + p4d_t *p4dp, p4d; 2018 + pud_t *pudp, pud; 2019 + pmd_t *pmdp, pmd; 2020 + pte_t *ptep, pte; 2021 + 2022 + /* 2023 + * If the given address points at a the start address of 2024 + * a possible leaf, we certainly won't split. Otherwise, 2025 + * check if we would actually split a leaf by traversing 2026 + * the page tables further. 2027 + */ 2028 + if (IS_ALIGNED(addr, PGDIR_SIZE)) 2029 + return false; 2030 + 2031 + pgdp = pgd_offset_k(addr); 2032 + pgd = pgdp_get(pgdp); 2033 + if (!pgd_present(pgd)) 2034 + return false; 2035 + 2036 + if (IS_ALIGNED(addr, P4D_SIZE)) 2037 + return false; 2038 + 2039 + p4dp = p4d_offset(pgdp, addr); 2040 + p4d = p4dp_get(p4dp); 2041 + if (!p4d_present(p4d)) 2042 + return false; 2043 + 2044 + if (IS_ALIGNED(addr, PUD_SIZE)) 2045 + return false; 2046 + 2047 + pudp = pud_offset(p4dp, addr); 2048 + pud = pudp_get(pudp); 2049 + if (!pud_present(pud)) 2050 + return false; 2051 + 2052 + if (pud_leaf(pud)) 2053 + return true; 2054 + 2055 + if (IS_ALIGNED(addr, CONT_PMD_SIZE)) 2056 + return false; 2057 + 2058 + pmdp = pmd_offset(pudp, addr); 2059 + pmd = pmdp_get(pmdp); 2060 + if (!pmd_present(pmd)) 2061 + return false; 2062 + 2063 + if (pmd_cont(pmd)) 2064 + return true; 2065 + 2066 + if (IS_ALIGNED(addr, PMD_SIZE)) 2067 + return false; 2068 + 2069 + if (pmd_leaf(pmd)) 2070 + return true; 2071 + 2072 + if (IS_ALIGNED(addr, CONT_PTE_SIZE)) 2073 + return false; 2074 + 2075 + ptep = pte_offset_kernel(pmdp, addr); 2076 + pte = __ptep_get(ptep); 2077 + if (!pte_present(pte)) 2078 + return false; 2079 + 2080 + if (pte_cont(pte)) 2081 + return true; 2082 + 2083 + return !IS_ALIGNED(addr, PAGE_SIZE); 2084 + } 2085 + 2086 + static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pages) 2087 + { 2088 + unsigned long phys_start, phys_end, start, end; 2089 + 2090 + phys_start = PFN_PHYS(pfn); 2091 + phys_end = phys_start + nr_pages * PAGE_SIZE; 2092 + 2093 + /* PFN range's linear map edges are leaf entry aligned */ 2094 + start = __phys_to_virt(phys_start); 2095 + end = __phys_to_virt(phys_end); 2096 + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { 2097 + pr_warn("[%lx %lx] splits a leaf entry in linear map\n", 2098 + phys_start, phys_end); 2099 + return false; 2100 + } 2101 + 2102 + /* PFN range's vmemmap edges are leaf entry aligned */ 2103 + BUILD_BUG_ON(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)); 2104 + start = (unsigned long)pfn_to_page(pfn); 2105 + end = (unsigned long)pfn_to_page(pfn + nr_pages); 2106 + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { 2107 + pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n", 2108 + phys_start, phys_end); 2109 + return false; 2110 + } 2111 + return true; 2112 + } 2113 + 2043 2114 /* 2044 2115 * This memory hotplug notifier helps prevent boot memory from being 2045 2116 * inadvertently removed as it blocks pfn range offlining process in ··· 2149 2018 * In future if and when boot memory could be removed, this notifier 2150 2019 * should be dropped and free_hotplug_page_range() should handle any 2151 2020 * reserved pages allocated during boot. 2021 + * 2022 + * This also blocks any memory remove that would have caused a split 2023 + * in leaf entry in kernel linear or vmemmap mapping. 2152 2024 */ 2153 - static int prevent_bootmem_remove_notifier(struct notifier_block *nb, 2025 + static int prevent_memory_remove_notifier(struct notifier_block *nb, 2154 2026 unsigned long action, void *data) 2155 2027 { 2156 2028 struct mem_section *ms; ··· 2199 2065 return NOTIFY_DONE; 2200 2066 } 2201 2067 } 2068 + 2069 + if (!can_unmap_without_split(pfn, arg->nr_pages)) 2070 + return NOTIFY_BAD; 2071 + 2202 2072 return NOTIFY_OK; 2203 2073 } 2204 2074 2205 - static struct notifier_block prevent_bootmem_remove_nb = { 2206 - .notifier_call = prevent_bootmem_remove_notifier, 2075 + static struct notifier_block prevent_memory_remove_nb = { 2076 + .notifier_call = prevent_memory_remove_notifier, 2207 2077 }; 2208 2078 2209 2079 /* ··· 2257 2119 } 2258 2120 } 2259 2121 2260 - static int __init prevent_bootmem_remove_init(void) 2122 + static int __init prevent_memory_remove_init(void) 2261 2123 { 2262 2124 int ret = 0; 2263 2125 ··· 2265 2127 return ret; 2266 2128 2267 2129 validate_bootmem_online(); 2268 - ret = register_memory_notifier(&prevent_bootmem_remove_nb); 2130 + ret = register_memory_notifier(&prevent_memory_remove_nb); 2269 2131 if (ret) 2270 2132 pr_err("%s: Notifier registration failed %d\n", __func__, ret); 2271 2133 2272 2134 return ret; 2273 2135 } 2274 - early_initcall(prevent_bootmem_remove_init); 2136 + early_initcall(prevent_memory_remove_init); 2275 2137 #endif 2276 2138 2277 2139 pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr, ··· 2287 2149 */ 2288 2150 if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte)) 2289 2151 __flush_tlb_range(vma, addr, nr * PAGE_SIZE, 2290 - PAGE_SIZE, true, 3); 2152 + PAGE_SIZE, 3, TLBF_NOWALKCACHE); 2291 2153 } 2292 2154 2293 2155 return pte; ··· 2326 2188 phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp)); 2327 2189 2328 2190 if (cnp) 2329 - ttbr1 |= TTBR_CNP_BIT; 2191 + ttbr1 |= TTBRx_EL1_CnP; 2330 2192 2331 2193 replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1); 2332 2194

+28 -22

arch/arm64/mm/pageattr.c

··· 25 25 { 26 26 struct page_change_data *masks = walk->private; 27 27 28 + /* 29 + * Some users clear and set bits which alias each other (e.g. PTE_NG and 30 + * PTE_PRESENT_INVALID). It is therefore important that we always clear 31 + * first then set. 32 + */ 28 33 val &= ~(pgprot_val(masks->clear_mask)); 29 34 val |= (pgprot_val(masks->set_mask)); 30 35 ··· 41 36 { 42 37 pud_t val = pudp_get(pud); 43 38 44 - if (pud_sect(val)) { 39 + if (pud_leaf(val)) { 45 40 if (WARN_ON_ONCE((next - addr) != PUD_SIZE)) 46 41 return -EINVAL; 47 42 val = __pud(set_pageattr_masks(pud_val(val), walk)); ··· 57 52 { 58 53 pmd_t val = pmdp_get(pmd); 59 54 60 - if (pmd_sect(val)) { 55 + if (pmd_leaf(val)) { 61 56 if (WARN_ON_ONCE((next - addr) != PMD_SIZE)) 62 57 return -EINVAL; 63 58 val = __pmd(set_pageattr_masks(pmd_val(val), walk)); ··· 137 132 ret = update_range_prot(start, size, set_mask, clear_mask); 138 133 139 134 /* 140 - * If the memory is being made valid without changing any other bits 141 - * then a TLBI isn't required as a non-valid entry cannot be cached in 142 - * the TLB. 135 + * If the memory is being switched from present-invalid to valid without 136 + * changing any other bits then a TLBI isn't required as a non-valid 137 + * entry cannot be cached in the TLB. 143 138 */ 144 - if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask)) 139 + if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL || 140 + pgprot_val(clear_mask) != PTE_PRESENT_INVALID) 145 141 flush_tlb_kernel_range(start, start + size); 146 142 return ret; 147 143 } ··· 243 237 { 244 238 if (enable) 245 239 return __change_memory_common(addr, PAGE_SIZE * numpages, 246 - __pgprot(PTE_VALID), 247 - __pgprot(0)); 240 + __pgprot(PTE_PRESENT_VALID_KERNEL), 241 + __pgprot(PTE_PRESENT_INVALID)); 248 242 else 249 243 return __change_memory_common(addr, PAGE_SIZE * numpages, 250 - __pgprot(0), 251 - __pgprot(PTE_VALID)); 244 + __pgprot(PTE_PRESENT_INVALID), 245 + __pgprot(PTE_PRESENT_VALID_KERNEL)); 252 246 } 253 247 254 248 int set_direct_map_invalid_noflush(struct page *page) 255 249 { 256 - pgprot_t clear_mask = __pgprot(PTE_VALID); 257 - pgprot_t set_mask = __pgprot(0); 250 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL); 251 + pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID); 258 252 259 253 if (!can_set_direct_map()) 260 254 return 0; ··· 265 259 266 260 int set_direct_map_default_noflush(struct page *page) 267 261 { 268 - pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE); 269 - pgprot_t clear_mask = __pgprot(PTE_RDONLY); 262 + pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE); 263 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY); 270 264 271 265 if (!can_set_direct_map()) 272 266 return 0; ··· 302 296 * entries or Synchronous External Aborts caused by RIPAS_EMPTY 303 297 */ 304 298 ret = __change_memory_common(addr, PAGE_SIZE * numpages, 305 - __pgprot(set_prot), 306 - __pgprot(clear_prot | PTE_VALID)); 299 + __pgprot(set_prot | PTE_PRESENT_INVALID), 300 + __pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL)); 307 301 308 302 if (ret) 309 303 return ret; ··· 317 311 return ret; 318 312 319 313 return __change_memory_common(addr, PAGE_SIZE * numpages, 320 - __pgprot(PTE_VALID), 321 - __pgprot(0)); 314 + __pgprot(PTE_PRESENT_VALID_KERNEL), 315 + __pgprot(PTE_PRESENT_INVALID)); 322 316 } 323 317 324 318 static int realm_set_memory_encrypted(unsigned long addr, int numpages) ··· 410 404 pud = READ_ONCE(*pudp); 411 405 if (pud_none(pud)) 412 406 return false; 413 - if (pud_sect(pud)) 414 - return true; 407 + if (pud_leaf(pud)) 408 + return pud_valid(pud); 415 409 416 410 pmdp = pmd_offset(pudp, addr); 417 411 pmd = READ_ONCE(*pmdp); 418 412 if (pmd_none(pmd)) 419 413 return false; 420 - if (pmd_sect(pmd)) 421 - return true; 414 + if (pmd_leaf(pmd)) 415 + return pmd_valid(pmd); 422 416 423 417 ptep = pte_offset_kernel(pmdp, addr); 424 418 return pte_valid(__ptep_get(ptep));

+7 -35

arch/arm64/mm/trans_pgd.c

··· 31 31 return info->trans_alloc_page(info->trans_alloc_arg); 32 32 } 33 33 34 - static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr) 35 - { 36 - pte_t pte = __ptep_get(src_ptep); 37 - 38 - if (pte_valid(pte)) { 39 - /* 40 - * Resume will overwrite areas that may be marked 41 - * read only (code, rodata). Clear the RDONLY bit from 42 - * the temporary mappings we use during restore. 43 - */ 44 - __set_pte(dst_ptep, pte_mkwrite_novma(pte)); 45 - } else if (!pte_none(pte)) { 46 - /* 47 - * debug_pagealloc will removed the PTE_VALID bit if 48 - * the page isn't in use by the resume kernel. It may have 49 - * been in use by the original kernel, in which case we need 50 - * to put it back in our copy to do the restore. 51 - * 52 - * Other cases include kfence / vmalloc / memfd_secret which 53 - * may call `set_direct_map_invalid_noflush()`. 54 - * 55 - * Before marking this entry valid, check the pfn should 56 - * be mapped. 57 - */ 58 - BUG_ON(!pfn_valid(pte_pfn(pte))); 59 - 60 - __set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte))); 61 - } 62 - } 63 - 64 34 static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp, 65 35 pmd_t *src_pmdp, unsigned long start, unsigned long end) 66 36 { ··· 46 76 47 77 src_ptep = pte_offset_kernel(src_pmdp, start); 48 78 do { 49 - _copy_pte(dst_ptep, src_ptep, addr); 79 + pte_t pte = __ptep_get(src_ptep); 80 + 81 + if (pte_none(pte)) 82 + continue; 83 + __set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte))); 50 84 } while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end); 51 85 52 86 return 0; ··· 83 109 if (copy_pte(info, dst_pmdp, src_pmdp, addr, next)) 84 110 return -ENOMEM; 85 111 } else { 86 - set_pmd(dst_pmdp, 87 - __pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY)); 112 + set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd))); 88 113 } 89 114 } while (dst_pmdp++, src_pmdp++, addr = next, addr != end); 90 115 ··· 118 145 if (copy_pmd(info, dst_pudp, src_pudp, addr, next)) 119 146 return -ENOMEM; 120 147 } else { 121 - set_pud(dst_pudp, 122 - __pud(pud_val(pud) & ~PUD_SECT_RDONLY)); 148 + set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud))); 123 149 } 124 150 } while (dst_pudp++, src_pudp++, addr = next, addr != end); 125 151

+7 -1

arch/arm64/tools/Makefile

··· 3 3 gen := arch/$(ARCH)/include/generated 4 4 kapi := $(gen)/asm 5 5 6 - kapisyshdr-y := cpucap-defs.h sysreg-defs.h 6 + kapisyshdr-y := cpucap-defs.h kernel-hwcap.h sysreg-defs.h 7 7 8 8 kapi-hdrs-y := $(addprefix $(kapi)/, $(kapisyshdr-y)) 9 9 ··· 18 18 quiet_cmd_gen_cpucaps = GEN $@ 19 19 cmd_gen_cpucaps = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@ 20 20 21 + quiet_cmd_gen_kernel_hwcap = GEN $@ 22 + cmd_gen_kernel_hwcap = mkdir -p $(dir $@); /bin/sh -e $(real-prereqs) > $@ 23 + 21 24 quiet_cmd_gen_sysreg = GEN $@ 22 25 cmd_gen_sysreg = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@ 23 26 24 27 $(kapi)/cpucap-defs.h: $(src)/gen-cpucaps.awk $(src)/cpucaps FORCE 25 28 $(call if_changed,gen_cpucaps) 29 + 30 + $(kapi)/kernel-hwcap.h: $(src)/gen-kernel-hwcaps.sh $(srctree)/arch/arm64/include/uapi/asm/hwcap.h FORCE 31 + $(call if_changed,gen_kernel_hwcap) 26 32 27 33 $(kapi)/sysreg-defs.h: $(src)/gen-sysreg.awk $(src)/sysreg FORCE 28 34 $(call if_changed,gen_sysreg)

+1

arch/arm64/tools/cpucaps

··· 48 48 HAS_LSE_ATOMICS 49 49 HAS_LS64 50 50 HAS_LS64_V 51 + HAS_LSUI 51 52 HAS_MOPS 52 53 HAS_NESTED_VIRT 53 54 HAS_BBML2_NOABORT

+23

arch/arm64/tools/gen-kernel-hwcaps.sh

··· 1 + #!/bin/sh -e 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # gen-kernel-hwcap.sh - Generate kernel internal hwcap.h definitions 5 + # 6 + # Copyright 2026 Arm, Ltd. 7 + 8 + if [ "$1" = "" ]; then 9 + echo "$0: no filename specified" 10 + exit 1 11 + fi 12 + 13 + echo "#ifndef __ASM_KERNEL_HWCAPS_H" 14 + echo "#define __ASM_KERNEL_HWCAPS_H" 15 + echo "" 16 + echo "/* Generated file - do not edit */" 17 + echo "" 18 + 19 + grep -E '^#define HWCAP[0-9]*_[A-Z0-9_]+' $1 | \ 20 + sed 's/.*HWCAP$[0-9]*$_$[A-Z0-9_]\+$.*/#define KERNEL_HWCAP_\2\t__khwcap\1_feature(\2)/' 21 + 22 + echo "" 23 + echo "#endif /* __ASM_KERNEL_HWCAPS_H */"

+32 -4

arch/arm64/tools/sysreg

··· 1496 1496 0b0000 NI 1497 1497 0b0001 IMP 1498 1498 0b0010 BFSCALE 1499 + 0b0011 B16MM 1499 1500 EndEnum 1500 1501 UnsignedEnum 23:20 BF16 1501 1502 0b0000 NI ··· 1523 1522 0b0001 SVE2 1524 1523 0b0010 SVE2p1 1525 1524 0b0011 SVE2p2 1525 + 0b0100 SVE2p3 1526 1526 EndEnum 1527 1527 EndSysreg 1528 1528 ··· 1532 1530 0b0 NI 1533 1531 0b1 IMP 1534 1532 EndEnum 1535 - Res0 62:61 1533 + Res0 62 1534 + UnsignedEnum 61 LUT6 1535 + 0b0 NI 1536 + 0b1 IMP 1537 + EndEnum 1536 1538 UnsignedEnum 60 LUTv2 1537 1539 0b0 NI 1538 1540 0b1 IMP ··· 1546 1540 0b0001 SME2 1547 1541 0b0010 SME2p1 1548 1542 0b0011 SME2p2 1543 + 0b0100 SME2p3 1549 1544 EndEnum 1550 1545 UnsignedEnum 55:52 I16I64 1551 1546 0b0000 NI ··· 1661 1654 0b0 NI 1662 1655 0b1 IMP 1663 1656 EndEnum 1664 - Res0 25:2 1657 + Res0 25:16 1658 + UnsignedEnum 15 F16MM2 1659 + 0b0 NI 1660 + 0b1 IMP 1661 + EndEnum 1662 + Res0 14:8 1663 + Raz 7:2 1665 1664 UnsignedEnum 1 F8E4M3 1666 1665 0b0 NI 1667 1666 0b1 IMP ··· 1848 1835 UnsignedEnum 51:48 FHM 1849 1836 0b0000 NI 1850 1837 0b0001 IMP 1838 + 0b0010 F16F32DOT 1839 + 0b0011 F16F32MM 1851 1840 EndEnum 1852 1841 UnsignedEnum 47:44 DP 1853 1842 0b0000 NI ··· 1991 1976 UnsignedEnum 59:56 LUT 1992 1977 0b0000 NI 1993 1978 0b0001 IMP 1979 + 0b0010 LUT6 1994 1980 EndEnum 1995 1981 UnsignedEnum 55:52 CSSC 1996 1982 0b0000 NI ··· 3671 3655 EndSysreg 3672 3656 3673 3657 Sysreg SMIDR_EL1 3 1 0 0 6 3674 - Res0 63:32 3658 + Res0 63:60 3659 + Field 59:56 NSMC 3660 + Field 55:52 HIP 3661 + Field 51:32 AFFINITY2 3675 3662 Field 31:24 IMPLEMENTER 3676 3663 Field 23:16 REVISION 3677 3664 Field 15 SMPS 3678 - Res0 14:12 3665 + Field 14:13 SH 3666 + Res0 12 3679 3667 Field 11:0 AFFINITY 3680 3668 EndSysreg 3681 3669 ··· 5190 5170 Field 39:32 PMG_I 5191 5171 Field 31:16 PARTID_D 5192 5172 Field 15:0 PARTID_I 5173 + EndSysreg 5174 + 5175 + Sysreg MPAMSM_EL1 3 0 10 5 3 5176 + Res0 63:48 5177 + Field 47:40 PMG_D 5178 + Res0 39:32 5179 + Field 31:16 PARTID_D 5180 + Res0 15:0 5193 5181 EndSysreg 5194 5182 5195 5183 Sysreg ISR_EL1 3 0 12 1 0

+1 -1

drivers/acpi/arm64/agdi.c

··· 36 36 37 37 err = sdei_event_register(adata->sdei_event, agdi_sdei_handler, pdev); 38 38 if (err) { 39 - dev_err(&pdev->dev, "Failed to register for SDEI event %d", 39 + dev_err(&pdev->dev, "Failed to register for SDEI event %d\n", 40 40 adata->sdei_event); 41 41 return err; 42 42 }

+14

drivers/perf/Kconfig

··· 311 311 Enable support for PCIe Interface performance monitoring 312 312 on Marvell platform. 313 313 314 + config NVIDIA_TEGRA410_CMEM_LATENCY_PMU 315 + tristate "NVIDIA Tegra410 CPU Memory Latency PMU" 316 + depends on ARM64 && ACPI 317 + help 318 + Enable perf support for CPU memory latency counters monitoring on 319 + NVIDIA Tegra410 SoC. 320 + 321 + config NVIDIA_TEGRA410_C2C_PMU 322 + tristate "NVIDIA Tegra410 C2C PMU" 323 + depends on ARM64 && ACPI 324 + help 325 + Enable perf support for counters in NVIDIA C2C interface of NVIDIA 326 + Tegra410 SoC. 327 + 314 328 endmenu

+2

drivers/perf/Makefile

··· 35 35 obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/ 36 36 obj-$(CONFIG_MESON_DDR_PMU) += amlogic/ 37 37 obj-$(CONFIG_CXL_PMU) += cxl_pmu.o 38 + obj-$(CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU) += nvidia_t410_cmem_latency_pmu.o 39 + obj-$(CONFIG_NVIDIA_TEGRA410_C2C_PMU) += nvidia_t410_c2c_pmu.o

+31 -39

drivers/perf/arm-cmn.c

··· 2132 2132 static int arm_cmn_init_dtc(struct arm_cmn *cmn, struct arm_cmn_node *dn, int idx) 2133 2133 { 2134 2134 struct arm_cmn_dtc *dtc = cmn->dtc + idx; 2135 + const struct resource *cfg; 2136 + resource_size_t base, size; 2135 2137 2136 2138 dtc->pmu_base = dn->pmu_base; 2137 2139 dtc->base = dtc->pmu_base - arm_cmn_pmu_offset(cmn, dn); 2138 2140 dtc->irq = platform_get_irq(to_platform_device(cmn->dev), idx); 2139 2141 if (dtc->irq < 0) 2140 2142 return dtc->irq; 2143 + 2144 + cfg = platform_get_resource(to_platform_device(cmn->dev), IORESOURCE_MEM, 0); 2145 + base = dtc->base - cmn->base + cfg->start; 2146 + size = cmn->part == PART_CMN600 ? SZ_16K : SZ_64K; 2147 + if (!devm_request_mem_region(cmn->dev, base, size, dev_name(cmn->dev))) 2148 + return dev_err_probe(cmn->dev, -EBUSY, 2149 + "Failed to request DTC region 0x%pa\n", &base); 2141 2150 2142 2151 writel_relaxed(CMN_DT_DTC_CTL_DT_EN, dtc->base + CMN_DT_DTC_CTL); 2143 2152 writel_relaxed(CMN_DT_PMCR_PMU_EN | CMN_DT_PMCR_OVFL_INTR_EN, CMN_DT_PMCR(dtc)); ··· 2534 2525 return 0; 2535 2526 } 2536 2527 2537 - static int arm_cmn600_acpi_probe(struct platform_device *pdev, struct arm_cmn *cmn) 2528 + static int arm_cmn_get_root(struct arm_cmn *cmn, const struct resource *cfg) 2538 2529 { 2539 - struct resource *cfg, *root; 2540 - 2541 - cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0); 2542 - if (!cfg) 2543 - return -EINVAL; 2544 - 2545 - root = platform_get_resource(pdev, IORESOURCE_MEM, 1); 2546 - if (!root) 2547 - return -EINVAL; 2548 - 2549 - if (!resource_contains(cfg, root)) 2550 - swap(cfg, root); 2551 - /* 2552 - * Note that devm_ioremap_resource() is dumb and won't let the platform 2553 - * device claim cfg when the ACPI companion device has already claimed 2554 - * root within it. But since they *are* already both claimed in the 2555 - * appropriate name, we don't really need to do it again here anyway. 2556 - */ 2557 - cmn->base = devm_ioremap(cmn->dev, cfg->start, resource_size(cfg)); 2558 - if (!cmn->base) 2559 - return -ENOMEM; 2560 - 2561 - return root->start - cfg->start; 2562 - } 2563 - 2564 - static int arm_cmn600_of_probe(struct device_node *np) 2565 - { 2530 + const struct device_node *np = cmn->dev->of_node; 2531 + const struct resource *root; 2566 2532 u32 rootnode; 2567 2533 2568 - return of_property_read_u32(np, "arm,root-node", &rootnode) ?: rootnode; 2534 + if (cmn->part != PART_CMN600) 2535 + return 0; 2536 + 2537 + if (np) 2538 + return of_property_read_u32(np, "arm,root-node", &rootnode) ?: rootnode; 2539 + 2540 + root = platform_get_resource(to_platform_device(cmn->dev), IORESOURCE_MEM, 1); 2541 + return root ? root->start - cfg->start : -EINVAL; 2569 2542 } 2570 2543 2571 2544 static int arm_cmn_probe(struct platform_device *pdev) 2572 2545 { 2573 2546 struct arm_cmn *cmn; 2547 + const struct resource *cfg; 2574 2548 const char *name; 2575 2549 static atomic_t id; 2576 2550 int err, rootnode, this_id; ··· 2567 2575 cmn->cpu = cpumask_local_spread(0, dev_to_node(cmn->dev)); 2568 2576 platform_set_drvdata(pdev, cmn); 2569 2577 2570 - if (cmn->part == PART_CMN600 && has_acpi_companion(cmn->dev)) { 2571 - rootnode = arm_cmn600_acpi_probe(pdev, cmn); 2572 - } else { 2573 - rootnode = 0; 2574 - cmn->base = devm_platform_ioremap_resource(pdev, 0); 2575 - if (IS_ERR(cmn->base)) 2576 - return PTR_ERR(cmn->base); 2577 - if (cmn->part == PART_CMN600) 2578 - rootnode = arm_cmn600_of_probe(pdev->dev.of_node); 2579 - } 2578 + cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0); 2579 + if (!cfg) 2580 + return -EINVAL; 2581 + 2582 + /* Map the whole region now, claim the DTCs once we've found them */ 2583 + cmn->base = devm_ioremap(cmn->dev, cfg->start, resource_size(cfg)); 2584 + if (!cmn->base) 2585 + return -ENOMEM; 2586 + 2587 + rootnode = arm_cmn_get_root(cmn, cfg); 2580 2588 if (rootnode < 0) 2581 2589 return rootnode; 2582 2590

+18 -1

drivers/perf/arm_cspmu/arm_cspmu.c

··· 16 16 * The user should refer to the vendor technical documentation to get details 17 17 * about the supported events. 18 18 * 19 - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 19 + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 20 20 * 21 21 */ 22 22 ··· 1134 1134 1135 1135 return 0; 1136 1136 } 1137 + 1138 + struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu) 1139 + { 1140 + char hid[16] = {}; 1141 + char uid[16] = {}; 1142 + const struct acpi_apmt_node *apmt_node; 1143 + 1144 + apmt_node = arm_cspmu_apmt_node(cspmu->dev); 1145 + if (!apmt_node || apmt_node->type != ACPI_APMT_NODE_TYPE_ACPI) 1146 + return NULL; 1147 + 1148 + memcpy(hid, &apmt_node->inst_primary, sizeof(apmt_node->inst_primary)); 1149 + snprintf(uid, sizeof(uid), "%u", apmt_node->inst_secondary); 1150 + 1151 + return acpi_dev_get_first_match_dev(hid, uid, -1); 1152 + } 1153 + EXPORT_SYMBOL_GPL(arm_cspmu_acpi_dev_get); 1137 1154 #else 1138 1155 static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu) 1139 1156 {

+16 -1

drivers/perf/arm_cspmu/arm_cspmu.h

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 2 2 * 3 3 * ARM CoreSight Architecture PMU driver. 4 - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 4 + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 5 5 * 6 6 */ 7 7 8 8 #ifndef __ARM_CSPMU_H__ 9 9 #define __ARM_CSPMU_H__ 10 10 11 + #include <linux/acpi.h> 11 12 #include <linux/bitfield.h> 12 13 #include <linux/cpumask.h> 13 14 #include <linux/device.h> ··· 255 254 256 255 /* Unregister vendor backend. */ 257 256 void arm_cspmu_impl_unregister(const struct arm_cspmu_impl_match *impl_match); 257 + 258 + #if defined(CONFIG_ACPI) && defined(CONFIG_ARM64) 259 + /** 260 + * Get ACPI device associated with the PMU. 261 + * The caller is responsible for calling acpi_dev_put() on the returned device. 262 + */ 263 + struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu); 264 + #else 265 + static inline struct acpi_device * 266 + arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu) 267 + { 268 + return NULL; 269 + } 270 + #endif 258 271 259 272 #endif /* __ARM_CSPMU_H__ */

+612 -6

drivers/perf/arm_cspmu/nvidia_cspmu.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 3 + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 4 4 * 5 5 */ 6 6 ··· 8 8 9 9 #include <linux/io.h> 10 10 #include <linux/module.h> 11 + #include <linux/property.h> 11 12 #include <linux/topology.h> 12 13 13 14 #include "arm_cspmu.h" ··· 21 20 22 21 #define NV_CNVL_PORT_COUNT 4ULL 23 22 #define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0) 23 + 24 + #define NV_UCF_SRC_COUNT 3ULL 25 + #define NV_UCF_DST_COUNT 4ULL 26 + #define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0) 27 + #define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0) 28 + #define NV_UCF_FILTER_DST GENMASK_ULL(11, 8) 29 + #define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST) 30 + 31 + #define NV_PCIE_V2_PORT_COUNT 8ULL 32 + #define NV_PCIE_V2_FILTER_ID_MASK GENMASK_ULL(24, 0) 33 + #define NV_PCIE_V2_FILTER_PORT GENMASK_ULL(NV_PCIE_V2_PORT_COUNT - 1, 0) 34 + #define NV_PCIE_V2_FILTER_BDF_VAL GENMASK_ULL(23, NV_PCIE_V2_PORT_COUNT) 35 + #define NV_PCIE_V2_FILTER_BDF_EN BIT(24) 36 + #define NV_PCIE_V2_FILTER_BDF_VAL_EN GENMASK_ULL(24, NV_PCIE_V2_PORT_COUNT) 37 + #define NV_PCIE_V2_FILTER_DEFAULT NV_PCIE_V2_FILTER_PORT 38 + 39 + #define NV_PCIE_V2_DST_COUNT 5ULL 40 + #define NV_PCIE_V2_FILTER2_ID_MASK GENMASK_ULL(4, 0) 41 + #define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0) 42 + #define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST 43 + 44 + #define NV_PCIE_TGT_PORT_COUNT 8ULL 45 + #define NV_PCIE_TGT_EV_TYPE_CC 0x4 46 + #define NV_PCIE_TGT_EV_TYPE_COUNT 3ULL 47 + #define NV_PCIE_TGT_EV_TYPE_MASK GENMASK_ULL(NV_PCIE_TGT_EV_TYPE_COUNT - 1, 0) 48 + #define NV_PCIE_TGT_FILTER2_MASK GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT, 0) 49 + #define NV_PCIE_TGT_FILTER2_PORT GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT - 1, 0) 50 + #define NV_PCIE_TGT_FILTER2_ADDR_EN BIT(NV_PCIE_TGT_PORT_COUNT) 51 + #define NV_PCIE_TGT_FILTER2_ADDR GENMASK_ULL(15, NV_PCIE_TGT_PORT_COUNT) 52 + #define NV_PCIE_TGT_FILTER2_DEFAULT NV_PCIE_TGT_FILTER2_PORT 53 + 54 + #define NV_PCIE_TGT_ADDR_COUNT 8ULL 55 + #define NV_PCIE_TGT_ADDR_STRIDE 20 56 + #define NV_PCIE_TGT_ADDR_CTRL 0xD38 57 + #define NV_PCIE_TGT_ADDR_BASE_LO 0xD3C 58 + #define NV_PCIE_TGT_ADDR_BASE_HI 0xD40 59 + #define NV_PCIE_TGT_ADDR_MASK_LO 0xD44 60 + #define NV_PCIE_TGT_ADDR_MASK_HI 0xD48 24 61 25 62 #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0) 26 63 ··· 163 124 NULL, 164 125 }; 165 126 127 + static struct attribute *ucf_pmu_event_attrs[] = { 128 + ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D), 129 + 130 + ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0), 131 + ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3), 132 + ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109), 133 + ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A), 134 + ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119), 135 + 136 + ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183), 137 + ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184), 138 + 139 + ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111), 140 + ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112), 141 + ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113), 142 + ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114), 143 + 144 + ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121), 145 + ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122), 146 + ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123), 147 + ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124), 148 + 149 + ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180), 150 + ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181), 151 + ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182), 152 + 153 + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 154 + NULL 155 + }; 156 + 157 + static struct attribute *pcie_v2_pmu_event_attrs[] = { 158 + ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0), 159 + ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1), 160 + ARM_CSPMU_EVENT_ATTR(rd_req, 0x2), 161 + ARM_CSPMU_EVENT_ATTR(wr_req, 0x3), 162 + ARM_CSPMU_EVENT_ATTR(rd_cum_outs, 0x4), 163 + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 164 + NULL 165 + }; 166 + 167 + static struct attribute *pcie_tgt_pmu_event_attrs[] = { 168 + ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0), 169 + ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1), 170 + ARM_CSPMU_EVENT_ATTR(rd_req, 0x2), 171 + ARM_CSPMU_EVENT_ATTR(wr_req, 0x3), 172 + ARM_CSPMU_EVENT_ATTR(cycles, NV_PCIE_TGT_EV_TYPE_CC), 173 + NULL 174 + }; 175 + 166 176 static struct attribute *generic_pmu_event_attrs[] = { 167 177 ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT), 168 178 NULL, ··· 238 150 ARM_CSPMU_FORMAT_EVENT_ATTR, 239 151 ARM_CSPMU_FORMAT_ATTR(rem_socket, "config1:0-3"), 240 152 NULL, 153 + }; 154 + 155 + static struct attribute *ucf_pmu_format_attrs[] = { 156 + ARM_CSPMU_FORMAT_EVENT_ATTR, 157 + ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"), 158 + ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"), 159 + ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"), 160 + ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"), 161 + ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"), 162 + ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"), 163 + ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"), 164 + NULL 165 + }; 166 + 167 + static struct attribute *pcie_v2_pmu_format_attrs[] = { 168 + ARM_CSPMU_FORMAT_EVENT_ATTR, 169 + ARM_CSPMU_FORMAT_ATTR(src_rp_mask, "config1:0-7"), 170 + ARM_CSPMU_FORMAT_ATTR(src_bdf, "config1:8-23"), 171 + ARM_CSPMU_FORMAT_ATTR(src_bdf_en, "config1:24"), 172 + ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config2:0"), 173 + ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config2:1"), 174 + ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_p2p, "config2:2"), 175 + ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_cxl, "config2:3"), 176 + ARM_CSPMU_FORMAT_ATTR(dst_rem, "config2:4"), 177 + NULL 178 + }; 179 + 180 + static struct attribute *pcie_tgt_pmu_format_attrs[] = { 181 + ARM_CSPMU_FORMAT_ATTR(event, "config:0-2"), 182 + ARM_CSPMU_FORMAT_ATTR(dst_rp_mask, "config:3-10"), 183 + ARM_CSPMU_FORMAT_ATTR(dst_addr_en, "config:11"), 184 + ARM_CSPMU_FORMAT_ATTR(dst_addr_base, "config1:0-63"), 185 + ARM_CSPMU_FORMAT_ATTR(dst_addr_mask, "config2:0-63"), 186 + NULL 241 187 }; 242 188 243 189 static struct attribute *generic_pmu_format_attrs[] = { ··· 304 182 305 183 return ctx->name; 306 184 } 185 + 186 + #if defined(CONFIG_ACPI) && defined(CONFIG_ARM64) 187 + static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id) 188 + { 189 + struct fwnode_handle *fwnode; 190 + struct acpi_device *adev; 191 + int ret; 192 + 193 + adev = arm_cspmu_acpi_dev_get(cspmu); 194 + if (!adev) 195 + return -ENODEV; 196 + 197 + fwnode = acpi_fwnode_handle(adev); 198 + ret = fwnode_property_read_u32(fwnode, "instance_id", id); 199 + if (ret) 200 + dev_err(cspmu->dev, "Failed to get instance ID\n"); 201 + 202 + acpi_dev_put(adev); 203 + return ret; 204 + } 205 + #else 206 + static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id) 207 + { 208 + return -EINVAL; 209 + } 210 + #endif 307 211 308 212 static u32 nv_cspmu_event_filter(const struct perf_event *event) 309 213 { ··· 376 228 } 377 229 } 378 230 231 + static void nv_cspmu_reset_ev_filter(struct arm_cspmu *cspmu, 232 + const struct perf_event *event) 233 + { 234 + const struct nv_cspmu_ctx *ctx = 235 + to_nv_cspmu_ctx(to_arm_cspmu(event->pmu)); 236 + const u32 offset = 4 * event->hw.idx; 237 + 238 + if (ctx->get_filter) 239 + writel(0, cspmu->base0 + PMEVFILTR + offset); 240 + 241 + if (ctx->get_filter2) 242 + writel(0, cspmu->base0 + PMEVFILT2R + offset); 243 + } 244 + 379 245 static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu, 380 246 const struct perf_event *event) 381 247 { ··· 398 236 writel(filter, cspmu->base0 + PMCCFILTR); 399 237 } 400 238 239 + static u32 ucf_pmu_event_filter(const struct perf_event *event) 240 + { 241 + u32 ret, filter, src, dst; 242 + 243 + filter = nv_cspmu_event_filter(event); 244 + 245 + /* Monitor all sources if none is selected. */ 246 + src = FIELD_GET(NV_UCF_FILTER_SRC, filter); 247 + if (src == 0) 248 + src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0); 249 + 250 + /* Monitor all destinations if none is selected. */ 251 + dst = FIELD_GET(NV_UCF_FILTER_DST, filter); 252 + if (dst == 0) 253 + dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0); 254 + 255 + ret = FIELD_PREP(NV_UCF_FILTER_SRC, src); 256 + ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst); 257 + 258 + return ret; 259 + } 260 + 261 + static u32 pcie_v2_pmu_bdf_val_en(u32 filter) 262 + { 263 + const u32 bdf_en = FIELD_GET(NV_PCIE_V2_FILTER_BDF_EN, filter); 264 + 265 + /* Returns both BDF value and enable bit if BDF filtering is enabled. */ 266 + if (bdf_en) 267 + return FIELD_GET(NV_PCIE_V2_FILTER_BDF_VAL_EN, filter); 268 + 269 + /* Ignore the BDF value if BDF filter is not enabled. */ 270 + return 0; 271 + } 272 + 273 + static u32 pcie_v2_pmu_event_filter(const struct perf_event *event) 274 + { 275 + u32 filter, lead_filter, lead_bdf; 276 + struct perf_event *leader; 277 + const struct nv_cspmu_ctx *ctx = 278 + to_nv_cspmu_ctx(to_arm_cspmu(event->pmu)); 279 + 280 + filter = event->attr.config1 & ctx->filter_mask; 281 + if (filter != 0) 282 + return filter; 283 + 284 + leader = event->group_leader; 285 + 286 + /* Use leader's filter value if its BDF filtering is enabled. */ 287 + if (event != leader) { 288 + lead_filter = pcie_v2_pmu_event_filter(leader); 289 + lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter); 290 + if (lead_bdf != 0) 291 + return lead_filter; 292 + } 293 + 294 + /* Otherwise, return default filter value. */ 295 + return ctx->filter_default_val; 296 + } 297 + 298 + static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu, 299 + struct perf_event *new_ev) 300 + { 301 + /* 302 + * Make sure the events are using same BDF filter since the PCIE-SRC PMU 303 + * only supports one common BDF filter setting for all of the counters. 304 + */ 305 + 306 + int idx; 307 + u32 new_filter, new_rp, new_bdf, new_lead_filter, new_lead_bdf; 308 + struct perf_event *new_leader; 309 + 310 + if (cspmu->impl.ops.is_cycle_counter_event(new_ev)) 311 + return 0; 312 + 313 + new_leader = new_ev->group_leader; 314 + 315 + new_filter = pcie_v2_pmu_event_filter(new_ev); 316 + new_lead_filter = pcie_v2_pmu_event_filter(new_leader); 317 + 318 + new_bdf = pcie_v2_pmu_bdf_val_en(new_filter); 319 + new_lead_bdf = pcie_v2_pmu_bdf_val_en(new_lead_filter); 320 + 321 + new_rp = FIELD_GET(NV_PCIE_V2_FILTER_PORT, new_filter); 322 + 323 + if (new_rp != 0 && new_bdf != 0) { 324 + dev_err(cspmu->dev, 325 + "RP and BDF filtering are mutually exclusive\n"); 326 + return -EINVAL; 327 + } 328 + 329 + if (new_bdf != new_lead_bdf) { 330 + dev_err(cspmu->dev, 331 + "sibling and leader BDF value should be equal\n"); 332 + return -EINVAL; 333 + } 334 + 335 + /* Compare BDF filter on existing events. */ 336 + idx = find_first_bit(cspmu->hw_events.used_ctrs, 337 + cspmu->cycle_counter_logical_idx); 338 + 339 + if (idx != cspmu->cycle_counter_logical_idx) { 340 + struct perf_event *leader = cspmu->hw_events.events[idx]->group_leader; 341 + 342 + const u32 lead_filter = pcie_v2_pmu_event_filter(leader); 343 + const u32 lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter); 344 + 345 + if (new_lead_bdf != lead_bdf) { 346 + dev_err(cspmu->dev, "only one BDF value is supported\n"); 347 + return -EINVAL; 348 + } 349 + } 350 + 351 + return 0; 352 + } 353 + 354 + struct pcie_tgt_addr_filter { 355 + u32 refcount; 356 + u64 base; 357 + u64 mask; 358 + }; 359 + 360 + struct pcie_tgt_data { 361 + struct pcie_tgt_addr_filter addr_filter[NV_PCIE_TGT_ADDR_COUNT]; 362 + void __iomem *addr_filter_reg; 363 + }; 364 + 365 + #if defined(CONFIG_ACPI) && defined(CONFIG_ARM64) 366 + static int pcie_tgt_init_data(struct arm_cspmu *cspmu) 367 + { 368 + int ret; 369 + struct acpi_device *adev; 370 + struct pcie_tgt_data *data; 371 + struct list_head resource_list; 372 + struct resource_entry *rentry; 373 + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu); 374 + struct device *dev = cspmu->dev; 375 + 376 + data = devm_kzalloc(dev, sizeof(struct pcie_tgt_data), GFP_KERNEL); 377 + if (!data) 378 + return -ENOMEM; 379 + 380 + adev = arm_cspmu_acpi_dev_get(cspmu); 381 + if (!adev) { 382 + dev_err(dev, "failed to get associated PCIE-TGT device\n"); 383 + return -ENODEV; 384 + } 385 + 386 + INIT_LIST_HEAD(&resource_list); 387 + ret = acpi_dev_get_memory_resources(adev, &resource_list); 388 + if (ret < 0) { 389 + dev_err(dev, "failed to get PCIE-TGT device memory resources\n"); 390 + acpi_dev_put(adev); 391 + return ret; 392 + } 393 + 394 + rentry = list_first_entry_or_null( 395 + &resource_list, struct resource_entry, node); 396 + if (rentry) { 397 + data->addr_filter_reg = devm_ioremap_resource(dev, rentry->res); 398 + ret = 0; 399 + } 400 + 401 + if (IS_ERR(data->addr_filter_reg)) { 402 + dev_err(dev, "failed to get address filter resource\n"); 403 + ret = PTR_ERR(data->addr_filter_reg); 404 + } 405 + 406 + acpi_dev_free_resource_list(&resource_list); 407 + acpi_dev_put(adev); 408 + 409 + ctx->data = data; 410 + 411 + return ret; 412 + } 413 + #else 414 + static int pcie_tgt_init_data(struct arm_cspmu *cspmu) 415 + { 416 + return -ENODEV; 417 + } 418 + #endif 419 + 420 + static struct pcie_tgt_data *pcie_tgt_get_data(struct arm_cspmu *cspmu) 421 + { 422 + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu); 423 + 424 + return ctx->data; 425 + } 426 + 427 + /* Find the first available address filter slot. */ 428 + static int pcie_tgt_find_addr_idx(struct arm_cspmu *cspmu, u64 base, u64 mask, 429 + bool is_reset) 430 + { 431 + int i; 432 + struct pcie_tgt_data *data = pcie_tgt_get_data(cspmu); 433 + 434 + for (i = 0; i < NV_PCIE_TGT_ADDR_COUNT; i++) { 435 + if (!is_reset && data->addr_filter[i].refcount == 0) 436 + return i; 437 + 438 + if (data->addr_filter[i].base == base && 439 + data->addr_filter[i].mask == mask) 440 + return i; 441 + } 442 + 443 + return -ENODEV; 444 + } 445 + 446 + static u32 pcie_tgt_pmu_event_filter(const struct perf_event *event) 447 + { 448 + u32 filter; 449 + 450 + filter = (event->attr.config >> NV_PCIE_TGT_EV_TYPE_COUNT) & 451 + NV_PCIE_TGT_FILTER2_MASK; 452 + 453 + return filter; 454 + } 455 + 456 + static bool pcie_tgt_pmu_addr_en(const struct perf_event *event) 457 + { 458 + u32 filter = pcie_tgt_pmu_event_filter(event); 459 + 460 + return FIELD_GET(NV_PCIE_TGT_FILTER2_ADDR_EN, filter) != 0; 461 + } 462 + 463 + static u32 pcie_tgt_pmu_port_filter(const struct perf_event *event) 464 + { 465 + u32 filter = pcie_tgt_pmu_event_filter(event); 466 + 467 + return FIELD_GET(NV_PCIE_TGT_FILTER2_PORT, filter); 468 + } 469 + 470 + static u64 pcie_tgt_pmu_dst_addr_base(const struct perf_event *event) 471 + { 472 + return event->attr.config1; 473 + } 474 + 475 + static u64 pcie_tgt_pmu_dst_addr_mask(const struct perf_event *event) 476 + { 477 + return event->attr.config2; 478 + } 479 + 480 + static int pcie_tgt_pmu_validate_event(struct arm_cspmu *cspmu, 481 + struct perf_event *new_ev) 482 + { 483 + u64 base, mask; 484 + int idx; 485 + 486 + if (!pcie_tgt_pmu_addr_en(new_ev)) 487 + return 0; 488 + 489 + /* Make sure there is a slot available for the address filter. */ 490 + base = pcie_tgt_pmu_dst_addr_base(new_ev); 491 + mask = pcie_tgt_pmu_dst_addr_mask(new_ev); 492 + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false); 493 + if (idx < 0) 494 + return -EINVAL; 495 + 496 + return 0; 497 + } 498 + 499 + static void pcie_tgt_pmu_config_addr_filter(struct arm_cspmu *cspmu, 500 + bool en, u64 base, u64 mask, int idx) 501 + { 502 + struct pcie_tgt_data *data; 503 + struct pcie_tgt_addr_filter *filter; 504 + void __iomem *filter_reg; 505 + 506 + data = pcie_tgt_get_data(cspmu); 507 + filter = &data->addr_filter[idx]; 508 + filter_reg = data->addr_filter_reg + (idx * NV_PCIE_TGT_ADDR_STRIDE); 509 + 510 + if (en) { 511 + filter->refcount++; 512 + if (filter->refcount == 1) { 513 + filter->base = base; 514 + filter->mask = mask; 515 + 516 + writel(lower_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_LO); 517 + writel(upper_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_HI); 518 + writel(lower_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_LO); 519 + writel(upper_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_HI); 520 + writel(1, filter_reg + NV_PCIE_TGT_ADDR_CTRL); 521 + } 522 + } else { 523 + filter->refcount--; 524 + if (filter->refcount == 0) { 525 + writel(0, filter_reg + NV_PCIE_TGT_ADDR_CTRL); 526 + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_LO); 527 + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_HI); 528 + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_LO); 529 + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_HI); 530 + 531 + filter->base = 0; 532 + filter->mask = 0; 533 + } 534 + } 535 + } 536 + 537 + static void pcie_tgt_pmu_set_ev_filter(struct arm_cspmu *cspmu, 538 + const struct perf_event *event) 539 + { 540 + bool addr_filter_en; 541 + int idx; 542 + u32 filter2_val, filter2_offset, port_filter; 543 + u64 base, mask; 544 + 545 + filter2_val = 0; 546 + filter2_offset = PMEVFILT2R + (4 * event->hw.idx); 547 + 548 + addr_filter_en = pcie_tgt_pmu_addr_en(event); 549 + if (addr_filter_en) { 550 + base = pcie_tgt_pmu_dst_addr_base(event); 551 + mask = pcie_tgt_pmu_dst_addr_mask(event); 552 + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false); 553 + 554 + if (idx < 0) { 555 + dev_err(cspmu->dev, 556 + "Unable to find a slot for address filtering\n"); 557 + writel(0, cspmu->base0 + filter2_offset); 558 + return; 559 + } 560 + 561 + /* Configure address range filter registers.*/ 562 + pcie_tgt_pmu_config_addr_filter(cspmu, true, base, mask, idx); 563 + 564 + /* Config the counter to use the selected address filter slot. */ 565 + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_ADDR, 1U << idx); 566 + } 567 + 568 + port_filter = pcie_tgt_pmu_port_filter(event); 569 + 570 + /* Monitor all ports if no filter is selected. */ 571 + if (!addr_filter_en && port_filter == 0) 572 + port_filter = NV_PCIE_TGT_FILTER2_PORT; 573 + 574 + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_PORT, port_filter); 575 + 576 + writel(filter2_val, cspmu->base0 + filter2_offset); 577 + } 578 + 579 + static void pcie_tgt_pmu_reset_ev_filter(struct arm_cspmu *cspmu, 580 + const struct perf_event *event) 581 + { 582 + bool addr_filter_en; 583 + u64 base, mask; 584 + int idx; 585 + 586 + addr_filter_en = pcie_tgt_pmu_addr_en(event); 587 + if (!addr_filter_en) 588 + return; 589 + 590 + base = pcie_tgt_pmu_dst_addr_base(event); 591 + mask = pcie_tgt_pmu_dst_addr_mask(event); 592 + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, true); 593 + 594 + if (idx < 0) { 595 + dev_err(cspmu->dev, 596 + "Unable to find the address filter slot to reset\n"); 597 + return; 598 + } 599 + 600 + pcie_tgt_pmu_config_addr_filter(cspmu, false, base, mask, idx); 601 + } 602 + 603 + static u32 pcie_tgt_pmu_event_type(const struct perf_event *event) 604 + { 605 + return event->attr.config & NV_PCIE_TGT_EV_TYPE_MASK; 606 + } 607 + 608 + static bool pcie_tgt_pmu_is_cycle_counter_event(const struct perf_event *event) 609 + { 610 + u32 event_type = pcie_tgt_pmu_event_type(event); 611 + 612 + return event_type == NV_PCIE_TGT_EV_TYPE_CC; 613 + } 401 614 402 615 enum nv_cspmu_name_fmt { 403 616 NAME_FMT_GENERIC, 404 - NAME_FMT_SOCKET 617 + NAME_FMT_SOCKET, 618 + NAME_FMT_SOCKET_INST, 405 619 }; 406 620 407 621 struct nv_cspmu_match { ··· 881 343 }, 882 344 }, 883 345 { 346 + .prodid = 0x2CF20000, 347 + .prodid_mask = NV_PRODID_MASK, 348 + .name_pattern = "nvidia_ucf_pmu_%u", 349 + .name_fmt = NAME_FMT_SOCKET, 350 + .template_ctx = { 351 + .event_attr = ucf_pmu_event_attrs, 352 + .format_attr = ucf_pmu_format_attrs, 353 + .filter_mask = NV_UCF_FILTER_ID_MASK, 354 + .filter_default_val = NV_UCF_FILTER_DEFAULT, 355 + .filter2_mask = 0x0, 356 + .filter2_default_val = 0x0, 357 + .get_filter = ucf_pmu_event_filter, 358 + }, 359 + }, 360 + { 361 + .prodid = 0x10301000, 362 + .prodid_mask = NV_PRODID_MASK, 363 + .name_pattern = "nvidia_pcie_pmu_%u_rc_%u", 364 + .name_fmt = NAME_FMT_SOCKET_INST, 365 + .template_ctx = { 366 + .event_attr = pcie_v2_pmu_event_attrs, 367 + .format_attr = pcie_v2_pmu_format_attrs, 368 + .filter_mask = NV_PCIE_V2_FILTER_ID_MASK, 369 + .filter_default_val = NV_PCIE_V2_FILTER_DEFAULT, 370 + .filter2_mask = NV_PCIE_V2_FILTER2_ID_MASK, 371 + .filter2_default_val = NV_PCIE_V2_FILTER2_DEFAULT, 372 + .get_filter = pcie_v2_pmu_event_filter, 373 + .get_filter2 = nv_cspmu_event_filter2, 374 + }, 375 + .ops = { 376 + .validate_event = pcie_v2_pmu_validate_event, 377 + .reset_ev_filter = nv_cspmu_reset_ev_filter, 378 + } 379 + }, 380 + { 381 + .prodid = 0x10700000, 382 + .prodid_mask = NV_PRODID_MASK, 383 + .name_pattern = "nvidia_pcie_tgt_pmu_%u_rc_%u", 384 + .name_fmt = NAME_FMT_SOCKET_INST, 385 + .template_ctx = { 386 + .event_attr = pcie_tgt_pmu_event_attrs, 387 + .format_attr = pcie_tgt_pmu_format_attrs, 388 + .filter_mask = 0x0, 389 + .filter_default_val = 0x0, 390 + .filter2_mask = NV_PCIE_TGT_FILTER2_MASK, 391 + .filter2_default_val = NV_PCIE_TGT_FILTER2_DEFAULT, 392 + .init_data = pcie_tgt_init_data 393 + }, 394 + .ops = { 395 + .is_cycle_counter_event = pcie_tgt_pmu_is_cycle_counter_event, 396 + .event_type = pcie_tgt_pmu_event_type, 397 + .validate_event = pcie_tgt_pmu_validate_event, 398 + .set_ev_filter = pcie_tgt_pmu_set_ev_filter, 399 + .reset_ev_filter = pcie_tgt_pmu_reset_ev_filter, 400 + } 401 + }, 402 + { 884 403 .prodid = 0, 885 404 .prodid_mask = 0, 886 405 .name_pattern = "nvidia_uncore_pmu_%u", ··· 960 365 static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu, 961 366 const struct nv_cspmu_match *match) 962 367 { 963 - char *name; 368 + char *name = NULL; 964 369 struct device *dev = cspmu->dev; 965 370 966 371 static atomic_t pmu_generic_idx = {0}; ··· 974 379 socket); 975 380 break; 976 381 } 382 + case NAME_FMT_SOCKET_INST: { 383 + const int cpu = cpumask_first(&cspmu->associated_cpus); 384 + const int socket = cpu_to_node(cpu); 385 + u32 inst_id; 386 + 387 + if (!nv_cspmu_get_inst_id(cspmu, &inst_id)) 388 + name = devm_kasprintf(dev, GFP_KERNEL, 389 + match->name_pattern, socket, inst_id); 390 + break; 391 + } 977 392 case NAME_FMT_GENERIC: 978 393 name = devm_kasprintf(dev, GFP_KERNEL, match->name_pattern, 979 394 atomic_fetch_inc(&pmu_generic_idx)); 980 - break; 981 - default: 982 - name = NULL; 983 395 break; 984 396 } 985 397 ··· 1028 426 cspmu->impl.ctx = ctx; 1029 427 1030 428 /* NVIDIA specific callbacks. */ 429 + SET_OP(validate_event, impl_ops, match, NULL); 430 + SET_OP(event_type, impl_ops, match, NULL); 431 + SET_OP(is_cycle_counter_event, impl_ops, match, NULL); 1031 432 SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter); 1032 433 SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter); 434 + SET_OP(reset_ev_filter, impl_ops, match, NULL); 1033 435 SET_OP(get_event_attrs, impl_ops, match, nv_cspmu_get_event_attrs); 1034 436 SET_OP(get_format_attrs, impl_ops, match, nv_cspmu_get_format_attrs); 1035 437 SET_OP(get_name, impl_ops, match, nv_cspmu_get_name);

+1051

drivers/perf/nvidia_t410_c2c_pmu.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * NVIDIA Tegra410 C2C PMU driver. 4 + * 5 + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 6 + */ 7 + 8 + #include <linux/acpi.h> 9 + #include <linux/bitops.h> 10 + #include <linux/cpumask.h> 11 + #include <linux/device.h> 12 + #include <linux/interrupt.h> 13 + #include <linux/io.h> 14 + #include <linux/module.h> 15 + #include <linux/perf_event.h> 16 + #include <linux/platform_device.h> 17 + #include <linux/property.h> 18 + 19 + /* The C2C interface types in Tegra410. */ 20 + #define C2C_TYPE_NVLINK 0x0 21 + #define C2C_TYPE_NVCLINK 0x1 22 + #define C2C_TYPE_NVDLINK 0x2 23 + #define C2C_TYPE_COUNT 0x3 24 + 25 + /* The type of the peer device connected to the C2C interface. */ 26 + #define C2C_PEER_TYPE_CPU 0x0 27 + #define C2C_PEER_TYPE_GPU 0x1 28 + #define C2C_PEER_TYPE_CXLMEM 0x2 29 + #define C2C_PEER_TYPE_COUNT 0x3 30 + 31 + /* The number of peer devices can be connected to the C2C interface. */ 32 + #define C2C_NR_PEER_CPU 0x1 33 + #define C2C_NR_PEER_GPU 0x2 34 + #define C2C_NR_PEER_CXLMEM 0x1 35 + #define C2C_NR_PEER_MAX 0x2 36 + 37 + /* Number of instances on each interface. */ 38 + #define C2C_NR_INST_NVLINK 14 39 + #define C2C_NR_INST_NVCLINK 12 40 + #define C2C_NR_INST_NVDLINK 16 41 + #define C2C_NR_INST_MAX 16 42 + 43 + /* Register offsets. */ 44 + #define C2C_CTRL 0x864 45 + #define C2C_IN_STATUS 0x868 46 + #define C2C_CYCLE_CNTR 0x86c 47 + #define C2C_IN_RD_CUM_OUTS_CNTR 0x874 48 + #define C2C_IN_RD_REQ_CNTR 0x87c 49 + #define C2C_IN_WR_CUM_OUTS_CNTR 0x884 50 + #define C2C_IN_WR_REQ_CNTR 0x88c 51 + #define C2C_OUT_STATUS 0x890 52 + #define C2C_OUT_RD_CUM_OUTS_CNTR 0x898 53 + #define C2C_OUT_RD_REQ_CNTR 0x8a0 54 + #define C2C_OUT_WR_CUM_OUTS_CNTR 0x8a8 55 + #define C2C_OUT_WR_REQ_CNTR 0x8b0 56 + 57 + /* C2C_IN_STATUS register field. */ 58 + #define C2C_IN_STATUS_CYCLE_OVF BIT(0) 59 + #define C2C_IN_STATUS_IN_RD_CUM_OUTS_OVF BIT(1) 60 + #define C2C_IN_STATUS_IN_RD_REQ_OVF BIT(2) 61 + #define C2C_IN_STATUS_IN_WR_CUM_OUTS_OVF BIT(3) 62 + #define C2C_IN_STATUS_IN_WR_REQ_OVF BIT(4) 63 + 64 + /* C2C_OUT_STATUS register field. */ 65 + #define C2C_OUT_STATUS_OUT_RD_CUM_OUTS_OVF BIT(0) 66 + #define C2C_OUT_STATUS_OUT_RD_REQ_OVF BIT(1) 67 + #define C2C_OUT_STATUS_OUT_WR_CUM_OUTS_OVF BIT(2) 68 + #define C2C_OUT_STATUS_OUT_WR_REQ_OVF BIT(3) 69 + 70 + /* Events. */ 71 + #define C2C_EVENT_CYCLES 0x0 72 + #define C2C_EVENT_IN_RD_CUM_OUTS 0x1 73 + #define C2C_EVENT_IN_RD_REQ 0x2 74 + #define C2C_EVENT_IN_WR_CUM_OUTS 0x3 75 + #define C2C_EVENT_IN_WR_REQ 0x4 76 + #define C2C_EVENT_OUT_RD_CUM_OUTS 0x5 77 + #define C2C_EVENT_OUT_RD_REQ 0x6 78 + #define C2C_EVENT_OUT_WR_CUM_OUTS 0x7 79 + #define C2C_EVENT_OUT_WR_REQ 0x8 80 + 81 + #define C2C_NUM_EVENTS 0x9 82 + #define C2C_MASK_EVENT 0xFF 83 + #define C2C_MAX_ACTIVE_EVENTS 32 84 + 85 + #define C2C_ACTIVE_CPU_MASK 0x0 86 + #define C2C_ASSOCIATED_CPU_MASK 0x1 87 + 88 + /* 89 + * Maximum poll count for reading counter value using high-low-high sequence. 90 + */ 91 + #define HILOHI_MAX_POLL 1000 92 + 93 + static unsigned long nv_c2c_pmu_cpuhp_state; 94 + 95 + /* PMU descriptor. */ 96 + 97 + /* C2C type information. */ 98 + struct nv_c2c_pmu_data { 99 + unsigned int c2c_type; 100 + unsigned int nr_inst; 101 + const char *name_fmt; 102 + }; 103 + 104 + static const struct nv_c2c_pmu_data nv_c2c_pmu_data[] = { 105 + [C2C_TYPE_NVLINK] = { 106 + .c2c_type = C2C_TYPE_NVLINK, 107 + .nr_inst = C2C_NR_INST_NVLINK, 108 + .name_fmt = "nvidia_nvlink_c2c_pmu_%u", 109 + }, 110 + [C2C_TYPE_NVCLINK] = { 111 + .c2c_type = C2C_TYPE_NVCLINK, 112 + .nr_inst = C2C_NR_INST_NVCLINK, 113 + .name_fmt = "nvidia_nvclink_pmu_%u", 114 + }, 115 + [C2C_TYPE_NVDLINK] = { 116 + .c2c_type = C2C_TYPE_NVDLINK, 117 + .nr_inst = C2C_NR_INST_NVDLINK, 118 + .name_fmt = "nvidia_nvdlink_pmu_%u", 119 + }, 120 + }; 121 + 122 + /* Tracks the events assigned to the PMU for a given logical index. */ 123 + struct nv_c2c_pmu_hw_events { 124 + /* The events that are active. */ 125 + struct perf_event *events[C2C_MAX_ACTIVE_EVENTS]; 126 + 127 + /* 128 + * Each bit indicates a logical counter is being used (or not) for an 129 + * event. 130 + */ 131 + DECLARE_BITMAP(used_ctrs, C2C_MAX_ACTIVE_EVENTS); 132 + }; 133 + 134 + struct nv_c2c_pmu { 135 + struct pmu pmu; 136 + struct device *dev; 137 + struct acpi_device *acpi_dev; 138 + 139 + const char *name; 140 + const char *identifier; 141 + 142 + const struct nv_c2c_pmu_data *data; 143 + unsigned int peer_type; 144 + unsigned int socket; 145 + unsigned int nr_peer; 146 + unsigned long peer_insts[C2C_NR_PEER_MAX][BITS_TO_LONGS(C2C_NR_INST_MAX)]; 147 + u32 filter_default; 148 + 149 + struct nv_c2c_pmu_hw_events hw_events; 150 + 151 + cpumask_t associated_cpus; 152 + cpumask_t active_cpu; 153 + 154 + struct hlist_node cpuhp_node; 155 + 156 + const struct attribute_group **attr_groups; 157 + 158 + void __iomem *base_broadcast; 159 + void __iomem *base[C2C_NR_INST_MAX]; 160 + }; 161 + 162 + #define to_c2c_pmu(p) (container_of(p, struct nv_c2c_pmu, pmu)) 163 + 164 + /* Get event type from perf_event. */ 165 + static inline u32 get_event_type(struct perf_event *event) 166 + { 167 + return (event->attr.config) & C2C_MASK_EVENT; 168 + } 169 + 170 + static inline u32 get_filter_mask(struct perf_event *event) 171 + { 172 + u32 filter; 173 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu); 174 + 175 + filter = ((u32)event->attr.config1) & c2c_pmu->filter_default; 176 + if (filter == 0) 177 + filter = c2c_pmu->filter_default; 178 + 179 + return filter; 180 + } 181 + 182 + /* PMU operations. */ 183 + 184 + static int nv_c2c_pmu_get_event_idx(struct nv_c2c_pmu_hw_events *hw_events, 185 + struct perf_event *event) 186 + { 187 + u32 idx; 188 + 189 + idx = find_first_zero_bit(hw_events->used_ctrs, C2C_MAX_ACTIVE_EVENTS); 190 + if (idx >= C2C_MAX_ACTIVE_EVENTS) 191 + return -EAGAIN; 192 + 193 + set_bit(idx, hw_events->used_ctrs); 194 + 195 + return idx; 196 + } 197 + 198 + static bool 199 + nv_c2c_pmu_validate_event(struct pmu *pmu, 200 + struct nv_c2c_pmu_hw_events *hw_events, 201 + struct perf_event *event) 202 + { 203 + if (is_software_event(event)) 204 + return true; 205 + 206 + /* Reject groups spanning multiple HW PMUs. */ 207 + if (event->pmu != pmu) 208 + return false; 209 + 210 + return nv_c2c_pmu_get_event_idx(hw_events, event) >= 0; 211 + } 212 + 213 + /* 214 + * Make sure the group of events can be scheduled at once 215 + * on the PMU. 216 + */ 217 + static bool nv_c2c_pmu_validate_group(struct perf_event *event) 218 + { 219 + struct perf_event *sibling, *leader = event->group_leader; 220 + struct nv_c2c_pmu_hw_events fake_hw_events; 221 + 222 + if (event->group_leader == event) 223 + return true; 224 + 225 + memset(&fake_hw_events, 0, sizeof(fake_hw_events)); 226 + 227 + if (!nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events, leader)) 228 + return false; 229 + 230 + for_each_sibling_event(sibling, leader) { 231 + if (!nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events, 232 + sibling)) 233 + return false; 234 + } 235 + 236 + return nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events, event); 237 + } 238 + 239 + static int nv_c2c_pmu_event_init(struct perf_event *event) 240 + { 241 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu); 242 + struct hw_perf_event *hwc = &event->hw; 243 + u32 event_type = get_event_type(event); 244 + 245 + if (event->attr.type != event->pmu->type || 246 + event_type >= C2C_NUM_EVENTS) 247 + return -ENOENT; 248 + 249 + /* 250 + * Following other "uncore" PMUs, we do not support sampling mode or 251 + * attach to a task (per-process mode). 252 + */ 253 + if (is_sampling_event(event)) { 254 + dev_dbg(c2c_pmu->pmu.dev, "Can't support sampling events\n"); 255 + return -EOPNOTSUPP; 256 + } 257 + 258 + if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) { 259 + dev_dbg(c2c_pmu->pmu.dev, "Can't support per-task counters\n"); 260 + return -EINVAL; 261 + } 262 + 263 + /* 264 + * Make sure the CPU assignment is on one of the CPUs associated with 265 + * this PMU. 266 + */ 267 + if (!cpumask_test_cpu(event->cpu, &c2c_pmu->associated_cpus)) { 268 + dev_dbg(c2c_pmu->pmu.dev, 269 + "Requested cpu is not associated with the PMU\n"); 270 + return -EINVAL; 271 + } 272 + 273 + /* Enforce the current active CPU to handle the events in this PMU. */ 274 + event->cpu = cpumask_first(&c2c_pmu->active_cpu); 275 + if (event->cpu >= nr_cpu_ids) 276 + return -EINVAL; 277 + 278 + if (!nv_c2c_pmu_validate_group(event)) 279 + return -EINVAL; 280 + 281 + hwc->idx = -1; 282 + hwc->config = event_type; 283 + 284 + return 0; 285 + } 286 + 287 + /* 288 + * Read 64-bit register as a pair of 32-bit registers using hi-lo-hi sequence. 289 + */ 290 + static u64 read_reg64_hilohi(const void __iomem *addr, u32 max_poll_count) 291 + { 292 + u32 val_lo, val_hi; 293 + u64 val; 294 + 295 + /* Use high-low-high sequence to avoid tearing */ 296 + do { 297 + if (max_poll_count-- == 0) { 298 + pr_err("NV C2C PMU: timeout hi-low-high sequence\n"); 299 + return 0; 300 + } 301 + 302 + val_hi = readl(addr + 4); 303 + val_lo = readl(addr); 304 + } while (val_hi != readl(addr + 4)); 305 + 306 + val = (((u64)val_hi << 32) | val_lo); 307 + 308 + return val; 309 + } 310 + 311 + static void nv_c2c_pmu_check_status(struct nv_c2c_pmu *c2c_pmu, u32 instance) 312 + { 313 + u32 in_status, out_status; 314 + 315 + in_status = readl(c2c_pmu->base[instance] + C2C_IN_STATUS); 316 + out_status = readl(c2c_pmu->base[instance] + C2C_OUT_STATUS); 317 + 318 + if (in_status || out_status) 319 + dev_warn(c2c_pmu->dev, 320 + "C2C PMU overflow in: 0x%x, out: 0x%x\n", 321 + in_status, out_status); 322 + } 323 + 324 + static u32 nv_c2c_ctr_offset[C2C_NUM_EVENTS] = { 325 + [C2C_EVENT_CYCLES] = C2C_CYCLE_CNTR, 326 + [C2C_EVENT_IN_RD_CUM_OUTS] = C2C_IN_RD_CUM_OUTS_CNTR, 327 + [C2C_EVENT_IN_RD_REQ] = C2C_IN_RD_REQ_CNTR, 328 + [C2C_EVENT_IN_WR_CUM_OUTS] = C2C_IN_WR_CUM_OUTS_CNTR, 329 + [C2C_EVENT_IN_WR_REQ] = C2C_IN_WR_REQ_CNTR, 330 + [C2C_EVENT_OUT_RD_CUM_OUTS] = C2C_OUT_RD_CUM_OUTS_CNTR, 331 + [C2C_EVENT_OUT_RD_REQ] = C2C_OUT_RD_REQ_CNTR, 332 + [C2C_EVENT_OUT_WR_CUM_OUTS] = C2C_OUT_WR_CUM_OUTS_CNTR, 333 + [C2C_EVENT_OUT_WR_REQ] = C2C_OUT_WR_REQ_CNTR, 334 + }; 335 + 336 + static u64 nv_c2c_pmu_read_counter(struct perf_event *event) 337 + { 338 + u32 ctr_id, ctr_offset, filter_mask, filter_idx, inst_idx; 339 + unsigned long *inst_mask; 340 + DECLARE_BITMAP(filter_bitmap, C2C_NR_PEER_MAX); 341 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu); 342 + u64 val = 0; 343 + 344 + filter_mask = get_filter_mask(event); 345 + bitmap_from_arr32(filter_bitmap, &filter_mask, c2c_pmu->nr_peer); 346 + 347 + ctr_id = event->hw.config; 348 + ctr_offset = nv_c2c_ctr_offset[ctr_id]; 349 + 350 + for_each_set_bit(filter_idx, filter_bitmap, c2c_pmu->nr_peer) { 351 + inst_mask = c2c_pmu->peer_insts[filter_idx]; 352 + for_each_set_bit(inst_idx, inst_mask, c2c_pmu->data->nr_inst) { 353 + nv_c2c_pmu_check_status(c2c_pmu, inst_idx); 354 + 355 + /* 356 + * Each instance share same clock and the driver always 357 + * enables all instances. So we can use the counts from 358 + * one instance for cycle counter. 359 + */ 360 + if (ctr_id == C2C_EVENT_CYCLES) 361 + return read_reg64_hilohi( 362 + c2c_pmu->base[inst_idx] + ctr_offset, 363 + HILOHI_MAX_POLL); 364 + 365 + /* 366 + * For other events, sum up the counts from all instances. 367 + */ 368 + val += read_reg64_hilohi( 369 + c2c_pmu->base[inst_idx] + ctr_offset, 370 + HILOHI_MAX_POLL); 371 + } 372 + } 373 + 374 + return val; 375 + } 376 + 377 + static void nv_c2c_pmu_event_update(struct perf_event *event) 378 + { 379 + struct hw_perf_event *hwc = &event->hw; 380 + u64 prev, now; 381 + 382 + do { 383 + prev = local64_read(&hwc->prev_count); 384 + now = nv_c2c_pmu_read_counter(event); 385 + } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev); 386 + 387 + local64_add(now - prev, &event->count); 388 + } 389 + 390 + static void nv_c2c_pmu_start(struct perf_event *event, int pmu_flags) 391 + { 392 + event->hw.state = 0; 393 + } 394 + 395 + static void nv_c2c_pmu_stop(struct perf_event *event, int pmu_flags) 396 + { 397 + event->hw.state |= PERF_HES_STOPPED; 398 + } 399 + 400 + static int nv_c2c_pmu_add(struct perf_event *event, int flags) 401 + { 402 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu); 403 + struct nv_c2c_pmu_hw_events *hw_events = &c2c_pmu->hw_events; 404 + struct hw_perf_event *hwc = &event->hw; 405 + int idx; 406 + 407 + if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), 408 + &c2c_pmu->associated_cpus))) 409 + return -ENOENT; 410 + 411 + idx = nv_c2c_pmu_get_event_idx(hw_events, event); 412 + if (idx < 0) 413 + return idx; 414 + 415 + hw_events->events[idx] = event; 416 + hwc->idx = idx; 417 + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE; 418 + 419 + if (flags & PERF_EF_START) 420 + nv_c2c_pmu_start(event, PERF_EF_RELOAD); 421 + 422 + /* Propagate changes to the userspace mapping. */ 423 + perf_event_update_userpage(event); 424 + 425 + return 0; 426 + } 427 + 428 + static void nv_c2c_pmu_del(struct perf_event *event, int flags) 429 + { 430 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu); 431 + struct nv_c2c_pmu_hw_events *hw_events = &c2c_pmu->hw_events; 432 + struct hw_perf_event *hwc = &event->hw; 433 + int idx = hwc->idx; 434 + 435 + nv_c2c_pmu_stop(event, PERF_EF_UPDATE); 436 + 437 + hw_events->events[idx] = NULL; 438 + 439 + clear_bit(idx, hw_events->used_ctrs); 440 + 441 + perf_event_update_userpage(event); 442 + } 443 + 444 + static void nv_c2c_pmu_read(struct perf_event *event) 445 + { 446 + nv_c2c_pmu_event_update(event); 447 + } 448 + 449 + static void nv_c2c_pmu_enable(struct pmu *pmu) 450 + { 451 + void __iomem *bcast; 452 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu); 453 + 454 + /* Check if any filter is enabled. */ 455 + if (bitmap_empty(c2c_pmu->hw_events.used_ctrs, C2C_MAX_ACTIVE_EVENTS)) 456 + return; 457 + 458 + /* Enable all the counters. */ 459 + bcast = c2c_pmu->base_broadcast; 460 + writel(0x1UL, bcast + C2C_CTRL); 461 + } 462 + 463 + static void nv_c2c_pmu_disable(struct pmu *pmu) 464 + { 465 + unsigned int idx; 466 + void __iomem *bcast; 467 + struct perf_event *event; 468 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu); 469 + 470 + /* Disable all the counters. */ 471 + bcast = c2c_pmu->base_broadcast; 472 + writel(0x0UL, bcast + C2C_CTRL); 473 + 474 + /* 475 + * The counters will start from 0 again on restart. 476 + * Update the events immediately to avoid losing the counts. 477 + */ 478 + for_each_set_bit(idx, c2c_pmu->hw_events.used_ctrs, 479 + C2C_MAX_ACTIVE_EVENTS) { 480 + event = c2c_pmu->hw_events.events[idx]; 481 + 482 + if (!event) 483 + continue; 484 + 485 + nv_c2c_pmu_event_update(event); 486 + 487 + local64_set(&event->hw.prev_count, 0ULL); 488 + } 489 + } 490 + 491 + /* PMU identifier attribute. */ 492 + 493 + static ssize_t nv_c2c_pmu_identifier_show(struct device *dev, 494 + struct device_attribute *attr, 495 + char *page) 496 + { 497 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev)); 498 + 499 + return sysfs_emit(page, "%s\n", c2c_pmu->identifier); 500 + } 501 + 502 + static struct device_attribute nv_c2c_pmu_identifier_attr = 503 + __ATTR(identifier, 0444, nv_c2c_pmu_identifier_show, NULL); 504 + 505 + static struct attribute *nv_c2c_pmu_identifier_attrs[] = { 506 + &nv_c2c_pmu_identifier_attr.attr, 507 + NULL, 508 + }; 509 + 510 + static struct attribute_group nv_c2c_pmu_identifier_attr_group = { 511 + .attrs = nv_c2c_pmu_identifier_attrs, 512 + }; 513 + 514 + /* Peer attribute. */ 515 + 516 + static ssize_t nv_c2c_pmu_peer_show(struct device *dev, 517 + struct device_attribute *attr, 518 + char *page) 519 + { 520 + const char *peer_type[C2C_PEER_TYPE_COUNT] = { 521 + [C2C_PEER_TYPE_CPU] = "cpu", 522 + [C2C_PEER_TYPE_GPU] = "gpu", 523 + [C2C_PEER_TYPE_CXLMEM] = "cxlmem", 524 + }; 525 + 526 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev)); 527 + return sysfs_emit(page, "nr_%s=%u\n", peer_type[c2c_pmu->peer_type], 528 + c2c_pmu->nr_peer); 529 + } 530 + 531 + static struct device_attribute nv_c2c_pmu_peer_attr = 532 + __ATTR(peer, 0444, nv_c2c_pmu_peer_show, NULL); 533 + 534 + static struct attribute *nv_c2c_pmu_peer_attrs[] = { 535 + &nv_c2c_pmu_peer_attr.attr, 536 + NULL, 537 + }; 538 + 539 + static struct attribute_group nv_c2c_pmu_peer_attr_group = { 540 + .attrs = nv_c2c_pmu_peer_attrs, 541 + }; 542 + 543 + /* Format attributes. */ 544 + 545 + #define NV_C2C_PMU_EXT_ATTR(_name, _func, _config) \ 546 + (&((struct dev_ext_attribute[]){ \ 547 + { \ 548 + .attr = __ATTR(_name, 0444, _func, NULL), \ 549 + .var = (void *)_config \ 550 + } \ 551 + })[0].attr.attr) 552 + 553 + #define NV_C2C_PMU_FORMAT_ATTR(_name, _config) \ 554 + NV_C2C_PMU_EXT_ATTR(_name, device_show_string, _config) 555 + 556 + #define NV_C2C_PMU_FORMAT_EVENT_ATTR \ 557 + NV_C2C_PMU_FORMAT_ATTR(event, "config:0-3") 558 + 559 + static struct attribute *nv_c2c_pmu_gpu_formats[] = { 560 + NV_C2C_PMU_FORMAT_EVENT_ATTR, 561 + NV_C2C_PMU_FORMAT_ATTR(gpu_mask, "config1:0-1"), 562 + NULL, 563 + }; 564 + 565 + static const struct attribute_group nv_c2c_pmu_gpu_format_group = { 566 + .name = "format", 567 + .attrs = nv_c2c_pmu_gpu_formats, 568 + }; 569 + 570 + static struct attribute *nv_c2c_pmu_formats[] = { 571 + NV_C2C_PMU_FORMAT_EVENT_ATTR, 572 + NULL, 573 + }; 574 + 575 + static const struct attribute_group nv_c2c_pmu_format_group = { 576 + .name = "format", 577 + .attrs = nv_c2c_pmu_formats, 578 + }; 579 + 580 + /* Event attributes. */ 581 + 582 + static ssize_t nv_c2c_pmu_sysfs_event_show(struct device *dev, 583 + struct device_attribute *attr, 584 + char *buf) 585 + { 586 + struct perf_pmu_events_attr *pmu_attr; 587 + 588 + pmu_attr = container_of(attr, typeof(*pmu_attr), attr); 589 + return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id); 590 + } 591 + 592 + #define NV_C2C_PMU_EVENT_ATTR(_name, _config) \ 593 + PMU_EVENT_ATTR_ID(_name, nv_c2c_pmu_sysfs_event_show, _config) 594 + 595 + static struct attribute *nv_c2c_pmu_gpu_events[] = { 596 + NV_C2C_PMU_EVENT_ATTR(cycles, C2C_EVENT_CYCLES), 597 + NV_C2C_PMU_EVENT_ATTR(in_rd_cum_outs, C2C_EVENT_IN_RD_CUM_OUTS), 598 + NV_C2C_PMU_EVENT_ATTR(in_rd_req, C2C_EVENT_IN_RD_REQ), 599 + NV_C2C_PMU_EVENT_ATTR(in_wr_cum_outs, C2C_EVENT_IN_WR_CUM_OUTS), 600 + NV_C2C_PMU_EVENT_ATTR(in_wr_req, C2C_EVENT_IN_WR_REQ), 601 + NV_C2C_PMU_EVENT_ATTR(out_rd_cum_outs, C2C_EVENT_OUT_RD_CUM_OUTS), 602 + NV_C2C_PMU_EVENT_ATTR(out_rd_req, C2C_EVENT_OUT_RD_REQ), 603 + NV_C2C_PMU_EVENT_ATTR(out_wr_cum_outs, C2C_EVENT_OUT_WR_CUM_OUTS), 604 + NV_C2C_PMU_EVENT_ATTR(out_wr_req, C2C_EVENT_OUT_WR_REQ), 605 + NULL 606 + }; 607 + 608 + static const struct attribute_group nv_c2c_pmu_gpu_events_group = { 609 + .name = "events", 610 + .attrs = nv_c2c_pmu_gpu_events, 611 + }; 612 + 613 + static struct attribute *nv_c2c_pmu_cpu_events[] = { 614 + NV_C2C_PMU_EVENT_ATTR(cycles, C2C_EVENT_CYCLES), 615 + NV_C2C_PMU_EVENT_ATTR(in_rd_cum_outs, C2C_EVENT_IN_RD_CUM_OUTS), 616 + NV_C2C_PMU_EVENT_ATTR(in_rd_req, C2C_EVENT_IN_RD_REQ), 617 + NV_C2C_PMU_EVENT_ATTR(out_rd_cum_outs, C2C_EVENT_OUT_RD_CUM_OUTS), 618 + NV_C2C_PMU_EVENT_ATTR(out_rd_req, C2C_EVENT_OUT_RD_REQ), 619 + NULL 620 + }; 621 + 622 + static const struct attribute_group nv_c2c_pmu_cpu_events_group = { 623 + .name = "events", 624 + .attrs = nv_c2c_pmu_cpu_events, 625 + }; 626 + 627 + static struct attribute *nv_c2c_pmu_cxlmem_events[] = { 628 + NV_C2C_PMU_EVENT_ATTR(cycles, C2C_EVENT_CYCLES), 629 + NV_C2C_PMU_EVENT_ATTR(in_rd_cum_outs, C2C_EVENT_IN_RD_CUM_OUTS), 630 + NV_C2C_PMU_EVENT_ATTR(in_rd_req, C2C_EVENT_IN_RD_REQ), 631 + NULL 632 + }; 633 + 634 + static const struct attribute_group nv_c2c_pmu_cxlmem_events_group = { 635 + .name = "events", 636 + .attrs = nv_c2c_pmu_cxlmem_events, 637 + }; 638 + 639 + /* Cpumask attributes. */ 640 + 641 + static ssize_t nv_c2c_pmu_cpumask_show(struct device *dev, 642 + struct device_attribute *attr, char *buf) 643 + { 644 + struct pmu *pmu = dev_get_drvdata(dev); 645 + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu); 646 + struct dev_ext_attribute *eattr = 647 + container_of(attr, struct dev_ext_attribute, attr); 648 + unsigned long mask_id = (unsigned long)eattr->var; 649 + const cpumask_t *cpumask; 650 + 651 + switch (mask_id) { 652 + case C2C_ACTIVE_CPU_MASK: 653 + cpumask = &c2c_pmu->active_cpu; 654 + break; 655 + case C2C_ASSOCIATED_CPU_MASK: 656 + cpumask = &c2c_pmu->associated_cpus; 657 + break; 658 + default: 659 + return 0; 660 + } 661 + return cpumap_print_to_pagebuf(true, buf, cpumask); 662 + } 663 + 664 + #define NV_C2C_PMU_CPUMASK_ATTR(_name, _config) \ 665 + NV_C2C_PMU_EXT_ATTR(_name, nv_c2c_pmu_cpumask_show, \ 666 + (unsigned long)_config) 667 + 668 + static struct attribute *nv_c2c_pmu_cpumask_attrs[] = { 669 + NV_C2C_PMU_CPUMASK_ATTR(cpumask, C2C_ACTIVE_CPU_MASK), 670 + NV_C2C_PMU_CPUMASK_ATTR(associated_cpus, C2C_ASSOCIATED_CPU_MASK), 671 + NULL, 672 + }; 673 + 674 + static const struct attribute_group nv_c2c_pmu_cpumask_attr_group = { 675 + .attrs = nv_c2c_pmu_cpumask_attrs, 676 + }; 677 + 678 + /* Attribute groups for C2C PMU connecting SoC and GPU */ 679 + static const struct attribute_group *nv_c2c_pmu_gpu_attr_groups[] = { 680 + &nv_c2c_pmu_gpu_format_group, 681 + &nv_c2c_pmu_gpu_events_group, 682 + &nv_c2c_pmu_cpumask_attr_group, 683 + &nv_c2c_pmu_identifier_attr_group, 684 + &nv_c2c_pmu_peer_attr_group, 685 + NULL 686 + }; 687 + 688 + /* Attribute groups for C2C PMU connecting multiple SoCs */ 689 + static const struct attribute_group *nv_c2c_pmu_cpu_attr_groups[] = { 690 + &nv_c2c_pmu_format_group, 691 + &nv_c2c_pmu_cpu_events_group, 692 + &nv_c2c_pmu_cpumask_attr_group, 693 + &nv_c2c_pmu_identifier_attr_group, 694 + &nv_c2c_pmu_peer_attr_group, 695 + NULL 696 + }; 697 + 698 + /* Attribute groups for C2C PMU connecting SoC and CXLMEM */ 699 + static const struct attribute_group *nv_c2c_pmu_cxlmem_attr_groups[] = { 700 + &nv_c2c_pmu_format_group, 701 + &nv_c2c_pmu_cxlmem_events_group, 702 + &nv_c2c_pmu_cpumask_attr_group, 703 + &nv_c2c_pmu_identifier_attr_group, 704 + &nv_c2c_pmu_peer_attr_group, 705 + NULL 706 + }; 707 + 708 + static int nv_c2c_pmu_online_cpu(unsigned int cpu, struct hlist_node *node) 709 + { 710 + struct nv_c2c_pmu *c2c_pmu = 711 + hlist_entry_safe(node, struct nv_c2c_pmu, cpuhp_node); 712 + 713 + if (!cpumask_test_cpu(cpu, &c2c_pmu->associated_cpus)) 714 + return 0; 715 + 716 + /* If the PMU is already managed, there is nothing to do */ 717 + if (!cpumask_empty(&c2c_pmu->active_cpu)) 718 + return 0; 719 + 720 + /* Use this CPU for event counting */ 721 + cpumask_set_cpu(cpu, &c2c_pmu->active_cpu); 722 + 723 + return 0; 724 + } 725 + 726 + static int nv_c2c_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node) 727 + { 728 + unsigned int dst; 729 + 730 + struct nv_c2c_pmu *c2c_pmu = 731 + hlist_entry_safe(node, struct nv_c2c_pmu, cpuhp_node); 732 + 733 + /* Nothing to do if this CPU doesn't own the PMU */ 734 + if (!cpumask_test_and_clear_cpu(cpu, &c2c_pmu->active_cpu)) 735 + return 0; 736 + 737 + /* Choose a new CPU to migrate ownership of the PMU to */ 738 + dst = cpumask_any_and_but(&c2c_pmu->associated_cpus, 739 + cpu_online_mask, cpu); 740 + if (dst >= nr_cpu_ids) 741 + return 0; 742 + 743 + /* Use this CPU for event counting */ 744 + perf_pmu_migrate_context(&c2c_pmu->pmu, cpu, dst); 745 + cpumask_set_cpu(dst, &c2c_pmu->active_cpu); 746 + 747 + return 0; 748 + } 749 + 750 + static int nv_c2c_pmu_get_cpus(struct nv_c2c_pmu *c2c_pmu) 751 + { 752 + int socket = c2c_pmu->socket, cpu; 753 + 754 + for_each_possible_cpu(cpu) { 755 + if (cpu_to_node(cpu) == socket) 756 + cpumask_set_cpu(cpu, &c2c_pmu->associated_cpus); 757 + } 758 + 759 + if (cpumask_empty(&c2c_pmu->associated_cpus)) { 760 + dev_dbg(c2c_pmu->dev, 761 + "No cpu associated with C2C PMU socket-%u\n", socket); 762 + return -ENODEV; 763 + } 764 + 765 + return 0; 766 + } 767 + 768 + static int nv_c2c_pmu_init_socket(struct nv_c2c_pmu *c2c_pmu) 769 + { 770 + const char *uid_str; 771 + int ret, socket; 772 + 773 + uid_str = acpi_device_uid(c2c_pmu->acpi_dev); 774 + if (!uid_str) { 775 + dev_err(c2c_pmu->dev, "No ACPI device UID\n"); 776 + return -ENODEV; 777 + } 778 + 779 + ret = kstrtou32(uid_str, 0, &socket); 780 + if (ret) { 781 + dev_err(c2c_pmu->dev, "Failed to parse ACPI device UID\n"); 782 + return ret; 783 + } 784 + 785 + c2c_pmu->socket = socket; 786 + return 0; 787 + } 788 + 789 + static int nv_c2c_pmu_init_id(struct nv_c2c_pmu *c2c_pmu) 790 + { 791 + char *name; 792 + 793 + name = devm_kasprintf(c2c_pmu->dev, GFP_KERNEL, c2c_pmu->data->name_fmt, 794 + c2c_pmu->socket); 795 + if (!name) 796 + return -ENOMEM; 797 + 798 + c2c_pmu->name = name; 799 + 800 + c2c_pmu->identifier = acpi_device_hid(c2c_pmu->acpi_dev); 801 + 802 + return 0; 803 + } 804 + 805 + static int nv_c2c_pmu_init_filter(struct nv_c2c_pmu *c2c_pmu) 806 + { 807 + u32 cpu_en = 0; 808 + struct device *dev = c2c_pmu->dev; 809 + const struct nv_c2c_pmu_data *data = c2c_pmu->data; 810 + 811 + if (data->c2c_type == C2C_TYPE_NVDLINK) { 812 + c2c_pmu->peer_type = C2C_PEER_TYPE_CXLMEM; 813 + 814 + c2c_pmu->peer_insts[0][0] = (1UL << data->nr_inst) - 1; 815 + 816 + c2c_pmu->nr_peer = C2C_NR_PEER_CXLMEM; 817 + c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1; 818 + 819 + c2c_pmu->attr_groups = nv_c2c_pmu_cxlmem_attr_groups; 820 + 821 + return 0; 822 + } 823 + 824 + if (device_property_read_u32(dev, "cpu_en_mask", &cpu_en)) 825 + dev_dbg(dev, "no cpu_en_mask property\n"); 826 + 827 + if (cpu_en) { 828 + c2c_pmu->peer_type = C2C_PEER_TYPE_CPU; 829 + 830 + /* Fill peer_insts bitmap with instances connected to peer CPU. */ 831 + bitmap_from_arr32(c2c_pmu->peer_insts[0], &cpu_en, data->nr_inst); 832 + 833 + c2c_pmu->nr_peer = 1; 834 + c2c_pmu->attr_groups = nv_c2c_pmu_cpu_attr_groups; 835 + } else { 836 + u32 i; 837 + const char *props[C2C_NR_PEER_MAX] = { 838 + "gpu0_en_mask", "gpu1_en_mask" 839 + }; 840 + 841 + for (i = 0; i < C2C_NR_PEER_MAX; i++) { 842 + u32 gpu_en = 0; 843 + 844 + if (device_property_read_u32(dev, props[i], &gpu_en)) 845 + dev_dbg(dev, "no %s property\n", props[i]); 846 + 847 + if (gpu_en) { 848 + /* Fill peer_insts bitmap with instances connected to peer GPU. */ 849 + bitmap_from_arr32(c2c_pmu->peer_insts[i], &gpu_en, 850 + data->nr_inst); 851 + 852 + c2c_pmu->nr_peer++; 853 + } 854 + } 855 + 856 + if (c2c_pmu->nr_peer == 0) { 857 + dev_err(dev, "No GPU is enabled\n"); 858 + return -EINVAL; 859 + } 860 + 861 + c2c_pmu->peer_type = C2C_PEER_TYPE_GPU; 862 + c2c_pmu->attr_groups = nv_c2c_pmu_gpu_attr_groups; 863 + } 864 + 865 + c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1; 866 + 867 + return 0; 868 + } 869 + 870 + static void *nv_c2c_pmu_init_pmu(struct platform_device *pdev) 871 + { 872 + int ret; 873 + struct nv_c2c_pmu *c2c_pmu; 874 + struct acpi_device *acpi_dev; 875 + struct device *dev = &pdev->dev; 876 + 877 + acpi_dev = ACPI_COMPANION(dev); 878 + if (!acpi_dev) 879 + return ERR_PTR(-ENODEV); 880 + 881 + c2c_pmu = devm_kzalloc(dev, sizeof(*c2c_pmu), GFP_KERNEL); 882 + if (!c2c_pmu) 883 + return ERR_PTR(-ENOMEM); 884 + 885 + c2c_pmu->dev = dev; 886 + c2c_pmu->acpi_dev = acpi_dev; 887 + c2c_pmu->data = (const struct nv_c2c_pmu_data *)device_get_match_data(dev); 888 + if (!c2c_pmu->data) 889 + return ERR_PTR(-EINVAL); 890 + 891 + platform_set_drvdata(pdev, c2c_pmu); 892 + 893 + ret = nv_c2c_pmu_init_socket(c2c_pmu); 894 + if (ret) 895 + return ERR_PTR(ret); 896 + 897 + ret = nv_c2c_pmu_init_id(c2c_pmu); 898 + if (ret) 899 + return ERR_PTR(ret); 900 + 901 + ret = nv_c2c_pmu_init_filter(c2c_pmu); 902 + if (ret) 903 + return ERR_PTR(ret); 904 + 905 + return c2c_pmu; 906 + } 907 + 908 + static int nv_c2c_pmu_init_mmio(struct nv_c2c_pmu *c2c_pmu) 909 + { 910 + int i; 911 + struct device *dev = c2c_pmu->dev; 912 + struct platform_device *pdev = to_platform_device(dev); 913 + const struct nv_c2c_pmu_data *data = c2c_pmu->data; 914 + 915 + /* Map the address of all the instances. */ 916 + for (i = 0; i < data->nr_inst; i++) { 917 + c2c_pmu->base[i] = devm_platform_ioremap_resource(pdev, i); 918 + if (IS_ERR(c2c_pmu->base[i])) { 919 + dev_err(dev, "Failed map address for instance %d\n", i); 920 + return PTR_ERR(c2c_pmu->base[i]); 921 + } 922 + } 923 + 924 + /* Map broadcast address. */ 925 + c2c_pmu->base_broadcast = devm_platform_ioremap_resource(pdev, 926 + data->nr_inst); 927 + if (IS_ERR(c2c_pmu->base_broadcast)) { 928 + dev_err(dev, "Failed map broadcast address\n"); 929 + return PTR_ERR(c2c_pmu->base_broadcast); 930 + } 931 + 932 + return 0; 933 + } 934 + 935 + static int nv_c2c_pmu_register_pmu(struct nv_c2c_pmu *c2c_pmu) 936 + { 937 + int ret; 938 + 939 + ret = cpuhp_state_add_instance(nv_c2c_pmu_cpuhp_state, 940 + &c2c_pmu->cpuhp_node); 941 + if (ret) { 942 + dev_err(c2c_pmu->dev, "Error %d registering hotplug\n", ret); 943 + return ret; 944 + } 945 + 946 + c2c_pmu->pmu = (struct pmu) { 947 + .parent = c2c_pmu->dev, 948 + .task_ctx_nr = perf_invalid_context, 949 + .pmu_enable = nv_c2c_pmu_enable, 950 + .pmu_disable = nv_c2c_pmu_disable, 951 + .event_init = nv_c2c_pmu_event_init, 952 + .add = nv_c2c_pmu_add, 953 + .del = nv_c2c_pmu_del, 954 + .start = nv_c2c_pmu_start, 955 + .stop = nv_c2c_pmu_stop, 956 + .read = nv_c2c_pmu_read, 957 + .attr_groups = c2c_pmu->attr_groups, 958 + .capabilities = PERF_PMU_CAP_NO_EXCLUDE | 959 + PERF_PMU_CAP_NO_INTERRUPT, 960 + }; 961 + 962 + ret = perf_pmu_register(&c2c_pmu->pmu, c2c_pmu->name, -1); 963 + if (ret) { 964 + dev_err(c2c_pmu->dev, "Failed to register C2C PMU: %d\n", ret); 965 + cpuhp_state_remove_instance(nv_c2c_pmu_cpuhp_state, 966 + &c2c_pmu->cpuhp_node); 967 + return ret; 968 + } 969 + 970 + return 0; 971 + } 972 + 973 + static int nv_c2c_pmu_probe(struct platform_device *pdev) 974 + { 975 + int ret; 976 + struct nv_c2c_pmu *c2c_pmu; 977 + 978 + c2c_pmu = nv_c2c_pmu_init_pmu(pdev); 979 + if (IS_ERR(c2c_pmu)) 980 + return PTR_ERR(c2c_pmu); 981 + 982 + ret = nv_c2c_pmu_init_mmio(c2c_pmu); 983 + if (ret) 984 + return ret; 985 + 986 + ret = nv_c2c_pmu_get_cpus(c2c_pmu); 987 + if (ret) 988 + return ret; 989 + 990 + ret = nv_c2c_pmu_register_pmu(c2c_pmu); 991 + if (ret) 992 + return ret; 993 + 994 + dev_dbg(c2c_pmu->dev, "Registered %s PMU\n", c2c_pmu->name); 995 + 996 + return 0; 997 + } 998 + 999 + static void nv_c2c_pmu_device_remove(struct platform_device *pdev) 1000 + { 1001 + struct nv_c2c_pmu *c2c_pmu = platform_get_drvdata(pdev); 1002 + 1003 + perf_pmu_unregister(&c2c_pmu->pmu); 1004 + cpuhp_state_remove_instance(nv_c2c_pmu_cpuhp_state, &c2c_pmu->cpuhp_node); 1005 + } 1006 + 1007 + static const struct acpi_device_id nv_c2c_pmu_acpi_match[] = { 1008 + { "NVDA2023", (kernel_ulong_t)&nv_c2c_pmu_data[C2C_TYPE_NVLINK] }, 1009 + { "NVDA2022", (kernel_ulong_t)&nv_c2c_pmu_data[C2C_TYPE_NVCLINK] }, 1010 + { "NVDA2020", (kernel_ulong_t)&nv_c2c_pmu_data[C2C_TYPE_NVDLINK] }, 1011 + { } 1012 + }; 1013 + MODULE_DEVICE_TABLE(acpi, nv_c2c_pmu_acpi_match); 1014 + 1015 + static struct platform_driver nv_c2c_pmu_driver = { 1016 + .driver = { 1017 + .name = "nvidia-t410-c2c-pmu", 1018 + .acpi_match_table = nv_c2c_pmu_acpi_match, 1019 + .suppress_bind_attrs = true, 1020 + }, 1021 + .probe = nv_c2c_pmu_probe, 1022 + .remove = nv_c2c_pmu_device_remove, 1023 + }; 1024 + 1025 + static int __init nv_c2c_pmu_init(void) 1026 + { 1027 + int ret; 1028 + 1029 + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, 1030 + "perf/nvidia/c2c:online", 1031 + nv_c2c_pmu_online_cpu, 1032 + nv_c2c_pmu_cpu_teardown); 1033 + if (ret < 0) 1034 + return ret; 1035 + 1036 + nv_c2c_pmu_cpuhp_state = ret; 1037 + return platform_driver_register(&nv_c2c_pmu_driver); 1038 + } 1039 + 1040 + static void __exit nv_c2c_pmu_exit(void) 1041 + { 1042 + platform_driver_unregister(&nv_c2c_pmu_driver); 1043 + cpuhp_remove_multi_state(nv_c2c_pmu_cpuhp_state); 1044 + } 1045 + 1046 + module_init(nv_c2c_pmu_init); 1047 + module_exit(nv_c2c_pmu_exit); 1048 + 1049 + MODULE_LICENSE("GPL"); 1050 + MODULE_DESCRIPTION("NVIDIA Tegra410 C2C PMU driver"); 1051 + MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");

+736

drivers/perf/nvidia_t410_cmem_latency_pmu.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * NVIDIA Tegra410 CPU Memory (CMEM) Latency PMU driver. 4 + * 5 + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 6 + */ 7 + 8 + #include <linux/acpi.h> 9 + #include <linux/bitops.h> 10 + #include <linux/cpumask.h> 11 + #include <linux/device.h> 12 + #include <linux/interrupt.h> 13 + #include <linux/io.h> 14 + #include <linux/module.h> 15 + #include <linux/perf_event.h> 16 + #include <linux/platform_device.h> 17 + 18 + #define NUM_INSTANCES 14 19 + 20 + /* Register offsets. */ 21 + #define CMEM_LAT_CG_CTRL 0x800 22 + #define CMEM_LAT_CTRL 0x808 23 + #define CMEM_LAT_STATUS 0x810 24 + #define CMEM_LAT_CYCLE_CNTR 0x818 25 + #define CMEM_LAT_MC0_REQ_CNTR 0x820 26 + #define CMEM_LAT_MC0_AOR_CNTR 0x830 27 + #define CMEM_LAT_MC1_REQ_CNTR 0x838 28 + #define CMEM_LAT_MC1_AOR_CNTR 0x848 29 + #define CMEM_LAT_MC2_REQ_CNTR 0x850 30 + #define CMEM_LAT_MC2_AOR_CNTR 0x860 31 + 32 + /* CMEM_LAT_CTRL values. */ 33 + #define CMEM_LAT_CTRL_DISABLE 0x0ULL 34 + #define CMEM_LAT_CTRL_ENABLE 0x1ULL 35 + #define CMEM_LAT_CTRL_CLR 0x2ULL 36 + 37 + /* CMEM_LAT_CG_CTRL values. */ 38 + #define CMEM_LAT_CG_CTRL_DISABLE 0x0ULL 39 + #define CMEM_LAT_CG_CTRL_ENABLE 0x1ULL 40 + 41 + /* CMEM_LAT_STATUS register field. */ 42 + #define CMEM_LAT_STATUS_CYCLE_OVF BIT(0) 43 + #define CMEM_LAT_STATUS_MC0_AOR_OVF BIT(1) 44 + #define CMEM_LAT_STATUS_MC0_REQ_OVF BIT(3) 45 + #define CMEM_LAT_STATUS_MC1_AOR_OVF BIT(4) 46 + #define CMEM_LAT_STATUS_MC1_REQ_OVF BIT(6) 47 + #define CMEM_LAT_STATUS_MC2_AOR_OVF BIT(7) 48 + #define CMEM_LAT_STATUS_MC2_REQ_OVF BIT(9) 49 + 50 + /* Events. */ 51 + #define CMEM_LAT_EVENT_CYCLES 0x0 52 + #define CMEM_LAT_EVENT_REQ 0x1 53 + #define CMEM_LAT_EVENT_AOR 0x2 54 + 55 + #define CMEM_LAT_NUM_EVENTS 0x3 56 + #define CMEM_LAT_MASK_EVENT 0x3 57 + #define CMEM_LAT_MAX_ACTIVE_EVENTS 32 58 + 59 + #define CMEM_LAT_ACTIVE_CPU_MASK 0x0 60 + #define CMEM_LAT_ASSOCIATED_CPU_MASK 0x1 61 + 62 + static unsigned long cmem_lat_pmu_cpuhp_state; 63 + 64 + struct cmem_lat_pmu_hw_events { 65 + struct perf_event *events[CMEM_LAT_MAX_ACTIVE_EVENTS]; 66 + DECLARE_BITMAP(used_ctrs, CMEM_LAT_MAX_ACTIVE_EVENTS); 67 + }; 68 + 69 + struct cmem_lat_pmu { 70 + struct pmu pmu; 71 + struct device *dev; 72 + const char *name; 73 + const char *identifier; 74 + void __iomem *base_broadcast; 75 + void __iomem *base[NUM_INSTANCES]; 76 + cpumask_t associated_cpus; 77 + cpumask_t active_cpu; 78 + struct hlist_node node; 79 + struct cmem_lat_pmu_hw_events hw_events; 80 + }; 81 + 82 + #define to_cmem_lat_pmu(p) \ 83 + container_of(p, struct cmem_lat_pmu, pmu) 84 + 85 + 86 + /* Get event type from perf_event. */ 87 + static inline u32 get_event_type(struct perf_event *event) 88 + { 89 + return (event->attr.config) & CMEM_LAT_MASK_EVENT; 90 + } 91 + 92 + /* PMU operations. */ 93 + static int cmem_lat_pmu_get_event_idx(struct cmem_lat_pmu_hw_events *hw_events, 94 + struct perf_event *event) 95 + { 96 + unsigned int idx; 97 + 98 + idx = find_first_zero_bit(hw_events->used_ctrs, CMEM_LAT_MAX_ACTIVE_EVENTS); 99 + if (idx >= CMEM_LAT_MAX_ACTIVE_EVENTS) 100 + return -EAGAIN; 101 + 102 + set_bit(idx, hw_events->used_ctrs); 103 + 104 + return idx; 105 + } 106 + 107 + static bool cmem_lat_pmu_validate_event(struct pmu *pmu, 108 + struct cmem_lat_pmu_hw_events *hw_events, 109 + struct perf_event *event) 110 + { 111 + int ret; 112 + 113 + if (is_software_event(event)) 114 + return true; 115 + 116 + /* Reject groups spanning multiple HW PMUs. */ 117 + if (event->pmu != pmu) 118 + return false; 119 + 120 + ret = cmem_lat_pmu_get_event_idx(hw_events, event); 121 + if (ret < 0) 122 + return false; 123 + 124 + return true; 125 + } 126 + 127 + /* Make sure the group of events can be scheduled at once on the PMU. */ 128 + static bool cmem_lat_pmu_validate_group(struct perf_event *event) 129 + { 130 + struct perf_event *sibling, *leader = event->group_leader; 131 + struct cmem_lat_pmu_hw_events fake_hw_events; 132 + 133 + if (event->group_leader == event) 134 + return true; 135 + 136 + memset(&fake_hw_events, 0, sizeof(fake_hw_events)); 137 + 138 + if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, leader)) 139 + return false; 140 + 141 + for_each_sibling_event(sibling, leader) { 142 + if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, sibling)) 143 + return false; 144 + } 145 + 146 + return cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, event); 147 + } 148 + 149 + static int cmem_lat_pmu_event_init(struct perf_event *event) 150 + { 151 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 152 + struct hw_perf_event *hwc = &event->hw; 153 + u32 event_type = get_event_type(event); 154 + 155 + if (event->attr.type != event->pmu->type || 156 + event_type >= CMEM_LAT_NUM_EVENTS) 157 + return -ENOENT; 158 + 159 + /* 160 + * Sampling, per-process mode, and per-task counters are not supported 161 + * since this PMU is shared across all CPUs. 162 + */ 163 + if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) { 164 + dev_dbg(cmem_lat_pmu->pmu.dev, 165 + "Can't support sampling and per-process mode\n"); 166 + return -EOPNOTSUPP; 167 + } 168 + 169 + if (event->cpu < 0) { 170 + dev_dbg(cmem_lat_pmu->pmu.dev, "Can't support per-task counters\n"); 171 + return -EINVAL; 172 + } 173 + 174 + /* 175 + * Make sure the CPU assignment is on one of the CPUs associated with 176 + * this PMU. 177 + */ 178 + if (!cpumask_test_cpu(event->cpu, &cmem_lat_pmu->associated_cpus)) { 179 + dev_dbg(cmem_lat_pmu->pmu.dev, 180 + "Requested cpu is not associated with the PMU\n"); 181 + return -EINVAL; 182 + } 183 + 184 + /* Enforce the current active CPU to handle the events in this PMU. */ 185 + event->cpu = cpumask_first(&cmem_lat_pmu->active_cpu); 186 + if (event->cpu >= nr_cpu_ids) 187 + return -EINVAL; 188 + 189 + if (!cmem_lat_pmu_validate_group(event)) 190 + return -EINVAL; 191 + 192 + hwc->idx = -1; 193 + hwc->config = event_type; 194 + 195 + return 0; 196 + } 197 + 198 + static u64 cmem_lat_pmu_read_status(struct cmem_lat_pmu *cmem_lat_pmu, 199 + unsigned int inst) 200 + { 201 + return readq(cmem_lat_pmu->base[inst] + CMEM_LAT_STATUS); 202 + } 203 + 204 + static u64 cmem_lat_pmu_read_cycle_counter(struct perf_event *event) 205 + { 206 + const unsigned int instance = 0; 207 + u64 status; 208 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 209 + struct device *dev = cmem_lat_pmu->dev; 210 + 211 + /* 212 + * Use the reading from first instance since all instances are 213 + * identical. 214 + */ 215 + status = cmem_lat_pmu_read_status(cmem_lat_pmu, instance); 216 + if (status & CMEM_LAT_STATUS_CYCLE_OVF) 217 + dev_warn(dev, "Cycle counter overflow\n"); 218 + 219 + return readq(cmem_lat_pmu->base[instance] + CMEM_LAT_CYCLE_CNTR); 220 + } 221 + 222 + static u64 cmem_lat_pmu_read_req_counter(struct perf_event *event) 223 + { 224 + unsigned int i; 225 + u64 status, val = 0; 226 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 227 + struct device *dev = cmem_lat_pmu->dev; 228 + 229 + /* Sum up the counts from all instances. */ 230 + for (i = 0; i < NUM_INSTANCES; i++) { 231 + status = cmem_lat_pmu_read_status(cmem_lat_pmu, i); 232 + if (status & CMEM_LAT_STATUS_MC0_REQ_OVF) 233 + dev_warn(dev, "MC0 request counter overflow\n"); 234 + if (status & CMEM_LAT_STATUS_MC1_REQ_OVF) 235 + dev_warn(dev, "MC1 request counter overflow\n"); 236 + if (status & CMEM_LAT_STATUS_MC2_REQ_OVF) 237 + dev_warn(dev, "MC2 request counter overflow\n"); 238 + 239 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC0_REQ_CNTR); 240 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC1_REQ_CNTR); 241 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC2_REQ_CNTR); 242 + } 243 + 244 + return val; 245 + } 246 + 247 + static u64 cmem_lat_pmu_read_aor_counter(struct perf_event *event) 248 + { 249 + unsigned int i; 250 + u64 status, val = 0; 251 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 252 + struct device *dev = cmem_lat_pmu->dev; 253 + 254 + /* Sum up the counts from all instances. */ 255 + for (i = 0; i < NUM_INSTANCES; i++) { 256 + status = cmem_lat_pmu_read_status(cmem_lat_pmu, i); 257 + if (status & CMEM_LAT_STATUS_MC0_AOR_OVF) 258 + dev_warn(dev, "MC0 AOR counter overflow\n"); 259 + if (status & CMEM_LAT_STATUS_MC1_AOR_OVF) 260 + dev_warn(dev, "MC1 AOR counter overflow\n"); 261 + if (status & CMEM_LAT_STATUS_MC2_AOR_OVF) 262 + dev_warn(dev, "MC2 AOR counter overflow\n"); 263 + 264 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC0_AOR_CNTR); 265 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC1_AOR_CNTR); 266 + val += readq(cmem_lat_pmu->base[i] + CMEM_LAT_MC2_AOR_CNTR); 267 + } 268 + 269 + return val; 270 + } 271 + 272 + static u64 (*read_counter_fn[CMEM_LAT_NUM_EVENTS])(struct perf_event *) = { 273 + [CMEM_LAT_EVENT_CYCLES] = cmem_lat_pmu_read_cycle_counter, 274 + [CMEM_LAT_EVENT_REQ] = cmem_lat_pmu_read_req_counter, 275 + [CMEM_LAT_EVENT_AOR] = cmem_lat_pmu_read_aor_counter, 276 + }; 277 + 278 + static void cmem_lat_pmu_event_update(struct perf_event *event) 279 + { 280 + u32 event_type; 281 + u64 prev, now; 282 + struct hw_perf_event *hwc = &event->hw; 283 + 284 + if (hwc->state & PERF_HES_STOPPED) 285 + return; 286 + 287 + event_type = hwc->config; 288 + 289 + do { 290 + prev = local64_read(&hwc->prev_count); 291 + now = read_counter_fn[event_type](event); 292 + } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev); 293 + 294 + local64_add(now - prev, &event->count); 295 + 296 + hwc->state |= PERF_HES_UPTODATE; 297 + } 298 + 299 + static void cmem_lat_pmu_start(struct perf_event *event, int pmu_flags) 300 + { 301 + event->hw.state = 0; 302 + } 303 + 304 + static void cmem_lat_pmu_stop(struct perf_event *event, int pmu_flags) 305 + { 306 + event->hw.state |= PERF_HES_STOPPED; 307 + } 308 + 309 + static int cmem_lat_pmu_add(struct perf_event *event, int flags) 310 + { 311 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 312 + struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events; 313 + struct hw_perf_event *hwc = &event->hw; 314 + int idx; 315 + 316 + if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), 317 + &cmem_lat_pmu->associated_cpus))) 318 + return -ENOENT; 319 + 320 + idx = cmem_lat_pmu_get_event_idx(hw_events, event); 321 + if (idx < 0) 322 + return idx; 323 + 324 + hw_events->events[idx] = event; 325 + hwc->idx = idx; 326 + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE; 327 + 328 + if (flags & PERF_EF_START) 329 + cmem_lat_pmu_start(event, PERF_EF_RELOAD); 330 + 331 + /* Propagate changes to the userspace mapping. */ 332 + perf_event_update_userpage(event); 333 + 334 + return 0; 335 + } 336 + 337 + static void cmem_lat_pmu_del(struct perf_event *event, int flags) 338 + { 339 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu); 340 + struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events; 341 + struct hw_perf_event *hwc = &event->hw; 342 + int idx = hwc->idx; 343 + 344 + cmem_lat_pmu_stop(event, PERF_EF_UPDATE); 345 + 346 + hw_events->events[idx] = NULL; 347 + 348 + clear_bit(idx, hw_events->used_ctrs); 349 + 350 + perf_event_update_userpage(event); 351 + } 352 + 353 + static void cmem_lat_pmu_read(struct perf_event *event) 354 + { 355 + cmem_lat_pmu_event_update(event); 356 + } 357 + 358 + static inline void cmem_lat_pmu_cg_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, 359 + u64 val) 360 + { 361 + writeq(val, cmem_lat_pmu->base_broadcast + CMEM_LAT_CG_CTRL); 362 + } 363 + 364 + static inline void cmem_lat_pmu_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val) 365 + { 366 + writeq(val, cmem_lat_pmu->base_broadcast + CMEM_LAT_CTRL); 367 + } 368 + 369 + static void cmem_lat_pmu_enable(struct pmu *pmu) 370 + { 371 + bool disabled; 372 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu); 373 + 374 + disabled = bitmap_empty(cmem_lat_pmu->hw_events.used_ctrs, 375 + CMEM_LAT_MAX_ACTIVE_EVENTS); 376 + 377 + if (disabled) 378 + return; 379 + 380 + /* Enable all the counters. */ 381 + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_ENABLE); 382 + cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_ENABLE); 383 + } 384 + 385 + static void cmem_lat_pmu_disable(struct pmu *pmu) 386 + { 387 + int idx; 388 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu); 389 + 390 + /* Disable all the counters. */ 391 + cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_DISABLE); 392 + 393 + /* 394 + * The counters will start from 0 again on restart. 395 + * Update the events immediately to avoid losing the counts. 396 + */ 397 + for_each_set_bit(idx, cmem_lat_pmu->hw_events.used_ctrs, 398 + CMEM_LAT_MAX_ACTIVE_EVENTS) { 399 + struct perf_event *event = cmem_lat_pmu->hw_events.events[idx]; 400 + 401 + if (!event) 402 + continue; 403 + 404 + cmem_lat_pmu_event_update(event); 405 + 406 + local64_set(&event->hw.prev_count, 0ULL); 407 + } 408 + 409 + cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_CLR); 410 + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_DISABLE); 411 + } 412 + 413 + /* PMU identifier attribute. */ 414 + 415 + static ssize_t cmem_lat_pmu_identifier_show(struct device *dev, 416 + struct device_attribute *attr, 417 + char *page) 418 + { 419 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(dev_get_drvdata(dev)); 420 + 421 + return sysfs_emit(page, "%s\n", cmem_lat_pmu->identifier); 422 + } 423 + 424 + static struct device_attribute cmem_lat_pmu_identifier_attr = 425 + __ATTR(identifier, 0444, cmem_lat_pmu_identifier_show, NULL); 426 + 427 + static struct attribute *cmem_lat_pmu_identifier_attrs[] = { 428 + &cmem_lat_pmu_identifier_attr.attr, 429 + NULL 430 + }; 431 + 432 + static struct attribute_group cmem_lat_pmu_identifier_attr_group = { 433 + .attrs = cmem_lat_pmu_identifier_attrs, 434 + }; 435 + 436 + /* Format attributes. */ 437 + 438 + #define NV_PMU_EXT_ATTR(_name, _func, _config) \ 439 + (&((struct dev_ext_attribute[]){ \ 440 + { \ 441 + .attr = __ATTR(_name, 0444, _func, NULL), \ 442 + .var = (void *)_config \ 443 + } \ 444 + })[0].attr.attr) 445 + 446 + static struct attribute *cmem_lat_pmu_formats[] = { 447 + NV_PMU_EXT_ATTR(event, device_show_string, "config:0-1"), 448 + NULL 449 + }; 450 + 451 + static const struct attribute_group cmem_lat_pmu_format_group = { 452 + .name = "format", 453 + .attrs = cmem_lat_pmu_formats, 454 + }; 455 + 456 + /* Event attributes. */ 457 + 458 + static ssize_t cmem_lat_pmu_sysfs_event_show(struct device *dev, 459 + struct device_attribute *attr, char *buf) 460 + { 461 + struct perf_pmu_events_attr *pmu_attr; 462 + 463 + pmu_attr = container_of(attr, typeof(*pmu_attr), attr); 464 + return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id); 465 + } 466 + 467 + #define NV_PMU_EVENT_ATTR(_name, _config) \ 468 + PMU_EVENT_ATTR_ID(_name, cmem_lat_pmu_sysfs_event_show, _config) 469 + 470 + static struct attribute *cmem_lat_pmu_events[] = { 471 + NV_PMU_EVENT_ATTR(cycles, CMEM_LAT_EVENT_CYCLES), 472 + NV_PMU_EVENT_ATTR(rd_req, CMEM_LAT_EVENT_REQ), 473 + NV_PMU_EVENT_ATTR(rd_cum_outs, CMEM_LAT_EVENT_AOR), 474 + NULL 475 + }; 476 + 477 + static const struct attribute_group cmem_lat_pmu_events_group = { 478 + .name = "events", 479 + .attrs = cmem_lat_pmu_events, 480 + }; 481 + 482 + /* Cpumask attributes. */ 483 + 484 + static ssize_t cmem_lat_pmu_cpumask_show(struct device *dev, 485 + struct device_attribute *attr, char *buf) 486 + { 487 + struct pmu *pmu = dev_get_drvdata(dev); 488 + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu); 489 + struct dev_ext_attribute *eattr = 490 + container_of(attr, struct dev_ext_attribute, attr); 491 + unsigned long mask_id = (unsigned long)eattr->var; 492 + const cpumask_t *cpumask; 493 + 494 + switch (mask_id) { 495 + case CMEM_LAT_ACTIVE_CPU_MASK: 496 + cpumask = &cmem_lat_pmu->active_cpu; 497 + break; 498 + case CMEM_LAT_ASSOCIATED_CPU_MASK: 499 + cpumask = &cmem_lat_pmu->associated_cpus; 500 + break; 501 + default: 502 + return 0; 503 + } 504 + return cpumap_print_to_pagebuf(true, buf, cpumask); 505 + } 506 + 507 + #define NV_PMU_CPUMASK_ATTR(_name, _config) \ 508 + NV_PMU_EXT_ATTR(_name, cmem_lat_pmu_cpumask_show, \ 509 + (unsigned long)_config) 510 + 511 + static struct attribute *cmem_lat_pmu_cpumask_attrs[] = { 512 + NV_PMU_CPUMASK_ATTR(cpumask, CMEM_LAT_ACTIVE_CPU_MASK), 513 + NV_PMU_CPUMASK_ATTR(associated_cpus, CMEM_LAT_ASSOCIATED_CPU_MASK), 514 + NULL 515 + }; 516 + 517 + static const struct attribute_group cmem_lat_pmu_cpumask_attr_group = { 518 + .attrs = cmem_lat_pmu_cpumask_attrs, 519 + }; 520 + 521 + /* Per PMU device attribute groups. */ 522 + 523 + static const struct attribute_group *cmem_lat_pmu_attr_groups[] = { 524 + &cmem_lat_pmu_identifier_attr_group, 525 + &cmem_lat_pmu_format_group, 526 + &cmem_lat_pmu_events_group, 527 + &cmem_lat_pmu_cpumask_attr_group, 528 + NULL 529 + }; 530 + 531 + static int cmem_lat_pmu_cpu_online(unsigned int cpu, struct hlist_node *node) 532 + { 533 + struct cmem_lat_pmu *cmem_lat_pmu = 534 + hlist_entry_safe(node, struct cmem_lat_pmu, node); 535 + 536 + if (!cpumask_test_cpu(cpu, &cmem_lat_pmu->associated_cpus)) 537 + return 0; 538 + 539 + /* If the PMU is already managed, there is nothing to do */ 540 + if (!cpumask_empty(&cmem_lat_pmu->active_cpu)) 541 + return 0; 542 + 543 + /* Use this CPU for event counting */ 544 + cpumask_set_cpu(cpu, &cmem_lat_pmu->active_cpu); 545 + 546 + return 0; 547 + } 548 + 549 + static int cmem_lat_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node) 550 + { 551 + unsigned int dst; 552 + 553 + struct cmem_lat_pmu *cmem_lat_pmu = 554 + hlist_entry_safe(node, struct cmem_lat_pmu, node); 555 + 556 + /* Nothing to do if this CPU doesn't own the PMU */ 557 + if (!cpumask_test_and_clear_cpu(cpu, &cmem_lat_pmu->active_cpu)) 558 + return 0; 559 + 560 + /* Choose a new CPU to migrate ownership of the PMU to */ 561 + dst = cpumask_any_and_but(&cmem_lat_pmu->associated_cpus, 562 + cpu_online_mask, cpu); 563 + if (dst >= nr_cpu_ids) 564 + return 0; 565 + 566 + /* Use this CPU for event counting */ 567 + perf_pmu_migrate_context(&cmem_lat_pmu->pmu, cpu, dst); 568 + cpumask_set_cpu(dst, &cmem_lat_pmu->active_cpu); 569 + 570 + return 0; 571 + } 572 + 573 + static int cmem_lat_pmu_get_cpus(struct cmem_lat_pmu *cmem_lat_pmu, 574 + unsigned int socket) 575 + { 576 + int cpu; 577 + 578 + for_each_possible_cpu(cpu) { 579 + if (cpu_to_node(cpu) == socket) 580 + cpumask_set_cpu(cpu, &cmem_lat_pmu->associated_cpus); 581 + } 582 + 583 + if (cpumask_empty(&cmem_lat_pmu->associated_cpus)) { 584 + dev_dbg(cmem_lat_pmu->dev, 585 + "No cpu associated with PMU socket-%u\n", socket); 586 + return -ENODEV; 587 + } 588 + 589 + return 0; 590 + } 591 + 592 + static int cmem_lat_pmu_probe(struct platform_device *pdev) 593 + { 594 + struct device *dev = &pdev->dev; 595 + struct acpi_device *acpi_dev; 596 + struct cmem_lat_pmu *cmem_lat_pmu; 597 + char *name, *uid_str; 598 + int ret, i; 599 + u32 socket; 600 + 601 + acpi_dev = ACPI_COMPANION(dev); 602 + if (!acpi_dev) 603 + return -ENODEV; 604 + 605 + uid_str = acpi_device_uid(acpi_dev); 606 + if (!uid_str) 607 + return -ENODEV; 608 + 609 + ret = kstrtou32(uid_str, 0, &socket); 610 + if (ret) 611 + return ret; 612 + 613 + cmem_lat_pmu = devm_kzalloc(dev, sizeof(*cmem_lat_pmu), GFP_KERNEL); 614 + name = devm_kasprintf(dev, GFP_KERNEL, "nvidia_cmem_latency_pmu_%u", socket); 615 + if (!cmem_lat_pmu || !name) 616 + return -ENOMEM; 617 + 618 + cmem_lat_pmu->dev = dev; 619 + cmem_lat_pmu->name = name; 620 + cmem_lat_pmu->identifier = acpi_device_hid(acpi_dev); 621 + platform_set_drvdata(pdev, cmem_lat_pmu); 622 + 623 + cmem_lat_pmu->pmu = (struct pmu) { 624 + .parent = &pdev->dev, 625 + .task_ctx_nr = perf_invalid_context, 626 + .pmu_enable = cmem_lat_pmu_enable, 627 + .pmu_disable = cmem_lat_pmu_disable, 628 + .event_init = cmem_lat_pmu_event_init, 629 + .add = cmem_lat_pmu_add, 630 + .del = cmem_lat_pmu_del, 631 + .start = cmem_lat_pmu_start, 632 + .stop = cmem_lat_pmu_stop, 633 + .read = cmem_lat_pmu_read, 634 + .attr_groups = cmem_lat_pmu_attr_groups, 635 + .capabilities = PERF_PMU_CAP_NO_EXCLUDE | 636 + PERF_PMU_CAP_NO_INTERRUPT, 637 + }; 638 + 639 + /* Map the address of all the instances. */ 640 + for (i = 0; i < NUM_INSTANCES; i++) { 641 + cmem_lat_pmu->base[i] = devm_platform_ioremap_resource(pdev, i); 642 + if (IS_ERR(cmem_lat_pmu->base[i])) { 643 + dev_err(dev, "Failed map address for instance %d\n", i); 644 + return PTR_ERR(cmem_lat_pmu->base[i]); 645 + } 646 + } 647 + 648 + /* Map broadcast address. */ 649 + cmem_lat_pmu->base_broadcast = devm_platform_ioremap_resource(pdev, 650 + NUM_INSTANCES); 651 + if (IS_ERR(cmem_lat_pmu->base_broadcast)) { 652 + dev_err(dev, "Failed map broadcast address\n"); 653 + return PTR_ERR(cmem_lat_pmu->base_broadcast); 654 + } 655 + 656 + ret = cmem_lat_pmu_get_cpus(cmem_lat_pmu, socket); 657 + if (ret) 658 + return ret; 659 + 660 + ret = cpuhp_state_add_instance(cmem_lat_pmu_cpuhp_state, 661 + &cmem_lat_pmu->node); 662 + if (ret) { 663 + dev_err(&pdev->dev, "Error %d registering hotplug\n", ret); 664 + return ret; 665 + } 666 + 667 + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_ENABLE); 668 + cmem_lat_pmu_ctrl(cmem_lat_pmu, CMEM_LAT_CTRL_CLR); 669 + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CMEM_LAT_CG_CTRL_DISABLE); 670 + 671 + ret = perf_pmu_register(&cmem_lat_pmu->pmu, name, -1); 672 + if (ret) { 673 + dev_err(&pdev->dev, "Failed to register PMU: %d\n", ret); 674 + cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state, 675 + &cmem_lat_pmu->node); 676 + return ret; 677 + } 678 + 679 + dev_dbg(&pdev->dev, "Registered %s PMU\n", name); 680 + 681 + return 0; 682 + } 683 + 684 + static void cmem_lat_pmu_device_remove(struct platform_device *pdev) 685 + { 686 + struct cmem_lat_pmu *cmem_lat_pmu = platform_get_drvdata(pdev); 687 + 688 + perf_pmu_unregister(&cmem_lat_pmu->pmu); 689 + cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state, 690 + &cmem_lat_pmu->node); 691 + } 692 + 693 + static const struct acpi_device_id cmem_lat_pmu_acpi_match[] = { 694 + { "NVDA2021" }, 695 + { } 696 + }; 697 + MODULE_DEVICE_TABLE(acpi, cmem_lat_pmu_acpi_match); 698 + 699 + static struct platform_driver cmem_lat_pmu_driver = { 700 + .driver = { 701 + .name = "nvidia-t410-cmem-latency-pmu", 702 + .acpi_match_table = ACPI_PTR(cmem_lat_pmu_acpi_match), 703 + .suppress_bind_attrs = true, 704 + }, 705 + .probe = cmem_lat_pmu_probe, 706 + .remove = cmem_lat_pmu_device_remove, 707 + }; 708 + 709 + static int __init cmem_lat_pmu_init(void) 710 + { 711 + int ret; 712 + 713 + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, 714 + "perf/nvidia/cmem_latency:online", 715 + cmem_lat_pmu_cpu_online, 716 + cmem_lat_pmu_cpu_teardown); 717 + if (ret < 0) 718 + return ret; 719 + 720 + cmem_lat_pmu_cpuhp_state = ret; 721 + 722 + return platform_driver_register(&cmem_lat_pmu_driver); 723 + } 724 + 725 + static void __exit cmem_lat_pmu_exit(void) 726 + { 727 + platform_driver_unregister(&cmem_lat_pmu_driver); 728 + cpuhp_remove_multi_state(cmem_lat_pmu_cpuhp_state); 729 + } 730 + 731 + module_init(cmem_lat_pmu_init); 732 + module_exit(cmem_lat_pmu_exit); 733 + 734 + MODULE_LICENSE("GPL"); 735 + MODULE_DESCRIPTION("NVIDIA Tegra410 CPU Memory Latency PMU driver"); 736 + MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");

+8 -1

drivers/resctrl/Kconfig

··· 1 1 menuconfig ARM64_MPAM_DRIVER 2 2 bool "MPAM driver" 3 - depends on ARM64 && ARM64_MPAM && EXPERT 3 + depends on ARM64 && ARM64_MPAM 4 + select ACPI_MPAM if ACPI 4 5 help 5 6 Memory System Resource Partitioning and Monitoring (MPAM) driver for 6 7 System IP, e.g. caches and memory controllers. ··· 23 22 If unsure, say N. 24 23 25 24 endif 25 + 26 + config ARM64_MPAM_RESCTRL_FS 27 + bool 28 + default y if ARM64_MPAM_DRIVER && RESCTRL_FS 29 + select RESCTRL_RMID_DEPENDS_ON_CLOSID 30 + select RESCTRL_ASSIGN_FIXED

+1

drivers/resctrl/Makefile

··· 1 1 obj-$(CONFIG_ARM64_MPAM_DRIVER) += mpam.o 2 2 mpam-y += mpam_devices.o 3 + mpam-$(CONFIG_ARM64_MPAM_RESCTRL_FS) += mpam_resctrl.o 3 4 4 5 ccflags-$(CONFIG_ARM64_MPAM_DRIVER_DEBUG) += -DDEBUG

+267 -38

drivers/resctrl/mpam_devices.c

··· 29 29 30 30 #include "mpam_internal.h" 31 31 32 - DEFINE_STATIC_KEY_FALSE(mpam_enabled); /* This moves to arch code */ 32 + /* Values for the T241 errata workaround */ 33 + #define T241_CHIPS_MAX 4 34 + #define T241_CHIP_NSLICES 12 35 + #define T241_SPARE_REG0_OFF 0x1b0000 36 + #define T241_SPARE_REG1_OFF 0x1c0000 37 + #define T241_CHIP_ID(phys) FIELD_GET(GENMASK_ULL(44, 43), phys) 38 + #define T241_SHADOW_REG_OFF(sidx, pid) (0x360048 + (sidx) * 0x10000 + (pid) * 8) 39 + #define SMCCC_SOC_ID_T241 0x036b0241 40 + static void __iomem *t241_scratch_regs[T241_CHIPS_MAX]; 33 41 34 42 /* 35 43 * mpam_list_lock protects the SRCU lists when writing. Once the ··· 82 74 83 75 /* When mpam is disabled, the printed reason to aid debugging */ 84 76 static char *mpam_disable_reason; 77 + 78 + /* 79 + * Whether resctrl has been setup. Used by cpuhp in preference to 80 + * mpam_is_enabled(). The disable call after an error interrupt makes 81 + * mpam_is_enabled() false before the cpuhp callbacks are made. 82 + * Reads/writes should hold mpam_cpuhp_state_lock, (or be cpuhp callbacks). 83 + */ 84 + static bool mpam_resctrl_enabled; 85 85 86 86 /* 87 87 * An MSC is a physical container for controls and monitors, each identified by ··· 640 624 return ERR_PTR(-ENOENT); 641 625 } 642 626 627 + static int mpam_enable_quirk_nvidia_t241_1(struct mpam_msc *msc, 628 + const struct mpam_quirk *quirk) 629 + { 630 + s32 soc_id = arm_smccc_get_soc_id_version(); 631 + struct resource *r; 632 + phys_addr_t phys; 633 + 634 + /* 635 + * A mapping to a device other than the MSC is needed, check 636 + * SOC_ID is NVIDIA T241 chip (036b:0241) 637 + */ 638 + if (soc_id < 0 || soc_id != SMCCC_SOC_ID_T241) 639 + return -EINVAL; 640 + 641 + r = platform_get_resource(msc->pdev, IORESOURCE_MEM, 0); 642 + if (!r) 643 + return -EINVAL; 644 + 645 + /* Find the internal registers base addr from the CHIP ID */ 646 + msc->t241_id = T241_CHIP_ID(r->start); 647 + phys = FIELD_PREP(GENMASK_ULL(45, 44), msc->t241_id) | 0x19000000ULL; 648 + 649 + t241_scratch_regs[msc->t241_id] = ioremap(phys, SZ_8M); 650 + if (WARN_ON_ONCE(!t241_scratch_regs[msc->t241_id])) 651 + return -EINVAL; 652 + 653 + pr_info_once("Enabled workaround for NVIDIA T241 erratum T241-MPAM-1\n"); 654 + 655 + return 0; 656 + } 657 + 658 + static const struct mpam_quirk mpam_quirks[] = { 659 + { 660 + /* NVIDIA t241 erratum T241-MPAM-1 */ 661 + .init = mpam_enable_quirk_nvidia_t241_1, 662 + .iidr = MPAM_IIDR_NVIDIA_T241, 663 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 664 + .workaround = T241_SCRUB_SHADOW_REGS, 665 + }, 666 + { 667 + /* NVIDIA t241 erratum T241-MPAM-4 */ 668 + .iidr = MPAM_IIDR_NVIDIA_T241, 669 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 670 + .workaround = T241_FORCE_MBW_MIN_TO_ONE, 671 + }, 672 + { 673 + /* NVIDIA t241 erratum T241-MPAM-6 */ 674 + .iidr = MPAM_IIDR_NVIDIA_T241, 675 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 676 + .workaround = T241_MBW_COUNTER_SCALE_64, 677 + }, 678 + { 679 + /* ARM CMN-650 CSU erratum 3642720 */ 680 + .iidr = MPAM_IIDR_ARM_CMN_650, 681 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 682 + .workaround = IGNORE_CSU_NRDY, 683 + }, 684 + { NULL } /* Sentinel */ 685 + }; 686 + 687 + static void mpam_enable_quirks(struct mpam_msc *msc) 688 + { 689 + const struct mpam_quirk *quirk; 690 + 691 + for (quirk = &mpam_quirks[0]; quirk->iidr_mask; quirk++) { 692 + int err = 0; 693 + 694 + if (quirk->iidr != (msc->iidr & quirk->iidr_mask)) 695 + continue; 696 + 697 + if (quirk->init) 698 + err = quirk->init(msc, quirk); 699 + 700 + if (err) 701 + continue; 702 + 703 + mpam_set_quirk(quirk->workaround, msc); 704 + } 705 + } 706 + 643 707 /* 644 708 * IHI009A.a has this nugget: "If a monitor does not support automatic behaviour 645 709 * of NRDY, software can use this bit for any purpose" - so hardware might not ··· 811 715 mpam_set_feature(mpam_feat_mbw_part, props); 812 716 813 717 props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); 718 + 719 + /* 720 + * The BWA_WD field can represent 0-63, but the control fields it 721 + * describes have a maximum of 16 bits. 722 + */ 723 + props->bwa_wd = min(props->bwa_wd, 16); 724 + 814 725 if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) 815 726 mpam_set_feature(mpam_feat_mbw_max, props); 816 727 ··· 954 851 /* Grab an IDR value to find out how many RIS there are */ 955 852 mutex_lock(&msc->part_sel_lock); 956 853 idr = mpam_msc_read_idr(msc); 854 + msc->iidr = mpam_read_partsel_reg(msc, IIDR); 957 855 mutex_unlock(&msc->part_sel_lock); 856 + 857 + mpam_enable_quirks(msc); 958 858 959 859 msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); 960 860 ··· 1009 903 enum mpam_device_features type; 1010 904 u64 *val; 1011 905 int err; 906 + bool waited_timeout; 1012 907 }; 1013 908 1014 909 static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) ··· 1159 1052 } 1160 1053 } 1161 1054 1162 - static u64 mpam_msmon_overflow_val(enum mpam_device_features type) 1055 + static u64 __mpam_msmon_overflow_val(enum mpam_device_features type) 1163 1056 { 1164 1057 /* TODO: implement scaling counters */ 1165 1058 switch (type) { ··· 1172 1065 default: 1173 1066 return 0; 1174 1067 } 1068 + } 1069 + 1070 + static u64 mpam_msmon_overflow_val(enum mpam_device_features type, 1071 + struct mpam_msc *msc) 1072 + { 1073 + u64 overflow_val = __mpam_msmon_overflow_val(type); 1074 + 1075 + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && 1076 + type != mpam_feat_msmon_mbwu_63counter) 1077 + overflow_val *= 64; 1078 + 1079 + return overflow_val; 1175 1080 } 1176 1081 1177 1082 static void __ris_msmon_read(void *arg) ··· 1256 1137 if (mpam_has_feature(mpam_feat_msmon_csu_hw_nrdy, rprops)) 1257 1138 nrdy = now & MSMON___NRDY; 1258 1139 now = FIELD_GET(MSMON___VALUE, now); 1140 + 1141 + if (mpam_has_quirk(IGNORE_CSU_NRDY, msc) && m->waited_timeout) 1142 + nrdy = false; 1143 + 1259 1144 break; 1260 1145 case mpam_feat_msmon_mbwu_31counter: 1261 1146 case mpam_feat_msmon_mbwu_44counter: ··· 1280 1157 now = FIELD_GET(MSMON___VALUE, now); 1281 1158 } 1282 1159 1160 + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && 1161 + m->type != mpam_feat_msmon_mbwu_63counter) 1162 + now *= 64; 1163 + 1283 1164 if (nrdy) 1284 1165 break; 1285 1166 1286 1167 mbwu_state = &ris->mbwu_state[ctx->mon]; 1287 1168 1288 1169 if (overflow) 1289 - mbwu_state->correction += mpam_msmon_overflow_val(m->type); 1170 + mbwu_state->correction += mpam_msmon_overflow_val(m->type, msc); 1290 1171 1291 1172 /* 1292 1173 * Include bandwidth consumed before the last hardware reset and ··· 1397 1270 .ctx = ctx, 1398 1271 .type = type, 1399 1272 .val = val, 1273 + .waited_timeout = true, 1400 1274 }; 1401 1275 *val = 0; 1402 1276 ··· 1466 1338 __mpam_write_reg(msc, reg, bm); 1467 1339 } 1468 1340 1341 + static void mpam_apply_t241_erratum(struct mpam_msc_ris *ris, u16 partid) 1342 + { 1343 + int sidx, i, lcount = 1000; 1344 + void __iomem *regs; 1345 + u64 val0, val; 1346 + 1347 + regs = t241_scratch_regs[ris->vmsc->msc->t241_id]; 1348 + 1349 + for (i = 0; i < lcount; i++) { 1350 + /* Read the shadow register at index 0 */ 1351 + val0 = readq_relaxed(regs + T241_SHADOW_REG_OFF(0, partid)); 1352 + 1353 + /* Check if all the shadow registers have the same value */ 1354 + for (sidx = 1; sidx < T241_CHIP_NSLICES; sidx++) { 1355 + val = readq_relaxed(regs + 1356 + T241_SHADOW_REG_OFF(sidx, partid)); 1357 + if (val != val0) 1358 + break; 1359 + } 1360 + if (sidx == T241_CHIP_NSLICES) 1361 + break; 1362 + } 1363 + 1364 + if (i == lcount) 1365 + pr_warn_once("t241: inconsistent values in shadow regs"); 1366 + 1367 + /* Write a value zero to spare registers to take effect of MBW conf */ 1368 + writeq_relaxed(0, regs + T241_SPARE_REG0_OFF); 1369 + writeq_relaxed(0, regs + T241_SPARE_REG1_OFF); 1370 + } 1371 + 1372 + static void mpam_quirk_post_config_change(struct mpam_msc_ris *ris, u16 partid, 1373 + struct mpam_config *cfg) 1374 + { 1375 + if (mpam_has_quirk(T241_SCRUB_SHADOW_REGS, ris->vmsc->msc)) 1376 + mpam_apply_t241_erratum(ris, partid); 1377 + } 1378 + 1379 + static u16 mpam_wa_t241_force_mbw_min_to_one(struct mpam_props *props) 1380 + { 1381 + u16 max_hw_value, min_hw_granule, res0_bits; 1382 + 1383 + res0_bits = 16 - props->bwa_wd; 1384 + max_hw_value = ((1 << props->bwa_wd) - 1) << res0_bits; 1385 + min_hw_granule = ~max_hw_value; 1386 + 1387 + return min_hw_granule + 1; 1388 + } 1389 + 1390 + static u16 mpam_wa_t241_calc_min_from_max(struct mpam_props *props, 1391 + struct mpam_config *cfg) 1392 + { 1393 + u16 val = 0; 1394 + u16 max; 1395 + u16 delta = ((5 * MPAMCFG_MBW_MAX_MAX) / 100) - 1; 1396 + 1397 + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) { 1398 + max = cfg->mbw_max; 1399 + } else { 1400 + /* Resetting. Hence, use the ris specific default. */ 1401 + max = GENMASK(15, 16 - props->bwa_wd); 1402 + } 1403 + 1404 + if (max > delta) 1405 + val = max - delta; 1406 + 1407 + return val; 1408 + } 1409 + 1469 1410 /* Called via IPI. Call while holding an SRCU reference */ 1470 1411 static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, 1471 1412 struct mpam_config *cfg) ··· 1561 1364 __mpam_intpart_sel(ris->ris_idx, partid, msc); 1562 1365 } 1563 1366 1564 - if (mpam_has_feature(mpam_feat_cpor_part, rprops) && 1565 - mpam_has_feature(mpam_feat_cpor_part, cfg)) { 1566 - if (cfg->reset_cpbm) 1567 - mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); 1568 - else 1367 + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { 1368 + if (mpam_has_feature(mpam_feat_cpor_part, cfg)) 1569 1369 mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); 1370 + else 1371 + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); 1570 1372 } 1571 1373 1572 - if (mpam_has_feature(mpam_feat_mbw_part, rprops) && 1573 - mpam_has_feature(mpam_feat_mbw_part, cfg)) { 1574 - if (cfg->reset_mbw_pbm) 1374 + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { 1375 + if (mpam_has_feature(mpam_feat_mbw_part, cfg)) 1575 1376 mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); 1576 1377 else 1577 1378 mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); 1578 1379 } 1579 1380 1580 - if (mpam_has_feature(mpam_feat_mbw_min, rprops) && 1581 - mpam_has_feature(mpam_feat_mbw_min, cfg)) 1582 - mpam_write_partsel_reg(msc, MBW_MIN, 0); 1381 + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) { 1382 + u16 val = 0; 1583 1383 1584 - if (mpam_has_feature(mpam_feat_mbw_max, rprops) && 1585 - mpam_has_feature(mpam_feat_mbw_max, cfg)) { 1586 - if (cfg->reset_mbw_max) 1587 - mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX); 1588 - else 1589 - mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); 1384 + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, msc)) { 1385 + u16 min = mpam_wa_t241_force_mbw_min_to_one(rprops); 1386 + 1387 + val = mpam_wa_t241_calc_min_from_max(rprops, cfg); 1388 + val = max(val, min); 1389 + } 1390 + 1391 + mpam_write_partsel_reg(msc, MBW_MIN, val); 1590 1392 } 1591 1393 1592 - if (mpam_has_feature(mpam_feat_mbw_prop, rprops) && 1593 - mpam_has_feature(mpam_feat_mbw_prop, cfg)) 1394 + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { 1395 + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) 1396 + mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); 1397 + else 1398 + mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX); 1399 + } 1400 + 1401 + if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) 1594 1402 mpam_write_partsel_reg(msc, MBW_PROP, 0); 1595 1403 1596 1404 if (mpam_has_feature(mpam_feat_cmax_cmax, rprops)) ··· 1622 1420 1623 1421 mpam_write_partsel_reg(msc, PRI, pri_val); 1624 1422 } 1423 + 1424 + mpam_quirk_post_config_change(ris, partid, cfg); 1625 1425 1626 1426 mutex_unlock(&msc->part_sel_lock); 1627 1427 } ··· 1697 1493 return 0; 1698 1494 } 1699 1495 1700 - static void mpam_init_reset_cfg(struct mpam_config *reset_cfg) 1701 - { 1702 - *reset_cfg = (struct mpam_config) { 1703 - .reset_cpbm = true, 1704 - .reset_mbw_pbm = true, 1705 - .reset_mbw_max = true, 1706 - }; 1707 - bitmap_fill(reset_cfg->features, MPAM_FEATURE_LAST); 1708 - } 1709 - 1710 1496 /* 1711 1497 * Called via smp_call_on_cpu() to prevent migration, while still being 1712 1498 * pre-emptible. Caller must hold mpam_srcu. ··· 1704 1510 static int mpam_reset_ris(void *arg) 1705 1511 { 1706 1512 u16 partid, partid_max; 1707 - struct mpam_config reset_cfg; 1513 + struct mpam_config reset_cfg = {}; 1708 1514 struct mpam_msc_ris *ris = arg; 1709 1515 1710 1516 if (ris->in_reset_state) 1711 1517 return 0; 1712 - 1713 - mpam_init_reset_cfg(&reset_cfg); 1714 1518 1715 1519 spin_lock(&partid_max_lock); 1716 1520 partid_max = mpam_partid_max; ··· 1824 1632 mpam_reprogram_msc(msc); 1825 1633 } 1826 1634 1635 + if (mpam_resctrl_enabled) 1636 + return mpam_resctrl_online_cpu(cpu); 1637 + 1827 1638 return 0; 1828 1639 } 1829 1640 ··· 1869 1674 static int mpam_cpu_offline(unsigned int cpu) 1870 1675 { 1871 1676 struct mpam_msc *msc; 1677 + 1678 + if (mpam_resctrl_enabled) 1679 + mpam_resctrl_offline_cpu(cpu); 1872 1680 1873 1681 guard(srcu)(&mpam_srcu); 1874 1682 list_for_each_entry_srcu(msc, &mpam_all_msc, all_msc_list, ··· 2169 1971 * resulting safe value must be compatible with both. When merging values in 2170 1972 * the tree, all the aliasing resources must be handled first. 2171 1973 * On mismatch, parent is modified. 1974 + * Quirks on an MSC will apply to all MSC in that class. 2172 1975 */ 2173 1976 static void __props_mismatch(struct mpam_props *parent, 2174 1977 struct mpam_props *child, bool alias) ··· 2289 2090 * nobble the class feature, as we can't configure all the resources. 2290 2091 * e.g. The L3 cache is composed of two resources with 13 and 17 portion 2291 2092 * bitmaps respectively. 2093 + * Quirks on an MSC will apply to all MSC in that class. 2292 2094 */ 2293 2095 static void 2294 2096 __class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc) ··· 2302 2102 2303 2103 dev_dbg(dev, "Merging features for class:0x%lx &= vmsc:0x%lx\n", 2304 2104 (long)cprops->features, (long)vprops->features); 2105 + 2106 + /* Merge quirks */ 2107 + class->quirks |= vmsc->msc->quirks; 2305 2108 2306 2109 /* Take the safe value for any common features */ 2307 2110 __props_mismatch(cprops, vprops, false); ··· 2370 2167 2371 2168 list_for_each_entry(vmsc, &comp->vmsc, comp_list) 2372 2169 __class_props_mismatch(class, vmsc); 2170 + 2171 + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, class)) 2172 + mpam_clear_feature(mpam_feat_mbw_min, &class->props); 2373 2173 } 2374 2174 2375 2175 /* ··· 2726 2520 mutex_unlock(&mpam_list_lock); 2727 2521 cpus_read_unlock(); 2728 2522 2523 + if (!err) { 2524 + err = mpam_resctrl_setup(); 2525 + if (err) 2526 + pr_err("Failed to initialise resctrl: %d\n", err); 2527 + } 2528 + 2729 2529 if (err) { 2730 2530 mpam_disable_reason = "Failed to enable."; 2731 2531 schedule_work(&mpam_broken_work); ··· 2739 2527 } 2740 2528 2741 2529 static_branch_enable(&mpam_enabled); 2530 + mpam_resctrl_enabled = true; 2742 2531 mpam_register_cpuhp_callbacks(mpam_cpu_online, mpam_cpu_offline, 2743 2532 "mpam:online"); 2744 2533 ··· 2772 2559 } 2773 2560 } 2774 2561 2775 - static void mpam_reset_class_locked(struct mpam_class *class) 2562 + void mpam_reset_class_locked(struct mpam_class *class) 2776 2563 { 2777 2564 struct mpam_component *comp; 2778 2565 ··· 2799 2586 void mpam_disable(struct work_struct *ignored) 2800 2587 { 2801 2588 int idx; 2589 + bool do_resctrl_exit; 2802 2590 struct mpam_class *class; 2803 2591 struct mpam_msc *msc, *tmp; 2592 + 2593 + if (mpam_is_enabled()) 2594 + static_branch_disable(&mpam_enabled); 2804 2595 2805 2596 mutex_lock(&mpam_cpuhp_state_lock); 2806 2597 if (mpam_cpuhp_state) { 2807 2598 cpuhp_remove_state(mpam_cpuhp_state); 2808 2599 mpam_cpuhp_state = 0; 2809 2600 } 2601 + 2602 + /* 2603 + * Removing the cpuhp state called mpam_cpu_offline() and told resctrl 2604 + * all the CPUs are offline. 2605 + */ 2606 + do_resctrl_exit = mpam_resctrl_enabled; 2607 + mpam_resctrl_enabled = false; 2810 2608 mutex_unlock(&mpam_cpuhp_state_lock); 2811 2609 2812 - static_branch_disable(&mpam_enabled); 2610 + if (do_resctrl_exit) 2611 + mpam_resctrl_exit(); 2813 2612 2814 2613 mpam_unregister_irqs(); 2815 2614 2816 2615 idx = srcu_read_lock(&mpam_srcu); 2817 2616 list_for_each_entry_srcu(class, &mpam_classes, classes_list, 2818 - srcu_read_lock_held(&mpam_srcu)) 2617 + srcu_read_lock_held(&mpam_srcu)) { 2819 2618 mpam_reset_class(class); 2619 + if (do_resctrl_exit) 2620 + mpam_resctrl_teardown_class(class); 2621 + } 2820 2622 srcu_read_unlock(&mpam_srcu, idx); 2821 2623 2822 2624 mutex_lock(&mpam_list_lock); ··· 2922 2694 srcu_read_lock_held(&mpam_srcu)) { 2923 2695 arg.ris = ris; 2924 2696 mpam_touch_msc(msc, __write_config, &arg); 2697 + ris->in_reset_state = false; 2925 2698 } 2926 2699 mutex_unlock(&msc->cfg_lock); 2927 2700 }

+101 -7

drivers/resctrl/mpam_internal.h

··· 12 12 #include <linux/jump_label.h> 13 13 #include <linux/llist.h> 14 14 #include <linux/mutex.h> 15 + #include <linux/resctrl.h> 15 16 #include <linux/spinlock.h> 16 17 #include <linux/srcu.h> 17 18 #include <linux/types.h> 18 19 20 + #include <asm/mpam.h> 21 + 19 22 #define MPAM_MSC_MAX_NUM_RIS 16 20 23 21 24 struct platform_device; 22 - 23 - DECLARE_STATIC_KEY_FALSE(mpam_enabled); 24 25 25 26 #ifdef CONFIG_MPAM_KUNIT_TEST 26 27 #define PACKED_FOR_KUNIT __packed 27 28 #else 28 29 #define PACKED_FOR_KUNIT 29 30 #endif 31 + 32 + /* 33 + * This 'mon' values must not alias an actual monitor, so must be larger than 34 + * U16_MAX, but not be confused with an errno value, so smaller than 35 + * (u32)-SZ_4K. 36 + * USE_PRE_ALLOCATED is used to avoid confusion with an actual monitor. 37 + */ 38 + #define USE_PRE_ALLOCATED (U16_MAX + 1) 30 39 31 40 static inline bool mpam_is_enabled(void) 32 41 { ··· 85 76 u8 pmg_max; 86 77 unsigned long ris_idxs; 87 78 u32 ris_max; 79 + u32 iidr; 80 + u16 quirks; 88 81 89 82 /* 90 83 * error_irq_lock is taken when registering/unregistering the error ··· 129 118 130 119 void __iomem *mapped_hwpage; 131 120 size_t mapped_hwpage_sz; 121 + 122 + /* Values only used on some platforms for quirks */ 123 + u32 t241_id; 132 124 133 125 struct mpam_garbage garbage; 134 126 }; ··· 221 207 #define mpam_set_feature(_feat, x) __set_bit(_feat, (x)->features) 222 208 #define mpam_clear_feature(_feat, x) __clear_bit(_feat, (x)->features) 223 209 210 + /* Workaround bits for msc->quirks */ 211 + enum mpam_device_quirks { 212 + T241_SCRUB_SHADOW_REGS, 213 + T241_FORCE_MBW_MIN_TO_ONE, 214 + T241_MBW_COUNTER_SCALE_64, 215 + IGNORE_CSU_NRDY, 216 + MPAM_QUIRK_LAST 217 + }; 218 + 219 + #define mpam_has_quirk(_quirk, x) ((1 << (_quirk) & (x)->quirks)) 220 + #define mpam_set_quirk(_quirk, x) ((x)->quirks |= (1 << (_quirk))) 221 + 222 + struct mpam_quirk { 223 + int (*init)(struct mpam_msc *msc, const struct mpam_quirk *quirk); 224 + 225 + u32 iidr; 226 + u32 iidr_mask; 227 + 228 + enum mpam_device_quirks workaround; 229 + }; 230 + 231 + #define MPAM_IIDR_MATCH_ONE (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0xfff) | \ 232 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0xf) | \ 233 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0xf) | \ 234 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0xfff)) 235 + 236 + #define MPAM_IIDR_NVIDIA_T241 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0x241) | \ 237 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ 238 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ 239 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x36b)) 240 + 241 + #define MPAM_IIDR_ARM_CMN_650 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0) | \ 242 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ 243 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ 244 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x43b)) 245 + 224 246 /* The values for MSMON_CFG_MBWU_FLT.RWBW */ 225 247 enum mon_filter_options { 226 248 COUNT_BOTH = 0, ··· 265 215 }; 266 216 267 217 struct mon_cfg { 268 - u16 mon; 218 + /* 219 + * mon must be large enough to hold out of range values like 220 + * USE_PRE_ALLOCATED 221 + */ 222 + u32 mon; 269 223 u8 pmg; 270 224 bool match_pmg; 271 225 bool csu_exclude_clean; ··· 300 246 301 247 struct mpam_props props; 302 248 u32 nrdy_usec; 249 + u16 quirks; 303 250 u8 level; 304 251 enum mpam_class_types type; 305 252 ··· 320 265 u32 cpbm; 321 266 u32 mbw_pbm; 322 267 u16 mbw_max; 323 - 324 - bool reset_cpbm; 325 - bool reset_mbw_pbm; 326 - bool reset_mbw_max; 327 268 328 269 struct mpam_garbage garbage; 329 270 }; ··· 388 337 struct mpam_garbage garbage; 389 338 }; 390 339 340 + struct mpam_resctrl_dom { 341 + struct mpam_component *ctrl_comp; 342 + 343 + /* 344 + * There is no single mon_comp because different events may be backed 345 + * by different class/components. mon_comp is indexed by the event 346 + * number. 347 + */ 348 + struct mpam_component *mon_comp[QOS_NUM_EVENTS]; 349 + 350 + struct rdt_ctrl_domain resctrl_ctrl_dom; 351 + struct rdt_l3_mon_domain resctrl_mon_dom; 352 + }; 353 + 354 + struct mpam_resctrl_res { 355 + struct mpam_class *class; 356 + struct rdt_resource resctrl_res; 357 + bool cdp_enabled; 358 + }; 359 + 360 + struct mpam_resctrl_mon { 361 + struct mpam_class *class; 362 + 363 + /* per-class data that resctrl needs will live here */ 364 + }; 365 + 391 366 static inline int mpam_alloc_csu_mon(struct mpam_class *class) 392 367 { 393 368 struct mpam_props *cprops = &class->props; ··· 458 381 void mpam_enable(struct work_struct *work); 459 382 void mpam_disable(struct work_struct *work); 460 383 384 + /* Reset all the RIS in a class under cpus_read_lock() */ 385 + void mpam_reset_class_locked(struct mpam_class *class); 386 + 461 387 int mpam_apply_config(struct mpam_component *comp, u16 partid, 462 388 struct mpam_config *cfg); 463 389 ··· 470 390 471 391 int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level, 472 392 cpumask_t *affinity); 393 + 394 + #ifdef CONFIG_RESCTRL_FS 395 + int mpam_resctrl_setup(void); 396 + void mpam_resctrl_exit(void); 397 + int mpam_resctrl_online_cpu(unsigned int cpu); 398 + void mpam_resctrl_offline_cpu(unsigned int cpu); 399 + void mpam_resctrl_teardown_class(struct mpam_class *class); 400 + #else 401 + static inline int mpam_resctrl_setup(void) { return 0; } 402 + static inline void mpam_resctrl_exit(void) { } 403 + static inline int mpam_resctrl_online_cpu(unsigned int cpu) { return 0; } 404 + static inline void mpam_resctrl_offline_cpu(unsigned int cpu) { } 405 + static inline void mpam_resctrl_teardown_class(struct mpam_class *class) { } 406 + #endif /* CONFIG_RESCTRL_FS */ 473 407 474 408 /* 475 409 * MPAM MSCs have the following register layout. See:

+1704

drivers/resctrl/mpam_resctrl.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (C) 2025 Arm Ltd. 3 + 4 + #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ 5 + 6 + #include <linux/arm_mpam.h> 7 + #include <linux/cacheinfo.h> 8 + #include <linux/cpu.h> 9 + #include <linux/cpumask.h> 10 + #include <linux/errno.h> 11 + #include <linux/limits.h> 12 + #include <linux/list.h> 13 + #include <linux/math.h> 14 + #include <linux/printk.h> 15 + #include <linux/rculist.h> 16 + #include <linux/resctrl.h> 17 + #include <linux/slab.h> 18 + #include <linux/types.h> 19 + #include <linux/wait.h> 20 + 21 + #include <asm/mpam.h> 22 + 23 + #include "mpam_internal.h" 24 + 25 + DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); 26 + 27 + /* 28 + * The classes we've picked to map to resctrl resources, wrapped 29 + * in with their resctrl structure. 30 + * Class pointer may be NULL. 31 + */ 32 + static struct mpam_resctrl_res mpam_resctrl_controls[RDT_NUM_RESOURCES]; 33 + 34 + #define for_each_mpam_resctrl_control(res, rid) \ 35 + for (rid = 0, res = &mpam_resctrl_controls[rid]; \ 36 + rid < RDT_NUM_RESOURCES; \ 37 + rid++, res = &mpam_resctrl_controls[rid]) 38 + 39 + /* 40 + * The classes we've picked to map to resctrl events. 41 + * Resctrl believes all the worlds a Xeon, and these are all on the L3. This 42 + * array lets us find the actual class backing the event counters. e.g. 43 + * the only memory bandwidth counters may be on the memory controller, but to 44 + * make use of them, we pretend they are on L3. Restrict the events considered 45 + * to those supported by MPAM. 46 + * Class pointer may be NULL. 47 + */ 48 + #define MPAM_MAX_EVENT QOS_L3_MBM_TOTAL_EVENT_ID 49 + static struct mpam_resctrl_mon mpam_resctrl_counters[MPAM_MAX_EVENT + 1]; 50 + 51 + #define for_each_mpam_resctrl_mon(mon, eventid) \ 52 + for (eventid = QOS_FIRST_EVENT, mon = &mpam_resctrl_counters[eventid]; \ 53 + eventid <= MPAM_MAX_EVENT; \ 54 + eventid++, mon = &mpam_resctrl_counters[eventid]) 55 + 56 + /* The lock for modifying resctrl's domain lists from cpuhp callbacks. */ 57 + static DEFINE_MUTEX(domain_list_lock); 58 + 59 + /* 60 + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM0_EL1. 61 + * This applies globally to all traffic the CPU generates. 62 + */ 63 + static bool cdp_enabled; 64 + 65 + /* 66 + * We use cacheinfo to discover the size of the caches and their id. cacheinfo 67 + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. 68 + */ 69 + static bool cacheinfo_ready; 70 + static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); 71 + 72 + /* 73 + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support 74 + * for the filesystem in the event of an error. 75 + */ 76 + static bool resctrl_enabled; 77 + 78 + bool resctrl_arch_alloc_capable(void) 79 + { 80 + struct mpam_resctrl_res *res; 81 + enum resctrl_res_level rid; 82 + 83 + for_each_mpam_resctrl_control(res, rid) { 84 + if (res->resctrl_res.alloc_capable) 85 + return true; 86 + } 87 + 88 + return false; 89 + } 90 + 91 + bool resctrl_arch_mon_capable(void) 92 + { 93 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 94 + struct rdt_resource *l3 = &res->resctrl_res; 95 + 96 + /* All monitors are presented as being on the L3 cache */ 97 + return l3->mon_capable; 98 + } 99 + 100 + bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) 101 + { 102 + return false; 103 + } 104 + 105 + void resctrl_arch_mon_event_config_read(void *info) 106 + { 107 + } 108 + 109 + void resctrl_arch_mon_event_config_write(void *info) 110 + { 111 + } 112 + 113 + void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d) 114 + { 115 + } 116 + 117 + void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 118 + u32 closid, u32 rmid, enum resctrl_event_id eventid) 119 + { 120 + } 121 + 122 + void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 123 + u32 closid, u32 rmid, int cntr_id, 124 + enum resctrl_event_id eventid) 125 + { 126 + } 127 + 128 + void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 129 + enum resctrl_event_id evtid, u32 rmid, u32 closid, 130 + u32 cntr_id, bool assign) 131 + { 132 + } 133 + 134 + int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 135 + u32 unused, u32 rmid, int cntr_id, 136 + enum resctrl_event_id eventid, u64 *val) 137 + { 138 + return -EOPNOTSUPP; 139 + } 140 + 141 + bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r) 142 + { 143 + return false; 144 + } 145 + 146 + int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable) 147 + { 148 + return -EINVAL; 149 + } 150 + 151 + int resctrl_arch_io_alloc_enable(struct rdt_resource *r, bool enable) 152 + { 153 + return -EOPNOTSUPP; 154 + } 155 + 156 + bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r) 157 + { 158 + return false; 159 + } 160 + 161 + void resctrl_arch_pre_mount(void) 162 + { 163 + } 164 + 165 + bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) 166 + { 167 + return mpam_resctrl_controls[rid].cdp_enabled; 168 + } 169 + 170 + /** 171 + * resctrl_reset_task_closids() - Reset the PARTID/PMG values for all tasks. 172 + * 173 + * At boot, all existing tasks use partid zero for D and I. 174 + * To enable/disable CDP emulation, all these tasks need relabelling. 175 + */ 176 + static void resctrl_reset_task_closids(void) 177 + { 178 + struct task_struct *p, *t; 179 + 180 + read_lock(&tasklist_lock); 181 + for_each_process_thread(p, t) { 182 + resctrl_arch_set_closid_rmid(t, RESCTRL_RESERVED_CLOSID, 183 + RESCTRL_RESERVED_RMID); 184 + } 185 + read_unlock(&tasklist_lock); 186 + } 187 + 188 + int resctrl_arch_set_cdp_enabled(enum resctrl_res_level rid, bool enable) 189 + { 190 + u32 partid_i = RESCTRL_RESERVED_CLOSID, partid_d = RESCTRL_RESERVED_CLOSID; 191 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 192 + struct rdt_resource *l3 = &res->resctrl_res; 193 + int cpu; 194 + 195 + if (!IS_ENABLED(CONFIG_EXPERT) && enable) { 196 + /* 197 + * If the resctrl fs is mounted more than once, sequentially, 198 + * then CDP can lead to the use of out of range PARTIDs. 199 + */ 200 + pr_warn("CDP not supported\n"); 201 + return -EOPNOTSUPP; 202 + } 203 + 204 + if (enable) 205 + pr_warn("CDP is an expert feature and may cause MPAM to malfunction.\n"); 206 + 207 + /* 208 + * resctrl_arch_set_cdp_enabled() is only called with enable set to 209 + * false on error and unmount. 210 + */ 211 + cdp_enabled = enable; 212 + mpam_resctrl_controls[rid].cdp_enabled = enable; 213 + 214 + if (enable) 215 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx() / 2; 216 + else 217 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx(); 218 + 219 + /* The mbw_max feature can't hide cdp as it's a per-partid maximum. */ 220 + if (cdp_enabled && !mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled) 221 + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = false; 222 + 223 + if (mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled && 224 + mpam_resctrl_controls[RDT_RESOURCE_MBA].class) 225 + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = true; 226 + 227 + if (enable) { 228 + if (mpam_partid_max < 1) 229 + return -EINVAL; 230 + 231 + partid_d = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_DATA); 232 + partid_i = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_CODE); 233 + } 234 + 235 + mpam_set_task_partid_pmg(current, partid_d, partid_i, 0, 0); 236 + WRITE_ONCE(arm64_mpam_global_default, mpam_get_regval(current)); 237 + 238 + resctrl_reset_task_closids(); 239 + 240 + for_each_possible_cpu(cpu) 241 + mpam_set_cpu_defaults(cpu, partid_d, partid_i, 0, 0); 242 + on_each_cpu(resctrl_arch_sync_cpu_closid_rmid, NULL, 1); 243 + 244 + return 0; 245 + } 246 + 247 + static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) 248 + { 249 + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); 250 + } 251 + 252 + /* 253 + * MSC may raise an error interrupt if it sees an out or range partid/pmg, 254 + * and go on to truncate the value. Regardless of what the hardware supports, 255 + * only the system wide safe value is safe to use. 256 + */ 257 + u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) 258 + { 259 + return mpam_partid_max + 1; 260 + } 261 + 262 + u32 resctrl_arch_system_num_rmid_idx(void) 263 + { 264 + return (mpam_pmg_max + 1) * (mpam_partid_max + 1); 265 + } 266 + 267 + u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) 268 + { 269 + return closid * (mpam_pmg_max + 1) + rmid; 270 + } 271 + 272 + void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) 273 + { 274 + *closid = idx / (mpam_pmg_max + 1); 275 + *rmid = idx % (mpam_pmg_max + 1); 276 + } 277 + 278 + void resctrl_arch_sched_in(struct task_struct *tsk) 279 + { 280 + lockdep_assert_preemption_disabled(); 281 + 282 + mpam_thread_switch(tsk); 283 + } 284 + 285 + void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid) 286 + { 287 + WARN_ON_ONCE(closid > U16_MAX); 288 + WARN_ON_ONCE(rmid > U8_MAX); 289 + 290 + if (!cdp_enabled) { 291 + mpam_set_cpu_defaults(cpu, closid, closid, rmid, rmid); 292 + } else { 293 + /* 294 + * When CDP is enabled, resctrl halves the closid range and we 295 + * use odd/even partid for one closid. 296 + */ 297 + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); 298 + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); 299 + 300 + mpam_set_cpu_defaults(cpu, partid_d, partid_i, rmid, rmid); 301 + } 302 + } 303 + 304 + void resctrl_arch_sync_cpu_closid_rmid(void *info) 305 + { 306 + struct resctrl_cpu_defaults *r = info; 307 + 308 + lockdep_assert_preemption_disabled(); 309 + 310 + if (r) { 311 + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), 312 + r->closid, r->rmid); 313 + } 314 + 315 + resctrl_arch_sched_in(current); 316 + } 317 + 318 + void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) 319 + { 320 + WARN_ON_ONCE(closid > U16_MAX); 321 + WARN_ON_ONCE(rmid > U8_MAX); 322 + 323 + if (!cdp_enabled) { 324 + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); 325 + } else { 326 + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); 327 + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); 328 + 329 + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); 330 + } 331 + } 332 + 333 + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) 334 + { 335 + u64 regval = mpam_get_regval(tsk); 336 + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); 337 + 338 + if (cdp_enabled) 339 + tsk_closid >>= 1; 340 + 341 + return tsk_closid == closid; 342 + } 343 + 344 + /* The task's pmg is not unique, the partid must be considered too */ 345 + bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) 346 + { 347 + u64 regval = mpam_get_regval(tsk); 348 + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); 349 + u32 tsk_rmid = FIELD_GET(MPAM0_EL1_PMG_D, regval); 350 + 351 + if (cdp_enabled) 352 + tsk_closid >>= 1; 353 + 354 + return (tsk_closid == closid) && (tsk_rmid == rmid); 355 + } 356 + 357 + struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) 358 + { 359 + if (l >= RDT_NUM_RESOURCES) 360 + return NULL; 361 + 362 + return &mpam_resctrl_controls[l].resctrl_res; 363 + } 364 + 365 + static int resctrl_arch_mon_ctx_alloc_no_wait(enum resctrl_event_id evtid) 366 + { 367 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; 368 + 369 + if (!mpam_is_enabled()) 370 + return -EINVAL; 371 + 372 + if (!mon->class) 373 + return -EINVAL; 374 + 375 + switch (evtid) { 376 + case QOS_L3_OCCUP_EVENT_ID: 377 + /* With CDP, one monitor gets used for both code/data reads */ 378 + return mpam_alloc_csu_mon(mon->class); 379 + case QOS_L3_MBM_LOCAL_EVENT_ID: 380 + case QOS_L3_MBM_TOTAL_EVENT_ID: 381 + return USE_PRE_ALLOCATED; 382 + default: 383 + return -EOPNOTSUPP; 384 + } 385 + } 386 + 387 + void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, 388 + enum resctrl_event_id evtid) 389 + { 390 + DEFINE_WAIT(wait); 391 + int *ret; 392 + 393 + ret = kmalloc_obj(*ret); 394 + if (!ret) 395 + return ERR_PTR(-ENOMEM); 396 + 397 + do { 398 + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, 399 + TASK_INTERRUPTIBLE); 400 + *ret = resctrl_arch_mon_ctx_alloc_no_wait(evtid); 401 + if (*ret == -ENOSPC) 402 + schedule(); 403 + } while (*ret == -ENOSPC && !signal_pending(current)); 404 + finish_wait(&resctrl_mon_ctx_waiters, &wait); 405 + 406 + return ret; 407 + } 408 + 409 + static void resctrl_arch_mon_ctx_free_no_wait(enum resctrl_event_id evtid, 410 + u32 mon_idx) 411 + { 412 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; 413 + 414 + if (!mpam_is_enabled()) 415 + return; 416 + 417 + if (!mon->class) 418 + return; 419 + 420 + if (evtid == QOS_L3_OCCUP_EVENT_ID) 421 + mpam_free_csu_mon(mon->class, mon_idx); 422 + 423 + wake_up(&resctrl_mon_ctx_waiters); 424 + } 425 + 426 + void resctrl_arch_mon_ctx_free(struct rdt_resource *r, 427 + enum resctrl_event_id evtid, void *arch_mon_ctx) 428 + { 429 + u32 mon_idx = *(u32 *)arch_mon_ctx; 430 + 431 + kfree(arch_mon_ctx); 432 + 433 + resctrl_arch_mon_ctx_free_no_wait(evtid, mon_idx); 434 + } 435 + 436 + static int __read_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, 437 + enum mpam_device_features mon_type, 438 + int mon_idx, 439 + enum resctrl_conf_type cdp_type, u32 closid, u32 rmid, u64 *val) 440 + { 441 + struct mon_cfg cfg; 442 + 443 + if (!mpam_is_enabled()) 444 + return -EINVAL; 445 + 446 + /* Shift closid to account for CDP */ 447 + closid = resctrl_get_config_index(closid, cdp_type); 448 + 449 + if (irqs_disabled()) { 450 + /* Check if we can access this domain without an IPI */ 451 + return -EIO; 452 + } 453 + 454 + cfg = (struct mon_cfg) { 455 + .mon = mon_idx, 456 + .match_pmg = true, 457 + .partid = closid, 458 + .pmg = rmid, 459 + }; 460 + 461 + return mpam_msmon_read(mon_comp, &cfg, mon_type, val); 462 + } 463 + 464 + static int read_mon_cdp_safe(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, 465 + enum mpam_device_features mon_type, 466 + int mon_idx, u32 closid, u32 rmid, u64 *val) 467 + { 468 + if (cdp_enabled) { 469 + u64 code_val = 0, data_val = 0; 470 + int err; 471 + 472 + err = __read_mon(mon, mon_comp, mon_type, mon_idx, 473 + CDP_CODE, closid, rmid, &code_val); 474 + if (err) 475 + return err; 476 + 477 + err = __read_mon(mon, mon_comp, mon_type, mon_idx, 478 + CDP_DATA, closid, rmid, &data_val); 479 + if (err) 480 + return err; 481 + 482 + *val += code_val + data_val; 483 + return 0; 484 + } 485 + 486 + return __read_mon(mon, mon_comp, mon_type, mon_idx, 487 + CDP_NONE, closid, rmid, val); 488 + } 489 + 490 + /* MBWU when not in ABMC mode (not supported), and CSU counters. */ 491 + int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr, 492 + u32 closid, u32 rmid, enum resctrl_event_id eventid, 493 + void *arch_priv, u64 *val, void *arch_mon_ctx) 494 + { 495 + struct mpam_resctrl_dom *l3_dom; 496 + struct mpam_component *mon_comp; 497 + u32 mon_idx = *(u32 *)arch_mon_ctx; 498 + enum mpam_device_features mon_type; 499 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[eventid]; 500 + 501 + resctrl_arch_rmid_read_context_check(); 502 + 503 + if (!mpam_is_enabled()) 504 + return -EINVAL; 505 + 506 + if (eventid >= QOS_NUM_EVENTS || !mon->class) 507 + return -EINVAL; 508 + 509 + l3_dom = container_of(hdr, struct mpam_resctrl_dom, resctrl_mon_dom.hdr); 510 + mon_comp = l3_dom->mon_comp[eventid]; 511 + 512 + if (eventid != QOS_L3_OCCUP_EVENT_ID) 513 + return -EINVAL; 514 + 515 + mon_type = mpam_feat_msmon_csu; 516 + 517 + return read_mon_cdp_safe(mon, mon_comp, mon_type, mon_idx, 518 + closid, rmid, val); 519 + } 520 + 521 + /* 522 + * The rmid realloc threshold should be for the smallest cache exposed to 523 + * resctrl. 524 + */ 525 + static int update_rmid_limits(struct mpam_class *class) 526 + { 527 + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); 528 + struct mpam_props *cprops = &class->props; 529 + struct cacheinfo *ci; 530 + 531 + lockdep_assert_cpus_held(); 532 + 533 + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) 534 + return 0; 535 + 536 + /* 537 + * Assume cache levels are the same size for all CPUs... 538 + * The check just requires any online CPU and it can't go offline as we 539 + * hold the cpu lock. 540 + */ 541 + ci = get_cpu_cacheinfo_level(raw_smp_processor_id(), class->level); 542 + if (!ci || ci->size == 0) { 543 + pr_debug("Could not read cache size for class %u\n", 544 + class->level); 545 + return -EINVAL; 546 + } 547 + 548 + if (!resctrl_rmid_realloc_limit || 549 + ci->size < resctrl_rmid_realloc_limit) { 550 + resctrl_rmid_realloc_limit = ci->size; 551 + resctrl_rmid_realloc_threshold = ci->size / num_unique_pmg; 552 + } 553 + 554 + return 0; 555 + } 556 + 557 + static bool cache_has_usable_cpor(struct mpam_class *class) 558 + { 559 + struct mpam_props *cprops = &class->props; 560 + 561 + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) 562 + return false; 563 + 564 + /* resctrl uses u32 for all bitmap configurations */ 565 + return class->props.cpbm_wd <= 32; 566 + } 567 + 568 + static bool mba_class_use_mbw_max(struct mpam_props *cprops) 569 + { 570 + return (mpam_has_feature(mpam_feat_mbw_max, cprops) && 571 + cprops->bwa_wd); 572 + } 573 + 574 + static bool class_has_usable_mba(struct mpam_props *cprops) 575 + { 576 + return mba_class_use_mbw_max(cprops); 577 + } 578 + 579 + static bool cache_has_usable_csu(struct mpam_class *class) 580 + { 581 + struct mpam_props *cprops; 582 + 583 + if (!class) 584 + return false; 585 + 586 + cprops = &class->props; 587 + 588 + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) 589 + return false; 590 + 591 + /* 592 + * CSU counters settle on the value, so we can get away with 593 + * having only one. 594 + */ 595 + if (!cprops->num_csu_mon) 596 + return false; 597 + 598 + return true; 599 + } 600 + 601 + /* 602 + * Calculate the worst-case percentage change from each implemented step 603 + * in the control. 604 + */ 605 + static u32 get_mba_granularity(struct mpam_props *cprops) 606 + { 607 + if (!mba_class_use_mbw_max(cprops)) 608 + return 0; 609 + 610 + /* 611 + * bwa_wd is the number of bits implemented in the 0.xxx 612 + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. 613 + */ 614 + return DIV_ROUND_UP(MAX_MBA_BW, 1 << cprops->bwa_wd); 615 + } 616 + 617 + /* 618 + * Each fixed-point hardware value architecturally represents a range 619 + * of values: the full range 0% - 100% is split contiguously into 620 + * (1 << cprops->bwa_wd) equal bands. 621 + * 622 + * Although the bwa_bwd fields have 6 bits the maximum valid value is 16 623 + * as it reports the width of fields that are at most 16 bits. When 624 + * fewer than 16 bits are valid the least significant bits are 625 + * ignored. The implied binary point is kept between bits 15 and 16 and 626 + * so the valid bits are leftmost. 627 + * 628 + * See ARM IHI0099B.a "MPAM system component specification", Section 9.3, 629 + * "The fixed-point fractional format" for more information. 630 + * 631 + * Find the nearest percentage value to the upper bound of the selected band: 632 + */ 633 + static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) 634 + { 635 + u32 val = mbw_max; 636 + 637 + val >>= 16 - cprops->bwa_wd; 638 + val += 1; 639 + val *= MAX_MBA_BW; 640 + val = DIV_ROUND_CLOSEST(val, 1 << cprops->bwa_wd); 641 + 642 + return val; 643 + } 644 + 645 + /* 646 + * Find the band whose upper bound is closest to the specified percentage. 647 + * 648 + * A round-to-nearest policy is followed here as a balanced compromise 649 + * between unexpected under-commit of the resource (where the total of 650 + * a set of resource allocations after conversion is less than the 651 + * expected total, due to rounding of the individual converted 652 + * percentages) and over-commit (where the total of the converted 653 + * allocations is greater than expected). 654 + */ 655 + static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) 656 + { 657 + u32 val = pc; 658 + 659 + val <<= cprops->bwa_wd; 660 + val = DIV_ROUND_CLOSEST(val, MAX_MBA_BW); 661 + val = max(val, 1) - 1; 662 + val <<= 16 - cprops->bwa_wd; 663 + 664 + return val; 665 + } 666 + 667 + static u32 get_mba_min(struct mpam_props *cprops) 668 + { 669 + if (!mba_class_use_mbw_max(cprops)) { 670 + WARN_ON_ONCE(1); 671 + return 0; 672 + } 673 + 674 + return mbw_max_to_percent(0, cprops); 675 + } 676 + 677 + /* Find the L3 cache that has affinity with this CPU */ 678 + static int find_l3_equivalent_bitmask(int cpu, cpumask_var_t tmp_cpumask) 679 + { 680 + u32 cache_id = get_cpu_cacheinfo_id(cpu, 3); 681 + 682 + lockdep_assert_cpus_held(); 683 + 684 + return mpam_get_cpumask_from_cache_id(cache_id, 3, tmp_cpumask); 685 + } 686 + 687 + /* 688 + * topology_matches_l3() - Is the provided class the same shape as L3 689 + * @victim: The class we'd like to pretend is L3. 690 + * 691 + * resctrl expects all the world's a Xeon, and all counters are on the 692 + * L3. We allow some mapping counters on other classes. This requires 693 + * that the CPU->domain mapping is the same kind of shape. 694 + * 695 + * Using cacheinfo directly would make this work even if resctrl can't 696 + * use the L3 - but cacheinfo can't tell us anything about offline CPUs. 697 + * Using the L3 resctrl domain list also depends on CPUs being online. 698 + * Using the mpam_class we picked for L3 so we can use its domain list 699 + * assumes that there are MPAM controls on the L3. 700 + * Instead, this path eventually uses the mpam_get_cpumask_from_cache_id() 701 + * helper which can tell us about offline CPUs ... but getting the cache_id 702 + * to start with relies on at least one CPU per L3 cache being online at 703 + * boot. 704 + * 705 + * Walk the victim component list and compare the affinity mask with the 706 + * corresponding L3. The topology matches if each victim:component's affinity 707 + * mask is the same as the CPU's corresponding L3's. These lists/masks are 708 + * computed from firmware tables so don't change at runtime. 709 + */ 710 + static bool topology_matches_l3(struct mpam_class *victim) 711 + { 712 + int cpu, err; 713 + struct mpam_component *victim_iter; 714 + 715 + lockdep_assert_cpus_held(); 716 + 717 + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; 718 + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) 719 + return false; 720 + 721 + guard(srcu)(&mpam_srcu); 722 + list_for_each_entry_srcu(victim_iter, &victim->components, class_list, 723 + srcu_read_lock_held(&mpam_srcu)) { 724 + if (cpumask_empty(&victim_iter->affinity)) { 725 + pr_debug("class %u has CPU-less component %u - can't match L3!\n", 726 + victim->level, victim_iter->comp_id); 727 + return false; 728 + } 729 + 730 + cpu = cpumask_any_and(&victim_iter->affinity, cpu_online_mask); 731 + if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) 732 + return false; 733 + 734 + cpumask_clear(tmp_cpumask); 735 + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); 736 + if (err) { 737 + pr_debug("Failed to find L3's equivalent component to class %u component %u\n", 738 + victim->level, victim_iter->comp_id); 739 + return false; 740 + } 741 + 742 + /* Any differing bits in the affinity mask? */ 743 + if (!cpumask_equal(tmp_cpumask, &victim_iter->affinity)) { 744 + pr_debug("class %u component %u has Mismatched CPU mask with L3 equivalent\n" 745 + "L3:%*pbl != victim:%*pbl\n", 746 + victim->level, victim_iter->comp_id, 747 + cpumask_pr_args(tmp_cpumask), 748 + cpumask_pr_args(&victim_iter->affinity)); 749 + 750 + return false; 751 + } 752 + } 753 + 754 + return true; 755 + } 756 + 757 + /* 758 + * Test if the traffic for a class matches that at egress from the L3. For 759 + * MSC at memory controllers this is only possible if there is a single L3 760 + * as otherwise the counters at the memory can include bandwidth from the 761 + * non-local L3. 762 + */ 763 + static bool traffic_matches_l3(struct mpam_class *class) 764 + { 765 + int err, cpu; 766 + 767 + lockdep_assert_cpus_held(); 768 + 769 + if (class->type == MPAM_CLASS_CACHE && class->level == 3) 770 + return true; 771 + 772 + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { 773 + pr_debug("class %u is a different cache from L3\n", class->level); 774 + return false; 775 + } 776 + 777 + if (class->type != MPAM_CLASS_MEMORY) { 778 + pr_debug("class %u is neither of type cache or memory\n", class->level); 779 + return false; 780 + } 781 + 782 + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; 783 + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) { 784 + pr_debug("cpumask allocation failed\n"); 785 + return false; 786 + } 787 + 788 + cpu = cpumask_any_and(&class->affinity, cpu_online_mask); 789 + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); 790 + if (err) { 791 + pr_debug("Failed to find L3 downstream to cpu %d\n", cpu); 792 + return false; 793 + } 794 + 795 + if (!cpumask_equal(tmp_cpumask, cpu_possible_mask)) { 796 + pr_debug("There is more than one L3\n"); 797 + return false; 798 + } 799 + 800 + /* Be strict; the traffic might stop in the intermediate cache. */ 801 + if (get_cpu_cacheinfo_id(cpu, 4) != -1) { 802 + pr_debug("L3 isn't the last level of cache\n"); 803 + return false; 804 + } 805 + 806 + if (num_possible_nodes() > 1) { 807 + pr_debug("There is more than one numa node\n"); 808 + return false; 809 + } 810 + 811 + #ifdef CONFIG_HMEM_REPORTING 812 + if (node_devices[cpu_to_node(cpu)]->cache_dev) { 813 + pr_debug("There is a memory side cache\n"); 814 + return false; 815 + } 816 + #endif 817 + 818 + return true; 819 + } 820 + 821 + /* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ 822 + static void mpam_resctrl_pick_caches(void) 823 + { 824 + struct mpam_class *class; 825 + struct mpam_resctrl_res *res; 826 + 827 + lockdep_assert_cpus_held(); 828 + 829 + guard(srcu)(&mpam_srcu); 830 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 831 + srcu_read_lock_held(&mpam_srcu)) { 832 + if (class->type != MPAM_CLASS_CACHE) { 833 + pr_debug("class %u is not a cache\n", class->level); 834 + continue; 835 + } 836 + 837 + if (class->level != 2 && class->level != 3) { 838 + pr_debug("class %u is not L2 or L3\n", class->level); 839 + continue; 840 + } 841 + 842 + if (!cache_has_usable_cpor(class)) { 843 + pr_debug("class %u cache misses CPOR\n", class->level); 844 + continue; 845 + } 846 + 847 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 848 + pr_debug("class %u has missing CPUs, mask %*pb != %*pb\n", class->level, 849 + cpumask_pr_args(&class->affinity), 850 + cpumask_pr_args(cpu_possible_mask)); 851 + continue; 852 + } 853 + 854 + if (class->level == 2) 855 + res = &mpam_resctrl_controls[RDT_RESOURCE_L2]; 856 + else 857 + res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 858 + res->class = class; 859 + } 860 + } 861 + 862 + static void mpam_resctrl_pick_mba(void) 863 + { 864 + struct mpam_class *class, *candidate_class = NULL; 865 + struct mpam_resctrl_res *res; 866 + 867 + lockdep_assert_cpus_held(); 868 + 869 + guard(srcu)(&mpam_srcu); 870 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 871 + srcu_read_lock_held(&mpam_srcu)) { 872 + struct mpam_props *cprops = &class->props; 873 + 874 + if (class->level != 3 && class->type == MPAM_CLASS_CACHE) { 875 + pr_debug("class %u is a cache but not the L3\n", class->level); 876 + continue; 877 + } 878 + 879 + if (!class_has_usable_mba(cprops)) { 880 + pr_debug("class %u has no bandwidth control\n", 881 + class->level); 882 + continue; 883 + } 884 + 885 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 886 + pr_debug("class %u has missing CPUs\n", class->level); 887 + continue; 888 + } 889 + 890 + if (!topology_matches_l3(class)) { 891 + pr_debug("class %u topology doesn't match L3\n", 892 + class->level); 893 + continue; 894 + } 895 + 896 + if (!traffic_matches_l3(class)) { 897 + pr_debug("class %u traffic doesn't match L3 egress\n", 898 + class->level); 899 + continue; 900 + } 901 + 902 + /* 903 + * Pick a resource to be MBA that as close as possible to 904 + * the L3. mbm_total counts the bandwidth leaving the L3 905 + * cache and MBA should correspond as closely as possible 906 + * for proper operation of mba_sc. 907 + */ 908 + if (!candidate_class || class->level < candidate_class->level) 909 + candidate_class = class; 910 + } 911 + 912 + if (candidate_class) { 913 + pr_debug("selected class %u to back MBA\n", 914 + candidate_class->level); 915 + res = &mpam_resctrl_controls[RDT_RESOURCE_MBA]; 916 + res->class = candidate_class; 917 + } 918 + } 919 + 920 + static void counter_update_class(enum resctrl_event_id evt_id, 921 + struct mpam_class *class) 922 + { 923 + struct mpam_class *existing_class = mpam_resctrl_counters[evt_id].class; 924 + 925 + if (existing_class) { 926 + if (class->level == 3) { 927 + pr_debug("Existing class is L3 - L3 wins\n"); 928 + return; 929 + } 930 + 931 + if (existing_class->level < class->level) { 932 + pr_debug("Existing class is closer to L3, %u versus %u - closer is better\n", 933 + existing_class->level, class->level); 934 + return; 935 + } 936 + } 937 + 938 + mpam_resctrl_counters[evt_id].class = class; 939 + } 940 + 941 + static void mpam_resctrl_pick_counters(void) 942 + { 943 + struct mpam_class *class; 944 + 945 + lockdep_assert_cpus_held(); 946 + 947 + guard(srcu)(&mpam_srcu); 948 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 949 + srcu_read_lock_held(&mpam_srcu)) { 950 + /* The name of the resource is L3... */ 951 + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { 952 + pr_debug("class %u is a cache but not the L3", class->level); 953 + continue; 954 + } 955 + 956 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 957 + pr_debug("class %u does not cover all CPUs", 958 + class->level); 959 + continue; 960 + } 961 + 962 + if (cache_has_usable_csu(class)) { 963 + pr_debug("class %u has usable CSU", 964 + class->level); 965 + 966 + /* CSU counters only make sense on a cache. */ 967 + switch (class->type) { 968 + case MPAM_CLASS_CACHE: 969 + if (update_rmid_limits(class)) 970 + break; 971 + 972 + counter_update_class(QOS_L3_OCCUP_EVENT_ID, class); 973 + break; 974 + default: 975 + break; 976 + } 977 + } 978 + } 979 + } 980 + 981 + static int mpam_resctrl_control_init(struct mpam_resctrl_res *res) 982 + { 983 + struct mpam_class *class = res->class; 984 + struct mpam_props *cprops = &class->props; 985 + struct rdt_resource *r = &res->resctrl_res; 986 + 987 + switch (r->rid) { 988 + case RDT_RESOURCE_L2: 989 + case RDT_RESOURCE_L3: 990 + r->schema_fmt = RESCTRL_SCHEMA_BITMAP; 991 + r->cache.arch_has_sparse_bitmasks = true; 992 + 993 + r->cache.cbm_len = class->props.cpbm_wd; 994 + /* mpam_devices will reject empty bitmaps */ 995 + r->cache.min_cbm_bits = 1; 996 + 997 + if (r->rid == RDT_RESOURCE_L2) { 998 + r->name = "L2"; 999 + r->ctrl_scope = RESCTRL_L2_CACHE; 1000 + r->cdp_capable = true; 1001 + } else { 1002 + r->name = "L3"; 1003 + r->ctrl_scope = RESCTRL_L3_CACHE; 1004 + r->cdp_capable = true; 1005 + } 1006 + 1007 + /* 1008 + * Which bits are shared with other ...things... Unknown 1009 + * devices use partid-0 which uses all the bitmap fields. Until 1010 + * we have configured the SMMU and GIC not to do this 'all the 1011 + * bits' is the correct answer here. 1012 + */ 1013 + r->cache.shareable_bits = resctrl_get_default_ctrl(r); 1014 + r->alloc_capable = true; 1015 + break; 1016 + case RDT_RESOURCE_MBA: 1017 + r->schema_fmt = RESCTRL_SCHEMA_RANGE; 1018 + r->ctrl_scope = RESCTRL_L3_CACHE; 1019 + 1020 + r->membw.delay_linear = true; 1021 + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; 1022 + r->membw.min_bw = get_mba_min(cprops); 1023 + r->membw.max_bw = MAX_MBA_BW; 1024 + r->membw.bw_gran = get_mba_granularity(cprops); 1025 + 1026 + r->name = "MB"; 1027 + r->alloc_capable = true; 1028 + break; 1029 + default: 1030 + return -EINVAL; 1031 + } 1032 + 1033 + return 0; 1034 + } 1035 + 1036 + static int mpam_resctrl_pick_domain_id(int cpu, struct mpam_component *comp) 1037 + { 1038 + struct mpam_class *class = comp->class; 1039 + 1040 + if (class->type == MPAM_CLASS_CACHE) 1041 + return comp->comp_id; 1042 + 1043 + if (topology_matches_l3(class)) { 1044 + /* Use the corresponding L3 component ID as the domain ID */ 1045 + int id = get_cpu_cacheinfo_id(cpu, 3); 1046 + 1047 + /* Implies topology_matches_l3() made a mistake */ 1048 + if (WARN_ON_ONCE(id == -1)) 1049 + return comp->comp_id; 1050 + 1051 + return id; 1052 + } 1053 + 1054 + /* Otherwise, expose the ID used by the firmware table code. */ 1055 + return comp->comp_id; 1056 + } 1057 + 1058 + static int mpam_resctrl_monitor_init(struct mpam_resctrl_mon *mon, 1059 + enum resctrl_event_id type) 1060 + { 1061 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 1062 + struct rdt_resource *l3 = &res->resctrl_res; 1063 + 1064 + lockdep_assert_cpus_held(); 1065 + 1066 + /* 1067 + * There also needs to be an L3 cache present. 1068 + * The check just requires any online CPU and it can't go offline as we 1069 + * hold the cpu lock. 1070 + */ 1071 + if (get_cpu_cacheinfo_id(raw_smp_processor_id(), 3) == -1) 1072 + return 0; 1073 + 1074 + /* 1075 + * If there are no MPAM resources on L3, force it into existence. 1076 + * topology_matches_l3() already ensures this looks like the L3. 1077 + * The domain-ids will be fixed up by mpam_resctrl_domain_hdr_init(). 1078 + */ 1079 + if (!res->class) { 1080 + pr_warn_once("Faking L3 MSC to enable counters.\n"); 1081 + res->class = mpam_resctrl_counters[type].class; 1082 + } 1083 + 1084 + /* 1085 + * Called multiple times!, once per event type that has a 1086 + * monitoring class. 1087 + * Setting name is necessary on monitor only platforms. 1088 + */ 1089 + l3->name = "L3"; 1090 + l3->mon_scope = RESCTRL_L3_CACHE; 1091 + 1092 + /* 1093 + * num-rmid is the upper bound for the number of monitoring groups that 1094 + * can exist simultaneously, including the default monitoring group for 1095 + * each control group. Hence, advertise the whole rmid_idx space even 1096 + * though each control group has its own pmg/rmid space. Unfortunately, 1097 + * this does mean userspace needs to know the architecture to correctly 1098 + * interpret this value. 1099 + */ 1100 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx(); 1101 + 1102 + if (resctrl_enable_mon_event(type, false, 0, NULL)) 1103 + l3->mon_capable = true; 1104 + 1105 + return 0; 1106 + } 1107 + 1108 + u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, 1109 + u32 closid, enum resctrl_conf_type type) 1110 + { 1111 + u32 partid; 1112 + struct mpam_config *cfg; 1113 + struct mpam_props *cprops; 1114 + struct mpam_resctrl_res *res; 1115 + struct mpam_resctrl_dom *dom; 1116 + enum mpam_device_features configured_by; 1117 + 1118 + lockdep_assert_cpus_held(); 1119 + 1120 + if (!mpam_is_enabled()) 1121 + return resctrl_get_default_ctrl(r); 1122 + 1123 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1124 + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); 1125 + cprops = &res->class->props; 1126 + 1127 + /* 1128 + * When CDP is enabled, but the resource doesn't support it, 1129 + * the control is cloned across both partids. 1130 + * Pick one at random to read: 1131 + */ 1132 + if (mpam_resctrl_hide_cdp(r->rid)) 1133 + type = CDP_DATA; 1134 + 1135 + partid = resctrl_get_config_index(closid, type); 1136 + cfg = &dom->ctrl_comp->cfg[partid]; 1137 + 1138 + switch (r->rid) { 1139 + case RDT_RESOURCE_L2: 1140 + case RDT_RESOURCE_L3: 1141 + configured_by = mpam_feat_cpor_part; 1142 + break; 1143 + case RDT_RESOURCE_MBA: 1144 + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { 1145 + configured_by = mpam_feat_mbw_max; 1146 + break; 1147 + } 1148 + fallthrough; 1149 + default: 1150 + return resctrl_get_default_ctrl(r); 1151 + } 1152 + 1153 + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || 1154 + !mpam_has_feature(configured_by, cfg)) 1155 + return resctrl_get_default_ctrl(r); 1156 + 1157 + switch (configured_by) { 1158 + case mpam_feat_cpor_part: 1159 + return cfg->cpbm; 1160 + case mpam_feat_mbw_max: 1161 + return mbw_max_to_percent(cfg->mbw_max, cprops); 1162 + default: 1163 + return resctrl_get_default_ctrl(r); 1164 + } 1165 + } 1166 + 1167 + int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, 1168 + u32 closid, enum resctrl_conf_type t, u32 cfg_val) 1169 + { 1170 + int err; 1171 + u32 partid; 1172 + struct mpam_config cfg; 1173 + struct mpam_props *cprops; 1174 + struct mpam_resctrl_res *res; 1175 + struct mpam_resctrl_dom *dom; 1176 + 1177 + lockdep_assert_cpus_held(); 1178 + lockdep_assert_irqs_enabled(); 1179 + 1180 + if (!mpam_is_enabled()) 1181 + return -EINVAL; 1182 + 1183 + /* 1184 + * No need to check the CPU as mpam_apply_config() doesn't care, and 1185 + * resctrl_arch_update_domains() relies on this. 1186 + */ 1187 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1188 + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); 1189 + cprops = &res->class->props; 1190 + 1191 + if (mpam_resctrl_hide_cdp(r->rid)) 1192 + t = CDP_DATA; 1193 + 1194 + partid = resctrl_get_config_index(closid, t); 1195 + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) { 1196 + pr_debug("Not alloc capable or computed PARTID out of range\n"); 1197 + return -EINVAL; 1198 + } 1199 + 1200 + /* 1201 + * Copy the current config to avoid clearing other resources when the 1202 + * same component is exposed multiple times through resctrl. 1203 + */ 1204 + cfg = dom->ctrl_comp->cfg[partid]; 1205 + 1206 + switch (r->rid) { 1207 + case RDT_RESOURCE_L2: 1208 + case RDT_RESOURCE_L3: 1209 + cfg.cpbm = cfg_val; 1210 + mpam_set_feature(mpam_feat_cpor_part, &cfg); 1211 + break; 1212 + case RDT_RESOURCE_MBA: 1213 + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { 1214 + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); 1215 + mpam_set_feature(mpam_feat_mbw_max, &cfg); 1216 + break; 1217 + } 1218 + fallthrough; 1219 + default: 1220 + return -EINVAL; 1221 + } 1222 + 1223 + /* 1224 + * When CDP is enabled, but the resource doesn't support it, we need to 1225 + * apply the same configuration to the other partid. 1226 + */ 1227 + if (mpam_resctrl_hide_cdp(r->rid)) { 1228 + partid = resctrl_get_config_index(closid, CDP_CODE); 1229 + err = mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1230 + if (err) 1231 + return err; 1232 + 1233 + partid = resctrl_get_config_index(closid, CDP_DATA); 1234 + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1235 + } 1236 + 1237 + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1238 + } 1239 + 1240 + int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) 1241 + { 1242 + int err; 1243 + struct rdt_ctrl_domain *d; 1244 + 1245 + lockdep_assert_cpus_held(); 1246 + lockdep_assert_irqs_enabled(); 1247 + 1248 + if (!mpam_is_enabled()) 1249 + return -EINVAL; 1250 + 1251 + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { 1252 + for (enum resctrl_conf_type t = 0; t < CDP_NUM_TYPES; t++) { 1253 + struct resctrl_staged_config *cfg = &d->staged_config[t]; 1254 + 1255 + if (!cfg->have_new_ctrl) 1256 + continue; 1257 + 1258 + err = resctrl_arch_update_one(r, d, closid, t, 1259 + cfg->new_ctrl); 1260 + if (err) 1261 + return err; 1262 + } 1263 + } 1264 + 1265 + return 0; 1266 + } 1267 + 1268 + void resctrl_arch_reset_all_ctrls(struct rdt_resource *r) 1269 + { 1270 + struct mpam_resctrl_res *res; 1271 + 1272 + lockdep_assert_cpus_held(); 1273 + 1274 + if (!mpam_is_enabled()) 1275 + return; 1276 + 1277 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1278 + mpam_reset_class_locked(res->class); 1279 + } 1280 + 1281 + static void mpam_resctrl_domain_hdr_init(int cpu, struct mpam_component *comp, 1282 + enum resctrl_res_level rid, 1283 + struct rdt_domain_hdr *hdr) 1284 + { 1285 + lockdep_assert_cpus_held(); 1286 + 1287 + INIT_LIST_HEAD(&hdr->list); 1288 + hdr->id = mpam_resctrl_pick_domain_id(cpu, comp); 1289 + hdr->rid = rid; 1290 + cpumask_set_cpu(cpu, &hdr->cpu_mask); 1291 + } 1292 + 1293 + static void mpam_resctrl_online_domain_hdr(unsigned int cpu, 1294 + struct rdt_domain_hdr *hdr) 1295 + { 1296 + lockdep_assert_cpus_held(); 1297 + 1298 + cpumask_set_cpu(cpu, &hdr->cpu_mask); 1299 + } 1300 + 1301 + /** 1302 + * mpam_resctrl_offline_domain_hdr() - Update the domain header to remove a CPU. 1303 + * @cpu: The CPU to remove from the domain. 1304 + * @hdr: The domain's header. 1305 + * 1306 + * Removes @cpu from the header mask. If this was the last CPU in the domain, 1307 + * the domain header is removed from its parent list and true is returned, 1308 + * indicating the parent structure can be freed. 1309 + * If there are other CPUs in the domain, returns false. 1310 + */ 1311 + static bool mpam_resctrl_offline_domain_hdr(unsigned int cpu, 1312 + struct rdt_domain_hdr *hdr) 1313 + { 1314 + lockdep_assert_held(&domain_list_lock); 1315 + 1316 + cpumask_clear_cpu(cpu, &hdr->cpu_mask); 1317 + if (cpumask_empty(&hdr->cpu_mask)) { 1318 + list_del_rcu(&hdr->list); 1319 + synchronize_rcu(); 1320 + return true; 1321 + } 1322 + 1323 + return false; 1324 + } 1325 + 1326 + static void mpam_resctrl_domain_insert(struct list_head *list, 1327 + struct rdt_domain_hdr *new) 1328 + { 1329 + struct rdt_domain_hdr *err; 1330 + struct list_head *pos = NULL; 1331 + 1332 + lockdep_assert_held(&domain_list_lock); 1333 + 1334 + err = resctrl_find_domain(list, new->id, &pos); 1335 + if (WARN_ON_ONCE(err)) 1336 + return; 1337 + 1338 + list_add_tail_rcu(&new->list, pos); 1339 + } 1340 + 1341 + static struct mpam_component *find_component(struct mpam_class *class, int cpu) 1342 + { 1343 + struct mpam_component *comp; 1344 + 1345 + guard(srcu)(&mpam_srcu); 1346 + list_for_each_entry_srcu(comp, &class->components, class_list, 1347 + srcu_read_lock_held(&mpam_srcu)) { 1348 + if (cpumask_test_cpu(cpu, &comp->affinity)) 1349 + return comp; 1350 + } 1351 + 1352 + return NULL; 1353 + } 1354 + 1355 + static struct mpam_resctrl_dom * 1356 + mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) 1357 + { 1358 + int err; 1359 + struct mpam_resctrl_dom *dom; 1360 + struct rdt_l3_mon_domain *mon_d; 1361 + struct rdt_ctrl_domain *ctrl_d; 1362 + struct mpam_class *class = res->class; 1363 + struct mpam_component *comp_iter, *ctrl_comp; 1364 + struct rdt_resource *r = &res->resctrl_res; 1365 + 1366 + lockdep_assert_held(&domain_list_lock); 1367 + 1368 + ctrl_comp = NULL; 1369 + guard(srcu)(&mpam_srcu); 1370 + list_for_each_entry_srcu(comp_iter, &class->components, class_list, 1371 + srcu_read_lock_held(&mpam_srcu)) { 1372 + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { 1373 + ctrl_comp = comp_iter; 1374 + break; 1375 + } 1376 + } 1377 + 1378 + /* class has no component for this CPU */ 1379 + if (WARN_ON_ONCE(!ctrl_comp)) 1380 + return ERR_PTR(-EINVAL); 1381 + 1382 + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); 1383 + if (!dom) 1384 + return ERR_PTR(-ENOMEM); 1385 + 1386 + if (r->alloc_capable) { 1387 + dom->ctrl_comp = ctrl_comp; 1388 + 1389 + ctrl_d = &dom->resctrl_ctrl_dom; 1390 + mpam_resctrl_domain_hdr_init(cpu, ctrl_comp, r->rid, &ctrl_d->hdr); 1391 + ctrl_d->hdr.type = RESCTRL_CTRL_DOMAIN; 1392 + err = resctrl_online_ctrl_domain(r, ctrl_d); 1393 + if (err) 1394 + goto free_domain; 1395 + 1396 + mpam_resctrl_domain_insert(&r->ctrl_domains, &ctrl_d->hdr); 1397 + } else { 1398 + pr_debug("Skipped control domain online - no controls\n"); 1399 + } 1400 + 1401 + if (r->mon_capable) { 1402 + struct mpam_component *any_mon_comp; 1403 + struct mpam_resctrl_mon *mon; 1404 + enum resctrl_event_id eventid; 1405 + 1406 + /* 1407 + * Even if the monitor domain is backed by a different 1408 + * component, the L3 component IDs need to be used... only 1409 + * there may be no ctrl_comp for the L3. 1410 + * Search each event's class list for a component with 1411 + * overlapping CPUs and set up the dom->mon_comp array. 1412 + */ 1413 + 1414 + for_each_mpam_resctrl_mon(mon, eventid) { 1415 + struct mpam_component *mon_comp; 1416 + 1417 + if (!mon->class) 1418 + continue; // dummy resource 1419 + 1420 + mon_comp = find_component(mon->class, cpu); 1421 + dom->mon_comp[eventid] = mon_comp; 1422 + if (mon_comp) 1423 + any_mon_comp = mon_comp; 1424 + } 1425 + if (!any_mon_comp) { 1426 + WARN_ON_ONCE(0); 1427 + err = -EFAULT; 1428 + goto offline_ctrl_domain; 1429 + } 1430 + 1431 + mon_d = &dom->resctrl_mon_dom; 1432 + mpam_resctrl_domain_hdr_init(cpu, any_mon_comp, r->rid, &mon_d->hdr); 1433 + mon_d->hdr.type = RESCTRL_MON_DOMAIN; 1434 + err = resctrl_online_mon_domain(r, &mon_d->hdr); 1435 + if (err) 1436 + goto offline_ctrl_domain; 1437 + 1438 + mpam_resctrl_domain_insert(&r->mon_domains, &mon_d->hdr); 1439 + } else { 1440 + pr_debug("Skipped monitor domain online - no monitors\n"); 1441 + } 1442 + 1443 + return dom; 1444 + 1445 + offline_ctrl_domain: 1446 + if (r->alloc_capable) { 1447 + mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); 1448 + resctrl_offline_ctrl_domain(r, ctrl_d); 1449 + } 1450 + free_domain: 1451 + kfree(dom); 1452 + dom = ERR_PTR(err); 1453 + 1454 + return dom; 1455 + } 1456 + 1457 + /* 1458 + * We know all the monitors are associated with the L3, even if there are no 1459 + * controls and therefore no control component. Find the cache-id for the CPU 1460 + * and use that to search for existing resctrl domains. 1461 + * This relies on mpam_resctrl_pick_domain_id() using the L3 cache-id 1462 + * for anything that is not a cache. 1463 + */ 1464 + static struct mpam_resctrl_dom *mpam_resctrl_get_mon_domain_from_cpu(int cpu) 1465 + { 1466 + int cache_id; 1467 + struct mpam_resctrl_dom *dom; 1468 + struct mpam_resctrl_res *l3 = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 1469 + 1470 + lockdep_assert_cpus_held(); 1471 + 1472 + if (!l3->class) 1473 + return NULL; 1474 + cache_id = get_cpu_cacheinfo_id(cpu, 3); 1475 + if (cache_id < 0) 1476 + return NULL; 1477 + 1478 + list_for_each_entry_rcu(dom, &l3->resctrl_res.mon_domains, resctrl_mon_dom.hdr.list) { 1479 + if (dom->resctrl_mon_dom.hdr.id == cache_id) 1480 + return dom; 1481 + } 1482 + 1483 + return NULL; 1484 + } 1485 + 1486 + static struct mpam_resctrl_dom * 1487 + mpam_resctrl_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) 1488 + { 1489 + struct mpam_resctrl_dom *dom; 1490 + struct rdt_resource *r = &res->resctrl_res; 1491 + 1492 + lockdep_assert_cpus_held(); 1493 + 1494 + list_for_each_entry_rcu(dom, &r->ctrl_domains, resctrl_ctrl_dom.hdr.list) { 1495 + if (cpumask_test_cpu(cpu, &dom->ctrl_comp->affinity)) 1496 + return dom; 1497 + } 1498 + 1499 + if (r->rid != RDT_RESOURCE_L3) 1500 + return NULL; 1501 + 1502 + /* Search the mon domain list too - needed on monitor only platforms. */ 1503 + return mpam_resctrl_get_mon_domain_from_cpu(cpu); 1504 + } 1505 + 1506 + int mpam_resctrl_online_cpu(unsigned int cpu) 1507 + { 1508 + struct mpam_resctrl_res *res; 1509 + enum resctrl_res_level rid; 1510 + 1511 + guard(mutex)(&domain_list_lock); 1512 + for_each_mpam_resctrl_control(res, rid) { 1513 + struct mpam_resctrl_dom *dom; 1514 + struct rdt_resource *r = &res->resctrl_res; 1515 + 1516 + if (!res->class) 1517 + continue; // dummy_resource; 1518 + 1519 + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); 1520 + if (!dom) { 1521 + dom = mpam_resctrl_alloc_domain(cpu, res); 1522 + if (IS_ERR(dom)) 1523 + return PTR_ERR(dom); 1524 + } else { 1525 + if (r->alloc_capable) { 1526 + struct rdt_ctrl_domain *ctrl_d = &dom->resctrl_ctrl_dom; 1527 + 1528 + mpam_resctrl_online_domain_hdr(cpu, &ctrl_d->hdr); 1529 + } 1530 + if (r->mon_capable) { 1531 + struct rdt_l3_mon_domain *mon_d = &dom->resctrl_mon_dom; 1532 + 1533 + mpam_resctrl_online_domain_hdr(cpu, &mon_d->hdr); 1534 + } 1535 + } 1536 + } 1537 + 1538 + resctrl_online_cpu(cpu); 1539 + 1540 + return 0; 1541 + } 1542 + 1543 + void mpam_resctrl_offline_cpu(unsigned int cpu) 1544 + { 1545 + struct mpam_resctrl_res *res; 1546 + enum resctrl_res_level rid; 1547 + 1548 + resctrl_offline_cpu(cpu); 1549 + 1550 + guard(mutex)(&domain_list_lock); 1551 + for_each_mpam_resctrl_control(res, rid) { 1552 + struct mpam_resctrl_dom *dom; 1553 + struct rdt_l3_mon_domain *mon_d; 1554 + struct rdt_ctrl_domain *ctrl_d; 1555 + bool ctrl_dom_empty, mon_dom_empty; 1556 + struct rdt_resource *r = &res->resctrl_res; 1557 + 1558 + if (!res->class) 1559 + continue; // dummy resource 1560 + 1561 + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); 1562 + if (WARN_ON_ONCE(!dom)) 1563 + continue; 1564 + 1565 + if (r->alloc_capable) { 1566 + ctrl_d = &dom->resctrl_ctrl_dom; 1567 + ctrl_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); 1568 + if (ctrl_dom_empty) 1569 + resctrl_offline_ctrl_domain(&res->resctrl_res, ctrl_d); 1570 + } else { 1571 + ctrl_dom_empty = true; 1572 + } 1573 + 1574 + if (r->mon_capable) { 1575 + mon_d = &dom->resctrl_mon_dom; 1576 + mon_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &mon_d->hdr); 1577 + if (mon_dom_empty) 1578 + resctrl_offline_mon_domain(&res->resctrl_res, &mon_d->hdr); 1579 + } else { 1580 + mon_dom_empty = true; 1581 + } 1582 + 1583 + if (ctrl_dom_empty && mon_dom_empty) 1584 + kfree(dom); 1585 + } 1586 + } 1587 + 1588 + int mpam_resctrl_setup(void) 1589 + { 1590 + int err = 0; 1591 + struct mpam_resctrl_res *res; 1592 + enum resctrl_res_level rid; 1593 + struct mpam_resctrl_mon *mon; 1594 + enum resctrl_event_id eventid; 1595 + 1596 + wait_event(wait_cacheinfo_ready, cacheinfo_ready); 1597 + 1598 + cpus_read_lock(); 1599 + for_each_mpam_resctrl_control(res, rid) { 1600 + INIT_LIST_HEAD_RCU(&res->resctrl_res.ctrl_domains); 1601 + INIT_LIST_HEAD_RCU(&res->resctrl_res.mon_domains); 1602 + res->resctrl_res.rid = rid; 1603 + } 1604 + 1605 + /* Find some classes to use for controls */ 1606 + mpam_resctrl_pick_caches(); 1607 + mpam_resctrl_pick_mba(); 1608 + 1609 + /* Initialise the resctrl structures from the classes */ 1610 + for_each_mpam_resctrl_control(res, rid) { 1611 + if (!res->class) 1612 + continue; // dummy resource 1613 + 1614 + err = mpam_resctrl_control_init(res); 1615 + if (err) { 1616 + pr_debug("Failed to initialise rid %u\n", rid); 1617 + goto internal_error; 1618 + } 1619 + } 1620 + 1621 + /* Find some classes to use for monitors */ 1622 + mpam_resctrl_pick_counters(); 1623 + 1624 + for_each_mpam_resctrl_mon(mon, eventid) { 1625 + if (!mon->class) 1626 + continue; // dummy resource 1627 + 1628 + err = mpam_resctrl_monitor_init(mon, eventid); 1629 + if (err) { 1630 + pr_debug("Failed to initialise event %u\n", eventid); 1631 + goto internal_error; 1632 + } 1633 + } 1634 + 1635 + cpus_read_unlock(); 1636 + 1637 + if (!resctrl_arch_alloc_capable() && !resctrl_arch_mon_capable()) { 1638 + pr_debug("No alloc(%u) or monitor(%u) found - resctrl not supported\n", 1639 + resctrl_arch_alloc_capable(), resctrl_arch_mon_capable()); 1640 + return -EOPNOTSUPP; 1641 + } 1642 + 1643 + err = resctrl_init(); 1644 + if (err) 1645 + return err; 1646 + 1647 + WRITE_ONCE(resctrl_enabled, true); 1648 + 1649 + return 0; 1650 + 1651 + internal_error: 1652 + cpus_read_unlock(); 1653 + pr_debug("Internal error %d - resctrl not supported\n", err); 1654 + return err; 1655 + } 1656 + 1657 + void mpam_resctrl_exit(void) 1658 + { 1659 + if (!READ_ONCE(resctrl_enabled)) 1660 + return; 1661 + 1662 + WRITE_ONCE(resctrl_enabled, false); 1663 + resctrl_exit(); 1664 + } 1665 + 1666 + /* 1667 + * The driver is detaching an MSC from this class, if resctrl was using it, 1668 + * pull on resctrl_exit(). 1669 + */ 1670 + void mpam_resctrl_teardown_class(struct mpam_class *class) 1671 + { 1672 + struct mpam_resctrl_res *res; 1673 + enum resctrl_res_level rid; 1674 + struct mpam_resctrl_mon *mon; 1675 + enum resctrl_event_id eventid; 1676 + 1677 + might_sleep(); 1678 + 1679 + for_each_mpam_resctrl_control(res, rid) { 1680 + if (res->class == class) { 1681 + res->class = NULL; 1682 + break; 1683 + } 1684 + } 1685 + for_each_mpam_resctrl_mon(mon, eventid) { 1686 + if (mon->class == class) { 1687 + mon->class = NULL; 1688 + break; 1689 + } 1690 + } 1691 + } 1692 + 1693 + static int __init __cacheinfo_ready(void) 1694 + { 1695 + cacheinfo_ready = true; 1696 + wake_up(&wait_cacheinfo_ready); 1697 + 1698 + return 0; 1699 + } 1700 + device_initcall_sync(__cacheinfo_ready); 1701 + 1702 + #ifdef CONFIG_MPAM_KUNIT_TEST 1703 + #include "test_mpam_resctrl.c" 1704 + #endif

+315

drivers/resctrl/test_mpam_resctrl.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (C) 2025 Arm Ltd. 3 + /* This file is intended to be included into mpam_resctrl.c */ 4 + 5 + #include <kunit/test.h> 6 + #include <linux/array_size.h> 7 + #include <linux/bits.h> 8 + #include <linux/math.h> 9 + #include <linux/sprintf.h> 10 + 11 + struct percent_value_case { 12 + u8 pc; 13 + u8 width; 14 + u16 value; 15 + }; 16 + 17 + /* 18 + * Mysterious inscriptions taken from the union of ARM DDI 0598D.b, 19 + * "Arm Architecture Reference Manual Supplement - Memory System 20 + * Resource Partitioning and Monitoring (MPAM), for A-profile 21 + * architecture", Section 9.8, "About the fixed-point fractional 22 + * format" (exact percentage entries only) and ARM IHI0099B.a 23 + * "MPAM system component specification", Section 9.3, 24 + * "The fixed-point fractional format": 25 + */ 26 + static const struct percent_value_case percent_value_cases[] = { 27 + /* Architectural cases: */ 28 + { 1, 8, 1 }, { 1, 12, 0x27 }, { 1, 16, 0x28e }, 29 + { 25, 8, 0x3f }, { 25, 12, 0x3ff }, { 25, 16, 0x3fff }, 30 + { 33, 8, 0x53 }, { 33, 12, 0x546 }, { 33, 16, 0x5479 }, 31 + { 35, 8, 0x58 }, { 35, 12, 0x598 }, { 35, 16, 0x5998 }, 32 + { 45, 8, 0x72 }, { 45, 12, 0x732 }, { 45, 16, 0x7332 }, 33 + { 50, 8, 0x7f }, { 50, 12, 0x7ff }, { 50, 16, 0x7fff }, 34 + { 52, 8, 0x84 }, { 52, 12, 0x850 }, { 52, 16, 0x851d }, 35 + { 55, 8, 0x8b }, { 55, 12, 0x8cb }, { 55, 16, 0x8ccb }, 36 + { 58, 8, 0x93 }, { 58, 12, 0x946 }, { 58, 16, 0x9479 }, 37 + { 75, 8, 0xbf }, { 75, 12, 0xbff }, { 75, 16, 0xbfff }, 38 + { 80, 8, 0xcb }, { 80, 12, 0xccb }, { 80, 16, 0xcccb }, 39 + { 88, 8, 0xe0 }, { 88, 12, 0xe13 }, { 88, 16, 0xe146 }, 40 + { 95, 8, 0xf2 }, { 95, 12, 0xf32 }, { 95, 16, 0xf332 }, 41 + { 100, 8, 0xff }, { 100, 12, 0xfff }, { 100, 16, 0xffff }, 42 + }; 43 + 44 + static void test_percent_value_desc(const struct percent_value_case *param, 45 + char *desc) 46 + { 47 + snprintf(desc, KUNIT_PARAM_DESC_SIZE, 48 + "pc=%d, width=%d, value=0x%.*x\n", 49 + param->pc, param->width, 50 + DIV_ROUND_UP(param->width, 4), param->value); 51 + } 52 + 53 + KUNIT_ARRAY_PARAM(test_percent_value, percent_value_cases, 54 + test_percent_value_desc); 55 + 56 + struct percent_value_test_info { 57 + u32 pc; /* result of value-to-percent conversion */ 58 + u32 value; /* result of percent-to-value conversion */ 59 + u32 max_value; /* maximum raw value allowed by test params */ 60 + unsigned int shift; /* promotes raw testcase value to 16 bits */ 61 + }; 62 + 63 + /* 64 + * Convert a reference percentage to a fixed-point MAX value and 65 + * vice-versa, based on param (not test->param_value!) 66 + */ 67 + static void __prepare_percent_value_test(struct kunit *test, 68 + struct percent_value_test_info *res, 69 + const struct percent_value_case *param) 70 + { 71 + struct mpam_props fake_props = { }; 72 + 73 + /* Reject bogus test parameters that would break the tests: */ 74 + KUNIT_ASSERT_GE(test, param->width, 1); 75 + KUNIT_ASSERT_LE(test, param->width, 16); 76 + KUNIT_ASSERT_LT(test, param->value, 1 << param->width); 77 + 78 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 79 + fake_props.bwa_wd = param->width; 80 + 81 + res->shift = 16 - param->width; 82 + res->max_value = GENMASK_U32(param->width - 1, 0); 83 + res->value = percent_to_mbw_max(param->pc, &fake_props); 84 + res->pc = mbw_max_to_percent(param->value << res->shift, &fake_props); 85 + } 86 + 87 + static void test_get_mba_granularity(struct kunit *test) 88 + { 89 + int ret; 90 + struct mpam_props fake_props = { }; 91 + 92 + /* Use MBW_MAX */ 93 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 94 + 95 + fake_props.bwa_wd = 0; 96 + KUNIT_EXPECT_FALSE(test, mba_class_use_mbw_max(&fake_props)); 97 + 98 + fake_props.bwa_wd = 1; 99 + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); 100 + 101 + /* Architectural maximum: */ 102 + fake_props.bwa_wd = 16; 103 + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); 104 + 105 + /* No usable control... */ 106 + fake_props.bwa_wd = 0; 107 + ret = get_mba_granularity(&fake_props); 108 + KUNIT_EXPECT_EQ(test, ret, 0); 109 + 110 + fake_props.bwa_wd = 1; 111 + ret = get_mba_granularity(&fake_props); 112 + KUNIT_EXPECT_EQ(test, ret, 50); /* DIV_ROUND_UP(100, 1 << 1)% = 50% */ 113 + 114 + fake_props.bwa_wd = 2; 115 + ret = get_mba_granularity(&fake_props); 116 + KUNIT_EXPECT_EQ(test, ret, 25); /* DIV_ROUND_UP(100, 1 << 2)% = 25% */ 117 + 118 + fake_props.bwa_wd = 3; 119 + ret = get_mba_granularity(&fake_props); 120 + KUNIT_EXPECT_EQ(test, ret, 13); /* DIV_ROUND_UP(100, 1 << 3)% = 13% */ 121 + 122 + fake_props.bwa_wd = 6; 123 + ret = get_mba_granularity(&fake_props); 124 + KUNIT_EXPECT_EQ(test, ret, 2); /* DIV_ROUND_UP(100, 1 << 6)% = 2% */ 125 + 126 + fake_props.bwa_wd = 7; 127 + ret = get_mba_granularity(&fake_props); 128 + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 7)% = 1% */ 129 + 130 + /* Granularity saturates at 1% */ 131 + fake_props.bwa_wd = 16; /* architectural maximum */ 132 + ret = get_mba_granularity(&fake_props); 133 + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 16)% = 1% */ 134 + } 135 + 136 + static void test_mbw_max_to_percent(struct kunit *test) 137 + { 138 + const struct percent_value_case *param = test->param_value; 139 + struct percent_value_test_info res; 140 + 141 + /* 142 + * Since the reference values in percent_value_cases[] all 143 + * correspond to exact percentages, round-to-nearest will 144 + * always give the exact percentage back when the MPAM max 145 + * value has precision of 0.5% or finer. (Always true for the 146 + * reference data, since they all specify 8 bits or more of 147 + * precision. 148 + * 149 + * So, keep it simple and demand an exact match: 150 + */ 151 + __prepare_percent_value_test(test, &res, param); 152 + KUNIT_EXPECT_EQ(test, res.pc, param->pc); 153 + } 154 + 155 + static void test_percent_to_mbw_max(struct kunit *test) 156 + { 157 + const struct percent_value_case *param = test->param_value; 158 + struct percent_value_test_info res; 159 + 160 + __prepare_percent_value_test(test, &res, param); 161 + 162 + KUNIT_EXPECT_GE(test, res.value, param->value << res.shift); 163 + KUNIT_EXPECT_LE(test, res.value, (param->value + 1) << res.shift); 164 + KUNIT_EXPECT_LE(test, res.value, res.max_value << res.shift); 165 + 166 + /* No flexibility allowed for 0% and 100%! */ 167 + 168 + if (param->pc == 0) 169 + KUNIT_EXPECT_EQ(test, res.value, 0); 170 + 171 + if (param->pc == 100) 172 + KUNIT_EXPECT_EQ(test, res.value, res.max_value << res.shift); 173 + } 174 + 175 + static const void *test_all_bwa_wd_gen_params(struct kunit *test, const void *prev, 176 + char *desc) 177 + { 178 + uintptr_t param = (uintptr_t)prev; 179 + 180 + if (param > 15) 181 + return NULL; 182 + 183 + param++; 184 + 185 + snprintf(desc, KUNIT_PARAM_DESC_SIZE, "wd=%u\n", (unsigned int)param); 186 + 187 + return (void *)param; 188 + } 189 + 190 + static unsigned int test_get_bwa_wd(struct kunit *test) 191 + { 192 + uintptr_t param = (uintptr_t)test->param_value; 193 + 194 + KUNIT_ASSERT_GE(test, param, 1); 195 + KUNIT_ASSERT_LE(test, param, 16); 196 + 197 + return param; 198 + } 199 + 200 + static void test_mbw_max_to_percent_limits(struct kunit *test) 201 + { 202 + struct mpam_props fake_props = {0}; 203 + u32 max_value; 204 + 205 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 206 + fake_props.bwa_wd = test_get_bwa_wd(test); 207 + max_value = GENMASK(15, 16 - fake_props.bwa_wd); 208 + 209 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(max_value, &fake_props), 210 + MAX_MBA_BW); 211 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 212 + get_mba_min(&fake_props)); 213 + 214 + /* 215 + * Rounding policy dependent 0% sanity-check: 216 + * With round-to-nearest, the minimum mbw_max value really 217 + * should map to 0% if there are at least 200 steps. 218 + * (100 steps may be enough for some other rounding policies.) 219 + */ 220 + if (fake_props.bwa_wd >= 8) 221 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 0); 222 + 223 + if (fake_props.bwa_wd < 8 && 224 + mbw_max_to_percent(0, &fake_props) == 0) 225 + kunit_warn(test, "wd=%d: Testsuite/driver Rounding policy mismatch?", 226 + fake_props.bwa_wd); 227 + } 228 + 229 + /* 230 + * Check that converting a percentage to mbw_max and back again (or, as 231 + * appropriate, vice-versa) always restores the original value: 232 + */ 233 + static void test_percent_max_roundtrip_stability(struct kunit *test) 234 + { 235 + struct mpam_props fake_props = {0}; 236 + unsigned int shift; 237 + u32 pc, max, pc2, max2; 238 + 239 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 240 + fake_props.bwa_wd = test_get_bwa_wd(test); 241 + shift = 16 - fake_props.bwa_wd; 242 + 243 + /* 244 + * Converting a valid value from the coarser scale to the finer 245 + * scale and back again must yield the original value: 246 + */ 247 + if (fake_props.bwa_wd >= 7) { 248 + /* More than 100 steps: only test exact pc values: */ 249 + for (pc = get_mba_min(&fake_props); pc <= MAX_MBA_BW; pc++) { 250 + max = percent_to_mbw_max(pc, &fake_props); 251 + pc2 = mbw_max_to_percent(max, &fake_props); 252 + KUNIT_EXPECT_EQ(test, pc2, pc); 253 + } 254 + } else { 255 + /* Fewer than 100 steps: only test exact mbw_max values: */ 256 + for (max = 0; max < 1 << 16; max += 1 << shift) { 257 + pc = mbw_max_to_percent(max, &fake_props); 258 + max2 = percent_to_mbw_max(pc, &fake_props); 259 + KUNIT_EXPECT_EQ(test, max2, max); 260 + } 261 + } 262 + } 263 + 264 + static void test_percent_to_max_rounding(struct kunit *test) 265 + { 266 + const struct percent_value_case *param = test->param_value; 267 + unsigned int num_rounded_up = 0, total = 0; 268 + struct percent_value_test_info res; 269 + 270 + for (param = percent_value_cases, total = 0; 271 + param < &percent_value_cases[ARRAY_SIZE(percent_value_cases)]; 272 + param++, total++) { 273 + __prepare_percent_value_test(test, &res, param); 274 + if (res.value > param->value << res.shift) 275 + num_rounded_up++; 276 + } 277 + 278 + /* 279 + * The MPAM driver applies a round-to-nearest policy, whereas a 280 + * round-down policy seems to have been applied in the 281 + * reference table from which the test vectors were selected. 282 + * 283 + * For a large and well-distributed suite of test vectors, 284 + * about half should be rounded up and half down compared with 285 + * the reference table. The actual test vectors are few in 286 + * number and probably not very well distributed however, so 287 + * tolerate a round-up rate of between 1/4 and 3/4 before 288 + * crying foul: 289 + */ 290 + 291 + kunit_info(test, "Round-up rate: %u%% (%u/%u)\n", 292 + DIV_ROUND_CLOSEST(num_rounded_up * 100, total), 293 + num_rounded_up, total); 294 + 295 + KUNIT_EXPECT_GE(test, 4 * num_rounded_up, 1 * total); 296 + KUNIT_EXPECT_LE(test, 4 * num_rounded_up, 3 * total); 297 + } 298 + 299 + static struct kunit_case mpam_resctrl_test_cases[] = { 300 + KUNIT_CASE(test_get_mba_granularity), 301 + KUNIT_CASE_PARAM(test_mbw_max_to_percent, test_percent_value_gen_params), 302 + KUNIT_CASE_PARAM(test_percent_to_mbw_max, test_percent_value_gen_params), 303 + KUNIT_CASE_PARAM(test_mbw_max_to_percent_limits, test_all_bwa_wd_gen_params), 304 + KUNIT_CASE(test_percent_to_max_rounding), 305 + KUNIT_CASE_PARAM(test_percent_max_roundtrip_stability, 306 + test_all_bwa_wd_gen_params), 307 + {} 308 + }; 309 + 310 + static struct kunit_suite mpam_resctrl_test_suite = { 311 + .name = "mpam_resctrl_test_suite", 312 + .test_cases = mpam_resctrl_test_cases, 313 + }; 314 + 315 + kunit_test_suites(&mpam_resctrl_test_suite);

+32

include/linux/arm_mpam.h

··· 5 5 #define __LINUX_ARM_MPAM_H 6 6 7 7 #include <linux/acpi.h> 8 + #include <linux/resctrl_types.h> 8 9 #include <linux/types.h> 9 10 10 11 struct mpam_msc; ··· 49 48 return -EINVAL; 50 49 } 51 50 #endif 51 + 52 + bool resctrl_arch_alloc_capable(void); 53 + bool resctrl_arch_mon_capable(void); 54 + 55 + void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); 56 + void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); 57 + void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid); 58 + void resctrl_arch_sched_in(struct task_struct *tsk); 59 + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); 60 + bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); 61 + u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); 62 + void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); 63 + u32 resctrl_arch_system_num_rmid_idx(void); 64 + 65 + struct rdt_resource; 66 + void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, enum resctrl_event_id evtid); 67 + void resctrl_arch_mon_ctx_free(struct rdt_resource *r, enum resctrl_event_id evtid, void *ctx); 68 + 69 + /* 70 + * The CPU configuration for MPAM is cheap to write, and is only written if it 71 + * has changed. No need for fine grained enables. 72 + */ 73 + static inline void resctrl_arch_enable_mon(void) { } 74 + static inline void resctrl_arch_disable_mon(void) { } 75 + static inline void resctrl_arch_enable_alloc(void) { } 76 + static inline void resctrl_arch_disable_alloc(void) { } 77 + 78 + static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) 79 + { 80 + return val; 81 + } 52 82 53 83 /** 54 84 * mpam_register_requestor() - Register a requestor with the MPAM driver

+1 -1

include/linux/entry-common.h

··· 324 324 { 325 325 instrumentation_begin(); 326 326 syscall_exit_to_user_mode_work(regs); 327 - local_irq_disable_exit_to_user(); 327 + local_irq_disable(); 328 328 syscall_exit_to_user_mode_prepare(regs); 329 329 instrumentation_end(); 330 330 exit_to_user_mode();

+204 -54

include/linux/irq-entry-common.h

··· 110 110 } 111 111 112 112 /** 113 - * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable() 114 - * @ti_work: Cached TIF flags gathered with interrupts disabled 115 - * 116 - * Defaults to local_irq_enable(). Can be supplied by architecture specific 117 - * code. 118 - */ 119 - static inline void local_irq_enable_exit_to_user(unsigned long ti_work); 120 - 121 - #ifndef local_irq_enable_exit_to_user 122 - static __always_inline void local_irq_enable_exit_to_user(unsigned long ti_work) 123 - { 124 - local_irq_enable(); 125 - } 126 - #endif 127 - 128 - /** 129 - * local_irq_disable_exit_to_user - Exit to user variant of local_irq_disable() 130 - * 131 - * Defaults to local_irq_disable(). Can be supplied by architecture specific 132 - * code. 133 - */ 134 - static inline void local_irq_disable_exit_to_user(void); 135 - 136 - #ifndef local_irq_disable_exit_to_user 137 - static __always_inline void local_irq_disable_exit_to_user(void) 138 - { 139 - local_irq_disable(); 140 - } 141 - #endif 142 - 143 - /** 144 113 * arch_exit_to_user_mode_work - Architecture specific TIF work for exit 145 114 * to user mode. 146 115 * @regs: Pointer to currents pt_regs ··· 317 348 */ 318 349 static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs) 319 350 { 351 + lockdep_assert_irqs_disabled(); 352 + 320 353 instrumentation_begin(); 321 354 irqentry_exit_to_user_mode_prepare(regs); 322 355 instrumentation_end(); ··· 350 379 #endif 351 380 352 381 /** 382 + * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt 383 + * 384 + * Conditional reschedule with additional sanity checks. 385 + */ 386 + void raw_irqentry_exit_cond_resched(void); 387 + 388 + #ifdef CONFIG_PREEMPT_DYNAMIC 389 + #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) 390 + #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched 391 + #define irqentry_exit_cond_resched_dynamic_disabled NULL 392 + DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched); 393 + #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)() 394 + #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) 395 + DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched); 396 + void dynamic_irqentry_exit_cond_resched(void); 397 + #define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched() 398 + #endif 399 + #else /* CONFIG_PREEMPT_DYNAMIC */ 400 + #define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched() 401 + #endif /* CONFIG_PREEMPT_DYNAMIC */ 402 + 403 + /** 404 + * irqentry_enter_from_kernel_mode - Establish state before invoking the irq handler 405 + * @regs: Pointer to currents pt_regs 406 + * 407 + * Invoked from architecture specific entry code with interrupts disabled. 408 + * Can only be called when the interrupt entry came from kernel mode. The 409 + * calling code must be non-instrumentable. When the function returns all 410 + * state is correct and the subsequent functions can be instrumented. 411 + * 412 + * The function establishes state (lockdep, RCU (context tracking), tracing) and 413 + * is provided for architectures which require a strict split between entry from 414 + * kernel and user mode and therefore cannot use irqentry_enter() which handles 415 + * both entry modes. 416 + * 417 + * Returns: An opaque object that must be passed to irqentry_exit_to_kernel_mode(). 418 + */ 419 + static __always_inline irqentry_state_t irqentry_enter_from_kernel_mode(struct pt_regs *regs) 420 + { 421 + irqentry_state_t ret = { 422 + .exit_rcu = false, 423 + }; 424 + 425 + /* 426 + * If this entry hit the idle task invoke ct_irq_enter() whether 427 + * RCU is watching or not. 428 + * 429 + * Interrupts can nest when the first interrupt invokes softirq 430 + * processing on return which enables interrupts. 431 + * 432 + * Scheduler ticks in the idle task can mark quiescent state and 433 + * terminate a grace period, if and only if the timer interrupt is 434 + * not nested into another interrupt. 435 + * 436 + * Checking for rcu_is_watching() here would prevent the nesting 437 + * interrupt to invoke ct_irq_enter(). If that nested interrupt is 438 + * the tick then rcu_flavor_sched_clock_irq() would wrongfully 439 + * assume that it is the first interrupt and eventually claim 440 + * quiescent state and end grace periods prematurely. 441 + * 442 + * Unconditionally invoke ct_irq_enter() so RCU state stays 443 + * consistent. 444 + * 445 + * TINY_RCU does not support EQS, so let the compiler eliminate 446 + * this part when enabled. 447 + */ 448 + if (!IS_ENABLED(CONFIG_TINY_RCU) && 449 + (is_idle_task(current) || arch_in_rcu_eqs())) { 450 + /* 451 + * If RCU is not watching then the same careful 452 + * sequence vs. lockdep and tracing is required 453 + * as in irqentry_enter_from_user_mode(). 454 + */ 455 + lockdep_hardirqs_off(CALLER_ADDR0); 456 + ct_irq_enter(); 457 + instrumentation_begin(); 458 + kmsan_unpoison_entry_regs(regs); 459 + trace_hardirqs_off_finish(); 460 + instrumentation_end(); 461 + 462 + ret.exit_rcu = true; 463 + return ret; 464 + } 465 + 466 + /* 467 + * If RCU is watching then RCU only wants to check whether it needs 468 + * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick() 469 + * already contains a warning when RCU is not watching, so no point 470 + * in having another one here. 471 + */ 472 + lockdep_hardirqs_off(CALLER_ADDR0); 473 + instrumentation_begin(); 474 + kmsan_unpoison_entry_regs(regs); 475 + rcu_irq_enter_check_tick(); 476 + trace_hardirqs_off_finish(); 477 + instrumentation_end(); 478 + 479 + return ret; 480 + } 481 + 482 + /** 483 + * irqentry_exit_to_kernel_mode_preempt - Run preempt checks on return to kernel mode 484 + * @regs: Pointer to current's pt_regs 485 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 486 + * 487 + * This is to be invoked before irqentry_exit_to_kernel_mode_after_preempt() to 488 + * allow kernel preemption on return from interrupt. 489 + * 490 + * Must be invoked with interrupts disabled and CPU state which allows kernel 491 + * preemption. 492 + * 493 + * After returning from this function, the caller can modify CPU state before 494 + * invoking irqentry_exit_to_kernel_mode_after_preempt(), which is required to 495 + * re-establish the tracing, lockdep and RCU state for returning to the 496 + * interrupted context. 497 + */ 498 + static inline void irqentry_exit_to_kernel_mode_preempt(struct pt_regs *regs, 499 + irqentry_state_t state) 500 + { 501 + if (regs_irqs_disabled(regs) || state.exit_rcu) 502 + return; 503 + 504 + if (IS_ENABLED(CONFIG_PREEMPTION)) 505 + irqentry_exit_cond_resched(); 506 + 507 + hrtimer_rearm_deferred(); 508 + } 509 + 510 + /** 511 + * irqentry_exit_to_kernel_mode_after_preempt - Establish trace, lockdep and RCU state 512 + * @regs: Pointer to current's pt_regs 513 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 514 + * 515 + * This is to be invoked after irqentry_exit_to_kernel_mode_preempt() and before 516 + * actually returning to the interrupted context. 517 + * 518 + * There are no requirements for the CPU state other than being able to complete 519 + * the tracing, lockdep and RCU state transitions. After this function returns 520 + * the caller must return directly to the interrupted context. 521 + */ 522 + static __always_inline void 523 + irqentry_exit_to_kernel_mode_after_preempt(struct pt_regs *regs, irqentry_state_t state) 524 + { 525 + if (!regs_irqs_disabled(regs)) { 526 + /* 527 + * If RCU was not watching on entry this needs to be done 528 + * carefully and needs the same ordering of lockdep/tracing 529 + * and RCU as the return to user mode path. 530 + */ 531 + if (state.exit_rcu) { 532 + instrumentation_begin(); 533 + /* Tell the tracer that IRET will enable interrupts */ 534 + trace_hardirqs_on_prepare(); 535 + lockdep_hardirqs_on_prepare(); 536 + instrumentation_end(); 537 + ct_irq_exit(); 538 + lockdep_hardirqs_on(CALLER_ADDR0); 539 + return; 540 + } 541 + 542 + instrumentation_begin(); 543 + /* Covers both tracing and lockdep */ 544 + trace_hardirqs_on(); 545 + instrumentation_end(); 546 + } else { 547 + /* 548 + * IRQ flags state is correct already. Just tell RCU if it 549 + * was not watching on entry. 550 + */ 551 + if (state.exit_rcu) 552 + ct_irq_exit(); 553 + } 554 + } 555 + 556 + /** 557 + * irqentry_exit_to_kernel_mode - Run preempt checks and establish state after 558 + * invoking the interrupt handler 559 + * @regs: Pointer to current's pt_regs 560 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 561 + * 562 + * This is the counterpart of irqentry_enter_from_kernel_mode() and combines 563 + * the calls to irqentry_exit_to_kernel_mode_preempt() and 564 + * irqentry_exit_to_kernel_mode_after_preempt(). 565 + * 566 + * The requirement for the CPU state is that it can schedule. After the function 567 + * returns the tracing, lockdep and RCU state transitions are completed and the 568 + * caller must return directly to the interrupted context. 569 + */ 570 + static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs, 571 + irqentry_state_t state) 572 + { 573 + lockdep_assert_irqs_disabled(); 574 + 575 + instrumentation_begin(); 576 + irqentry_exit_to_kernel_mode_preempt(regs, state); 577 + instrumentation_end(); 578 + 579 + irqentry_exit_to_kernel_mode_after_preempt(regs, state); 580 + } 581 + 582 + /** 353 583 * irqentry_enter - Handle state tracking on ordinary interrupt entries 354 584 * @regs: Pointer to pt_regs of interrupted context 355 585 * ··· 579 407 * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit 580 408 * would not be possible. 581 409 * 582 - * Returns: An opaque object that must be passed to idtentry_exit() 410 + * Returns: An opaque object that must be passed to irqentry_exit() 583 411 */ 584 412 irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); 585 - 586 - /** 587 - * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt 588 - * 589 - * Conditional reschedule with additional sanity checks. 590 - */ 591 - void raw_irqentry_exit_cond_resched(void); 592 - 593 - #ifdef CONFIG_PREEMPT_DYNAMIC 594 - #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) 595 - #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched 596 - #define irqentry_exit_cond_resched_dynamic_disabled NULL 597 - DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched); 598 - #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)() 599 - #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) 600 - DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched); 601 - void dynamic_irqentry_exit_cond_resched(void); 602 - #define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched() 603 - #endif 604 - #else /* CONFIG_PREEMPT_DYNAMIC */ 605 - #define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched() 606 - #endif /* CONFIG_PREEMPT_DYNAMIC */ 607 413 608 414 /** 609 415 * irqentry_exit - Handle return from exception that used irqentry_enter()

+10 -99

kernel/entry/common.c

··· 47 47 */ 48 48 while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) { 49 49 50 - local_irq_enable_exit_to_user(ti_work); 50 + local_irq_enable(); 51 51 52 52 if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { 53 53 if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY)) ··· 74 74 * might have changed while interrupts and preemption was 75 75 * enabled above. 76 76 */ 77 - local_irq_disable_exit_to_user(); 77 + local_irq_disable(); 78 78 79 79 /* Check if any of the above work has queued a deferred wakeup */ 80 80 tick_nohz_user_enter_prepare(); ··· 105 105 106 106 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) 107 107 { 108 - irqentry_state_t ret = { 109 - .exit_rcu = false, 110 - }; 111 - 112 108 if (user_mode(regs)) { 109 + irqentry_state_t ret = { 110 + .exit_rcu = false, 111 + }; 112 + 113 113 irqentry_enter_from_user_mode(regs); 114 114 return ret; 115 115 } 116 116 117 - /* 118 - * If this entry hit the idle task invoke ct_irq_enter() whether 119 - * RCU is watching or not. 120 - * 121 - * Interrupts can nest when the first interrupt invokes softirq 122 - * processing on return which enables interrupts. 123 - * 124 - * Scheduler ticks in the idle task can mark quiescent state and 125 - * terminate a grace period, if and only if the timer interrupt is 126 - * not nested into another interrupt. 127 - * 128 - * Checking for rcu_is_watching() here would prevent the nesting 129 - * interrupt to invoke ct_irq_enter(). If that nested interrupt is 130 - * the tick then rcu_flavor_sched_clock_irq() would wrongfully 131 - * assume that it is the first interrupt and eventually claim 132 - * quiescent state and end grace periods prematurely. 133 - * 134 - * Unconditionally invoke ct_irq_enter() so RCU state stays 135 - * consistent. 136 - * 137 - * TINY_RCU does not support EQS, so let the compiler eliminate 138 - * this part when enabled. 139 - */ 140 - if (!IS_ENABLED(CONFIG_TINY_RCU) && 141 - (is_idle_task(current) || arch_in_rcu_eqs())) { 142 - /* 143 - * If RCU is not watching then the same careful 144 - * sequence vs. lockdep and tracing is required 145 - * as in irqentry_enter_from_user_mode(). 146 - */ 147 - lockdep_hardirqs_off(CALLER_ADDR0); 148 - ct_irq_enter(); 149 - instrumentation_begin(); 150 - kmsan_unpoison_entry_regs(regs); 151 - trace_hardirqs_off_finish(); 152 - instrumentation_end(); 153 - 154 - ret.exit_rcu = true; 155 - return ret; 156 - } 157 - 158 - /* 159 - * If RCU is watching then RCU only wants to check whether it needs 160 - * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick() 161 - * already contains a warning when RCU is not watching, so no point 162 - * in having another one here. 163 - */ 164 - lockdep_hardirqs_off(CALLER_ADDR0); 165 - instrumentation_begin(); 166 - kmsan_unpoison_entry_regs(regs); 167 - rcu_irq_enter_check_tick(); 168 - trace_hardirqs_off_finish(); 169 - instrumentation_end(); 170 - 171 - return ret; 117 + return irqentry_enter_from_kernel_mode(regs); 172 118 } 173 119 174 120 /** ··· 158 212 159 213 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) 160 214 { 161 - lockdep_assert_irqs_disabled(); 162 - 163 - /* Check whether this returns to user mode */ 164 - if (user_mode(regs)) { 215 + if (user_mode(regs)) 165 216 irqentry_exit_to_user_mode(regs); 166 - } else if (!regs_irqs_disabled(regs)) { 167 - /* 168 - * If RCU was not watching on entry this needs to be done 169 - * carefully and needs the same ordering of lockdep/tracing 170 - * and RCU as the return to user mode path. 171 - */ 172 - if (state.exit_rcu) { 173 - instrumentation_begin(); 174 - hrtimer_rearm_deferred(); 175 - /* Tell the tracer that IRET will enable interrupts */ 176 - trace_hardirqs_on_prepare(); 177 - lockdep_hardirqs_on_prepare(); 178 - instrumentation_end(); 179 - ct_irq_exit(); 180 - lockdep_hardirqs_on(CALLER_ADDR0); 181 - return; 182 - } 183 - 184 - instrumentation_begin(); 185 - if (IS_ENABLED(CONFIG_PREEMPTION)) 186 - irqentry_exit_cond_resched(); 187 - 188 - hrtimer_rearm_deferred(); 189 - /* Covers both tracing and lockdep */ 190 - trace_hardirqs_on(); 191 - instrumentation_end(); 192 - } else { 193 - /* 194 - * IRQ flags state is correct already. Just tell RCU if it 195 - * was not watching on entry. 196 - */ 197 - if (state.exit_rcu) 198 - ct_irq_exit(); 199 - } 217 + else 218 + irqentry_exit_to_kernel_mode(regs, state); 200 219 } 201 220 202 221 irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)

+2 -1

tools/testing/selftests/arm64/abi/hwcap.c

··· 56 56 57 57 static void cmpbr_sigill(void) 58 58 { 59 - /* Not implemented, too complicated and unreliable anyway */ 59 + asm volatile(".inst 0x74C00040\n" /* CBEQ w0, w0, +8 */ 60 + "udf #0" : : : "cc"); /* UDF #0 */ 60 61 } 61 62 62 63 static void crc32_sigill(void)

+1

tools/testing/selftests/kvm/arm64/set_id_regs.c

··· 124 124 125 125 static const struct reg_ftr_bits ftr_id_aa64isar3_el1[] = { 126 126 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FPRCVT, 0), 127 + REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSUI, 0), 127 128 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSFE, 0), 128 129 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FAMINMAX, 0), 129 130 REG_FTR_END,