Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

accel/qaic: Add documentation for AIC100 accelerator driver

The Qualcomm Cloud AI 100 (AIC100) device is an Artificial Intelligence
accelerator PCIe card. It contains a number of components both in the
SoC and on the card which facilitate running workloads:

QSM: management processor
NSPs: workload compute units
DMA Bridge: dedicated data mover for the workloads
MHI: multiplexed communication channels
DDR: workload storage and memory

The Linux kernel driver for AIC100 is called "QAIC" and is located in the
accel subsystem.

Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-2-git-send-email-quic_jhugo@quicinc.com

authored by

Jeffrey Hugo and committed by
Jacek Lawrynowicz
830f3f27 f2c7ca89

+694
+1
Documentation/accel/index.rst
··· 8 8 :maxdepth: 1 9 9 10 10 introduction 11 + qaic/index 11 12 12 13 .. only:: subproject and html 13 14
+510
Documentation/accel/qaic/aic100.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0-only 2 + 3 + =============================== 4 + Qualcomm Cloud AI 100 (AIC100) 5 + =============================== 6 + 7 + Overview 8 + ======== 9 + 10 + The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of 11 + Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for 12 + the purpose of efficiently running Artificial Intelligence (AI) Deep Learning 13 + inference workloads. They are AI accelerators. 14 + 15 + The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes 16 + (x8). An individual SoC on a card can have up to 16 NSPs for running workloads. 17 + Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR. 18 + 19 + Multiple AIC100 cards can be hosted in a single system to scale overall 20 + performance. AIC100 cards are multi-user capable and able to execute workloads 21 + from multiple users in a concurrent manner. 22 + 23 + Hardware Description 24 + ==================== 25 + 26 + An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc 27 + peripherals (PMICs, etc). 28 + 29 + An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card), 30 + or a Dual M.2 card. Both use PCIe to connect to the host system. 31 + 32 + As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/ 33 + DeviceID(DID) combination to uniquely identify itself to the host. AIC100 34 + uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same 35 + AIC100 DID (0xa100). 36 + 37 + AIC100 does not implement FLR (function level reset). 38 + 39 + AIC100 implements MSI but does not implement MSI-X. AIC100 requires 17 MSIs to 40 + operate (1 for MHI, 16 for the DMA Bridge). 41 + 42 + As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device 43 + hardware. AIC100 provides 3, 64-bit BARs. 44 + 45 + * The first BAR is 4K in size, and exposes the MHI interface to the host. 46 + 47 + * The second BAR is 2M in size, and exposes the DMA Bridge interface to the 48 + host. 49 + 50 + * The third BAR is variable in size based on an individual AIC100's 51 + configuration, but defaults to 64K. This BAR currently has no purpose. 52 + 53 + From the host perspective, AIC100 has several key hardware components - 54 + 55 + * MHI (Modem Host Interface) 56 + * QSM (QAIC Service Manager) 57 + * NSPs (Neural Signal Processor) 58 + * DMA Bridge 59 + * DDR 60 + 61 + MHI 62 + --- 63 + 64 + AIC100 has one MHI interface over PCIe. MHI itself is documented at 65 + Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate 66 + with the QSM. Except for workload data via the DMA Bridge, all interaction with 67 + the device occurs via MHI. 68 + 69 + QSM 70 + --- 71 + 72 + QAIC Service Manager. This is an ARM A53 CPU that runs the primary 73 + firmware of the card and performs on-card management tasks. It also 74 + communicates with the host via MHI. Each AIC100 has one of 75 + these. 76 + 77 + NSP 78 + --- 79 + 80 + Neural Signal Processor. Each AIC100 has up to 16 of these. These are 81 + the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon 82 + (Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but 83 + multiple NSPs may be assigned to a single workload. Since each NSP can only run 84 + one workload, AIC100 is limited to 16 concurrent workloads. Workload 85 + "scheduling" is under the purview of the host. AIC100 does not automatically 86 + timeslice. 87 + 88 + DMA Bridge 89 + ---------- 90 + 91 + The DMA Bridge is custom DMA engine that manages the flow of data 92 + in and out of workloads. AIC100 has one of these. The DMA Bridge has 16 93 + channels, each consisting of a set of request/response FIFOs. Each active 94 + workload is assigned a single DMA Bridge channel. The DMA Bridge exposes 95 + hardware registers to manage the FIFOs (head/tail pointers), but requires host 96 + memory to store the FIFOs. 97 + 98 + DDR 99 + --- 100 + 101 + AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR. 102 + This DDR is used to store workloads, data for the workloads, and is used by the 103 + QSM for managing the device. NSPs are granted access to sections of the DDR by 104 + the QSM. The host does not have direct access to the DDR, and must make 105 + requests to the QSM to transfer data to the DDR. 106 + 107 + High-level Use Flow 108 + =================== 109 + 110 + AIC100 is a multi-user, programmable accelerator typically used for running 111 + neural networks in inferencing mode to efficiently perform AI operations. 112 + AIC100 is not intended for training neural networks. AIC100 can be utilized 113 + for generic compute workloads. 114 + 115 + Assuming a user wants to utilize AIC100, they would follow these steps: 116 + 117 + 1. Compile the workload into an ELF targeting the NSP(s) 118 + 2. Make requests to the QSM to load the workload and related artifacts into the 119 + device DDR 120 + 3. Make a request to the QSM to activate the workload onto a set of idle NSPs 121 + 4. Make requests to the DMA Bridge to send input data to the workload to be 122 + processed, and other requests to receive processed output data from the 123 + workload. 124 + 5. Once the workload is no longer required, make a request to the QSM to 125 + deactivate the workload, thus putting the NSPs back into an idle state. 126 + 6. Once the workload and related artifacts are no longer needed for future 127 + sessions, make requests to the QSM to unload the data from DDR. This frees 128 + the DDR to be used by other users. 129 + 130 + 131 + Boot Flow 132 + ========= 133 + 134 + AIC100 uses a flashless boot flow, derived from Qualcomm MSMs. 135 + 136 + When AIC100 is first powered on, it begins executing PBL (Primary Bootloader) 137 + from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host 138 + Interface) component of MHI. 139 + 140 + Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader) 141 + image. The PBL pulls the image from the host, validates it, and begins 142 + execution of SBL. 143 + 144 + SBL initializes MHI, and uses MHI to notify the host that the device has entered 145 + the SBL stage. SBL performs a number of operations: 146 + 147 + * SBL initializes the majority of hardware (anything PBL left uninitialized), 148 + including DDR. 149 + * SBL offloads the bootlog to the host. 150 + * SBL synchronizes timestamps with the host for future logging. 151 + * SBL uses the Sahara protocol to obtain the runtime firmware images from the 152 + host. 153 + 154 + Once SBL has obtained and validated the runtime firmware, it brings the NSPs out 155 + of reset, and jumps into the QSM. 156 + 157 + The QSM uses MHI to notify the host that the device has entered the QSM stage 158 + (AMSS in MHI terms). At this point, the AIC100 device is fully functional, and 159 + ready to process workloads. 160 + 161 + Userspace components 162 + ==================== 163 + 164 + Compiler 165 + -------- 166 + 167 + An open compiler for AIC100 based on upstream LLVM can be found at: 168 + https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc 169 + 170 + Usermode Driver (UMD) 171 + --------------------- 172 + 173 + An open UMD that interfaces with the qaic kernel driver can be found at: 174 + https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100 175 + 176 + Sahara loader 177 + ------------- 178 + 179 + An open implementation of the Sahara protocol called kickstart can be found at: 180 + https://github.com/andersson/qdl 181 + 182 + MHI Channels 183 + ============ 184 + 185 + AIC100 defines a number of MHI channels for different purposes. This is a list 186 + of the defined channels, and their uses. 187 + 188 + +----------------+---------+----------+----------------------------------------+ 189 + | Channel name | IDs | EEs | Purpose | 190 + +================+=========+==========+========================================+ 191 + | QAIC_LOOPBACK | 0 & 1 | AMSS | Any data sent to the device on this | 192 + | | | | channel is sent back to the host. | 193 + +----------------+---------+----------+----------------------------------------+ 194 + | QAIC_SAHARA | 2 & 3 | SBL | Used by SBL to obtain the runtime | 195 + | | | | firmware from the host. | 196 + +----------------+---------+----------+----------------------------------------+ 197 + | QAIC_DIAG | 4 & 5 | AMSS | Used to communicate with QSM via the | 198 + | | | | DIAG protocol. | 199 + +----------------+---------+----------+----------------------------------------+ 200 + | QAIC_SSR | 6 & 7 | AMSS | Used to notify the host of subsystem | 201 + | | | | restart events, and to offload SSR | 202 + | | | | crashdumps. | 203 + +----------------+---------+----------+----------------------------------------+ 204 + | QAIC_QDSS | 8 & 9 | AMSS | Used for the Qualcomm Debug Subsystem. | 205 + +----------------+---------+----------+----------------------------------------+ 206 + | QAIC_CONTROL | 10 & 11 | AMSS | Used for the Neural Network Control | 207 + | | | | (NNC) protocol. This is the primary | 208 + | | | | channel between host and QSM for | 209 + | | | | managing workloads. | 210 + +----------------+---------+----------+----------------------------------------+ 211 + | QAIC_LOGGING | 12 & 13 | SBL | Used by the SBL to send the bootlog to | 212 + | | | | the host. | 213 + +----------------+---------+----------+----------------------------------------+ 214 + | QAIC_STATUS | 14 & 15 | AMSS | Used to notify the host of Reliability,| 215 + | | | | Accessibility, Serviceability (RAS) | 216 + | | | | events. | 217 + +----------------+---------+----------+----------------------------------------+ 218 + | QAIC_TELEMETRY | 16 & 17 | AMSS | Used to get/set power/thermal/etc | 219 + | | | | attributes. | 220 + +----------------+---------+----------+----------------------------------------+ 221 + | QAIC_DEBUG | 18 & 19 | AMSS | Not used. | 222 + +----------------+---------+----------+----------------------------------------+ 223 + | QAIC_TIMESYNC | 20 & 21 | SBL/AMSS | Used to synchronize timestamps in the | 224 + | | | | device side logs with the host time | 225 + | | | | source. | 226 + +----------------+---------+----------+----------------------------------------+ 227 + 228 + DMA Bridge 229 + ========== 230 + 231 + Overview 232 + -------- 233 + 234 + The DMA Bridge is one of the main interfaces to the host from the device 235 + (the other being MHI). As part of activating a workload to run on NSPs, the QSM 236 + assigns that network a DMA Bridge channel. A workload's DMA Bridge channel 237 + (DBC for short) is solely for the use of that workload and is not shared with 238 + other workloads. 239 + 240 + Each DBC is a pair of FIFOs that manage data in and out of the workload. One 241 + FIFO is the request FIFO. The other FIFO is the response FIFO. 242 + 243 + Each DBC contains 4 registers in hardware: 244 + 245 + * Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the 246 + latest item in the FIFO the device has consumed. 247 + * Request FIFO tail pointer (offset 0x4). Read/write by the host. Host 248 + increments this register to add new items to the FIFO. 249 + * Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates 250 + the latest item in the FIFO the host has consumed. 251 + * Response FIFO tail pointer (offset 0xc). Read only by the host. Device 252 + increments this register to add new items to the FIFO. 253 + 254 + The values in each register are indexes in the FIFO. To get the location of the 255 + FIFO element pointed to by the register: FIFO base address + register * element 256 + size. 257 + 258 + DBC registers are exposed to the host via the second BAR. Each DBC consumes 259 + 4KB of space in the BAR. 260 + 261 + The actual FIFOs are backed by host memory. When sending a request to the QSM 262 + to activate a network, the host must donate memory to be used for the FIFOs. 263 + Due to internal mapping limitations of the device, a single contiguous chunk of 264 + memory must be provided per DBC, which hosts both FIFOs. The request FIFO will 265 + consume the beginning of the memory chunk, and the response FIFO will consume 266 + the end of the memory chunk. 267 + 268 + Request FIFO 269 + ------------ 270 + 271 + A request FIFO element has the following structure: 272 + 273 + .. code-block:: c 274 + 275 + struct request_elem { 276 + u16 req_id; 277 + u8 seq_id; 278 + u8 pcie_dma_cmd; 279 + u32 reserved; 280 + u64 pcie_dma_source_addr; 281 + u64 pcie_dma_dest_addr; 282 + u32 pcie_dma_len; 283 + u32 reserved; 284 + u64 doorbell_addr; 285 + u8 doorbell_attr; 286 + u8 reserved; 287 + u16 reserved; 288 + u32 doorbell_data; 289 + u32 sem_cmd0; 290 + u32 sem_cmd1; 291 + u32 sem_cmd2; 292 + u32 sem_cmd3; 293 + }; 294 + 295 + Request field descriptions: 296 + 297 + req_id 298 + request ID. A request FIFO element and a response FIFO element with 299 + the same request ID refer to the same command. 300 + 301 + seq_id 302 + sequence ID within a request. Ignored by the DMA Bridge. 303 + 304 + pcie_dma_cmd 305 + describes the DMA element of this request. 306 + 307 + * Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic 308 + and generates a MSI when this request is complete, and QSM 309 + configures the DMA Bridge to look at this bit. 310 + * Bits(6:5) are reserved. 311 + * Bit(4) is the completion code flag, and indicates that the DMA Bridge 312 + shall generate a response FIFO element when this request is 313 + complete. 314 + * Bit(3) indicates if this request is a linked list transfer(0) or a bulk 315 + transfer(1). 316 + * Bit(2) is reserved. 317 + * Bits(1:0) indicate the type of transfer. No transfer(0), to device(1), 318 + from device(2). Value 3 is illegal. 319 + 320 + pcie_dma_source_addr 321 + source address for a bulk transfer, or the address of the linked list. 322 + 323 + pcie_dma_dest_addr 324 + destination address for a bulk transfer. 325 + 326 + pcie_dma_len 327 + length of the bulk transfer. Note that the size of this field 328 + limits transfers to 4G in size. 329 + 330 + doorbell_addr 331 + address of the doorbell to ring when this request is complete. 332 + 333 + doorbell_attr 334 + doorbell attributes. 335 + 336 + * Bit(7) indicates if a write to a doorbell is to occur. 337 + * Bits(6:2) are reserved. 338 + * Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit, 339 + 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address 340 + must be naturally aligned to the specified length. 341 + 342 + doorbell_data 343 + data to write to the doorbell. Only the bits corresponding to 344 + the doorbell length are valid. 345 + 346 + sem_cmdN 347 + semaphore command. 348 + 349 + * Bit(31) indicates this semaphore command is enabled. 350 + * Bit(30) is the to-device DMA fence. Block this request until all 351 + to-device DMA transfers are complete. 352 + * Bit(29) is the from-device DMA fence. Block this request until all 353 + from-device DMA transfers are complete. 354 + * Bits(28:27) are reserved. 355 + * Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the 356 + specified value. 2 is increment. 3 is decrement. 4 is wait 357 + until the semaphore is equal to the specified value. 5 is wait 358 + until the semaphore is greater or equal to the specified value. 359 + 6 is "P", wait until semaphore is greater than 0, then 360 + decrement by 1. 7 is reserved. 361 + * Bit(23) is reserved. 362 + * Bit(22) is the semaphore sync. 0 is post sync, which means that the 363 + semaphore operation is done after the DMA transfer. 1 is 364 + presync, which gates the DMA transfer. Only one presync is 365 + allowed per request. 366 + * Bit(21) is reserved. 367 + * Bits(20:16) is the index of the semaphore to operate on. 368 + * Bits(15:12) are reserved. 369 + * Bits(11:0) are the semaphore value to use in operations. 370 + 371 + Overall, a request is processed in 4 steps: 372 + 373 + 1. If specified, the presync semaphore condition must be true 374 + 2. If enabled, the DMA transfer occurs 375 + 3. If specified, the postsync semaphore conditions must be true 376 + 4. If enabled, the doorbell is written 377 + 378 + By using the semaphores in conjunction with the workload running on the NSPs, 379 + the data pipeline can be synchronized such that the host can queue multiple 380 + requests of data for the workload to process, but the DMA Bridge will only copy 381 + the data into the memory of the workload when the workload is ready to process 382 + the next input. 383 + 384 + Response FIFO 385 + ------------- 386 + 387 + Once a request is fully processed, a response FIFO element is generated if 388 + specified in pcie_dma_cmd. The structure of a response FIFO element: 389 + 390 + .. code-block:: c 391 + 392 + struct response_elem { 393 + u16 req_id; 394 + u16 completion_code; 395 + }; 396 + 397 + req_id 398 + matches the req_id of the request that generated this element. 399 + 400 + completion_code 401 + status of this request. 0 is success. Non-zero is an error. 402 + 403 + The DMA Bridge will generate a MSI to the host as a reaction to activity in the 404 + response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation 405 + algorithm, where it will only generate a MSI when the response FIFO transitions 406 + from empty to non-empty (unless force MSI is enabled and triggered). In 407 + response to this MSI, the host is expected to drain the response FIFO, and must 408 + take care to handle any race conditions between draining the FIFO, and the 409 + device inserting elements into the FIFO. 410 + 411 + Neural Network Control (NNC) Protocol 412 + ===================================== 413 + 414 + The NNC protocol is how the host makes requests to the QSM to manage workloads. 415 + It uses the QAIC_CONTROL MHI channel. 416 + 417 + Each NNC request is packaged into a message. Each message is a series of 418 + transactions. A passthrough type transaction can contain elements known as 419 + commands. 420 + 421 + QSM requires NNC messages be little endian encoded and the fields be naturally 422 + aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment 423 + must be maintained. 424 + 425 + A message contains a header and then a series of transactions. A message may be 426 + at most 4K in size from QSM to the host. From the host to the QSM, a message 427 + can be at most 64K (maximum size of a single MHI packet), but there is a 428 + continuation feature where message N+1 can be marked as a continuation of 429 + message N. This is used for exceedingly large DMA xfer transactions. 430 + 431 + Transaction descriptions 432 + ------------------------ 433 + 434 + passthrough 435 + Allows userspace to send an opaque payload directly to the QSM. 436 + This is used for NNC commands. Userspace is responsible for managing 437 + the QSM message requirements in the payload. 438 + 439 + dma_xfer 440 + DMA transfer. Describes an object that the QSM should DMA into the 441 + device via address and size tuples. 442 + 443 + activate 444 + Activate a workload onto NSPs. The host must provide memory to be 445 + used by the DBC. 446 + 447 + deactivate 448 + Deactivate an active workload and return the NSPs to idle. 449 + 450 + status 451 + Query the QSM about it's NNC implementation. Returns the NNC version, 452 + and if CRC is used. 453 + 454 + terminate 455 + Release a user's resources. 456 + 457 + dma_xfer_cont 458 + Continuation of a previous DMA transfer. If a DMA transfer 459 + cannot be specified in a single message (highly fragmented), this 460 + transaction can be used to specify more ranges. 461 + 462 + validate_partition 463 + Query to QSM to determine if a partition identifier is valid. 464 + 465 + Each message is tagged with a user id, and a partition id. The user id allows 466 + QSM to track resources, and release them when the user goes away (eg the process 467 + crashes). A partition id identifies the resource partition that QSM manages, 468 + which this message applies to. 469 + 470 + Messages may have CRCs. Messages should have CRCs applied until the QSM 471 + reports via the status transaction that CRCs are not needed. The QSM on the 472 + SA9000P requires CRCs for black channel safing. 473 + 474 + Subsystem Restart (SSR) 475 + ======================= 476 + 477 + SSR is the concept of limiting the impact of an error. An AIC100 device may 478 + have multiple users, each with their own workload running. If the workload of 479 + one user crashes, the fallout of that should be limited to that workload and not 480 + impact other workloads. SSR accomplishes this. 481 + 482 + If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI 483 + channel. This notification identifies the workload by it's assigned DBC. A 484 + multi-stage recovery process is then used to cleanup both sides, and get the 485 + DBC/NSPs into a working state. 486 + 487 + When SSR occurs, any state in the workload is lost. Any inputs that were in 488 + process, or queued by not yet serviced, are lost. The loaded artifacts will 489 + remain in on-card DDR, but the host will need to re-activate the workload if 490 + it desires to recover the workload. 491 + 492 + Reliability, Accessibility, Serviceability (RAS) 493 + ================================================ 494 + 495 + AIC100 is expected to be deployed in server systems where RAS ideology is 496 + applied. Simply put, RAS is the concept of detecting, classifying, and 497 + reporting errors. While PCIe has AER (Advanced Error Reporting) which factors 498 + into RAS, AER does not allow for a device to report details about internal 499 + errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event 500 + occurs, QSM will report the event with appropriate details via the QAIC_STATUS 501 + MHI channel. A sysadmin may determine that a particular device needs 502 + additional service based on RAS reports. 503 + 504 + Telemetry 505 + ========= 506 + 507 + QSM has the ability to report various physical attributes of the device, and in 508 + some cases, to allow the host to control them. Examples include thermal limits, 509 + thermal readings, and power readings. These items are communicated via the 510 + QAIC_TELEMETRY MHI channel.
+13
Documentation/accel/qaic/index.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0-only 2 + 3 + ===================================== 4 + accel/qaic Qualcomm Cloud AI driver 5 + ===================================== 6 + 7 + The accel/qaic driver supports the Qualcomm Cloud AI machine learning 8 + accelerator cards. 9 + 10 + .. toctree:: 11 + 12 + qaic 13 + aic100
+170
Documentation/accel/qaic/qaic.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0-only 2 + 3 + ============= 4 + QAIC driver 5 + ============= 6 + 7 + The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI 8 + accelerator products. 9 + 10 + Interrupts 11 + ========== 12 + 13 + While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation 14 + mechanism, it is still possible for an IRQ storm to occur. A storm can happen 15 + if the workload is particularly quick, and the host is responsive. If the host 16 + can drain the response FIFO as quickly as the device can insert elements into 17 + it, then the device will frequently transition the response FIFO from empty to 18 + non-empty and generate MSIs at a rate equivalent to the speed of the 19 + workload's ability to process inputs. The lprnet (license plate reader network) 20 + workload is known to trigger this condition, and can generate in excess of 100k 21 + MSIs per second. It has been observed that most systems cannot tolerate this 22 + for long, and will crash due to some form of watchdog due to the overhead of 23 + the interrupt controller interrupting the host CPU. 24 + 25 + To mitigate this issue, the QAIC driver implements specific IRQ handling. When 26 + QAIC receives an IRQ, it disables that line. This prevents the interrupt 27 + controller from interrupting the CPU. Then AIC drains the FIFO. Once the FIFO 28 + is drained, QAIC implements a "last chance" polling algorithm where QAIC will 29 + sleep for a time to see if the workload will generate more activity. The IRQ 30 + line remains disabled during this time. If no activity is detected, QAIC exits 31 + polling mode and reenables the IRQ line. 32 + 33 + This mitigation in QAIC is very effective. The same lprnet usecase that 34 + generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64 35 + IRQs over 5 minutes while keeping the host system stable, and having the same 36 + workload throughput performance (within run to run noise variation). 37 + 38 + 39 + Neural Network Control (NNC) Protocol 40 + ===================================== 41 + 42 + The implementation of NNC is split between the KMD (QAIC) and UMD. In general 43 + QAIC understands how to encode/decode NNC wire protocol, and elements of the 44 + protocol which require kernel space knowledge to process (for example, mapping 45 + host memory to device IOVAs). QAIC understands the structure of a message, and 46 + all of the transactions. QAIC does not understand commands (the payload of a 47 + passthrough transaction). 48 + 49 + QAIC handles and enforces the required little endianness and 64-bit alignment, 50 + to the degree that it can. Since QAIC does not know the contents of a 51 + passthrough transaction, it relies on the UMD to satisfy the requirements. 52 + 53 + The terminate transaction is of particular use to QAIC. QAIC is not aware of 54 + the resources that are loaded onto a device since the majority of that activity 55 + occurs within NNC commands. As a result, QAIC does not have the means to 56 + roll back userspace activity. To ensure that a userspace client's resources 57 + are fully released in the case of a process crash, or a bug, QAIC uses the 58 + terminate command to let QSM know when a user has gone away, and the resources 59 + can be released. 60 + 61 + QSM can report a version number of the NNC protocol it supports. This is in the 62 + form of a Major number and a Minor number. 63 + 64 + Major number updates indicate changes to the NNC protocol which impact the 65 + message format, or transactions (impacts QAIC). 66 + 67 + Minor number updates indicate changes to the NNC protocol which impact the 68 + commands (does not impact QAIC). 69 + 70 + uAPI 71 + ==== 72 + 73 + QAIC defines a number of driver specific IOCTLs as part of the userspace API. 74 + This section describes those APIs. 75 + 76 + DRM_IOCTL_QAIC_MANAGE 77 + This IOCTL allows userspace to send a NNC request to the QSM. The call will 78 + block until a response is received, or the request has timed out. 79 + 80 + DRM_IOCTL_QAIC_CREATE_BO 81 + This IOCTL allows userspace to allocate a buffer object (BO) which can send 82 + or receive data from a workload. The call will return a GEM handle that 83 + represents the allocated buffer. The BO is not usable until it has been 84 + sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO). 85 + 86 + DRM_IOCTL_QAIC_MMAP_BO 87 + This IOCTL allows userspace to prepare an allocated BO to be mmap'd into the 88 + userspace process. 89 + 90 + DRM_IOCTL_QAIC_ATTACH_SLICE_BO 91 + This IOCTL allows userspace to slice a BO in preparation for sending the BO 92 + to the device. Slicing is the operation of describing what portions of a BO 93 + get sent where to a workload. This requires a set of DMA transfers for the 94 + DMA Bridge, and as such, locks the BO to a specific DBC. 95 + 96 + DRM_IOCTL_QAIC_EXECUTE_BO 97 + This IOCTL allows userspace to submit a set of sliced BOs to the device. The 98 + call is non-blocking. Success only indicates that the BOs have been queued 99 + to the device, but does not guarantee they have been executed. 100 + 101 + DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO 102 + This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows userspace 103 + to shrink the BOs sent to the device for this specific call. If a BO 104 + typically has N inputs, but only a subset of those is available, this IOCTL 105 + allows userspace to indicate that only the first M bytes of the BO should be 106 + sent to the device to minimize data transfer overhead. This IOCTL dynamically 107 + recomputes the slicing, and therefore has some processing overhead before the 108 + BOs can be queued to the device. 109 + 110 + DRM_IOCTL_QAIC_WAIT_BO 111 + This IOCTL allows userspace to determine when a particular BO has been 112 + processed by the device. The call will block until either the BO has been 113 + processed and can be re-queued to the device, or a timeout occurs. 114 + 115 + DRM_IOCTL_QAIC_PERF_STATS_BO 116 + This IOCTL allows userspace to collect performance statistics on the most 117 + recent execution of a BO. This allows userspace to construct an end to end 118 + timeline of the BO processing for a performance analysis. 119 + 120 + DRM_IOCTL_QAIC_PART_DEV 121 + This IOCTL allows userspace to request a duplicate "shadow device". This extra 122 + accelN device is associated with a specific partition of resources on the 123 + AIC100 device and can be used for limiting a process to some subset of 124 + resources. 125 + 126 + Userspace Client Isolation 127 + ========================== 128 + 129 + AIC100 supports multiple clients. Multiple DBCs can be consumed by a single 130 + client, and multiple clients can each consume one or more DBCs. Workloads 131 + may contain sensitive information therefore only the client that owns the 132 + workload should be allowed to interface with the DBC. 133 + 134 + Clients are identified by the instance associated with their open(). A client 135 + may only use memory they allocate, and DBCs that are assigned to their 136 + workloads. Attempts to access resources assigned to other clients will be 137 + rejected. 138 + 139 + Module parameters 140 + ================= 141 + 142 + QAIC supports the following module parameters: 143 + 144 + **datapath_polling (bool)** 145 + 146 + Configures QAIC to use a polling thread for datapath events instead of relying 147 + on the device interrupts. Useful for platforms with broken multiMSI. Must be 148 + set at QAIC driver initialization. Default is 0 (off). 149 + 150 + **mhi_timeout_ms (unsigned int)** 151 + 152 + Sets the timeout value for MHI operations in milliseconds (ms). Must be set 153 + at the time the driver detects a device. Default is 2000 (2 seconds). 154 + 155 + **control_resp_timeout_s (unsigned int)** 156 + 157 + Sets the timeout value for QSM responses to NNC messages in seconds (s). Must 158 + be set at the time the driver is sending a request to QSM. Default is 60 (one 159 + minute). 160 + 161 + **wait_exec_default_timeout_ms (unsigned int)** 162 + 163 + Sets the default timeout for the wait_exec ioctl in milliseconds (ms). Must be 164 + set prior to the waic_exec ioctl call. A value specified in the ioctl call 165 + overrides this for that call. Default is 5000 (5 seconds). 166 + 167 + **datapath_poll_interval_us (unsigned int)** 168 + 169 + Sets the polling interval in microseconds (us) when datapath polling is active. 170 + Takes effect at the next polling interval. Default is 100 (100 us).