Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'drm-habanalabs-next-2023-12-19' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains habanalabs driver changes for v6.8.

The notable changes are:

- uAPI changes:
- Add sysfs entry to allow users to identify a device minor id with its
debugfs path
- Add sysfs entry to expose the device's module id as given to us from
the f/w
- Add signed device information retrieval through the INFO ioctl

- New features and improvements:
- Update documentation of debugfs paths
- Add support for Gaudi2C device (new PCI revision number)
- Add pcie reset prepare/done hooks

- Firmware related fixes and changes:
- Print three instances version numbers of Infineon second stage
- Assume hard-reset is done by f/w upon PCIe AXI drain

- Bug fixes and code cleanups:
- Fix information leak in sec_attest_info()
- Avoid overriding existing undefined opcode data in Gaudi2
- Multiple Queue Manager (QMAN) fixes for Gaudi2
- Set hard reset flag if graceful reset is skipped
- Remove 'get temperature' debug print
- Fix the new Event Queue heartbeat mechanism

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Oded Gabbay <ogabbay@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/ZYFpihZscr/fsRRd@ogabbay-vm-u22.habana-labs.com

+333 -184
+36 -36
Documentation/ABI/testing/debugfs-driver-habanalabs
··· 1 - What: /sys/kernel/debug/accel/<n>/addr 1 + What: /sys/kernel/debug/accel/<parent_device>/addr 2 2 Date: Jan 2019 3 3 KernelVersion: 5.1 4 4 Contact: ogabbay@kernel.org ··· 8 8 only when the IOMMU is disabled. 9 9 The acceptable value is a string that starts with "0x" 10 10 11 - What: /sys/kernel/debug/accel/<n>/clk_gate 11 + What: /sys/kernel/debug/accel/<parent_device>/clk_gate 12 12 Date: May 2020 13 13 KernelVersion: 5.8 14 14 Contact: ogabbay@kernel.org 15 15 Description: This setting is now deprecated as clock gating is handled solely by the f/w 16 16 17 - What: /sys/kernel/debug/accel/<n>/command_buffers 17 + What: /sys/kernel/debug/accel/<parent_device>/command_buffers 18 18 Date: Jan 2019 19 19 KernelVersion: 5.1 20 20 Contact: ogabbay@kernel.org 21 21 Description: Displays a list with information about the currently allocated 22 22 command buffers 23 23 24 - What: /sys/kernel/debug/accel/<n>/command_submission 24 + What: /sys/kernel/debug/accel/<parent_device>/command_submission 25 25 Date: Jan 2019 26 26 KernelVersion: 5.1 27 27 Contact: ogabbay@kernel.org 28 28 Description: Displays a list with information about the currently active 29 29 command submissions 30 30 31 - What: /sys/kernel/debug/accel/<n>/command_submission_jobs 31 + What: /sys/kernel/debug/accel/<parent_device>/command_submission_jobs 32 32 Date: Jan 2019 33 33 KernelVersion: 5.1 34 34 Contact: ogabbay@kernel.org 35 35 Description: Displays a list with detailed information about each JOB (CB) of 36 36 each active command submission 37 37 38 - What: /sys/kernel/debug/accel/<n>/data32 38 + What: /sys/kernel/debug/accel/<parent_device>/data32 39 39 Date: Jan 2019 40 40 KernelVersion: 5.1 41 41 Contact: ogabbay@kernel.org ··· 50 50 If the IOMMU is disabled, it also allows the root user to read 51 51 or write from the host a device VA of a host mapped memory 52 52 53 - What: /sys/kernel/debug/accel/<n>/data64 53 + What: /sys/kernel/debug/accel/<parent_device>/data64 54 54 Date: Jan 2020 55 55 KernelVersion: 5.6 56 56 Contact: ogabbay@kernel.org ··· 65 65 If the IOMMU is disabled, it also allows the root user to read 66 66 or write from the host a device VA of a host mapped memory 67 67 68 - What: /sys/kernel/debug/accel/<n>/data_dma 68 + What: /sys/kernel/debug/accel/<parent_device>/data_dma 69 69 Date: Apr 2021 70 70 KernelVersion: 5.13 71 71 Contact: ogabbay@kernel.org ··· 83 83 workloads. 84 84 Only supported on GAUDI at this stage. 85 85 86 - What: /sys/kernel/debug/accel/<n>/device 86 + What: /sys/kernel/debug/accel/<parent_device>/device 87 87 Date: Jan 2019 88 88 KernelVersion: 5.1 89 89 Contact: ogabbay@kernel.org ··· 91 91 Valid values are "disable", "enable", "suspend", "resume". 92 92 User can read this property to see the valid values 93 93 94 - What: /sys/kernel/debug/accel/<n>/device_release_watchdog_timeout 94 + What: /sys/kernel/debug/accel/<parent_device>/device_release_watchdog_timeout 95 95 Date: Oct 2022 96 96 KernelVersion: 6.2 97 97 Contact: ttayar@habana.ai 98 98 Description: The watchdog timeout value in seconds for a device release upon 99 99 certain error cases, after which the device is reset. 100 100 101 - What: /sys/kernel/debug/accel/<n>/dma_size 101 + What: /sys/kernel/debug/accel/<parent_device>/dma_size 102 102 Date: Apr 2021 103 103 KernelVersion: 5.13 104 104 Contact: ogabbay@kernel.org ··· 108 108 When the write is finished, the user can read the "data_dma" 109 109 blob 110 110 111 - What: /sys/kernel/debug/accel/<n>/dump_razwi_events 111 + What: /sys/kernel/debug/accel/<parent_device>/dump_razwi_events 112 112 Date: Aug 2022 113 113 KernelVersion: 5.20 114 114 Contact: fkassabri@habana.ai ··· 117 117 the routine will clear the status register. 118 118 Usage: cat dump_razwi_events 119 119 120 - What: /sys/kernel/debug/accel/<n>/dump_security_violations 120 + What: /sys/kernel/debug/accel/<parent_device>/dump_security_violations 121 121 Date: Jan 2021 122 122 KernelVersion: 5.12 123 123 Contact: ogabbay@kernel.org ··· 125 125 all security violations meanings those violations will not be 126 126 dumped next time user calls this API 127 127 128 - What: /sys/kernel/debug/accel/<n>/engines 128 + What: /sys/kernel/debug/accel/<parent_device>/engines 129 129 Date: Jul 2019 130 130 KernelVersion: 5.3 131 131 Contact: ogabbay@kernel.org 132 132 Description: Displays the status registers values of the device engines and 133 133 their derived idle status 134 134 135 - What: /sys/kernel/debug/accel/<n>/i2c_addr 135 + What: /sys/kernel/debug/accel/<parent_device>/i2c_addr 136 136 Date: Jan 2019 137 137 KernelVersion: 5.1 138 138 Contact: ogabbay@kernel.org ··· 140 140 by the device's CPU, Not available when device is loaded with secured 141 141 firmware 142 142 143 - What: /sys/kernel/debug/accel/<n>/i2c_bus 143 + What: /sys/kernel/debug/accel/<parent_device>/i2c_bus 144 144 Date: Jan 2019 145 145 KernelVersion: 5.1 146 146 Contact: ogabbay@kernel.org ··· 148 148 the device's CPU, Not available when device is loaded with secured 149 149 firmware 150 150 151 - What: /sys/kernel/debug/accel/<n>/i2c_data 151 + What: /sys/kernel/debug/accel/<parent_device>/i2c_data 152 152 Date: Jan 2019 153 153 KernelVersion: 5.1 154 154 Contact: ogabbay@kernel.org ··· 157 157 reading from the file generates a read transaction, Not available 158 158 when device is loaded with secured firmware 159 159 160 - What: /sys/kernel/debug/accel/<n>/i2c_len 160 + What: /sys/kernel/debug/accel/<parent_device>/i2c_len 161 161 Date: Dec 2021 162 162 KernelVersion: 5.17 163 163 Contact: obitton@habana.ai ··· 165 165 the device's CPU, Not available when device is loaded with secured 166 166 firmware 167 167 168 - What: /sys/kernel/debug/accel/<n>/i2c_reg 168 + What: /sys/kernel/debug/accel/<parent_device>/i2c_reg 169 169 Date: Jan 2019 170 170 KernelVersion: 5.1 171 171 Contact: ogabbay@kernel.org ··· 173 173 the device's CPU, Not available when device is loaded with secured 174 174 firmware 175 175 176 - What: /sys/kernel/debug/accel/<n>/led0 176 + What: /sys/kernel/debug/accel/<parent_device>/led0 177 177 Date: Jan 2019 178 178 KernelVersion: 5.1 179 179 Contact: ogabbay@kernel.org 180 180 Description: Sets the state of the first S/W led on the device, Not available 181 181 when device is loaded with secured firmware 182 182 183 - What: /sys/kernel/debug/accel/<n>/led1 183 + What: /sys/kernel/debug/accel/<parent_device>/led1 184 184 Date: Jan 2019 185 185 KernelVersion: 5.1 186 186 Contact: ogabbay@kernel.org 187 187 Description: Sets the state of the second S/W led on the device, Not available 188 188 when device is loaded with secured firmware 189 189 190 - What: /sys/kernel/debug/accel/<n>/led2 190 + What: /sys/kernel/debug/accel/<parent_device>/led2 191 191 Date: Jan 2019 192 192 KernelVersion: 5.1 193 193 Contact: ogabbay@kernel.org 194 194 Description: Sets the state of the third S/W led on the device, Not available 195 195 when device is loaded with secured firmware 196 196 197 - What: /sys/kernel/debug/accel/<n>/memory_scrub 197 + What: /sys/kernel/debug/accel/<parent_device>/memory_scrub 198 198 Date: May 2022 199 199 KernelVersion: 5.19 200 200 Contact: dhirschfeld@habana.ai 201 201 Description: Allows the root user to scrub the dram memory. The scrubbing 202 202 value can be set using the debugfs file memory_scrub_val. 203 203 204 - What: /sys/kernel/debug/accel/<n>/memory_scrub_val 204 + What: /sys/kernel/debug/accel/<parent_device>/memory_scrub_val 205 205 Date: May 2022 206 206 KernelVersion: 5.19 207 207 Contact: dhirschfeld@habana.ai ··· 209 209 scrubs the dram using 'memory_scrub' debugfs file and 210 210 the scrubbing value when using module param 'memory_scrub' 211 211 212 - What: /sys/kernel/debug/accel/<n>/mmu 212 + What: /sys/kernel/debug/accel/<parent_device>/mmu 213 213 Date: Jan 2019 214 214 KernelVersion: 5.1 215 215 Contact: ogabbay@kernel.org ··· 219 219 e.g. to display info about VA 0x1000 for ASID 1 you need to do: 220 220 echo "1 0x1000" > /sys/kernel/debug/accel/0/mmu 221 221 222 - What: /sys/kernel/debug/accel/<n>/mmu_error 222 + What: /sys/kernel/debug/accel/<parent_device>/mmu_error 223 223 Date: Mar 2021 224 224 KernelVersion: 5.12 225 225 Contact: fkassabri@habana.ai ··· 229 229 echo "0x200" > /sys/kernel/debug/accel/0/mmu_error 230 230 cat /sys/kernel/debug/accel/0/mmu_error 231 231 232 - What: /sys/kernel/debug/accel/<n>/monitor_dump 232 + What: /sys/kernel/debug/accel/<parent_device>/monitor_dump 233 233 Date: Mar 2022 234 234 KernelVersion: 5.19 235 235 Contact: osharabi@habana.ai ··· 243 243 This interface doesn't support concurrency in the same device. 244 244 Only supported on GAUDI. 245 245 246 - What: /sys/kernel/debug/accel/<n>/monitor_dump_trig 246 + What: /sys/kernel/debug/accel/<parent_device>/monitor_dump_trig 247 247 Date: Mar 2022 248 248 KernelVersion: 5.19 249 249 Contact: osharabi@habana.ai ··· 253 253 When the write is finished, the user can read the "monitor_dump" 254 254 blob 255 255 256 - What: /sys/kernel/debug/accel/<n>/set_power_state 256 + What: /sys/kernel/debug/accel/<parent_device>/set_power_state 257 257 Date: Jan 2019 258 258 KernelVersion: 5.1 259 259 Contact: ogabbay@kernel.org 260 260 Description: Sets the PCI power state. Valid values are "1" for D0 and "2" 261 261 for D3Hot 262 262 263 - What: /sys/kernel/debug/accel/<n>/skip_reset_on_timeout 263 + What: /sys/kernel/debug/accel/<parent_device>/skip_reset_on_timeout 264 264 Date: Jun 2021 265 265 KernelVersion: 5.13 266 266 Contact: ynudelman@habana.ai ··· 268 268 "0" means device will be reset in case some CS has timed out, 269 269 otherwise it will not be reset. 270 270 271 - What: /sys/kernel/debug/accel/<n>/state_dump 271 + What: /sys/kernel/debug/accel/<parent_device>/state_dump 272 272 Date: Oct 2021 273 273 KernelVersion: 5.15 274 274 Contact: ynudelman@habana.ai ··· 279 279 Writing an integer X discards X state dumps, so that the 280 280 next read would return X+1-st newest state dump. 281 281 282 - What: /sys/kernel/debug/accel/<n>/stop_on_err 282 + What: /sys/kernel/debug/accel/<parent_device>/stop_on_err 283 283 Date: Mar 2020 284 284 KernelVersion: 5.6 285 285 Contact: ogabbay@kernel.org ··· 287 287 "0" is for disable, otherwise enable. 288 288 Relevant only for GOYA and GAUDI. 289 289 290 - What: /sys/kernel/debug/accel/<n>/timeout_locked 290 + What: /sys/kernel/debug/accel/<parent_device>/timeout_locked 291 291 Date: Sep 2021 292 292 KernelVersion: 5.16 293 293 Contact: obitton@habana.ai 294 294 Description: Sets the command submission timeout value in seconds. 295 295 296 - What: /sys/kernel/debug/accel/<n>/userptr 296 + What: /sys/kernel/debug/accel/<parent_device>/userptr 297 297 Date: Jan 2019 298 298 KernelVersion: 5.1 299 299 Contact: ogabbay@kernel.org ··· 301 301 pointers (user virtual addresses) that are pinned and mapped 302 302 to DMA addresses 303 303 304 - What: /sys/kernel/debug/accel/<n>/userptr_lookup 304 + What: /sys/kernel/debug/accel/<parent_device>/userptr_lookup 305 305 Date: Oct 2021 306 306 KernelVersion: 5.15 307 307 Contact: ogabbay@kernel.org ··· 309 309 addresses) that are pinned and mapped to DMA addresses, and see 310 310 their resolution to the specific dma address. 311 311 312 - What: /sys/kernel/debug/accel/<n>/vm 312 + What: /sys/kernel/debug/accel/<parent_device>/vm 313 313 Date: Jan 2019 314 314 KernelVersion: 5.1 315 315 Contact: ogabbay@kernel.org
+12
Documentation/ABI/testing/sysfs-driver-habanalabs
··· 149 149 Description: Displays the current clock frequency, in Hz, of the MME compute 150 150 engine. This property is valid only for the Goya ASIC family 151 151 152 + What: /sys/class/accel/accel<n>/device/module_id 153 + Date: Nov 2023 154 + KernelVersion: not yet upstreamed 155 + Contact: ogabbay@kernel.org 156 + Description: Displays the device's module id 157 + 158 + What: /sys/class/accel/accel<n>/device/parent_device 159 + Date: Nov 2023 160 + KernelVersion: 6.8 161 + Contact: ttayar@habana.ai 162 + Description: Displays the name of the parent device of the accel device 163 + 152 164 What: /sys/class/accel/accel<n>/device/pci_addr 153 165 Date: Jan 2019 154 166 KernelVersion: 5.1
+15 -10
drivers/accel/habanalabs/common/device.c
··· 853 853 gaudi2_set_asic_funcs(hdev); 854 854 strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name)); 855 855 break; 856 + case ASIC_GAUDI2C: 857 + gaudi2_set_asic_funcs(hdev); 858 + strscpy(hdev->asic_name, "GAUDI2C", sizeof(hdev->asic_name)); 856 859 break; 857 860 default: 858 861 dev_err(hdev->dev, "Unrecognized ASIC type %d\n", ··· 1044 1041 return (vendor_id == PCI_VENDOR_ID_HABANALABS); 1045 1042 } 1046 1043 1047 - static void hl_device_eq_heartbeat(struct hl_device *hdev) 1044 + static int hl_device_eq_heartbeat_check(struct hl_device *hdev) 1048 1045 { 1049 - u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE; 1050 1046 struct asic_fixed_properties *prop = &hdev->asic_prop; 1051 1047 1052 1048 if (!prop->cpucp_info.eq_health_check_supported) 1053 - return; 1049 + return 0; 1054 1050 1055 - if (hdev->eq_heartbeat_received) 1051 + if (hdev->eq_heartbeat_received) { 1056 1052 hdev->eq_heartbeat_received = false; 1057 - else 1058 - hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask); 1053 + } else { 1054 + dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); 1055 + return -EIO; 1056 + } 1057 + 1058 + return 0; 1059 1059 } 1060 1060 1061 1061 static void hl_device_heartbeat(struct work_struct *work) ··· 1075 1069 /* 1076 1070 * For EQ health check need to check if driver received the heartbeat eq event 1077 1071 * in order to validate the eq is working. 1072 + * Only if both the EQ is healthy and we managed to send the next heartbeat reschedule. 1078 1073 */ 1079 - hl_device_eq_heartbeat(hdev); 1080 - 1081 - if (!hdev->asic_funcs->send_heartbeat(hdev)) 1074 + if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev))) 1082 1075 goto reschedule; 1083 1076 1084 1077 if (hl_device_operational(hdev, NULL)) ··· 2040 2035 if (ctx) 2041 2036 hl_ctx_put(ctx); 2042 2037 2043 - return hl_device_reset(hdev, flags); 2038 + return hl_device_reset(hdev, flags | HL_DRV_RESET_HARD); 2044 2039 } 2045 2040 2046 2041 static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask)
+40 -83
drivers/accel/habanalabs/common/firmware_if.c
··· 646 646 return rc; 647 647 } 648 648 649 - static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, 650 - u32 sts_val) 649 + static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, u32 sts_val) 651 650 { 652 651 bool err_exists = false; 653 652 654 653 if (!(err_val & CPU_BOOT_ERR0_ENABLED)) 655 654 return false; 656 655 657 - if (err_val & CPU_BOOT_ERR0_DRAM_INIT_FAIL) { 658 - dev_err(hdev->dev, 659 - "Device boot error - DRAM initialization failed\n"); 660 - err_exists = true; 661 - } 656 + if (err_val & CPU_BOOT_ERR0_DRAM_INIT_FAIL) 657 + dev_err(hdev->dev, "Device boot error - DRAM initialization failed\n"); 662 658 663 - if (err_val & CPU_BOOT_ERR0_FIT_CORRUPTED) { 659 + if (err_val & CPU_BOOT_ERR0_FIT_CORRUPTED) 664 660 dev_err(hdev->dev, "Device boot error - FIT image corrupted\n"); 665 - err_exists = true; 666 - } 667 661 668 - if (err_val & CPU_BOOT_ERR0_TS_INIT_FAIL) { 669 - dev_err(hdev->dev, 670 - "Device boot error - Thermal Sensor initialization failed\n"); 671 - err_exists = true; 672 - } 662 + if (err_val & CPU_BOOT_ERR0_TS_INIT_FAIL) 663 + dev_err(hdev->dev, "Device boot error - Thermal Sensor initialization failed\n"); 673 664 674 665 if (err_val & CPU_BOOT_ERR0_BMC_WAIT_SKIPPED) { 675 666 if (hdev->bmc_enable) { 676 - dev_err(hdev->dev, 677 - "Device boot error - Skipped waiting for BMC\n"); 678 - err_exists = true; 667 + dev_err(hdev->dev, "Device boot error - Skipped waiting for BMC\n"); 679 668 } else { 680 - dev_info(hdev->dev, 681 - "Device boot message - Skipped waiting for BMC\n"); 669 + dev_info(hdev->dev, "Device boot message - Skipped waiting for BMC\n"); 682 670 /* This is an info so we don't want it to disable the 683 671 * device 684 672 */ ··· 674 686 } 675 687 } 676 688 677 - if (err_val & CPU_BOOT_ERR0_NIC_DATA_NOT_RDY) { 678 - dev_err(hdev->dev, 679 - "Device boot error - Serdes data from BMC not available\n"); 680 - err_exists = true; 681 - } 689 + if (err_val & CPU_BOOT_ERR0_NIC_DATA_NOT_RDY) 690 + dev_err(hdev->dev, "Device boot error - Serdes data from BMC not available\n"); 682 691 683 - if (err_val & CPU_BOOT_ERR0_NIC_FW_FAIL) { 684 - dev_err(hdev->dev, 685 - "Device boot error - NIC F/W initialization failed\n"); 686 - err_exists = true; 687 - } 692 + if (err_val & CPU_BOOT_ERR0_NIC_FW_FAIL) 693 + dev_err(hdev->dev, "Device boot error - NIC F/W initialization failed\n"); 688 694 689 - if (err_val & CPU_BOOT_ERR0_SECURITY_NOT_RDY) { 690 - dev_err(hdev->dev, 691 - "Device boot warning - security not ready\n"); 692 - err_exists = true; 693 - } 695 + if (err_val & CPU_BOOT_ERR0_SECURITY_NOT_RDY) 696 + dev_err(hdev->dev, "Device boot warning - security not ready\n"); 694 697 695 - if (err_val & CPU_BOOT_ERR0_SECURITY_FAIL) { 698 + if (err_val & CPU_BOOT_ERR0_SECURITY_FAIL) 696 699 dev_err(hdev->dev, "Device boot error - security failure\n"); 697 - err_exists = true; 698 - } 699 700 700 - if (err_val & CPU_BOOT_ERR0_EFUSE_FAIL) { 701 + if (err_val & CPU_BOOT_ERR0_EFUSE_FAIL) 701 702 dev_err(hdev->dev, "Device boot error - eFuse failure\n"); 702 - err_exists = true; 703 - } 704 703 705 - if (err_val & CPU_BOOT_ERR0_SEC_IMG_VER_FAIL) { 704 + if (err_val & CPU_BOOT_ERR0_SEC_IMG_VER_FAIL) 706 705 dev_err(hdev->dev, "Device boot error - Failed to load preboot secondary image\n"); 707 - err_exists = true; 708 - } 709 706 710 - if (err_val & CPU_BOOT_ERR0_PLL_FAIL) { 707 + if (err_val & CPU_BOOT_ERR0_PLL_FAIL) 711 708 dev_err(hdev->dev, "Device boot error - PLL failure\n"); 712 - err_exists = true; 713 - } 714 709 715 - if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) { 710 + if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) 716 711 dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n"); 717 - err_exists = true; 718 - } 719 712 720 713 if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) { 721 714 /* Ignore this bit, don't prevent driver loading */ ··· 704 735 err_val &= ~CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL; 705 736 } 706 737 707 - if (err_val & CPU_BOOT_ERR0_BINNING_FAIL) { 738 + if (err_val & CPU_BOOT_ERR0_BINNING_FAIL) 708 739 dev_err(hdev->dev, "Device boot error - binning failure\n"); 709 - err_exists = true; 710 - } 711 740 712 741 if (sts_val & CPU_BOOT_DEV_STS0_ENABLED) 713 742 dev_dbg(hdev->dev, "Device status0 %#x\n", sts_val); 714 743 744 + if (err_val & CPU_BOOT_ERR0_DRAM_SKIPPED) 745 + dev_err(hdev->dev, "Device boot warning - Skipped DRAM initialization\n"); 746 + 747 + if (err_val & CPU_BOOT_ERR_ENG_ARC_MEM_SCRUB_FAIL) 748 + dev_err(hdev->dev, "Device boot error - ARC memory scrub failed\n"); 749 + 750 + /* All warnings should go here in order not to reach the unknown error validation */ 715 751 if (err_val & CPU_BOOT_ERR0_EEPROM_FAIL) { 716 752 dev_err(hdev->dev, "Device boot error - EEPROM failure detected\n"); 717 753 err_exists = true; 718 754 } 719 755 720 - /* All warnings should go here in order not to reach the unknown error validation */ 721 - if (err_val & CPU_BOOT_ERR0_DRAM_SKIPPED) { 722 - dev_warn(hdev->dev, 723 - "Device boot warning - Skipped DRAM initialization\n"); 724 - /* This is a warning so we don't want it to disable the 725 - * device 726 - */ 727 - err_val &= ~CPU_BOOT_ERR0_DRAM_SKIPPED; 728 - } 756 + if (err_val & CPU_BOOT_ERR0_PRI_IMG_VER_FAIL) 757 + dev_warn(hdev->dev, "Device boot warning - Failed to load preboot primary image\n"); 729 758 730 - if (err_val & CPU_BOOT_ERR0_PRI_IMG_VER_FAIL) { 731 - dev_warn(hdev->dev, 732 - "Device boot warning - Failed to load preboot primary image\n"); 733 - /* This is a warning so we don't want it to disable the 734 - * device as we have a secondary preboot image 735 - */ 736 - err_val &= ~CPU_BOOT_ERR0_PRI_IMG_VER_FAIL; 737 - } 759 + if (err_val & CPU_BOOT_ERR0_TPM_FAIL) 760 + dev_warn(hdev->dev, "Device boot warning - TPM failure\n"); 738 761 739 - if (err_val & CPU_BOOT_ERR0_TPM_FAIL) { 740 - dev_warn(hdev->dev, 741 - "Device boot warning - TPM failure\n"); 742 - /* This is a warning so we don't want it to disable the 743 - * device 744 - */ 745 - err_val &= ~CPU_BOOT_ERR0_TPM_FAIL; 746 - } 747 - 748 - if (!err_exists && (err_val & ~CPU_BOOT_ERR0_ENABLED)) { 749 - dev_err(hdev->dev, 750 - "Device boot error - unknown ERR0 error 0x%08x\n", err_val); 762 + if (err_val & CPU_BOOT_ERR_FATAL_MASK) 751 763 err_exists = true; 752 - } 753 764 754 765 /* return error only if it's in the predefined mask */ 755 766 if (err_exists && ((err_val & ~CPU_BOOT_ERR0_ENABLED) & ··· 3242 3293 return hl_fw_get_sec_attest_data(hdev, CPUCP_PACKET_SEC_ATTEST_GET, sec_attest_info, 3243 3294 sizeof(struct cpucp_sec_attest_info), nonce, 3244 3295 HL_CPUCP_SEC_ATTEST_INFO_TINEOUT_USEC); 3296 + } 3297 + 3298 + int hl_fw_get_dev_info_signed(struct hl_device *hdev, 3299 + struct cpucp_dev_info_signed *dev_info_signed, u32 nonce) 3300 + { 3301 + return hl_fw_get_sec_attest_data(hdev, CPUCP_PACKET_INFO_SIGNED_GET, dev_info_signed, 3302 + sizeof(struct cpucp_dev_info_signed), nonce, 3303 + HL_CPUCP_SEC_ATTEST_INFO_TINEOUT_USEC); 3245 3304 } 3246 3305 3247 3306 int hl_fw_send_generic_request(struct hl_device *hdev, enum hl_passthrough_type sub_opcode,
+15
drivers/accel/habanalabs/common/habanalabs.h
··· 1262 1262 * @ASIC_GAUDI_SEC: Gaudi secured device (HL-2000). 1263 1263 * @ASIC_GAUDI2: Gaudi2 device. 1264 1264 * @ASIC_GAUDI2B: Gaudi2B device. 1265 + * @ASIC_GAUDI2C: Gaudi2C device. 1265 1266 */ 1266 1267 enum hl_asic_type { 1267 1268 ASIC_INVALID, ··· 1271 1270 ASIC_GAUDI_SEC, 1272 1271 ASIC_GAUDI2, 1273 1272 ASIC_GAUDI2B, 1273 + ASIC_GAUDI2C, 1274 1274 }; 1275 1275 1276 1276 struct hl_cs_parser; ··· 3521 3519 u8 heartbeat; 3522 3520 }; 3523 3521 3522 + /* Retrieve PCI device name in case of a PCI device or dev name in simulator */ 3523 + #define HL_DEV_NAME(hdev) \ 3524 + ((hdev)->pdev ? dev_name(&(hdev)->pdev->dev) : "NA-DEVICE") 3524 3525 3525 3526 /** 3526 3527 * struct hl_cs_encaps_sig_handle - encapsulated signals handle structure ··· 3597 3592 if (hdev->fw_sw_minor_ver < fw_sw_minor) 3598 3593 return true; 3599 3594 return false; 3595 + } 3596 + 3597 + static inline bool hl_is_fw_sw_ver_equal_or_greater(struct hl_device *hdev, u32 fw_sw_major, 3598 + u32 fw_sw_minor) 3599 + { 3600 + return (hdev->fw_sw_major_ver > fw_sw_major || 3601 + (hdev->fw_sw_major_ver == fw_sw_major && 3602 + hdev->fw_sw_minor_ver >= fw_sw_minor)); 3600 3603 } 3601 3604 3602 3605 /* ··· 3967 3954 void hl_fw_set_max_power(struct hl_device *hdev); 3968 3955 int hl_fw_get_sec_attest_info(struct hl_device *hdev, struct cpucp_sec_attest_info *sec_attest_info, 3969 3956 u32 nonce); 3957 + int hl_fw_get_dev_info_signed(struct hl_device *hdev, 3958 + struct cpucp_dev_info_signed *dev_info_signed, u32 nonce); 3970 3959 int hl_set_voltage(struct hl_device *hdev, int sensor_index, u32 attr, long value); 3971 3960 int hl_set_current(struct hl_device *hdev, int sensor_index, u32 attr, long value); 3972 3961 int hl_set_power(struct hl_device *hdev, int sensor_index, u32 attr, long value);
+37
drivers/accel/habanalabs/common/habanalabs_drv.c
··· 141 141 case REV_ID_B: 142 142 asic_type = ASIC_GAUDI2B; 143 143 break; 144 + case REV_ID_C: 145 + asic_type = ASIC_GAUDI2C; 146 + break; 144 147 default: 145 148 break; 146 149 } ··· 673 670 return PCI_ERS_RESULT_RECOVERED; 674 671 } 675 672 673 + static void hl_pci_reset_prepare(struct pci_dev *pdev) 674 + { 675 + struct hl_device *hdev; 676 + 677 + hdev = pci_get_drvdata(pdev); 678 + if (!hdev) 679 + return; 680 + 681 + hdev->disabled = true; 682 + } 683 + 684 + static void hl_pci_reset_done(struct pci_dev *pdev) 685 + { 686 + struct hl_device *hdev; 687 + u32 flags; 688 + 689 + hdev = pci_get_drvdata(pdev); 690 + if (!hdev) 691 + return; 692 + 693 + /* 694 + * Schedule a thread to trigger hard reset. 695 + * The reason for this handler, is for rare cases where the driver is up 696 + * and FLR occurs. This is valid only when working with no VM, so FW handles FLR 697 + * and resets the device. FW will go back preboot stage, so driver needs to perform 698 + * hard reset in order to load FW fit again. 699 + */ 700 + flags = HL_DRV_RESET_HARD | HL_DRV_RESET_BYPASS_REQ_TO_FW; 701 + 702 + hl_device_reset(hdev, flags); 703 + } 704 + 676 705 static const struct dev_pm_ops hl_pm_ops = { 677 706 .suspend = hl_pmops_suspend, 678 707 .resume = hl_pmops_resume, ··· 714 679 .error_detected = hl_pci_err_detected, 715 680 .slot_reset = hl_pci_err_slot_reset, 716 681 .resume = hl_pci_err_resume, 682 + .reset_prepare = hl_pci_reset_prepare, 683 + .reset_done = hl_pci_reset_done, 717 684 }; 718 685 719 686 static struct pci_driver hl_pci_driver = {
+54 -1
drivers/accel/habanalabs/common/habanalabs_ioctl.c
··· 19 19 20 20 #include <asm/msr.h> 21 21 22 + /* make sure there is space for all the signed info */ 23 + static_assert(sizeof(struct cpucp_info) <= SEC_DEV_INFO_BUF_SZ); 24 + 22 25 static u32 hl_debug_struct_size[HL_DEBUG_OP_TIMESTAMP + 1] = { 23 26 [HL_DEBUG_OP_ETR] = sizeof(struct hl_debug_params_etr), 24 27 [HL_DEBUG_OP_ETF] = sizeof(struct hl_debug_params_etf), ··· 688 685 if (!sec_attest_info) 689 686 return -ENOMEM; 690 687 691 - info = kmalloc(sizeof(*info), GFP_KERNEL); 688 + info = kzalloc(sizeof(*info), GFP_KERNEL); 692 689 if (!info) { 693 690 rc = -ENOMEM; 694 691 goto free_sec_attest_info; ··· 721 718 722 719 return rc; 723 720 } 721 + 722 + static int dev_info_signed(struct hl_fpriv *hpriv, struct hl_info_args *args) 723 + { 724 + void __user *out = (void __user *) (uintptr_t) args->return_pointer; 725 + struct cpucp_dev_info_signed *dev_info_signed; 726 + struct hl_info_signed *info; 727 + u32 max_size = args->return_size; 728 + int rc; 729 + 730 + if ((!max_size) || (!out)) 731 + return -EINVAL; 732 + 733 + dev_info_signed = kzalloc(sizeof(*dev_info_signed), GFP_KERNEL); 734 + if (!dev_info_signed) 735 + return -ENOMEM; 736 + 737 + info = kzalloc(sizeof(*info), GFP_KERNEL); 738 + if (!info) { 739 + rc = -ENOMEM; 740 + goto free_dev_info_signed; 741 + } 742 + 743 + rc = hl_fw_get_dev_info_signed(hpriv->hdev, 744 + dev_info_signed, args->sec_attest_nonce); 745 + if (rc) 746 + goto free_info; 747 + 748 + info->nonce = le32_to_cpu(dev_info_signed->nonce); 749 + info->info_sig_len = dev_info_signed->info_sig_len; 750 + info->pub_data_len = le16_to_cpu(dev_info_signed->pub_data_len); 751 + info->certificate_len = le16_to_cpu(dev_info_signed->certificate_len); 752 + info->dev_info_len = sizeof(struct cpucp_info); 753 + memcpy(&info->info_sig, &dev_info_signed->info_sig, sizeof(info->info_sig)); 754 + memcpy(&info->public_data, &dev_info_signed->public_data, sizeof(info->public_data)); 755 + memcpy(&info->certificate, &dev_info_signed->certificate, sizeof(info->certificate)); 756 + memcpy(&info->dev_info, &dev_info_signed->info, info->dev_info_len); 757 + 758 + rc = copy_to_user(out, info, min_t(size_t, max_size, sizeof(*info))) ? -EFAULT : 0; 759 + 760 + free_info: 761 + kfree(info); 762 + free_dev_info_signed: 763 + kfree(dev_info_signed); 764 + 765 + return rc; 766 + } 767 + 724 768 725 769 static int eventfd_register(struct hl_fpriv *hpriv, struct hl_info_args *args) 726 770 { ··· 1138 1088 1139 1089 case HL_INFO_FW_GENERIC_REQ: 1140 1090 return send_fw_generic_request(hdev, args); 1091 + 1092 + case HL_INFO_DEV_SIGNED: 1093 + return dev_info_signed(hpriv, args); 1141 1094 1142 1095 default: 1143 1096 dev_err(dev, "Invalid request %d\n", args->op);
-4
drivers/accel/habanalabs/common/hwmon.c
··· 578 578 CPUCP_PKT_CTL_OPCODE_SHIFT); 579 579 pkt.sensor_index = __cpu_to_le16(sensor_index); 580 580 pkt.type = __cpu_to_le16(attr); 581 - 582 - dev_dbg(hdev->dev, "get temp, ctl 0x%x, sensor %d, type %d\n", 583 - pkt.ctl, pkt.sensor_index, pkt.type); 584 - 585 581 rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt), 586 582 0, &result); 587 583
+4 -3
drivers/accel/habanalabs/common/memory.c
··· 955 955 (i + 1) == phys_pg_pack->npages); 956 956 if (rc) { 957 957 dev_err(hdev->dev, 958 - "map failed for handle %u, npages: %llu, mapped: %llu", 959 - phys_pg_pack->handle, phys_pg_pack->npages, 958 + "map failed (%d) for handle %u, npages: %llu, mapped: %llu\n", 959 + rc, phys_pg_pack->handle, phys_pg_pack->npages, 960 960 mapped_pg_cnt); 961 961 goto err; 962 962 } ··· 1186 1186 1187 1187 rc = map_phys_pg_pack(ctx, ret_vaddr, phys_pg_pack); 1188 1188 if (rc) { 1189 - dev_err(hdev->dev, "mapping page pack failed for handle %u\n", handle); 1189 + dev_err(hdev->dev, "mapping page pack failed (%d) for handle %u\n", 1190 + rc, handle); 1190 1191 mutex_unlock(&hdev->mmu_lock); 1191 1192 goto map_err; 1192 1193 }
+1
drivers/accel/habanalabs/common/mmu/mmu.c
··· 596 596 break; 597 597 case ASIC_GAUDI2: 598 598 case ASIC_GAUDI2B: 599 + case ASIC_GAUDI2C: 599 600 /* MMUs in Gaudi2 are always host resident */ 600 601 hl_mmu_v2_hr_set_funcs(hdev, &hdev->mmu_func[MMU_HR_PGT]); 601 602 break;
+40 -2
drivers/accel/habanalabs/common/sysfs.c
··· 8 8 #include "habanalabs.h" 9 9 10 10 #include <linux/pci.h> 11 + #include <linux/types.h> 11 12 12 13 static ssize_t clk_max_freq_mhz_show(struct device *dev, struct device_attribute *attr, char *buf) 13 14 { ··· 81 80 { 82 81 struct hl_device *hdev = dev_get_drvdata(dev); 83 82 struct cpucp_info *cpucp_info; 83 + u32 infineon_second_stage_version; 84 + u32 infineon_second_stage_first_instance; 85 + u32 infineon_second_stage_second_instance; 86 + u32 infineon_second_stage_third_instance; 87 + u32 mask = 0xff; 84 88 85 89 cpucp_info = &hdev->asic_prop.cpucp_info; 86 90 91 + infineon_second_stage_version = le32_to_cpu(cpucp_info->infineon_second_stage_version); 92 + infineon_second_stage_first_instance = infineon_second_stage_version & mask; 93 + infineon_second_stage_second_instance = 94 + (infineon_second_stage_version >> 8) & mask; 95 + infineon_second_stage_third_instance = 96 + (infineon_second_stage_version >> 16) & mask; 97 + 87 98 if (cpucp_info->infineon_second_stage_version) 88 - return sprintf(buf, "%#04x %#04x\n", le32_to_cpu(cpucp_info->infineon_version), 89 - le32_to_cpu(cpucp_info->infineon_second_stage_version)); 99 + return sprintf(buf, "%#04x %#04x:%#04x:%#04x\n", 100 + le32_to_cpu(cpucp_info->infineon_version), 101 + infineon_second_stage_first_instance, 102 + infineon_second_stage_second_instance, 103 + infineon_second_stage_third_instance); 90 104 else 91 105 return sprintf(buf, "%#04x\n", le32_to_cpu(cpucp_info->infineon_version)); 92 106 } ··· 267 251 case ASIC_GAUDI2B: 268 252 str = "GAUDI2B"; 269 253 break; 254 + case ASIC_GAUDI2C: 255 + str = "GAUDI2C"; 256 + break; 270 257 default: 271 258 dev_err(hdev->dev, "Unrecognized ASIC type %d\n", 272 259 hdev->asic_type); ··· 402 383 return sprintf(buf, "%d\n", hdev->asic_prop.fw_security_enabled); 403 384 } 404 385 386 + static ssize_t module_id_show(struct device *dev, 387 + struct device_attribute *attr, char *buf) 388 + { 389 + struct hl_device *hdev = dev_get_drvdata(dev); 390 + 391 + return sprintf(buf, "%u\n", le32_to_cpu(hdev->asic_prop.cpucp_info.card_location)); 392 + } 393 + 394 + static ssize_t parent_device_show(struct device *dev, struct device_attribute *attr, char *buf) 395 + { 396 + struct hl_device *hdev = dev_get_drvdata(dev); 397 + 398 + return sprintf(buf, "%s\n", HL_DEV_NAME(hdev)); 399 + } 400 + 405 401 static DEVICE_ATTR_RO(armcp_kernel_ver); 406 402 static DEVICE_ATTR_RO(armcp_ver); 407 403 static DEVICE_ATTR_RO(cpld_ver); ··· 436 402 static DEVICE_ATTR_RO(uboot_ver); 437 403 static DEVICE_ATTR_RO(fw_os_ver); 438 404 static DEVICE_ATTR_RO(security_enabled); 405 + static DEVICE_ATTR_RO(module_id); 406 + static DEVICE_ATTR_RO(parent_device); 439 407 440 408 static struct bin_attribute bin_attr_eeprom = { 441 409 .attr = {.name = "eeprom", .mode = (0444)}, ··· 463 427 &dev_attr_uboot_ver.attr, 464 428 &dev_attr_fw_os_ver.attr, 465 429 &dev_attr_security_enabled.attr, 430 + &dev_attr_module_id.attr, 431 + &dev_attr_parent_device.attr, 466 432 NULL, 467 433 }; 468 434
+36 -38
drivers/accel/habanalabs/gaudi2/gaudi2.c
··· 7858 7858 return !!ecc_data->is_critical; 7859 7859 } 7860 7860 7861 - static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u64 event_mask) 7861 + static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u32 engine_id) 7862 7862 { 7863 - u32 lo, hi, cq_ptr_size, arc_cq_ptr_size; 7864 - u64 cq_ptr, arc_cq_ptr, cp_current_inst; 7863 + struct undefined_opcode_info *undef_opcode = &hdev->captured_err_info.undef_opcode; 7864 + u64 cq_ptr, cp_current_inst; 7865 + u32 lo, hi, cq_size, cp_sts; 7866 + bool is_arc_cq; 7865 7867 7866 - lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET); 7867 - hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET); 7868 - cq_ptr = ((u64) hi) << 32 | lo; 7869 - cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET); 7868 + cp_sts = RREG32(qman_base + QM_CP_STS_4_OFFSET); 7869 + is_arc_cq = FIELD_GET(PDMA0_QM_CP_STS_CUR_CQ_MASK, cp_sts); /* 0 - legacy CQ, 1 - ARC_CQ */ 7870 7870 7871 - lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET); 7872 - hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET); 7873 - arc_cq_ptr = ((u64) hi) << 32 | lo; 7874 - arc_cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET); 7871 + if (is_arc_cq) { 7872 + lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_STS_OFFSET); 7873 + hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_STS_OFFSET); 7874 + cq_ptr = ((u64) hi) << 32 | lo; 7875 + cq_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_STS_OFFSET); 7876 + } else { 7877 + lo = RREG32(qman_base + QM_CQ_PTR_LO_STS_4_OFFSET); 7878 + hi = RREG32(qman_base + QM_CQ_PTR_HI_STS_4_OFFSET); 7879 + cq_ptr = ((u64) hi) << 32 | lo; 7880 + cq_size = RREG32(qman_base + QM_CQ_TSIZE_STS_4_OFFSET); 7881 + } 7875 7882 7876 7883 lo = RREG32(qman_base + QM_CP_CURRENT_INST_LO_4_OFFSET); 7877 7884 hi = RREG32(qman_base + QM_CP_CURRENT_INST_HI_4_OFFSET); 7878 7885 cp_current_inst = ((u64) hi) << 32 | lo; 7879 7886 7880 7887 dev_info(hdev->dev, 7881 - "LowerQM. CQ: {ptr %#llx, size %u}, ARC_CQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n", 7882 - cq_ptr, cq_ptr_size, arc_cq_ptr, arc_cq_ptr_size, cp_current_inst); 7888 + "LowerQM. %sCQ: {ptr %#llx, size %u}, CP: {instruction %#018llx}\n", 7889 + is_arc_cq ? "ARC_" : "", cq_ptr, cq_size, cp_current_inst); 7883 7890 7884 - if (event_mask & HL_NOTIFIER_EVENT_UNDEFINED_OPCODE) { 7885 - if (arc_cq_ptr) { 7886 - hdev->captured_err_info.undef_opcode.cq_addr = arc_cq_ptr; 7887 - hdev->captured_err_info.undef_opcode.cq_size = arc_cq_ptr_size; 7888 - } else { 7889 - hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr; 7890 - hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size; 7891 - } 7892 - 7893 - hdev->captured_err_info.undef_opcode.stream_id = QMAN_STREAMS; 7891 + if (undef_opcode->write_enable) { 7892 + memset(undef_opcode, 0, sizeof(*undef_opcode)); 7893 + undef_opcode->timestamp = ktime_get(); 7894 + undef_opcode->cq_addr = cq_ptr; 7895 + undef_opcode->cq_size = cq_size; 7896 + undef_opcode->engine_id = engine_id; 7897 + undef_opcode->stream_id = QMAN_STREAMS; 7898 + undef_opcode->write_enable = 0; 7894 7899 } 7895 7900 } 7896 7901 ··· 7934 7929 error_count++; 7935 7930 } 7936 7931 7937 - if (i == QMAN_STREAMS && error_count) { 7938 - /* check for undefined opcode */ 7939 - if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK && 7940 - hdev->captured_err_info.undef_opcode.write_enable) { 7941 - memset(&hdev->captured_err_info.undef_opcode, 0, 7942 - sizeof(hdev->captured_err_info.undef_opcode)); 7943 - 7944 - hdev->captured_err_info.undef_opcode.write_enable = false; 7945 - hdev->captured_err_info.undef_opcode.timestamp = ktime_get(); 7946 - hdev->captured_err_info.undef_opcode.engine_id = 7947 - gaudi2_queue_id_to_engine_id[qid_base]; 7948 - *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE; 7949 - } 7950 - 7951 - handle_lower_qman_data_on_err(hdev, qman_base, *event_mask); 7932 + /* Check for undefined opcode error in lower QM */ 7933 + if ((i == QMAN_STREAMS) && 7934 + (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK)) { 7935 + handle_lower_qman_data_on_err(hdev, qman_base, 7936 + gaudi2_queue_id_to_engine_id[qid_base]); 7937 + *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE; 7952 7938 } 7953 7939 } 7954 7940 ··· 10003 10007 error_count = gaudi2_handle_pcie_drain(hdev, &eq_entry->pcie_drain_ind_data); 10004 10008 reset_flags |= HL_DRV_RESET_FW_FATAL_ERR; 10005 10009 event_mask |= HL_NOTIFIER_EVENT_GENERAL_HW_ERR; 10010 + if (hl_is_fw_sw_ver_equal_or_greater(hdev, 1, 13)) 10011 + is_critical = true; 10006 10012 break; 10007 10013 10008 10014 case GAUDI2_EVENT_PSOC59_RPM_ERROR_OR_DRAIN:
+7 -6
drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h
··· 242 242 #define QM_FENCE2_OFFSET (mmPDMA0_QM_CP_FENCE2_RDATA_0 - mmPDMA0_QM_BASE) 243 243 #define QM_SEI_STATUS_OFFSET (mmPDMA0_QM_SEI_STATUS - mmPDMA0_QM_BASE) 244 244 245 - #define QM_CQ_PTR_LO_4_OFFSET (mmPDMA0_QM_CQ_PTR_LO_4 - mmPDMA0_QM_BASE) 246 - #define QM_CQ_PTR_HI_4_OFFSET (mmPDMA0_QM_CQ_PTR_HI_4 - mmPDMA0_QM_BASE) 247 - #define QM_CQ_TSIZE_4_OFFSET (mmPDMA0_QM_CQ_TSIZE_4 - mmPDMA0_QM_BASE) 245 + #define QM_CQ_TSIZE_STS_4_OFFSET (mmPDMA0_QM_CQ_TSIZE_STS_4 - mmPDMA0_QM_BASE) 246 + #define QM_CQ_PTR_LO_STS_4_OFFSET (mmPDMA0_QM_CQ_PTR_LO_STS_4 - mmPDMA0_QM_BASE) 247 + #define QM_CQ_PTR_HI_STS_4_OFFSET (mmPDMA0_QM_CQ_PTR_HI_STS_4 - mmPDMA0_QM_BASE) 248 248 249 - #define QM_ARC_CQ_PTR_LO_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_LO - mmPDMA0_QM_BASE) 250 - #define QM_ARC_CQ_PTR_HI_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_HI - mmPDMA0_QM_BASE) 251 - #define QM_ARC_CQ_TSIZE_OFFSET (mmPDMA0_QM_ARC_CQ_TSIZE - mmPDMA0_QM_BASE) 249 + #define QM_ARC_CQ_TSIZE_STS_OFFSET (mmPDMA0_QM_ARC_CQ_TSIZE_STS - mmPDMA0_QM_BASE) 250 + #define QM_ARC_CQ_PTR_LO_STS_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_LO_STS - mmPDMA0_QM_BASE) 251 + #define QM_ARC_CQ_PTR_HI_STS_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_HI_STS - mmPDMA0_QM_BASE) 252 252 253 + #define QM_CP_STS_4_OFFSET (mmPDMA0_QM_CP_STS_4 - mmPDMA0_QM_BASE) 253 254 #define QM_CP_CURRENT_INST_LO_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_LO_4 - mmPDMA0_QM_BASE) 254 255 #define QM_CP_CURRENT_INST_HI_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_HI_4 - mmPDMA0_QM_BASE) 255 256
+1
drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h
··· 25 25 REV_ID_INVALID = 0x00, 26 26 REV_ID_A = 0x01, 27 27 REV_ID_B = 0x02, 28 + REV_ID_C = 0x03 28 29 }; 29 30 30 31 #endif /* INCLUDE_PCI_GENERAL_H_ */
+7 -1
include/linux/habanalabs/cpucp_if.h
··· 659 659 * number (nonce) provided by the host to prevent replay attacks. 660 660 * public key and certificate also provided as part of the FW response. 661 661 * 662 + * CPUCP_PACKET_INFO_SIGNED_GET - 663 + * Get the device information signed by the Trusted Platform device. 664 + * device info data is also hashed with some unique number (nonce) provided 665 + * by the host to prevent replay attacks. public key and certificate also 666 + * provided as part of the FW response. 667 + * 662 668 * CPUCP_PACKET_MONITOR_DUMP_GET - 663 669 * Get monitors registers dump from the CpuCP kernel. 664 670 * The CPU will put the registers dump in the a buffer allocated by the driver ··· 739 733 CPUCP_PACKET_ENGINE_CORE_ASID_SET, /* internal */ 740 734 CPUCP_PACKET_RESERVED2, /* not used */ 741 735 CPUCP_PACKET_SEC_ATTEST_GET, /* internal */ 742 - CPUCP_PACKET_RESERVED3, /* not used */ 736 + CPUCP_PACKET_INFO_SIGNED_GET, /* internal */ 743 737 CPUCP_PACKET_RESERVED4, /* not used */ 744 738 CPUCP_PACKET_MONITOR_DUMP_GET, /* debugfs */ 745 739 CPUCP_PACKET_RESERVED5, /* not used */
+28
include/uapi/drm/habanalabs_accel.h
··· 846 846 #define HL_INFO_HW_ERR_EVENT 36 847 847 #define HL_INFO_FW_ERR_EVENT 37 848 848 #define HL_INFO_USER_ENGINE_ERR_EVENT 38 849 + #define HL_INFO_DEV_SIGNED 40 849 850 850 851 #define HL_INFO_VERSION_MAX_LEN 128 851 852 #define HL_INFO_CARD_NAME_MAX_LEN 16 ··· 1257 1256 #define SEC_SIGNATURE_BUF_SZ 255 /* (256 - 1) 1 byte used for size */ 1258 1257 #define SEC_PUB_DATA_BUF_SZ 510 /* (512 - 2) 2 bytes used for size */ 1259 1258 #define SEC_CERTIFICATE_BUF_SZ 2046 /* (2048 - 2) 2 bytes used for size */ 1259 + #define SEC_DEV_INFO_BUF_SZ 5120 1260 1260 1261 1261 /* 1262 1262 * struct hl_info_sec_attest - attestation report of the boot ··· 1290 1288 __u8 public_data[SEC_PUB_DATA_BUF_SZ]; 1291 1289 __u8 certificate[SEC_CERTIFICATE_BUF_SZ]; 1292 1290 __u8 pad0[2]; 1291 + }; 1292 + 1293 + /* 1294 + * struct hl_info_signed - device information signed by a secured device. 1295 + * @nonce: number only used once. random number provided by host. this also passed to the quote 1296 + * command as a qualifying data. 1297 + * @pub_data_len: length of the public data (bytes) 1298 + * @certificate_len: length of the certificate (bytes) 1299 + * @info_sig_len: length of the attestation signature (bytes) 1300 + * @public_data: public key info signed info data (outPublic + name + qualifiedName) 1301 + * @certificate: certificate for the signing key 1302 + * @info_sig: signature of the info + nonce data. 1303 + * @dev_info_len: length of device info (bytes) 1304 + * @dev_info: device info as byte array. 1305 + */ 1306 + struct hl_info_signed { 1307 + __u32 nonce; 1308 + __u16 pub_data_len; 1309 + __u16 certificate_len; 1310 + __u8 info_sig_len; 1311 + __u8 public_data[SEC_PUB_DATA_BUF_SZ]; 1312 + __u8 certificate[SEC_CERTIFICATE_BUF_SZ]; 1313 + __u8 info_sig[SEC_SIGNATURE_BUF_SZ]; 1314 + __u16 dev_info_len; 1315 + __u8 dev_info[SEC_DEV_INFO_BUF_SZ]; 1316 + __u8 pad[2]; 1293 1317 }; 1294 1318 1295 1319 /**