Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/xe/xe_survivability: Add support for Runtime survivability mode

Certain runtime firmware errors can cause the device to be in a unusable
state requiring a firmware flash to restore normal operation.
Runtime Survivability Mode indicates firmware flash is necessary by
wedging the device and exposing survivability mode sysfs.

The below sysfs is an indication that device is in survivability mode

/sys/bus/pci/devices/<device>/survivability_mode

v2: Fix kernel-doc (Umesh)
v3: Add user friendly dmesg (Frank)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-7-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

authored by

Riana Tauro and committed by
Rodrigo Vivi
a2ca0633 41ff795a

+44 -1
+42 -1
drivers/gpu/drm/xe/xe_survivability_mode.c
··· 138 138 struct xe_survivability_info *info = survivability->info; 139 139 int index = 0, count = 0; 140 140 141 - count += sysfs_emit_at(buff, count, "Survivability mode type: Boot\n"); 141 + count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n", 142 + survivability->type ? "Runtime" : "Boot"); 142 143 143 144 if (!check_boot_failure(xe)) 144 145 return count; ··· 290 289 survivability->boot_status = REG_FIELD_GET(BOOT_STATUS, data); 291 290 292 291 return check_boot_failure(xe); 292 + } 293 + 294 + /** 295 + * xe_survivability_mode_runtime_enable - Initialize and enable runtime survivability mode 296 + * @xe: xe device instance 297 + * 298 + * Initialize survivability information and enable runtime survivability mode. 299 + * Runtime survivability mode is enabled when certain errors cause the device to be 300 + * in non-recoverable state. The device is declared wedged with the appropriate 301 + * recovery method and survivability mode sysfs exposed to userspace 302 + * 303 + * Return: 0 if runtime survivability mode is enabled, negative error code otherwise. 304 + */ 305 + int xe_survivability_mode_runtime_enable(struct xe_device *xe) 306 + { 307 + struct xe_survivability *survivability = &xe->survivability; 308 + struct pci_dev *pdev = to_pci_dev(xe->drm.dev); 309 + int ret; 310 + 311 + if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < XE_BATTLEMAGE) { 312 + dev_err(&pdev->dev, "Runtime Survivability Mode not supported\n"); 313 + return -EINVAL; 314 + } 315 + 316 + ret = init_survivability_mode(xe); 317 + if (ret) 318 + return ret; 319 + 320 + ret = create_survivability_sysfs(pdev); 321 + if (ret) 322 + dev_err(&pdev->dev, "Failed to create survivability mode sysfs\n"); 323 + 324 + survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME; 325 + dev_err(&pdev->dev, "Runtime Survivability mode enabled\n"); 326 + 327 + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR); 328 + xe_device_declare_wedged(xe); 329 + dev_err(&pdev->dev, "Firmware update required, Refer the userspace documentation for more details!\n"); 330 + 331 + return 0; 293 332 } 294 333 295 334 /**
+1
drivers/gpu/drm/xe/xe_survivability_mode.h
··· 11 11 struct xe_device; 12 12 13 13 int xe_survivability_mode_boot_enable(struct xe_device *xe); 14 + int xe_survivability_mode_runtime_enable(struct xe_device *xe); 14 15 bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe); 15 16 bool xe_survivability_mode_is_requested(struct xe_device *xe); 16 17
+1
drivers/gpu/drm/xe/xe_survivability_mode_types.h
··· 11 11 12 12 enum xe_survivability_type { 13 13 XE_SURVIVABILITY_TYPE_BOOT, 14 + XE_SURVIVABILITY_TYPE_RUNTIME, 14 15 }; 15 16 16 17 struct xe_survivability_info {