Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdkfd: Introduce KFD module parameter halt_if_hws_hang

This avoids triggering a GPU reset or otherwise changing the HW
state. Instead KFD will hang, which allows HW debugging tools to
analyze the problem.

Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

authored by

Yong Zhao and committed by
Oded Gabbay
0e9a860c a29ec470

+16
+7
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
··· 1217 1217 while (*fence_addr != fence_value) { 1218 1218 if (time_after(jiffies, end_jiffies)) { 1219 1219 pr_err("qcm fence wait loop timeout expired\n"); 1220 + /* In HWS case, this is used to halt the driver thread 1221 + * in order not to mess up CP states before doing 1222 + * scandumps for FW debugging. 1223 + */ 1224 + while (halt_if_hws_hang) 1225 + schedule(); 1226 + 1220 1227 return -ETIME; 1221 1228 } 1222 1229 schedule();
+4
drivers/gpu/drm/amd/amdkfd/kfd_module.c
··· 92 92 93 93 static int amdkfd_init_completed; 94 94 95 + int halt_if_hws_hang; 96 + module_param(halt_if_hws_hang, int, 0644); 97 + MODULE_PARM_DESC(halt_if_hws_hang, "Halt if HWS hang is detected (0 = off (default), 1 = on)"); 98 + 95 99 int kgd2kfd_init(unsigned int interface_version, 96 100 const struct kgd2kfd_calls **g2f) 97 101 {
+5
drivers/gpu/drm/amd/amdkfd/kfd_priv.h
··· 144 144 */ 145 145 extern int vega10_noretry; 146 146 147 + /* 148 + * Halt if HWS hang is detected 149 + */ 150 + extern int halt_if_hws_hang; 151 + 147 152 /** 148 153 * enum kfd_sched_policy 149 154 *