Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdgpu: Handle the GPU recovery failure in SRIOV environment.

This patch handles the GPU recovery failure in sriov environment by
retrying the reset if the first reset fails. To determine the condition
of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns
true if failure is due to ETIMEDOUT, EINVAL or EBUSY, otherwise return
false.A new macro AMDGPU_MAX_RETRY_LIMIT is used to limit the retry to 2.

It also handles the return status in Post Asic Reset by updating the return
code with asic_reset_res and eventually return the return code in
amdgpu_job_timedout().

Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Surbhi Kakarya and committed by
Alex Deucher
7258fa31 1ec1944e

+20 -1
+15
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
··· 88 88 MODULE_FIRMWARE("amdgpu/yellow_carp_gpu_info.bin"); 89 89 90 90 #define AMDGPU_RESUME_MS 2000 91 + #define AMDGPU_MAX_RETRY_LIMIT 2 92 + #define AMDGPU_RETRY_SRIOV_RESET(r) ((r) == -EBUSY || (r) == -ETIMEDOUT || (r) == -EINVAL) 91 93 92 94 const char *amdgpu_asic_name[] = { 93 95 "TAHITI", ··· 4368 4366 { 4369 4367 int r; 4370 4368 struct amdgpu_hive_info *hive = NULL; 4369 + int retry_limit = 0; 4371 4370 4371 + retry: 4372 4372 amdgpu_amdkfd_pre_reset(adev); 4373 4373 4374 4374 amdgpu_amdkfd_pre_reset(adev); ··· 4418 4414 r = amdgpu_device_recover_vram(adev); 4419 4415 } 4420 4416 amdgpu_virt_release_full_gpu(adev, true); 4417 + 4418 + if (AMDGPU_RETRY_SRIOV_RESET(r)) { 4419 + if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) { 4420 + retry_limit++; 4421 + goto retry; 4422 + } else 4423 + DRM_ERROR("GPU reset retry is beyond the retry limit\n"); 4424 + } 4421 4425 4422 4426 return r; 4423 4427 } ··· 5217 5205 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled) { 5218 5206 drm_helper_resume_force_mode(adev_to_drm(tmp_adev)); 5219 5207 } 5208 + 5209 + if (tmp_adev->asic_reset_res) 5210 + r = tmp_adev->asic_reset_res; 5220 5211 5221 5212 tmp_adev->asic_reset_res = 0; 5222 5213
+5 -1
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
··· 37 37 struct amdgpu_task_info ti; 38 38 struct amdgpu_device *adev = ring->adev; 39 39 int idx; 40 + int r; 40 41 41 42 if (!drm_dev_enter(adev_to_drm(adev), &idx)) { 42 43 DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s", ··· 64 63 ti.process_name, ti.tgid, ti.task_name, ti.pid); 65 64 66 65 if (amdgpu_device_should_recover_gpu(ring->adev)) { 67 - amdgpu_device_gpu_recover(ring->adev, job); 66 + r = amdgpu_device_gpu_recover(ring->adev, job); 67 + if (r) 68 + DRM_ERROR("GPU Recovery Failed: %d\n", r); 69 + 68 70 } else { 69 71 drm_sched_suspend_timeout(&ring->sched); 70 72 if (amdgpu_sriov_vf(adev))