drm/amdgpu: Handle the GPU recovery failure in SRIOV environment.

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

This patch handles the GPU recovery failure in sriov environment by
retrying the reset if the first reset fails. To determine the condition
of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns
true if failure is due to ETIMEDOUT, EINVAL or EBUSY, otherwise return
false.A new macro AMDGPU_MAX_RETRY_LIMIT is used to limit the retry to 2.

It also handles the return status in Post Asic Reset by updating the return
code with asic_reset_res and eventually return the return code in
amdgpu_job_timedout().

Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Surbhi Kakarya and committed by

Alex Deucher 4 years ago 7258fa31 1ec1944e

+20 -1

2 changed files

expand all

drivers

gpu

drm

amd

amdgpu

amdgpu_device.c

amdgpu_job.c

+15

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

··· 88 88 MODULE_FIRMWARE("amdgpu/yellow_carp_gpu_info.bin"); 89 89 90 90 #define AMDGPU_RESUME_MS 2000 91 + #define AMDGPU_MAX_RETRY_LIMIT 2 92 + #define AMDGPU_RETRY_SRIOV_RESET(r) ((r) == -EBUSY || (r) == -ETIMEDOUT || (r) == -EINVAL) 91 93 92 94 const char *amdgpu_asic_name[] = { 93 95 "TAHITI", ··· 4368 4366 { 4369 4367 int r; 4370 4368 struct amdgpu_hive_info *hive = NULL; 4369 + int retry_limit = 0; 4371 4370 4371 + retry: 4372 4372 amdgpu_amdkfd_pre_reset(adev); 4373 4373 4374 4374 amdgpu_amdkfd_pre_reset(adev); ··· 4418 4414 r = amdgpu_device_recover_vram(adev); 4419 4415 } 4420 4416 amdgpu_virt_release_full_gpu(adev, true); 4417 + 4418 + if (AMDGPU_RETRY_SRIOV_RESET(r)) { 4419 + if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) { 4420 + retry_limit++; 4421 + goto retry; 4422 + } else 4423 + DRM_ERROR("GPU reset retry is beyond the retry limit\n"); 4424 + } 4421 4425 4422 4426 return r; 4423 4427 } ··· 5217 5205 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled) { 5218 5206 drm_helper_resume_force_mode(adev_to_drm(tmp_adev)); 5219 5207 } 5208 + 5209 + if (tmp_adev->asic_reset_res) 5210 + r = tmp_adev->asic_reset_res; 5220 5211 5221 5212 tmp_adev->asic_reset_res = 0; 5222 5213

+5 -1

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

··· 37 37 struct amdgpu_task_info ti; 38 38 struct amdgpu_device *adev = ring->adev; 39 39 int idx; 40 + int r; 40 41 41 42 if (!drm_dev_enter(adev_to_drm(adev), &idx)) { 42 43 DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s", ··· 64 63 ti.process_name, ti.tgid, ti.task_name, ti.pid); 65 64 66 65 if (amdgpu_device_should_recover_gpu(ring->adev)) { 67 - amdgpu_device_gpu_recover(ring->adev, job); 66 + r = amdgpu_device_gpu_recover(ring->adev, job); 67 + if (r) 68 + DRM_ERROR("GPU Recovery Failed: %d\n", r); 69 + 68 70 } else { 69 71 drm_sched_suspend_timeout(&ring->sched); 70 72 if (amdgpu_sriov_vf(adev))