Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdgpu: Report individual reset error

If reinitialization of one of the GPUs fails after reset, it logs
failure on all subsequent GPUs eventhough they have resumed
successfully.

A sample log where only device at 0000:95:00.0 had a failure -

amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded!
amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded!
amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded!
amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded!
amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed
amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed
amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed
amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed
amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5

To avoid confusion, report the error for each device
separately and return the first error as the overall result.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Lijo Lazar and committed by
Alex Deucher
2e976637 a107aeb6

+15 -10
+15 -10
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
··· 6389 6389 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled) 6390 6390 drm_helper_resume_force_mode(adev_to_drm(tmp_adev)); 6391 6391 6392 - if (tmp_adev->asic_reset_res) 6393 - r = tmp_adev->asic_reset_res; 6394 - 6395 - tmp_adev->asic_reset_res = 0; 6396 - 6397 - if (r) { 6392 + if (tmp_adev->asic_reset_res) { 6398 6393 /* bad news, how to tell it to userspace ? 6399 6394 * for ras error, we should report GPU bad status instead of 6400 6395 * reset failure 6401 6396 */ 6402 6397 if (reset_context->src != AMDGPU_RESET_SRC_RAS || 6403 6398 !amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) 6404 - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", 6405 - atomic_read(&tmp_adev->gpu_reset_counter)); 6406 - amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); 6399 + dev_info( 6400 + tmp_adev->dev, 6401 + "GPU reset(%d) failed with error %d \n", 6402 + atomic_read( 6403 + &tmp_adev->gpu_reset_counter), 6404 + tmp_adev->asic_reset_res); 6405 + amdgpu_vf_error_put(tmp_adev, 6406 + AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, 6407 + tmp_adev->asic_reset_res); 6408 + if (!r) 6409 + r = tmp_adev->asic_reset_res; 6410 + tmp_adev->asic_reset_res = 0; 6407 6411 } else { 6408 - dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter)); 6412 + dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", 6413 + atomic_read(&tmp_adev->gpu_reset_counter)); 6409 6414 if (amdgpu_acpi_smart_shift_update(tmp_adev, 6410 6415 AMDGPU_SS_DEV_D0)) 6411 6416 dev_warn(tmp_adev->dev,