Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdgpu: RAS emergency restart logic refine

If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Wenhui Sheng and committed by
Alex Deucher
bb5c7235 ea8139d8

+24 -11
+12 -11
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
··· 4245 4245 struct amdgpu_hive_info *hive = NULL; 4246 4246 struct amdgpu_device *tmp_adev = NULL; 4247 4247 int i, r = 0; 4248 - bool in_ras_intr = amdgpu_ras_intr_triggered(); 4249 - bool use_baco = 4250 - (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ? 4251 - true : false; 4248 + bool need_emergency_restart = false; 4252 4249 bool audio_suspended = false; 4250 + 4251 + /** 4252 + * Special case: RAS triggered and full reset isn't supported 4253 + */ 4254 + need_emergency_restart = amdgpu_ras_need_emergency_restart(adev); 4253 4255 4254 4256 /* 4255 4257 * Flush RAM to disk so that after reboot 4256 4258 * the user can read log and see why the system rebooted. 4257 4259 */ 4258 - if (in_ras_intr && !use_baco && amdgpu_ras_get_context(adev)->reboot) { 4259 - 4260 + if (need_emergency_restart && amdgpu_ras_get_context(adev)->reboot) { 4260 4261 DRM_WARN("Emergency reboot."); 4261 4262 4262 4263 ksys_sync_helper(); ··· 4265 4264 } 4266 4265 4267 4266 dev_info(adev->dev, "GPU %s begin!\n", 4268 - (in_ras_intr && !use_baco) ? "jobs stop":"reset"); 4267 + need_emergency_restart ? "jobs stop":"reset"); 4269 4268 4270 4269 /* 4271 4270 * Here we trylock to avoid chain of resets executing from ··· 4337 4336 amdgpu_fbdev_set_suspend(tmp_adev, 1); 4338 4337 4339 4338 /* disable ras on ALL IPs */ 4340 - if (!(in_ras_intr && !use_baco) && 4339 + if (!need_emergency_restart && 4341 4340 amdgpu_device_ip_need_full_reset(tmp_adev)) 4342 4341 amdgpu_ras_suspend(tmp_adev); 4343 4342 ··· 4349 4348 4350 4349 drm_sched_stop(&ring->sched, job ? &job->base : NULL); 4351 4350 4352 - if (in_ras_intr && !use_baco) 4351 + if (need_emergency_restart) 4353 4352 amdgpu_job_stop_all_jobs_on_sched(&ring->sched); 4354 4353 } 4355 4354 } 4356 4355 4357 - if (in_ras_intr && !use_baco) 4356 + if (need_emergency_restart) 4358 4357 goto skip_sched_resume; 4359 4358 4360 4359 /* ··· 4431 4430 skip_sched_resume: 4432 4431 list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { 4433 4432 /*unlock kfd: SRIOV would do it separately */ 4434 - if (!(in_ras_intr && !use_baco) && !amdgpu_sriov_vf(tmp_adev)) 4433 + if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) 4435 4434 amdgpu_amdkfd_post_reset(tmp_adev); 4436 4435 if (audio_suspended) 4437 4436 amdgpu_device_resume_display_audio(tmp_adev);
+11
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
··· 2131 2131 amdgpu_ras_reset_gpu(adev); 2132 2132 } 2133 2133 } 2134 + 2135 + bool amdgpu_ras_need_emergency_restart(struct amdgpu_device *adev) 2136 + { 2137 + if (adev->asic_type == CHIP_VEGA20 && 2138 + adev->pm.fw_version <= 0x283400) { 2139 + return !(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) && 2140 + amdgpu_ras_intr_triggered(); 2141 + } 2142 + 2143 + return false; 2144 + }
+1
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
··· 633 633 634 634 void amdgpu_ras_set_error_query_ready(struct amdgpu_device *adev, bool ready); 635 635 636 + bool amdgpu_ras_need_emergency_restart(struct amdgpu_device *adev); 636 637 #endif