Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amd/amdgpu: consider kernel job always not guilty

[Why]
Currently all timedout job will be considered to be guilty. In SRIOV
multi-vf use case, the vf flr happens first and then job time out is
found. There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty. This will lead
to a page fault after resubmitting job.

[How]
If the job is a kernel job, we will always consider it not guilty

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Jingwen Chen and committed by
Alex Deucher
ff99849b 410e302e

+3 -3
+3 -3
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
··· 4468 4468 amdgpu_fence_driver_force_completion(ring); 4469 4469 } 4470 4470 4471 - if(job) 4471 + if (job && job->vm) 4472 4472 drm_sched_increase_karma(&job->base); 4473 4473 4474 4474 r = amdgpu_reset_prepare_hwcontext(adev, reset_context); ··· 4932 4932 DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", 4933 4933 job ? job->base.id : -1, hive->hive_id); 4934 4934 amdgpu_put_xgmi_hive(hive); 4935 - if (job) 4935 + if (job && job->vm) 4936 4936 drm_sched_increase_karma(&job->base); 4937 4937 return 0; 4938 4938 } ··· 4956 4956 job ? job->base.id : -1); 4957 4957 4958 4958 /* even we skipped this reset, still need to set the job to guilty */ 4959 - if (job) 4959 + if (job && job->vm) 4960 4960 drm_sched_increase_karma(&job->base); 4961 4961 goto skip_recovery; 4962 4962 }