Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdgpu: fix gpu page fault after hibernation on PF passthrough

On PF passthrough environment, after hibernate and then resume, coralgemm
will cause gpu page fault.

Mode1 reset happens during hibernate, but partition mode is not restored
on resume, register mmCP_HYP_XCP_CTL and mmCP_PSP_XCP_CTL is not right
after resume. When CP access the MQD BO, wrong stride size is used,
this will cause out of bound access on the MQD BO, resulting page fault.

The fix is to ensure gfx_v9_4_3_switch_compute_partition() is called
when resume from a hibernation.
KFD resume is called separately during a reset recovery or resume from
suspend sequence. Hence it's not required to be called as part of
partition switch.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5d1b32cfe4a676fe552416cb5ae847b215463a1a)

authored by

Samuel Zhang and committed by
Alex Deucher
eb6e7f52 6dd97ceb

+5 -2
+2 -1
drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
··· 407 407 return -EINVAL; 408 408 } 409 409 410 - if (adev->kfd.init_complete && !amdgpu_in_reset(adev)) 410 + if (adev->kfd.init_complete && !amdgpu_in_reset(adev) && 411 + !adev->in_suspend) 411 412 flags |= AMDGPU_XCP_OPS_KFD; 412 413 413 414 if (flags & AMDGPU_XCP_OPS_KFD) {
+3 -1
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
··· 2292 2292 r = amdgpu_xcp_init(adev->xcp_mgr, num_xcp, mode); 2293 2293 2294 2294 } else { 2295 - if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, 2295 + if (adev->in_suspend) 2296 + amdgpu_xcp_restore_partition_mode(adev->xcp_mgr); 2297 + else if (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, 2296 2298 AMDGPU_XCP_FL_NONE) == 2297 2299 AMDGPU_UNKNOWN_COMPUTE_PARTITION_MODE) 2298 2300 r = amdgpu_xcp_switch_partition_mode(