Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout

In case if g2h worker doesn't get opportunity to within specified
timeout delay then flush the g2h worker explicitly.

v2:
- Describe change in the comment and add TODO (Matt B/John H)
- Add xe_gt_warn on fence done after G2H flush (John H)
v3:
- Updated the comment with root cause
- Clean up xe_gt_warn message (John H)

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902
Signed-off-by: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Acked-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017111410.2553784-2-badal.nilawar@intel.com
(cherry picked from commit e5152723380404acb8175e0777b1cea57f319a01)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>

authored by

Badal Nilawar and committed by
Lucas De Marchi
22ef43c7 c8fb95e7

+18
+18
drivers/gpu/drm/xe/xe_guc_ct.c
··· 898 898 ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ); 899 899 900 900 /* 901 + * Occasionally it is seen that the G2H worker starts running after a delay of more than 902 + * a second even after being queued and activated by the Linux workqueue subsystem. This 903 + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of 904 + * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS 905 + * and this is beyond xe kmd. 906 + * 907 + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. 908 + */ 909 + if (!ret) { 910 + flush_work(&ct->g2h_worker); 911 + if (g2h_fence.done) { 912 + xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", 913 + g2h_fence.seqno, action[0]); 914 + ret = 1; 915 + } 916 + } 917 + 918 + /* 901 919 * Ensure we serialize with completion side to prevent UAF with fence going out of scope on 902 920 * the stack, since we have no clue if it will fire after the timeout before we can erase 903 921 * from the xa. Also we have some dependent loads and stores below for which we need the