Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/xe/pm: Temporarily disable D3Cold on BMG

Currently, many instability cases related to D3Cold -> D0 transition
on BMG are under investigation. Among them some bad cases where
the device is lost after 1 to 3 transitions from D3Cold to D0
on the runtime pm, with pcieport upstream bridge port link retrain
failure.

In other cases, it works fine, but with some sudden random memory
corruptions after D3cold, that could be 0xffff missed ack on GT
forcewake or GuC reload related failures.

In some other cases though, D3Cold -> D0 works pretty reliably.
It looks like it is a combination of GPU cards and Host boards at
this point. So, there is no possible/available quirk at this time.

This patch disables the D3Cold by default on BMG by reducing the
vram_d3cold_threshold to 0. Users and developers who wants to enable
it are still able to via
$ echo 300 > /sys/bus/pci/devices/<addr>/vram_d3cold_threshold

Fixes: 3adcf970dc7e ("drm/xe/bmg: Drop force_probe requirement")
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4395
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4396
Cc: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250308005636.1475420-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit d945cc876277851053c0cf37927c8d7bd9d0e880)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

+12 -1
+12 -1
drivers/gpu/drm/xe/xe_pm.c
··· 267 267 } 268 268 ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */ 269 269 270 + static u32 vram_threshold_value(struct xe_device *xe) 271 + { 272 + /* FIXME: D3Cold temporarily disabled by default on BMG */ 273 + if (xe->info.platform == XE_BATTLEMAGE) 274 + return 0; 275 + 276 + return DEFAULT_VRAM_THRESHOLD; 277 + } 278 + 270 279 /** 271 280 * xe_pm_init - Initialize Xe Power Management 272 281 * @xe: xe device instance ··· 286 277 */ 287 278 int xe_pm_init(struct xe_device *xe) 288 279 { 280 + u32 vram_threshold; 289 281 int err; 290 282 291 283 /* For now suspend/resume is only allowed with GuC */ ··· 300 290 if (err) 301 291 return err; 302 292 303 - err = xe_pm_set_vram_threshold(xe, DEFAULT_VRAM_THRESHOLD); 293 + vram_threshold = vram_threshold_value(xe); 294 + err = xe_pm_set_vram_threshold(xe, vram_threshold); 304 295 if (err) 305 296 return err; 306 297 }