Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/doc: Document device wedged event

Add documentation for device wedged event in a new "Device wedging"
chapter. This describes basic definitions, prerequisites and consumer
expectations along with an example.

v8: Improve introduction (Christian, Rodrigo)
v9: Add prerequisites section (Christian)
v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
v11: Reference wedged event in device reset chapter (André)
v12: Refine consumer expectations and terminologies (Xaver, Pekka)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250204070528.1919158-3-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

authored by

Raag Jadav and committed by
Rodrigo Vivi
a97bc11b b7cf9f4a

+113 -3
+113 -3
Documentation/gpu/drm-uapi.rst
··· 371 371 372 372 Apart from propagating the reset through the stack so apps can recover, it's 373 373 really useful for driver developers to learn more about what caused the reset in 374 - the first place. DRM devices should make use of devcoredump to store relevant 375 - information about the reset, so this information can be added to user bug 376 - reports. 374 + the first place. For this, drivers can make use of devcoredump to store relevant 375 + information about the reset and send device wedged event with ``none`` recovery 376 + method (as explained in "Device Wedging" chapter) to notify userspace, so this 377 + information can be collected and added to user bug reports. 378 + 379 + Device Wedging 380 + ============== 381 + 382 + Drivers can optionally make use of device wedged event (implemented as 383 + drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged' 384 + (hanged/unusable) state of the DRM device through a uevent. This is useful 385 + especially in cases where the device is no longer operating as expected and has 386 + become unrecoverable from driver context. Purpose of this implementation is to 387 + provide drivers a generic way to recover the device with the help of userspace 388 + intervention, without taking any drastic measures (like resetting or 389 + re-enumerating the full bus, on which the underlying physical device is sitting) 390 + in the driver. 391 + 392 + A 'wedged' device is basically a device that is declared dead by the driver 393 + after exhausting all possible attempts to recover it from driver context. The 394 + uevent is the notification that is sent to userspace along with a hint about 395 + what could possibly be attempted to recover the device from userspace and bring 396 + it back to usable state. Different drivers may have different ideas of a 397 + 'wedged' device depending on hardware implementation of the underlying physical 398 + device, and hence the vendor agnostic nature of the event. It is up to the 399 + drivers to decide when they see the need for device recovery and how they want 400 + to recover from the available methods. 401 + 402 + Driver prerequisites 403 + -------------------- 404 + 405 + The driver, before opting for recovery, needs to make sure that the 'wedged' 406 + device doesn't harm the system as a whole by taking care of the prerequisites. 407 + Necessary actions must include disabling DMA to system memory as well as any 408 + communication channels with other devices. Further, the driver must ensure 409 + that all dma_fences are signalled and any device state that the core kernel 410 + might depend on is cleaned up. All existing mmaps should be invalidated and 411 + page faults should be redirected to a dummy page. Once the event is sent, the 412 + device must be kept in 'wedged' state until the recovery is performed. New 413 + accesses to the device (IOCTLs) should be rejected, preferably with an error 414 + code that resembles the type of failure the device has encountered. This will 415 + signify the reason for wedging, which can be reported to the application if 416 + needed. 417 + 418 + Recovery 419 + -------- 420 + 421 + Current implementation defines three recovery methods, out of which, drivers 422 + can use any one, multiple or none. Method(s) of choice will be sent in the 423 + uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to 424 + more side-effects. If driver is unsure about recovery or method is unknown 425 + (like soft/hard system reboot, firmware flashing, physical device replacement 426 + or any other procedure which can't be attempted on the fly), ``WEDGED=unknown`` 427 + will be sent instead. 428 + 429 + Userspace consumers can parse this event and attempt recovery as per the 430 + following expectations. 431 + 432 + =============== ======================================== 433 + Recovery method Consumer expectations 434 + =============== ======================================== 435 + none optional telemetry collection 436 + rebind unbind + bind driver 437 + bus-reset unbind + bus reset/re-enumeration + bind 438 + unknown consumer policy 439 + =============== ======================================== 440 + 441 + The only exception to this is ``WEDGED=none``, which signifies that the device 442 + was temporarily 'wedged' at some point but was recovered from driver context 443 + using device specific methods like reset. No explicit recovery is expected from 444 + the consumer in this case, but it can still take additional steps like gathering 445 + telemetry information (devcoredump, syslog). This is useful because the first 446 + hang is usually the most critical one which can result in consequential hangs or 447 + complete wedging. 448 + 449 + Consumer prerequisites 450 + ---------------------- 451 + 452 + It is the responsibility of the consumer to make sure that the device or its 453 + resources are not in use by any process before attempting recovery. With IOCTLs 454 + erroring out, all device memory should be unmapped and file descriptors should 455 + be closed to prevent leaks or undefined behaviour. The idea here is to clear the 456 + device of all user context beforehand and set the stage for a clean recovery. 457 + 458 + Example 459 + ------- 460 + 461 + Udev rule:: 462 + 463 + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", 464 + RUN+="/path/to/rebind.sh $env{DEVPATH}" 465 + 466 + Recovery script:: 467 + 468 + #!/bin/sh 469 + 470 + DEVPATH=$(readlink -f /sys/$1/device) 471 + DEVICE=$(basename $DEVPATH) 472 + DRIVER=$(readlink -f $DEVPATH/driver) 473 + 474 + echo -n $DEVICE > $DRIVER/unbind 475 + echo -n $DEVICE > $DRIVER/bind 476 + 477 + Customization 478 + ------------- 479 + 480 + Although basic recovery is possible with a simple script, consumers can define 481 + custom policies around recovery. For example, if the driver supports multiple 482 + recovery methods, consumers can opt for the suitable one depending on scenarios 483 + like repeat offences or vendor specific failures. Consumers can also choose to 484 + have the device available for debugging or telemetry collection and base their 485 + recovery decision on the findings. This is useful especially when the driver is 486 + unsure about recovery or method is unknown. 377 487 378 488 .. _drm_driver_ioctl: 379 489