Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drm/amdgpu: Improve RAS documentation (v2)

Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

v2: fix grammatical errors

Reviewed-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

+68 -7
+35
Documentation/gpu/amdgpu.rst
··· 82 82 AMDGPU RAS Support 83 83 ================== 84 84 85 + The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and 86 + debugfs (for error injection). 87 + 85 88 RAS debugfs/sysfs Control and Error Injection Interfaces 86 89 -------------------------------------------------------- 87 90 88 91 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 89 92 :doc: AMDGPU RAS debugfs control interface 93 + 94 + RAS Reboot Behavior for Unrecoverable Errors 95 + -------------------------------------------------------- 96 + 97 + .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 98 + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors 90 99 91 100 RAS Error Count sysfs Interface 92 101 ------------------------------- ··· 117 108 118 109 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 119 110 :internal: 111 + 112 + Sample Code 113 + ----------- 114 + Sample code for testing error injection can be found here: 115 + https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c 116 + 117 + This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. 118 + There are four sets of tests: 119 + 120 + RAS Basic Test 121 + 122 + The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files 123 + are present. 124 + 125 + RAS Query Test 126 + 127 + This test checks the RAS availability and enablement status for each supported IP block as well as 128 + the error counts. 129 + 130 + RAS Inject Test 131 + 132 + This test injects errors for each IP. 133 + 134 + RAS Disable Test 135 + 136 + This test tests disabling of RAS features for each IP block. 120 137 121 138 122 139 GPU Power/Thermal Controls and Monitoring
+33 -7
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
··· 220 220 * As their names indicate, inject operation will write the 221 221 * value to the address. 222 222 * 223 - * Second member: struct ras_debug_if::op. 223 + * The second member: struct ras_debug_if::op. 224 224 * It has three kinds of operations. 225 225 * 226 226 * - 0: disable RAS on the block. Take ::head as its data. ··· 228 228 * - 2: inject errors on the block. Take ::inject as its data. 229 229 * 230 230 * How to use the interface? 231 - * programs: 232 - * copy the struct ras_debug_if in your codes and initialize it. 233 - * write the struct to the control node. 231 + * 232 + * Programs 233 + * 234 + * Copy the struct ras_debug_if in your codes and initialize it. 235 + * Write the struct to the control node. 236 + * 237 + * Shells 234 238 * 235 239 * .. code-block:: bash 236 240 * 237 241 * echo op block [error [sub_block address value]] > .../ras/ras_ctrl 242 + * 243 + * Parameters: 238 244 * 239 245 * op: disable, enable, inject 240 246 * disable: only block is needed ··· 271 265 * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count 272 266 * 273 267 * .. note:: 274 - * Operation is only allowed on blocks which are supported. 268 + * Operations are only allowed on blocks which are supported. 275 269 * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask 270 + * to see which blocks support RAS on a particular asic. 271 + * 276 272 */ 277 273 static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, 278 274 size_t size, loff_t *pos) ··· 330 322 * DOC: AMDGPU RAS debugfs EEPROM table reset interface 331 323 * 332 324 * Some boards contain an EEPROM which is used to persistently store a list of 333 - * bad pages containing ECC errors detected in vram. This interface provides 325 + * bad pages which experiences ECC errors in vram. This interface provides 334 326 * a way to reset the EEPROM, e.g., after testing error injection. 335 327 * 336 328 * Usage: ··· 370 362 /** 371 363 * DOC: AMDGPU RAS sysfs Error Count Interface 372 364 * 373 - * It allows user to read the error count for each IP block on the gpu through 365 + * It allows the user to read the error count for each IP block on the gpu through 374 366 * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count 375 367 * 376 368 * It outputs the multiple lines which report the uncorrected (ue) and corrected ··· 1035 1027 } 1036 1028 /* sysfs end */ 1037 1029 1030 + /** 1031 + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors 1032 + * 1033 + * Normally when there is an uncorrectable error, the driver will reset 1034 + * the GPU to recover. However, in the event of an unrecoverable error, 1035 + * the driver provides an interface to reboot the system automatically 1036 + * in that event. 1037 + * 1038 + * The following file in debugfs provides that interface: 1039 + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot 1040 + * 1041 + * Usage: 1042 + * 1043 + * .. code-block:: bash 1044 + * 1045 + * echo true > .../ras/auto_reboot 1046 + * 1047 + */ 1038 1048 /* debugfs begin */ 1039 1049 static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) 1040 1050 {