Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation/gpu: Document how to narrow down display issues

The amdgpu driver is composed of multiple components, each of which can
be a source of some specific problem that the user/developer can see.
This commit introduces steps to narrow down and collect display
information.

Cc: Leo Li <sunpeng.li@amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Cc: Hamza Mahfooz <hamza.mahfooz@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Christian Konig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

authored by

Rodrigo Siqueira and committed by
Alex Deucher
dec36b22 3c0be69b

+187
+187
Documentation/gpu/amdgpu/display/dc-debug.rst
··· 2 2 Display Core Debug tools 3 3 ======================== 4 4 5 + In this section, you will find helpful information on debugging the amdgpu 6 + driver from the display perspective. This page introduces debug mechanisms and 7 + procedures to help you identify if some issues are related to display code. 8 + 9 + Narrow down display issues 10 + ========================== 11 + 12 + Since the display is the driver's visual component, it is common to see users 13 + reporting issues as a display when another component causes the problem. This 14 + section equips users to determine if a specific issue was caused by the display 15 + component or another part of the driver. 16 + 17 + DC dmesg important messages 18 + --------------------------- 19 + 20 + The dmesg log is the first source of information to be checked, and amdgpu 21 + takes advantage of this feature by logging some valuable information. When 22 + looking for the issues associated with amdgpu, remember that each component of 23 + the driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and this 24 + information can be found in the dmesg log. In this sense, look for the part of 25 + the log that looks like the below log snippet:: 26 + 27 + [ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8). 28 + [ 4.254718] [drm] register mmio base: 0xFCB00000 29 + [ 4.254918] [drm] register mmio size: 1048576 30 + [ 4.260095] [drm] add ip block number 0 <soc21_common> 31 + [ 4.260318] [drm] add ip block number 1 <gmc_v11_0> 32 + [ 4.260510] [drm] add ip block number 2 <ih_v6_0> 33 + [ 4.260696] [drm] add ip block number 3 <psp> 34 + [ 4.260878] [drm] add ip block number 4 <smu> 35 + [ 4.261057] [drm] add ip block number 5 <dm> 36 + [ 4.261231] [drm] add ip block number 6 <gfx_v11_0> 37 + [ 4.261402] [drm] add ip block number 7 <sdma_v6_0> 38 + [ 4.261568] [drm] add ip block number 8 <vcn_v4_0> 39 + [ 4.261729] [drm] add ip block number 9 <jpeg_v4_0> 40 + [ 4.261887] [drm] add ip block number 10 <mes_v11_0> 41 + 42 + From the above example, you can see the line that reports that `<dm>`, 43 + (**Display Manager**), was loaded, which means that display can be part of the 44 + issue. If you do not see that line, something else might have failed before 45 + amdgpu loads the display component, indicating that we don't have a 46 + display issue. 47 + 48 + After you identified that the DM was loaded correctly, you can check for the 49 + display version of the hardware in use, which can be retrieved from the dmesg 50 + log with the command:: 51 + 52 + dmesg | grep -i 'display core' 53 + 54 + This command shows a message that looks like this:: 55 + 56 + [ 4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2 57 + 58 + This message has two key pieces of information: 59 + 60 + * **The DC version (e.g., v3.2.285)**: Display developers release a new DC version 61 + every week, and this information can be advantageous in a situation where a 62 + user/developer must find a good point versus a bad point based on a tested 63 + version of the display code. Remember from page :ref:`Display Core <amdgpu-display-core>`, 64 + that every week the new patches for display are heavily tested with IGT and 65 + manual tests. 66 + * **The DCN version (e.g., DCN 3.2)**: The DCN block is associated with the 67 + hardware generation, and the DCN version conveys the hardware generation that 68 + the driver is currently running. This information helps to narrow down the 69 + code debug area since each DCN version has its files in the DC folder per DCN 70 + component (from the example, the developer might want to focus on 71 + files/folders/functions/structs with the dcn32 label might be executed). 72 + However, keep in mind that DC reuses code across different DCN versions; for 73 + example, it is expected to have some callbacks set in one DCN that are the same 74 + as those from another DCN. In summary, use the DCN version just as a guide. 75 + 76 + From the dmesg file, it is also possible to get the ATOM bios code by using:: 77 + 78 + dmesg | grep -i 'ATOM BIOS' 79 + 80 + Which generates an output that looks like this:: 81 + 82 + [ 4.274534] amdgpu: ATOM BIOS: 113-D7020100-102 83 + 84 + This type of information is useful to be reported. 85 + 86 + Avoid loading display core 87 + -------------------------- 88 + 89 + Sometimes, it might be hard to figure out which part of the driver is causing 90 + the issue; if you suspect that the display is not part of the problem and your 91 + bug scenario is simple (e.g., some desktop configuration) you can try to remove 92 + the display component from the equation. First, you need to identify `dm` ID 93 + from the dmesg log; for example, search for the following log:: 94 + 95 + [ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8). 96 + [..] 97 + [ 4.260095] [drm] add ip block number 0 <soc21_common> 98 + [ 4.260318] [drm] add ip block number 1 <gmc_v11_0> 99 + [..] 100 + [ 4.261057] [drm] add ip block number 5 <dm> 101 + 102 + Notice from the above example that the `dm` id is 5 for this specific hardware. 103 + Next, you need to run the following binary operation to identify the IP block 104 + mask:: 105 + 106 + 0xffffffff & ~(1 << [DM ID]) 107 + 108 + From our example the IP mask is:: 109 + 110 + 0xffffffff & ~(1 << 5) = 0xffffffdf 111 + 112 + Finally, to disable DC, you just need to set the below parameter in your 113 + bootloader:: 114 + 115 + amdgpu.ip_block_mask = 0xffffffdf 116 + 117 + If you can boot your system with the DC disabled and still see the issue, it 118 + means you can rule DC out of the equation. However, if the bug disappears, you 119 + still need to consider the DC part of the problem and keep narrowing down the 120 + issue. In some scenarios, disabling DC is impossible since it might be 121 + necessary to use the display component to reproduce the issue (e.g., play a 122 + game). 123 + 124 + **Note: This will probably lead to the absence of a display output.** 125 + 126 + Display flickering 127 + ------------------ 128 + 129 + Display flickering might have multiple causes; one is the lack of proper power 130 + to the GPU or problems in the DPM switches. A good first generic verification 131 + is to set the GPU to use high voltage:: 132 + 133 + bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level" 134 + 135 + The above command sets the GPU/APU to use the maximum power allowed which 136 + disables DPM switches. If forcing DPM levels high does not fix the issue, it 137 + is less likely that the issue is related to power management. If the issue 138 + disappears, there is a good chance that other components might be involved, and 139 + the display should not be ignored since this could be a DPM issues. From the 140 + display side, if the power increase fixes the issue, it is worth debugging the 141 + clock configuration and the pipe split police used in the specific 142 + configuration. 143 + 144 + Display artifacts 145 + ----------------- 146 + 147 + Users may see some screen artifacts that can be categorized into two different 148 + types: localized artifacts and general artifacts. The localized artifacts 149 + happen in some specific areas, such as around the UI window corners; if you see 150 + this type of issue, there is a considerable chance that you have a userspace 151 + problem, likely Mesa or similar. The general artifacts usually happen on the 152 + entire screen. They might be caused by a misconfiguration at the driver level 153 + of the display parameters, but the userspace might also cause this issue. One 154 + way to identify the source of the problem is to take a screenshot or make a 155 + desktop video capture when the problem happens; after checking the 156 + screenshot/video recording, if you don't see any of the artifacts, it means 157 + that the issue is likely on the the driver side. If you can still see the 158 + problem in the data collected, it is an issue that probably happened during 159 + rendering, and the display code just got the framebuffer already corrupted. 160 + 161 + Disabling/Enabling specific features 162 + ==================================== 163 + 164 + DC has a struct named `dc_debug_options`, which is statically initialized by 165 + all DCE/DCN components based on the specific hardware characteristic. This 166 + structure usually facilitates the bring-up phase since developers can start 167 + with many disabled features and enable them individually. This is also an 168 + important debug feature since users can change it when debugging specific 169 + issues. 170 + 171 + For example, dGPU users sometimes see a problem where a horizontal fillet of 172 + flickering happens in some specific part of the screen. This could be an 173 + indication of Sub-Viewport issues; after the users identified the target DCN, 174 + they can set the `force_disable_subvp` field to true in the statically 175 + initialized version of `dc_debug_options` to see if the issue gets fixed. Along 176 + the same lines, users/developers can also try to turn off `fams2_config` and 177 + `enable_single_display_2to1_odm_policy`. In summary, the `dc_debug_options` is 178 + an interesting form for identifying the problem. 179 + 5 180 DC Visual Confirmation 6 181 ====================== 7 182 ··· 250 75 251 76 When reporting a bug related to DC, consider attaching this log before and 252 77 after you reproduce the bug. 78 + 79 + Collect Firmware information 80 + ============================ 81 + 82 + When reporting issues, it is important to have the firmware information since 83 + it can be helpful for debugging purposes. To get all the firmware information, 84 + use the command:: 85 + 86 + cat /sys/kernel/debug/dri/0/amdgpu_firmware_info 87 + 88 + From the display perspective, pay attention to the firmware of the DMCU and 89 + DMCUB. 253 90 254 91 DMUB Firmware Debug 255 92 ===================