Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

PCI-Express AER implemetation: aer howto document

PCI-Express AER (Advanced Error Reporting) provides more robust error reporting.
The series of patches enable kernel support to AER.

The initial patches were written by Tom Long Nguyen. I ported them to the kernel
2.6.18-rc3. Many thanks to Rajesh Shah and Narayanan Chandramouli for their great
review comments and testing help.

Patch 1 consists of the pciaer-howto.txt document.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


authored by

Zhang, Yanmin and committed by
Greg Kroah-Hartman
47402400 20d51660

+253
+253
Documentation/pcieaer-howto.txt
··· 1 + The PCI Express Advanced Error Reporting Driver Guide HOWTO 2 + T. Long Nguyen <tom.l.nguyen@intel.com> 3 + Yanmin Zhang <yanmin.zhang@intel.com> 4 + 07/29/2006 5 + 6 + 7 + 1. Overview 8 + 9 + 1.1 About this guide 10 + 11 + This guide describes the basics of the PCI Express Advanced Error 12 + Reporting (AER) driver and provides information on how to use it, as 13 + well as how to enable the drivers of endpoint devices to conform with 14 + PCI Express AER driver. 15 + 16 + 1.2 Copyright � Intel Corporation 2006. 17 + 18 + 1.3 What is the PCI Express AER Driver? 19 + 20 + PCI Express error signaling can occur on the PCI Express link itself 21 + or on behalf of transactions initiated on the link. PCI Express 22 + defines two error reporting paradigms: the baseline capability and 23 + the Advanced Error Reporting capability. The baseline capability is 24 + required of all PCI Express components providing a minimum defined 25 + set of error reporting requirements. Advanced Error Reporting 26 + capability is implemented with a PCI Express advanced error reporting 27 + extended capability structure providing more robust error reporting. 28 + 29 + The PCI Express AER driver provides the infrastructure to support PCI 30 + Express Advanced Error Reporting capability. The PCI Express AER 31 + driver provides three basic functions: 32 + 33 + - Gathers the comprehensive error information if errors occurred. 34 + - Reports error to the users. 35 + - Performs error recovery actions. 36 + 37 + AER driver only attaches root ports which support PCI-Express AER 38 + capability. 39 + 40 + 41 + 2. User Guide 42 + 43 + 2.1 Include the PCI Express AER Root Driver into the Linux Kernel 44 + 45 + The PCI Express AER Root driver is a Root Port service driver attached 46 + to the PCI Express Port Bus driver. If a user wants to use it, the driver 47 + has to be compiled. Option CONFIG_PCIEAER supports this capability. It 48 + depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 49 + CONFIG_PCIEAER = y. 50 + 51 + 2.2 Load PCI Express AER Root Driver 52 + There is a case where a system has AER support in BIOS. Enabling the AER 53 + Root driver and having AER support in BIOS may result unpredictable 54 + behavior. To avoid this conflict, a successful load of the AER Root driver 55 + requires ACPI _OSC support in the BIOS to allow the AER Root driver to 56 + request for native control of AER. See the PCI FW 3.0 Specification for 57 + details regarding OSC usage. Currently, lots of firmwares don't provide 58 + _OSC support while they use PCI Express. To support such firmwares, 59 + forceload, a parameter of type bool, could enable AER to continue to 60 + be initiated although firmwares have no _OSC support. To enable the 61 + walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line 62 + when booting kernel. Note that forceload=n by default. 63 + 64 + 2.3 AER error output 65 + When a PCI-E AER error is captured, an error message will be outputed to 66 + console. If it's a correctable error, it is outputed as a warning. 67 + Otherwise, it is printed as an error. So users could choose different 68 + log level to filter out correctable error messages. 69 + 70 + Below shows an example. 71 + +------ PCI-Express Device Error -----+ 72 + Error Severity : Uncorrected (Fatal) 73 + PCIE Bus Error type : Transaction Layer 74 + Unsupported Request : First 75 + Requester ID : 0500 76 + VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h 77 + TLB Header: 78 + 04000001 00200a03 05010000 00050100 79 + 80 + In the example, 'Requester ID' means the ID of the device who sends 81 + the error message to root port. Pls. refer to pci express specs for 82 + other fields. 83 + 84 + 85 + 3. Developer Guide 86 + 87 + To enable AER aware support requires a software driver to configure 88 + the AER capability structure within its device and to provide callbacks. 89 + 90 + To support AER better, developers need understand how AER does work 91 + firstly. 92 + 93 + PCI Express errors are classified into two types: correctable errors 94 + and uncorrectable errors. This classification is based on the impacts 95 + of those errors, which may result in degraded performance or function 96 + failure. 97 + 98 + Correctable errors pose no impacts on the functionality of the 99 + interface. The PCI Express protocol can recover without any software 100 + intervention or any loss of data. These errors are detected and 101 + corrected by hardware. Unlike correctable errors, uncorrectable 102 + errors impact functionality of the interface. Uncorrectable errors 103 + can cause a particular transaction or a particular PCI Express link 104 + to be unreliable. Depending on those error conditions, uncorrectable 105 + errors are further classified into non-fatal errors and fatal errors. 106 + Non-fatal errors cause the particular transaction to be unreliable, 107 + but the PCI Express link itself is fully functional. Fatal errors, on 108 + the other hand, cause the link to be unreliable. 109 + 110 + When AER is enabled, a PCI Express device will automatically send an 111 + error message to the PCIE root port above it when the device captures 112 + an error. The Root Port, upon receiving an error reporting message, 113 + internally processes and logs the error message in its PCI Express 114 + capability structure. Error information being logged includes storing 115 + the error reporting agent's requestor ID into the Error Source 116 + Identification Registers and setting the error bits of the Root Error 117 + Status Register accordingly. If AER error reporting is enabled in Root 118 + Error Command Register, the Root Port generates an interrupt if an 119 + error is detected. 120 + 121 + Note that the errors as described above are related to the PCI Express 122 + hierarchy and links. These errors do not include any device specific 123 + errors because device specific errors will still get sent directly to 124 + the device driver. 125 + 126 + 3.1 Configure the AER capability structure 127 + 128 + AER aware drivers of PCI Express component need change the device 129 + control registers to enable AER. They also could change AER registers, 130 + including mask and severity registers. Helper function 131 + pci_enable_pcie_error_reporting could be used to enable AER. See 132 + section 3.3. 133 + 134 + 3.2. Provide callbacks 135 + 136 + 3.2.1 callback reset_link to reset pci express link 137 + 138 + This callback is used to reset the pci express physical link when a 139 + fatal error happens. The root port aer service driver provides a 140 + default reset_link function, but different upstream ports might 141 + have different specifications to reset pci express link, so all 142 + upstream ports should provide their own reset_link functions. 143 + 144 + In struct pcie_port_service_driver, a new pointer, reset_link, is 145 + added. 146 + 147 + pci_ers_result_t (*reset_link) (struct pci_dev *dev); 148 + 149 + Section 3.2.2.2 provides more detailed info on when to call 150 + reset_link. 151 + 152 + 3.2.2 PCI error-recovery callbacks 153 + 154 + The PCI Express AER Root driver uses error callbacks to coordinate 155 + with downstream device drivers associated with a hierarchy in question 156 + when performing error recovery actions. 157 + 158 + Data struct pci_driver has a pointer, err_handler, to point to 159 + pci_error_handlers who consists of a couple of callback function 160 + pointers. AER driver follows the rules defined in 161 + pci-error-recovery.txt except pci express specific parts (e.g. 162 + reset_link). Pls. refer to pci-error-recovery.txt for detailed 163 + definitions of the callbacks. 164 + 165 + Below sections specify when to call the error callback functions. 166 + 167 + 3.2.2.1 Correctable errors 168 + 169 + Correctable errors pose no impacts on the functionality of 170 + the interface. The PCI Express protocol can recover without any 171 + software intervention or any loss of data. These errors do not 172 + require any recovery actions. The AER driver clears the device's 173 + correctable error status register accordingly and logs these errors. 174 + 175 + 3.2.2.2 Non-correctable (non-fatal and fatal) errors 176 + 177 + If an error message indicates a non-fatal error, performing link reset 178 + at upstream is not required. The AER driver calls error_detected(dev, 179 + pci_channel_io_normal) to all drivers associated within a hierarchy in 180 + question. for example, 181 + EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. 182 + If Upstream port A captures an AER error, the hierarchy consists of 183 + Downstream port B and EndPoint. 184 + 185 + A driver may return PCI_ERS_RESULT_CAN_RECOVER, 186 + PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 187 + whether it can recover or the AER driver calls mmio_enabled as next. 188 + 189 + If an error message indicates a fatal error, kernel will broadcast 190 + error_detected(dev, pci_channel_io_frozen) to all drivers within 191 + a hierarchy in question. Then, performing link reset at upstream is 192 + necessary. As different kinds of devices might use different approaches 193 + to reset link, AER port service driver is required to provide the 194 + function to reset link. Firstly, kernel looks for if the upstream 195 + component has an aer driver. If it has, kernel uses the reset_link 196 + callback of the aer driver. If the upstream component has no aer driver 197 + and the port is downstream port, we will use the aer driver of the 198 + root port who reports the AER error. As for upstream ports, 199 + they should provide their own aer service drivers with reset_link 200 + function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and 201 + reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 202 + to mmio_enabled. 203 + 204 + 3.3 helper functions 205 + 206 + 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); 207 + pci_find_aer_capability locates the PCI Express AER capability 208 + in the device configuration space. If the device doesn't support 209 + PCI-Express AER, the function returns 0. 210 + 211 + 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); 212 + pci_enable_pcie_error_reporting enables the device to send error 213 + messages to root port when an error is detected. Note that devices 214 + don't enable the error reporting by default, so device drivers need 215 + call this function to enable it. 216 + 217 + 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); 218 + pci_disable_pcie_error_reporting disables the device to send error 219 + messages to root port when an error is detected. 220 + 221 + 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); 222 + pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable 223 + error status register. 224 + 225 + 3.4 Frequent Asked Questions 226 + 227 + Q: What happens if a PCI Express device driver does not provide an 228 + error recovery handler (pci_driver->err_handler is equal to NULL)? 229 + 230 + A: The devices attached with the driver won't be recovered. If the 231 + error is fatal, kernel will print out warning messages. Please refer 232 + to section 3 for more information. 233 + 234 + Q: What happens if an upstream port service driver does not provide 235 + callback reset_link? 236 + 237 + A: Fatal error recovery will fail if the errors are reported by the 238 + upstream ports who are attached by the service driver. 239 + 240 + Q: How does this infrastructure deal with driver that is not PCI 241 + Express aware? 242 + 243 + A: This infrastructure calls the error callback functions of the 244 + driver when an error happens. But if the driver is not aware of 245 + PCI Express, the device might not report its own errors to root 246 + port. 247 + 248 + Q: What modifications will that driver need to make it compatible 249 + with the PCI Express AER Root driver? 250 + 251 + A: It could call the helper functions to enable AER in devices and 252 + cleanup uncorrectable status register. Pls. refer to section 3.3. 253 +