Documentation: PCI: convert pci-error-recovery.txt to reST

+1

Documentation/PCI/index.rst

··· 13 13 pci-iov-howto 14 14 msi-howto 15 15 acpi-info 16 + pci-error-recovery

+141 -130

Documentation/PCI/pci-error-recovery.txt Documentation/PCI/pci-error-recovery.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 1 2 2 - PCI Error Recovery 3 - ------------------ 4 - February 2, 2006 3 + ================== 4 + PCI Error Recovery 5 + ================== 5 6 6 - Current document maintainer: 7 - Linas Vepstas <linasvepstas@gmail.com> 8 - updated by Richard Lary <rlary@us.ibm.com> 9 - and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009 7 + 8 + :Authors: - Linas Vepstas <linasvepstas@gmail.com> 9 + - Richard Lary <rlary@us.ibm.com> 10 + - Mike Mason <mmlnx@us.ibm.com> 10 11 11 12 12 13 Many PCI bus controllers are able to detect a variety of hardware ··· 64 63 65 64 66 65 Detailed Design 67 - --------------- 66 + =============== 67 + 68 68 Design and implementation details below, based on a chain of 69 69 public email discussions with Ben Herrenschmidt, circa 5 April 2005. 70 70 ··· 75 73 and the actual recovery steps taken are platform dependent. The 76 74 arch/powerpc implementation will simulate a PCI hotplug remove/add. 77 75 78 - This structure has the form: 79 - struct pci_error_handlers 80 - { 81 - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); 82 - int (*mmio_enabled)(struct pci_dev *dev); 83 - int (*slot_reset)(struct pci_dev *dev); 84 - void (*resume)(struct pci_dev *dev); 85 - }; 76 + This structure has the form:: 86 77 87 - The possible channel states are: 88 - enum pci_channel_state { 89 - pci_channel_io_normal, /* I/O channel is in normal state */ 90 - pci_channel_io_frozen, /* I/O to channel is blocked */ 91 - pci_channel_io_perm_failure, /* PCI card is dead */ 92 - }; 78 + struct pci_error_handlers 79 + { 80 + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); 81 + int (*mmio_enabled)(struct pci_dev *dev); 82 + int (*slot_reset)(struct pci_dev *dev); 83 + void (*resume)(struct pci_dev *dev); 84 + }; 93 85 94 - Possible return values are: 95 - enum pci_ers_result { 96 - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ 97 - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ 98 - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ 99 - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ 100 - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ 101 - }; 86 + The possible channel states are:: 87 + 88 + enum pci_channel_state { 89 + pci_channel_io_normal, /* I/O channel is in normal state */ 90 + pci_channel_io_frozen, /* I/O to channel is blocked */ 91 + pci_channel_io_perm_failure, /* PCI card is dead */ 92 + }; 93 + 94 + Possible return values are:: 95 + 96 + enum pci_ers_result { 97 + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ 98 + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ 99 + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ 100 + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ 101 + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ 102 + }; 102 103 103 104 A driver does not have to implement all of these callbacks; however, 104 105 if it implements any, it must implement error_detected(). If a callback ··· 139 134 140 135 All drivers participating in this system must implement this call. 141 136 The driver must return one of the following result codes: 142 - - PCI_ERS_RESULT_CAN_RECOVER: 143 - Driver returns this if it thinks it might be able to recover 144 - the HW by just banging IOs or if it wants to be given 145 - a chance to extract some diagnostic information (see 146 - mmio_enable, below). 147 - - PCI_ERS_RESULT_NEED_RESET: 148 - Driver returns this if it can't recover without a 149 - slot reset. 150 - - PCI_ERS_RESULT_DISCONNECT: 151 - Driver returns this if it doesn't want to recover at all. 137 + 138 + - PCI_ERS_RESULT_CAN_RECOVER 139 + Driver returns this if it thinks it might be able to recover 140 + the HW by just banging IOs or if it wants to be given 141 + a chance to extract some diagnostic information (see 142 + mmio_enable, below). 143 + - PCI_ERS_RESULT_NEED_RESET 144 + Driver returns this if it can't recover without a 145 + slot reset. 146 + - PCI_ERS_RESULT_DISCONNECT 147 + Driver returns this if it doesn't want to recover at all. 152 148 153 149 The next step taken will depend on the result codes returned by the 154 150 drivers. ··· 165 159 If the platform is unable to recover the slot, the next step 166 160 is STEP 6 (Permanent Failure). 167 161 168 - >>> The current powerpc implementation assumes that a device driver will 169 - >>> *not* schedule or semaphore in this routine; the current powerpc 170 - >>> implementation uses one kernel thread to notify all devices; 171 - >>> thus, if one device sleeps/schedules, all devices are affected. 172 - >>> Doing better requires complex multi-threaded logic in the error 173 - >>> recovery implementation (e.g. waiting for all notification threads 174 - >>> to "join" before proceeding with recovery.) This seems excessively 175 - >>> complex and not worth implementing. 162 + .. note:: 176 163 177 - >>> The current powerpc implementation doesn't much care if the device 178 - >>> attempts I/O at this point, or not. I/O's will fail, returning 179 - >>> a value of 0xff on read, and writes will be dropped. If more than 180 - >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH 181 - >>> assumes that the device driver has gone into an infinite loop 182 - >>> and prints an error to syslog. A reboot is then required to 183 - >>> get the device working again. 164 + The current powerpc implementation assumes that a device driver will 165 + *not* schedule or semaphore in this routine; the current powerpc 166 + implementation uses one kernel thread to notify all devices; 167 + thus, if one device sleeps/schedules, all devices are affected. 168 + Doing better requires complex multi-threaded logic in the error 169 + recovery implementation (e.g. waiting for all notification threads 170 + to "join" before proceeding with recovery.) This seems excessively 171 + complex and not worth implementing. 172 + 173 + The current powerpc implementation doesn't much care if the device 174 + attempts I/O at this point, or not. I/O's will fail, returning 175 + a value of 0xff on read, and writes will be dropped. If more than 176 + EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH 177 + assumes that the device driver has gone into an infinite loop 178 + and prints an error to syslog. A reboot is then required to 179 + get the device working again. 184 180 185 181 STEP 2: MMIO Enabled 186 - ------------------- 182 + -------------------- 187 183 The platform re-enables MMIO to the device (but typically not the 188 184 DMA), and then calls the mmio_enabled() callback on all affected 189 185 device drivers. ··· 200 192 without a slot reset or a link reset, it will not call this callback, and 201 193 instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) 202 194 203 - >>> The following is proposed; no platform implements this yet: 204 - >>> Proposal: All I/O's should be done _synchronously_ from within 205 - >>> this callback, errors triggered by them will be returned via 206 - >>> the normal pci_check_whatever() API, no new error_detected() 207 - >>> callback will be issued due to an error happening here. However, 208 - >>> such an error might cause IOs to be re-blocked for the whole 209 - >>> segment, and thus invalidate the recovery that other devices 210 - >>> on the same segment might have done, forcing the whole segment 211 - >>> into one of the next states, that is, link reset or slot reset. 195 + .. note:: 196 + 197 + The following is proposed; no platform implements this yet: 198 + Proposal: All I/O's should be done _synchronously_ from within 199 + this callback, errors triggered by them will be returned via 200 + the normal pci_check_whatever() API, no new error_detected() 201 + callback will be issued due to an error happening here. However, 202 + such an error might cause IOs to be re-blocked for the whole 203 + segment, and thus invalidate the recovery that other devices 204 + on the same segment might have done, forcing the whole segment 205 + into one of the next states, that is, link reset or slot reset. 212 206 213 207 The driver should return one of the following result codes: 214 - - PCI_ERS_RESULT_RECOVERED 215 - Driver returns this if it thinks the device is fully 216 - functional and thinks it is ready to start 217 - normal driver operations again. There is no 218 - guarantee that the driver will actually be 219 - allowed to proceed, as another driver on the 220 - same segment might have failed and thus triggered a 221 - slot reset on platforms that support it. 208 + - PCI_ERS_RESULT_RECOVERED 209 + Driver returns this if it thinks the device is fully 210 + functional and thinks it is ready to start 211 + normal driver operations again. There is no 212 + guarantee that the driver will actually be 213 + allowed to proceed, as another driver on the 214 + same segment might have failed and thus triggered a 215 + slot reset on platforms that support it. 222 216 223 - - PCI_ERS_RESULT_NEED_RESET 224 - Driver returns this if it thinks the device is not 225 - recoverable in its current state and it needs a slot 226 - reset to proceed. 217 + - PCI_ERS_RESULT_NEED_RESET 218 + Driver returns this if it thinks the device is not 219 + recoverable in its current state and it needs a slot 220 + reset to proceed. 227 221 228 - - PCI_ERS_RESULT_DISCONNECT 229 - Same as above. Total failure, no recovery even after 230 - reset driver dead. (To be defined more precisely) 222 + - PCI_ERS_RESULT_DISCONNECT 223 + Same as above. Total failure, no recovery even after 224 + reset driver dead. (To be defined more precisely) 231 225 232 226 The next step taken depends on the results returned by the drivers. 233 227 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform ··· 303 293 Drivers for multi-function cards will need to coordinate among 304 294 themselves as to which driver instance will perform any "one-shot" 305 295 or global device initialization. For example, the Symbios sym53cxx2 306 - driver performs device init only from PCI function 0: 296 + driver performs device init only from PCI function 0:: 307 297 308 - + if (PCI_FUNC(pdev->devfn) == 0) 309 - + sym_reset_scsi_bus(np, 0); 298 + + if (PCI_FUNC(pdev->devfn) == 0) 299 + + sym_reset_scsi_bus(np, 0); 310 300 311 - Result codes: 312 - - PCI_ERS_RESULT_DISCONNECT 313 - Same as above. 301 + Result codes: 302 + - PCI_ERS_RESULT_DISCONNECT 303 + Same as above. 314 304 315 305 Drivers for PCI Express cards that require a fundamental reset must 316 306 set the needs_freset bit in the pci_dev structure in their probe function. 317 307 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain 318 - PCI card types: 308 + PCI card types:: 319 309 320 - + /* Set EEH reset type to fundamental if required by hba */ 321 - + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) 322 - + pdev->needs_freset = 1; 323 - + 310 + + /* Set EEH reset type to fundamental if required by hba */ 311 + + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) 312 + + pdev->needs_freset = 1; 313 + + 324 314 325 315 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent 326 316 Failure). 327 317 328 - >>> The current powerpc implementation does not try a power-cycle 329 - >>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT. 330 - >>> However, it probably should. 318 + .. note:: 319 + 320 + The current powerpc implementation does not try a power-cycle 321 + reset if the driver returned PCI_ERS_RESULT_DISCONNECT. 322 + However, it probably should. 331 323 332 324 333 325 STEP 5: Resume Operations ··· 382 370 That is, the recovery API only requires that: 383 371 384 372 - There is no guarantee that interrupt delivery can proceed from any 385 - device on the segment starting from the error detection and until the 386 - slot_reset callback is called, at which point interrupts are expected 387 - to be fully operational. 373 + device on the segment starting from the error detection and until the 374 + slot_reset callback is called, at which point interrupts are expected 375 + to be fully operational. 388 376 389 377 - There is no guarantee that interrupt delivery is stopped, that is, 390 - a driver that gets an interrupt after detecting an error, or that detects 391 - an error within the interrupt handler such that it prevents proper 392 - ack'ing of the interrupt (and thus removal of the source) should just 393 - return IRQ_NOTHANDLED. It's up to the platform to deal with that 394 - condition, typically by masking the IRQ source during the duration of 395 - the error handling. It is expected that the platform "knows" which 396 - interrupts are routed to error-management capable slots and can deal 397 - with temporarily disabling that IRQ number during error processing (this 398 - isn't terribly complex). That means some IRQ latency for other devices 399 - sharing the interrupt, but there is simply no other way. High end 400 - platforms aren't supposed to share interrupts between many devices 401 - anyway :) 378 + a driver that gets an interrupt after detecting an error, or that detects 379 + an error within the interrupt handler such that it prevents proper 380 + ack'ing of the interrupt (and thus removal of the source) should just 381 + return IRQ_NOTHANDLED. It's up to the platform to deal with that 382 + condition, typically by masking the IRQ source during the duration of 383 + the error handling. It is expected that the platform "knows" which 384 + interrupts are routed to error-management capable slots and can deal 385 + with temporarily disabling that IRQ number during error processing (this 386 + isn't terribly complex). That means some IRQ latency for other devices 387 + sharing the interrupt, but there is simply no other way. High end 388 + platforms aren't supposed to share interrupts between many devices 389 + anyway :) 402 390 403 - >>> Implementation details for the powerpc platform are discussed in 404 - >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt 391 + .. note:: 405 392 406 - >>> As of this writing, there is a growing list of device drivers with 407 - >>> patches implementing error recovery. Not all of these patches are in 408 - >>> mainline yet. These may be used as "examples": 409 - >>> 410 - >>> drivers/scsi/ipr 411 - >>> drivers/scsi/sym53c8xx_2 412 - >>> drivers/scsi/qla2xxx 413 - >>> drivers/scsi/lpfc 414 - >>> drivers/next/bnx2.c 415 - >>> drivers/next/e100.c 416 - >>> drivers/net/e1000 417 - >>> drivers/net/e1000e 418 - >>> drivers/net/ixgb 419 - >>> drivers/net/ixgbe 420 - >>> drivers/net/cxgb3 421 - >>> drivers/net/s2io.c 422 - >>> drivers/net/qlge 393 + Implementation details for the powerpc platform are discussed in 394 + the file Documentation/powerpc/eeh-pci-error-recovery.txt 423 395 424 - The End 425 - ------- 396 + As of this writing, there is a growing list of device drivers with 397 + patches implementing error recovery. Not all of these patches are in 398 + mainline yet. These may be used as "examples": 399 + 400 + - drivers/scsi/ipr 401 + - drivers/scsi/sym53c8xx_2 402 + - drivers/scsi/qla2xxx 403 + - drivers/scsi/lpfc 404 + - drivers/next/bnx2.c 405 + - drivers/next/e100.c 406 + - drivers/net/e1000 407 + - drivers/net/e1000e 408 + - drivers/net/ixgb 409 + - drivers/net/ixgbe 410 + - drivers/net/cxgb3 411 + - drivers/net/s2io.c 412 + - drivers/net/qlge

+2 -2

MAINTAINERS

··· 12143 12143 M: Oliver O'Halloran <oohall@gmail.com> 12144 12144 L: linuxppc-dev@lists.ozlabs.org 12145 12145 S: Supported 12146 - F: Documentation/PCI/pci-error-recovery.txt 12146 + F: Documentation/PCI/pci-error-recovery.rst 12147 12147 F: drivers/pci/pcie/aer.c 12148 12148 F: drivers/pci/pcie/dpc.c 12149 12149 F: drivers/pci/pcie/err.c ··· 12156 12156 M: Linas Vepstas <linasvepstas@gmail.com> 12157 12157 L: linux-pci@vger.kernel.org 12158 12158 S: Supported 12159 - F: Documentation/PCI/pci-error-recovery.txt 12159 + F: Documentation/PCI/pci-error-recovery.rst 12160 12160 12161 12161 PCI MSI DRIVER FOR ALTERA MSI IP 12162 12162 M: Ley Foon Tan <lftan@altera.com>