Documentation/powerpc/eeh-pci-error-recovery.txt at 77b2555b52a894a2e39a42e43d993df875c46a6a

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / powerpc / eeh-pci-error-recovery.txt
at 77b2555b52a894a2e39a42e43d993df875c46a6a 332 lines 15 kB view raw
wrap content
  1
  2
  3                      PCI Bus EEH Error Recovery
  4                      --------------------------
  5                           Linas Vepstas
  6                       <linas@austin.ibm.com>
  7                          12 January 2005
  8
  9
 10Overview:
 11---------
 12The IBM POWER-based pSeries and iSeries computers include PCI bus
 13controller chips that have extended capabilities for detecting and
 14reporting a large variety of PCI bus error conditions.  These features
 15go under the name of "EEH", for "Extended Error Handling".  The EEH
 16hardware features allow PCI bus errors to be cleared and a PCI
 17card to be "rebooted", without also having to reboot the operating
 18system.
 19
 20This is in contrast to traditional PCI error handling, where the
 21PCI chip is wired directly to the CPU, and an error would cause
 22a CPU machine-check/check-stop condition, halting the CPU entirely.
 23Another "traditional" technique is to ignore such errors, which
 24can lead to data corruption, both of user data or of kernel data,
 25hung/unresponsive adapters, or system crashes/lockups.  Thus,
 26the idea behind EEH is that the operating system can become more
 27reliable and robust by protecting it from PCI errors, and giving
 28the OS the ability to "reboot"/recover individual PCI devices.
 29
 30Future systems from other vendors, based on the PCI-E specification,
 31may contain similar features.
 32
 33
 34Causes of EEH Errors
 35--------------------
 36EEH was originally designed to guard against hardware failure, such
 37as PCI cards dying from heat, humidity, dust, vibration and bad
 38electrical connections. The vast majority of EEH errors seen in
 39"real life" are due to eithr poorly seated PCI cards, or,
 40unfortunately quite commonly, due device driver bugs, device firmware
 41bugs, and sometimes PCI card hardware bugs.
 42
 43The most common software bug, is one that causes the device to
 44attempt to DMA to a location in system memory that has not been
 45reserved for DMA access for that card.  This is a powerful feature,
 46as it prevents what; otherwise, would have been silent memory
 47corruption caused by the bad DMA.  A number of device driver
 48bugs have been found and fixed in this way over the past few
 49years.  Other possible causes of EEH errors include data or
 50address line parity errors (for example, due to poor electrical
 51connectivity due to a poorly seated card), and PCI-X split-completion
 52errors (due to software, device firmware, or device PCI hardware bugs).
 53The vast majority of "true hardware failures" can be cured by
 54physically removing and re-seating the PCI card.
 55
 56
 57Detection and Recovery
 58----------------------
 59In the following discussion, a generic overview of how to detect
 60and recover from EEH errors will be presented. This is followed
 61by an overview of how the current implementation in the Linux
 62kernel does it.  The actual implementation is subject to change,
 63and some of the finer points are still being debated.  These
 64may in turn be swayed if or when other architectures implement
 65similar functionality.
 66
 67When a PCI Host Bridge (PHB, the bus controller connecting the
 68PCI bus to the system CPU electronics complex) detects a PCI error
 69condition, it will "isolate" the affected PCI card.  Isolation
 70will block all writes (either to the card from the system, or
 71from the card to the system), and it will cause all reads to
 72return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
 73This value was chosen because it is the same value you would
 74get if the device was physically unplugged from the slot.
 75This includes access to PCI memory, I/O space, and PCI config
 76space.  Interrupts; however, will continued to be delivered.
 77
 78Detection and recovery are performed with the aid of ppc64
 79firmware.  The programming interfaces in the Linux kernel
 80into the firmware are referred to as RTAS (Run-Time Abstraction
 81Services).  The Linux kernel does not (should not) access
 82the EEH function in the PCI chipsets directly, primarily because
 83there are a number of different chipsets out there, each with
 84different interfaces and quirks. The firmware provides a
 85uniform abstraction layer that will work with all pSeries
 86and iSeries hardware (and be forwards-compatible).
 87
 88If the OS or device driver suspects that a PCI slot has been
 89EEH-isolated, there is a firmware call it can make to determine if
 90this is the case. If so, then the device driver should put itself
 91into a consistent state (given that it won't be able to complete any
 92pending work) and start recovery of the card.  Recovery normally
 93would consist of reseting the PCI device (holding the PCI #RST
 94line high for two seconds), followed by setting up the device
 95config space (the base address registers (BAR's), latency timer,
 96cache line size, interrupt line, and so on).  This is followed by a
 97reinitialization of the device driver.  In a worst-case scenario,
 98the power to the card can be toggled, at least on hot-plug-capable
 99slots.  In principle, layers far above the device driver probably
100do not need to know that the PCI card has been "rebooted" in this
101way; ideally, there should be at most a pause in Ethernet/disk/USB
102I/O while the card is being reset.
103
104If the card cannot be recovered after three or four resets, the
105kernel/device driver should assume the worst-case scenario, that the
106card has died completely, and report this error to the sysadmin.
107In addition, error messages are reported through RTAS and also through
108syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
109The correct way to deal with failed adapters is to use the standard
110PCI hotplug tools to remove and replace the dead card.
111
112
113Current PPC64 Linux EEH Implementation
114--------------------------------------
115At this time, a generic EEH recovery mechanism has been implemented,
116so that individual device drivers do not need to be modified to support
117EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
118infrastructure,  and percolates events up through the hotplug/udev
119infrastructure.  Followiing is a detailed description of how this is
120accomplished.
121
122EEH must be enabled in the PHB's very early during the boot process,
123and if a PCI slot is hot-plugged. The former is performed by
124eeh_init() in arch/ppc64/kernel/eeh.c, and the later by
125drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
126EEH must be enabled before a PCI scan of the device can proceed.
127Current Power5 hardware will not work unless EEH is enabled;
128although older Power4 can run with it disabled.  Effectively,
129EEH can no longer be turned off.  PCI devices *must* be
130registered with the EEH code; the EEH code needs to know about
131the I/O address ranges of the PCI device in order to detect an
132error.  Given an arbitrary address, the routine
133pci_get_device_by_addr() will find the pci device associated
134with that address (if any).
135
136The default include/asm-ppc64/io.h macros readb(), inb(), insb(),
137etc. include a check to see if the i/o read returned all-0xff's.
138If so, these make a call to eeh_dn_check_failure(), which in turn
139asks the firmware if the all-ff's value is the sign of a true EEH
140error.  If it is not, processing continues as normal.  The grand
141total number of these false alarms or "false positives" can be
142seen in /proc/ppc64/eeh (subject to change).  Normally, almost
143all of these occur during boot, when the PCI bus is scanned, where
144a large number of 0xff reads are part of the bus scan procedure.
145
146If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will
147print a stack trace to syslog (/var/log/messages).  This stack trace
148has proven to be very useful to device-driver authors for finding
149out at what point the EEH error was detected, as the error itself
150usually occurs slightly beforehand.
151
152Next, it uses the Linux kernel notifier chain/work queue mechanism to
153allow any interested parties to find out about the failure.  Device
154drivers, or other parts of the kernel, can use
155eeh_register_notifier(struct notifier_block *) to find out about EEH
156events.  The event will include a pointer to the pci device, the
157device node and some state info.  Receivers of the event can "do as
158they wish"; the default handler will be described further in this
159section.
160
161To assist in the recovery of the device, eeh.c exports the
162following functions:
163
164rtas_set_slot_reset() -- assert the  PCI #RST line for 1/8th of a second
165rtas_configure_bridge() -- ask firmware to configure any PCI bridges
166   located topologically under the pci slot.
167eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
168   config-space info for a device and any devices under it.
169
170
171A handler for the EEH notifier_block events is implemented in
172drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
173It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
174This last call causes the device driver for the card to be stopped,
175which causes hotplug events to go out to user space. This triggers
176user-space scripts that might issue commands such as "ifdown eth0"
177for ethernet cards, and so on.  This handler then sleeps for 5 seconds,
178hoping to give the user-space scripts enough time to complete.
179It then resets the PCI card, reconfigures the device BAR's, and
180any bridges underneath. It then calls rpaphp_enable_pci_slot(),
181which restarts the device driver and triggers more user-space
182events (for example, calling "ifup eth0" for ethernet cards).
183
184
185Device Shutdown and User-Space Events
186-------------------------------------
187This section documents what happens when a pci slot is unconfigured,
188focusing on how the device driver gets shut down, and on how the
189events get delivered to user-space scripts.
190
191Following is an example sequence of events that cause a device driver
192close function to be called during the first phase of an EEH reset.
193The following sequence is an example of the pcnet32 device driver.
194
195    rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
196    {
197      calls
198      pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
199      {
200        calls
201        pci_destroy_dev (struct pci_dev *)
202        {
203          calls
204          device_unregister (&dev->dev) // in /drivers/base/core.c
205          {
206            calls
207            device_del (struct device *)
208            {
209              calls
210              bus_remove_device() // in /drivers/base/bus.c
211              {
212                calls
213                device_release_driver()
214                {
215                  calls
216                  struct device_driver->remove() which is just
217                  pci_device_remove()  // in /drivers/pci/pci_driver.c
218                  {
219                    calls
220                    struct pci_driver->remove() which is just
221                    pcnet32_remove_one() // in /drivers/net/pcnet32.c
222                    {
223                      calls
224                      unregister_netdev() // in /net/core/dev.c
225                      {
226                        calls
227                        dev_close()  // in /net/core/dev.c
228                        {
229                           calls dev->stop();
230                           which is just pcnet32_close() // in pcnet32.c
231                           {
232                             which does what you wanted
233                             to stop the device
234                           }
235                        }
236                     }
237                   which
238                   frees pcnet32 device driver memory
239                }
240     }}}}}}
241
242
243    in drivers/pci/pci_driver.c,
244    struct device_driver->remove() is just pci_device_remove()
245    which calls struct pci_driver->remove() which is pcnet32_remove_one()
246    which calls unregister_netdev()  (in net/core/dev.c)
247    which calls dev_close()  (in net/core/dev.c)
248    which calls dev->stop() which is pcnet32_close()
249    which then does the appropriate shutdown.
250
251---
252Following is the analogous stack trace for events sent to user-space
253when the pci device is unconfigured.
254
255rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
256  calls
257  pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
258    calls
259    pci_destroy_dev (struct pci_dev *) {
260      calls
261      device_unregister (&dev->dev) {      // in /drivers/base/core.c
262        calls
263        device_del(struct device * dev) {  // in /drivers/base/core.c
264          calls
265          kobject_del() {                  //in /libs/kobject.c
266            calls
267            kobject_hotplug() {            // in /libs/kobject.c
268              calls
269              kset_hotplug() {             // in /lib/kobject.c
270                calls
271                kset->hotplug_ops->hotplug() which is really just
272                a call to
273                dev_hotplug() {           // in /drivers/base/core.c
274                  calls
275                  dev->bus->hotplug() which is really just a call to
276                  pci_hotplug () {      // in drivers/pci/hotplug.c
277                    which prints device name, etc....
278                 }
279               }
280               then kset_hotplug() calls
281                call_usermodehelper () with
282                   argv[0]=hotplug_path[] which is "/sbin/hotplug"
283             --> event to userspace,
284           }
285         }
286         kobject_del() then calls sysfs_remove_dir(), which would
287         trigger any user-space daemon that was watching /sysfs,
288         and notice the delete event.
289
290
291Pro's and Con's of the Current Design
292-------------------------------------
293There are several issues with the current EEH software recovery design,
294which may be addressed in future revisions.  But first, note that the
295big plus of the current design is that no changes need to be made to
296individual device drivers, so that the current design throws a wide net.
297The biggest negative of the design is that it potentially disturbs
298network daemons and file systems that didn't need to be disturbed.
299
300-- A minor complaint is that resetting the network card causes
301   user-space back-to-back ifdown/ifup burps that potentially disturb
302   network daemons, that didn't need to even know that the pci
303   card was being rebooted.
304
305-- A more serious concern is that the same reset, for SCSI devices,
306   causes havoc to mounted file systems.  Scripts cannot post-facto
307   unmount a file system without flushing pending buffers, but this
308   is impossible, because I/O has already been stopped.  Thus,
309   ideally, the reset should happen at or below the block layer,
310   so that the file systems are not disturbed.
311
312   Reiserfs does not tolerate errors returned from the block device.
313   Ext3fs seems to be tolerant, retrying reads/writes until it does
314   succeed. Both have been only lightly tested in this scenario.
315
316   The SCSI-generic subsystem already has built-in code for performing
317   SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
318   (HBA) resets.  These are cascaded into a chain of attempted
319   resets if a SCSI command fails. These are completely hidden
320   from the block layer.  It would be very natural to add an EEH
321   reset into this chain of events.
322
323-- If a SCSI error occurs for the root device, all is lost unless
324   the sysadmin had the foresight to run /bin, /sbin, /etc, /var
325   and so on, out of ramdisk/tmpfs.
326
327
328Conclusions
329-----------
330There's forward progress ...
331
332
Configure Feed

Configure Feed