Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

PCI/ERR: Update device error_state already after reset

After a Fatal Error has been reported by a device and has been recovered
through a Secondary Bus Reset, AER updates the device's error_state to
pci_channel_io_normal before invoking its driver's ->resume() callback.

By contrast, EEH updates the error_state earlier, namely after resetting
the device and before invoking its driver's ->slot_reset() callback.
Commit c58dc575f3c8 ("powerpc/pseries: Set error_state to
pci_channel_io_normal in eeh_report_reset()") explains in great detail
that the earlier invocation is necessitated by various drivers checking
accessibility of the device with pci_channel_offline() and avoiding
accesses if it returns true. It returns true for any other error_state
than pci_channel_io_normal.

The device should be accessible already after reset, hence the reasoning
is that it's safe to update the error_state immediately afterwards.

This deviation between AER and EEH seems problematic because drivers
behave differently depending on which error recovery mechanism the
platform uses. Three drivers have gone so far as to update the
error_state themselves, presumably to work around AER's behavior.

For consistency, amend AER to update the error_state at the same recovery
steps as EEH. Drop the now unnecessary workaround from the three drivers.

Keep updating the error_state before ->resume() in case ->error_detected()
or ->mmio_enabled() return PCI_ERS_RESULT_RECOVERED, which causes
->slot_reset() to be skipped. There are drivers doing this even for Fatal
Errors, e.g. mhi_pci_error_detected().

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/4517af6359ffb9d66152b827a5d2833459144e3f.1755008151.git.lukas@wunner.de

authored by

Lukas Wunner and committed by
Bjorn Helgaas
45bc8256 9011f066

+2 -9
-1
drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
··· 4215 4215 struct qlcnic_adapter *adapter = pci_get_drvdata(pdev); 4216 4216 int err = 0; 4217 4217 4218 - pdev->error_state = pci_channel_io_normal; 4219 4218 err = pci_enable_device(pdev); 4220 4219 if (err) 4221 4220 goto disconnect;
-2
drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
··· 3766 3766 struct qlcnic_adapter *adapter = pci_get_drvdata(pdev); 3767 3767 struct net_device *netdev = adapter->netdev; 3768 3768 3769 - pdev->error_state = pci_channel_io_normal; 3770 - 3771 3769 err = pci_enable_device(pdev); 3772 3770 if (err) 3773 3771 return err;
+2 -1
drivers/pci/pcie/err.c
··· 153 153 154 154 device_lock(&dev->dev); 155 155 pdrv = dev->driver; 156 - if (!pdrv || !pdrv->err_handler || !pdrv->err_handler->slot_reset) 156 + if (!pci_dev_set_io_state(dev, pci_channel_io_normal) || 157 + !pdrv || !pdrv->err_handler || !pdrv->err_handler->slot_reset) 157 158 goto out; 158 159 159 160 err_handler = pdrv->err_handler;
-5
drivers/scsi/qla2xxx/qla_os.c
··· 7883 7883 "Slot Reset.\n"); 7884 7884 7885 7885 ha->pci_error_state = QLA_PCI_SLOT_RESET; 7886 - /* Workaround: qla2xxx driver which access hardware earlier 7887 - * needs error state to be pci_channel_io_online. 7888 - * Otherwise mailbox command timesout. 7889 - */ 7890 - pdev->error_state = pci_channel_io_normal; 7891 7886 7892 7887 pci_restore_state(pdev); 7893 7888