Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

PCI: pciehp: Ignore Link Down/Up caused by Secondary Bus Reset

When a Secondary Bus Reset is issued at a hotplug port, it causes a Data
Link Layer State Changed event as a side effect. On hotplug ports using
in-band presence detect, it additionally causes a Presence Detect Changed
event.

These spurious events should not result in teardown and re-enumeration of
the device in the slot. Hence commit 2e35afaefe64 ("PCI: pciehp: Add
reset_slot() method") masked the Presence Detect Changed Enable bit in the
Slot Control register during a Secondary Bus Reset. Commit 06a8d89af551
("PCI: pciehp: Disable link notification across slot reset") additionally
masked the Data Link Layer State Changed Enable bit.

However masking those bits only disables interrupt generation (PCIe r6.2
sec 6.7.3.1). The events are still visible in the Slot Status register
and picked up by the IRQ handler if it runs during a Secondary Bus Reset.
This can happen if the interrupt is shared or if an unmasked hotplug event
occurs, e.g. Attention Button Pressed or Power Fault Detected.

The likelihood of this happening used to be small, so it wasn't much of a
problem in practice. That has changed with the recent introduction of
bandwidth control in v6.13-rc1 with commit 665745f27487 ("PCI/bwctrl:
Re-add BW notification portdrv as PCIe BW controller"):

Bandwidth control shares the interrupt with PCIe hotplug. A Secondary Bus
Reset causes a Link Bandwidth Notification, so the hotplug IRQ handler
runs, picks up the masked events and tears down the device in the slot.

As a result, Joel reports VFIO passthrough failure of a GPU, which Ilpo
root-caused to the incorrect handling of masked hotplug events.

Clearly, a more reliable way is needed to ignore spurious hotplug events.

For Downstream Port Containment, a new ignore mechanism was introduced by
commit a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC").
It has been working reliably for the past four years.

Adapt it for Secondary Bus Resets.

Introduce two helpers to annotate code sections which cause spurious link
changes: pci_hp_ignore_link_change() and pci_hp_unignore_link_change()
Use those helpers in lieu of masking interrupts in the Slot Control
register.

Introduce a helper to check whether such a code section is executing
concurrently and if so, await it: pci_hp_spurious_link_change()
Invoke the helper in the hotplug IRQ thread pciehp_ist(). Re-use the
IRQ thread's existing code which ignores DPC-induced link changes unless
the link is unexpectedly down after reset recovery or the device was
replaced during the bus reset.

That code block in pciehp_ist() was previously only executed if a Data
Link Layer State Changed event has occurred. Additionally execute it for
Presence Detect Changed events. That's necessary for compatibility with
PCIe r1.0 hotplug ports because Data Link Layer State Changed didn't exist
before PCIe r1.1. DPC was added with PCIe r3.1 and thus DPC-capable
hotplug ports always support Data Link Layer State Changed events.
But the same cannot be assumed for Secondary Bus Reset, which already
existed in PCIe r1.0.

Secondary Bus Reset is only one of many causes of spurious link changes.
Others include runtime suspend to D3cold, firmware updates or FPGA
reconfiguration. The new pci_hp_{,un}ignore_link_change() helpers may be
used by all kinds of drivers to annotate such code sections, hence their
declarations are publicly visible in <linux/pci.h>. A case in point is
the Mellanox Ethernet driver which disables a firmware reset feature if
the Ethernet card is attached to a hotplug port, see commit 3d7a3f2612d7
("net/mlx5: Nack sync reset request when HotPlug is enabled"). Going
forward, PCIe hotplug will be able to cope gracefully with all such use
cases once the code sections are properly annotated.

The new helpers internally use two bits in struct pci_dev's priv_flags as
well as a wait_queue. This mirrors what was done for DPC by commit
a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC"). That may
be insufficient if spurious link changes are caused by multiple sources
simultaneously. An example might be a Secondary Bus Reset issued by AER
during FPGA reconfiguration. If this turns out to happen in real life,
support for it can easily be added by replacing the PCI_LINK_CHANGING flag
with an atomic_t counter incremented by pci_hp_ignore_link_change() and
decremented by pci_hp_unignore_link_change(). Instead of awaiting a zero
PCI_LINK_CHANGING flag, the pci_hp_spurious_link_change() helper would
then simply await a zero counter.

Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Joel Mathew Thomas <proxy0@tutamail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Joel Mathew Thomas <proxy0@tutamail.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/d04deaf49d634a2edf42bf3c06ed81b4ca54d17b.1744298239.git.lukas@wunner.de

authored by

Lukas Wunner and committed by
Bjorn Helgaas
2af781a9 c3be50f7

+92 -23
+69
drivers/pci/hotplug/pci_hotplug_core.c
··· 492 492 } 493 493 EXPORT_SYMBOL_GPL(pci_hp_destroy); 494 494 495 + static DECLARE_WAIT_QUEUE_HEAD(pci_hp_link_change_wq); 496 + 497 + /** 498 + * pci_hp_ignore_link_change - begin code section causing spurious link changes 499 + * @pdev: PCI hotplug bridge 500 + * 501 + * Mark the beginning of a code section causing spurious link changes on the 502 + * Secondary Bus of @pdev, e.g. as a side effect of a Secondary Bus Reset, 503 + * D3cold transition, firmware update or FPGA reconfiguration. 504 + * 505 + * Hotplug drivers can thus check whether such a code section is executing 506 + * concurrently, await it with pci_hp_spurious_link_change() and ignore the 507 + * resulting link change events. 508 + * 509 + * Must be paired with pci_hp_unignore_link_change(). May be called both 510 + * from the PCI core and from Endpoint drivers. May be called for bridges 511 + * which are not hotplug-capable, in which case it has no effect because 512 + * no hotplug driver is bound to the bridge. 513 + */ 514 + void pci_hp_ignore_link_change(struct pci_dev *pdev) 515 + { 516 + set_bit(PCI_LINK_CHANGING, &pdev->priv_flags); 517 + smp_mb__after_atomic(); /* pairs with implied barrier of wait_event() */ 518 + } 519 + 520 + /** 521 + * pci_hp_unignore_link_change - end code section causing spurious link changes 522 + * @pdev: PCI hotplug bridge 523 + * 524 + * Mark the end of a code section causing spurious link changes on the 525 + * Secondary Bus of @pdev. Must be paired with pci_hp_ignore_link_change(). 526 + */ 527 + void pci_hp_unignore_link_change(struct pci_dev *pdev) 528 + { 529 + set_bit(PCI_LINK_CHANGED, &pdev->priv_flags); 530 + mb(); /* ensure pci_hp_spurious_link_change() sees either bit set */ 531 + clear_bit(PCI_LINK_CHANGING, &pdev->priv_flags); 532 + wake_up_all(&pci_hp_link_change_wq); 533 + } 534 + 535 + /** 536 + * pci_hp_spurious_link_change - check for spurious link changes 537 + * @pdev: PCI hotplug bridge 538 + * 539 + * Check whether a code section is executing concurrently which is causing 540 + * spurious link changes on the Secondary Bus of @pdev. Await the end of the 541 + * code section if so. 542 + * 543 + * May be called by hotplug drivers to check whether a link change is spurious 544 + * and can be ignored. 545 + * 546 + * Because a genuine link change may have occurred in-between a spurious link 547 + * change and the invocation of this function, hotplug drivers should perform 548 + * sanity checks such as retrieving the current link state and bringing down 549 + * the slot if the link is down. 550 + * 551 + * Return: %true if such a code section has been executing concurrently, 552 + * otherwise %false. Also return %true if such a code section has not been 553 + * executing concurrently, but at least once since the last invocation of this 554 + * function. 555 + */ 556 + bool pci_hp_spurious_link_change(struct pci_dev *pdev) 557 + { 558 + wait_event(pci_hp_link_change_wq, 559 + !test_bit(PCI_LINK_CHANGING, &pdev->priv_flags)); 560 + 561 + return test_and_clear_bit(PCI_LINK_CHANGED, &pdev->priv_flags); 562 + } 563 + 495 564 static int __init pci_hotplug_init(void) 496 565 { 497 566 int result;
+12 -23
drivers/pci/hotplug/pciehp_hpc.c
··· 592 592 return false; 593 593 } 594 594 595 - static void pciehp_ignore_dpc_link_change(struct controller *ctrl, 596 - struct pci_dev *pdev, int irq, 597 - u16 ignored_events) 595 + static void pciehp_ignore_link_change(struct controller *ctrl, 596 + struct pci_dev *pdev, int irq, 597 + u16 ignored_events) 598 598 { 599 599 /* 600 600 * Ignore link changes which occurred while waiting for DPC recovery. 601 601 * Could be several if DPC triggered multiple times consecutively. 602 + * Also ignore link changes caused by Secondary Bus Reset, etc. 602 603 */ 603 604 synchronize_hardirq(irq); 604 605 atomic_and(~ignored_events, &ctrl->pending_events); 605 606 if (pciehp_poll_mode) 606 607 pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, 607 608 ignored_events); 608 - ctrl_info(ctrl, "Slot(%s): Link Down/Up ignored (recovered by DPC)\n", 609 - slot_name(ctrl)); 609 + ctrl_info(ctrl, "Slot(%s): Link Down/Up ignored\n", slot_name(ctrl)); 610 610 611 611 /* 612 612 * If the link is unexpectedly down after successful recovery, ··· 762 762 763 763 /* 764 764 * Ignore Link Down/Up events caused by Downstream Port Containment 765 - * if recovery from the error succeeded. 765 + * if recovery succeeded, or caused by Secondary Bus Reset, 766 + * suspend to D3cold, firmware update, FPGA reconfiguration, etc. 766 767 */ 767 - if ((events & PCI_EXP_SLTSTA_DLLSC) && pci_dpc_recovered(pdev) && 768 + if ((events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) && 769 + (pci_dpc_recovered(pdev) || pci_hp_spurious_link_change(pdev)) && 768 770 ctrl->state == ON_STATE) { 769 771 u16 ignored_events = PCI_EXP_SLTSTA_DLLSC; 770 772 ··· 774 772 ignored_events |= events & PCI_EXP_SLTSTA_PDC; 775 773 776 774 events &= ~ignored_events; 777 - pciehp_ignore_dpc_link_change(ctrl, pdev, irq, ignored_events); 775 + pciehp_ignore_link_change(ctrl, pdev, irq, ignored_events); 778 776 } 779 777 780 778 /* ··· 939 937 { 940 938 struct controller *ctrl = to_ctrl(hotplug_slot); 941 939 struct pci_dev *pdev = ctrl_dev(ctrl); 942 - u16 stat_mask = 0, ctrl_mask = 0; 943 940 int rc; 944 941 945 942 if (probe) ··· 946 945 947 946 down_write_nested(&ctrl->reset_lock, ctrl->depth); 948 947 949 - if (!ATTN_BUTTN(ctrl)) { 950 - ctrl_mask |= PCI_EXP_SLTCTL_PDCE; 951 - stat_mask |= PCI_EXP_SLTSTA_PDC; 952 - } 953 - ctrl_mask |= PCI_EXP_SLTCTL_DLLSCE; 954 - stat_mask |= PCI_EXP_SLTSTA_DLLSC; 955 - 956 - pcie_write_cmd(ctrl, 0, ctrl_mask); 957 - ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__, 958 - pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, 0); 948 + pci_hp_ignore_link_change(pdev); 959 949 960 950 rc = pci_bridge_secondary_bus_reset(ctrl->pcie->port); 961 951 962 - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, stat_mask); 963 - pcie_write_cmd_nowait(ctrl, ctrl_mask, ctrl_mask); 964 - ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__, 965 - pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask); 952 + pci_hp_unignore_link_change(pdev); 966 953 967 954 up_write(&ctrl->reset_lock); 968 955 return rc;
+3
drivers/pci/pci.h
··· 227 227 228 228 /* Functions for PCI Hotplug drivers to use */ 229 229 int pci_hp_add_bridge(struct pci_dev *dev); 230 + bool pci_hp_spurious_link_change(struct pci_dev *pdev); 230 231 231 232 #if defined(CONFIG_SYSFS) && defined(HAVE_PCI_LEGACY) 232 233 void pci_create_legacy_files(struct pci_bus *bus); ··· 558 557 #define PCI_DPC_RECOVERED 1 559 558 #define PCI_DPC_RECOVERING 2 560 559 #define PCI_DEV_REMOVED 3 560 + #define PCI_LINK_CHANGED 4 561 + #define PCI_LINK_CHANGING 5 561 562 562 563 static inline void pci_dev_assign_added(struct pci_dev *dev) 563 564 {
+8
include/linux/pci.h
··· 1848 1848 static inline bool pcie_aspm_enabled(struct pci_dev *pdev) { return false; } 1849 1849 #endif 1850 1850 1851 + #ifdef CONFIG_HOTPLUG_PCI 1852 + void pci_hp_ignore_link_change(struct pci_dev *pdev); 1853 + void pci_hp_unignore_link_change(struct pci_dev *pdev); 1854 + #else 1855 + static inline void pci_hp_ignore_link_change(struct pci_dev *pdev) { } 1856 + static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { } 1857 + #endif 1858 + 1851 1859 #ifdef CONFIG_PCIEAER 1852 1860 bool pci_aer_available(void); 1853 1861 #else