Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

iommu: Get DT/ACPI parsing into the proper probe path

In hindsight, there were some crucial subtleties overlooked when moving
{of,acpi}_dma_configure() to driver probe time to allow waiting for
IOMMU drivers with -EPROBE_DEFER, and these have become an
ever-increasing source of problems. The IOMMU API has some fundamental
assumptions that iommu_probe_device() is called for every device added
to the system, in the order in which they are added. Calling it in a
random order or not at all dependent on driver binding leads to
malformed groups, a potential lack of isolation for devices with no
driver, and all manner of unexpected concurrency and race conditions.
We've attempted to mitigate the latter with point-fix bodges like
iommu_probe_device_lock, but it's a losing battle and the time has come
to bite the bullet and address the true source of the problem instead.

The crux of the matter is that the firmware parsing actually serves two
distinct purposes; one is identifying the IOMMU instance associated with
a device so we can check its availability, the second is actually
telling that instance about the relevant firmware-provided data for the
device. However the latter also depends on the former, and at the time
there was no good place to defer and retry that separately from the
availability check we also wanted for client driver probe.

Nowadays, though, we have a proper notion of multiple IOMMU instances in
the core API itself, and each one gets a chance to probe its own devices
upon registration, so we can finally make that work as intended for
DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
be able to run the iommu_fwspec machinery currently buried deep in the
wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
surprisingly straightforward to bootstrap this transformation by pretty
much just calling the same path twice. At client driver probe time,
dev->driver is obviously set; conversely at device_add(), or a
subsequent bus_iommu_probe(), any device waiting for an IOMMU really
should *not* have a driver already, so we can use that as a condition to
disambiguate the two cases, and avoid recursing back into the IOMMU core
at the wrong times.

Obviously this isn't the nicest thing, but for now it gives us a
functional baseline to then unpick the layers in between without many
more awkward cross-subsystem patches. There are some minor side-effects
like dma_range_map potentially being created earlier, and some debug
prints being repeated, but these aren't significantly detrimental. Let's
make things work first, then deal with making them nice.

With the basic flow finally in the right order again, the next step is
probably turning the bus->dma_configure paths inside-out, since all we
really need from bus code is its notion of which device and input ID(s)
to parse the common firmware properties with...

Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>

authored by

Robin Murphy and committed by
Joerg Roedel
bcb81ac6 3832862e

+48 -17
+5
drivers/acpi/arm64/dma.c
··· 26 26 else 27 27 end = (1ULL << 32) - 1; 28 28 29 + if (dev->dma_range_map) { 30 + dev_dbg(dev, "dma_range_map already set\n"); 31 + return; 32 + } 33 + 29 34 ret = acpi_dma_get_range(dev, &map); 30 35 if (!ret && map) { 31 36 end = dma_range_map_max(map);
-7
drivers/acpi/scan.c
··· 1632 1632 err = viot_iommu_configure(dev); 1633 1633 mutex_unlock(&iommu_probe_device_lock); 1634 1634 1635 - /* 1636 - * If we have reason to believe the IOMMU driver missed the initial 1637 - * iommu_probe_device() call for dev, replay it to get things in order. 1638 - */ 1639 - if (!err && dev->bus) 1640 - err = iommu_probe_device(dev); 1641 - 1642 1635 return err; 1643 1636 } 1644 1637
+2 -1
drivers/amba/bus.c
··· 364 364 ret = acpi_dma_configure(dev, attr); 365 365 } 366 366 367 - if (!ret && !drv->driver_managed_dma) { 367 + /* @drv may not be valid when we're called from the IOMMU layer */ 368 + if (!ret && dev->driver && !drv->driver_managed_dma) { 368 369 ret = iommu_device_use_default_domain(dev); 369 370 if (ret) 370 371 arch_teardown_dma_ops(dev);
+2 -1
drivers/base/platform.c
··· 1451 1451 attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); 1452 1452 ret = acpi_dma_configure(dev, attr); 1453 1453 } 1454 - if (ret || drv->driver_managed_dma) 1454 + /* @drv may not be valid when we're called from the IOMMU layer */ 1455 + if (ret || !dev->driver || drv->driver_managed_dma) 1455 1456 return ret; 1456 1457 1457 1458 ret = iommu_device_use_default_domain(dev);
+2 -1
drivers/bus/fsl-mc/fsl-mc-bus.c
··· 153 153 else 154 154 ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); 155 155 156 - if (!ret && !mc_drv->driver_managed_dma) { 156 + /* @mc_drv may not be valid when we're called from the IOMMU layer */ 157 + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { 157 158 ret = iommu_device_use_default_domain(dev); 158 159 if (ret) 159 160 arch_teardown_dma_ops(dev);
+2 -1
drivers/cdx/cdx.c
··· 360 360 return ret; 361 361 } 362 362 363 - if (!ret && !cdx_drv->driver_managed_dma) { 363 + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ 364 + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { 364 365 ret = iommu_device_use_default_domain(dev); 365 366 if (ret) 366 367 arch_teardown_dma_ops(dev);
+21 -3
drivers/iommu/iommu.c
··· 417 417 if (!dev_iommu_get(dev)) 418 418 return -ENOMEM; 419 419 /* 420 - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU 421 - * instances with non-NULL fwnodes, and client devices should have been 422 - * identified with a fwspec by this point. Otherwise, we can currently 420 + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing 421 + * is buried in the bus dma_configure path. Properly unpicking that is 422 + * still a big job, so for now just invoke the whole thing. The device 423 + * already having a driver bound means dma_configure has already run and 424 + * either found no IOMMU to wait for, or we're in its replay call right 425 + * now, so either way there's no point calling it again. 426 + */ 427 + if (!dev->driver && dev->bus->dma_configure) { 428 + mutex_unlock(&iommu_probe_device_lock); 429 + dev->bus->dma_configure(dev); 430 + mutex_lock(&iommu_probe_device_lock); 431 + } 432 + /* 433 + * At this point, relevant devices either now have a fwspec which will 434 + * match ops registered with a non-NULL fwnode, or we can reasonably 423 435 * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can 424 436 * be present, and that any of their registered instances has suitable 425 437 * ops for probing, and thus cheekily co-opt the same mechanism. ··· 441 429 ret = -ENODEV; 442 430 goto err_free; 443 431 } 432 + /* 433 + * And if we do now see any replay calls, they would indicate someone 434 + * misusing the dma_configure path outside bus code. 435 + */ 436 + if (dev->driver) 437 + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); 444 438 445 439 if (!try_module_get(ops->owner)) { 446 440 ret = -EINVAL;
+6 -1
drivers/iommu/of_iommu.c
··· 155 155 dev_iommu_free(dev); 156 156 mutex_unlock(&iommu_probe_device_lock); 157 157 158 - if (!err && dev->bus) 158 + /* 159 + * If we're not on the iommu_probe_device() path (as indicated by the 160 + * initial dev->iommu) then try to simulate it. This should no longer 161 + * happen unless of_dma_configure() is being misused outside bus code. 162 + */ 163 + if (!err && dev->bus && !dev_iommu_present) 159 164 err = iommu_probe_device(dev); 160 165 161 166 if (err && err != -EPROBE_DEFER)
+6 -1
drivers/of/device.c
··· 99 99 bool coherent, set_map = false; 100 100 int ret; 101 101 102 + if (dev->dma_range_map) { 103 + dev_dbg(dev, "dma_range_map already set\n"); 104 + goto skip_map; 105 + } 106 + 102 107 if (np == dev->of_node) 103 108 bus_np = __of_get_dma_parent(np); 104 109 else ··· 124 119 end = dma_range_map_max(map); 125 120 set_map = true; 126 121 } 127 - 122 + skip_map: 128 123 /* 129 124 * If @dev is expected to be DMA-capable then the bus code that created 130 125 * it should have initialised its dma_mask pointer by this point. For
+2 -1
drivers/pci/pci-driver.c
··· 1653 1653 1654 1654 pci_put_host_bridge_device(bridge); 1655 1655 1656 - if (!ret && !driver->driver_managed_dma) { 1656 + /* @driver may not be valid when we're called from the IOMMU layer */ 1657 + if (!ret && dev->driver && !driver->driver_managed_dma) { 1657 1658 ret = iommu_device_use_default_domain(dev); 1658 1659 if (ret) 1659 1660 arch_teardown_dma_ops(dev);