Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next

Jason Gunthorpe says:

====================
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:

https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/

The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.

Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.

The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.

Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.

Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.

The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().

Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().

I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.

The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com

And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
====================

Based on a shared branch with vfio.

* iommufd_dmabuf:
iommufd/selftest: Add some tests for the dmabuf flow
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Allow MMIO pages in a batch
iommufd: Allow a DMABUF to be revoked
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Add DMABUF to iopt_pages
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
vfio/nvgrace: Support get_dmabuf_phys
vfio/pci: Add dma-buf export support for MMIO regions
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Share the core device pointer while invoking feature functions
vfio: Export vfio device get and put registration helpers
dma-buf: provide phys_vec to scatter-gather mapping routine
PCI/P2PDMA: Document DMABUF model
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Separate the mmap() support from the core logic

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

+1887 -211
+73 -22
Documentation/driver-api/pci/p2pdma.rst
··· 9 9 called Peer-to-Peer (or P2P). However, there are a number of issues that 10 10 make P2P transactions tricky to do in a perfectly safe way. 11 11 12 - One of the biggest issues is that PCI doesn't require forwarding 13 - transactions between hierarchy domains, and in PCIe, each Root Port 14 - defines a separate hierarchy domain. To make things worse, there is no 15 - simple way to determine if a given Root Complex supports this or not. 16 - (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel 17 - only supports doing P2P when the endpoints involved are all behind the 18 - same PCI bridge, as such devices are all in the same PCI hierarchy 19 - domain, and the spec guarantees that all transactions within the 20 - hierarchy will be routable, but it does not require routing 21 - between hierarchies. 12 + For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up 13 + until they reach a host bridge or root port. If the path includes PCIe switches 14 + then based on the ACS settings the transaction can route entirely within 15 + the PCIe hierarchy and never reach the root port. The kernel will evaluate 16 + the PCIe topology and always permit P2P in these well-defined cases. 22 17 23 - The second issue is that to make use of existing interfaces in Linux, 24 - memory that is used for P2P transactions needs to be backed by struct 25 - pages. However, PCI BARs are not typically cache coherent so there are 26 - a few corner case gotchas with these pages so developers need to 27 - be careful about what they do with them. 18 + However, if the P2P transaction reaches the host bridge then it might have to 19 + hairpin back out the same root port, be routed inside the CPU SOC to another 20 + PCIe root port, or routed internally to the SOC. 21 + 22 + The PCIe specification doesn't define the forwarding of transactions between 23 + hierarchy domains and kernel defaults to blocking such routing. There is an 24 + allow list to allow detecting known-good HW, in which case P2P between any 25 + two PCIe devices will be permitted. 26 + 27 + Since P2P inherently is doing transactions between two devices it requires two 28 + drivers to be co-operating inside the kernel. The providing driver has to convey 29 + its MMIO to the consuming driver. To meet the driver model lifecycle rules the 30 + MMIO must have all DMA mapping removed, all CPU accesses prevented, all page 31 + table mappings undone before the providing driver completes remove(). 32 + 33 + This requires the providing and consuming driver to actively work together to 34 + guarantee that the consuming driver has stopped using the MMIO during a removal 35 + cycle. This is done by either a synchronous invalidation shutdown or waiting 36 + for all usage refcounts to reach zero. 37 + 38 + At the lowest level the P2P subsystem offers a naked struct p2p_provider that 39 + delegates lifecycle management to the providing driver. It is expected that 40 + drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF 41 + to provide an invalidation shutdown. These MMIO addresess have no struct page, and 42 + if used with mmap() must create special PTEs. As such there are very few 43 + kernel uAPIs that can accept pointers to them; in particular they cannot be used 44 + with read()/write(), including O_DIRECT. 45 + 46 + Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE 47 + pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of 48 + pgmap ensures that when the pgmap is destroyed all other drivers have stopped 49 + using the MMIO. This option works with O_DIRECT flows, in some cases, if the 50 + underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through 51 + FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap 52 + it also relies on architecture support along with alignment and minimum size 53 + limitations. 28 54 29 55 30 56 Driver Writer's Guide ··· 140 114 Struct Page Caveats 141 115 ------------------- 142 116 143 - Driver writers should be very careful about not passing these special 144 - struct pages to code that isn't prepared for it. At this time, the kernel 145 - interfaces do not have any checks for ensuring this. This obviously 146 - precludes passing these pages to userspace. 117 + While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs, 118 + pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set. 147 119 148 - P2P memory is also technically IO memory but should never have any side 149 - effects behind it. Thus, the order of loads and stores should not be important 150 - and ioreadX(), iowriteX() and friends should not be necessary. 120 + The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The 121 + KVA is still MMIO and must still be accessed through the normal 122 + readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just 123 + like any other MMIO mapping. While this will actually work on some 124 + architectures, others will experience corruption or just crash in the kernel. 125 + Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU 126 + access happens. 127 + 128 + 129 + Usage With DMABUF 130 + ================= 131 + 132 + DMABUF provides an alternative to the above struct page-based 133 + client/provider/orchestrator system and should be used when struct page 134 + doesn't exist. In this mode the exporting driver will wrap 135 + some of its MMIO in a DMABUF and give the DMABUF FD to userspace. 136 + 137 + Userspace can then pass the FD to an importing driver which will ask the 138 + exporting driver to map it to the importer. 139 + 140 + In this case the initiator and target pci_devices are known and the P2P subsystem 141 + is used to determine the mapping type. The phys_addr_t-based DMA API is used to 142 + establish the dma_addr_t. 143 + 144 + Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants 145 + to remove() it must deliver an invalidation shutdown to all DMABUF importing 146 + drivers through move_notify() and synchronously DMA unmap all the MMIO. 147 + 148 + No importing driver can continue to have a DMA map to the MMIO after the 149 + exporting driver has destroyed its p2p_provider. 151 150 152 151 153 152 P2P DMA Support Library
+1 -1
block/blk-mq-dma.c
··· 85 85 86 86 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) 87 87 { 88 - iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr); 88 + iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr); 89 89 iter->len = vec->len; 90 90 return true; 91 91 }
+1 -1
drivers/dma-buf/Makefile
··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \ 3 - dma-fence-unwrap.o dma-resv.o 3 + dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o 4 4 obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o 5 5 obj-$(CONFIG_DMABUF_HEAPS) += heaps/ 6 6 obj-$(CONFIG_SYNC_FILE) += sync_file.o
+248
drivers/dma-buf/dma-buf-mapping.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * DMA BUF Mapping Helpers 4 + * 5 + */ 6 + #include <linux/dma-buf-mapping.h> 7 + #include <linux/dma-resv.h> 8 + 9 + static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, 10 + dma_addr_t addr) 11 + { 12 + unsigned int len, nents; 13 + int i; 14 + 15 + nents = DIV_ROUND_UP(length, UINT_MAX); 16 + for (i = 0; i < nents; i++) { 17 + len = min_t(size_t, length, UINT_MAX); 18 + length -= len; 19 + /* 20 + * DMABUF abuses scatterlist to create a scatterlist 21 + * that does not have any CPU list, only the DMA list. 22 + * Always set the page related values to NULL to ensure 23 + * importers can't use it. The phys_addr based DMA API 24 + * does not require the CPU list for mapping or unmapping. 25 + */ 26 + sg_set_page(sgl, NULL, 0, 0); 27 + sg_dma_address(sgl) = addr + i * UINT_MAX; 28 + sg_dma_len(sgl) = len; 29 + sgl = sg_next(sgl); 30 + } 31 + 32 + return sgl; 33 + } 34 + 35 + static unsigned int calc_sg_nents(struct dma_iova_state *state, 36 + struct dma_buf_phys_vec *phys_vec, 37 + size_t nr_ranges, size_t size) 38 + { 39 + unsigned int nents = 0; 40 + size_t i; 41 + 42 + if (!state || !dma_use_iova(state)) { 43 + for (i = 0; i < nr_ranges; i++) 44 + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); 45 + } else { 46 + /* 47 + * In IOVA case, there is only one SG entry which spans 48 + * for whole IOVA address space, but we need to make sure 49 + * that it fits sg->length, maybe we need more. 50 + */ 51 + nents = DIV_ROUND_UP(size, UINT_MAX); 52 + } 53 + 54 + return nents; 55 + } 56 + 57 + /** 58 + * struct dma_buf_dma - holds DMA mapping information 59 + * @sgt: Scatter-gather table 60 + * @state: DMA IOVA state relevant in IOMMU-based DMA 61 + * @size: Total size of DMA transfer 62 + */ 63 + struct dma_buf_dma { 64 + struct sg_table sgt; 65 + struct dma_iova_state *state; 66 + size_t size; 67 + }; 68 + 69 + /** 70 + * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment 71 + * from arrays of physical vectors. This funciton is intended for MMIO memory 72 + * only. 73 + * @attach: [in] attachment whose scatterlist is to be returned 74 + * @provider: [in] p2pdma provider 75 + * @phys_vec: [in] array of physical vectors 76 + * @nr_ranges: [in] number of entries in phys_vec array 77 + * @size: [in] total size of phys_vec 78 + * @dir: [in] direction of DMA transfer 79 + * 80 + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR 81 + * on error. May return -EINTR if it is interrupted by a signal. 82 + * 83 + * On success, the DMA addresses and lengths in the returned scatterlist are 84 + * PAGE_SIZE aligned. 85 + * 86 + * A mapping must be unmapped by using dma_buf_free_sgt(). 87 + * 88 + * NOTE: This function is intended for exporters. If direct traffic routing is 89 + * mandatory exporter should call routing pci_p2pdma_map_type() before calling 90 + * this function. 91 + */ 92 + struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, 93 + struct p2pdma_provider *provider, 94 + struct dma_buf_phys_vec *phys_vec, 95 + size_t nr_ranges, size_t size, 96 + enum dma_data_direction dir) 97 + { 98 + unsigned int nents, mapped_len = 0; 99 + struct dma_buf_dma *dma; 100 + struct scatterlist *sgl; 101 + dma_addr_t addr; 102 + size_t i; 103 + int ret; 104 + 105 + dma_resv_assert_held(attach->dmabuf->resv); 106 + 107 + if (WARN_ON(!attach || !attach->dmabuf || !provider)) 108 + /* This function is supposed to work on MMIO memory only */ 109 + return ERR_PTR(-EINVAL); 110 + 111 + dma = kzalloc(sizeof(*dma), GFP_KERNEL); 112 + if (!dma) 113 + return ERR_PTR(-ENOMEM); 114 + 115 + switch (pci_p2pdma_map_type(provider, attach->dev)) { 116 + case PCI_P2PDMA_MAP_BUS_ADDR: 117 + /* 118 + * There is no need in IOVA at all for this flow. 119 + */ 120 + break; 121 + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: 122 + dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL); 123 + if (!dma->state) { 124 + ret = -ENOMEM; 125 + goto err_free_dma; 126 + } 127 + 128 + dma_iova_try_alloc(attach->dev, dma->state, 0, size); 129 + break; 130 + default: 131 + ret = -EINVAL; 132 + goto err_free_dma; 133 + } 134 + 135 + nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size); 136 + ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO); 137 + if (ret) 138 + goto err_free_state; 139 + 140 + sgl = dma->sgt.sgl; 141 + 142 + for (i = 0; i < nr_ranges; i++) { 143 + if (!dma->state) { 144 + addr = pci_p2pdma_bus_addr_map(provider, 145 + phys_vec[i].paddr); 146 + } else if (dma_use_iova(dma->state)) { 147 + ret = dma_iova_link(attach->dev, dma->state, 148 + phys_vec[i].paddr, 0, 149 + phys_vec[i].len, dir, 150 + DMA_ATTR_MMIO); 151 + if (ret) 152 + goto err_unmap_dma; 153 + 154 + mapped_len += phys_vec[i].len; 155 + } else { 156 + addr = dma_map_phys(attach->dev, phys_vec[i].paddr, 157 + phys_vec[i].len, dir, 158 + DMA_ATTR_MMIO); 159 + ret = dma_mapping_error(attach->dev, addr); 160 + if (ret) 161 + goto err_unmap_dma; 162 + } 163 + 164 + if (!dma->state || !dma_use_iova(dma->state)) 165 + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); 166 + } 167 + 168 + if (dma->state && dma_use_iova(dma->state)) { 169 + WARN_ON_ONCE(mapped_len != size); 170 + ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len); 171 + if (ret) 172 + goto err_unmap_dma; 173 + 174 + sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr); 175 + } 176 + 177 + dma->size = size; 178 + 179 + /* 180 + * No CPU list included — set orig_nents = 0 so others can detect 181 + * this via SG table (use nents only). 182 + */ 183 + dma->sgt.orig_nents = 0; 184 + 185 + 186 + /* 187 + * SGL must be NULL to indicate that SGL is the last one 188 + * and we allocated correct number of entries in sg_alloc_table() 189 + */ 190 + WARN_ON_ONCE(sgl); 191 + return &dma->sgt; 192 + 193 + err_unmap_dma: 194 + if (!i || !dma->state) { 195 + ; /* Do nothing */ 196 + } else if (dma_use_iova(dma->state)) { 197 + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir, 198 + DMA_ATTR_MMIO); 199 + } else { 200 + for_each_sgtable_dma_sg(&dma->sgt, sgl, i) 201 + dma_unmap_phys(attach->dev, sg_dma_address(sgl), 202 + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); 203 + } 204 + sg_free_table(&dma->sgt); 205 + err_free_state: 206 + kfree(dma->state); 207 + err_free_dma: 208 + kfree(dma); 209 + return ERR_PTR(ret); 210 + } 211 + EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF"); 212 + 213 + /** 214 + * dma_buf_free_sgt- unmaps the buffer 215 + * @attach: [in] attachment to unmap buffer from 216 + * @sgt: [in] scatterlist info of the buffer to unmap 217 + * @dir: [in] direction of DMA transfer 218 + * 219 + * This unmaps a DMA mapping for @attached obtained 220 + * by dma_buf_phys_vec_to_sgt(). 221 + */ 222 + void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, 223 + enum dma_data_direction dir) 224 + { 225 + struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt); 226 + int i; 227 + 228 + dma_resv_assert_held(attach->dmabuf->resv); 229 + 230 + if (!dma->state) { 231 + ; /* Do nothing */ 232 + } else if (dma_use_iova(dma->state)) { 233 + dma_iova_destroy(attach->dev, dma->state, dma->size, dir, 234 + DMA_ATTR_MMIO); 235 + } else { 236 + struct scatterlist *sgl; 237 + 238 + for_each_sgtable_dma_sg(sgt, sgl, i) 239 + dma_unmap_phys(attach->dev, sg_dma_address(sgl), 240 + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); 241 + } 242 + 243 + sg_free_table(sgt); 244 + kfree(dma->state); 245 + kfree(dma); 246 + 247 + } 248 + EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");
+2 -2
drivers/iommu/dma-iommu.c
··· 1439 1439 * as a bus address, __finalise_sg() will copy the dma 1440 1440 * address into the output segment. 1441 1441 */ 1442 - s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, 1443 - sg_phys(s)); 1442 + s->dma_address = pci_p2pdma_bus_addr_map( 1443 + p2pdma_state.mem, sg_phys(s)); 1444 1444 sg_dma_len(s) = sg->length; 1445 1445 sg_dma_mark_bus_address(s); 1446 1446 continue;
+65 -13
drivers/iommu/iommufd/io_pagetable.c
··· 8 8 * The datastructure uses the iopt_pages to optimize the storage of the PFNs 9 9 * between the domains and xarray. 10 10 */ 11 + #include <linux/dma-buf.h> 11 12 #include <linux/err.h> 12 13 #include <linux/errno.h> 14 + #include <linux/file.h> 13 15 #include <linux/iommu.h> 14 16 #include <linux/iommufd.h> 15 17 #include <linux/lockdep.h> ··· 286 284 case IOPT_ADDRESS_FILE: 287 285 start = elm->start_byte + elm->pages->start; 288 286 break; 287 + case IOPT_ADDRESS_DMABUF: 288 + start = elm->start_byte + elm->pages->dmabuf.start; 289 + break; 289 290 } 290 291 rc = iopt_alloc_iova(iopt, dst_iova, start, length); 291 292 if (rc) ··· 473 468 * @iopt: io_pagetable to act on 474 469 * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains 475 470 * the chosen iova on output. Otherwise is the iova to map to on input 476 - * @file: file to map 471 + * @fd: fdno of a file to map 477 472 * @start: map file starting at this byte offset 478 473 * @length: Number of bytes to map 479 474 * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping 480 475 * @flags: IOPT_ALLOC_IOVA or zero 481 476 */ 482 477 int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, 483 - unsigned long *iova, struct file *file, 484 - unsigned long start, unsigned long length, 485 - int iommu_prot, unsigned int flags) 478 + unsigned long *iova, int fd, unsigned long start, 479 + unsigned long length, int iommu_prot, 480 + unsigned int flags) 486 481 { 487 482 struct iopt_pages *pages; 483 + struct dma_buf *dmabuf; 484 + unsigned long start_byte; 485 + unsigned long last; 488 486 489 - pages = iopt_alloc_file_pages(file, start, length, 490 - iommu_prot & IOMMU_WRITE); 491 - if (IS_ERR(pages)) 492 - return PTR_ERR(pages); 487 + if (!length) 488 + return -EINVAL; 489 + if (check_add_overflow(start, length - 1, &last)) 490 + return -EOVERFLOW; 491 + 492 + start_byte = start - ALIGN_DOWN(start, PAGE_SIZE); 493 + dmabuf = dma_buf_get(fd); 494 + if (!IS_ERR(dmabuf)) { 495 + pages = iopt_alloc_dmabuf_pages(ictx, dmabuf, start_byte, start, 496 + length, 497 + iommu_prot & IOMMU_WRITE); 498 + if (IS_ERR(pages)) { 499 + dma_buf_put(dmabuf); 500 + return PTR_ERR(pages); 501 + } 502 + } else { 503 + struct file *file; 504 + 505 + file = fget(fd); 506 + if (!file) 507 + return -EBADF; 508 + 509 + pages = iopt_alloc_file_pages(file, start_byte, start, length, 510 + iommu_prot & IOMMU_WRITE); 511 + fput(file); 512 + if (IS_ERR(pages)) 513 + return PTR_ERR(pages); 514 + } 515 + 493 516 return iopt_map_common(ictx, iopt, pages, iova, length, 494 - start - pages->start, iommu_prot, flags); 517 + start_byte, iommu_prot, flags); 495 518 } 496 519 497 520 struct iova_bitmap_fn_arg { ··· 994 961 WARN_ON(!area->storage_domain); 995 962 if (area->storage_domain == domain) 996 963 area->storage_domain = storage_domain; 964 + if (iopt_is_dmabuf(pages)) { 965 + if (!iopt_dmabuf_revoked(pages)) 966 + iopt_area_unmap_domain(area, domain); 967 + iopt_dmabuf_untrack_domain(pages, area, domain); 968 + } 997 969 mutex_unlock(&pages->mutex); 998 970 999 - iopt_area_unmap_domain(area, domain); 971 + if (!iopt_is_dmabuf(pages)) 972 + iopt_area_unmap_domain(area, domain); 1000 973 } 1001 974 return; 1002 975 } ··· 1019 980 WARN_ON(area->storage_domain != domain); 1020 981 area->storage_domain = NULL; 1021 982 iopt_area_unfill_domain(area, pages, domain); 983 + if (iopt_is_dmabuf(pages)) 984 + iopt_dmabuf_untrack_domain(pages, area, domain); 1022 985 mutex_unlock(&pages->mutex); 1023 986 } 1024 987 } ··· 1050 1009 if (!pages) 1051 1010 continue; 1052 1011 1053 - mutex_lock(&pages->mutex); 1012 + guard(mutex)(&pages->mutex); 1013 + if (iopt_is_dmabuf(pages)) { 1014 + rc = iopt_dmabuf_track_domain(pages, area, domain); 1015 + if (rc) 1016 + goto out_unfill; 1017 + } 1054 1018 rc = iopt_area_fill_domain(area, domain); 1055 1019 if (rc) { 1056 - mutex_unlock(&pages->mutex); 1020 + if (iopt_is_dmabuf(pages)) 1021 + iopt_dmabuf_untrack_domain(pages, area, domain); 1057 1022 goto out_unfill; 1058 1023 } 1059 1024 if (!area->storage_domain) { ··· 1068 1021 interval_tree_insert(&area->pages_node, 1069 1022 &pages->domains_itree); 1070 1023 } 1071 - mutex_unlock(&pages->mutex); 1072 1024 } 1073 1025 return 0; 1074 1026 ··· 1088 1042 area->storage_domain = NULL; 1089 1043 } 1090 1044 iopt_area_unfill_domain(area, pages, domain); 1045 + if (iopt_is_dmabuf(pages)) 1046 + iopt_dmabuf_untrack_domain(pages, area, domain); 1091 1047 mutex_unlock(&pages->mutex); 1092 1048 } 1093 1049 return rc; ··· 1299 1251 1300 1252 if (!pages || area->prevent_access) 1301 1253 return -EBUSY; 1254 + 1255 + /* Maintaining the domains_itree below is a bit complicated */ 1256 + if (iopt_is_dmabuf(pages)) 1257 + return -EOPNOTSUPP; 1302 1258 1303 1259 if (new_start & (alignment - 1) || 1304 1260 iopt_area_start_byte(area, new_start) & (alignment - 1))
+52 -2
drivers/iommu/iommufd/io_pagetable.h
··· 5 5 #ifndef __IO_PAGETABLE_H 6 6 #define __IO_PAGETABLE_H 7 7 8 + #include <linux/dma-buf.h> 8 9 #include <linux/interval_tree.h> 9 10 #include <linux/kref.h> 10 11 #include <linux/mutex.h> ··· 69 68 struct iommu_domain *domain); 70 69 void iopt_area_unmap_domain(struct iopt_area *area, 71 70 struct iommu_domain *domain); 71 + 72 + int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, 73 + struct iommu_domain *domain); 74 + void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, 75 + struct iopt_area *area, 76 + struct iommu_domain *domain); 77 + int iopt_dmabuf_track_all_domains(struct iopt_area *area, 78 + struct iopt_pages *pages); 79 + void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, 80 + struct iopt_pages *pages); 72 81 73 82 static inline unsigned long iopt_area_index(struct iopt_area *area) 74 83 { ··· 190 179 191 180 enum iopt_address_type { 192 181 IOPT_ADDRESS_USER = 0, 193 - IOPT_ADDRESS_FILE = 1, 182 + IOPT_ADDRESS_FILE, 183 + IOPT_ADDRESS_DMABUF, 184 + }; 185 + 186 + struct iopt_pages_dmabuf_track { 187 + struct iommu_domain *domain; 188 + struct iopt_area *area; 189 + struct list_head elm; 190 + }; 191 + 192 + struct iopt_pages_dmabuf { 193 + struct dma_buf_attachment *attach; 194 + struct dma_buf_phys_vec phys; 195 + /* Always PAGE_SIZE aligned */ 196 + unsigned long start; 197 + struct list_head tracker; 194 198 }; 195 199 196 200 /* ··· 235 209 struct file *file; 236 210 unsigned long start; 237 211 }; 212 + /* IOPT_ADDRESS_DMABUF */ 213 + struct iopt_pages_dmabuf dmabuf; 238 214 }; 239 215 bool writable:1; 240 216 u8 account_mode; ··· 248 220 struct rb_root_cached domains_itree; 249 221 }; 250 222 223 + static inline bool iopt_is_dmabuf(struct iopt_pages *pages) 224 + { 225 + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) 226 + return false; 227 + return pages->type == IOPT_ADDRESS_DMABUF; 228 + } 229 + 230 + static inline bool iopt_dmabuf_revoked(struct iopt_pages *pages) 231 + { 232 + lockdep_assert_held(&pages->mutex); 233 + if (iopt_is_dmabuf(pages)) 234 + return pages->dmabuf.phys.len == 0; 235 + return false; 236 + } 237 + 251 238 struct iopt_pages *iopt_alloc_user_pages(void __user *uptr, 252 239 unsigned long length, bool writable); 253 - struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, 240 + struct iopt_pages *iopt_alloc_file_pages(struct file *file, 241 + unsigned long start_byte, 242 + unsigned long start, 254 243 unsigned long length, bool writable); 244 + struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, 245 + struct dma_buf *dmabuf, 246 + unsigned long start_byte, 247 + unsigned long start, 248 + unsigned long length, bool writable); 255 249 void iopt_release_pages(struct kref *kref); 256 250 static inline void iopt_put_pages(struct iopt_pages *pages) 257 251 {
+1 -7
drivers/iommu/iommufd/ioas.c
··· 207 207 unsigned long iova = cmd->iova; 208 208 struct iommufd_ioas *ioas; 209 209 unsigned int flags = 0; 210 - struct file *file; 211 210 int rc; 212 211 213 212 if (cmd->flags & ··· 228 229 if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA)) 229 230 flags = IOPT_ALLOC_IOVA; 230 231 231 - file = fget(cmd->fd); 232 - if (!file) 233 - return -EBADF; 234 - 235 - rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, file, 232 + rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, cmd->fd, 236 233 cmd->start, cmd->length, 237 234 conv_iommu_prot(cmd->flags), flags); 238 235 if (rc) ··· 238 243 rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); 239 244 out_put: 240 245 iommufd_put_object(ucmd->ictx, &ioas->obj); 241 - fput(file); 242 246 return rc; 243 247 } 244 248
+13 -1
drivers/iommu/iommufd/iommufd_private.h
··· 19 19 struct iommu_group; 20 20 struct iommu_option; 21 21 struct iommufd_device; 22 + struct dma_buf_attachment; 23 + struct dma_buf_phys_vec; 22 24 23 25 struct iommufd_sw_msi_map { 24 26 struct list_head sw_msi_item; ··· 110 108 unsigned long length, int iommu_prot, 111 109 unsigned int flags); 112 110 int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, 113 - unsigned long *iova, struct file *file, 111 + unsigned long *iova, int fd, 114 112 unsigned long start, unsigned long length, 115 113 int iommu_prot, unsigned int flags); 116 114 int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list, ··· 506 504 void iommufd_device_destroy(struct iommufd_object *obj); 507 505 int iommufd_get_hw_info(struct iommufd_ucmd *ucmd); 508 506 507 + struct device *iommufd_global_device(void); 508 + 509 509 struct iommufd_access { 510 510 struct iommufd_object obj; 511 511 struct iommufd_ctx *ictx; ··· 715 711 int __init iommufd_test_init(void); 716 712 void iommufd_test_exit(void); 717 713 bool iommufd_selftest_is_mock_dev(struct device *dev); 714 + int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 715 + struct dma_buf_phys_vec *phys); 718 716 #else 719 717 static inline void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd, 720 718 unsigned int ioas_id, ··· 737 731 static inline bool iommufd_selftest_is_mock_dev(struct device *dev) 738 732 { 739 733 return false; 734 + } 735 + static inline int 736 + iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 737 + struct dma_buf_phys_vec *phys) 738 + { 739 + return -EOPNOTSUPP; 740 740 } 741 741 #endif 742 742 #endif
+10
drivers/iommu/iommufd/iommufd_test.h
··· 29 29 IOMMU_TEST_OP_PASID_REPLACE, 30 30 IOMMU_TEST_OP_PASID_DETACH, 31 31 IOMMU_TEST_OP_PASID_CHECK_HWPT, 32 + IOMMU_TEST_OP_DMABUF_GET, 33 + IOMMU_TEST_OP_DMABUF_REVOKE, 32 34 }; 33 35 34 36 enum { ··· 178 176 __u32 hwpt_id; 179 177 /* @id is stdev_id */ 180 178 } pasid_check; 179 + struct { 180 + __u32 length; 181 + __u32 open_flags; 182 + } dmabuf_get; 183 + struct { 184 + __s32 dmabuf_fd; 185 + __u32 revoked; 186 + } dmabuf_revoke; 181 187 }; 182 188 __u32 last; 183 189 };
+10
drivers/iommu/iommufd/main.c
··· 751 751 .mode = 0666, 752 752 }; 753 753 754 + /* 755 + * Used only by DMABUF, returns a valid struct device to use as a dummy struct 756 + * device for attachment. 757 + */ 758 + struct device *iommufd_global_device(void) 759 + { 760 + return iommu_misc_dev.this_device; 761 + } 762 + 754 763 static int __init iommufd_init(void) 755 764 { 756 765 int ret; ··· 803 794 #endif 804 795 MODULE_IMPORT_NS("IOMMUFD_INTERNAL"); 805 796 MODULE_IMPORT_NS("IOMMUFD"); 797 + MODULE_IMPORT_NS("DMA_BUF"); 806 798 MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices"); 807 799 MODULE_LICENSE("GPL");
+367 -47
drivers/iommu/iommufd/pages.c
··· 45 45 * last_iova + 1 can overflow. An iopt_pages index will always be much less than 46 46 * ULONG_MAX so last_index + 1 cannot overflow. 47 47 */ 48 + #include <linux/dma-buf.h> 49 + #include <linux/dma-resv.h> 48 50 #include <linux/file.h> 49 51 #include <linux/highmem.h> 50 52 #include <linux/iommu.h> ··· 55 53 #include <linux/overflow.h> 56 54 #include <linux/slab.h> 57 55 #include <linux/sched/mm.h> 56 + #include <linux/vfio_pci_core.h> 58 57 59 58 #include "double_span.h" 60 59 #include "io_pagetable.h" ··· 261 258 return container_of(node, struct iopt_area, pages_node); 262 259 } 263 260 261 + enum batch_kind { 262 + BATCH_CPU_MEMORY = 0, 263 + BATCH_MMIO, 264 + }; 265 + 264 266 /* 265 267 * A simple datastructure to hold a vector of PFNs, optimized for contiguous 266 268 * PFNs. This is used as a temporary holding memory for shuttling pfns from one ··· 279 271 unsigned int array_size; 280 272 unsigned int end; 281 273 unsigned int total_pfns; 274 + enum batch_kind kind; 282 275 }; 276 + enum { MAX_NPFNS = type_max(typeof(((struct pfn_batch *)0)->npfns[0])) }; 283 277 284 278 static void batch_clear(struct pfn_batch *batch) 285 279 { ··· 358 348 } 359 349 360 350 static bool batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn, 361 - u32 nr) 351 + u32 nr, enum batch_kind kind) 362 352 { 363 - const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns)); 364 353 unsigned int end = batch->end; 354 + 355 + if (batch->kind != kind) { 356 + /* One kind per batch */ 357 + if (batch->end != 0) 358 + return false; 359 + batch->kind = kind; 360 + } 365 361 366 362 if (end && pfn == batch->pfns[end - 1] + batch->npfns[end - 1] && 367 363 nr <= MAX_NPFNS - batch->npfns[end - 1]) { ··· 395 379 /* true if the pfn was added, false otherwise */ 396 380 static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn) 397 381 { 398 - return batch_add_pfn_num(batch, pfn, 1); 382 + return batch_add_pfn_num(batch, pfn, 1, BATCH_CPU_MEMORY); 399 383 } 400 384 401 385 /* ··· 508 492 { 509 493 bool disable_large_pages = area->iopt->disable_large_pages; 510 494 unsigned long last_iova = iopt_area_last_iova(area); 495 + int iommu_prot = area->iommu_prot; 511 496 unsigned int page_offset = 0; 512 497 unsigned long start_iova; 513 498 unsigned long next_iova; 514 499 unsigned int cur = 0; 515 500 unsigned long iova; 516 501 int rc; 502 + 503 + if (batch->kind == BATCH_MMIO) { 504 + iommu_prot &= ~IOMMU_CACHE; 505 + iommu_prot |= IOMMU_MMIO; 506 + } 517 507 518 508 /* The first index might be a partial page */ 519 509 if (start_index == iopt_area_index(area)) ··· 534 512 rc = batch_iommu_map_small( 535 513 domain, iova, 536 514 PFN_PHYS(batch->pfns[cur]) + page_offset, 537 - next_iova - iova, area->iommu_prot); 515 + next_iova - iova, iommu_prot); 538 516 else 539 517 rc = iommu_map(domain, iova, 540 518 PFN_PHYS(batch->pfns[cur]) + page_offset, 541 - next_iova - iova, area->iommu_prot, 519 + next_iova - iova, iommu_prot, 542 520 GFP_KERNEL_ACCOUNT); 543 521 if (rc) 544 522 goto err_unmap; ··· 674 652 nr = min(nr, npages); 675 653 npages -= nr; 676 654 677 - if (!batch_add_pfn_num(batch, pfn, nr)) 655 + if (!batch_add_pfn_num(batch, pfn, nr, BATCH_CPU_MEMORY)) 678 656 break; 679 657 if (nr > 1) { 680 658 rc = folio_add_pins(folio, nr - 1); ··· 1076 1054 return iopt_pages_update_pinned(pages, npages, inc, user); 1077 1055 } 1078 1056 1057 + struct pfn_reader_dmabuf { 1058 + struct dma_buf_phys_vec phys; 1059 + unsigned long start_offset; 1060 + }; 1061 + 1062 + static int pfn_reader_dmabuf_init(struct pfn_reader_dmabuf *dmabuf, 1063 + struct iopt_pages *pages) 1064 + { 1065 + /* Callers must not get here if the dmabuf was already revoked */ 1066 + if (WARN_ON(iopt_dmabuf_revoked(pages))) 1067 + return -EINVAL; 1068 + 1069 + dmabuf->phys = pages->dmabuf.phys; 1070 + dmabuf->start_offset = pages->dmabuf.start; 1071 + return 0; 1072 + } 1073 + 1074 + static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf, 1075 + struct pfn_batch *batch, 1076 + unsigned long start_index, 1077 + unsigned long last_index) 1078 + { 1079 + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE; 1080 + 1081 + /* 1082 + * start/last_index and start are all PAGE_SIZE aligned, the batch is 1083 + * always filled using page size aligned PFNs just like the other types. 1084 + * If the dmabuf has been sliced on a sub page offset then the common 1085 + * batch to domain code will adjust it before mapping to the domain. 1086 + */ 1087 + batch_add_pfn_num(batch, PHYS_PFN(dmabuf->phys.paddr + start), 1088 + last_index - start_index + 1, BATCH_MMIO); 1089 + return 0; 1090 + } 1091 + 1079 1092 /* 1080 1093 * PFNs are stored in three places, in order of preference: 1081 1094 * - The iopt_pages xarray. This is only populated if there is a ··· 1129 1072 unsigned long batch_end_index; 1130 1073 unsigned long last_index; 1131 1074 1132 - struct pfn_reader_user user; 1075 + union { 1076 + struct pfn_reader_user user; 1077 + struct pfn_reader_dmabuf dmabuf; 1078 + }; 1133 1079 }; 1134 1080 1135 1081 static int pfn_reader_update_pinned(struct pfn_reader *pfns) ··· 1168 1108 { 1169 1109 struct interval_tree_double_span_iter *span = &pfns->span; 1170 1110 unsigned long start_index = pfns->batch_end_index; 1171 - struct pfn_reader_user *user = &pfns->user; 1111 + struct pfn_reader_user *user; 1172 1112 unsigned long npages; 1173 1113 struct iopt_area *area; 1174 1114 int rc; ··· 1200 1140 return 0; 1201 1141 } 1202 1142 1203 - if (start_index >= pfns->user.upages_end) { 1204 - rc = pfn_reader_user_pin(&pfns->user, pfns->pages, start_index, 1143 + if (iopt_is_dmabuf(pfns->pages)) 1144 + return pfn_reader_fill_dmabuf(&pfns->dmabuf, &pfns->batch, 1145 + start_index, span->last_hole); 1146 + 1147 + user = &pfns->user; 1148 + if (start_index >= user->upages_end) { 1149 + rc = pfn_reader_user_pin(user, pfns->pages, start_index, 1205 1150 span->last_hole); 1206 1151 if (rc) 1207 1152 return rc; ··· 1274 1209 pfns->batch_start_index = start_index; 1275 1210 pfns->batch_end_index = start_index; 1276 1211 pfns->last_index = last_index; 1277 - pfn_reader_user_init(&pfns->user, pages); 1212 + if (iopt_is_dmabuf(pages)) 1213 + pfn_reader_dmabuf_init(&pfns->dmabuf, pages); 1214 + else 1215 + pfn_reader_user_init(&pfns->user, pages); 1278 1216 rc = batch_init(&pfns->batch, last_index - start_index + 1); 1279 1217 if (rc) 1280 1218 return rc; ··· 1298 1230 static void pfn_reader_release_pins(struct pfn_reader *pfns) 1299 1231 { 1300 1232 struct iopt_pages *pages = pfns->pages; 1301 - struct pfn_reader_user *user = &pfns->user; 1233 + struct pfn_reader_user *user; 1302 1234 1235 + if (iopt_is_dmabuf(pages)) 1236 + return; 1237 + 1238 + user = &pfns->user; 1303 1239 if (user->upages_end > pfns->batch_end_index) { 1304 1240 /* Any pages not transferred to the batch are just unpinned */ 1305 1241 ··· 1333 1261 struct iopt_pages *pages = pfns->pages; 1334 1262 1335 1263 pfn_reader_release_pins(pfns); 1336 - pfn_reader_user_destroy(&pfns->user, pfns->pages); 1264 + if (!iopt_is_dmabuf(pfns->pages)) 1265 + pfn_reader_user_destroy(&pfns->user, pfns->pages); 1337 1266 batch_destroy(&pfns->batch, NULL); 1338 1267 WARN_ON(pages->last_npinned != pages->npinned); 1339 1268 } ··· 1413 1340 return pages; 1414 1341 } 1415 1342 1416 - struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, 1343 + struct iopt_pages *iopt_alloc_file_pages(struct file *file, 1344 + unsigned long start_byte, 1345 + unsigned long start, 1417 1346 unsigned long length, bool writable) 1418 1347 1419 1348 { 1420 1349 struct iopt_pages *pages; 1421 - unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE); 1422 - unsigned long end; 1423 1350 1424 - if (length && check_add_overflow(start, length - 1, &end)) 1425 - return ERR_PTR(-EOVERFLOW); 1426 - 1427 - pages = iopt_alloc_pages(start - start_down, length, writable); 1351 + pages = iopt_alloc_pages(start_byte, length, writable); 1428 1352 if (IS_ERR(pages)) 1429 1353 return pages; 1430 1354 pages->file = get_file(file); 1431 - pages->start = start_down; 1355 + pages->start = start - start_byte; 1432 1356 pages->type = IOPT_ADDRESS_FILE; 1433 1357 return pages; 1358 + } 1359 + 1360 + static void iopt_revoke_notify(struct dma_buf_attachment *attach) 1361 + { 1362 + struct iopt_pages *pages = attach->importer_priv; 1363 + struct iopt_pages_dmabuf_track *track; 1364 + 1365 + guard(mutex)(&pages->mutex); 1366 + if (iopt_dmabuf_revoked(pages)) 1367 + return; 1368 + 1369 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { 1370 + struct iopt_area *area = track->area; 1371 + 1372 + iopt_area_unmap_domain_range(area, track->domain, 1373 + iopt_area_index(area), 1374 + iopt_area_last_index(area)); 1375 + } 1376 + pages->dmabuf.phys.len = 0; 1377 + } 1378 + 1379 + static struct dma_buf_attach_ops iopt_dmabuf_attach_revoke_ops = { 1380 + .allow_peer2peer = true, 1381 + .move_notify = iopt_revoke_notify, 1382 + }; 1383 + 1384 + /* 1385 + * iommufd and vfio have a circular dependency. Future work for a phys 1386 + * based private interconnect will remove this. 1387 + */ 1388 + static int 1389 + sym_vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 1390 + struct dma_buf_phys_vec *phys) 1391 + { 1392 + typeof(&vfio_pci_dma_buf_iommufd_map) fn; 1393 + int rc; 1394 + 1395 + rc = iommufd_test_dma_buf_iommufd_map(attachment, phys); 1396 + if (rc != -EOPNOTSUPP) 1397 + return rc; 1398 + 1399 + if (!IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)) 1400 + return -EOPNOTSUPP; 1401 + 1402 + fn = symbol_get(vfio_pci_dma_buf_iommufd_map); 1403 + if (!fn) 1404 + return -EOPNOTSUPP; 1405 + rc = fn(attachment, phys); 1406 + symbol_put(vfio_pci_dma_buf_iommufd_map); 1407 + return rc; 1408 + } 1409 + 1410 + static int iopt_map_dmabuf(struct iommufd_ctx *ictx, struct iopt_pages *pages, 1411 + struct dma_buf *dmabuf) 1412 + { 1413 + struct dma_buf_attachment *attach; 1414 + int rc; 1415 + 1416 + attach = dma_buf_dynamic_attach(dmabuf, iommufd_global_device(), 1417 + &iopt_dmabuf_attach_revoke_ops, pages); 1418 + if (IS_ERR(attach)) 1419 + return PTR_ERR(attach); 1420 + 1421 + dma_resv_lock(dmabuf->resv, NULL); 1422 + /* 1423 + * Lock ordering requires the mutex to be taken inside the reservation, 1424 + * make sure lockdep sees this. 1425 + */ 1426 + if (IS_ENABLED(CONFIG_LOCKDEP)) { 1427 + mutex_lock(&pages->mutex); 1428 + mutex_unlock(&pages->mutex); 1429 + } 1430 + 1431 + rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys); 1432 + if (rc) 1433 + goto err_detach; 1434 + 1435 + dma_resv_unlock(dmabuf->resv); 1436 + 1437 + /* On success iopt_release_pages() will detach and put the dmabuf. */ 1438 + pages->dmabuf.attach = attach; 1439 + return 0; 1440 + 1441 + err_detach: 1442 + dma_resv_unlock(dmabuf->resv); 1443 + dma_buf_detach(dmabuf, attach); 1444 + return rc; 1445 + } 1446 + 1447 + struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, 1448 + struct dma_buf *dmabuf, 1449 + unsigned long start_byte, 1450 + unsigned long start, 1451 + unsigned long length, bool writable) 1452 + { 1453 + static struct lock_class_key pages_dmabuf_mutex_key; 1454 + struct iopt_pages *pages; 1455 + int rc; 1456 + 1457 + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) 1458 + return ERR_PTR(-EOPNOTSUPP); 1459 + 1460 + if (dmabuf->size <= (start + length - 1) || 1461 + length / PAGE_SIZE >= MAX_NPFNS) 1462 + return ERR_PTR(-EINVAL); 1463 + 1464 + pages = iopt_alloc_pages(start_byte, length, writable); 1465 + if (IS_ERR(pages)) 1466 + return pages; 1467 + 1468 + /* 1469 + * The mmap_lock can be held when obtaining the dmabuf reservation lock 1470 + * which creates a locking cycle with the pages mutex which is held 1471 + * while obtaining the mmap_lock. This locking path is not present for 1472 + * IOPT_ADDRESS_DMABUF so split the lock class. 1473 + */ 1474 + lockdep_set_class(&pages->mutex, &pages_dmabuf_mutex_key); 1475 + 1476 + /* dmabuf does not use pinned page accounting. */ 1477 + pages->account_mode = IOPT_PAGES_ACCOUNT_NONE; 1478 + pages->type = IOPT_ADDRESS_DMABUF; 1479 + pages->dmabuf.start = start - start_byte; 1480 + INIT_LIST_HEAD(&pages->dmabuf.tracker); 1481 + 1482 + rc = iopt_map_dmabuf(ictx, pages, dmabuf); 1483 + if (rc) { 1484 + iopt_put_pages(pages); 1485 + return ERR_PTR(rc); 1486 + } 1487 + 1488 + return pages; 1489 + } 1490 + 1491 + int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, 1492 + struct iommu_domain *domain) 1493 + { 1494 + struct iopt_pages_dmabuf_track *track; 1495 + 1496 + lockdep_assert_held(&pages->mutex); 1497 + if (WARN_ON(!iopt_is_dmabuf(pages))) 1498 + return -EINVAL; 1499 + 1500 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) 1501 + if (WARN_ON(track->domain == domain && track->area == area)) 1502 + return -EINVAL; 1503 + 1504 + track = kzalloc(sizeof(*track), GFP_KERNEL); 1505 + if (!track) 1506 + return -ENOMEM; 1507 + track->domain = domain; 1508 + track->area = area; 1509 + list_add_tail(&track->elm, &pages->dmabuf.tracker); 1510 + 1511 + return 0; 1512 + } 1513 + 1514 + void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, 1515 + struct iopt_area *area, 1516 + struct iommu_domain *domain) 1517 + { 1518 + struct iopt_pages_dmabuf_track *track; 1519 + 1520 + lockdep_assert_held(&pages->mutex); 1521 + WARN_ON(!iopt_is_dmabuf(pages)); 1522 + 1523 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { 1524 + if (track->domain == domain && track->area == area) { 1525 + list_del(&track->elm); 1526 + kfree(track); 1527 + return; 1528 + } 1529 + } 1530 + WARN_ON(true); 1531 + } 1532 + 1533 + int iopt_dmabuf_track_all_domains(struct iopt_area *area, 1534 + struct iopt_pages *pages) 1535 + { 1536 + struct iopt_pages_dmabuf_track *track; 1537 + struct iommu_domain *domain; 1538 + unsigned long index; 1539 + int rc; 1540 + 1541 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) 1542 + if (WARN_ON(track->area == area)) 1543 + return -EINVAL; 1544 + 1545 + xa_for_each(&area->iopt->domains, index, domain) { 1546 + rc = iopt_dmabuf_track_domain(pages, area, domain); 1547 + if (rc) 1548 + goto err_untrack; 1549 + } 1550 + return 0; 1551 + err_untrack: 1552 + iopt_dmabuf_untrack_all_domains(area, pages); 1553 + return rc; 1554 + } 1555 + 1556 + void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, 1557 + struct iopt_pages *pages) 1558 + { 1559 + struct iopt_pages_dmabuf_track *track; 1560 + struct iopt_pages_dmabuf_track *tmp; 1561 + 1562 + list_for_each_entry_safe(track, tmp, &pages->dmabuf.tracker, 1563 + elm) { 1564 + if (track->area == area) { 1565 + list_del(&track->elm); 1566 + kfree(track); 1567 + } 1568 + } 1434 1569 } 1435 1570 1436 1571 void iopt_release_pages(struct kref *kref) ··· 1653 1372 mutex_destroy(&pages->mutex); 1654 1373 put_task_struct(pages->source_task); 1655 1374 free_uid(pages->source_user); 1656 - if (pages->type == IOPT_ADDRESS_FILE) 1375 + if (iopt_is_dmabuf(pages) && pages->dmabuf.attach) { 1376 + struct dma_buf *dmabuf = pages->dmabuf.attach->dmabuf; 1377 + 1378 + dma_buf_detach(dmabuf, pages->dmabuf.attach); 1379 + dma_buf_put(dmabuf); 1380 + WARN_ON(!list_empty(&pages->dmabuf.tracker)); 1381 + } else if (pages->type == IOPT_ADDRESS_FILE) { 1657 1382 fput(pages->file); 1383 + } 1658 1384 kfree(pages); 1659 1385 } 1660 1386 ··· 1739 1451 1740 1452 lockdep_assert_held(&pages->mutex); 1741 1453 1454 + if (iopt_is_dmabuf(pages)) { 1455 + if (WARN_ON(iopt_dmabuf_revoked(pages))) 1456 + return; 1457 + iopt_area_unmap_domain_range(area, domain, start_index, 1458 + last_index); 1459 + return; 1460 + } 1461 + 1742 1462 /* 1743 1463 * For security we must not unpin something that is still DMA mapped, 1744 1464 * so this must unmap any IOVA before we go ahead and unpin the pages. ··· 1822 1526 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages, 1823 1527 struct iommu_domain *domain) 1824 1528 { 1529 + if (iopt_dmabuf_revoked(pages)) 1530 + return; 1531 + 1825 1532 __iopt_area_unfill_domain(area, pages, domain, 1826 1533 iopt_area_last_index(area)); 1827 1534 } ··· 1844 1545 int rc; 1845 1546 1846 1547 lockdep_assert_held(&area->pages->mutex); 1548 + 1549 + if (iopt_dmabuf_revoked(area->pages)) 1550 + return 0; 1847 1551 1848 1552 rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area), 1849 1553 iopt_area_last_index(area)); ··· 1907 1605 return 0; 1908 1606 1909 1607 mutex_lock(&pages->mutex); 1910 - rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), 1911 - iopt_area_last_index(area)); 1912 - if (rc) 1913 - goto out_unlock; 1608 + if (iopt_is_dmabuf(pages)) { 1609 + rc = iopt_dmabuf_track_all_domains(area, pages); 1610 + if (rc) 1611 + goto out_unlock; 1612 + } 1914 1613 1915 - while (!pfn_reader_done(&pfns)) { 1916 - done_first_end_index = pfns.batch_end_index; 1917 - done_all_end_index = pfns.batch_start_index; 1918 - xa_for_each(&area->iopt->domains, index, domain) { 1919 - rc = batch_to_domain(&pfns.batch, domain, area, 1920 - pfns.batch_start_index); 1614 + if (!iopt_dmabuf_revoked(pages)) { 1615 + rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), 1616 + iopt_area_last_index(area)); 1617 + if (rc) 1618 + goto out_untrack; 1619 + 1620 + while (!pfn_reader_done(&pfns)) { 1621 + done_first_end_index = pfns.batch_end_index; 1622 + done_all_end_index = pfns.batch_start_index; 1623 + xa_for_each(&area->iopt->domains, index, domain) { 1624 + rc = batch_to_domain(&pfns.batch, domain, area, 1625 + pfns.batch_start_index); 1626 + if (rc) 1627 + goto out_unmap; 1628 + } 1629 + done_all_end_index = done_first_end_index; 1630 + 1631 + rc = pfn_reader_next(&pfns); 1921 1632 if (rc) 1922 1633 goto out_unmap; 1923 1634 } 1924 - done_all_end_index = done_first_end_index; 1925 - 1926 - rc = pfn_reader_next(&pfns); 1635 + rc = pfn_reader_update_pinned(&pfns); 1927 1636 if (rc) 1928 1637 goto out_unmap; 1638 + 1639 + pfn_reader_destroy(&pfns); 1929 1640 } 1930 - rc = pfn_reader_update_pinned(&pfns); 1931 - if (rc) 1932 - goto out_unmap; 1933 1641 1934 1642 area->storage_domain = xa_load(&area->iopt->domains, 0); 1935 1643 interval_tree_insert(&area->pages_node, &pages->domains_itree); 1936 - goto out_destroy; 1644 + mutex_unlock(&pages->mutex); 1645 + return 0; 1937 1646 1938 1647 out_unmap: 1939 1648 pfn_reader_release_pins(&pfns); ··· 1971 1658 end_index); 1972 1659 } 1973 1660 } 1974 - out_destroy: 1975 1661 pfn_reader_destroy(&pfns); 1662 + out_untrack: 1663 + if (iopt_is_dmabuf(pages)) 1664 + iopt_dmabuf_untrack_all_domains(area, pages); 1976 1665 out_unlock: 1977 1666 mutex_unlock(&pages->mutex); 1978 1667 return rc; ··· 2000 1685 if (!area->storage_domain) 2001 1686 goto out_unlock; 2002 1687 2003 - xa_for_each(&iopt->domains, index, domain) 2004 - if (domain != area->storage_domain) 1688 + xa_for_each(&iopt->domains, index, domain) { 1689 + if (domain == area->storage_domain) 1690 + continue; 1691 + 1692 + if (!iopt_dmabuf_revoked(pages)) 2005 1693 iopt_area_unmap_domain_range( 2006 1694 area, domain, iopt_area_index(area), 2007 1695 iopt_area_last_index(area)); 1696 + } 2008 1697 2009 1698 if (IS_ENABLED(CONFIG_IOMMUFD_TEST)) 2010 1699 WARN_ON(RB_EMPTY_NODE(&area->pages_node.rb)); 2011 1700 interval_tree_remove(&area->pages_node, &pages->domains_itree); 2012 1701 iopt_area_unfill_domain(area, pages, area->storage_domain); 1702 + if (iopt_is_dmabuf(pages)) 1703 + iopt_dmabuf_untrack_all_domains(area, pages); 2013 1704 area->storage_domain = NULL; 2014 1705 out_unlock: 2015 1706 mutex_unlock(&pages->mutex); ··· 2352 2031 if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable) 2353 2032 return -EPERM; 2354 2033 2355 - if (pages->type == IOPT_ADDRESS_FILE) 2034 + if (iopt_is_dmabuf(pages)) 2035 + return -EINVAL; 2036 + 2037 + if (pages->type != IOPT_ADDRESS_USER) 2356 2038 return iopt_pages_rw_slow(pages, start_index, last_index, 2357 2039 start_byte % PAGE_SIZE, data, length, 2358 2040 flags); 2359 - 2360 - if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && 2361 - WARN_ON(pages->type != IOPT_ADDRESS_USER)) 2362 - return -EINVAL; 2363 2041 2364 2042 if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) { 2365 2043 if (start_index == last_index)
+143
drivers/iommu/iommufd/selftest.c
··· 5 5 */ 6 6 #include <linux/anon_inodes.h> 7 7 #include <linux/debugfs.h> 8 + #include <linux/dma-buf.h> 9 + #include <linux/dma-resv.h> 8 10 #include <linux/fault-inject.h> 9 11 #include <linux/file.h> 10 12 #include <linux/iommu.h> ··· 2033 2031 } 2034 2032 } 2035 2033 2034 + struct iommufd_test_dma_buf { 2035 + void *memory; 2036 + size_t length; 2037 + bool revoked; 2038 + }; 2039 + 2040 + static int iommufd_test_dma_buf_attach(struct dma_buf *dmabuf, 2041 + struct dma_buf_attachment *attachment) 2042 + { 2043 + return 0; 2044 + } 2045 + 2046 + static void iommufd_test_dma_buf_detach(struct dma_buf *dmabuf, 2047 + struct dma_buf_attachment *attachment) 2048 + { 2049 + } 2050 + 2051 + static struct sg_table * 2052 + iommufd_test_dma_buf_map(struct dma_buf_attachment *attachment, 2053 + enum dma_data_direction dir) 2054 + { 2055 + return ERR_PTR(-EOPNOTSUPP); 2056 + } 2057 + 2058 + static void iommufd_test_dma_buf_unmap(struct dma_buf_attachment *attachment, 2059 + struct sg_table *sgt, 2060 + enum dma_data_direction dir) 2061 + { 2062 + } 2063 + 2064 + static void iommufd_test_dma_buf_release(struct dma_buf *dmabuf) 2065 + { 2066 + struct iommufd_test_dma_buf *priv = dmabuf->priv; 2067 + 2068 + kfree(priv->memory); 2069 + kfree(priv); 2070 + } 2071 + 2072 + static const struct dma_buf_ops iommufd_test_dmabuf_ops = { 2073 + .attach = iommufd_test_dma_buf_attach, 2074 + .detach = iommufd_test_dma_buf_detach, 2075 + .map_dma_buf = iommufd_test_dma_buf_map, 2076 + .release = iommufd_test_dma_buf_release, 2077 + .unmap_dma_buf = iommufd_test_dma_buf_unmap, 2078 + }; 2079 + 2080 + int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 2081 + struct dma_buf_phys_vec *phys) 2082 + { 2083 + struct iommufd_test_dma_buf *priv = attachment->dmabuf->priv; 2084 + 2085 + dma_resv_assert_held(attachment->dmabuf->resv); 2086 + 2087 + if (attachment->dmabuf->ops != &iommufd_test_dmabuf_ops) 2088 + return -EOPNOTSUPP; 2089 + 2090 + if (priv->revoked) 2091 + return -ENODEV; 2092 + 2093 + phys->paddr = virt_to_phys(priv->memory); 2094 + phys->len = priv->length; 2095 + return 0; 2096 + } 2097 + 2098 + static int iommufd_test_dmabuf_get(struct iommufd_ucmd *ucmd, 2099 + unsigned int open_flags, 2100 + size_t len) 2101 + { 2102 + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); 2103 + struct iommufd_test_dma_buf *priv; 2104 + struct dma_buf *dmabuf; 2105 + int rc; 2106 + 2107 + len = ALIGN(len, PAGE_SIZE); 2108 + if (len == 0 || len > PAGE_SIZE * 512) 2109 + return -EINVAL; 2110 + 2111 + priv = kzalloc(sizeof(*priv), GFP_KERNEL); 2112 + if (!priv) 2113 + return -ENOMEM; 2114 + 2115 + priv->length = len; 2116 + priv->memory = kzalloc(len, GFP_KERNEL); 2117 + if (!priv->memory) { 2118 + rc = -ENOMEM; 2119 + goto err_free; 2120 + } 2121 + 2122 + exp_info.ops = &iommufd_test_dmabuf_ops; 2123 + exp_info.size = len; 2124 + exp_info.flags = open_flags; 2125 + exp_info.priv = priv; 2126 + 2127 + dmabuf = dma_buf_export(&exp_info); 2128 + if (IS_ERR(dmabuf)) { 2129 + rc = PTR_ERR(dmabuf); 2130 + goto err_free; 2131 + } 2132 + 2133 + return dma_buf_fd(dmabuf, open_flags); 2134 + 2135 + err_free: 2136 + kfree(priv->memory); 2137 + kfree(priv); 2138 + return rc; 2139 + } 2140 + 2141 + static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd, 2142 + bool revoked) 2143 + { 2144 + struct iommufd_test_dma_buf *priv; 2145 + struct dma_buf *dmabuf; 2146 + int rc = 0; 2147 + 2148 + dmabuf = dma_buf_get(fd); 2149 + if (IS_ERR(dmabuf)) 2150 + return PTR_ERR(dmabuf); 2151 + 2152 + if (dmabuf->ops != &iommufd_test_dmabuf_ops) { 2153 + rc = -EOPNOTSUPP; 2154 + goto err_put; 2155 + } 2156 + 2157 + priv = dmabuf->priv; 2158 + dma_resv_lock(dmabuf->resv, NULL); 2159 + priv->revoked = revoked; 2160 + dma_buf_move_notify(dmabuf); 2161 + dma_resv_unlock(dmabuf->resv); 2162 + 2163 + err_put: 2164 + dma_buf_put(dmabuf); 2165 + return rc; 2166 + } 2167 + 2036 2168 int iommufd_test(struct iommufd_ucmd *ucmd) 2037 2169 { 2038 2170 struct iommu_test_cmd *cmd = ucmd->cmd; ··· 2245 2109 return iommufd_test_pasid_detach(ucmd, cmd); 2246 2110 case IOMMU_TEST_OP_PASID_CHECK_HWPT: 2247 2111 return iommufd_test_pasid_check_hwpt(ucmd, cmd); 2112 + case IOMMU_TEST_OP_DMABUF_GET: 2113 + return iommufd_test_dmabuf_get(ucmd, cmd->dmabuf_get.open_flags, 2114 + cmd->dmabuf_get.length); 2115 + case IOMMU_TEST_OP_DMABUF_REVOKE: 2116 + return iommufd_test_dmabuf_revoke(ucmd, 2117 + cmd->dmabuf_revoke.dmabuf_fd, 2118 + cmd->dmabuf_revoke.revoked); 2248 2119 default: 2249 2120 return -EOPNOTSUPP; 2250 2121 }
+145 -43
drivers/pci/p2pdma.c
··· 25 25 struct gen_pool *pool; 26 26 bool p2pmem_published; 27 27 struct xarray map_types; 28 + struct p2pdma_provider mem[PCI_STD_NUM_BARS]; 28 29 }; 29 30 30 31 struct pci_p2pdma_pagemap { 31 - struct pci_dev *provider; 32 - u64 bus_offset; 33 32 struct dev_pagemap pgmap; 33 + struct p2pdma_provider *mem; 34 34 }; 35 35 36 36 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) ··· 204 204 { 205 205 struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); 206 206 /* safe to dereference while a reference is held to the percpu ref */ 207 - struct pci_p2pdma *p2pdma = 208 - rcu_dereference_protected(pgmap->provider->p2pdma, 1); 207 + struct pci_p2pdma *p2pdma = rcu_dereference_protected( 208 + to_pci_dev(pgmap->mem->owner)->p2pdma, 1); 209 209 struct percpu_ref *ref; 210 210 211 211 gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), ··· 228 228 229 229 /* Flush and disable pci_alloc_p2p_mem() */ 230 230 pdev->p2pdma = NULL; 231 - synchronize_rcu(); 231 + if (p2pdma->pool) 232 + synchronize_rcu(); 233 + xa_destroy(&p2pdma->map_types); 234 + 235 + if (!p2pdma->pool) 236 + return; 232 237 233 238 gen_pool_destroy(p2pdma->pool); 234 239 sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); 235 - xa_destroy(&p2pdma->map_types); 236 240 } 237 241 238 - static int pci_p2pdma_setup(struct pci_dev *pdev) 242 + /** 243 + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers 244 + * @pdev: The PCI device to enable P2PDMA for 245 + * 246 + * This function initializes the peer-to-peer DMA infrastructure 247 + * for a PCI device. It allocates and sets up the necessary data 248 + * structures to support P2PDMA operations, including mapping type 249 + * tracking. 250 + */ 251 + int pcim_p2pdma_init(struct pci_dev *pdev) 239 252 { 240 - int error = -ENOMEM; 241 253 struct pci_p2pdma *p2p; 254 + int i, ret; 255 + 256 + p2p = rcu_dereference_protected(pdev->p2pdma, 1); 257 + if (p2p) 258 + return 0; 242 259 243 260 p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); 244 261 if (!p2p) 245 262 return -ENOMEM; 246 263 247 264 xa_init(&p2p->map_types); 265 + /* 266 + * Iterate over all standard PCI BARs and record only those that 267 + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O 268 + * port BARs) since they cannot be used for peer-to-peer (P2P) 269 + * transactions. 270 + */ 271 + for (i = 0; i < PCI_STD_NUM_BARS; i++) { 272 + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) 273 + continue; 248 274 249 - p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); 250 - if (!p2p->pool) 251 - goto out; 275 + p2p->mem[i].owner = &pdev->dev; 276 + p2p->mem[i].bus_offset = 277 + pci_bus_address(pdev, i) - pci_resource_start(pdev, i); 278 + } 252 279 253 - error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); 254 - if (error) 255 - goto out_pool_destroy; 256 - 257 - error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); 258 - if (error) 259 - goto out_pool_destroy; 280 + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); 281 + if (ret) 282 + goto out_p2p; 260 283 261 284 rcu_assign_pointer(pdev->p2pdma, p2p); 262 285 return 0; 263 286 264 - out_pool_destroy: 265 - gen_pool_destroy(p2p->pool); 266 - out: 287 + out_p2p: 267 288 devm_kfree(&pdev->dev, p2p); 268 - return error; 289 + return ret; 290 + } 291 + EXPORT_SYMBOL_GPL(pcim_p2pdma_init); 292 + 293 + /** 294 + * pcim_p2pdma_provider - Get peer-to-peer DMA provider 295 + * @pdev: The PCI device to enable P2PDMA for 296 + * @bar: BAR index to get provider 297 + * 298 + * This function gets peer-to-peer DMA provider for a PCI device. The lifetime 299 + * of the provider (and of course the MMIO) is bound to the lifetime of the 300 + * driver. A driver calling this function must ensure that all references to the 301 + * provider, and any DMA mappings created for any MMIO, are all cleaned up 302 + * before the driver remove() completes. 303 + * 304 + * Since P2P is almost always shared with a second driver this means some system 305 + * to notify, invalidate and revoke the MMIO's DMA must be in place to use this 306 + * function. For example a revoke can be built using DMABUF. 307 + */ 308 + struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) 309 + { 310 + struct pci_p2pdma *p2p; 311 + 312 + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) 313 + return NULL; 314 + 315 + p2p = rcu_dereference_protected(pdev->p2pdma, 1); 316 + if (WARN_ON(!p2p)) 317 + /* Someone forgot to call to pcim_p2pdma_init() before */ 318 + return NULL; 319 + 320 + return &p2p->mem[bar]; 321 + } 322 + EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); 323 + 324 + static int pci_p2pdma_setup_pool(struct pci_dev *pdev) 325 + { 326 + struct pci_p2pdma *p2pdma; 327 + int ret; 328 + 329 + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); 330 + if (p2pdma->pool) 331 + /* We already setup pools, do nothing, */ 332 + return 0; 333 + 334 + p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); 335 + if (!p2pdma->pool) 336 + return -ENOMEM; 337 + 338 + ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); 339 + if (ret) 340 + goto out_pool_destroy; 341 + 342 + return 0; 343 + 344 + out_pool_destroy: 345 + gen_pool_destroy(p2pdma->pool); 346 + p2pdma->pool = NULL; 347 + return ret; 269 348 } 270 349 271 350 static void pci_p2pdma_unmap_mappings(void *data) 272 351 { 273 - struct pci_dev *pdev = data; 352 + struct pci_p2pdma_pagemap *p2p_pgmap = data; 274 353 275 354 /* 276 355 * Removing the alloc attribute from sysfs will call 277 356 * unmap_mapping_range() on the inode, teardown any existing userspace 278 357 * mappings and prevent new ones from being created. 279 358 */ 280 - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, 359 + sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj, 360 + &p2pmem_alloc_attr.attr, 281 361 p2pmem_group.name); 282 362 } 283 363 ··· 375 295 u64 offset) 376 296 { 377 297 struct pci_p2pdma_pagemap *p2p_pgmap; 298 + struct p2pdma_provider *mem; 378 299 struct dev_pagemap *pgmap; 379 300 struct pci_p2pdma *p2pdma; 380 301 void *addr; ··· 393 312 if (size + offset > pci_resource_len(pdev, bar)) 394 313 return -EINVAL; 395 314 396 - if (!pdev->p2pdma) { 397 - error = pci_p2pdma_setup(pdev); 398 - if (error) 399 - return error; 400 - } 315 + error = pcim_p2pdma_init(pdev); 316 + if (error) 317 + return error; 318 + 319 + error = pci_p2pdma_setup_pool(pdev); 320 + if (error) 321 + return error; 322 + 323 + mem = pcim_p2pdma_provider(pdev, bar); 324 + /* 325 + * We checked validity of BAR prior to call 326 + * to pcim_p2pdma_provider. It should never return NULL. 327 + */ 328 + if (WARN_ON(!mem)) 329 + return -EINVAL; 401 330 402 331 p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL); 403 332 if (!p2p_pgmap) ··· 419 328 pgmap->nr_range = 1; 420 329 pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; 421 330 pgmap->ops = &p2pdma_pgmap_ops; 422 - 423 - p2p_pgmap->provider = pdev; 424 - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - 425 - pci_resource_start(pdev, bar); 331 + p2p_pgmap->mem = mem; 426 332 427 333 addr = devm_memremap_pages(&pdev->dev, pgmap); 428 334 if (IS_ERR(addr)) { ··· 428 340 } 429 341 430 342 error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, 431 - pdev); 343 + p2p_pgmap); 432 344 if (error) 433 345 goto pages_free; 434 346 ··· 1060 972 } 1061 973 EXPORT_SYMBOL_GPL(pci_p2pmem_publish); 1062 974 1063 - static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, 1064 - struct device *dev) 975 + /** 976 + * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers 977 + * @provider: P2PDMA provider structure 978 + * @dev: Target device for the transfer 979 + * 980 + * Determines how peer-to-peer DMA transfers should be mapped between 981 + * the provider and the target device. The mapping type indicates whether 982 + * the transfer can be done directly through PCI switches or must go 983 + * through the host bridge. 984 + */ 985 + enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, 986 + struct device *dev) 1065 987 { 1066 988 enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; 1067 - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; 989 + struct pci_dev *pdev = to_pci_dev(provider->owner); 1068 990 struct pci_dev *client; 1069 991 struct pci_p2pdma *p2pdma; 1070 992 int dist; 1071 993 1072 - if (!provider->p2pdma) 994 + if (!pdev->p2pdma) 1073 995 return PCI_P2PDMA_MAP_NOT_SUPPORTED; 1074 996 1075 997 if (!dev_is_pci(dev)) ··· 1088 990 client = to_pci_dev(dev); 1089 991 1090 992 rcu_read_lock(); 1091 - p2pdma = rcu_dereference(provider->p2pdma); 993 + p2pdma = rcu_dereference(pdev->p2pdma); 1092 994 1093 995 if (p2pdma) 1094 996 type = xa_to_value(xa_load(&p2pdma->map_types, ··· 1096 998 rcu_read_unlock(); 1097 999 1098 1000 if (type == PCI_P2PDMA_MAP_UNKNOWN) 1099 - return calc_map_type_and_dist(provider, client, &dist, true); 1001 + return calc_map_type_and_dist(pdev, client, &dist, true); 1100 1002 1101 1003 return type; 1102 1004 } ··· 1104 1006 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, 1105 1007 struct device *dev, struct page *page) 1106 1008 { 1107 - state->pgmap = page_pgmap(page); 1108 - state->map = pci_p2pdma_map_type(state->pgmap, dev); 1109 - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; 1009 + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); 1010 + 1011 + if (state->mem == p2p_pgmap->mem) 1012 + return; 1013 + 1014 + state->mem = p2p_pgmap->mem; 1015 + state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev); 1110 1016 } 1111 1017 1112 1018 /**
+3
drivers/vfio/pci/Kconfig
··· 55 55 56 56 To enable s390x KVM vfio-pci extensions, say Y. 57 57 58 + config VFIO_PCI_DMABUF 59 + def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER 60 + 58 61 source "drivers/vfio/pci/mlx5/Kconfig" 59 62 60 63 source "drivers/vfio/pci/hisilicon/Kconfig"
+1
drivers/vfio/pci/Makefile
··· 2 2 3 3 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o 4 4 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o 5 + vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o 5 6 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o 6 7 7 8 vfio-pci-y := vfio_pci.o
+52
drivers/vfio/pci/nvgrace-gpu/main.c
··· 7 7 #include <linux/vfio_pci_core.h> 8 8 #include <linux/delay.h> 9 9 #include <linux/jiffies.h> 10 + #include <linux/pci-p2pdma.h> 10 11 11 12 /* 12 13 * The device memory usable to the workloads running in the VM is cached ··· 684 683 return vfio_pci_core_write(core_vdev, buf, count, ppos); 685 684 } 686 685 686 + static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev, 687 + struct p2pdma_provider **provider, 688 + unsigned int region_index, 689 + struct dma_buf_phys_vec *phys_vec, 690 + struct vfio_region_dma_range *dma_ranges, 691 + size_t nr_ranges) 692 + { 693 + struct nvgrace_gpu_pci_core_device *nvdev = container_of( 694 + core_vdev, struct nvgrace_gpu_pci_core_device, core_device); 695 + struct pci_dev *pdev = core_vdev->pdev; 696 + struct mem_region *mem_region; 697 + 698 + /* 699 + * if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) { 700 + * The P2P properties of the non-BAR memory is the same as the 701 + * BAR memory, so just use the provider for index 0. Someday 702 + * when CXL gets P2P support we could create CXLish providers 703 + * for the non-BAR memory. 704 + * } else if (region_index == USEMEM_REGION_INDEX) { 705 + * This is actually cachable memory and isn't treated as P2P in 706 + * the chip. For now we have no way to push cachable memory 707 + * through everything and the Grace HW doesn't care what caching 708 + * attribute is programmed into the SMMU. So use BAR 0. 709 + * } 710 + */ 711 + mem_region = nvgrace_gpu_memregion(region_index, nvdev); 712 + if (mem_region) { 713 + *provider = pcim_p2pdma_provider(pdev, 0); 714 + if (!*provider) 715 + return -EINVAL; 716 + return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges, 717 + nr_ranges, 718 + mem_region->memphys, 719 + mem_region->memlength); 720 + } 721 + 722 + return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index, 723 + phys_vec, dma_ranges, nr_ranges); 724 + } 725 + 726 + static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = { 727 + .get_dmabuf_phys = nvgrace_get_dmabuf_phys, 728 + }; 729 + 687 730 static const struct vfio_device_ops nvgrace_gpu_pci_ops = { 688 731 .name = "nvgrace-gpu-vfio-pci", 689 732 .init = vfio_pci_core_init_dev, ··· 746 701 .unbind_iommufd = vfio_iommufd_physical_unbind, 747 702 .attach_ioas = vfio_iommufd_physical_attach_ioas, 748 703 .detach_ioas = vfio_iommufd_physical_detach_ioas, 704 + }; 705 + 706 + static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = { 707 + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, 749 708 }; 750 709 751 710 static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = { ··· 1014 965 memphys, memlength); 1015 966 if (ret) 1016 967 goto out_put_vdev; 968 + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops; 969 + } else { 970 + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops; 1017 971 } 1018 972 1019 973 ret = vfio_pci_core_register_device(&nvdev->core_device);
+5
drivers/vfio/pci/vfio_pci.c
··· 147 147 .pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas, 148 148 }; 149 149 150 + static const struct vfio_pci_device_ops vfio_pci_dev_ops = { 151 + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, 152 + }; 153 + 150 154 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) 151 155 { 152 156 struct vfio_pci_core_device *vdev; ··· 165 161 return PTR_ERR(vdev); 166 162 167 163 dev_set_drvdata(&pdev->dev, vdev); 164 + vdev->pci_ops = &vfio_pci_dev_ops; 168 165 ret = vfio_pci_core_register_device(vdev); 169 166 if (ret) 170 167 goto out_put_vdev;
+18 -4
drivers/vfio/pci/vfio_pci_config.c
··· 589 589 virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); 590 590 new_mem = !!(new_cmd & PCI_COMMAND_MEMORY); 591 591 592 - if (!new_mem) 592 + if (!new_mem) { 593 593 vfio_pci_zap_and_down_write_memory_lock(vdev); 594 - else 594 + vfio_pci_dma_buf_move(vdev, true); 595 + } else { 595 596 down_write(&vdev->memory_lock); 597 + } 596 598 597 599 /* 598 600 * If the user is writing mem/io enable (new_mem/io) and we ··· 629 627 *virt_cmd &= cpu_to_le16(~mask); 630 628 *virt_cmd |= cpu_to_le16(new_cmd & mask); 631 629 630 + if (__vfio_pci_memory_enabled(vdev)) 631 + vfio_pci_dma_buf_move(vdev, false); 632 632 up_write(&vdev->memory_lock); 633 633 } 634 634 ··· 711 707 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, 712 708 pci_power_t state) 713 709 { 714 - if (state >= PCI_D3hot) 710 + if (state >= PCI_D3hot) { 715 711 vfio_pci_zap_and_down_write_memory_lock(vdev); 716 - else 712 + vfio_pci_dma_buf_move(vdev, true); 713 + } else { 717 714 down_write(&vdev->memory_lock); 715 + } 718 716 719 717 vfio_pci_set_power_state(vdev, state); 718 + if (__vfio_pci_memory_enabled(vdev)) 719 + vfio_pci_dma_buf_move(vdev, false); 720 720 up_write(&vdev->memory_lock); 721 721 } 722 722 ··· 908 900 909 901 if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { 910 902 vfio_pci_zap_and_down_write_memory_lock(vdev); 903 + vfio_pci_dma_buf_move(vdev, true); 911 904 pci_try_reset_function(vdev->pdev); 905 + if (__vfio_pci_memory_enabled(vdev)) 906 + vfio_pci_dma_buf_move(vdev, false); 912 907 up_write(&vdev->memory_lock); 913 908 } 914 909 } ··· 993 982 994 983 if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { 995 984 vfio_pci_zap_and_down_write_memory_lock(vdev); 985 + vfio_pci_dma_buf_move(vdev, true); 996 986 pci_try_reset_function(vdev->pdev); 987 + if (__vfio_pci_memory_enabled(vdev)) 988 + vfio_pci_dma_buf_move(vdev, false); 997 989 up_write(&vdev->memory_lock); 998 990 } 999 991 }
+35 -18
drivers/vfio/pci/vfio_pci_core.c
··· 28 28 #include <linux/nospec.h> 29 29 #include <linux/sched/mm.h> 30 30 #include <linux/iommufd.h> 31 + #include <linux/pci-p2pdma.h> 31 32 #if IS_ENABLED(CONFIG_EEH) 32 33 #include <asm/eeh.h> 33 34 #endif ··· 287 286 * semaphore. 288 287 */ 289 288 vfio_pci_zap_and_down_write_memory_lock(vdev); 289 + vfio_pci_dma_buf_move(vdev, true); 290 + 290 291 if (vdev->pm_runtime_engaged) { 291 292 up_write(&vdev->memory_lock); 292 293 return -EINVAL; ··· 302 299 return 0; 303 300 } 304 301 305 - static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, 302 + static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, 306 303 void __user *arg, size_t argsz) 307 304 { 308 - struct vfio_pci_core_device *vdev = 309 - container_of(device, struct vfio_pci_core_device, vdev); 310 305 int ret; 311 306 312 307 ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); ··· 321 320 } 322 321 323 322 static int vfio_pci_core_pm_entry_with_wakeup( 324 - struct vfio_device *device, u32 flags, 323 + struct vfio_pci_core_device *vdev, u32 flags, 325 324 struct vfio_device_low_power_entry_with_wakeup __user *arg, 326 325 size_t argsz) 327 326 { 328 - struct vfio_pci_core_device *vdev = 329 - container_of(device, struct vfio_pci_core_device, vdev); 330 327 struct vfio_device_low_power_entry_with_wakeup entry; 331 328 struct eventfd_ctx *efdctx; 332 329 int ret; ··· 372 373 */ 373 374 down_write(&vdev->memory_lock); 374 375 __vfio_pci_runtime_pm_exit(vdev); 376 + if (__vfio_pci_memory_enabled(vdev)) 377 + vfio_pci_dma_buf_move(vdev, false); 375 378 up_write(&vdev->memory_lock); 376 379 } 377 380 378 - static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, 381 + static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, 379 382 void __user *arg, size_t argsz) 380 383 { 381 - struct vfio_pci_core_device *vdev = 382 - container_of(device, struct vfio_pci_core_device, vdev); 383 384 int ret; 384 385 385 386 ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); ··· 693 694 eeh_dev_release(vdev->pdev); 694 695 #endif 695 696 vfio_pci_core_disable(vdev); 697 + 698 + vfio_pci_dma_buf_cleanup(vdev); 696 699 697 700 mutex_lock(&vdev->igate); 698 701 if (vdev->err_trigger) { ··· 1228 1227 */ 1229 1228 vfio_pci_set_power_state(vdev, PCI_D0); 1230 1229 1230 + vfio_pci_dma_buf_move(vdev, true); 1231 1231 ret = pci_try_reset_function(vdev->pdev); 1232 + if (__vfio_pci_memory_enabled(vdev)) 1233 + vfio_pci_dma_buf_move(vdev, false); 1232 1234 up_write(&vdev->memory_lock); 1233 1235 1234 1236 return ret; ··· 1477 1473 } 1478 1474 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); 1479 1475 1480 - static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, 1481 - uuid_t __user *arg, size_t argsz) 1476 + static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, 1477 + u32 flags, uuid_t __user *arg, 1478 + size_t argsz) 1482 1479 { 1483 - struct vfio_pci_core_device *vdev = 1484 - container_of(device, struct vfio_pci_core_device, vdev); 1485 1480 uuid_t uuid; 1486 1481 int ret; 1487 1482 ··· 1507 1504 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, 1508 1505 void __user *arg, size_t argsz) 1509 1506 { 1507 + struct vfio_pci_core_device *vdev = 1508 + container_of(device, struct vfio_pci_core_device, vdev); 1509 + 1510 1510 switch (flags & VFIO_DEVICE_FEATURE_MASK) { 1511 1511 case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY: 1512 - return vfio_pci_core_pm_entry(device, flags, arg, argsz); 1512 + return vfio_pci_core_pm_entry(vdev, flags, arg, argsz); 1513 1513 case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP: 1514 - return vfio_pci_core_pm_entry_with_wakeup(device, flags, 1514 + return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, 1515 1515 arg, argsz); 1516 1516 case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT: 1517 - return vfio_pci_core_pm_exit(device, flags, arg, argsz); 1517 + return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); 1518 1518 case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: 1519 - return vfio_pci_core_feature_token(device, flags, arg, argsz); 1519 + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); 1520 + case VFIO_DEVICE_FEATURE_DMA_BUF: 1521 + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); 1520 1522 default: 1521 1523 return -ENOTTY; 1522 1524 } ··· 2093 2085 { 2094 2086 struct vfio_pci_core_device *vdev = 2095 2087 container_of(core_vdev, struct vfio_pci_core_device, vdev); 2088 + int ret; 2096 2089 2097 2090 vdev->pdev = to_pci_dev(core_vdev->dev); 2098 2091 vdev->irq_type = VFIO_PCI_NUM_IRQS; ··· 2103 2094 INIT_LIST_HEAD(&vdev->dummy_resources_list); 2104 2095 INIT_LIST_HEAD(&vdev->ioeventfds_list); 2105 2096 INIT_LIST_HEAD(&vdev->sriov_pfs_item); 2097 + ret = pcim_p2pdma_init(vdev->pdev); 2098 + if (ret && ret != -EOPNOTSUPP) 2099 + return ret; 2100 + INIT_LIST_HEAD(&vdev->dmabufs); 2106 2101 init_rwsem(&vdev->memory_lock); 2107 2102 xa_init(&vdev->ctx); 2108 2103 ··· 2471 2458 break; 2472 2459 } 2473 2460 2461 + vfio_pci_dma_buf_move(vdev, true); 2474 2462 vfio_pci_zap_bars(vdev); 2475 2463 } 2476 2464 ··· 2500 2486 2501 2487 err_undo: 2502 2488 list_for_each_entry_from_reverse(vdev, &dev_set->device_list, 2503 - vdev.dev_set_list) 2489 + vdev.dev_set_list) { 2490 + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev)) 2491 + vfio_pci_dma_buf_move(vdev, false); 2504 2492 up_write(&vdev->memory_lock); 2493 + } 2505 2494 2506 2495 list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) 2507 2496 pm_runtime_put(&vdev->pdev->dev);
+350
drivers/vfio/pci/vfio_pci_dmabuf.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. 3 + */ 4 + #include <linux/dma-buf-mapping.h> 5 + #include <linux/pci-p2pdma.h> 6 + #include <linux/dma-resv.h> 7 + 8 + #include "vfio_pci_priv.h" 9 + 10 + MODULE_IMPORT_NS("DMA_BUF"); 11 + 12 + struct vfio_pci_dma_buf { 13 + struct dma_buf *dmabuf; 14 + struct vfio_pci_core_device *vdev; 15 + struct list_head dmabufs_elm; 16 + size_t size; 17 + struct dma_buf_phys_vec *phys_vec; 18 + struct p2pdma_provider *provider; 19 + u32 nr_ranges; 20 + u8 revoked : 1; 21 + }; 22 + 23 + static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, 24 + struct dma_buf_attachment *attachment) 25 + { 26 + struct vfio_pci_dma_buf *priv = dmabuf->priv; 27 + 28 + if (!attachment->peer2peer) 29 + return -EOPNOTSUPP; 30 + 31 + if (priv->revoked) 32 + return -ENODEV; 33 + 34 + return 0; 35 + } 36 + 37 + static struct sg_table * 38 + vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, 39 + enum dma_data_direction dir) 40 + { 41 + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; 42 + 43 + dma_resv_assert_held(priv->dmabuf->resv); 44 + 45 + if (priv->revoked) 46 + return ERR_PTR(-ENODEV); 47 + 48 + return dma_buf_phys_vec_to_sgt(attachment, priv->provider, 49 + priv->phys_vec, priv->nr_ranges, 50 + priv->size, dir); 51 + } 52 + 53 + static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, 54 + struct sg_table *sgt, 55 + enum dma_data_direction dir) 56 + { 57 + dma_buf_free_sgt(attachment, sgt, dir); 58 + } 59 + 60 + static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) 61 + { 62 + struct vfio_pci_dma_buf *priv = dmabuf->priv; 63 + 64 + /* 65 + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. 66 + * The refcount prevents both. 67 + */ 68 + if (priv->vdev) { 69 + down_write(&priv->vdev->memory_lock); 70 + list_del_init(&priv->dmabufs_elm); 71 + up_write(&priv->vdev->memory_lock); 72 + vfio_device_put_registration(&priv->vdev->vdev); 73 + } 74 + kfree(priv->phys_vec); 75 + kfree(priv); 76 + } 77 + 78 + static const struct dma_buf_ops vfio_pci_dmabuf_ops = { 79 + .attach = vfio_pci_dma_buf_attach, 80 + .map_dma_buf = vfio_pci_dma_buf_map, 81 + .unmap_dma_buf = vfio_pci_dma_buf_unmap, 82 + .release = vfio_pci_dma_buf_release, 83 + }; 84 + 85 + /* 86 + * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. 87 + * It allows the two co-operating drivers to exchange the physical address of 88 + * the BAR. This is to be replaced with a formal DMABUF system for negotiated 89 + * interconnect types. 90 + * 91 + * If this function succeeds the following are true: 92 + * - There is one physical range and it is pointing to MMIO 93 + * - When move_notify is called it means revoke, not move, vfio_dma_buf_map 94 + * will fail if it is currently revoked 95 + */ 96 + int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 97 + struct dma_buf_phys_vec *phys) 98 + { 99 + struct vfio_pci_dma_buf *priv; 100 + 101 + dma_resv_assert_held(attachment->dmabuf->resv); 102 + 103 + if (attachment->dmabuf->ops != &vfio_pci_dmabuf_ops) 104 + return -EOPNOTSUPP; 105 + 106 + priv = attachment->dmabuf->priv; 107 + if (priv->revoked) 108 + return -ENODEV; 109 + 110 + /* More than one range to iommufd will require proper DMABUF support */ 111 + if (priv->nr_ranges != 1) 112 + return -EOPNOTSUPP; 113 + 114 + *phys = priv->phys_vec[0]; 115 + return 0; 116 + } 117 + EXPORT_SYMBOL_FOR_MODULES(vfio_pci_dma_buf_iommufd_map, "iommufd"); 118 + 119 + int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 120 + struct vfio_region_dma_range *dma_ranges, 121 + size_t nr_ranges, phys_addr_t start, 122 + phys_addr_t len) 123 + { 124 + phys_addr_t max_addr; 125 + unsigned int i; 126 + 127 + max_addr = start + len; 128 + for (i = 0; i < nr_ranges; i++) { 129 + phys_addr_t end; 130 + 131 + if (!dma_ranges[i].length) 132 + return -EINVAL; 133 + 134 + if (check_add_overflow(start, dma_ranges[i].offset, 135 + &phys_vec[i].paddr) || 136 + check_add_overflow(phys_vec[i].paddr, 137 + dma_ranges[i].length, &end)) 138 + return -EOVERFLOW; 139 + if (end > max_addr) 140 + return -EINVAL; 141 + 142 + phys_vec[i].len = dma_ranges[i].length; 143 + } 144 + return 0; 145 + } 146 + EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec); 147 + 148 + int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, 149 + struct p2pdma_provider **provider, 150 + unsigned int region_index, 151 + struct dma_buf_phys_vec *phys_vec, 152 + struct vfio_region_dma_range *dma_ranges, 153 + size_t nr_ranges) 154 + { 155 + struct pci_dev *pdev = vdev->pdev; 156 + 157 + *provider = pcim_p2pdma_provider(pdev, region_index); 158 + if (!*provider) 159 + return -EINVAL; 160 + 161 + return vfio_pci_core_fill_phys_vec( 162 + phys_vec, dma_ranges, nr_ranges, 163 + pci_resource_start(pdev, region_index), 164 + pci_resource_len(pdev, region_index)); 165 + } 166 + EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys); 167 + 168 + static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf, 169 + struct vfio_region_dma_range *dma_ranges, 170 + size_t *lengthp) 171 + { 172 + size_t length = 0; 173 + u32 i; 174 + 175 + for (i = 0; i < dma_buf->nr_ranges; i++) { 176 + u64 offset = dma_ranges[i].offset; 177 + u64 len = dma_ranges[i].length; 178 + 179 + if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) 180 + return -EINVAL; 181 + 182 + if (check_add_overflow(length, len, &length)) 183 + return -EINVAL; 184 + } 185 + 186 + /* 187 + * dma_iova_try_alloc() will WARN on if userspace proposes a size that 188 + * is too big, eg with lots of ranges. 189 + */ 190 + if ((u64)(length) & DMA_IOVA_USE_SWIOTLB) 191 + return -EINVAL; 192 + 193 + *lengthp = length; 194 + return 0; 195 + } 196 + 197 + int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 198 + struct vfio_device_feature_dma_buf __user *arg, 199 + size_t argsz) 200 + { 201 + struct vfio_device_feature_dma_buf get_dma_buf = {}; 202 + struct vfio_region_dma_range *dma_ranges; 203 + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); 204 + struct vfio_pci_dma_buf *priv; 205 + size_t length; 206 + int ret; 207 + 208 + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) 209 + return -EOPNOTSUPP; 210 + 211 + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, 212 + sizeof(get_dma_buf)); 213 + if (ret != 1) 214 + return ret; 215 + 216 + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) 217 + return -EFAULT; 218 + 219 + if (!get_dma_buf.nr_ranges || get_dma_buf.flags) 220 + return -EINVAL; 221 + 222 + /* 223 + * For PCI the region_index is the BAR number like everything else. 224 + */ 225 + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) 226 + return -ENODEV; 227 + 228 + dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges, 229 + sizeof(*dma_ranges)); 230 + if (IS_ERR(dma_ranges)) 231 + return PTR_ERR(dma_ranges); 232 + 233 + ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length); 234 + if (ret) 235 + goto err_free_ranges; 236 + 237 + priv = kzalloc(sizeof(*priv), GFP_KERNEL); 238 + if (!priv) { 239 + ret = -ENOMEM; 240 + goto err_free_ranges; 241 + } 242 + priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec), 243 + GFP_KERNEL); 244 + if (!priv->phys_vec) { 245 + ret = -ENOMEM; 246 + goto err_free_priv; 247 + } 248 + 249 + priv->vdev = vdev; 250 + priv->nr_ranges = get_dma_buf.nr_ranges; 251 + priv->size = length; 252 + ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, 253 + get_dma_buf.region_index, 254 + priv->phys_vec, dma_ranges, 255 + priv->nr_ranges); 256 + if (ret) 257 + goto err_free_phys; 258 + 259 + kfree(dma_ranges); 260 + dma_ranges = NULL; 261 + 262 + if (!vfio_device_try_get_registration(&vdev->vdev)) { 263 + ret = -ENODEV; 264 + goto err_free_phys; 265 + } 266 + 267 + exp_info.ops = &vfio_pci_dmabuf_ops; 268 + exp_info.size = priv->size; 269 + exp_info.flags = get_dma_buf.open_flags; 270 + exp_info.priv = priv; 271 + 272 + priv->dmabuf = dma_buf_export(&exp_info); 273 + if (IS_ERR(priv->dmabuf)) { 274 + ret = PTR_ERR(priv->dmabuf); 275 + goto err_dev_put; 276 + } 277 + 278 + /* dma_buf_put() now frees priv */ 279 + INIT_LIST_HEAD(&priv->dmabufs_elm); 280 + down_write(&vdev->memory_lock); 281 + dma_resv_lock(priv->dmabuf->resv, NULL); 282 + priv->revoked = !__vfio_pci_memory_enabled(vdev); 283 + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); 284 + dma_resv_unlock(priv->dmabuf->resv); 285 + up_write(&vdev->memory_lock); 286 + 287 + /* 288 + * dma_buf_fd() consumes the reference, when the file closes the dmabuf 289 + * will be released. 290 + */ 291 + ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); 292 + if (ret < 0) 293 + goto err_dma_buf; 294 + return ret; 295 + 296 + err_dma_buf: 297 + dma_buf_put(priv->dmabuf); 298 + err_dev_put: 299 + vfio_device_put_registration(&vdev->vdev); 300 + err_free_phys: 301 + kfree(priv->phys_vec); 302 + err_free_priv: 303 + kfree(priv); 304 + err_free_ranges: 305 + kfree(dma_ranges); 306 + return ret; 307 + } 308 + 309 + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) 310 + { 311 + struct vfio_pci_dma_buf *priv; 312 + struct vfio_pci_dma_buf *tmp; 313 + 314 + lockdep_assert_held_write(&vdev->memory_lock); 315 + 316 + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { 317 + if (!get_file_active(&priv->dmabuf->file)) 318 + continue; 319 + 320 + if (priv->revoked != revoked) { 321 + dma_resv_lock(priv->dmabuf->resv, NULL); 322 + priv->revoked = revoked; 323 + dma_buf_move_notify(priv->dmabuf); 324 + dma_resv_unlock(priv->dmabuf->resv); 325 + } 326 + fput(priv->dmabuf->file); 327 + } 328 + } 329 + 330 + void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) 331 + { 332 + struct vfio_pci_dma_buf *priv; 333 + struct vfio_pci_dma_buf *tmp; 334 + 335 + down_write(&vdev->memory_lock); 336 + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { 337 + if (!get_file_active(&priv->dmabuf->file)) 338 + continue; 339 + 340 + dma_resv_lock(priv->dmabuf->resv, NULL); 341 + list_del_init(&priv->dmabufs_elm); 342 + priv->vdev = NULL; 343 + priv->revoked = true; 344 + dma_buf_move_notify(priv->dmabuf); 345 + dma_resv_unlock(priv->dmabuf->resv); 346 + vfio_device_put_registration(&vdev->vdev); 347 + fput(priv->dmabuf->file); 348 + } 349 + up_write(&vdev->memory_lock); 350 + }
+23
drivers/vfio/pci/vfio_pci_priv.h
··· 107 107 return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; 108 108 } 109 109 110 + #ifdef CONFIG_VFIO_PCI_DMABUF 111 + int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 112 + struct vfio_device_feature_dma_buf __user *arg, 113 + size_t argsz); 114 + void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); 115 + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); 116 + #else 117 + static inline int 118 + vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 119 + struct vfio_device_feature_dma_buf __user *arg, 120 + size_t argsz) 121 + { 122 + return -ENOTTY; 123 + } 124 + static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) 125 + { 126 + } 127 + static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, 128 + bool revoked) 129 + { 130 + } 131 + #endif 132 + 110 133 #endif
+2
drivers/vfio/vfio_main.c
··· 172 172 if (refcount_dec_and_test(&device->refcount)) 173 173 complete(&device->comp); 174 174 } 175 + EXPORT_SYMBOL_GPL(vfio_device_put_registration); 175 176 176 177 bool vfio_device_try_get_registration(struct vfio_device *device) 177 178 { 178 179 return refcount_inc_not_zero(&device->refcount); 179 180 } 181 + EXPORT_SYMBOL_GPL(vfio_device_try_get_registration); 180 182 181 183 /* 182 184 * VFIO driver API
+17
include/linux/dma-buf-mapping.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * DMA BUF Mapping Helpers 4 + * 5 + */ 6 + #ifndef __DMA_BUF_MAPPING_H__ 7 + #define __DMA_BUF_MAPPING_H__ 8 + #include <linux/dma-buf.h> 9 + 10 + struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, 11 + struct p2pdma_provider *provider, 12 + struct dma_buf_phys_vec *phys_vec, 13 + size_t nr_ranges, size_t size, 14 + enum dma_data_direction dir); 15 + void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, 16 + enum dma_data_direction dir); 17 + #endif
+11
include/linux/dma-buf.h
··· 22 22 #include <linux/fs.h> 23 23 #include <linux/dma-fence.h> 24 24 #include <linux/wait.h> 25 + #include <linux/pci-p2pdma.h> 25 26 26 27 struct device; 27 28 struct dma_buf; ··· 529 528 int flags; 530 529 struct dma_resv *resv; 531 530 void *priv; 531 + }; 532 + 533 + /** 534 + * struct dma_buf_phys_vec - describe continuous chunk of memory 535 + * @paddr: physical address of that chunk 536 + * @len: Length of this chunk 537 + */ 538 + struct dma_buf_phys_vec { 539 + phys_addr_t paddr; 540 + size_t len; 532 541 }; 533 542 534 543 /**
+73 -47
include/linux/pci-p2pdma.h
··· 16 16 struct block_device; 17 17 struct scatterlist; 18 18 19 + /** 20 + * struct p2pdma_provider 21 + * 22 + * A p2pdma provider is a range of MMIO address space available to the CPU. 23 + */ 24 + struct p2pdma_provider { 25 + struct device *owner; 26 + u64 bus_offset; 27 + }; 28 + 29 + enum pci_p2pdma_map_type { 30 + /* 31 + * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before 32 + * the mapping type has been calculated. Exported routines for the API 33 + * will never return this value. 34 + */ 35 + PCI_P2PDMA_MAP_UNKNOWN = 0, 36 + 37 + /* 38 + * Not a PCI P2PDMA transfer. 39 + */ 40 + PCI_P2PDMA_MAP_NONE, 41 + 42 + /* 43 + * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will 44 + * traverse the host bridge and the host bridge is not in the 45 + * allowlist. DMA Mapping routines should return an error when 46 + * this is returned. 47 + */ 48 + PCI_P2PDMA_MAP_NOT_SUPPORTED, 49 + 50 + /* 51 + * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to 52 + * each other directly through a PCI switch and the transaction will 53 + * not traverse the host bridge. Such a mapping should program 54 + * the DMA engine with PCI bus addresses. 55 + */ 56 + PCI_P2PDMA_MAP_BUS_ADDR, 57 + 58 + /* 59 + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk 60 + * to each other, but the transaction traverses a host bridge on the 61 + * allowlist. In this case, a normal mapping either with CPU physical 62 + * addresses (in the case of dma-direct) or IOVA addresses (in the 63 + * case of IOMMUs) should be used to program the DMA engine. 64 + */ 65 + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, 66 + }; 67 + 19 68 #ifdef CONFIG_PCI_P2PDMA 69 + int pcim_p2pdma_init(struct pci_dev *pdev); 70 + struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); 20 71 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, 21 72 u64 offset); 22 73 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, ··· 84 33 bool *use_p2pdma); 85 34 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, 86 35 bool use_p2pdma); 36 + enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, 37 + struct device *dev); 87 38 #else /* CONFIG_PCI_P2PDMA */ 39 + static inline int pcim_p2pdma_init(struct pci_dev *pdev) 40 + { 41 + return -EOPNOTSUPP; 42 + } 43 + static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, 44 + int bar) 45 + { 46 + return NULL; 47 + } 88 48 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, 89 49 size_t size, u64 offset) 90 50 { ··· 147 85 { 148 86 return sprintf(page, "none\n"); 149 87 } 88 + static inline enum pci_p2pdma_map_type 89 + pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) 90 + { 91 + return PCI_P2PDMA_MAP_NOT_SUPPORTED; 92 + } 150 93 #endif /* CONFIG_PCI_P2PDMA */ 151 94 152 95 ··· 166 99 return pci_p2pmem_find_many(&client, 1); 167 100 } 168 101 169 - enum pci_p2pdma_map_type { 170 - /* 171 - * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before 172 - * the mapping type has been calculated. Exported routines for the API 173 - * will never return this value. 174 - */ 175 - PCI_P2PDMA_MAP_UNKNOWN = 0, 176 - 177 - /* 178 - * Not a PCI P2PDMA transfer. 179 - */ 180 - PCI_P2PDMA_MAP_NONE, 181 - 182 - /* 183 - * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will 184 - * traverse the host bridge and the host bridge is not in the 185 - * allowlist. DMA Mapping routines should return an error when 186 - * this is returned. 187 - */ 188 - PCI_P2PDMA_MAP_NOT_SUPPORTED, 189 - 190 - /* 191 - * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to 192 - * each other directly through a PCI switch and the transaction will 193 - * not traverse the host bridge. Such a mapping should program 194 - * the DMA engine with PCI bus addresses. 195 - */ 196 - PCI_P2PDMA_MAP_BUS_ADDR, 197 - 198 - /* 199 - * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk 200 - * to each other, but the transaction traverses a host bridge on the 201 - * allowlist. In this case, a normal mapping either with CPU physical 202 - * addresses (in the case of dma-direct) or IOVA addresses (in the 203 - * case of IOMMUs) should be used to program the DMA engine. 204 - */ 205 - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, 206 - }; 207 - 208 102 struct pci_p2pdma_map_state { 209 - struct dev_pagemap *pgmap; 103 + struct p2pdma_provider *mem; 210 104 enum pci_p2pdma_map_type map; 211 - u64 bus_off; 212 105 }; 106 + 213 107 214 108 /* helper for pci_p2pdma_state(), do not use directly */ 215 109 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, ··· 190 162 struct page *page) 191 163 { 192 164 if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { 193 - if (state->pgmap != page_pgmap(page)) 194 - __pci_p2pdma_update_state(state, dev, page); 165 + __pci_p2pdma_update_state(state, dev, page); 195 166 return state->map; 196 167 } 197 168 return PCI_P2PDMA_MAP_NONE; ··· 199 172 /** 200 173 * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address 201 174 * for a PCI_P2PDMA_MAP_BUS_ADDR transfer. 202 - * @state: P2P state structure 175 + * @provider: P2P provider structure 203 176 * @paddr: physical address to map 204 177 * 205 178 * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer. 206 179 */ 207 180 static inline dma_addr_t 208 - pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) 181 + pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr) 209 182 { 210 - WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); 211 - return paddr + state->bus_off; 183 + return paddr + provider->bus_offset; 212 184 } 213 185 214 186 #endif /* _LINUX_PCI_P2P_H */
+2
include/linux/vfio.h
··· 297 297 int vfio_register_group_dev(struct vfio_device *device); 298 298 int vfio_register_emulated_iommu_dev(struct vfio_device *device); 299 299 void vfio_unregister_group_dev(struct vfio_device *device); 300 + bool vfio_device_try_get_registration(struct vfio_device *device); 301 + void vfio_device_put_registration(struct vfio_device *device); 300 302 301 303 int vfio_assign_device_set(struct vfio_device *device, void *set_id); 302 304 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
+46
include/linux/vfio_pci_core.h
··· 26 26 27 27 struct vfio_pci_core_device; 28 28 struct vfio_pci_region; 29 + struct p2pdma_provider; 30 + struct dma_buf_phys_vec; 31 + struct dma_buf_attachment; 29 32 30 33 struct vfio_pci_regops { 31 34 ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf, ··· 52 49 u32 flags; 53 50 }; 54 51 52 + struct vfio_pci_device_ops { 53 + int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev, 54 + struct p2pdma_provider **provider, 55 + unsigned int region_index, 56 + struct dma_buf_phys_vec *phys_vec, 57 + struct vfio_region_dma_range *dma_ranges, 58 + size_t nr_ranges); 59 + }; 60 + 61 + #if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF) 62 + int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 63 + struct vfio_region_dma_range *dma_ranges, 64 + size_t nr_ranges, phys_addr_t start, 65 + phys_addr_t len); 66 + int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, 67 + struct p2pdma_provider **provider, 68 + unsigned int region_index, 69 + struct dma_buf_phys_vec *phys_vec, 70 + struct vfio_region_dma_range *dma_ranges, 71 + size_t nr_ranges); 72 + #else 73 + static inline int 74 + vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 75 + struct vfio_region_dma_range *dma_ranges, 76 + size_t nr_ranges, phys_addr_t start, 77 + phys_addr_t len) 78 + { 79 + return -EINVAL; 80 + } 81 + static inline int vfio_pci_core_get_dmabuf_phys( 82 + struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider, 83 + unsigned int region_index, struct dma_buf_phys_vec *phys_vec, 84 + struct vfio_region_dma_range *dma_ranges, size_t nr_ranges) 85 + { 86 + return -EOPNOTSUPP; 87 + } 88 + #endif 89 + 55 90 struct vfio_pci_core_device { 56 91 struct vfio_device vdev; 57 92 struct pci_dev *pdev; 93 + const struct vfio_pci_device_ops *pci_ops; 58 94 void __iomem *barmap[PCI_STD_NUM_BARS]; 59 95 bool bar_mmap_supported[PCI_STD_NUM_BARS]; 60 96 u8 *pci_config_map; ··· 136 94 struct vfio_pci_core_device *sriov_pf_core_dev; 137 95 struct notifier_block nb; 138 96 struct rw_semaphore memory_lock; 97 + struct list_head dmabufs; 139 98 }; 140 99 141 100 /* Will be exported for vfio pci drivers usage */ ··· 203 160 #ifdef ioread64 204 161 VFIO_IOREAD_DECLARATION(64) 205 162 #endif 163 + 164 + int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 165 + struct dma_buf_phys_vec *phys); 206 166 207 167 #endif /* VFIO_PCI_CORE_H */
+28
include/uapi/linux/vfio.h
··· 14 14 15 15 #include <linux/types.h> 16 16 #include <linux/ioctl.h> 17 + #include <linux/stddef.h> 17 18 18 19 #define VFIO_API_VERSION 0 19 20 ··· 1478 1477 #define VFIO_DEVICE_FEATURE_SET_MASTER 1 /* Set Bus Master */ 1479 1478 }; 1480 1479 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 1480 + 1481 + /** 1482 + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the 1483 + * regions selected. 1484 + * 1485 + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, 1486 + * etc. offset/length specify a slice of the region to create the dmabuf from. 1487 + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. 1488 + * 1489 + * flags should be 0. 1490 + * 1491 + * Return: The fd number on success, -1 and errno is set on failure. 1492 + */ 1493 + #define VFIO_DEVICE_FEATURE_DMA_BUF 11 1494 + 1495 + struct vfio_region_dma_range { 1496 + __u64 offset; 1497 + __u64 length; 1498 + }; 1499 + 1500 + struct vfio_device_feature_dma_buf { 1501 + __u32 region_index; 1502 + __u32 open_flags; 1503 + __u32 flags; 1504 + __u32 nr_ranges; 1505 + struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges); 1506 + }; 1481 1507 1482 1508 /* -------- API for Type1 VFIO IOMMU -------- */ 1483 1509
+2 -2
kernel/dma/direct.c
··· 479 479 } 480 480 break; 481 481 case PCI_P2PDMA_MAP_BUS_ADDR: 482 - sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, 483 - sg_phys(sg)); 482 + sg->dma_address = pci_p2pdma_bus_addr_map( 483 + p2pdma_state.mem, sg_phys(sg)); 484 484 sg_dma_mark_bus_address(sg); 485 485 continue; 486 486 default:
+1 -1
mm/hmm.c
··· 811 811 break; 812 812 case PCI_P2PDMA_MAP_BUS_ADDR: 813 813 pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED; 814 - return pci_p2pdma_bus_addr_map(p2pdma_state, paddr); 814 + return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr); 815 815 default: 816 816 return DMA_MAPPING_ERROR; 817 817 }
+43
tools/testing/selftests/iommu/iommufd.c
··· 1574 1574 test_ioctl_destroy(dst_ioas_id); 1575 1575 } 1576 1576 1577 + TEST_F(iommufd_ioas, dmabuf_simple) 1578 + { 1579 + size_t buf_size = PAGE_SIZE*4; 1580 + __u64 iova; 1581 + int dfd; 1582 + 1583 + test_cmd_get_dmabuf(buf_size, &dfd); 1584 + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, 0, &iova); 1585 + test_err_ioctl_ioas_map_file(EINVAL, dfd, buf_size, buf_size, &iova); 1586 + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, buf_size + 1, &iova); 1587 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); 1588 + 1589 + close(dfd); 1590 + } 1591 + 1592 + TEST_F(iommufd_ioas, dmabuf_revoke) 1593 + { 1594 + size_t buf_size = PAGE_SIZE*4; 1595 + __u32 hwpt_id; 1596 + __u64 iova; 1597 + __u64 iova2; 1598 + int dfd; 1599 + 1600 + test_cmd_get_dmabuf(buf_size, &dfd); 1601 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); 1602 + test_cmd_revoke_dmabuf(dfd, true); 1603 + 1604 + if (variant->mock_domains) 1605 + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, 0, 1606 + &hwpt_id); 1607 + 1608 + test_err_ioctl_ioas_map_file(ENODEV, dfd, 0, buf_size, &iova2); 1609 + 1610 + test_cmd_revoke_dmabuf(dfd, false); 1611 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova2); 1612 + 1613 + /* Restore the iova back */ 1614 + test_ioctl_ioas_unmap(iova, buf_size); 1615 + test_ioctl_ioas_map_fixed_file(dfd, 0, buf_size, iova); 1616 + 1617 + close(dfd); 1618 + } 1619 + 1577 1620 FIXTURE(iommufd_mock_domain) 1578 1621 { 1579 1622 int fd;
+44
tools/testing/selftests/iommu/iommufd_utils.h
··· 548 548 EXPECT_ERRNO(_errno, _test_cmd_destroy_access_pages( \ 549 549 self->fd, access_id, access_pages_id)) 550 550 551 + static int _test_cmd_get_dmabuf(int fd, size_t len, int *out_fd) 552 + { 553 + struct iommu_test_cmd cmd = { 554 + .size = sizeof(cmd), 555 + .op = IOMMU_TEST_OP_DMABUF_GET, 556 + .dmabuf_get = { .length = len, .open_flags = O_CLOEXEC }, 557 + }; 558 + 559 + *out_fd = ioctl(fd, IOMMU_TEST_CMD, &cmd); 560 + if (*out_fd < 0) 561 + return -1; 562 + return 0; 563 + } 564 + #define test_cmd_get_dmabuf(len, out_fd) \ 565 + ASSERT_EQ(0, _test_cmd_get_dmabuf(self->fd, len, out_fd)) 566 + 567 + static int _test_cmd_revoke_dmabuf(int fd, int dmabuf_fd, bool revoked) 568 + { 569 + struct iommu_test_cmd cmd = { 570 + .size = sizeof(cmd), 571 + .op = IOMMU_TEST_OP_DMABUF_REVOKE, 572 + .dmabuf_revoke = { .dmabuf_fd = dmabuf_fd, .revoked = revoked }, 573 + }; 574 + int ret; 575 + 576 + ret = ioctl(fd, IOMMU_TEST_CMD, &cmd); 577 + if (ret < 0) 578 + return -1; 579 + return 0; 580 + } 581 + #define test_cmd_revoke_dmabuf(dmabuf_fd, revoke) \ 582 + ASSERT_EQ(0, _test_cmd_revoke_dmabuf(self->fd, dmabuf_fd, revoke)) 583 + 551 584 static int _test_ioctl_destroy(int fd, unsigned int id) 552 585 { 553 586 struct iommu_destroy cmd = { ··· 750 717 _test_ioctl_ioas_map_file( \ 751 718 self->fd, ioas_id, mfd, start, length, iova_p, \ 752 719 IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE)) 720 + 721 + #define test_ioctl_ioas_map_fixed_file(mfd, start, length, iova) \ 722 + ({ \ 723 + __u64 __iova = iova; \ 724 + ASSERT_EQ(0, _test_ioctl_ioas_map_file( \ 725 + self->fd, self->ioas_id, mfd, start, \ 726 + length, &__iova, \ 727 + IOMMU_IOAS_MAP_FIXED_IOVA | \ 728 + IOMMU_IOAS_MAP_WRITEABLE | \ 729 + IOMMU_IOAS_MAP_READABLE)); \ 730 + }) 753 731 754 732 static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit) 755 733 {