Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next

+73 -22

Documentation/driver-api/pci/p2pdma.rst

··· 9 9 called Peer-to-Peer (or P2P). However, there are a number of issues that 10 10 make P2P transactions tricky to do in a perfectly safe way. 11 11 12 - One of the biggest issues is that PCI doesn't require forwarding 13 - transactions between hierarchy domains, and in PCIe, each Root Port 14 - defines a separate hierarchy domain. To make things worse, there is no 15 - simple way to determine if a given Root Complex supports this or not. 16 - (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel 17 - only supports doing P2P when the endpoints involved are all behind the 18 - same PCI bridge, as such devices are all in the same PCI hierarchy 19 - domain, and the spec guarantees that all transactions within the 20 - hierarchy will be routable, but it does not require routing 21 - between hierarchies. 12 + For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up 13 + until they reach a host bridge or root port. If the path includes PCIe switches 14 + then based on the ACS settings the transaction can route entirely within 15 + the PCIe hierarchy and never reach the root port. The kernel will evaluate 16 + the PCIe topology and always permit P2P in these well-defined cases. 22 17 23 - The second issue is that to make use of existing interfaces in Linux, 24 - memory that is used for P2P transactions needs to be backed by struct 25 - pages. However, PCI BARs are not typically cache coherent so there are 26 - a few corner case gotchas with these pages so developers need to 27 - be careful about what they do with them. 18 + However, if the P2P transaction reaches the host bridge then it might have to 19 + hairpin back out the same root port, be routed inside the CPU SOC to another 20 + PCIe root port, or routed internally to the SOC. 21 + 22 + The PCIe specification doesn't define the forwarding of transactions between 23 + hierarchy domains and kernel defaults to blocking such routing. There is an 24 + allow list to allow detecting known-good HW, in which case P2P between any 25 + two PCIe devices will be permitted. 26 + 27 + Since P2P inherently is doing transactions between two devices it requires two 28 + drivers to be co-operating inside the kernel. The providing driver has to convey 29 + its MMIO to the consuming driver. To meet the driver model lifecycle rules the 30 + MMIO must have all DMA mapping removed, all CPU accesses prevented, all page 31 + table mappings undone before the providing driver completes remove(). 32 + 33 + This requires the providing and consuming driver to actively work together to 34 + guarantee that the consuming driver has stopped using the MMIO during a removal 35 + cycle. This is done by either a synchronous invalidation shutdown or waiting 36 + for all usage refcounts to reach zero. 37 + 38 + At the lowest level the P2P subsystem offers a naked struct p2p_provider that 39 + delegates lifecycle management to the providing driver. It is expected that 40 + drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF 41 + to provide an invalidation shutdown. These MMIO addresess have no struct page, and 42 + if used with mmap() must create special PTEs. As such there are very few 43 + kernel uAPIs that can accept pointers to them; in particular they cannot be used 44 + with read()/write(), including O_DIRECT. 45 + 46 + Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE 47 + pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of 48 + pgmap ensures that when the pgmap is destroyed all other drivers have stopped 49 + using the MMIO. This option works with O_DIRECT flows, in some cases, if the 50 + underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through 51 + FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap 52 + it also relies on architecture support along with alignment and minimum size 53 + limitations. 28 54 29 55 30 56 Driver Writer's Guide ··· 140 114 Struct Page Caveats 141 115 ------------------- 142 116 143 - Driver writers should be very careful about not passing these special 144 - struct pages to code that isn't prepared for it. At this time, the kernel 145 - interfaces do not have any checks for ensuring this. This obviously 146 - precludes passing these pages to userspace. 117 + While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs, 118 + pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set. 147 119 148 - P2P memory is also technically IO memory but should never have any side 149 - effects behind it. Thus, the order of loads and stores should not be important 150 - and ioreadX(), iowriteX() and friends should not be necessary. 120 + The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The 121 + KVA is still MMIO and must still be accessed through the normal 122 + readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just 123 + like any other MMIO mapping. While this will actually work on some 124 + architectures, others will experience corruption or just crash in the kernel. 125 + Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU 126 + access happens. 127 + 128 + 129 + Usage With DMABUF 130 + ================= 131 + 132 + DMABUF provides an alternative to the above struct page-based 133 + client/provider/orchestrator system and should be used when struct page 134 + doesn't exist. In this mode the exporting driver will wrap 135 + some of its MMIO in a DMABUF and give the DMABUF FD to userspace. 136 + 137 + Userspace can then pass the FD to an importing driver which will ask the 138 + exporting driver to map it to the importer. 139 + 140 + In this case the initiator and target pci_devices are known and the P2P subsystem 141 + is used to determine the mapping type. The phys_addr_t-based DMA API is used to 142 + establish the dma_addr_t. 143 + 144 + Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants 145 + to remove() it must deliver an invalidation shutdown to all DMABUF importing 146 + drivers through move_notify() and synchronously DMA unmap all the MMIO. 147 + 148 + No importing driver can continue to have a DMA map to the MMIO after the 149 + exporting driver has destroyed its p2p_provider. 151 150 152 151 153 152 P2P DMA Support Library

+1 -1

block/blk-mq-dma.c

··· 85 85 86 86 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) 87 87 { 88 - iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr); 88 + iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr); 89 89 iter->len = vec->len; 90 90 return true; 91 91 }

+1 -1

drivers/dma-buf/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \ 3 - dma-fence-unwrap.o dma-resv.o 3 + dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o 4 4 obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o 5 5 obj-$(CONFIG_DMABUF_HEAPS) += heaps/ 6 6 obj-$(CONFIG_SYNC_FILE) += sync_file.o

+248

drivers/dma-buf/dma-buf-mapping.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * DMA BUF Mapping Helpers 4 + * 5 + */ 6 + #include <linux/dma-buf-mapping.h> 7 + #include <linux/dma-resv.h> 8 + 9 + static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, 10 + dma_addr_t addr) 11 + { 12 + unsigned int len, nents; 13 + int i; 14 + 15 + nents = DIV_ROUND_UP(length, UINT_MAX); 16 + for (i = 0; i < nents; i++) { 17 + len = min_t(size_t, length, UINT_MAX); 18 + length -= len; 19 + /* 20 + * DMABUF abuses scatterlist to create a scatterlist 21 + * that does not have any CPU list, only the DMA list. 22 + * Always set the page related values to NULL to ensure 23 + * importers can't use it. The phys_addr based DMA API 24 + * does not require the CPU list for mapping or unmapping. 25 + */ 26 + sg_set_page(sgl, NULL, 0, 0); 27 + sg_dma_address(sgl) = addr + i * UINT_MAX; 28 + sg_dma_len(sgl) = len; 29 + sgl = sg_next(sgl); 30 + } 31 + 32 + return sgl; 33 + } 34 + 35 + static unsigned int calc_sg_nents(struct dma_iova_state *state, 36 + struct dma_buf_phys_vec *phys_vec, 37 + size_t nr_ranges, size_t size) 38 + { 39 + unsigned int nents = 0; 40 + size_t i; 41 + 42 + if (!state || !dma_use_iova(state)) { 43 + for (i = 0; i < nr_ranges; i++) 44 + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); 45 + } else { 46 + /* 47 + * In IOVA case, there is only one SG entry which spans 48 + * for whole IOVA address space, but we need to make sure 49 + * that it fits sg->length, maybe we need more. 50 + */ 51 + nents = DIV_ROUND_UP(size, UINT_MAX); 52 + } 53 + 54 + return nents; 55 + } 56 + 57 + /** 58 + * struct dma_buf_dma - holds DMA mapping information 59 + * @sgt: Scatter-gather table 60 + * @state: DMA IOVA state relevant in IOMMU-based DMA 61 + * @size: Total size of DMA transfer 62 + */ 63 + struct dma_buf_dma { 64 + struct sg_table sgt; 65 + struct dma_iova_state *state; 66 + size_t size; 67 + }; 68 + 69 + /** 70 + * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment 71 + * from arrays of physical vectors. This funciton is intended for MMIO memory 72 + * only. 73 + * @attach: [in] attachment whose scatterlist is to be returned 74 + * @provider: [in] p2pdma provider 75 + * @phys_vec: [in] array of physical vectors 76 + * @nr_ranges: [in] number of entries in phys_vec array 77 + * @size: [in] total size of phys_vec 78 + * @dir: [in] direction of DMA transfer 79 + * 80 + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR 81 + * on error. May return -EINTR if it is interrupted by a signal. 82 + * 83 + * On success, the DMA addresses and lengths in the returned scatterlist are 84 + * PAGE_SIZE aligned. 85 + * 86 + * A mapping must be unmapped by using dma_buf_free_sgt(). 87 + * 88 + * NOTE: This function is intended for exporters. If direct traffic routing is 89 + * mandatory exporter should call routing pci_p2pdma_map_type() before calling 90 + * this function. 91 + */ 92 + struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, 93 + struct p2pdma_provider *provider, 94 + struct dma_buf_phys_vec *phys_vec, 95 + size_t nr_ranges, size_t size, 96 + enum dma_data_direction dir) 97 + { 98 + unsigned int nents, mapped_len = 0; 99 + struct dma_buf_dma *dma; 100 + struct scatterlist *sgl; 101 + dma_addr_t addr; 102 + size_t i; 103 + int ret; 104 + 105 + dma_resv_assert_held(attach->dmabuf->resv); 106 + 107 + if (WARN_ON(!attach || !attach->dmabuf || !provider)) 108 + /* This function is supposed to work on MMIO memory only */ 109 + return ERR_PTR(-EINVAL); 110 + 111 + dma = kzalloc(sizeof(*dma), GFP_KERNEL); 112 + if (!dma) 113 + return ERR_PTR(-ENOMEM); 114 + 115 + switch (pci_p2pdma_map_type(provider, attach->dev)) { 116 + case PCI_P2PDMA_MAP_BUS_ADDR: 117 + /* 118 + * There is no need in IOVA at all for this flow. 119 + */ 120 + break; 121 + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: 122 + dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL); 123 + if (!dma->state) { 124 + ret = -ENOMEM; 125 + goto err_free_dma; 126 + } 127 + 128 + dma_iova_try_alloc(attach->dev, dma->state, 0, size); 129 + break; 130 + default: 131 + ret = -EINVAL; 132 + goto err_free_dma; 133 + } 134 + 135 + nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size); 136 + ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO); 137 + if (ret) 138 + goto err_free_state; 139 + 140 + sgl = dma->sgt.sgl; 141 + 142 + for (i = 0; i < nr_ranges; i++) { 143 + if (!dma->state) { 144 + addr = pci_p2pdma_bus_addr_map(provider, 145 + phys_vec[i].paddr); 146 + } else if (dma_use_iova(dma->state)) { 147 + ret = dma_iova_link(attach->dev, dma->state, 148 + phys_vec[i].paddr, 0, 149 + phys_vec[i].len, dir, 150 + DMA_ATTR_MMIO); 151 + if (ret) 152 + goto err_unmap_dma; 153 + 154 + mapped_len += phys_vec[i].len; 155 + } else { 156 + addr = dma_map_phys(attach->dev, phys_vec[i].paddr, 157 + phys_vec[i].len, dir, 158 + DMA_ATTR_MMIO); 159 + ret = dma_mapping_error(attach->dev, addr); 160 + if (ret) 161 + goto err_unmap_dma; 162 + } 163 + 164 + if (!dma->state || !dma_use_iova(dma->state)) 165 + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); 166 + } 167 + 168 + if (dma->state && dma_use_iova(dma->state)) { 169 + WARN_ON_ONCE(mapped_len != size); 170 + ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len); 171 + if (ret) 172 + goto err_unmap_dma; 173 + 174 + sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr); 175 + } 176 + 177 + dma->size = size; 178 + 179 + /* 180 + * No CPU list included — set orig_nents = 0 so others can detect 181 + * this via SG table (use nents only). 182 + */ 183 + dma->sgt.orig_nents = 0; 184 + 185 + 186 + /* 187 + * SGL must be NULL to indicate that SGL is the last one 188 + * and we allocated correct number of entries in sg_alloc_table() 189 + */ 190 + WARN_ON_ONCE(sgl); 191 + return &dma->sgt; 192 + 193 + err_unmap_dma: 194 + if (!i || !dma->state) { 195 + ; /* Do nothing */ 196 + } else if (dma_use_iova(dma->state)) { 197 + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir, 198 + DMA_ATTR_MMIO); 199 + } else { 200 + for_each_sgtable_dma_sg(&dma->sgt, sgl, i) 201 + dma_unmap_phys(attach->dev, sg_dma_address(sgl), 202 + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); 203 + } 204 + sg_free_table(&dma->sgt); 205 + err_free_state: 206 + kfree(dma->state); 207 + err_free_dma: 208 + kfree(dma); 209 + return ERR_PTR(ret); 210 + } 211 + EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF"); 212 + 213 + /** 214 + * dma_buf_free_sgt- unmaps the buffer 215 + * @attach: [in] attachment to unmap buffer from 216 + * @sgt: [in] scatterlist info of the buffer to unmap 217 + * @dir: [in] direction of DMA transfer 218 + * 219 + * This unmaps a DMA mapping for @attached obtained 220 + * by dma_buf_phys_vec_to_sgt(). 221 + */ 222 + void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, 223 + enum dma_data_direction dir) 224 + { 225 + struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt); 226 + int i; 227 + 228 + dma_resv_assert_held(attach->dmabuf->resv); 229 + 230 + if (!dma->state) { 231 + ; /* Do nothing */ 232 + } else if (dma_use_iova(dma->state)) { 233 + dma_iova_destroy(attach->dev, dma->state, dma->size, dir, 234 + DMA_ATTR_MMIO); 235 + } else { 236 + struct scatterlist *sgl; 237 + 238 + for_each_sgtable_dma_sg(sgt, sgl, i) 239 + dma_unmap_phys(attach->dev, sg_dma_address(sgl), 240 + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); 241 + } 242 + 243 + sg_free_table(sgt); 244 + kfree(dma->state); 245 + kfree(dma); 246 + 247 + } 248 + EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

+2 -2

drivers/iommu/dma-iommu.c

··· 1439 1439 * as a bus address, __finalise_sg() will copy the dma 1440 1440 * address into the output segment. 1441 1441 */ 1442 - s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, 1443 - sg_phys(s)); 1442 + s->dma_address = pci_p2pdma_bus_addr_map( 1443 + p2pdma_state.mem, sg_phys(s)); 1444 1444 sg_dma_len(s) = sg->length; 1445 1445 sg_dma_mark_bus_address(s); 1446 1446 continue;

+65 -13

drivers/iommu/iommufd/io_pagetable.c

··· 8 8 * The datastructure uses the iopt_pages to optimize the storage of the PFNs 9 9 * between the domains and xarray. 10 10 */ 11 + #include <linux/dma-buf.h> 11 12 #include <linux/err.h> 12 13 #include <linux/errno.h> 14 + #include <linux/file.h> 13 15 #include <linux/iommu.h> 14 16 #include <linux/iommufd.h> 15 17 #include <linux/lockdep.h> ··· 286 284 case IOPT_ADDRESS_FILE: 287 285 start = elm->start_byte + elm->pages->start; 288 286 break; 287 + case IOPT_ADDRESS_DMABUF: 288 + start = elm->start_byte + elm->pages->dmabuf.start; 289 + break; 289 290 } 290 291 rc = iopt_alloc_iova(iopt, dst_iova, start, length); 291 292 if (rc) ··· 473 468 * @iopt: io_pagetable to act on 474 469 * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains 475 470 * the chosen iova on output. Otherwise is the iova to map to on input 476 - * @file: file to map 471 + * @fd: fdno of a file to map 477 472 * @start: map file starting at this byte offset 478 473 * @length: Number of bytes to map 479 474 * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping 480 475 * @flags: IOPT_ALLOC_IOVA or zero 481 476 */ 482 477 int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, 483 - unsigned long *iova, struct file *file, 484 - unsigned long start, unsigned long length, 485 - int iommu_prot, unsigned int flags) 478 + unsigned long *iova, int fd, unsigned long start, 479 + unsigned long length, int iommu_prot, 480 + unsigned int flags) 486 481 { 487 482 struct iopt_pages *pages; 483 + struct dma_buf *dmabuf; 484 + unsigned long start_byte; 485 + unsigned long last; 488 486 489 - pages = iopt_alloc_file_pages(file, start, length, 490 - iommu_prot & IOMMU_WRITE); 491 - if (IS_ERR(pages)) 492 - return PTR_ERR(pages); 487 + if (!length) 488 + return -EINVAL; 489 + if (check_add_overflow(start, length - 1, &last)) 490 + return -EOVERFLOW; 491 + 492 + start_byte = start - ALIGN_DOWN(start, PAGE_SIZE); 493 + dmabuf = dma_buf_get(fd); 494 + if (!IS_ERR(dmabuf)) { 495 + pages = iopt_alloc_dmabuf_pages(ictx, dmabuf, start_byte, start, 496 + length, 497 + iommu_prot & IOMMU_WRITE); 498 + if (IS_ERR(pages)) { 499 + dma_buf_put(dmabuf); 500 + return PTR_ERR(pages); 501 + } 502 + } else { 503 + struct file *file; 504 + 505 + file = fget(fd); 506 + if (!file) 507 + return -EBADF; 508 + 509 + pages = iopt_alloc_file_pages(file, start_byte, start, length, 510 + iommu_prot & IOMMU_WRITE); 511 + fput(file); 512 + if (IS_ERR(pages)) 513 + return PTR_ERR(pages); 514 + } 515 + 493 516 return iopt_map_common(ictx, iopt, pages, iova, length, 494 - start - pages->start, iommu_prot, flags); 517 + start_byte, iommu_prot, flags); 495 518 } 496 519 497 520 struct iova_bitmap_fn_arg { ··· 994 961 WARN_ON(!area->storage_domain); 995 962 if (area->storage_domain == domain) 996 963 area->storage_domain = storage_domain; 964 + if (iopt_is_dmabuf(pages)) { 965 + if (!iopt_dmabuf_revoked(pages)) 966 + iopt_area_unmap_domain(area, domain); 967 + iopt_dmabuf_untrack_domain(pages, area, domain); 968 + } 997 969 mutex_unlock(&pages->mutex); 998 970 999 - iopt_area_unmap_domain(area, domain); 971 + if (!iopt_is_dmabuf(pages)) 972 + iopt_area_unmap_domain(area, domain); 1000 973 } 1001 974 return; 1002 975 } ··· 1019 980 WARN_ON(area->storage_domain != domain); 1020 981 area->storage_domain = NULL; 1021 982 iopt_area_unfill_domain(area, pages, domain); 983 + if (iopt_is_dmabuf(pages)) 984 + iopt_dmabuf_untrack_domain(pages, area, domain); 1022 985 mutex_unlock(&pages->mutex); 1023 986 } 1024 987 } ··· 1050 1009 if (!pages) 1051 1010 continue; 1052 1011 1053 - mutex_lock(&pages->mutex); 1012 + guard(mutex)(&pages->mutex); 1013 + if (iopt_is_dmabuf(pages)) { 1014 + rc = iopt_dmabuf_track_domain(pages, area, domain); 1015 + if (rc) 1016 + goto out_unfill; 1017 + } 1054 1018 rc = iopt_area_fill_domain(area, domain); 1055 1019 if (rc) { 1056 - mutex_unlock(&pages->mutex); 1020 + if (iopt_is_dmabuf(pages)) 1021 + iopt_dmabuf_untrack_domain(pages, area, domain); 1057 1022 goto out_unfill; 1058 1023 } 1059 1024 if (!area->storage_domain) { ··· 1068 1021 interval_tree_insert(&area->pages_node, 1069 1022 &pages->domains_itree); 1070 1023 } 1071 - mutex_unlock(&pages->mutex); 1072 1024 } 1073 1025 return 0; 1074 1026 ··· 1088 1042 area->storage_domain = NULL; 1089 1043 } 1090 1044 iopt_area_unfill_domain(area, pages, domain); 1045 + if (iopt_is_dmabuf(pages)) 1046 + iopt_dmabuf_untrack_domain(pages, area, domain); 1091 1047 mutex_unlock(&pages->mutex); 1092 1048 } 1093 1049 return rc; ··· 1299 1251 1300 1252 if (!pages || area->prevent_access) 1301 1253 return -EBUSY; 1254 + 1255 + /* Maintaining the domains_itree below is a bit complicated */ 1256 + if (iopt_is_dmabuf(pages)) 1257 + return -EOPNOTSUPP; 1302 1258 1303 1259 if (new_start & (alignment - 1) || 1304 1260 iopt_area_start_byte(area, new_start) & (alignment - 1))

+52 -2

drivers/iommu/iommufd/io_pagetable.h

··· 5 5 #ifndef __IO_PAGETABLE_H 6 6 #define __IO_PAGETABLE_H 7 7 8 + #include <linux/dma-buf.h> 8 9 #include <linux/interval_tree.h> 9 10 #include <linux/kref.h> 10 11 #include <linux/mutex.h> ··· 69 68 struct iommu_domain *domain); 70 69 void iopt_area_unmap_domain(struct iopt_area *area, 71 70 struct iommu_domain *domain); 71 + 72 + int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, 73 + struct iommu_domain *domain); 74 + void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, 75 + struct iopt_area *area, 76 + struct iommu_domain *domain); 77 + int iopt_dmabuf_track_all_domains(struct iopt_area *area, 78 + struct iopt_pages *pages); 79 + void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, 80 + struct iopt_pages *pages); 72 81 73 82 static inline unsigned long iopt_area_index(struct iopt_area *area) 74 83 { ··· 190 179 191 180 enum iopt_address_type { 192 181 IOPT_ADDRESS_USER = 0, 193 - IOPT_ADDRESS_FILE = 1, 182 + IOPT_ADDRESS_FILE, 183 + IOPT_ADDRESS_DMABUF, 184 + }; 185 + 186 + struct iopt_pages_dmabuf_track { 187 + struct iommu_domain *domain; 188 + struct iopt_area *area; 189 + struct list_head elm; 190 + }; 191 + 192 + struct iopt_pages_dmabuf { 193 + struct dma_buf_attachment *attach; 194 + struct dma_buf_phys_vec phys; 195 + /* Always PAGE_SIZE aligned */ 196 + unsigned long start; 197 + struct list_head tracker; 194 198 }; 195 199 196 200 /* ··· 235 209 struct file *file; 236 210 unsigned long start; 237 211 }; 212 + /* IOPT_ADDRESS_DMABUF */ 213 + struct iopt_pages_dmabuf dmabuf; 238 214 }; 239 215 bool writable:1; 240 216 u8 account_mode; ··· 248 220 struct rb_root_cached domains_itree; 249 221 }; 250 222 223 + static inline bool iopt_is_dmabuf(struct iopt_pages *pages) 224 + { 225 + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) 226 + return false; 227 + return pages->type == IOPT_ADDRESS_DMABUF; 228 + } 229 + 230 + static inline bool iopt_dmabuf_revoked(struct iopt_pages *pages) 231 + { 232 + lockdep_assert_held(&pages->mutex); 233 + if (iopt_is_dmabuf(pages)) 234 + return pages->dmabuf.phys.len == 0; 235 + return false; 236 + } 237 + 251 238 struct iopt_pages *iopt_alloc_user_pages(void __user *uptr, 252 239 unsigned long length, bool writable); 253 - struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, 240 + struct iopt_pages *iopt_alloc_file_pages(struct file *file, 241 + unsigned long start_byte, 242 + unsigned long start, 254 243 unsigned long length, bool writable); 244 + struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, 245 + struct dma_buf *dmabuf, 246 + unsigned long start_byte, 247 + unsigned long start, 248 + unsigned long length, bool writable); 255 249 void iopt_release_pages(struct kref *kref); 256 250 static inline void iopt_put_pages(struct iopt_pages *pages) 257 251 {

+1 -7

drivers/iommu/iommufd/ioas.c

··· 207 207 unsigned long iova = cmd->iova; 208 208 struct iommufd_ioas *ioas; 209 209 unsigned int flags = 0; 210 - struct file *file; 211 210 int rc; 212 211 213 212 if (cmd->flags & ··· 228 229 if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA)) 229 230 flags = IOPT_ALLOC_IOVA; 230 231 231 - file = fget(cmd->fd); 232 - if (!file) 233 - return -EBADF; 234 - 235 - rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, file, 232 + rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, cmd->fd, 236 233 cmd->start, cmd->length, 237 234 conv_iommu_prot(cmd->flags), flags); 238 235 if (rc) ··· 238 243 rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); 239 244 out_put: 240 245 iommufd_put_object(ucmd->ictx, &ioas->obj); 241 - fput(file); 242 246 return rc; 243 247 } 244 248

+13 -1

drivers/iommu/iommufd/iommufd_private.h

··· 19 19 struct iommu_group; 20 20 struct iommu_option; 21 21 struct iommufd_device; 22 + struct dma_buf_attachment; 23 + struct dma_buf_phys_vec; 22 24 23 25 struct iommufd_sw_msi_map { 24 26 struct list_head sw_msi_item; ··· 110 108 unsigned long length, int iommu_prot, 111 109 unsigned int flags); 112 110 int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, 113 - unsigned long *iova, struct file *file, 111 + unsigned long *iova, int fd, 114 112 unsigned long start, unsigned long length, 115 113 int iommu_prot, unsigned int flags); 116 114 int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list, ··· 506 504 void iommufd_device_destroy(struct iommufd_object *obj); 507 505 int iommufd_get_hw_info(struct iommufd_ucmd *ucmd); 508 506 507 + struct device *iommufd_global_device(void); 508 + 509 509 struct iommufd_access { 510 510 struct iommufd_object obj; 511 511 struct iommufd_ctx *ictx; ··· 715 711 int __init iommufd_test_init(void); 716 712 void iommufd_test_exit(void); 717 713 bool iommufd_selftest_is_mock_dev(struct device *dev); 714 + int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 715 + struct dma_buf_phys_vec *phys); 718 716 #else 719 717 static inline void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd, 720 718 unsigned int ioas_id, ··· 737 731 static inline bool iommufd_selftest_is_mock_dev(struct device *dev) 738 732 { 739 733 return false; 734 + } 735 + static inline int 736 + iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 737 + struct dma_buf_phys_vec *phys) 738 + { 739 + return -EOPNOTSUPP; 740 740 } 741 741 #endif 742 742 #endif

+10

drivers/iommu/iommufd/iommufd_test.h

··· 29 29 IOMMU_TEST_OP_PASID_REPLACE, 30 30 IOMMU_TEST_OP_PASID_DETACH, 31 31 IOMMU_TEST_OP_PASID_CHECK_HWPT, 32 + IOMMU_TEST_OP_DMABUF_GET, 33 + IOMMU_TEST_OP_DMABUF_REVOKE, 32 34 }; 33 35 34 36 enum { ··· 178 176 __u32 hwpt_id; 179 177 /* @id is stdev_id */ 180 178 } pasid_check; 179 + struct { 180 + __u32 length; 181 + __u32 open_flags; 182 + } dmabuf_get; 183 + struct { 184 + __s32 dmabuf_fd; 185 + __u32 revoked; 186 + } dmabuf_revoke; 181 187 }; 182 188 __u32 last; 183 189 };

+10

drivers/iommu/iommufd/main.c

··· 751 751 .mode = 0666, 752 752 }; 753 753 754 + /* 755 + * Used only by DMABUF, returns a valid struct device to use as a dummy struct 756 + * device for attachment. 757 + */ 758 + struct device *iommufd_global_device(void) 759 + { 760 + return iommu_misc_dev.this_device; 761 + } 762 + 754 763 static int __init iommufd_init(void) 755 764 { 756 765 int ret; ··· 803 794 #endif 804 795 MODULE_IMPORT_NS("IOMMUFD_INTERNAL"); 805 796 MODULE_IMPORT_NS("IOMMUFD"); 797 + MODULE_IMPORT_NS("DMA_BUF"); 806 798 MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices"); 807 799 MODULE_LICENSE("GPL");

+367 -47

drivers/iommu/iommufd/pages.c

··· 45 45 * last_iova + 1 can overflow. An iopt_pages index will always be much less than 46 46 * ULONG_MAX so last_index + 1 cannot overflow. 47 47 */ 48 + #include <linux/dma-buf.h> 49 + #include <linux/dma-resv.h> 48 50 #include <linux/file.h> 49 51 #include <linux/highmem.h> 50 52 #include <linux/iommu.h> ··· 55 53 #include <linux/overflow.h> 56 54 #include <linux/slab.h> 57 55 #include <linux/sched/mm.h> 56 + #include <linux/vfio_pci_core.h> 58 57 59 58 #include "double_span.h" 60 59 #include "io_pagetable.h" ··· 261 258 return container_of(node, struct iopt_area, pages_node); 262 259 } 263 260 261 + enum batch_kind { 262 + BATCH_CPU_MEMORY = 0, 263 + BATCH_MMIO, 264 + }; 265 + 264 266 /* 265 267 * A simple datastructure to hold a vector of PFNs, optimized for contiguous 266 268 * PFNs. This is used as a temporary holding memory for shuttling pfns from one ··· 279 271 unsigned int array_size; 280 272 unsigned int end; 281 273 unsigned int total_pfns; 274 + enum batch_kind kind; 282 275 }; 276 + enum { MAX_NPFNS = type_max(typeof(((struct pfn_batch *)0)->npfns[0])) }; 283 277 284 278 static void batch_clear(struct pfn_batch *batch) 285 279 { ··· 358 348 } 359 349 360 350 static bool batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn, 361 - u32 nr) 351 + u32 nr, enum batch_kind kind) 362 352 { 363 - const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns)); 364 353 unsigned int end = batch->end; 354 + 355 + if (batch->kind != kind) { 356 + /* One kind per batch */ 357 + if (batch->end != 0) 358 + return false; 359 + batch->kind = kind; 360 + } 365 361 366 362 if (end && pfn == batch->pfns[end - 1] + batch->npfns[end - 1] && 367 363 nr <= MAX_NPFNS - batch->npfns[end - 1]) { ··· 395 379 /* true if the pfn was added, false otherwise */ 396 380 static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn) 397 381 { 398 - return batch_add_pfn_num(batch, pfn, 1); 382 + return batch_add_pfn_num(batch, pfn, 1, BATCH_CPU_MEMORY); 399 383 } 400 384 401 385 /* ··· 508 492 { 509 493 bool disable_large_pages = area->iopt->disable_large_pages; 510 494 unsigned long last_iova = iopt_area_last_iova(area); 495 + int iommu_prot = area->iommu_prot; 511 496 unsigned int page_offset = 0; 512 497 unsigned long start_iova; 513 498 unsigned long next_iova; 514 499 unsigned int cur = 0; 515 500 unsigned long iova; 516 501 int rc; 502 + 503 + if (batch->kind == BATCH_MMIO) { 504 + iommu_prot &= ~IOMMU_CACHE; 505 + iommu_prot |= IOMMU_MMIO; 506 + } 517 507 518 508 /* The first index might be a partial page */ 519 509 if (start_index == iopt_area_index(area)) ··· 534 512 rc = batch_iommu_map_small( 535 513 domain, iova, 536 514 PFN_PHYS(batch->pfns[cur]) + page_offset, 537 - next_iova - iova, area->iommu_prot); 515 + next_iova - iova, iommu_prot); 538 516 else 539 517 rc = iommu_map(domain, iova, 540 518 PFN_PHYS(batch->pfns[cur]) + page_offset, 541 - next_iova - iova, area->iommu_prot, 519 + next_iova - iova, iommu_prot, 542 520 GFP_KERNEL_ACCOUNT); 543 521 if (rc) 544 522 goto err_unmap; ··· 674 652 nr = min(nr, npages); 675 653 npages -= nr; 676 654 677 - if (!batch_add_pfn_num(batch, pfn, nr)) 655 + if (!batch_add_pfn_num(batch, pfn, nr, BATCH_CPU_MEMORY)) 678 656 break; 679 657 if (nr > 1) { 680 658 rc = folio_add_pins(folio, nr - 1); ··· 1076 1054 return iopt_pages_update_pinned(pages, npages, inc, user); 1077 1055 } 1078 1056 1057 + struct pfn_reader_dmabuf { 1058 + struct dma_buf_phys_vec phys; 1059 + unsigned long start_offset; 1060 + }; 1061 + 1062 + static int pfn_reader_dmabuf_init(struct pfn_reader_dmabuf *dmabuf, 1063 + struct iopt_pages *pages) 1064 + { 1065 + /* Callers must not get here if the dmabuf was already revoked */ 1066 + if (WARN_ON(iopt_dmabuf_revoked(pages))) 1067 + return -EINVAL; 1068 + 1069 + dmabuf->phys = pages->dmabuf.phys; 1070 + dmabuf->start_offset = pages->dmabuf.start; 1071 + return 0; 1072 + } 1073 + 1074 + static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf, 1075 + struct pfn_batch *batch, 1076 + unsigned long start_index, 1077 + unsigned long last_index) 1078 + { 1079 + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE; 1080 + 1081 + /* 1082 + * start/last_index and start are all PAGE_SIZE aligned, the batch is 1083 + * always filled using page size aligned PFNs just like the other types. 1084 + * If the dmabuf has been sliced on a sub page offset then the common 1085 + * batch to domain code will adjust it before mapping to the domain. 1086 + */ 1087 + batch_add_pfn_num(batch, PHYS_PFN(dmabuf->phys.paddr + start), 1088 + last_index - start_index + 1, BATCH_MMIO); 1089 + return 0; 1090 + } 1091 + 1079 1092 /* 1080 1093 * PFNs are stored in three places, in order of preference: 1081 1094 * - The iopt_pages xarray. This is only populated if there is a ··· 1129 1072 unsigned long batch_end_index; 1130 1073 unsigned long last_index; 1131 1074 1132 - struct pfn_reader_user user; 1075 + union { 1076 + struct pfn_reader_user user; 1077 + struct pfn_reader_dmabuf dmabuf; 1078 + }; 1133 1079 }; 1134 1080 1135 1081 static int pfn_reader_update_pinned(struct pfn_reader *pfns) ··· 1168 1108 { 1169 1109 struct interval_tree_double_span_iter *span = &pfns->span; 1170 1110 unsigned long start_index = pfns->batch_end_index; 1171 - struct pfn_reader_user *user = &pfns->user; 1111 + struct pfn_reader_user *user; 1172 1112 unsigned long npages; 1173 1113 struct iopt_area *area; 1174 1114 int rc; ··· 1200 1140 return 0; 1201 1141 } 1202 1142 1203 - if (start_index >= pfns->user.upages_end) { 1204 - rc = pfn_reader_user_pin(&pfns->user, pfns->pages, start_index, 1143 + if (iopt_is_dmabuf(pfns->pages)) 1144 + return pfn_reader_fill_dmabuf(&pfns->dmabuf, &pfns->batch, 1145 + start_index, span->last_hole); 1146 + 1147 + user = &pfns->user; 1148 + if (start_index >= user->upages_end) { 1149 + rc = pfn_reader_user_pin(user, pfns->pages, start_index, 1205 1150 span->last_hole); 1206 1151 if (rc) 1207 1152 return rc; ··· 1274 1209 pfns->batch_start_index = start_index; 1275 1210 pfns->batch_end_index = start_index; 1276 1211 pfns->last_index = last_index; 1277 - pfn_reader_user_init(&pfns->user, pages); 1212 + if (iopt_is_dmabuf(pages)) 1213 + pfn_reader_dmabuf_init(&pfns->dmabuf, pages); 1214 + else 1215 + pfn_reader_user_init(&pfns->user, pages); 1278 1216 rc = batch_init(&pfns->batch, last_index - start_index + 1); 1279 1217 if (rc) 1280 1218 return rc; ··· 1298 1230 static void pfn_reader_release_pins(struct pfn_reader *pfns) 1299 1231 { 1300 1232 struct iopt_pages *pages = pfns->pages; 1301 - struct pfn_reader_user *user = &pfns->user; 1233 + struct pfn_reader_user *user; 1302 1234 1235 + if (iopt_is_dmabuf(pages)) 1236 + return; 1237 + 1238 + user = &pfns->user; 1303 1239 if (user->upages_end > pfns->batch_end_index) { 1304 1240 /* Any pages not transferred to the batch are just unpinned */ 1305 1241 ··· 1333 1261 struct iopt_pages *pages = pfns->pages; 1334 1262 1335 1263 pfn_reader_release_pins(pfns); 1336 - pfn_reader_user_destroy(&pfns->user, pfns->pages); 1264 + if (!iopt_is_dmabuf(pfns->pages)) 1265 + pfn_reader_user_destroy(&pfns->user, pfns->pages); 1337 1266 batch_destroy(&pfns->batch, NULL); 1338 1267 WARN_ON(pages->last_npinned != pages->npinned); 1339 1268 } ··· 1413 1340 return pages; 1414 1341 } 1415 1342 1416 - struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, 1343 + struct iopt_pages *iopt_alloc_file_pages(struct file *file, 1344 + unsigned long start_byte, 1345 + unsigned long start, 1417 1346 unsigned long length, bool writable) 1418 1347 1419 1348 { 1420 1349 struct iopt_pages *pages; 1421 - unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE); 1422 - unsigned long end; 1423 1350 1424 - if (length && check_add_overflow(start, length - 1, &end)) 1425 - return ERR_PTR(-EOVERFLOW); 1426 - 1427 - pages = iopt_alloc_pages(start - start_down, length, writable); 1351 + pages = iopt_alloc_pages(start_byte, length, writable); 1428 1352 if (IS_ERR(pages)) 1429 1353 return pages; 1430 1354 pages->file = get_file(file); 1431 - pages->start = start_down; 1355 + pages->start = start - start_byte; 1432 1356 pages->type = IOPT_ADDRESS_FILE; 1433 1357 return pages; 1358 + } 1359 + 1360 + static void iopt_revoke_notify(struct dma_buf_attachment *attach) 1361 + { 1362 + struct iopt_pages *pages = attach->importer_priv; 1363 + struct iopt_pages_dmabuf_track *track; 1364 + 1365 + guard(mutex)(&pages->mutex); 1366 + if (iopt_dmabuf_revoked(pages)) 1367 + return; 1368 + 1369 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { 1370 + struct iopt_area *area = track->area; 1371 + 1372 + iopt_area_unmap_domain_range(area, track->domain, 1373 + iopt_area_index(area), 1374 + iopt_area_last_index(area)); 1375 + } 1376 + pages->dmabuf.phys.len = 0; 1377 + } 1378 + 1379 + static struct dma_buf_attach_ops iopt_dmabuf_attach_revoke_ops = { 1380 + .allow_peer2peer = true, 1381 + .move_notify = iopt_revoke_notify, 1382 + }; 1383 + 1384 + /* 1385 + * iommufd and vfio have a circular dependency. Future work for a phys 1386 + * based private interconnect will remove this. 1387 + */ 1388 + static int 1389 + sym_vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 1390 + struct dma_buf_phys_vec *phys) 1391 + { 1392 + typeof(&vfio_pci_dma_buf_iommufd_map) fn; 1393 + int rc; 1394 + 1395 + rc = iommufd_test_dma_buf_iommufd_map(attachment, phys); 1396 + if (rc != -EOPNOTSUPP) 1397 + return rc; 1398 + 1399 + if (!IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)) 1400 + return -EOPNOTSUPP; 1401 + 1402 + fn = symbol_get(vfio_pci_dma_buf_iommufd_map); 1403 + if (!fn) 1404 + return -EOPNOTSUPP; 1405 + rc = fn(attachment, phys); 1406 + symbol_put(vfio_pci_dma_buf_iommufd_map); 1407 + return rc; 1408 + } 1409 + 1410 + static int iopt_map_dmabuf(struct iommufd_ctx *ictx, struct iopt_pages *pages, 1411 + struct dma_buf *dmabuf) 1412 + { 1413 + struct dma_buf_attachment *attach; 1414 + int rc; 1415 + 1416 + attach = dma_buf_dynamic_attach(dmabuf, iommufd_global_device(), 1417 + &iopt_dmabuf_attach_revoke_ops, pages); 1418 + if (IS_ERR(attach)) 1419 + return PTR_ERR(attach); 1420 + 1421 + dma_resv_lock(dmabuf->resv, NULL); 1422 + /* 1423 + * Lock ordering requires the mutex to be taken inside the reservation, 1424 + * make sure lockdep sees this. 1425 + */ 1426 + if (IS_ENABLED(CONFIG_LOCKDEP)) { 1427 + mutex_lock(&pages->mutex); 1428 + mutex_unlock(&pages->mutex); 1429 + } 1430 + 1431 + rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys); 1432 + if (rc) 1433 + goto err_detach; 1434 + 1435 + dma_resv_unlock(dmabuf->resv); 1436 + 1437 + /* On success iopt_release_pages() will detach and put the dmabuf. */ 1438 + pages->dmabuf.attach = attach; 1439 + return 0; 1440 + 1441 + err_detach: 1442 + dma_resv_unlock(dmabuf->resv); 1443 + dma_buf_detach(dmabuf, attach); 1444 + return rc; 1445 + } 1446 + 1447 + struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, 1448 + struct dma_buf *dmabuf, 1449 + unsigned long start_byte, 1450 + unsigned long start, 1451 + unsigned long length, bool writable) 1452 + { 1453 + static struct lock_class_key pages_dmabuf_mutex_key; 1454 + struct iopt_pages *pages; 1455 + int rc; 1456 + 1457 + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) 1458 + return ERR_PTR(-EOPNOTSUPP); 1459 + 1460 + if (dmabuf->size <= (start + length - 1) || 1461 + length / PAGE_SIZE >= MAX_NPFNS) 1462 + return ERR_PTR(-EINVAL); 1463 + 1464 + pages = iopt_alloc_pages(start_byte, length, writable); 1465 + if (IS_ERR(pages)) 1466 + return pages; 1467 + 1468 + /* 1469 + * The mmap_lock can be held when obtaining the dmabuf reservation lock 1470 + * which creates a locking cycle with the pages mutex which is held 1471 + * while obtaining the mmap_lock. This locking path is not present for 1472 + * IOPT_ADDRESS_DMABUF so split the lock class. 1473 + */ 1474 + lockdep_set_class(&pages->mutex, &pages_dmabuf_mutex_key); 1475 + 1476 + /* dmabuf does not use pinned page accounting. */ 1477 + pages->account_mode = IOPT_PAGES_ACCOUNT_NONE; 1478 + pages->type = IOPT_ADDRESS_DMABUF; 1479 + pages->dmabuf.start = start - start_byte; 1480 + INIT_LIST_HEAD(&pages->dmabuf.tracker); 1481 + 1482 + rc = iopt_map_dmabuf(ictx, pages, dmabuf); 1483 + if (rc) { 1484 + iopt_put_pages(pages); 1485 + return ERR_PTR(rc); 1486 + } 1487 + 1488 + return pages; 1489 + } 1490 + 1491 + int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, 1492 + struct iommu_domain *domain) 1493 + { 1494 + struct iopt_pages_dmabuf_track *track; 1495 + 1496 + lockdep_assert_held(&pages->mutex); 1497 + if (WARN_ON(!iopt_is_dmabuf(pages))) 1498 + return -EINVAL; 1499 + 1500 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) 1501 + if (WARN_ON(track->domain == domain && track->area == area)) 1502 + return -EINVAL; 1503 + 1504 + track = kzalloc(sizeof(*track), GFP_KERNEL); 1505 + if (!track) 1506 + return -ENOMEM; 1507 + track->domain = domain; 1508 + track->area = area; 1509 + list_add_tail(&track->elm, &pages->dmabuf.tracker); 1510 + 1511 + return 0; 1512 + } 1513 + 1514 + void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, 1515 + struct iopt_area *area, 1516 + struct iommu_domain *domain) 1517 + { 1518 + struct iopt_pages_dmabuf_track *track; 1519 + 1520 + lockdep_assert_held(&pages->mutex); 1521 + WARN_ON(!iopt_is_dmabuf(pages)); 1522 + 1523 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { 1524 + if (track->domain == domain && track->area == area) { 1525 + list_del(&track->elm); 1526 + kfree(track); 1527 + return; 1528 + } 1529 + } 1530 + WARN_ON(true); 1531 + } 1532 + 1533 + int iopt_dmabuf_track_all_domains(struct iopt_area *area, 1534 + struct iopt_pages *pages) 1535 + { 1536 + struct iopt_pages_dmabuf_track *track; 1537 + struct iommu_domain *domain; 1538 + unsigned long index; 1539 + int rc; 1540 + 1541 + list_for_each_entry(track, &pages->dmabuf.tracker, elm) 1542 + if (WARN_ON(track->area == area)) 1543 + return -EINVAL; 1544 + 1545 + xa_for_each(&area->iopt->domains, index, domain) { 1546 + rc = iopt_dmabuf_track_domain(pages, area, domain); 1547 + if (rc) 1548 + goto err_untrack; 1549 + } 1550 + return 0; 1551 + err_untrack: 1552 + iopt_dmabuf_untrack_all_domains(area, pages); 1553 + return rc; 1554 + } 1555 + 1556 + void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, 1557 + struct iopt_pages *pages) 1558 + { 1559 + struct iopt_pages_dmabuf_track *track; 1560 + struct iopt_pages_dmabuf_track *tmp; 1561 + 1562 + list_for_each_entry_safe(track, tmp, &pages->dmabuf.tracker, 1563 + elm) { 1564 + if (track->area == area) { 1565 + list_del(&track->elm); 1566 + kfree(track); 1567 + } 1568 + } 1434 1569 } 1435 1570 1436 1571 void iopt_release_pages(struct kref *kref) ··· 1653 1372 mutex_destroy(&pages->mutex); 1654 1373 put_task_struct(pages->source_task); 1655 1374 free_uid(pages->source_user); 1656 - if (pages->type == IOPT_ADDRESS_FILE) 1375 + if (iopt_is_dmabuf(pages) && pages->dmabuf.attach) { 1376 + struct dma_buf *dmabuf = pages->dmabuf.attach->dmabuf; 1377 + 1378 + dma_buf_detach(dmabuf, pages->dmabuf.attach); 1379 + dma_buf_put(dmabuf); 1380 + WARN_ON(!list_empty(&pages->dmabuf.tracker)); 1381 + } else if (pages->type == IOPT_ADDRESS_FILE) { 1657 1382 fput(pages->file); 1383 + } 1658 1384 kfree(pages); 1659 1385 } 1660 1386 ··· 1739 1451 1740 1452 lockdep_assert_held(&pages->mutex); 1741 1453 1454 + if (iopt_is_dmabuf(pages)) { 1455 + if (WARN_ON(iopt_dmabuf_revoked(pages))) 1456 + return; 1457 + iopt_area_unmap_domain_range(area, domain, start_index, 1458 + last_index); 1459 + return; 1460 + } 1461 + 1742 1462 /* 1743 1463 * For security we must not unpin something that is still DMA mapped, 1744 1464 * so this must unmap any IOVA before we go ahead and unpin the pages. ··· 1822 1526 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages, 1823 1527 struct iommu_domain *domain) 1824 1528 { 1529 + if (iopt_dmabuf_revoked(pages)) 1530 + return; 1531 + 1825 1532 __iopt_area_unfill_domain(area, pages, domain, 1826 1533 iopt_area_last_index(area)); 1827 1534 } ··· 1844 1545 int rc; 1845 1546 1846 1547 lockdep_assert_held(&area->pages->mutex); 1548 + 1549 + if (iopt_dmabuf_revoked(area->pages)) 1550 + return 0; 1847 1551 1848 1552 rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area), 1849 1553 iopt_area_last_index(area)); ··· 1907 1605 return 0; 1908 1606 1909 1607 mutex_lock(&pages->mutex); 1910 - rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), 1911 - iopt_area_last_index(area)); 1912 - if (rc) 1913 - goto out_unlock; 1608 + if (iopt_is_dmabuf(pages)) { 1609 + rc = iopt_dmabuf_track_all_domains(area, pages); 1610 + if (rc) 1611 + goto out_unlock; 1612 + } 1914 1613 1915 - while (!pfn_reader_done(&pfns)) { 1916 - done_first_end_index = pfns.batch_end_index; 1917 - done_all_end_index = pfns.batch_start_index; 1918 - xa_for_each(&area->iopt->domains, index, domain) { 1919 - rc = batch_to_domain(&pfns.batch, domain, area, 1920 - pfns.batch_start_index); 1614 + if (!iopt_dmabuf_revoked(pages)) { 1615 + rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), 1616 + iopt_area_last_index(area)); 1617 + if (rc) 1618 + goto out_untrack; 1619 + 1620 + while (!pfn_reader_done(&pfns)) { 1621 + done_first_end_index = pfns.batch_end_index; 1622 + done_all_end_index = pfns.batch_start_index; 1623 + xa_for_each(&area->iopt->domains, index, domain) { 1624 + rc = batch_to_domain(&pfns.batch, domain, area, 1625 + pfns.batch_start_index); 1626 + if (rc) 1627 + goto out_unmap; 1628 + } 1629 + done_all_end_index = done_first_end_index; 1630 + 1631 + rc = pfn_reader_next(&pfns); 1921 1632 if (rc) 1922 1633 goto out_unmap; 1923 1634 } 1924 - done_all_end_index = done_first_end_index; 1925 - 1926 - rc = pfn_reader_next(&pfns); 1635 + rc = pfn_reader_update_pinned(&pfns); 1927 1636 if (rc) 1928 1637 goto out_unmap; 1638 + 1639 + pfn_reader_destroy(&pfns); 1929 1640 } 1930 - rc = pfn_reader_update_pinned(&pfns); 1931 - if (rc) 1932 - goto out_unmap; 1933 1641 1934 1642 area->storage_domain = xa_load(&area->iopt->domains, 0); 1935 1643 interval_tree_insert(&area->pages_node, &pages->domains_itree); 1936 - goto out_destroy; 1644 + mutex_unlock(&pages->mutex); 1645 + return 0; 1937 1646 1938 1647 out_unmap: 1939 1648 pfn_reader_release_pins(&pfns); ··· 1971 1658 end_index); 1972 1659 } 1973 1660 } 1974 - out_destroy: 1975 1661 pfn_reader_destroy(&pfns); 1662 + out_untrack: 1663 + if (iopt_is_dmabuf(pages)) 1664 + iopt_dmabuf_untrack_all_domains(area, pages); 1976 1665 out_unlock: 1977 1666 mutex_unlock(&pages->mutex); 1978 1667 return rc; ··· 2000 1685 if (!area->storage_domain) 2001 1686 goto out_unlock; 2002 1687 2003 - xa_for_each(&iopt->domains, index, domain) 2004 - if (domain != area->storage_domain) 1688 + xa_for_each(&iopt->domains, index, domain) { 1689 + if (domain == area->storage_domain) 1690 + continue; 1691 + 1692 + if (!iopt_dmabuf_revoked(pages)) 2005 1693 iopt_area_unmap_domain_range( 2006 1694 area, domain, iopt_area_index(area), 2007 1695 iopt_area_last_index(area)); 1696 + } 2008 1697 2009 1698 if (IS_ENABLED(CONFIG_IOMMUFD_TEST)) 2010 1699 WARN_ON(RB_EMPTY_NODE(&area->pages_node.rb)); 2011 1700 interval_tree_remove(&area->pages_node, &pages->domains_itree); 2012 1701 iopt_area_unfill_domain(area, pages, area->storage_domain); 1702 + if (iopt_is_dmabuf(pages)) 1703 + iopt_dmabuf_untrack_all_domains(area, pages); 2013 1704 area->storage_domain = NULL; 2014 1705 out_unlock: 2015 1706 mutex_unlock(&pages->mutex); ··· 2352 2031 if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable) 2353 2032 return -EPERM; 2354 2033 2355 - if (pages->type == IOPT_ADDRESS_FILE) 2034 + if (iopt_is_dmabuf(pages)) 2035 + return -EINVAL; 2036 + 2037 + if (pages->type != IOPT_ADDRESS_USER) 2356 2038 return iopt_pages_rw_slow(pages, start_index, last_index, 2357 2039 start_byte % PAGE_SIZE, data, length, 2358 2040 flags); 2359 - 2360 - if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && 2361 - WARN_ON(pages->type != IOPT_ADDRESS_USER)) 2362 - return -EINVAL; 2363 2041 2364 2042 if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) { 2365 2043 if (start_index == last_index)

+143

drivers/iommu/iommufd/selftest.c

··· 5 5 */ 6 6 #include <linux/anon_inodes.h> 7 7 #include <linux/debugfs.h> 8 + #include <linux/dma-buf.h> 9 + #include <linux/dma-resv.h> 8 10 #include <linux/fault-inject.h> 9 11 #include <linux/file.h> 10 12 #include <linux/iommu.h> ··· 2033 2031 } 2034 2032 } 2035 2033 2034 + struct iommufd_test_dma_buf { 2035 + void *memory; 2036 + size_t length; 2037 + bool revoked; 2038 + }; 2039 + 2040 + static int iommufd_test_dma_buf_attach(struct dma_buf *dmabuf, 2041 + struct dma_buf_attachment *attachment) 2042 + { 2043 + return 0; 2044 + } 2045 + 2046 + static void iommufd_test_dma_buf_detach(struct dma_buf *dmabuf, 2047 + struct dma_buf_attachment *attachment) 2048 + { 2049 + } 2050 + 2051 + static struct sg_table * 2052 + iommufd_test_dma_buf_map(struct dma_buf_attachment *attachment, 2053 + enum dma_data_direction dir) 2054 + { 2055 + return ERR_PTR(-EOPNOTSUPP); 2056 + } 2057 + 2058 + static void iommufd_test_dma_buf_unmap(struct dma_buf_attachment *attachment, 2059 + struct sg_table *sgt, 2060 + enum dma_data_direction dir) 2061 + { 2062 + } 2063 + 2064 + static void iommufd_test_dma_buf_release(struct dma_buf *dmabuf) 2065 + { 2066 + struct iommufd_test_dma_buf *priv = dmabuf->priv; 2067 + 2068 + kfree(priv->memory); 2069 + kfree(priv); 2070 + } 2071 + 2072 + static const struct dma_buf_ops iommufd_test_dmabuf_ops = { 2073 + .attach = iommufd_test_dma_buf_attach, 2074 + .detach = iommufd_test_dma_buf_detach, 2075 + .map_dma_buf = iommufd_test_dma_buf_map, 2076 + .release = iommufd_test_dma_buf_release, 2077 + .unmap_dma_buf = iommufd_test_dma_buf_unmap, 2078 + }; 2079 + 2080 + int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 2081 + struct dma_buf_phys_vec *phys) 2082 + { 2083 + struct iommufd_test_dma_buf *priv = attachment->dmabuf->priv; 2084 + 2085 + dma_resv_assert_held(attachment->dmabuf->resv); 2086 + 2087 + if (attachment->dmabuf->ops != &iommufd_test_dmabuf_ops) 2088 + return -EOPNOTSUPP; 2089 + 2090 + if (priv->revoked) 2091 + return -ENODEV; 2092 + 2093 + phys->paddr = virt_to_phys(priv->memory); 2094 + phys->len = priv->length; 2095 + return 0; 2096 + } 2097 + 2098 + static int iommufd_test_dmabuf_get(struct iommufd_ucmd *ucmd, 2099 + unsigned int open_flags, 2100 + size_t len) 2101 + { 2102 + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); 2103 + struct iommufd_test_dma_buf *priv; 2104 + struct dma_buf *dmabuf; 2105 + int rc; 2106 + 2107 + len = ALIGN(len, PAGE_SIZE); 2108 + if (len == 0 || len > PAGE_SIZE * 512) 2109 + return -EINVAL; 2110 + 2111 + priv = kzalloc(sizeof(*priv), GFP_KERNEL); 2112 + if (!priv) 2113 + return -ENOMEM; 2114 + 2115 + priv->length = len; 2116 + priv->memory = kzalloc(len, GFP_KERNEL); 2117 + if (!priv->memory) { 2118 + rc = -ENOMEM; 2119 + goto err_free; 2120 + } 2121 + 2122 + exp_info.ops = &iommufd_test_dmabuf_ops; 2123 + exp_info.size = len; 2124 + exp_info.flags = open_flags; 2125 + exp_info.priv = priv; 2126 + 2127 + dmabuf = dma_buf_export(&exp_info); 2128 + if (IS_ERR(dmabuf)) { 2129 + rc = PTR_ERR(dmabuf); 2130 + goto err_free; 2131 + } 2132 + 2133 + return dma_buf_fd(dmabuf, open_flags); 2134 + 2135 + err_free: 2136 + kfree(priv->memory); 2137 + kfree(priv); 2138 + return rc; 2139 + } 2140 + 2141 + static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd, 2142 + bool revoked) 2143 + { 2144 + struct iommufd_test_dma_buf *priv; 2145 + struct dma_buf *dmabuf; 2146 + int rc = 0; 2147 + 2148 + dmabuf = dma_buf_get(fd); 2149 + if (IS_ERR(dmabuf)) 2150 + return PTR_ERR(dmabuf); 2151 + 2152 + if (dmabuf->ops != &iommufd_test_dmabuf_ops) { 2153 + rc = -EOPNOTSUPP; 2154 + goto err_put; 2155 + } 2156 + 2157 + priv = dmabuf->priv; 2158 + dma_resv_lock(dmabuf->resv, NULL); 2159 + priv->revoked = revoked; 2160 + dma_buf_move_notify(dmabuf); 2161 + dma_resv_unlock(dmabuf->resv); 2162 + 2163 + err_put: 2164 + dma_buf_put(dmabuf); 2165 + return rc; 2166 + } 2167 + 2036 2168 int iommufd_test(struct iommufd_ucmd *ucmd) 2037 2169 { 2038 2170 struct iommu_test_cmd *cmd = ucmd->cmd; ··· 2245 2109 return iommufd_test_pasid_detach(ucmd, cmd); 2246 2110 case IOMMU_TEST_OP_PASID_CHECK_HWPT: 2247 2111 return iommufd_test_pasid_check_hwpt(ucmd, cmd); 2112 + case IOMMU_TEST_OP_DMABUF_GET: 2113 + return iommufd_test_dmabuf_get(ucmd, cmd->dmabuf_get.open_flags, 2114 + cmd->dmabuf_get.length); 2115 + case IOMMU_TEST_OP_DMABUF_REVOKE: 2116 + return iommufd_test_dmabuf_revoke(ucmd, 2117 + cmd->dmabuf_revoke.dmabuf_fd, 2118 + cmd->dmabuf_revoke.revoked); 2248 2119 default: 2249 2120 return -EOPNOTSUPP; 2250 2121 }

+145 -43

drivers/pci/p2pdma.c

··· 25 25 struct gen_pool *pool; 26 26 bool p2pmem_published; 27 27 struct xarray map_types; 28 + struct p2pdma_provider mem[PCI_STD_NUM_BARS]; 28 29 }; 29 30 30 31 struct pci_p2pdma_pagemap { 31 - struct pci_dev *provider; 32 - u64 bus_offset; 33 32 struct dev_pagemap pgmap; 33 + struct p2pdma_provider *mem; 34 34 }; 35 35 36 36 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) ··· 204 204 { 205 205 struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); 206 206 /* safe to dereference while a reference is held to the percpu ref */ 207 - struct pci_p2pdma *p2pdma = 208 - rcu_dereference_protected(pgmap->provider->p2pdma, 1); 207 + struct pci_p2pdma *p2pdma = rcu_dereference_protected( 208 + to_pci_dev(pgmap->mem->owner)->p2pdma, 1); 209 209 struct percpu_ref *ref; 210 210 211 211 gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), ··· 228 228 229 229 /* Flush and disable pci_alloc_p2p_mem() */ 230 230 pdev->p2pdma = NULL; 231 - synchronize_rcu(); 231 + if (p2pdma->pool) 232 + synchronize_rcu(); 233 + xa_destroy(&p2pdma->map_types); 234 + 235 + if (!p2pdma->pool) 236 + return; 232 237 233 238 gen_pool_destroy(p2pdma->pool); 234 239 sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); 235 - xa_destroy(&p2pdma->map_types); 236 240 } 237 241 238 - static int pci_p2pdma_setup(struct pci_dev *pdev) 242 + /** 243 + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers 244 + * @pdev: The PCI device to enable P2PDMA for 245 + * 246 + * This function initializes the peer-to-peer DMA infrastructure 247 + * for a PCI device. It allocates and sets up the necessary data 248 + * structures to support P2PDMA operations, including mapping type 249 + * tracking. 250 + */ 251 + int pcim_p2pdma_init(struct pci_dev *pdev) 239 252 { 240 - int error = -ENOMEM; 241 253 struct pci_p2pdma *p2p; 254 + int i, ret; 255 + 256 + p2p = rcu_dereference_protected(pdev->p2pdma, 1); 257 + if (p2p) 258 + return 0; 242 259 243 260 p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); 244 261 if (!p2p) 245 262 return -ENOMEM; 246 263 247 264 xa_init(&p2p->map_types); 265 + /* 266 + * Iterate over all standard PCI BARs and record only those that 267 + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O 268 + * port BARs) since they cannot be used for peer-to-peer (P2P) 269 + * transactions. 270 + */ 271 + for (i = 0; i < PCI_STD_NUM_BARS; i++) { 272 + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) 273 + continue; 248 274 249 - p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); 250 - if (!p2p->pool) 251 - goto out; 275 + p2p->mem[i].owner = &pdev->dev; 276 + p2p->mem[i].bus_offset = 277 + pci_bus_address(pdev, i) - pci_resource_start(pdev, i); 278 + } 252 279 253 - error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); 254 - if (error) 255 - goto out_pool_destroy; 256 - 257 - error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); 258 - if (error) 259 - goto out_pool_destroy; 280 + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); 281 + if (ret) 282 + goto out_p2p; 260 283 261 284 rcu_assign_pointer(pdev->p2pdma, p2p); 262 285 return 0; 263 286 264 - out_pool_destroy: 265 - gen_pool_destroy(p2p->pool); 266 - out: 287 + out_p2p: 267 288 devm_kfree(&pdev->dev, p2p); 268 - return error; 289 + return ret; 290 + } 291 + EXPORT_SYMBOL_GPL(pcim_p2pdma_init); 292 + 293 + /** 294 + * pcim_p2pdma_provider - Get peer-to-peer DMA provider 295 + * @pdev: The PCI device to enable P2PDMA for 296 + * @bar: BAR index to get provider 297 + * 298 + * This function gets peer-to-peer DMA provider for a PCI device. The lifetime 299 + * of the provider (and of course the MMIO) is bound to the lifetime of the 300 + * driver. A driver calling this function must ensure that all references to the 301 + * provider, and any DMA mappings created for any MMIO, are all cleaned up 302 + * before the driver remove() completes. 303 + * 304 + * Since P2P is almost always shared with a second driver this means some system 305 + * to notify, invalidate and revoke the MMIO's DMA must be in place to use this 306 + * function. For example a revoke can be built using DMABUF. 307 + */ 308 + struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) 309 + { 310 + struct pci_p2pdma *p2p; 311 + 312 + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) 313 + return NULL; 314 + 315 + p2p = rcu_dereference_protected(pdev->p2pdma, 1); 316 + if (WARN_ON(!p2p)) 317 + /* Someone forgot to call to pcim_p2pdma_init() before */ 318 + return NULL; 319 + 320 + return &p2p->mem[bar]; 321 + } 322 + EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); 323 + 324 + static int pci_p2pdma_setup_pool(struct pci_dev *pdev) 325 + { 326 + struct pci_p2pdma *p2pdma; 327 + int ret; 328 + 329 + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); 330 + if (p2pdma->pool) 331 + /* We already setup pools, do nothing, */ 332 + return 0; 333 + 334 + p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); 335 + if (!p2pdma->pool) 336 + return -ENOMEM; 337 + 338 + ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); 339 + if (ret) 340 + goto out_pool_destroy; 341 + 342 + return 0; 343 + 344 + out_pool_destroy: 345 + gen_pool_destroy(p2pdma->pool); 346 + p2pdma->pool = NULL; 347 + return ret; 269 348 } 270 349 271 350 static void pci_p2pdma_unmap_mappings(void *data) 272 351 { 273 - struct pci_dev *pdev = data; 352 + struct pci_p2pdma_pagemap *p2p_pgmap = data; 274 353 275 354 /* 276 355 * Removing the alloc attribute from sysfs will call 277 356 * unmap_mapping_range() on the inode, teardown any existing userspace 278 357 * mappings and prevent new ones from being created. 279 358 */ 280 - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, 359 + sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj, 360 + &p2pmem_alloc_attr.attr, 281 361 p2pmem_group.name); 282 362 } 283 363 ··· 375 295 u64 offset) 376 296 { 377 297 struct pci_p2pdma_pagemap *p2p_pgmap; 298 + struct p2pdma_provider *mem; 378 299 struct dev_pagemap *pgmap; 379 300 struct pci_p2pdma *p2pdma; 380 301 void *addr; ··· 393 312 if (size + offset > pci_resource_len(pdev, bar)) 394 313 return -EINVAL; 395 314 396 - if (!pdev->p2pdma) { 397 - error = pci_p2pdma_setup(pdev); 398 - if (error) 399 - return error; 400 - } 315 + error = pcim_p2pdma_init(pdev); 316 + if (error) 317 + return error; 318 + 319 + error = pci_p2pdma_setup_pool(pdev); 320 + if (error) 321 + return error; 322 + 323 + mem = pcim_p2pdma_provider(pdev, bar); 324 + /* 325 + * We checked validity of BAR prior to call 326 + * to pcim_p2pdma_provider. It should never return NULL. 327 + */ 328 + if (WARN_ON(!mem)) 329 + return -EINVAL; 401 330 402 331 p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL); 403 332 if (!p2p_pgmap) ··· 419 328 pgmap->nr_range = 1; 420 329 pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; 421 330 pgmap->ops = &p2pdma_pgmap_ops; 422 - 423 - p2p_pgmap->provider = pdev; 424 - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - 425 - pci_resource_start(pdev, bar); 331 + p2p_pgmap->mem = mem; 426 332 427 333 addr = devm_memremap_pages(&pdev->dev, pgmap); 428 334 if (IS_ERR(addr)) { ··· 428 340 } 429 341 430 342 error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, 431 - pdev); 343 + p2p_pgmap); 432 344 if (error) 433 345 goto pages_free; 434 346 ··· 1060 972 } 1061 973 EXPORT_SYMBOL_GPL(pci_p2pmem_publish); 1062 974 1063 - static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, 1064 - struct device *dev) 975 + /** 976 + * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers 977 + * @provider: P2PDMA provider structure 978 + * @dev: Target device for the transfer 979 + * 980 + * Determines how peer-to-peer DMA transfers should be mapped between 981 + * the provider and the target device. The mapping type indicates whether 982 + * the transfer can be done directly through PCI switches or must go 983 + * through the host bridge. 984 + */ 985 + enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, 986 + struct device *dev) 1065 987 { 1066 988 enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; 1067 - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; 989 + struct pci_dev *pdev = to_pci_dev(provider->owner); 1068 990 struct pci_dev *client; 1069 991 struct pci_p2pdma *p2pdma; 1070 992 int dist; 1071 993 1072 - if (!provider->p2pdma) 994 + if (!pdev->p2pdma) 1073 995 return PCI_P2PDMA_MAP_NOT_SUPPORTED; 1074 996 1075 997 if (!dev_is_pci(dev)) ··· 1088 990 client = to_pci_dev(dev); 1089 991 1090 992 rcu_read_lock(); 1091 - p2pdma = rcu_dereference(provider->p2pdma); 993 + p2pdma = rcu_dereference(pdev->p2pdma); 1092 994 1093 995 if (p2pdma) 1094 996 type = xa_to_value(xa_load(&p2pdma->map_types, ··· 1096 998 rcu_read_unlock(); 1097 999 1098 1000 if (type == PCI_P2PDMA_MAP_UNKNOWN) 1099 - return calc_map_type_and_dist(provider, client, &dist, true); 1001 + return calc_map_type_and_dist(pdev, client, &dist, true); 1100 1002 1101 1003 return type; 1102 1004 } ··· 1104 1006 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, 1105 1007 struct device *dev, struct page *page) 1106 1008 { 1107 - state->pgmap = page_pgmap(page); 1108 - state->map = pci_p2pdma_map_type(state->pgmap, dev); 1109 - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; 1009 + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); 1010 + 1011 + if (state->mem == p2p_pgmap->mem) 1012 + return; 1013 + 1014 + state->mem = p2p_pgmap->mem; 1015 + state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev); 1110 1016 } 1111 1017 1112 1018 /**

+3

drivers/vfio/pci/Kconfig

··· 55 55 56 56 To enable s390x KVM vfio-pci extensions, say Y. 57 57 58 + config VFIO_PCI_DMABUF 59 + def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER 60 + 58 61 source "drivers/vfio/pci/mlx5/Kconfig" 59 62 60 63 source "drivers/vfio/pci/hisilicon/Kconfig"

+1

drivers/vfio/pci/Makefile

··· 2 2 3 3 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o 4 4 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o 5 + vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o 5 6 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o 6 7 7 8 vfio-pci-y := vfio_pci.o

+52

drivers/vfio/pci/nvgrace-gpu/main.c

··· 7 7 #include <linux/vfio_pci_core.h> 8 8 #include <linux/delay.h> 9 9 #include <linux/jiffies.h> 10 + #include <linux/pci-p2pdma.h> 10 11 11 12 /* 12 13 * The device memory usable to the workloads running in the VM is cached ··· 684 683 return vfio_pci_core_write(core_vdev, buf, count, ppos); 685 684 } 686 685 686 + static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev, 687 + struct p2pdma_provider **provider, 688 + unsigned int region_index, 689 + struct dma_buf_phys_vec *phys_vec, 690 + struct vfio_region_dma_range *dma_ranges, 691 + size_t nr_ranges) 692 + { 693 + struct nvgrace_gpu_pci_core_device *nvdev = container_of( 694 + core_vdev, struct nvgrace_gpu_pci_core_device, core_device); 695 + struct pci_dev *pdev = core_vdev->pdev; 696 + struct mem_region *mem_region; 697 + 698 + /* 699 + * if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) { 700 + * The P2P properties of the non-BAR memory is the same as the 701 + * BAR memory, so just use the provider for index 0. Someday 702 + * when CXL gets P2P support we could create CXLish providers 703 + * for the non-BAR memory. 704 + * } else if (region_index == USEMEM_REGION_INDEX) { 705 + * This is actually cachable memory and isn't treated as P2P in 706 + * the chip. For now we have no way to push cachable memory 707 + * through everything and the Grace HW doesn't care what caching 708 + * attribute is programmed into the SMMU. So use BAR 0. 709 + * } 710 + */ 711 + mem_region = nvgrace_gpu_memregion(region_index, nvdev); 712 + if (mem_region) { 713 + *provider = pcim_p2pdma_provider(pdev, 0); 714 + if (!*provider) 715 + return -EINVAL; 716 + return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges, 717 + nr_ranges, 718 + mem_region->memphys, 719 + mem_region->memlength); 720 + } 721 + 722 + return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index, 723 + phys_vec, dma_ranges, nr_ranges); 724 + } 725 + 726 + static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = { 727 + .get_dmabuf_phys = nvgrace_get_dmabuf_phys, 728 + }; 729 + 687 730 static const struct vfio_device_ops nvgrace_gpu_pci_ops = { 688 731 .name = "nvgrace-gpu-vfio-pci", 689 732 .init = vfio_pci_core_init_dev, ··· 746 701 .unbind_iommufd = vfio_iommufd_physical_unbind, 747 702 .attach_ioas = vfio_iommufd_physical_attach_ioas, 748 703 .detach_ioas = vfio_iommufd_physical_detach_ioas, 704 + }; 705 + 706 + static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = { 707 + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, 749 708 }; 750 709 751 710 static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = { ··· 1014 965 memphys, memlength); 1015 966 if (ret) 1016 967 goto out_put_vdev; 968 + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops; 969 + } else { 970 + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops; 1017 971 } 1018 972 1019 973 ret = vfio_pci_core_register_device(&nvdev->core_device);

+5

drivers/vfio/pci/vfio_pci.c

··· 147 147 .pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas, 148 148 }; 149 149 150 + static const struct vfio_pci_device_ops vfio_pci_dev_ops = { 151 + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, 152 + }; 153 + 150 154 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) 151 155 { 152 156 struct vfio_pci_core_device *vdev; ··· 165 161 return PTR_ERR(vdev); 166 162 167 163 dev_set_drvdata(&pdev->dev, vdev); 164 + vdev->pci_ops = &vfio_pci_dev_ops; 168 165 ret = vfio_pci_core_register_device(vdev); 169 166 if (ret) 170 167 goto out_put_vdev;

+18 -4

drivers/vfio/pci/vfio_pci_config.c

··· 589 589 virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); 590 590 new_mem = !!(new_cmd & PCI_COMMAND_MEMORY); 591 591 592 - if (!new_mem) 592 + if (!new_mem) { 593 593 vfio_pci_zap_and_down_write_memory_lock(vdev); 594 - else 594 + vfio_pci_dma_buf_move(vdev, true); 595 + } else { 595 596 down_write(&vdev->memory_lock); 597 + } 596 598 597 599 /* 598 600 * If the user is writing mem/io enable (new_mem/io) and we ··· 629 627 *virt_cmd &= cpu_to_le16(~mask); 630 628 *virt_cmd |= cpu_to_le16(new_cmd & mask); 631 629 630 + if (__vfio_pci_memory_enabled(vdev)) 631 + vfio_pci_dma_buf_move(vdev, false); 632 632 up_write(&vdev->memory_lock); 633 633 } 634 634 ··· 711 707 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, 712 708 pci_power_t state) 713 709 { 714 - if (state >= PCI_D3hot) 710 + if (state >= PCI_D3hot) { 715 711 vfio_pci_zap_and_down_write_memory_lock(vdev); 716 - else 712 + vfio_pci_dma_buf_move(vdev, true); 713 + } else { 717 714 down_write(&vdev->memory_lock); 715 + } 718 716 719 717 vfio_pci_set_power_state(vdev, state); 718 + if (__vfio_pci_memory_enabled(vdev)) 719 + vfio_pci_dma_buf_move(vdev, false); 720 720 up_write(&vdev->memory_lock); 721 721 } 722 722 ··· 908 900 909 901 if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { 910 902 vfio_pci_zap_and_down_write_memory_lock(vdev); 903 + vfio_pci_dma_buf_move(vdev, true); 911 904 pci_try_reset_function(vdev->pdev); 905 + if (__vfio_pci_memory_enabled(vdev)) 906 + vfio_pci_dma_buf_move(vdev, false); 912 907 up_write(&vdev->memory_lock); 913 908 } 914 909 } ··· 993 982 994 983 if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { 995 984 vfio_pci_zap_and_down_write_memory_lock(vdev); 985 + vfio_pci_dma_buf_move(vdev, true); 996 986 pci_try_reset_function(vdev->pdev); 987 + if (__vfio_pci_memory_enabled(vdev)) 988 + vfio_pci_dma_buf_move(vdev, false); 997 989 up_write(&vdev->memory_lock); 998 990 } 999 991 }

+35 -18

drivers/vfio/pci/vfio_pci_core.c

··· 28 28 #include <linux/nospec.h> 29 29 #include <linux/sched/mm.h> 30 30 #include <linux/iommufd.h> 31 + #include <linux/pci-p2pdma.h> 31 32 #if IS_ENABLED(CONFIG_EEH) 32 33 #include <asm/eeh.h> 33 34 #endif ··· 287 286 * semaphore. 288 287 */ 289 288 vfio_pci_zap_and_down_write_memory_lock(vdev); 289 + vfio_pci_dma_buf_move(vdev, true); 290 + 290 291 if (vdev->pm_runtime_engaged) { 291 292 up_write(&vdev->memory_lock); 292 293 return -EINVAL; ··· 302 299 return 0; 303 300 } 304 301 305 - static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, 302 + static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, 306 303 void __user *arg, size_t argsz) 307 304 { 308 - struct vfio_pci_core_device *vdev = 309 - container_of(device, struct vfio_pci_core_device, vdev); 310 305 int ret; 311 306 312 307 ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); ··· 321 320 } 322 321 323 322 static int vfio_pci_core_pm_entry_with_wakeup( 324 - struct vfio_device *device, u32 flags, 323 + struct vfio_pci_core_device *vdev, u32 flags, 325 324 struct vfio_device_low_power_entry_with_wakeup __user *arg, 326 325 size_t argsz) 327 326 { 328 - struct vfio_pci_core_device *vdev = 329 - container_of(device, struct vfio_pci_core_device, vdev); 330 327 struct vfio_device_low_power_entry_with_wakeup entry; 331 328 struct eventfd_ctx *efdctx; 332 329 int ret; ··· 372 373 */ 373 374 down_write(&vdev->memory_lock); 374 375 __vfio_pci_runtime_pm_exit(vdev); 376 + if (__vfio_pci_memory_enabled(vdev)) 377 + vfio_pci_dma_buf_move(vdev, false); 375 378 up_write(&vdev->memory_lock); 376 379 } 377 380 378 - static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, 381 + static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, 379 382 void __user *arg, size_t argsz) 380 383 { 381 - struct vfio_pci_core_device *vdev = 382 - container_of(device, struct vfio_pci_core_device, vdev); 383 384 int ret; 384 385 385 386 ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); ··· 693 694 eeh_dev_release(vdev->pdev); 694 695 #endif 695 696 vfio_pci_core_disable(vdev); 697 + 698 + vfio_pci_dma_buf_cleanup(vdev); 696 699 697 700 mutex_lock(&vdev->igate); 698 701 if (vdev->err_trigger) { ··· 1228 1227 */ 1229 1228 vfio_pci_set_power_state(vdev, PCI_D0); 1230 1229 1230 + vfio_pci_dma_buf_move(vdev, true); 1231 1231 ret = pci_try_reset_function(vdev->pdev); 1232 + if (__vfio_pci_memory_enabled(vdev)) 1233 + vfio_pci_dma_buf_move(vdev, false); 1232 1234 up_write(&vdev->memory_lock); 1233 1235 1234 1236 return ret; ··· 1477 1473 } 1478 1474 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); 1479 1475 1480 - static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, 1481 - uuid_t __user *arg, size_t argsz) 1476 + static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, 1477 + u32 flags, uuid_t __user *arg, 1478 + size_t argsz) 1482 1479 { 1483 - struct vfio_pci_core_device *vdev = 1484 - container_of(device, struct vfio_pci_core_device, vdev); 1485 1480 uuid_t uuid; 1486 1481 int ret; 1487 1482 ··· 1507 1504 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, 1508 1505 void __user *arg, size_t argsz) 1509 1506 { 1507 + struct vfio_pci_core_device *vdev = 1508 + container_of(device, struct vfio_pci_core_device, vdev); 1509 + 1510 1510 switch (flags & VFIO_DEVICE_FEATURE_MASK) { 1511 1511 case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY: 1512 - return vfio_pci_core_pm_entry(device, flags, arg, argsz); 1512 + return vfio_pci_core_pm_entry(vdev, flags, arg, argsz); 1513 1513 case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP: 1514 - return vfio_pci_core_pm_entry_with_wakeup(device, flags, 1514 + return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, 1515 1515 arg, argsz); 1516 1516 case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT: 1517 - return vfio_pci_core_pm_exit(device, flags, arg, argsz); 1517 + return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); 1518 1518 case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: 1519 - return vfio_pci_core_feature_token(device, flags, arg, argsz); 1519 + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); 1520 + case VFIO_DEVICE_FEATURE_DMA_BUF: 1521 + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); 1520 1522 default: 1521 1523 return -ENOTTY; 1522 1524 } ··· 2093 2085 { 2094 2086 struct vfio_pci_core_device *vdev = 2095 2087 container_of(core_vdev, struct vfio_pci_core_device, vdev); 2088 + int ret; 2096 2089 2097 2090 vdev->pdev = to_pci_dev(core_vdev->dev); 2098 2091 vdev->irq_type = VFIO_PCI_NUM_IRQS; ··· 2103 2094 INIT_LIST_HEAD(&vdev->dummy_resources_list); 2104 2095 INIT_LIST_HEAD(&vdev->ioeventfds_list); 2105 2096 INIT_LIST_HEAD(&vdev->sriov_pfs_item); 2097 + ret = pcim_p2pdma_init(vdev->pdev); 2098 + if (ret && ret != -EOPNOTSUPP) 2099 + return ret; 2100 + INIT_LIST_HEAD(&vdev->dmabufs); 2106 2101 init_rwsem(&vdev->memory_lock); 2107 2102 xa_init(&vdev->ctx); 2108 2103 ··· 2471 2458 break; 2472 2459 } 2473 2460 2461 + vfio_pci_dma_buf_move(vdev, true); 2474 2462 vfio_pci_zap_bars(vdev); 2475 2463 } 2476 2464 ··· 2500 2486 2501 2487 err_undo: 2502 2488 list_for_each_entry_from_reverse(vdev, &dev_set->device_list, 2503 - vdev.dev_set_list) 2489 + vdev.dev_set_list) { 2490 + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev)) 2491 + vfio_pci_dma_buf_move(vdev, false); 2504 2492 up_write(&vdev->memory_lock); 2493 + } 2505 2494 2506 2495 list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) 2507 2496 pm_runtime_put(&vdev->pdev->dev);

+350

drivers/vfio/pci/vfio_pci_dmabuf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. 3 + */ 4 + #include <linux/dma-buf-mapping.h> 5 + #include <linux/pci-p2pdma.h> 6 + #include <linux/dma-resv.h> 7 + 8 + #include "vfio_pci_priv.h" 9 + 10 + MODULE_IMPORT_NS("DMA_BUF"); 11 + 12 + struct vfio_pci_dma_buf { 13 + struct dma_buf *dmabuf; 14 + struct vfio_pci_core_device *vdev; 15 + struct list_head dmabufs_elm; 16 + size_t size; 17 + struct dma_buf_phys_vec *phys_vec; 18 + struct p2pdma_provider *provider; 19 + u32 nr_ranges; 20 + u8 revoked : 1; 21 + }; 22 + 23 + static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, 24 + struct dma_buf_attachment *attachment) 25 + { 26 + struct vfio_pci_dma_buf *priv = dmabuf->priv; 27 + 28 + if (!attachment->peer2peer) 29 + return -EOPNOTSUPP; 30 + 31 + if (priv->revoked) 32 + return -ENODEV; 33 + 34 + return 0; 35 + } 36 + 37 + static struct sg_table * 38 + vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, 39 + enum dma_data_direction dir) 40 + { 41 + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; 42 + 43 + dma_resv_assert_held(priv->dmabuf->resv); 44 + 45 + if (priv->revoked) 46 + return ERR_PTR(-ENODEV); 47 + 48 + return dma_buf_phys_vec_to_sgt(attachment, priv->provider, 49 + priv->phys_vec, priv->nr_ranges, 50 + priv->size, dir); 51 + } 52 + 53 + static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, 54 + struct sg_table *sgt, 55 + enum dma_data_direction dir) 56 + { 57 + dma_buf_free_sgt(attachment, sgt, dir); 58 + } 59 + 60 + static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) 61 + { 62 + struct vfio_pci_dma_buf *priv = dmabuf->priv; 63 + 64 + /* 65 + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. 66 + * The refcount prevents both. 67 + */ 68 + if (priv->vdev) { 69 + down_write(&priv->vdev->memory_lock); 70 + list_del_init(&priv->dmabufs_elm); 71 + up_write(&priv->vdev->memory_lock); 72 + vfio_device_put_registration(&priv->vdev->vdev); 73 + } 74 + kfree(priv->phys_vec); 75 + kfree(priv); 76 + } 77 + 78 + static const struct dma_buf_ops vfio_pci_dmabuf_ops = { 79 + .attach = vfio_pci_dma_buf_attach, 80 + .map_dma_buf = vfio_pci_dma_buf_map, 81 + .unmap_dma_buf = vfio_pci_dma_buf_unmap, 82 + .release = vfio_pci_dma_buf_release, 83 + }; 84 + 85 + /* 86 + * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. 87 + * It allows the two co-operating drivers to exchange the physical address of 88 + * the BAR. This is to be replaced with a formal DMABUF system for negotiated 89 + * interconnect types. 90 + * 91 + * If this function succeeds the following are true: 92 + * - There is one physical range and it is pointing to MMIO 93 + * - When move_notify is called it means revoke, not move, vfio_dma_buf_map 94 + * will fail if it is currently revoked 95 + */ 96 + int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 97 + struct dma_buf_phys_vec *phys) 98 + { 99 + struct vfio_pci_dma_buf *priv; 100 + 101 + dma_resv_assert_held(attachment->dmabuf->resv); 102 + 103 + if (attachment->dmabuf->ops != &vfio_pci_dmabuf_ops) 104 + return -EOPNOTSUPP; 105 + 106 + priv = attachment->dmabuf->priv; 107 + if (priv->revoked) 108 + return -ENODEV; 109 + 110 + /* More than one range to iommufd will require proper DMABUF support */ 111 + if (priv->nr_ranges != 1) 112 + return -EOPNOTSUPP; 113 + 114 + *phys = priv->phys_vec[0]; 115 + return 0; 116 + } 117 + EXPORT_SYMBOL_FOR_MODULES(vfio_pci_dma_buf_iommufd_map, "iommufd"); 118 + 119 + int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 120 + struct vfio_region_dma_range *dma_ranges, 121 + size_t nr_ranges, phys_addr_t start, 122 + phys_addr_t len) 123 + { 124 + phys_addr_t max_addr; 125 + unsigned int i; 126 + 127 + max_addr = start + len; 128 + for (i = 0; i < nr_ranges; i++) { 129 + phys_addr_t end; 130 + 131 + if (!dma_ranges[i].length) 132 + return -EINVAL; 133 + 134 + if (check_add_overflow(start, dma_ranges[i].offset, 135 + &phys_vec[i].paddr) || 136 + check_add_overflow(phys_vec[i].paddr, 137 + dma_ranges[i].length, &end)) 138 + return -EOVERFLOW; 139 + if (end > max_addr) 140 + return -EINVAL; 141 + 142 + phys_vec[i].len = dma_ranges[i].length; 143 + } 144 + return 0; 145 + } 146 + EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec); 147 + 148 + int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, 149 + struct p2pdma_provider **provider, 150 + unsigned int region_index, 151 + struct dma_buf_phys_vec *phys_vec, 152 + struct vfio_region_dma_range *dma_ranges, 153 + size_t nr_ranges) 154 + { 155 + struct pci_dev *pdev = vdev->pdev; 156 + 157 + *provider = pcim_p2pdma_provider(pdev, region_index); 158 + if (!*provider) 159 + return -EINVAL; 160 + 161 + return vfio_pci_core_fill_phys_vec( 162 + phys_vec, dma_ranges, nr_ranges, 163 + pci_resource_start(pdev, region_index), 164 + pci_resource_len(pdev, region_index)); 165 + } 166 + EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys); 167 + 168 + static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf, 169 + struct vfio_region_dma_range *dma_ranges, 170 + size_t *lengthp) 171 + { 172 + size_t length = 0; 173 + u32 i; 174 + 175 + for (i = 0; i < dma_buf->nr_ranges; i++) { 176 + u64 offset = dma_ranges[i].offset; 177 + u64 len = dma_ranges[i].length; 178 + 179 + if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) 180 + return -EINVAL; 181 + 182 + if (check_add_overflow(length, len, &length)) 183 + return -EINVAL; 184 + } 185 + 186 + /* 187 + * dma_iova_try_alloc() will WARN on if userspace proposes a size that 188 + * is too big, eg with lots of ranges. 189 + */ 190 + if ((u64)(length) & DMA_IOVA_USE_SWIOTLB) 191 + return -EINVAL; 192 + 193 + *lengthp = length; 194 + return 0; 195 + } 196 + 197 + int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 198 + struct vfio_device_feature_dma_buf __user *arg, 199 + size_t argsz) 200 + { 201 + struct vfio_device_feature_dma_buf get_dma_buf = {}; 202 + struct vfio_region_dma_range *dma_ranges; 203 + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); 204 + struct vfio_pci_dma_buf *priv; 205 + size_t length; 206 + int ret; 207 + 208 + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) 209 + return -EOPNOTSUPP; 210 + 211 + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, 212 + sizeof(get_dma_buf)); 213 + if (ret != 1) 214 + return ret; 215 + 216 + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) 217 + return -EFAULT; 218 + 219 + if (!get_dma_buf.nr_ranges || get_dma_buf.flags) 220 + return -EINVAL; 221 + 222 + /* 223 + * For PCI the region_index is the BAR number like everything else. 224 + */ 225 + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) 226 + return -ENODEV; 227 + 228 + dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges, 229 + sizeof(*dma_ranges)); 230 + if (IS_ERR(dma_ranges)) 231 + return PTR_ERR(dma_ranges); 232 + 233 + ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length); 234 + if (ret) 235 + goto err_free_ranges; 236 + 237 + priv = kzalloc(sizeof(*priv), GFP_KERNEL); 238 + if (!priv) { 239 + ret = -ENOMEM; 240 + goto err_free_ranges; 241 + } 242 + priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec), 243 + GFP_KERNEL); 244 + if (!priv->phys_vec) { 245 + ret = -ENOMEM; 246 + goto err_free_priv; 247 + } 248 + 249 + priv->vdev = vdev; 250 + priv->nr_ranges = get_dma_buf.nr_ranges; 251 + priv->size = length; 252 + ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, 253 + get_dma_buf.region_index, 254 + priv->phys_vec, dma_ranges, 255 + priv->nr_ranges); 256 + if (ret) 257 + goto err_free_phys; 258 + 259 + kfree(dma_ranges); 260 + dma_ranges = NULL; 261 + 262 + if (!vfio_device_try_get_registration(&vdev->vdev)) { 263 + ret = -ENODEV; 264 + goto err_free_phys; 265 + } 266 + 267 + exp_info.ops = &vfio_pci_dmabuf_ops; 268 + exp_info.size = priv->size; 269 + exp_info.flags = get_dma_buf.open_flags; 270 + exp_info.priv = priv; 271 + 272 + priv->dmabuf = dma_buf_export(&exp_info); 273 + if (IS_ERR(priv->dmabuf)) { 274 + ret = PTR_ERR(priv->dmabuf); 275 + goto err_dev_put; 276 + } 277 + 278 + /* dma_buf_put() now frees priv */ 279 + INIT_LIST_HEAD(&priv->dmabufs_elm); 280 + down_write(&vdev->memory_lock); 281 + dma_resv_lock(priv->dmabuf->resv, NULL); 282 + priv->revoked = !__vfio_pci_memory_enabled(vdev); 283 + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); 284 + dma_resv_unlock(priv->dmabuf->resv); 285 + up_write(&vdev->memory_lock); 286 + 287 + /* 288 + * dma_buf_fd() consumes the reference, when the file closes the dmabuf 289 + * will be released. 290 + */ 291 + ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); 292 + if (ret < 0) 293 + goto err_dma_buf; 294 + return ret; 295 + 296 + err_dma_buf: 297 + dma_buf_put(priv->dmabuf); 298 + err_dev_put: 299 + vfio_device_put_registration(&vdev->vdev); 300 + err_free_phys: 301 + kfree(priv->phys_vec); 302 + err_free_priv: 303 + kfree(priv); 304 + err_free_ranges: 305 + kfree(dma_ranges); 306 + return ret; 307 + } 308 + 309 + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) 310 + { 311 + struct vfio_pci_dma_buf *priv; 312 + struct vfio_pci_dma_buf *tmp; 313 + 314 + lockdep_assert_held_write(&vdev->memory_lock); 315 + 316 + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { 317 + if (!get_file_active(&priv->dmabuf->file)) 318 + continue; 319 + 320 + if (priv->revoked != revoked) { 321 + dma_resv_lock(priv->dmabuf->resv, NULL); 322 + priv->revoked = revoked; 323 + dma_buf_move_notify(priv->dmabuf); 324 + dma_resv_unlock(priv->dmabuf->resv); 325 + } 326 + fput(priv->dmabuf->file); 327 + } 328 + } 329 + 330 + void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) 331 + { 332 + struct vfio_pci_dma_buf *priv; 333 + struct vfio_pci_dma_buf *tmp; 334 + 335 + down_write(&vdev->memory_lock); 336 + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { 337 + if (!get_file_active(&priv->dmabuf->file)) 338 + continue; 339 + 340 + dma_resv_lock(priv->dmabuf->resv, NULL); 341 + list_del_init(&priv->dmabufs_elm); 342 + priv->vdev = NULL; 343 + priv->revoked = true; 344 + dma_buf_move_notify(priv->dmabuf); 345 + dma_resv_unlock(priv->dmabuf->resv); 346 + vfio_device_put_registration(&vdev->vdev); 347 + fput(priv->dmabuf->file); 348 + } 349 + up_write(&vdev->memory_lock); 350 + }

+23

drivers/vfio/pci/vfio_pci_priv.h

··· 107 107 return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; 108 108 } 109 109 110 + #ifdef CONFIG_VFIO_PCI_DMABUF 111 + int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 112 + struct vfio_device_feature_dma_buf __user *arg, 113 + size_t argsz); 114 + void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); 115 + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); 116 + #else 117 + static inline int 118 + vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, 119 + struct vfio_device_feature_dma_buf __user *arg, 120 + size_t argsz) 121 + { 122 + return -ENOTTY; 123 + } 124 + static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) 125 + { 126 + } 127 + static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, 128 + bool revoked) 129 + { 130 + } 131 + #endif 132 + 110 133 #endif

+2

drivers/vfio/vfio_main.c

··· 172 172 if (refcount_dec_and_test(&device->refcount)) 173 173 complete(&device->comp); 174 174 } 175 + EXPORT_SYMBOL_GPL(vfio_device_put_registration); 175 176 176 177 bool vfio_device_try_get_registration(struct vfio_device *device) 177 178 { 178 179 return refcount_inc_not_zero(&device->refcount); 179 180 } 181 + EXPORT_SYMBOL_GPL(vfio_device_try_get_registration); 180 182 181 183 /* 182 184 * VFIO driver API

+17

include/linux/dma-buf-mapping.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * DMA BUF Mapping Helpers 4 + * 5 + */ 6 + #ifndef __DMA_BUF_MAPPING_H__ 7 + #define __DMA_BUF_MAPPING_H__ 8 + #include <linux/dma-buf.h> 9 + 10 + struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, 11 + struct p2pdma_provider *provider, 12 + struct dma_buf_phys_vec *phys_vec, 13 + size_t nr_ranges, size_t size, 14 + enum dma_data_direction dir); 15 + void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, 16 + enum dma_data_direction dir); 17 + #endif

+11

include/linux/dma-buf.h

··· 22 22 #include <linux/fs.h> 23 23 #include <linux/dma-fence.h> 24 24 #include <linux/wait.h> 25 + #include <linux/pci-p2pdma.h> 25 26 26 27 struct device; 27 28 struct dma_buf; ··· 529 528 int flags; 530 529 struct dma_resv *resv; 531 530 void *priv; 531 + }; 532 + 533 + /** 534 + * struct dma_buf_phys_vec - describe continuous chunk of memory 535 + * @paddr: physical address of that chunk 536 + * @len: Length of this chunk 537 + */ 538 + struct dma_buf_phys_vec { 539 + phys_addr_t paddr; 540 + size_t len; 532 541 }; 533 542 534 543 /**

+73 -47

include/linux/pci-p2pdma.h

··· 16 16 struct block_device; 17 17 struct scatterlist; 18 18 19 + /** 20 + * struct p2pdma_provider 21 + * 22 + * A p2pdma provider is a range of MMIO address space available to the CPU. 23 + */ 24 + struct p2pdma_provider { 25 + struct device *owner; 26 + u64 bus_offset; 27 + }; 28 + 29 + enum pci_p2pdma_map_type { 30 + /* 31 + * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before 32 + * the mapping type has been calculated. Exported routines for the API 33 + * will never return this value. 34 + */ 35 + PCI_P2PDMA_MAP_UNKNOWN = 0, 36 + 37 + /* 38 + * Not a PCI P2PDMA transfer. 39 + */ 40 + PCI_P2PDMA_MAP_NONE, 41 + 42 + /* 43 + * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will 44 + * traverse the host bridge and the host bridge is not in the 45 + * allowlist. DMA Mapping routines should return an error when 46 + * this is returned. 47 + */ 48 + PCI_P2PDMA_MAP_NOT_SUPPORTED, 49 + 50 + /* 51 + * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to 52 + * each other directly through a PCI switch and the transaction will 53 + * not traverse the host bridge. Such a mapping should program 54 + * the DMA engine with PCI bus addresses. 55 + */ 56 + PCI_P2PDMA_MAP_BUS_ADDR, 57 + 58 + /* 59 + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk 60 + * to each other, but the transaction traverses a host bridge on the 61 + * allowlist. In this case, a normal mapping either with CPU physical 62 + * addresses (in the case of dma-direct) or IOVA addresses (in the 63 + * case of IOMMUs) should be used to program the DMA engine. 64 + */ 65 + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, 66 + }; 67 + 19 68 #ifdef CONFIG_PCI_P2PDMA 69 + int pcim_p2pdma_init(struct pci_dev *pdev); 70 + struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); 20 71 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, 21 72 u64 offset); 22 73 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, ··· 84 33 bool *use_p2pdma); 85 34 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, 86 35 bool use_p2pdma); 36 + enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, 37 + struct device *dev); 87 38 #else /* CONFIG_PCI_P2PDMA */ 39 + static inline int pcim_p2pdma_init(struct pci_dev *pdev) 40 + { 41 + return -EOPNOTSUPP; 42 + } 43 + static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, 44 + int bar) 45 + { 46 + return NULL; 47 + } 88 48 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, 89 49 size_t size, u64 offset) 90 50 { ··· 147 85 { 148 86 return sprintf(page, "none\n"); 149 87 } 88 + static inline enum pci_p2pdma_map_type 89 + pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) 90 + { 91 + return PCI_P2PDMA_MAP_NOT_SUPPORTED; 92 + } 150 93 #endif /* CONFIG_PCI_P2PDMA */ 151 94 152 95 ··· 166 99 return pci_p2pmem_find_many(&client, 1); 167 100 } 168 101 169 - enum pci_p2pdma_map_type { 170 - /* 171 - * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before 172 - * the mapping type has been calculated. Exported routines for the API 173 - * will never return this value. 174 - */ 175 - PCI_P2PDMA_MAP_UNKNOWN = 0, 176 - 177 - /* 178 - * Not a PCI P2PDMA transfer. 179 - */ 180 - PCI_P2PDMA_MAP_NONE, 181 - 182 - /* 183 - * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will 184 - * traverse the host bridge and the host bridge is not in the 185 - * allowlist. DMA Mapping routines should return an error when 186 - * this is returned. 187 - */ 188 - PCI_P2PDMA_MAP_NOT_SUPPORTED, 189 - 190 - /* 191 - * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to 192 - * each other directly through a PCI switch and the transaction will 193 - * not traverse the host bridge. Such a mapping should program 194 - * the DMA engine with PCI bus addresses. 195 - */ 196 - PCI_P2PDMA_MAP_BUS_ADDR, 197 - 198 - /* 199 - * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk 200 - * to each other, but the transaction traverses a host bridge on the 201 - * allowlist. In this case, a normal mapping either with CPU physical 202 - * addresses (in the case of dma-direct) or IOVA addresses (in the 203 - * case of IOMMUs) should be used to program the DMA engine. 204 - */ 205 - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, 206 - }; 207 - 208 102 struct pci_p2pdma_map_state { 209 - struct dev_pagemap *pgmap; 103 + struct p2pdma_provider *mem; 210 104 enum pci_p2pdma_map_type map; 211 - u64 bus_off; 212 105 }; 106 + 213 107 214 108 /* helper for pci_p2pdma_state(), do not use directly */ 215 109 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, ··· 190 162 struct page *page) 191 163 { 192 164 if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { 193 - if (state->pgmap != page_pgmap(page)) 194 - __pci_p2pdma_update_state(state, dev, page); 165 + __pci_p2pdma_update_state(state, dev, page); 195 166 return state->map; 196 167 } 197 168 return PCI_P2PDMA_MAP_NONE; ··· 199 172 /** 200 173 * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address 201 174 * for a PCI_P2PDMA_MAP_BUS_ADDR transfer. 202 - * @state: P2P state structure 175 + * @provider: P2P provider structure 203 176 * @paddr: physical address to map 204 177 * 205 178 * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer. 206 179 */ 207 180 static inline dma_addr_t 208 - pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) 181 + pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr) 209 182 { 210 - WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); 211 - return paddr + state->bus_off; 183 + return paddr + provider->bus_offset; 212 184 } 213 185 214 186 #endif /* _LINUX_PCI_P2P_H */

+2

include/linux/vfio.h

··· 297 297 int vfio_register_group_dev(struct vfio_device *device); 298 298 int vfio_register_emulated_iommu_dev(struct vfio_device *device); 299 299 void vfio_unregister_group_dev(struct vfio_device *device); 300 + bool vfio_device_try_get_registration(struct vfio_device *device); 301 + void vfio_device_put_registration(struct vfio_device *device); 300 302 301 303 int vfio_assign_device_set(struct vfio_device *device, void *set_id); 302 304 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);

+46

include/linux/vfio_pci_core.h

··· 26 26 27 27 struct vfio_pci_core_device; 28 28 struct vfio_pci_region; 29 + struct p2pdma_provider; 30 + struct dma_buf_phys_vec; 31 + struct dma_buf_attachment; 29 32 30 33 struct vfio_pci_regops { 31 34 ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf, ··· 52 49 u32 flags; 53 50 }; 54 51 52 + struct vfio_pci_device_ops { 53 + int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev, 54 + struct p2pdma_provider **provider, 55 + unsigned int region_index, 56 + struct dma_buf_phys_vec *phys_vec, 57 + struct vfio_region_dma_range *dma_ranges, 58 + size_t nr_ranges); 59 + }; 60 + 61 + #if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF) 62 + int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 63 + struct vfio_region_dma_range *dma_ranges, 64 + size_t nr_ranges, phys_addr_t start, 65 + phys_addr_t len); 66 + int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, 67 + struct p2pdma_provider **provider, 68 + unsigned int region_index, 69 + struct dma_buf_phys_vec *phys_vec, 70 + struct vfio_region_dma_range *dma_ranges, 71 + size_t nr_ranges); 72 + #else 73 + static inline int 74 + vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, 75 + struct vfio_region_dma_range *dma_ranges, 76 + size_t nr_ranges, phys_addr_t start, 77 + phys_addr_t len) 78 + { 79 + return -EINVAL; 80 + } 81 + static inline int vfio_pci_core_get_dmabuf_phys( 82 + struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider, 83 + unsigned int region_index, struct dma_buf_phys_vec *phys_vec, 84 + struct vfio_region_dma_range *dma_ranges, size_t nr_ranges) 85 + { 86 + return -EOPNOTSUPP; 87 + } 88 + #endif 89 + 55 90 struct vfio_pci_core_device { 56 91 struct vfio_device vdev; 57 92 struct pci_dev *pdev; 93 + const struct vfio_pci_device_ops *pci_ops; 58 94 void __iomem *barmap[PCI_STD_NUM_BARS]; 59 95 bool bar_mmap_supported[PCI_STD_NUM_BARS]; 60 96 u8 *pci_config_map; ··· 136 94 struct vfio_pci_core_device *sriov_pf_core_dev; 137 95 struct notifier_block nb; 138 96 struct rw_semaphore memory_lock; 97 + struct list_head dmabufs; 139 98 }; 140 99 141 100 /* Will be exported for vfio pci drivers usage */ ··· 203 160 #ifdef ioread64 204 161 VFIO_IOREAD_DECLARATION(64) 205 162 #endif 163 + 164 + int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, 165 + struct dma_buf_phys_vec *phys); 206 166 207 167 #endif /* VFIO_PCI_CORE_H */

+28

include/uapi/linux/vfio.h

··· 14 14 15 15 #include <linux/types.h> 16 16 #include <linux/ioctl.h> 17 + #include <linux/stddef.h> 17 18 18 19 #define VFIO_API_VERSION 0 19 20 ··· 1478 1477 #define VFIO_DEVICE_FEATURE_SET_MASTER 1 /* Set Bus Master */ 1479 1478 }; 1480 1479 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 1480 + 1481 + /** 1482 + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the 1483 + * regions selected. 1484 + * 1485 + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, 1486 + * etc. offset/length specify a slice of the region to create the dmabuf from. 1487 + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. 1488 + * 1489 + * flags should be 0. 1490 + * 1491 + * Return: The fd number on success, -1 and errno is set on failure. 1492 + */ 1493 + #define VFIO_DEVICE_FEATURE_DMA_BUF 11 1494 + 1495 + struct vfio_region_dma_range { 1496 + __u64 offset; 1497 + __u64 length; 1498 + }; 1499 + 1500 + struct vfio_device_feature_dma_buf { 1501 + __u32 region_index; 1502 + __u32 open_flags; 1503 + __u32 flags; 1504 + __u32 nr_ranges; 1505 + struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges); 1506 + }; 1481 1507 1482 1508 /* -------- API for Type1 VFIO IOMMU -------- */ 1483 1509

+2 -2

kernel/dma/direct.c

··· 479 479 } 480 480 break; 481 481 case PCI_P2PDMA_MAP_BUS_ADDR: 482 - sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, 483 - sg_phys(sg)); 482 + sg->dma_address = pci_p2pdma_bus_addr_map( 483 + p2pdma_state.mem, sg_phys(sg)); 484 484 sg_dma_mark_bus_address(sg); 485 485 continue; 486 486 default:

+1 -1

mm/hmm.c

··· 811 811 break; 812 812 case PCI_P2PDMA_MAP_BUS_ADDR: 813 813 pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED; 814 - return pci_p2pdma_bus_addr_map(p2pdma_state, paddr); 814 + return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr); 815 815 default: 816 816 return DMA_MAPPING_ERROR; 817 817 }

+43

tools/testing/selftests/iommu/iommufd.c

··· 1574 1574 test_ioctl_destroy(dst_ioas_id); 1575 1575 } 1576 1576 1577 + TEST_F(iommufd_ioas, dmabuf_simple) 1578 + { 1579 + size_t buf_size = PAGE_SIZE*4; 1580 + __u64 iova; 1581 + int dfd; 1582 + 1583 + test_cmd_get_dmabuf(buf_size, &dfd); 1584 + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, 0, &iova); 1585 + test_err_ioctl_ioas_map_file(EINVAL, dfd, buf_size, buf_size, &iova); 1586 + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, buf_size + 1, &iova); 1587 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); 1588 + 1589 + close(dfd); 1590 + } 1591 + 1592 + TEST_F(iommufd_ioas, dmabuf_revoke) 1593 + { 1594 + size_t buf_size = PAGE_SIZE*4; 1595 + __u32 hwpt_id; 1596 + __u64 iova; 1597 + __u64 iova2; 1598 + int dfd; 1599 + 1600 + test_cmd_get_dmabuf(buf_size, &dfd); 1601 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); 1602 + test_cmd_revoke_dmabuf(dfd, true); 1603 + 1604 + if (variant->mock_domains) 1605 + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, 0, 1606 + &hwpt_id); 1607 + 1608 + test_err_ioctl_ioas_map_file(ENODEV, dfd, 0, buf_size, &iova2); 1609 + 1610 + test_cmd_revoke_dmabuf(dfd, false); 1611 + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova2); 1612 + 1613 + /* Restore the iova back */ 1614 + test_ioctl_ioas_unmap(iova, buf_size); 1615 + test_ioctl_ioas_map_fixed_file(dfd, 0, buf_size, iova); 1616 + 1617 + close(dfd); 1618 + } 1619 + 1577 1620 FIXTURE(iommufd_mock_domain) 1578 1621 { 1579 1622 int fd;

+44

tools/testing/selftests/iommu/iommufd_utils.h

··· 548 548 EXPECT_ERRNO(_errno, _test_cmd_destroy_access_pages( \ 549 549 self->fd, access_id, access_pages_id)) 550 550 551 + static int _test_cmd_get_dmabuf(int fd, size_t len, int *out_fd) 552 + { 553 + struct iommu_test_cmd cmd = { 554 + .size = sizeof(cmd), 555 + .op = IOMMU_TEST_OP_DMABUF_GET, 556 + .dmabuf_get = { .length = len, .open_flags = O_CLOEXEC }, 557 + }; 558 + 559 + *out_fd = ioctl(fd, IOMMU_TEST_CMD, &cmd); 560 + if (*out_fd < 0) 561 + return -1; 562 + return 0; 563 + } 564 + #define test_cmd_get_dmabuf(len, out_fd) \ 565 + ASSERT_EQ(0, _test_cmd_get_dmabuf(self->fd, len, out_fd)) 566 + 567 + static int _test_cmd_revoke_dmabuf(int fd, int dmabuf_fd, bool revoked) 568 + { 569 + struct iommu_test_cmd cmd = { 570 + .size = sizeof(cmd), 571 + .op = IOMMU_TEST_OP_DMABUF_REVOKE, 572 + .dmabuf_revoke = { .dmabuf_fd = dmabuf_fd, .revoked = revoked }, 573 + }; 574 + int ret; 575 + 576 + ret = ioctl(fd, IOMMU_TEST_CMD, &cmd); 577 + if (ret < 0) 578 + return -1; 579 + return 0; 580 + } 581 + #define test_cmd_revoke_dmabuf(dmabuf_fd, revoke) \ 582 + ASSERT_EQ(0, _test_cmd_revoke_dmabuf(self->fd, dmabuf_fd, revoke)) 583 + 551 584 static int _test_ioctl_destroy(int fd, unsigned int id) 552 585 { 553 586 struct iommu_destroy cmd = { ··· 750 717 _test_ioctl_ioas_map_file( \ 751 718 self->fd, ioas_id, mfd, start, length, iova_p, \ 752 719 IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE)) 720 + 721 + #define test_ioctl_ioas_map_fixed_file(mfd, start, length, iova) \ 722 + ({ \ 723 + __u64 __iova = iova; \ 724 + ASSERT_EQ(0, _test_ioctl_ioas_map_file( \ 725 + self->fd, self->ioas_id, mfd, start, \ 726 + length, &__iova, \ 727 + IOMMU_IOAS_MAP_FIXED_IOVA | \ 728 + IOMMU_IOAS_MAP_WRITEABLE | \ 729 + IOMMU_IOAS_MAP_READABLE)); \ 730 + }) 753 731 754 732 static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit) 755 733 {