Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

vfio: powerpc/spapr: Register memory and define IOMMU v2

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

authored by

Alexey Kardashevskiy and committed by
Michael Ellerman
2157e7b8 15b244a8

+488 -89
+28 -3
Documentation/vfio.txt
··· 289 289 290 290 This implementation has some specifics: 291 291 292 - 1) Only one IOMMU group per container is supported as an IOMMU group 293 - represents the minimal entity which isolation can be guaranteed for and 294 - groups are allocated statically, one per a Partitionable Endpoint (PE) 292 + 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 293 + container is supported as an IOMMU table is allocated at the boot time, 294 + one table per a IOMMU group which is a Partitionable Endpoint (PE) 295 295 (PE is often a PCI domain but not always). 296 + Newer systems (POWER8 with IODA2) have improved hardware design which allows 297 + to remove this limitation and have multiple IOMMU groups per a VFIO container. 296 298 297 299 2) The hardware supports so called DMA windows - the PCI address range 298 300 within which DMA transfer is allowed, any attempt to access address space ··· 440 438 */ 441 439 442 440 .... 441 + 442 + 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 443 + VFIO_IOMMU_DISABLE and implements 2 new ioctls: 444 + VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 445 + (which are unsupported in v1 IOMMU). 446 + 447 + PPC64 paravirtualized guests generate a lot of map/unmap requests, 448 + and the handling of those includes pinning/unpinning pages and updating 449 + mm::locked_vm counter to make sure we do not exceed the rlimit. 450 + The v2 IOMMU splits accounting and pinning into separate operations: 451 + 452 + - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 453 + receive a user space address and size of the block to be pinned. 454 + Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 455 + be called with the exact address and size used for registering 456 + the memory block. The userspace is not expected to call these often. 457 + The ranges are stored in a linked list in a VFIO container. 458 + 459 + - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 460 + IOMMU table and do not do pinning; instead these check that the userspace 461 + address is from pre-registered range. 462 + 463 + This separation helps in optimizing DMA for guests. 443 464 444 465 ------------------------------------------------------------------------------- 445 466
+6
arch/powerpc/include/asm/iommu.h
··· 112 112 unsigned long *it_map; /* A simple allocation bitmap for now */ 113 113 unsigned long it_page_shift;/* table iommu page size */ 114 114 struct list_head it_group_list;/* List of iommu_table_group_link */ 115 + unsigned long *it_userspace; /* userspace view of the table */ 115 116 struct iommu_table_ops *it_ops; 116 117 }; 118 + 119 + #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ 120 + ((tbl)->it_userspace ? \ 121 + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \ 122 + NULL) 117 123 118 124 /* Pure 2^n version of get_order */ 119 125 static inline __attribute_const__
+427 -86
drivers/vfio/vfio_iommu_spapr_tce.c
··· 19 19 #include <linux/uaccess.h> 20 20 #include <linux/err.h> 21 21 #include <linux/vfio.h> 22 + #include <linux/vmalloc.h> 22 23 #include <asm/iommu.h> 23 24 #include <asm/tce.h> 25 + #include <asm/mmu_context.h> 24 26 25 27 #define DRIVER_VERSION "0.1" 26 28 #define DRIVER_AUTHOR "aik@ozlabs.ru" ··· 83 81 * into DMA'ble space using the IOMMU 84 82 */ 85 83 84 + struct tce_iommu_group { 85 + struct list_head next; 86 + struct iommu_group *grp; 87 + }; 88 + 86 89 /* 87 90 * The container descriptor supports only a single group per container. 88 91 * Required by the API as the container is not supplied with the IOMMU group ··· 95 88 */ 96 89 struct tce_container { 97 90 struct mutex lock; 98 - struct iommu_group *grp; 99 91 bool enabled; 92 + bool v2; 100 93 unsigned long locked_pages; 94 + struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; 95 + struct list_head group_list; 101 96 }; 97 + 98 + static long tce_iommu_unregister_pages(struct tce_container *container, 99 + __u64 vaddr, __u64 size) 100 + { 101 + struct mm_iommu_table_group_mem_t *mem; 102 + 103 + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK)) 104 + return -EINVAL; 105 + 106 + mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT); 107 + if (!mem) 108 + return -ENOENT; 109 + 110 + return mm_iommu_put(mem); 111 + } 112 + 113 + static long tce_iommu_register_pages(struct tce_container *container, 114 + __u64 vaddr, __u64 size) 115 + { 116 + long ret = 0; 117 + struct mm_iommu_table_group_mem_t *mem = NULL; 118 + unsigned long entries = size >> PAGE_SHIFT; 119 + 120 + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) || 121 + ((vaddr + size) < vaddr)) 122 + return -EINVAL; 123 + 124 + ret = mm_iommu_get(vaddr, entries, &mem); 125 + if (ret) 126 + return ret; 127 + 128 + container->enabled = true; 129 + 130 + return 0; 131 + } 132 + 133 + static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl) 134 + { 135 + unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) * 136 + tbl->it_size, PAGE_SIZE); 137 + unsigned long *uas; 138 + long ret; 139 + 140 + BUG_ON(tbl->it_userspace); 141 + 142 + ret = try_increment_locked_vm(cb >> PAGE_SHIFT); 143 + if (ret) 144 + return ret; 145 + 146 + uas = vzalloc(cb); 147 + if (!uas) { 148 + decrement_locked_vm(cb >> PAGE_SHIFT); 149 + return -ENOMEM; 150 + } 151 + tbl->it_userspace = uas; 152 + 153 + return 0; 154 + } 155 + 156 + static void tce_iommu_userspace_view_free(struct iommu_table *tbl) 157 + { 158 + unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) * 159 + tbl->it_size, PAGE_SIZE); 160 + 161 + if (!tbl->it_userspace) 162 + return; 163 + 164 + vfree(tbl->it_userspace); 165 + tbl->it_userspace = NULL; 166 + decrement_locked_vm(cb >> PAGE_SHIFT); 167 + } 102 168 103 169 static bool tce_page_is_contained(struct page *page, unsigned page_shift) 104 170 { ··· 183 103 return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift; 184 104 } 185 105 106 + static inline bool tce_groups_attached(struct tce_container *container) 107 + { 108 + return !list_empty(&container->group_list); 109 + } 110 + 186 111 static long tce_iommu_find_table(struct tce_container *container, 187 112 phys_addr_t ioba, struct iommu_table **ptbl) 188 113 { 189 114 long i; 190 - struct iommu_table_group *table_group; 191 - 192 - table_group = iommu_group_get_iommudata(container->grp); 193 - if (!table_group) 194 - return -1; 195 115 196 116 for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { 197 - struct iommu_table *tbl = table_group->tables[i]; 117 + struct iommu_table *tbl = container->tables[i]; 198 118 199 119 if (tbl) { 200 120 unsigned long entry = ioba >> tbl->it_page_shift; ··· 216 136 int ret = 0; 217 137 unsigned long locked; 218 138 struct iommu_table_group *table_group; 219 - 220 - if (!container->grp) 221 - return -ENXIO; 139 + struct tce_iommu_group *tcegrp; 222 140 223 141 if (!current->mm) 224 142 return -ESRCH; /* process exited */ ··· 253 175 * as there is no way to know how much we should increment 254 176 * the locked_vm counter. 255 177 */ 256 - table_group = iommu_group_get_iommudata(container->grp); 178 + if (!tce_groups_attached(container)) 179 + return -ENODEV; 180 + 181 + tcegrp = list_first_entry(&container->group_list, 182 + struct tce_iommu_group, next); 183 + table_group = iommu_group_get_iommudata(tcegrp->grp); 257 184 if (!table_group) 258 185 return -ENODEV; 259 186 ··· 294 211 { 295 212 struct tce_container *container; 296 213 297 - if (arg != VFIO_SPAPR_TCE_IOMMU) { 214 + if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) { 298 215 pr_err("tce_vfio: Wrong IOMMU type\n"); 299 216 return ERR_PTR(-EINVAL); 300 217 } ··· 304 221 return ERR_PTR(-ENOMEM); 305 222 306 223 mutex_init(&container->lock); 224 + INIT_LIST_HEAD_RCU(&container->group_list); 225 + 226 + container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU; 307 227 308 228 return container; 309 229 } 310 230 231 + static int tce_iommu_clear(struct tce_container *container, 232 + struct iommu_table *tbl, 233 + unsigned long entry, unsigned long pages); 234 + static void tce_iommu_free_table(struct iommu_table *tbl); 235 + 311 236 static void tce_iommu_release(void *iommu_data) 312 237 { 313 238 struct tce_container *container = iommu_data; 239 + struct iommu_table_group *table_group; 240 + struct tce_iommu_group *tcegrp; 241 + long i; 314 242 315 - WARN_ON(container->grp); 243 + while (tce_groups_attached(container)) { 244 + tcegrp = list_first_entry(&container->group_list, 245 + struct tce_iommu_group, next); 246 + table_group = iommu_group_get_iommudata(tcegrp->grp); 247 + tce_iommu_detach_group(iommu_data, tcegrp->grp); 248 + } 316 249 317 - if (container->grp) 318 - tce_iommu_detach_group(iommu_data, container->grp); 250 + /* 251 + * If VFIO created a table, it was not disposed 252 + * by tce_iommu_detach_group() so do it now. 253 + */ 254 + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { 255 + struct iommu_table *tbl = container->tables[i]; 256 + 257 + if (!tbl) 258 + continue; 259 + 260 + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); 261 + tce_iommu_free_table(tbl); 262 + } 319 263 320 264 tce_iommu_disable(container); 321 265 mutex_destroy(&container->lock); ··· 357 247 358 248 page = pfn_to_page(hpa >> PAGE_SHIFT); 359 249 put_page(page); 250 + } 251 + 252 + static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size, 253 + unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem) 254 + { 255 + long ret = 0; 256 + struct mm_iommu_table_group_mem_t *mem; 257 + 258 + mem = mm_iommu_lookup(tce, size); 259 + if (!mem) 260 + return -EINVAL; 261 + 262 + ret = mm_iommu_ua_to_hpa(mem, tce, phpa); 263 + if (ret) 264 + return -EINVAL; 265 + 266 + *pmem = mem; 267 + 268 + return 0; 269 + } 270 + 271 + static void tce_iommu_unuse_page_v2(struct iommu_table *tbl, 272 + unsigned long entry) 273 + { 274 + struct mm_iommu_table_group_mem_t *mem = NULL; 275 + int ret; 276 + unsigned long hpa = 0; 277 + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); 278 + 279 + if (!pua || !current || !current->mm) 280 + return; 281 + 282 + ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl), 283 + &hpa, &mem); 284 + if (ret) 285 + pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n", 286 + __func__, *pua, entry, ret); 287 + if (mem) 288 + mm_iommu_mapped_dec(mem); 289 + 290 + *pua = 0; 360 291 } 361 292 362 293 static int tce_iommu_clear(struct tce_container *container, ··· 417 266 418 267 if (direction == DMA_NONE) 419 268 continue; 269 + 270 + if (container->v2) { 271 + tce_iommu_unuse_page_v2(tbl, entry); 272 + continue; 273 + } 420 274 421 275 tce_iommu_unuse_page(container, oldhpa); 422 276 } ··· 489 333 return ret; 490 334 } 491 335 336 + static long tce_iommu_build_v2(struct tce_container *container, 337 + struct iommu_table *tbl, 338 + unsigned long entry, unsigned long tce, unsigned long pages, 339 + enum dma_data_direction direction) 340 + { 341 + long i, ret = 0; 342 + struct page *page; 343 + unsigned long hpa; 344 + enum dma_data_direction dirtmp; 345 + 346 + for (i = 0; i < pages; ++i) { 347 + struct mm_iommu_table_group_mem_t *mem = NULL; 348 + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, 349 + entry + i); 350 + 351 + ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl), 352 + &hpa, &mem); 353 + if (ret) 354 + break; 355 + 356 + page = pfn_to_page(hpa >> PAGE_SHIFT); 357 + if (!tce_page_is_contained(page, tbl->it_page_shift)) { 358 + ret = -EPERM; 359 + break; 360 + } 361 + 362 + /* Preserve offset within IOMMU page */ 363 + hpa |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK; 364 + dirtmp = direction; 365 + 366 + /* The registered region is being unregistered */ 367 + if (mm_iommu_mapped_inc(mem)) 368 + break; 369 + 370 + ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp); 371 + if (ret) { 372 + /* dirtmp cannot be DMA_NONE here */ 373 + tce_iommu_unuse_page_v2(tbl, entry + i); 374 + pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n", 375 + __func__, entry << tbl->it_page_shift, 376 + tce, ret); 377 + break; 378 + } 379 + 380 + if (dirtmp != DMA_NONE) 381 + tce_iommu_unuse_page_v2(tbl, entry + i); 382 + 383 + *pua = tce; 384 + 385 + tce += IOMMU_PAGE_SIZE(tbl); 386 + } 387 + 388 + if (ret) 389 + tce_iommu_clear(container, tbl, entry, i); 390 + 391 + return ret; 392 + } 393 + 492 394 static long tce_iommu_create_table(struct tce_container *container, 493 395 struct iommu_table_group *table_group, 494 396 int num, ··· 572 358 WARN_ON(!ret && !(*ptbl)->it_ops->free); 573 359 WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size)); 574 360 361 + if (!ret && container->v2) { 362 + ret = tce_iommu_userspace_view_alloc(*ptbl); 363 + if (ret) 364 + (*ptbl)->it_ops->free(*ptbl); 365 + } 366 + 575 367 if (ret) 576 368 decrement_locked_vm(table_size >> PAGE_SHIFT); 577 369 ··· 588 368 { 589 369 unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT; 590 370 371 + tce_iommu_userspace_view_free(tbl); 591 372 tbl->it_ops->free(tbl); 592 373 decrement_locked_vm(pages); 593 374 } ··· 604 383 case VFIO_CHECK_EXTENSION: 605 384 switch (arg) { 606 385 case VFIO_SPAPR_TCE_IOMMU: 386 + case VFIO_SPAPR_TCE_v2_IOMMU: 607 387 ret = 1; 608 388 break; 609 389 default: ··· 616 394 617 395 case VFIO_IOMMU_SPAPR_TCE_GET_INFO: { 618 396 struct vfio_iommu_spapr_tce_info info; 397 + struct tce_iommu_group *tcegrp; 619 398 struct iommu_table_group *table_group; 620 399 621 - if (WARN_ON(!container->grp)) 400 + if (!tce_groups_attached(container)) 622 401 return -ENXIO; 623 402 624 - table_group = iommu_group_get_iommudata(container->grp); 403 + tcegrp = list_first_entry(&container->group_list, 404 + struct tce_iommu_group, next); 405 + table_group = iommu_group_get_iommudata(tcegrp->grp); 625 406 626 407 if (!table_group) 627 408 return -ENXIO; ··· 693 468 if (ret) 694 469 return ret; 695 470 696 - ret = tce_iommu_build(container, tbl, 697 - param.iova >> tbl->it_page_shift, 698 - param.vaddr, 699 - param.size >> tbl->it_page_shift, 700 - direction); 471 + if (container->v2) 472 + ret = tce_iommu_build_v2(container, tbl, 473 + param.iova >> tbl->it_page_shift, 474 + param.vaddr, 475 + param.size >> tbl->it_page_shift, 476 + direction); 477 + else 478 + ret = tce_iommu_build(container, tbl, 479 + param.iova >> tbl->it_page_shift, 480 + param.vaddr, 481 + param.size >> tbl->it_page_shift, 482 + direction); 701 483 702 484 iommu_flush_tce(tbl); 703 485 ··· 750 518 751 519 return ret; 752 520 } 521 + case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: { 522 + struct vfio_iommu_spapr_register_memory param; 523 + 524 + if (!container->v2) 525 + break; 526 + 527 + minsz = offsetofend(struct vfio_iommu_spapr_register_memory, 528 + size); 529 + 530 + if (copy_from_user(&param, (void __user *)arg, minsz)) 531 + return -EFAULT; 532 + 533 + if (param.argsz < minsz) 534 + return -EINVAL; 535 + 536 + /* No flag is supported now */ 537 + if (param.flags) 538 + return -EINVAL; 539 + 540 + mutex_lock(&container->lock); 541 + ret = tce_iommu_register_pages(container, param.vaddr, 542 + param.size); 543 + mutex_unlock(&container->lock); 544 + 545 + return ret; 546 + } 547 + case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: { 548 + struct vfio_iommu_spapr_register_memory param; 549 + 550 + if (!container->v2) 551 + break; 552 + 553 + minsz = offsetofend(struct vfio_iommu_spapr_register_memory, 554 + size); 555 + 556 + if (copy_from_user(&param, (void __user *)arg, minsz)) 557 + return -EFAULT; 558 + 559 + if (param.argsz < minsz) 560 + return -EINVAL; 561 + 562 + /* No flag is supported now */ 563 + if (param.flags) 564 + return -EINVAL; 565 + 566 + mutex_lock(&container->lock); 567 + ret = tce_iommu_unregister_pages(container, param.vaddr, 568 + param.size); 569 + mutex_unlock(&container->lock); 570 + 571 + return ret; 572 + } 753 573 case VFIO_IOMMU_ENABLE: 574 + if (container->v2) 575 + break; 576 + 754 577 mutex_lock(&container->lock); 755 578 ret = tce_iommu_enable(container); 756 579 mutex_unlock(&container->lock); ··· 813 526 814 527 815 528 case VFIO_IOMMU_DISABLE: 529 + if (container->v2) 530 + break; 531 + 816 532 mutex_lock(&container->lock); 817 533 tce_iommu_disable(container); 818 534 mutex_unlock(&container->lock); 819 535 return 0; 820 - case VFIO_EEH_PE_OP: 821 - if (!container->grp) 822 - return -ENODEV; 823 536 824 - return vfio_spapr_iommu_eeh_ioctl(container->grp, 825 - cmd, arg); 537 + case VFIO_EEH_PE_OP: { 538 + struct tce_iommu_group *tcegrp; 539 + 540 + ret = 0; 541 + list_for_each_entry(tcegrp, &container->group_list, next) { 542 + ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp, 543 + cmd, arg); 544 + if (ret) 545 + return ret; 546 + } 547 + return ret; 548 + } 549 + 826 550 } 827 551 828 552 return -ENOTTY; ··· 845 547 int i; 846 548 847 549 for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { 848 - struct iommu_table *tbl = table_group->tables[i]; 550 + struct iommu_table *tbl = container->tables[i]; 849 551 850 552 if (!tbl) 851 553 continue; 852 554 853 555 tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); 556 + tce_iommu_userspace_view_free(tbl); 854 557 if (tbl->it_map) 855 558 iommu_release_ownership(tbl); 559 + 560 + container->tables[i] = NULL; 856 561 } 857 562 } 858 563 ··· 870 569 if (!tbl || !tbl->it_map) 871 570 continue; 872 571 873 - rc = iommu_take_ownership(tbl); 572 + rc = tce_iommu_userspace_view_alloc(tbl); 573 + if (!rc) 574 + rc = iommu_take_ownership(tbl); 575 + 874 576 if (rc) { 875 577 for (j = 0; j < i; ++j) 876 578 iommu_release_ownership( ··· 882 578 return rc; 883 579 } 884 580 } 581 + 582 + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) 583 + container->tables[i] = table_group->tables[i]; 885 584 886 585 return 0; 887 586 } ··· 899 592 return; 900 593 } 901 594 902 - for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { 903 - /* Store table pointer as unset_window resets it */ 904 - struct iommu_table *tbl = table_group->tables[i]; 905 - 906 - if (!tbl) 907 - continue; 908 - 595 + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) 909 596 table_group->ops->unset_window(table_group, i); 910 - tce_iommu_clear(container, tbl, 911 - tbl->it_offset, tbl->it_size); 912 - tce_iommu_free_table(tbl); 913 - } 914 597 915 598 table_group->ops->release_ownership(table_group); 916 599 } ··· 908 611 static long tce_iommu_take_ownership_ddw(struct tce_container *container, 909 612 struct iommu_table_group *table_group) 910 613 { 911 - long ret; 614 + long i, ret = 0; 912 615 struct iommu_table *tbl = NULL; 913 616 914 617 if (!table_group->ops->create_table || !table_group->ops->set_window || ··· 919 622 920 623 table_group->ops->take_ownership(table_group); 921 624 922 - ret = tce_iommu_create_table(container, 923 - table_group, 924 - 0, /* window number */ 925 - IOMMU_PAGE_SHIFT_4K, 926 - table_group->tce32_size, 927 - 1, /* default levels */ 928 - &tbl); 929 - if (!ret) { 930 - ret = table_group->ops->set_window(table_group, 0, tbl); 625 + /* 626 + * If it the first group attached, check if there is 627 + * a default DMA window and create one if none as 628 + * the userspace expects it to exist. 629 + */ 630 + if (!tce_groups_attached(container) && !container->tables[0]) { 631 + ret = tce_iommu_create_table(container, 632 + table_group, 633 + 0, /* window number */ 634 + IOMMU_PAGE_SHIFT_4K, 635 + table_group->tce32_size, 636 + 1, /* default levels */ 637 + &tbl); 931 638 if (ret) 932 - tce_iommu_free_table(tbl); 639 + goto release_exit; 933 640 else 934 - table_group->tables[0] = tbl; 641 + container->tables[0] = tbl; 935 642 } 936 643 937 - if (ret) 938 - table_group->ops->release_ownership(table_group); 644 + /* Set all windows to the new group */ 645 + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { 646 + tbl = container->tables[i]; 647 + 648 + if (!tbl) 649 + continue; 650 + 651 + /* Set the default window to a new group */ 652 + ret = table_group->ops->set_window(table_group, i, tbl); 653 + if (ret) 654 + goto release_exit; 655 + } 656 + 657 + return 0; 658 + 659 + release_exit: 660 + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) 661 + table_group->ops->unset_window(table_group, i); 662 + 663 + table_group->ops->release_ownership(table_group); 939 664 940 665 return ret; 941 666 } ··· 968 649 int ret; 969 650 struct tce_container *container = iommu_data; 970 651 struct iommu_table_group *table_group; 652 + struct tce_iommu_group *tcegrp = NULL; 971 653 972 654 mutex_lock(&container->lock); 973 655 974 656 /* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n", 975 657 iommu_group_id(iommu_group), iommu_group); */ 976 - if (container->grp) { 977 - pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n", 978 - iommu_group_id(container->grp), 979 - iommu_group_id(iommu_group)); 980 - ret = -EBUSY; 981 - goto unlock_exit; 982 - } 983 - 984 - if (container->enabled) { 985 - pr_err("tce_vfio: attaching group #%u to enabled container\n", 986 - iommu_group_id(iommu_group)); 987 - ret = -EBUSY; 988 - goto unlock_exit; 989 - } 990 - 991 658 table_group = iommu_group_get_iommudata(iommu_group); 992 - if (!table_group) { 993 - ret = -ENXIO; 659 + 660 + if (tce_groups_attached(container) && (!table_group->ops || 661 + !table_group->ops->take_ownership || 662 + !table_group->ops->release_ownership)) { 663 + ret = -EBUSY; 664 + goto unlock_exit; 665 + } 666 + 667 + /* Check if new group has the same iommu_ops (i.e. compatible) */ 668 + list_for_each_entry(tcegrp, &container->group_list, next) { 669 + struct iommu_table_group *table_group_tmp; 670 + 671 + if (tcegrp->grp == iommu_group) { 672 + pr_warn("tce_vfio: Group %d is already attached\n", 673 + iommu_group_id(iommu_group)); 674 + ret = -EBUSY; 675 + goto unlock_exit; 676 + } 677 + table_group_tmp = iommu_group_get_iommudata(tcegrp->grp); 678 + if (table_group_tmp->ops != table_group->ops) { 679 + pr_warn("tce_vfio: Group %d is incompatible with group %d\n", 680 + iommu_group_id(iommu_group), 681 + iommu_group_id(tcegrp->grp)); 682 + ret = -EPERM; 683 + goto unlock_exit; 684 + } 685 + } 686 + 687 + tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL); 688 + if (!tcegrp) { 689 + ret = -ENOMEM; 994 690 goto unlock_exit; 995 691 } 996 692 ··· 1015 681 else 1016 682 ret = tce_iommu_take_ownership_ddw(container, table_group); 1017 683 1018 - if (!ret) 1019 - container->grp = iommu_group; 684 + if (!ret) { 685 + tcegrp->grp = iommu_group; 686 + list_add(&tcegrp->next, &container->group_list); 687 + } 1020 688 1021 689 unlock_exit: 690 + if (ret && tcegrp) 691 + kfree(tcegrp); 692 + 1022 693 mutex_unlock(&container->lock); 1023 694 1024 695 return ret; ··· 1034 695 { 1035 696 struct tce_container *container = iommu_data; 1036 697 struct iommu_table_group *table_group; 698 + bool found = false; 699 + struct tce_iommu_group *tcegrp; 1037 700 1038 701 mutex_lock(&container->lock); 1039 - if (iommu_group != container->grp) { 1040 - pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n", 1041 - iommu_group_id(iommu_group), 1042 - iommu_group_id(container->grp)); 702 + 703 + list_for_each_entry(tcegrp, &container->group_list, next) { 704 + if (tcegrp->grp == iommu_group) { 705 + found = true; 706 + break; 707 + } 708 + } 709 + 710 + if (!found) { 711 + pr_warn("tce_vfio: detaching unattached group #%u\n", 712 + iommu_group_id(iommu_group)); 1043 713 goto unlock_exit; 1044 714 } 1045 715 1046 - if (container->enabled) { 1047 - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", 1048 - iommu_group_id(container->grp)); 1049 - tce_iommu_disable(container); 1050 - } 1051 - 1052 - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", 1053 - iommu_group_id(iommu_group), iommu_group); */ 1054 - container->grp = NULL; 716 + list_del(&tcegrp->next); 717 + kfree(tcegrp); 1055 718 1056 719 table_group = iommu_group_get_iommudata(iommu_group); 1057 720 BUG_ON(!table_group);
+27
include/uapi/linux/vfio.h
··· 36 36 /* Two-stage IOMMU */ 37 37 #define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */ 38 38 39 + #define VFIO_SPAPR_TCE_v2_IOMMU 7 40 + 39 41 /* 40 42 * The IOCTL interface is designed for extensibility by embedding the 41 43 * structure length (argsz) and flags into structures passed between ··· 508 506 #define VFIO_EEH_PE_INJECT_ERR 9 /* Inject EEH error */ 509 507 510 508 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21) 509 + 510 + /** 511 + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory) 512 + * 513 + * Registers user space memory where DMA is allowed. It pins 514 + * user pages and does the locked memory accounting so 515 + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls 516 + * get faster. 517 + */ 518 + struct vfio_iommu_spapr_register_memory { 519 + __u32 argsz; 520 + __u32 flags; 521 + __u64 vaddr; /* Process virtual address */ 522 + __u64 size; /* Size of mapping (bytes) */ 523 + }; 524 + #define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17) 525 + 526 + /** 527 + * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory) 528 + * 529 + * Unregisters user space memory registered with 530 + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY. 531 + * Uses vfio_iommu_spapr_register_memory for parameters. 532 + */ 533 + #define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18) 511 534 512 535 /* ***************************************************************** */ 513 536