Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory.c: fix race when faulting a device private page

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting.
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap. This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid. It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed. Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail. To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Alistair Popple and committed by
Andrew Morton
16ce101d ab63f63f

+89 -43
+11 -8
arch/powerpc/kvm/book3s_hv_uvmem.c
··· 508 508 static int __kvmppc_svm_page_out(struct vm_area_struct *vma, 509 509 unsigned long start, 510 510 unsigned long end, unsigned long page_shift, 511 - struct kvm *kvm, unsigned long gpa) 511 + struct kvm *kvm, unsigned long gpa, struct page *fault_page) 512 512 { 513 513 unsigned long src_pfn, dst_pfn = 0; 514 - struct migrate_vma mig; 514 + struct migrate_vma mig = { 0 }; 515 515 struct page *dpage, *spage; 516 516 struct kvmppc_uvmem_page_pvt *pvt; 517 517 unsigned long pfn; ··· 525 525 mig.dst = &dst_pfn; 526 526 mig.pgmap_owner = &kvmppc_uvmem_pgmap; 527 527 mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; 528 + mig.fault_page = fault_page; 528 529 529 530 /* The requested page is already paged-out, nothing to do */ 530 531 if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL)) ··· 581 580 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma, 582 581 unsigned long start, unsigned long end, 583 582 unsigned long page_shift, 584 - struct kvm *kvm, unsigned long gpa) 583 + struct kvm *kvm, unsigned long gpa, 584 + struct page *fault_page) 585 585 { 586 586 int ret; 587 587 588 588 mutex_lock(&kvm->arch.uvmem_lock); 589 - ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa); 589 + ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa, 590 + fault_page); 590 591 mutex_unlock(&kvm->arch.uvmem_lock); 591 592 592 593 return ret; ··· 637 634 pvt->remove_gfn = true; 638 635 639 636 if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE, 640 - PAGE_SHIFT, kvm, pvt->gpa)) 637 + PAGE_SHIFT, kvm, pvt->gpa, NULL)) 641 638 pr_err("Can't page out gpa:0x%lx addr:0x%lx\n", 642 639 pvt->gpa, addr); 643 640 } else { ··· 739 736 bool pagein) 740 737 { 741 738 unsigned long src_pfn, dst_pfn = 0; 742 - struct migrate_vma mig; 739 + struct migrate_vma mig = { 0 }; 743 740 struct page *spage; 744 741 unsigned long pfn; 745 742 struct page *dpage; ··· 997 994 998 995 if (kvmppc_svm_page_out(vmf->vma, vmf->address, 999 996 vmf->address + PAGE_SIZE, PAGE_SHIFT, 1000 - pvt->kvm, pvt->gpa)) 997 + pvt->kvm, pvt->gpa, vmf->page)) 1001 998 return VM_FAULT_SIGBUS; 1002 999 else 1003 1000 return 0; ··· 1068 1065 if (!vma || vma->vm_start > start || vma->vm_end < end) 1069 1066 goto out; 1070 1067 1071 - if (!kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa)) 1068 + if (!kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa, NULL)) 1072 1069 ret = H_SUCCESS; 1073 1070 out: 1074 1071 mmap_read_unlock(kvm->mm);
+10 -7
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
··· 409 409 uint64_t npages = (end - start) >> PAGE_SHIFT; 410 410 struct kfd_process_device *pdd; 411 411 struct dma_fence *mfence = NULL; 412 - struct migrate_vma migrate; 412 + struct migrate_vma migrate = { 0 }; 413 413 unsigned long cpages = 0; 414 414 dma_addr_t *scratch; 415 415 void *buf; ··· 668 668 static long 669 669 svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, 670 670 struct vm_area_struct *vma, uint64_t start, uint64_t end, 671 - uint32_t trigger) 671 + uint32_t trigger, struct page *fault_page) 672 672 { 673 673 struct kfd_process *p = container_of(prange->svms, struct kfd_process, svms); 674 674 uint64_t npages = (end - start) >> PAGE_SHIFT; ··· 676 676 unsigned long cpages = 0; 677 677 struct kfd_process_device *pdd; 678 678 struct dma_fence *mfence = NULL; 679 - struct migrate_vma migrate; 679 + struct migrate_vma migrate = { 0 }; 680 680 dma_addr_t *scratch; 681 681 void *buf; 682 682 int r = -ENOMEM; ··· 699 699 700 700 migrate.src = buf; 701 701 migrate.dst = migrate.src + npages; 702 + migrate.fault_page = fault_page; 702 703 scratch = (dma_addr_t *)(migrate.dst + npages); 703 704 704 705 kfd_smi_event_migration_start(adev->kfd.dev, p->lead_thread->pid, ··· 767 766 * 0 - OK, otherwise error code 768 767 */ 769 768 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm, 770 - uint32_t trigger) 769 + uint32_t trigger, struct page *fault_page) 771 770 { 772 771 struct amdgpu_device *adev; 773 772 struct vm_area_struct *vma; ··· 808 807 } 809 808 810 809 next = min(vma->vm_end, end); 811 - r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger); 810 + r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger, 811 + fault_page); 812 812 if (r < 0) { 813 813 pr_debug("failed %ld to migrate prange %p\n", r, prange); 814 814 break; ··· 853 851 pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc); 854 852 855 853 do { 856 - r = svm_migrate_vram_to_ram(prange, mm, trigger); 854 + r = svm_migrate_vram_to_ram(prange, mm, trigger, NULL); 857 855 if (r) 858 856 return r; 859 857 } while (prange->actual_loc && --retries); ··· 940 938 goto out_unlock_prange; 941 939 } 942 940 943 - r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU); 941 + r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, 942 + vmf->page); 944 943 if (r) 945 944 pr_debug("failed %d migrate 0x%p [0x%lx 0x%lx] to ram\n", r, 946 945 prange, prange->start, prange->last);
+1 -1
drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
··· 43 43 int svm_migrate_to_vram(struct svm_range *prange, uint32_t best_loc, 44 44 struct mm_struct *mm, uint32_t trigger); 45 45 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm, 46 - uint32_t trigger); 46 + uint32_t trigger, struct page *fault_page); 47 47 unsigned long 48 48 svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr); 49 49
+7 -4
drivers/gpu/drm/amd/amdkfd/kfd_svm.c
··· 2913 2913 */ 2914 2914 if (prange->actual_loc) 2915 2915 r = svm_migrate_vram_to_ram(prange, mm, 2916 - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU); 2916 + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, 2917 + NULL); 2917 2918 else 2918 2919 r = 0; 2919 2920 } 2920 2921 } else { 2921 2922 r = svm_migrate_vram_to_ram(prange, mm, 2922 - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU); 2923 + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, 2924 + NULL); 2923 2925 } 2924 2926 if (r) { 2925 2927 pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n", ··· 3244 3242 return 0; 3245 3243 3246 3244 if (!best_loc) { 3247 - r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PREFETCH); 3245 + r = svm_migrate_vram_to_ram(prange, mm, 3246 + KFD_MIGRATE_TRIGGER_PREFETCH, NULL); 3248 3247 *migrated = !r; 3249 3248 return r; 3250 3249 } ··· 3306 3303 mutex_lock(&prange->migrate_mutex); 3307 3304 do { 3308 3305 r = svm_migrate_vram_to_ram(prange, mm, 3309 - KFD_MIGRATE_TRIGGER_TTM_EVICTION); 3306 + KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL); 3310 3307 } while (!r && prange->actual_loc && --retries); 3311 3308 3312 3309 if (!r && prange->actual_loc)
+8
include/linux/migrate.h
··· 62 62 #ifdef CONFIG_MIGRATION 63 63 64 64 extern void putback_movable_pages(struct list_head *l); 65 + int migrate_folio_extra(struct address_space *mapping, struct folio *dst, 66 + struct folio *src, enum migrate_mode mode, int extra_count); 65 67 int migrate_folio(struct address_space *mapping, struct folio *dst, 66 68 struct folio *src, enum migrate_mode mode); 67 69 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, ··· 199 197 */ 200 198 void *pgmap_owner; 201 199 unsigned long flags; 200 + 201 + /* 202 + * Set to vmf->page if this is being called to migrate a page as part of 203 + * a migrate_to_ram() callback. 204 + */ 205 + struct page *fault_page; 202 206 }; 203 207 204 208 int migrate_vma_setup(struct migrate_vma *args);
+4 -3
lib/test_hmm.c
··· 907 907 struct vm_area_struct *vma; 908 908 unsigned long src_pfns[64] = { 0 }; 909 909 unsigned long dst_pfns[64] = { 0 }; 910 - struct migrate_vma args; 910 + struct migrate_vma args = { 0 }; 911 911 unsigned long next; 912 912 int ret; 913 913 ··· 968 968 unsigned long src_pfns[64] = { 0 }; 969 969 unsigned long dst_pfns[64] = { 0 }; 970 970 struct dmirror_bounce bounce; 971 - struct migrate_vma args; 971 + struct migrate_vma args = { 0 }; 972 972 unsigned long next; 973 973 int ret; 974 974 ··· 1334 1334 1335 1335 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf) 1336 1336 { 1337 - struct migrate_vma args; 1337 + struct migrate_vma args = { 0 }; 1338 1338 unsigned long src_pfns = 0; 1339 1339 unsigned long dst_pfns = 0; 1340 1340 struct page *rpage; ··· 1357 1357 args.dst = &dst_pfns; 1358 1358 args.pgmap_owner = dmirror->mdevice; 1359 1359 args.flags = dmirror_select_device(dmirror); 1360 + args.fault_page = vmf->page; 1360 1361 1361 1362 if (migrate_vma_setup(&args)) 1362 1363 return VM_FAULT_SIGBUS;
+15 -1
mm/memory.c
··· 3750 3750 ret = remove_device_exclusive_entry(vmf); 3751 3751 } else if (is_device_private_entry(entry)) { 3752 3752 vmf->page = pfn_swap_entry_to_page(entry); 3753 - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); 3753 + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 3754 + vmf->address, &vmf->ptl); 3755 + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { 3756 + spin_unlock(vmf->ptl); 3757 + goto out; 3758 + } 3759 + 3760 + /* 3761 + * Get a page reference while we know the page can't be 3762 + * freed. 3763 + */ 3764 + get_page(vmf->page); 3765 + pte_unmap_unlock(vmf->pte, vmf->ptl); 3766 + vmf->page->pgmap->ops->migrate_to_ram(vmf); 3767 + put_page(vmf->page); 3754 3768 } else if (is_hwpoison_entry(entry)) { 3755 3769 ret = VM_FAULT_HWPOISON; 3756 3770 } else if (is_swapin_error_entry(entry)) {
+20 -14
mm/migrate.c
··· 625 625 * Migration functions 626 626 ***********************************************************/ 627 627 628 + int migrate_folio_extra(struct address_space *mapping, struct folio *dst, 629 + struct folio *src, enum migrate_mode mode, int extra_count) 630 + { 631 + int rc; 632 + 633 + BUG_ON(folio_test_writeback(src)); /* Writeback must be complete */ 634 + 635 + rc = folio_migrate_mapping(mapping, dst, src, extra_count); 636 + 637 + if (rc != MIGRATEPAGE_SUCCESS) 638 + return rc; 639 + 640 + if (mode != MIGRATE_SYNC_NO_COPY) 641 + folio_migrate_copy(dst, src); 642 + else 643 + folio_migrate_flags(dst, src); 644 + return MIGRATEPAGE_SUCCESS; 645 + } 646 + 628 647 /** 629 648 * migrate_folio() - Simple folio migration. 630 649 * @mapping: The address_space containing the folio. ··· 659 640 int migrate_folio(struct address_space *mapping, struct folio *dst, 660 641 struct folio *src, enum migrate_mode mode) 661 642 { 662 - int rc; 663 - 664 - BUG_ON(folio_test_writeback(src)); /* Writeback must be complete */ 665 - 666 - rc = folio_migrate_mapping(mapping, dst, src, 0); 667 - 668 - if (rc != MIGRATEPAGE_SUCCESS) 669 - return rc; 670 - 671 - if (mode != MIGRATE_SYNC_NO_COPY) 672 - folio_migrate_copy(dst, src); 673 - else 674 - folio_migrate_flags(dst, src); 675 - return MIGRATEPAGE_SUCCESS; 643 + return migrate_folio_extra(mapping, dst, src, mode, 0); 676 644 } 677 645 EXPORT_SYMBOL(migrate_folio); 678 646
+13 -5
mm/migrate_device.c
··· 325 325 * folio_migrate_mapping(), except that here we allow migration of a 326 326 * ZONE_DEVICE page. 327 327 */ 328 - static bool migrate_vma_check_page(struct page *page) 328 + static bool migrate_vma_check_page(struct page *page, struct page *fault_page) 329 329 { 330 330 /* 331 331 * One extra ref because caller holds an extra reference, either from 332 332 * isolate_lru_page() for a regular page, or migrate_vma_collect() for 333 333 * a device page. 334 334 */ 335 - int extra = 1; 335 + int extra = 1 + (page == fault_page); 336 336 337 337 /* 338 338 * FIXME support THP (transparent huge page), it is bit more complex to ··· 405 405 if (folio_mapped(folio)) 406 406 try_to_migrate(folio, 0); 407 407 408 - if (page_mapped(page) || !migrate_vma_check_page(page)) { 408 + if (page_mapped(page) || 409 + !migrate_vma_check_page(page, migrate->fault_page)) { 409 410 if (!is_zone_device_page(page)) { 410 411 get_page(page); 411 412 putback_lru_page(page); ··· 517 516 if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end) 518 517 return -EINVAL; 519 518 if (!args->src || !args->dst) 519 + return -EINVAL; 520 + if (args->fault_page && !is_device_private_page(args->fault_page)) 520 521 return -EINVAL; 521 522 522 523 memset(args->src, 0, sizeof(*args->src) * nr_pages); ··· 750 747 continue; 751 748 } 752 749 753 - r = migrate_folio(mapping, page_folio(newpage), 754 - page_folio(page), MIGRATE_SYNC_NO_COPY); 750 + if (migrate->fault_page == page) 751 + r = migrate_folio_extra(mapping, page_folio(newpage), 752 + page_folio(page), 753 + MIGRATE_SYNC_NO_COPY, 1); 754 + else 755 + r = migrate_folio(mapping, page_folio(newpage), 756 + page_folio(page), MIGRATE_SYNC_NO_COPY); 755 757 if (r != MIGRATEPAGE_SUCCESS) 756 758 migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; 757 759 }