Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
"This is another round of bug fixing and cleanup. This time the focus
is on the driver pattern to use mmu notifiers to monitor a VA range.
This code is lifted out of many drivers and hmm_mirror directly into
the mmu_notifier core and written using the best ideas from all the
driver implementations.

This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.

- A shared branch with RDMA reworking the RDMA ODP implementation

- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers

- A common seq-count locking scheme built into the
mmu_interval_notifier API usable by drivers that call
get_user_pages() or hmm_range_fault() with the VA range

- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
GntDev drivers to the new API. This deletes a lot of wonky driver
code.

- Two improvements for hmm_range_fault(), from testing done by Ralph"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
mm/hmm: make full use of walk_page_range()
xen/gntdev: use mmu_interval_notifier_insert
mm/hmm: remove hmm_mirror and related
drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
drm/amdgpu: Call find_vma under mmap_sem
nouveau: use mmu_interval_notifier instead of hmm_mirror
nouveau: use mmu_notifier directly for invalidate_range_start
drm/radeon: use mmu_interval_notifier_insert
RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
RDMA/odp: Use mmu_interval_notifier_insert()
mm/hmm: define the pre-processor related parts of hmm.h even if disabled
mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
mm/mmu_notifier: add an interval tree notifier
mm/mmu_notifier: define the header pre-processor parts even if disabled
mm/hmm: allow snapshot of the special zero page

+1303 -2144
+22 -79
Documentation/vm/hmm.rst
··· 147 147 Address space mirroring's main objective is to allow duplication of a range of 148 148 CPU page table into a device page table; HMM helps keep both synchronized. A 149 149 device driver that wants to mirror a process address space must start with the 150 - registration of an hmm_mirror struct:: 150 + registration of a mmu_interval_notifier:: 151 151 152 - int hmm_mirror_register(struct hmm_mirror *mirror, 153 - struct mm_struct *mm); 152 + mni->ops = &driver_ops; 153 + int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni, 154 + unsigned long start, unsigned long length, 155 + struct mm_struct *mm); 154 156 155 - The mirror struct has a set of callbacks that are used 156 - to propagate CPU page tables:: 157 - 158 - struct hmm_mirror_ops { 159 - /* release() - release hmm_mirror 160 - * 161 - * @mirror: pointer to struct hmm_mirror 162 - * 163 - * This is called when the mm_struct is being released. The callback 164 - * must ensure that all access to any pages obtained from this mirror 165 - * is halted before the callback returns. All future access should 166 - * fault. 167 - */ 168 - void (*release)(struct hmm_mirror *mirror); 169 - 170 - /* sync_cpu_device_pagetables() - synchronize page tables 171 - * 172 - * @mirror: pointer to struct hmm_mirror 173 - * @update: update information (see struct mmu_notifier_range) 174 - * Return: -EAGAIN if update.blockable false and callback need to 175 - * block, 0 otherwise. 176 - * 177 - * This callback ultimately originates from mmu_notifiers when the CPU 178 - * page table is updated. The device driver must update its page table 179 - * in response to this callback. The update argument tells what action 180 - * to perform. 181 - * 182 - * The device driver must not return from this callback until the device 183 - * page tables are completely updated (TLBs flushed, etc); this is a 184 - * synchronous call. 185 - */ 186 - int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror, 187 - const struct hmm_update *update); 188 - }; 189 - 190 - The device driver must perform the update action to the range (mark range 191 - read only, or fully unmap, etc.). The device must complete the update before 192 - the driver callback returns. 157 + During the driver_ops->invalidate() callback the device driver must perform 158 + the update action to the range (mark range read only, or fully unmap, 159 + etc.). The device must complete the update before the driver callback returns. 193 160 194 161 When the device driver wants to populate a range of virtual addresses, it can 195 162 use:: ··· 183 216 struct hmm_range range; 184 217 ... 185 218 219 + range.notifier = &mni; 186 220 range.start = ...; 187 221 range.end = ...; 188 222 range.pfns = ...; 189 223 range.flags = ...; 190 224 range.values = ...; 191 225 range.pfn_shift = ...; 192 - hmm_range_register(&range, mirror); 193 226 194 - /* 195 - * Just wait for range to be valid, safe to ignore return value as we 196 - * will use the return value of hmm_range_fault() below under the 197 - * mmap_sem to ascertain the validity of the range. 198 - */ 199 - hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); 227 + if (!mmget_not_zero(mni->notifier.mm)) 228 + return -EFAULT; 200 229 201 230 again: 231 + range.notifier_seq = mmu_interval_read_begin(&mni); 202 232 down_read(&mm->mmap_sem); 203 233 ret = hmm_range_fault(&range, HMM_RANGE_SNAPSHOT); 204 234 if (ret) { 205 235 up_read(&mm->mmap_sem); 206 - if (ret == -EBUSY) { 207 - /* 208 - * No need to check hmm_range_wait_until_valid() return value 209 - * on retry we will get proper error with hmm_range_fault() 210 - */ 211 - hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); 212 - goto again; 213 - } 214 - hmm_range_unregister(&range); 236 + if (ret == -EBUSY) 237 + goto again; 215 238 return ret; 216 239 } 240 + up_read(&mm->mmap_sem); 241 + 217 242 take_lock(driver->update); 218 - if (!hmm_range_valid(&range)) { 243 + if (mmu_interval_read_retry(&ni, range.notifier_seq) { 219 244 release_lock(driver->update); 220 - up_read(&mm->mmap_sem); 221 245 goto again; 222 246 } 223 247 224 - // Use pfns array content to update device page table 248 + /* Use pfns array content to update device page table, 249 + * under the update lock */ 225 250 226 - hmm_range_unregister(&range); 227 251 release_lock(driver->update); 228 - up_read(&mm->mmap_sem); 229 252 return 0; 230 253 } 231 254 232 255 The driver->update lock is the same lock that the driver takes inside its 233 - sync_cpu_device_pagetables() callback. That lock must be held before calling 234 - hmm_range_valid() to avoid any race with a concurrent CPU page table update. 235 - 236 - HMM implements all this on top of the mmu_notifier API because we wanted a 237 - simpler API and also to be able to perform optimizations latter on like doing 238 - concurrent device updates in multi-devices scenario. 239 - 240 - HMM also serves as an impedance mismatch between how CPU page table updates 241 - are done (by CPU write to the page table and TLB flushes) and how devices 242 - update their own page table. Device updates are a multi-step process. First, 243 - appropriate commands are written to a buffer, then this buffer is scheduled for 244 - execution on the device. It is only once the device has executed commands in 245 - the buffer that the update is done. Creating and scheduling the update command 246 - buffer can happen concurrently for multiple devices. Waiting for each device to 247 - report commands as executed is serialized (there is no point in doing this 248 - concurrently). 249 - 256 + invalidate() callback. That lock must be held before calling 257 + mmu_interval_read_retry() to avoid any race with a concurrent CPU page table 258 + update. 250 259 251 260 Leverage default_flags and pfn_flags_mask 252 261 =========================================
+2
drivers/gpu/drm/amd/amdgpu/amdgpu.h
··· 967 967 struct mutex lock_reset; 968 968 struct amdgpu_doorbell_index doorbell_index; 969 969 970 + struct mutex notifier_lock; 971 + 970 972 int asic_reset_res; 971 973 struct work_struct xgmi_reset_work; 972 974
+6 -3
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
··· 505 505 * 506 506 * Returns 0 for success, negative errno for errors. 507 507 */ 508 - static int init_user_pages(struct kgd_mem *mem, struct mm_struct *mm, 509 - uint64_t user_addr) 508 + static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) 510 509 { 511 510 struct amdkfd_process_info *process_info = mem->process_info; 512 511 struct amdgpu_bo *bo = mem->bo; ··· 1198 1199 add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, user_addr); 1199 1200 1200 1201 if (user_addr) { 1201 - ret = init_user_pages(*mem, current->mm, user_addr); 1202 + ret = init_user_pages(*mem, user_addr); 1202 1203 if (ret) 1203 1204 goto allocate_init_user_pages_failed; 1204 1205 } ··· 1743 1744 return ret; 1744 1745 } 1745 1746 1747 + /* 1748 + * FIXME: Cannot ignore the return code, must hold 1749 + * notifier_lock 1750 + */ 1746 1751 amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm); 1747 1752 1748 1753 /* Mark the BO as valid unless it was invalidated
+6 -8
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
··· 538 538 e->tv.num_shared = 2; 539 539 540 540 amdgpu_bo_list_get_list(p->bo_list, &p->validated); 541 - if (p->bo_list->first_userptr != p->bo_list->num_entries) 542 - p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX); 543 541 544 542 INIT_LIST_HEAD(&duplicates); 545 543 amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd); ··· 1217 1219 if (r) 1218 1220 goto error_unlock; 1219 1221 1220 - /* No memory allocation is allowed while holding the mn lock. 1221 - * p->mn is hold until amdgpu_cs_submit is finished and fence is added 1222 - * to BOs. 1222 + /* No memory allocation is allowed while holding the notifier lock. 1223 + * The lock is held until amdgpu_cs_submit is finished and fence is 1224 + * added to BOs. 1223 1225 */ 1224 - amdgpu_mn_lock(p->mn); 1226 + mutex_lock(&p->adev->notifier_lock); 1225 1227 1226 1228 /* If userptr are invalidated after amdgpu_cs_parser_bos(), return 1227 1229 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl. ··· 1264 1266 amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm); 1265 1267 1266 1268 ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); 1267 - amdgpu_mn_unlock(p->mn); 1269 + mutex_unlock(&p->adev->notifier_lock); 1268 1270 1269 1271 return 0; 1270 1272 1271 1273 error_abort: 1272 1274 drm_sched_job_cleanup(&job->base); 1273 - amdgpu_mn_unlock(p->mn); 1275 + mutex_unlock(&p->adev->notifier_lock); 1274 1276 1275 1277 error_unlock: 1276 1278 amdgpu_job_free(job);
+1
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
··· 2794 2794 mutex_init(&adev->virt.vf_errors.lock); 2795 2795 hash_init(adev->mn_hash); 2796 2796 mutex_init(&adev->lock_reset); 2797 + mutex_init(&adev->notifier_lock); 2797 2798 mutex_init(&adev->virt.dpm_mutex); 2798 2799 mutex_init(&adev->psp.mutex); 2799 2800
+57 -389
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
··· 51 51 #include "amdgpu_amdkfd.h" 52 52 53 53 /** 54 - * struct amdgpu_mn_node 54 + * amdgpu_mn_invalidate_gfx - callback to notify about mm change 55 55 * 56 - * @it: interval node defining start-last of the affected address range 57 - * @bos: list of all BOs in the affected address range 58 - * 59 - * Manages all BOs which are affected of a certain range of address space. 60 - */ 61 - struct amdgpu_mn_node { 62 - struct interval_tree_node it; 63 - struct list_head bos; 64 - }; 65 - 66 - /** 67 - * amdgpu_mn_destroy - destroy the HMM mirror 68 - * 69 - * @work: previously sheduled work item 70 - * 71 - * Lazy destroys the notifier from a work item 72 - */ 73 - static void amdgpu_mn_destroy(struct work_struct *work) 74 - { 75 - struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work); 76 - struct amdgpu_device *adev = amn->adev; 77 - struct amdgpu_mn_node *node, *next_node; 78 - struct amdgpu_bo *bo, *next_bo; 79 - 80 - mutex_lock(&adev->mn_lock); 81 - down_write(&amn->lock); 82 - hash_del(&amn->node); 83 - rbtree_postorder_for_each_entry_safe(node, next_node, 84 - &amn->objects.rb_root, it.rb) { 85 - list_for_each_entry_safe(bo, next_bo, &node->bos, mn_list) { 86 - bo->mn = NULL; 87 - list_del_init(&bo->mn_list); 88 - } 89 - kfree(node); 90 - } 91 - up_write(&amn->lock); 92 - mutex_unlock(&adev->mn_lock); 93 - 94 - hmm_mirror_unregister(&amn->mirror); 95 - kfree(amn); 96 - } 97 - 98 - /** 99 - * amdgpu_hmm_mirror_release - callback to notify about mm destruction 100 - * 101 - * @mirror: the HMM mirror (mm) this callback is about 102 - * 103 - * Shedule a work item to lazy destroy HMM mirror. 104 - */ 105 - static void amdgpu_hmm_mirror_release(struct hmm_mirror *mirror) 106 - { 107 - struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror); 108 - 109 - INIT_WORK(&amn->work, amdgpu_mn_destroy); 110 - schedule_work(&amn->work); 111 - } 112 - 113 - /** 114 - * amdgpu_mn_lock - take the write side lock for this notifier 115 - * 116 - * @mn: our notifier 117 - */ 118 - void amdgpu_mn_lock(struct amdgpu_mn *mn) 119 - { 120 - if (mn) 121 - down_write(&mn->lock); 122 - } 123 - 124 - /** 125 - * amdgpu_mn_unlock - drop the write side lock for this notifier 126 - * 127 - * @mn: our notifier 128 - */ 129 - void amdgpu_mn_unlock(struct amdgpu_mn *mn) 130 - { 131 - if (mn) 132 - up_write(&mn->lock); 133 - } 134 - 135 - /** 136 - * amdgpu_mn_read_lock - take the read side lock for this notifier 137 - * 138 - * @amn: our notifier 139 - * @blockable: is the notifier blockable 140 - */ 141 - static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) 142 - { 143 - if (blockable) 144 - down_read(&amn->lock); 145 - else if (!down_read_trylock(&amn->lock)) 146 - return -EAGAIN; 147 - 148 - return 0; 149 - } 150 - 151 - /** 152 - * amdgpu_mn_read_unlock - drop the read side lock for this notifier 153 - * 154 - * @amn: our notifier 155 - */ 156 - static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) 157 - { 158 - up_read(&amn->lock); 159 - } 160 - 161 - /** 162 - * amdgpu_mn_invalidate_node - unmap all BOs of a node 163 - * 164 - * @node: the node with the BOs to unmap 165 - * @start: start of address range affected 166 - * @end: end of address range affected 56 + * @mni: the range (mm) is about to update 57 + * @range: details on the invalidation 58 + * @cur_seq: Value to pass to mmu_interval_set_seq() 167 59 * 168 60 * Block for operations on BOs to finish and mark pages as accessed and 169 61 * potentially dirty. 170 62 */ 171 - static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node, 172 - unsigned long start, 173 - unsigned long end) 63 + static bool amdgpu_mn_invalidate_gfx(struct mmu_interval_notifier *mni, 64 + const struct mmu_notifier_range *range, 65 + unsigned long cur_seq) 174 66 { 175 - struct amdgpu_bo *bo; 67 + struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo, notifier); 68 + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); 176 69 long r; 177 70 178 - list_for_each_entry(bo, &node->bos, mn_list) { 71 + if (!mmu_notifier_range_blockable(range)) 72 + return false; 179 73 180 - if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end)) 181 - continue; 74 + mutex_lock(&adev->notifier_lock); 182 75 183 - r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, 184 - true, false, MAX_SCHEDULE_TIMEOUT); 185 - if (r <= 0) 186 - DRM_ERROR("(%ld) failed to wait for user bo\n", r); 187 - } 76 + mmu_interval_set_seq(mni, cur_seq); 77 + 78 + r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false, 79 + MAX_SCHEDULE_TIMEOUT); 80 + mutex_unlock(&adev->notifier_lock); 81 + if (r <= 0) 82 + DRM_ERROR("(%ld) failed to wait for user bo\n", r); 83 + return true; 188 84 } 189 85 190 - /** 191 - * amdgpu_mn_sync_pagetables_gfx - callback to notify about mm change 192 - * 193 - * @mirror: the hmm_mirror (mm) is about to update 194 - * @update: the update start, end address 195 - * 196 - * Block for operations on BOs to finish and mark pages as accessed and 197 - * potentially dirty. 198 - */ 199 - static int 200 - amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror, 201 - const struct mmu_notifier_range *update) 202 - { 203 - struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror); 204 - unsigned long start = update->start; 205 - unsigned long end = update->end; 206 - bool blockable = mmu_notifier_range_blockable(update); 207 - struct interval_tree_node *it; 208 - 209 - /* notification is exclusive, but interval is inclusive */ 210 - end -= 1; 211 - 212 - /* TODO we should be able to split locking for interval tree and 213 - * amdgpu_mn_invalidate_node 214 - */ 215 - if (amdgpu_mn_read_lock(amn, blockable)) 216 - return -EAGAIN; 217 - 218 - it = interval_tree_iter_first(&amn->objects, start, end); 219 - while (it) { 220 - struct amdgpu_mn_node *node; 221 - 222 - if (!blockable) { 223 - amdgpu_mn_read_unlock(amn); 224 - return -EAGAIN; 225 - } 226 - 227 - node = container_of(it, struct amdgpu_mn_node, it); 228 - it = interval_tree_iter_next(it, start, end); 229 - 230 - amdgpu_mn_invalidate_node(node, start, end); 231 - } 232 - 233 - amdgpu_mn_read_unlock(amn); 234 - 235 - return 0; 236 - } 237 - 238 - /** 239 - * amdgpu_mn_sync_pagetables_hsa - callback to notify about mm change 240 - * 241 - * @mirror: the hmm_mirror (mm) is about to update 242 - * @update: the update start, end address 243 - * 244 - * We temporarily evict all BOs between start and end. This 245 - * necessitates evicting all user-mode queues of the process. The BOs 246 - * are restorted in amdgpu_mn_invalidate_range_end_hsa. 247 - */ 248 - static int 249 - amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror, 250 - const struct mmu_notifier_range *update) 251 - { 252 - struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror); 253 - unsigned long start = update->start; 254 - unsigned long end = update->end; 255 - bool blockable = mmu_notifier_range_blockable(update); 256 - struct interval_tree_node *it; 257 - 258 - /* notification is exclusive, but interval is inclusive */ 259 - end -= 1; 260 - 261 - if (amdgpu_mn_read_lock(amn, blockable)) 262 - return -EAGAIN; 263 - 264 - it = interval_tree_iter_first(&amn->objects, start, end); 265 - while (it) { 266 - struct amdgpu_mn_node *node; 267 - struct amdgpu_bo *bo; 268 - 269 - if (!blockable) { 270 - amdgpu_mn_read_unlock(amn); 271 - return -EAGAIN; 272 - } 273 - 274 - node = container_of(it, struct amdgpu_mn_node, it); 275 - it = interval_tree_iter_next(it, start, end); 276 - 277 - list_for_each_entry(bo, &node->bos, mn_list) { 278 - struct kgd_mem *mem = bo->kfd_bo; 279 - 280 - if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, 281 - start, end)) 282 - amdgpu_amdkfd_evict_userptr(mem, amn->mm); 283 - } 284 - } 285 - 286 - amdgpu_mn_read_unlock(amn); 287 - 288 - return 0; 289 - } 290 - 291 - /* Low bits of any reasonable mm pointer will be unused due to struct 292 - * alignment. Use these bits to make a unique key from the mm pointer 293 - * and notifier type. 294 - */ 295 - #define AMDGPU_MN_KEY(mm, type) ((unsigned long)(mm) + (type)) 296 - 297 - static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = { 298 - [AMDGPU_MN_TYPE_GFX] = { 299 - .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx, 300 - .release = amdgpu_hmm_mirror_release 301 - }, 302 - [AMDGPU_MN_TYPE_HSA] = { 303 - .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa, 304 - .release = amdgpu_hmm_mirror_release 305 - }, 86 + static const struct mmu_interval_notifier_ops amdgpu_mn_gfx_ops = { 87 + .invalidate = amdgpu_mn_invalidate_gfx, 306 88 }; 307 89 308 90 /** 309 - * amdgpu_mn_get - create HMM mirror context 91 + * amdgpu_mn_invalidate_hsa - callback to notify about mm change 310 92 * 311 - * @adev: amdgpu device pointer 312 - * @type: type of MMU notifier context 93 + * @mni: the range (mm) is about to update 94 + * @range: details on the invalidation 95 + * @cur_seq: Value to pass to mmu_interval_set_seq() 313 96 * 314 - * Creates a HMM mirror context for current->mm. 97 + * We temporarily evict the BO attached to this range. This necessitates 98 + * evicting all user-mode queues of the process. 315 99 */ 316 - struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, 317 - enum amdgpu_mn_type type) 100 + static bool amdgpu_mn_invalidate_hsa(struct mmu_interval_notifier *mni, 101 + const struct mmu_notifier_range *range, 102 + unsigned long cur_seq) 318 103 { 319 - struct mm_struct *mm = current->mm; 320 - struct amdgpu_mn *amn; 321 - unsigned long key = AMDGPU_MN_KEY(mm, type); 322 - int r; 104 + struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo, notifier); 105 + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); 323 106 324 - mutex_lock(&adev->mn_lock); 325 - if (down_write_killable(&mm->mmap_sem)) { 326 - mutex_unlock(&adev->mn_lock); 327 - return ERR_PTR(-EINTR); 328 - } 107 + if (!mmu_notifier_range_blockable(range)) 108 + return false; 329 109 330 - hash_for_each_possible(adev->mn_hash, amn, node, key) 331 - if (AMDGPU_MN_KEY(amn->mm, amn->type) == key) 332 - goto release_locks; 110 + mutex_lock(&adev->notifier_lock); 333 111 334 - amn = kzalloc(sizeof(*amn), GFP_KERNEL); 335 - if (!amn) { 336 - amn = ERR_PTR(-ENOMEM); 337 - goto release_locks; 338 - } 112 + mmu_interval_set_seq(mni, cur_seq); 339 113 340 - amn->adev = adev; 341 - amn->mm = mm; 342 - init_rwsem(&amn->lock); 343 - amn->type = type; 344 - amn->objects = RB_ROOT_CACHED; 114 + amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm); 115 + mutex_unlock(&adev->notifier_lock); 345 116 346 - amn->mirror.ops = &amdgpu_hmm_mirror_ops[type]; 347 - r = hmm_mirror_register(&amn->mirror, mm); 348 - if (r) 349 - goto free_amn; 350 - 351 - hash_add(adev->mn_hash, &amn->node, AMDGPU_MN_KEY(mm, type)); 352 - 353 - release_locks: 354 - up_write(&mm->mmap_sem); 355 - mutex_unlock(&adev->mn_lock); 356 - 357 - return amn; 358 - 359 - free_amn: 360 - up_write(&mm->mmap_sem); 361 - mutex_unlock(&adev->mn_lock); 362 - kfree(amn); 363 - 364 - return ERR_PTR(r); 117 + return true; 365 118 } 119 + 120 + static const struct mmu_interval_notifier_ops amdgpu_mn_hsa_ops = { 121 + .invalidate = amdgpu_mn_invalidate_hsa, 122 + }; 366 123 367 124 /** 368 125 * amdgpu_mn_register - register a BO for notifier updates ··· 127 370 * @bo: amdgpu buffer object 128 371 * @addr: userptr addr we should monitor 129 372 * 130 - * Registers an HMM mirror for the given BO at the specified address. 373 + * Registers a mmu_notifier for the given BO at the specified address. 131 374 * Returns 0 on success, -ERRNO if anything goes wrong. 132 375 */ 133 376 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr) 134 377 { 135 - unsigned long end = addr + amdgpu_bo_size(bo) - 1; 136 - struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); 137 - enum amdgpu_mn_type type = 138 - bo->kfd_bo ? AMDGPU_MN_TYPE_HSA : AMDGPU_MN_TYPE_GFX; 139 - struct amdgpu_mn *amn; 140 - struct amdgpu_mn_node *node = NULL, *new_node; 141 - struct list_head bos; 142 - struct interval_tree_node *it; 143 - 144 - amn = amdgpu_mn_get(adev, type); 145 - if (IS_ERR(amn)) 146 - return PTR_ERR(amn); 147 - 148 - new_node = kmalloc(sizeof(*new_node), GFP_KERNEL); 149 - if (!new_node) 150 - return -ENOMEM; 151 - 152 - INIT_LIST_HEAD(&bos); 153 - 154 - down_write(&amn->lock); 155 - 156 - while ((it = interval_tree_iter_first(&amn->objects, addr, end))) { 157 - kfree(node); 158 - node = container_of(it, struct amdgpu_mn_node, it); 159 - interval_tree_remove(&node->it, &amn->objects); 160 - addr = min(it->start, addr); 161 - end = max(it->last, end); 162 - list_splice(&node->bos, &bos); 163 - } 164 - 165 - if (!node) 166 - node = new_node; 167 - else 168 - kfree(new_node); 169 - 170 - bo->mn = amn; 171 - 172 - node->it.start = addr; 173 - node->it.last = end; 174 - INIT_LIST_HEAD(&node->bos); 175 - list_splice(&bos, &node->bos); 176 - list_add(&bo->mn_list, &node->bos); 177 - 178 - interval_tree_insert(&node->it, &amn->objects); 179 - 180 - up_write(&amn->lock); 181 - 182 - return 0; 378 + if (bo->kfd_bo) 379 + return mmu_interval_notifier_insert(&bo->notifier, current->mm, 380 + addr, amdgpu_bo_size(bo), 381 + &amdgpu_mn_hsa_ops); 382 + return mmu_interval_notifier_insert(&bo->notifier, current->mm, addr, 383 + amdgpu_bo_size(bo), 384 + &amdgpu_mn_gfx_ops); 183 385 } 184 386 185 387 /** 186 - * amdgpu_mn_unregister - unregister a BO for HMM mirror updates 388 + * amdgpu_mn_unregister - unregister a BO for notifier updates 187 389 * 188 390 * @bo: amdgpu buffer object 189 391 * 190 - * Remove any registration of HMM mirror updates from the buffer object. 392 + * Remove any registration of mmu notifier updates from the buffer object. 191 393 */ 192 394 void amdgpu_mn_unregister(struct amdgpu_bo *bo) 193 395 { 194 - struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); 195 - struct amdgpu_mn *amn; 196 - struct list_head *head; 197 - 198 - mutex_lock(&adev->mn_lock); 199 - 200 - amn = bo->mn; 201 - if (amn == NULL) { 202 - mutex_unlock(&adev->mn_lock); 396 + if (!bo->notifier.mm) 203 397 return; 204 - } 205 - 206 - down_write(&amn->lock); 207 - 208 - /* save the next list entry for later */ 209 - head = bo->mn_list.next; 210 - 211 - bo->mn = NULL; 212 - list_del_init(&bo->mn_list); 213 - 214 - if (list_empty(head)) { 215 - struct amdgpu_mn_node *node; 216 - 217 - node = container_of(head, struct amdgpu_mn_node, bos); 218 - interval_tree_remove(&node->it, &amn->objects); 219 - kfree(node); 220 - } 221 - 222 - up_write(&amn->lock); 223 - mutex_unlock(&adev->mn_lock); 224 - } 225 - 226 - /* flags used by HMM internal, not related to CPU/GPU PTE flags */ 227 - static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = { 228 - (1 << 0), /* HMM_PFN_VALID */ 229 - (1 << 1), /* HMM_PFN_WRITE */ 230 - 0 /* HMM_PFN_DEVICE_PRIVATE */ 231 - }; 232 - 233 - static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = { 234 - 0xfffffffffffffffeUL, /* HMM_PFN_ERROR */ 235 - 0, /* HMM_PFN_NONE */ 236 - 0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */ 237 - }; 238 - 239 - void amdgpu_hmm_init_range(struct hmm_range *range) 240 - { 241 - if (range) { 242 - range->flags = hmm_range_flags; 243 - range->values = hmm_range_values; 244 - range->pfn_shift = PAGE_SHIFT; 245 - } 398 + mmu_interval_notifier_remove(&bo->notifier); 399 + bo->notifier.mm = NULL; 246 400 }
-53
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
··· 30 30 #include <linux/workqueue.h> 31 31 #include <linux/interval_tree.h> 32 32 33 - enum amdgpu_mn_type { 34 - AMDGPU_MN_TYPE_GFX, 35 - AMDGPU_MN_TYPE_HSA, 36 - }; 37 - 38 - /** 39 - * struct amdgpu_mn 40 - * 41 - * @adev: amdgpu device pointer 42 - * @mm: process address space 43 - * @type: type of MMU notifier 44 - * @work: destruction work item 45 - * @node: hash table node to find structure by adev and mn 46 - * @lock: rw semaphore protecting the notifier nodes 47 - * @objects: interval tree containing amdgpu_mn_nodes 48 - * @mirror: HMM mirror function support 49 - * 50 - * Data for each amdgpu device and process address space. 51 - */ 52 - struct amdgpu_mn { 53 - /* constant after initialisation */ 54 - struct amdgpu_device *adev; 55 - struct mm_struct *mm; 56 - enum amdgpu_mn_type type; 57 - 58 - /* only used on destruction */ 59 - struct work_struct work; 60 - 61 - /* protected by adev->mn_lock */ 62 - struct hlist_node node; 63 - 64 - /* objects protected by lock */ 65 - struct rw_semaphore lock; 66 - struct rb_root_cached objects; 67 - 68 - #ifdef CONFIG_HMM_MIRROR 69 - /* HMM mirror */ 70 - struct hmm_mirror mirror; 71 - #endif 72 - }; 73 - 74 33 #if defined(CONFIG_HMM_MIRROR) 75 - void amdgpu_mn_lock(struct amdgpu_mn *mn); 76 - void amdgpu_mn_unlock(struct amdgpu_mn *mn); 77 - struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, 78 - enum amdgpu_mn_type type); 79 34 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr); 80 35 void amdgpu_mn_unregister(struct amdgpu_bo *bo); 81 - void amdgpu_hmm_init_range(struct hmm_range *range); 82 36 #else 83 - static inline void amdgpu_mn_lock(struct amdgpu_mn *mn) {} 84 - static inline void amdgpu_mn_unlock(struct amdgpu_mn *mn) {} 85 - static inline struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, 86 - enum amdgpu_mn_type type) 87 - { 88 - return NULL; 89 - } 90 37 static inline int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr) 91 38 { 92 39 DRM_WARN_ONCE("HMM_MIRROR kernel config option is not enabled, "
+9 -4
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
··· 30 30 31 31 #include <drm/amdgpu_drm.h> 32 32 #include "amdgpu.h" 33 + #ifdef CONFIG_MMU_NOTIFIER 34 + #include <linux/mmu_notifier.h> 35 + #endif 33 36 34 37 #define AMDGPU_BO_INVALID_OFFSET LONG_MAX 35 38 #define AMDGPU_BO_MAX_PLACEMENTS 3 ··· 104 101 struct ttm_bo_kmap_obj dma_buf_vmap; 105 102 struct amdgpu_mn *mn; 106 103 107 - union { 108 - struct list_head mn_list; 109 - struct list_head shadow_list; 110 - }; 104 + 105 + #ifdef CONFIG_MMU_NOTIFIER 106 + struct mmu_interval_notifier notifier; 107 + #endif 108 + 109 + struct list_head shadow_list; 111 110 112 111 struct kgd_mem *kfd_bo; 113 112 };
+90 -57
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
··· 35 35 #include <linux/hmm.h> 36 36 #include <linux/pagemap.h> 37 37 #include <linux/sched/task.h> 38 + #include <linux/sched/mm.h> 38 39 #include <linux/seq_file.h> 39 40 #include <linux/slab.h> 40 41 #include <linux/swap.h> ··· 770 769 #endif 771 770 }; 772 771 772 + #ifdef CONFIG_DRM_AMDGPU_USERPTR 773 + /* flags used by HMM internal, not related to CPU/GPU PTE flags */ 774 + static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = { 775 + (1 << 0), /* HMM_PFN_VALID */ 776 + (1 << 1), /* HMM_PFN_WRITE */ 777 + 0 /* HMM_PFN_DEVICE_PRIVATE */ 778 + }; 779 + 780 + static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = { 781 + 0xfffffffffffffffeUL, /* HMM_PFN_ERROR */ 782 + 0, /* HMM_PFN_NONE */ 783 + 0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */ 784 + }; 785 + 773 786 /** 774 787 * amdgpu_ttm_tt_get_user_pages - get device accessible pages that back user 775 788 * memory and start HMM tracking CPU page table update ··· 791 776 * Calling function must call amdgpu_ttm_tt_userptr_range_done() once and only 792 777 * once afterwards to stop HMM tracking 793 778 */ 794 - #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) 795 - 796 - #define MAX_RETRY_HMM_RANGE_FAULT 16 797 - 798 779 int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) 799 780 { 800 - struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL; 801 781 struct ttm_tt *ttm = bo->tbo.ttm; 802 782 struct amdgpu_ttm_tt *gtt = (void *)ttm; 803 - struct mm_struct *mm = gtt->usertask->mm; 804 783 unsigned long start = gtt->userptr; 805 784 struct vm_area_struct *vma; 806 785 struct hmm_range *range; 786 + unsigned long timeout; 787 + struct mm_struct *mm; 807 788 unsigned long i; 808 - uint64_t *pfns; 809 789 int r = 0; 810 790 811 - if (!mm) /* Happens during process shutdown */ 791 + mm = bo->notifier.mm; 792 + if (unlikely(!mm)) { 793 + DRM_DEBUG_DRIVER("BO is not registered?\n"); 794 + return -EFAULT; 795 + } 796 + 797 + /* Another get_user_pages is running at the same time?? */ 798 + if (WARN_ON(gtt->range)) 799 + return -EFAULT; 800 + 801 + if (!mmget_not_zero(mm)) /* Happens during process shutdown */ 812 802 return -ESRCH; 813 - 814 - if (unlikely(!mirror)) { 815 - DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n"); 816 - r = -EFAULT; 817 - goto out; 818 - } 819 - 820 - vma = find_vma(mm, start); 821 - if (unlikely(!vma || start < vma->vm_start)) { 822 - r = -EFAULT; 823 - goto out; 824 - } 825 - if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) && 826 - vma->vm_file)) { 827 - r = -EPERM; 828 - goto out; 829 - } 830 803 831 804 range = kzalloc(sizeof(*range), GFP_KERNEL); 832 805 if (unlikely(!range)) { 833 806 r = -ENOMEM; 834 807 goto out; 835 808 } 809 + range->notifier = &bo->notifier; 810 + range->flags = hmm_range_flags; 811 + range->values = hmm_range_values; 812 + range->pfn_shift = PAGE_SHIFT; 813 + range->start = bo->notifier.interval_tree.start; 814 + range->end = bo->notifier.interval_tree.last + 1; 815 + range->default_flags = hmm_range_flags[HMM_PFN_VALID]; 816 + if (!amdgpu_ttm_tt_is_readonly(ttm)) 817 + range->default_flags |= range->flags[HMM_PFN_WRITE]; 836 818 837 - pfns = kvmalloc_array(ttm->num_pages, sizeof(*pfns), GFP_KERNEL); 838 - if (unlikely(!pfns)) { 819 + range->pfns = kvmalloc_array(ttm->num_pages, sizeof(*range->pfns), 820 + GFP_KERNEL); 821 + if (unlikely(!range->pfns)) { 839 822 r = -ENOMEM; 840 823 goto out_free_ranges; 841 824 } 842 825 843 - amdgpu_hmm_init_range(range); 844 - range->default_flags = range->flags[HMM_PFN_VALID]; 845 - range->default_flags |= amdgpu_ttm_tt_is_readonly(ttm) ? 846 - 0 : range->flags[HMM_PFN_WRITE]; 847 - range->pfn_flags_mask = 0; 848 - range->pfns = pfns; 849 - range->start = start; 850 - range->end = start + ttm->num_pages * PAGE_SIZE; 826 + down_read(&mm->mmap_sem); 827 + vma = find_vma(mm, start); 828 + if (unlikely(!vma || start < vma->vm_start)) { 829 + r = -EFAULT; 830 + goto out_unlock; 831 + } 832 + if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) && 833 + vma->vm_file)) { 834 + r = -EPERM; 835 + goto out_unlock; 836 + } 837 + up_read(&mm->mmap_sem); 838 + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); 851 839 852 - hmm_range_register(range, mirror); 853 - 854 - /* 855 - * Just wait for range to be valid, safe to ignore return value as we 856 - * will use the return value of hmm_range_fault() below under the 857 - * mmap_sem to ascertain the validity of the range. 858 - */ 859 - hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT); 840 + retry: 841 + range->notifier_seq = mmu_interval_read_begin(&bo->notifier); 860 842 861 843 down_read(&mm->mmap_sem); 862 844 r = hmm_range_fault(range, 0); 863 845 up_read(&mm->mmap_sem); 864 - 865 - if (unlikely(r < 0)) 846 + if (unlikely(r <= 0)) { 847 + /* 848 + * FIXME: This timeout should encompass the retry from 849 + * mmu_interval_read_retry() as well. 850 + */ 851 + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout)) 852 + goto retry; 866 853 goto out_free_pfns; 854 + } 867 855 868 856 for (i = 0; i < ttm->num_pages; i++) { 869 - pages[i] = hmm_device_entry_to_page(range, pfns[i]); 857 + /* FIXME: The pages cannot be touched outside the notifier_lock */ 858 + pages[i] = hmm_device_entry_to_page(range, range->pfns[i]); 870 859 if (unlikely(!pages[i])) { 871 860 pr_err("Page fault failed for pfn[%lu] = 0x%llx\n", 872 - i, pfns[i]); 861 + i, range->pfns[i]); 873 862 r = -ENOMEM; 874 863 875 864 goto out_free_pfns; ··· 881 862 } 882 863 883 864 gtt->range = range; 865 + mmput(mm); 884 866 885 867 return 0; 886 868 869 + out_unlock: 870 + up_read(&mm->mmap_sem); 887 871 out_free_pfns: 888 - hmm_range_unregister(range); 889 - kvfree(pfns); 872 + kvfree(range->pfns); 890 873 out_free_ranges: 891 874 kfree(range); 892 875 out: 876 + mmput(mm); 893 877 return r; 894 878 } 895 879 ··· 917 895 "No user pages to check\n"); 918 896 919 897 if (gtt->range) { 920 - r = hmm_range_valid(gtt->range); 921 - hmm_range_unregister(gtt->range); 922 - 898 + /* 899 + * FIXME: Must always hold notifier_lock for this, and must 900 + * not ignore the return code. 901 + */ 902 + r = mmu_interval_read_retry(gtt->range->notifier, 903 + gtt->range->notifier_seq); 923 904 kvfree(gtt->range->pfns); 924 905 kfree(gtt->range); 925 906 gtt->range = NULL; 926 907 } 927 908 928 - return r; 909 + return !r; 929 910 } 930 911 #endif 931 912 ··· 1009 984 sg_free_table(ttm->sg); 1010 985 1011 986 #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) 1012 - if (gtt->range && 1013 - ttm->pages[0] == hmm_device_entry_to_page(gtt->range, 1014 - gtt->range->pfns[0])) 1015 - WARN_ONCE(1, "Missing get_user_page_done\n"); 987 + if (gtt->range) { 988 + unsigned long i; 989 + 990 + for (i = 0; i < ttm->num_pages; i++) { 991 + if (ttm->pages[i] != 992 + hmm_device_entry_to_page(gtt->range, 993 + gtt->range->pfns[i])) 994 + break; 995 + } 996 + 997 + WARN((i == ttm->num_pages), "Missing get_user_page_done\n"); 998 + } 1016 999 #endif 1017 1000 } 1018 1001
+141 -91
drivers/gpu/drm/nouveau/nouveau_svm.c
··· 88 88 } 89 89 90 90 struct nouveau_svmm { 91 + struct mmu_notifier notifier; 91 92 struct nouveau_vmm *vmm; 92 93 struct { 93 94 unsigned long start; ··· 96 95 } unmanaged; 97 96 98 97 struct mutex mutex; 99 - 100 - struct mm_struct *mm; 101 - struct hmm_mirror mirror; 102 98 }; 103 99 104 100 #define SVMM_DBG(s,f,a...) \ ··· 249 251 } 250 252 251 253 static int 252 - nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror, 253 - const struct mmu_notifier_range *update) 254 + nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn, 255 + const struct mmu_notifier_range *update) 254 256 { 255 - struct nouveau_svmm *svmm = container_of(mirror, typeof(*svmm), mirror); 257 + struct nouveau_svmm *svmm = 258 + container_of(mn, struct nouveau_svmm, notifier); 256 259 unsigned long start = update->start; 257 260 unsigned long limit = update->end; 258 261 ··· 263 264 SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit); 264 265 265 266 mutex_lock(&svmm->mutex); 267 + if (unlikely(!svmm->vmm)) 268 + goto out; 269 + 266 270 if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) { 267 271 if (start < svmm->unmanaged.start) { 268 272 nouveau_svmm_invalidate(svmm, start, ··· 275 273 } 276 274 277 275 nouveau_svmm_invalidate(svmm, start, limit); 276 + 277 + out: 278 278 mutex_unlock(&svmm->mutex); 279 279 return 0; 280 280 } 281 281 282 - static void 283 - nouveau_svmm_release(struct hmm_mirror *mirror) 282 + static void nouveau_svmm_free_notifier(struct mmu_notifier *mn) 284 283 { 284 + kfree(container_of(mn, struct nouveau_svmm, notifier)); 285 285 } 286 286 287 - static const struct hmm_mirror_ops 288 - nouveau_svmm = { 289 - .sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables, 290 - .release = nouveau_svmm_release, 287 + static const struct mmu_notifier_ops nouveau_mn_ops = { 288 + .invalidate_range_start = nouveau_svmm_invalidate_range_start, 289 + .free_notifier = nouveau_svmm_free_notifier, 291 290 }; 292 291 293 292 void ··· 296 293 { 297 294 struct nouveau_svmm *svmm = *psvmm; 298 295 if (svmm) { 299 - hmm_mirror_unregister(&svmm->mirror); 300 - kfree(*psvmm); 296 + mutex_lock(&svmm->mutex); 297 + svmm->vmm = NULL; 298 + mutex_unlock(&svmm->mutex); 299 + mmu_notifier_put(&svmm->notifier); 301 300 *psvmm = NULL; 302 301 } 303 302 } ··· 325 320 mutex_lock(&cli->mutex); 326 321 if (cli->svm.cli) { 327 322 ret = -EBUSY; 328 - goto done; 323 + goto out_free; 329 324 } 330 325 331 326 /* Allocate a new GPU VMM that can support SVM (managed by the ··· 340 335 .fault_replay = true, 341 336 }, sizeof(struct gp100_vmm_v0), &cli->svm.vmm); 342 337 if (ret) 343 - goto done; 338 + goto out_free; 344 339 345 - /* Enable HMM mirroring of CPU address-space to VMM. */ 346 - svmm->mm = get_task_mm(current); 347 - down_write(&svmm->mm->mmap_sem); 348 - svmm->mirror.ops = &nouveau_svmm; 349 - ret = hmm_mirror_register(&svmm->mirror, svmm->mm); 350 - if (ret == 0) { 351 - cli->svm.svmm = svmm; 352 - cli->svm.cli = cli; 353 - } 354 - up_write(&svmm->mm->mmap_sem); 355 - mmput(svmm->mm); 356 - 357 - done: 340 + down_write(&current->mm->mmap_sem); 341 + svmm->notifier.ops = &nouveau_mn_ops; 342 + ret = __mmu_notifier_register(&svmm->notifier, current->mm); 358 343 if (ret) 359 - nouveau_svmm_fini(&svmm); 344 + goto out_mm_unlock; 345 + /* Note, ownership of svmm transfers to mmu_notifier */ 346 + 347 + cli->svm.svmm = svmm; 348 + cli->svm.cli = cli; 349 + up_write(&current->mm->mmap_sem); 360 350 mutex_unlock(&cli->mutex); 351 + return 0; 352 + 353 + out_mm_unlock: 354 + up_write(&current->mm->mmap_sem); 355 + out_free: 356 + mutex_unlock(&cli->mutex); 357 + kfree(svmm); 361 358 return ret; 362 359 } 363 360 ··· 482 475 fault->inst, fault->addr, fault->access); 483 476 } 484 477 485 - static inline bool 486 - nouveau_range_done(struct hmm_range *range) 487 - { 488 - bool ret = hmm_range_valid(range); 478 + struct svm_notifier { 479 + struct mmu_interval_notifier notifier; 480 + struct nouveau_svmm *svmm; 481 + }; 489 482 490 - hmm_range_unregister(range); 491 - return ret; 483 + static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni, 484 + const struct mmu_notifier_range *range, 485 + unsigned long cur_seq) 486 + { 487 + struct svm_notifier *sn = 488 + container_of(mni, struct svm_notifier, notifier); 489 + 490 + /* 491 + * serializes the update to mni->invalidate_seq done by caller and 492 + * prevents invalidation of the PTE from progressing while HW is being 493 + * programmed. This is very hacky and only works because the normal 494 + * notifier that does invalidation is always called after the range 495 + * notifier. 496 + */ 497 + if (mmu_notifier_range_blockable(range)) 498 + mutex_lock(&sn->svmm->mutex); 499 + else if (!mutex_trylock(&sn->svmm->mutex)) 500 + return false; 501 + mmu_interval_set_seq(mni, cur_seq); 502 + mutex_unlock(&sn->svmm->mutex); 503 + return true; 492 504 } 493 505 494 - static int 495 - nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range) 506 + static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops = { 507 + .invalidate = nouveau_svm_range_invalidate, 508 + }; 509 + 510 + static int nouveau_range_fault(struct nouveau_svmm *svmm, 511 + struct nouveau_drm *drm, void *data, u32 size, 512 + u64 *pfns, struct svm_notifier *notifier) 496 513 { 514 + unsigned long timeout = 515 + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); 516 + /* Have HMM fault pages within the fault window to the GPU. */ 517 + struct hmm_range range = { 518 + .notifier = &notifier->notifier, 519 + .start = notifier->notifier.interval_tree.start, 520 + .end = notifier->notifier.interval_tree.last + 1, 521 + .pfns = pfns, 522 + .flags = nouveau_svm_pfn_flags, 523 + .values = nouveau_svm_pfn_values, 524 + .pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, 525 + }; 526 + struct mm_struct *mm = notifier->notifier.mm; 497 527 long ret; 498 528 499 - range->default_flags = 0; 500 - range->pfn_flags_mask = -1UL; 529 + while (true) { 530 + if (time_after(jiffies, timeout)) 531 + return -EBUSY; 501 532 502 - ret = hmm_range_register(range, &svmm->mirror); 503 - if (ret) { 504 - up_read(&svmm->mm->mmap_sem); 505 - return (int)ret; 533 + range.notifier_seq = mmu_interval_read_begin(range.notifier); 534 + range.default_flags = 0; 535 + range.pfn_flags_mask = -1UL; 536 + down_read(&mm->mmap_sem); 537 + ret = hmm_range_fault(&range, 0); 538 + up_read(&mm->mmap_sem); 539 + if (ret <= 0) { 540 + if (ret == 0 || ret == -EBUSY) 541 + continue; 542 + return ret; 543 + } 544 + 545 + mutex_lock(&svmm->mutex); 546 + if (mmu_interval_read_retry(range.notifier, 547 + range.notifier_seq)) { 548 + mutex_unlock(&svmm->mutex); 549 + continue; 550 + } 551 + break; 506 552 } 507 553 508 - if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) { 509 - up_read(&svmm->mm->mmap_sem); 510 - return -EBUSY; 511 - } 554 + nouveau_dmem_convert_pfn(drm, &range); 512 555 513 - ret = hmm_range_fault(range, 0); 514 - if (ret <= 0) { 515 - if (ret == 0) 516 - ret = -EBUSY; 517 - up_read(&svmm->mm->mmap_sem); 518 - hmm_range_unregister(range); 519 - return ret; 520 - } 521 - return 0; 556 + svmm->vmm->vmm.object.client->super = true; 557 + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, data, size, NULL); 558 + svmm->vmm->vmm.object.client->super = false; 559 + mutex_unlock(&svmm->mutex); 560 + 561 + return ret; 522 562 } 523 563 524 564 static int ··· 585 531 } i; 586 532 u64 phys[16]; 587 533 } args; 588 - struct hmm_range range; 589 534 struct vm_area_struct *vma; 590 535 u64 inst, start, limit; 591 536 int fi, fn, pi, fill; ··· 640 587 args.i.p.version = 0; 641 588 642 589 for (fi = 0; fn = fi + 1, fi < buffer->fault_nr; fi = fn) { 590 + struct svm_notifier notifier; 591 + struct mm_struct *mm; 592 + 643 593 /* Cancel any faults from non-SVM channels. */ 644 594 if (!(svmm = buffer->fault[fi]->svmm)) { 645 595 nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); ··· 662 606 start = max_t(u64, start, svmm->unmanaged.limit); 663 607 SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit); 664 608 609 + mm = svmm->notifier.mm; 610 + if (!mmget_not_zero(mm)) { 611 + nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); 612 + continue; 613 + } 614 + 665 615 /* Intersect fault window with the CPU VMA, cancelling 666 616 * the fault if the address is invalid. 667 617 */ 668 - down_read(&svmm->mm->mmap_sem); 669 - vma = find_vma_intersection(svmm->mm, start, limit); 618 + down_read(&mm->mmap_sem); 619 + vma = find_vma_intersection(mm, start, limit); 670 620 if (!vma) { 671 621 SVMM_ERR(svmm, "wndw %016llx-%016llx", start, limit); 672 - up_read(&svmm->mm->mmap_sem); 622 + up_read(&mm->mmap_sem); 623 + mmput(mm); 673 624 nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); 674 625 continue; 675 626 } 676 627 start = max_t(u64, start, vma->vm_start); 677 628 limit = min_t(u64, limit, vma->vm_end); 629 + up_read(&mm->mmap_sem); 678 630 SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit); 679 631 680 632 if (buffer->fault[fi]->addr != start) { 681 633 SVMM_ERR(svmm, "addr %016llx", buffer->fault[fi]->addr); 682 - up_read(&svmm->mm->mmap_sem); 634 + mmput(mm); 683 635 nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); 684 636 continue; 685 637 } ··· 743 679 args.i.p.addr, 744 680 args.i.p.addr + args.i.p.size, fn - fi); 745 681 746 - /* Have HMM fault pages within the fault window to the GPU. */ 747 - range.start = args.i.p.addr; 748 - range.end = args.i.p.addr + args.i.p.size; 749 - range.pfns = args.phys; 750 - range.flags = nouveau_svm_pfn_flags; 751 - range.values = nouveau_svm_pfn_values; 752 - range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT; 753 - again: 754 - ret = nouveau_range_fault(svmm, &range); 755 - if (ret == 0) { 756 - mutex_lock(&svmm->mutex); 757 - if (!nouveau_range_done(&range)) { 758 - mutex_unlock(&svmm->mutex); 759 - goto again; 760 - } 761 - 762 - nouveau_dmem_convert_pfn(svm->drm, &range); 763 - 764 - svmm->vmm->vmm.object.client->super = true; 765 - ret = nvif_object_ioctl(&svmm->vmm->vmm.object, 766 - &args, sizeof(args.i) + 767 - pi * sizeof(args.phys[0]), 768 - NULL); 769 - svmm->vmm->vmm.object.client->super = false; 770 - mutex_unlock(&svmm->mutex); 771 - up_read(&svmm->mm->mmap_sem); 682 + notifier.svmm = svmm; 683 + ret = mmu_interval_notifier_insert(&notifier.notifier, 684 + svmm->notifier.mm, 685 + args.i.p.addr, args.i.p.size, 686 + &nouveau_svm_mni_ops); 687 + if (!ret) { 688 + ret = nouveau_range_fault( 689 + svmm, svm->drm, &args, 690 + sizeof(args.i) + pi * sizeof(args.phys[0]), 691 + args.phys, &notifier); 692 + mmu_interval_notifier_remove(&notifier.notifier); 772 693 } 694 + mmput(mm); 773 695 774 696 /* Cancel any faults in the window whose pages didn't manage 775 697 * to keep their valid bit, or stay writeable when required. ··· 764 714 */ 765 715 while (fi < fn) { 766 716 struct nouveau_svm_fault *fault = buffer->fault[fi++]; 767 - pi = (fault->addr - range.start) >> PAGE_SHIFT; 717 + pi = (fault->addr - args.i.p.addr) >> PAGE_SHIFT; 768 718 if (ret || 769 - !(range.pfns[pi] & NVIF_VMM_PFNMAP_V0_V) || 770 - (!(range.pfns[pi] & NVIF_VMM_PFNMAP_V0_W) && 719 + !(args.phys[pi] & NVIF_VMM_PFNMAP_V0_V) || 720 + (!(args.phys[pi] & NVIF_VMM_PFNMAP_V0_W) && 771 721 fault->access != 0 && fault->access != 3)) { 772 722 nouveau_svm_fault_cancel_fault(svm, fault); 773 723 continue;
+7 -2
drivers/gpu/drm/radeon/radeon.h
··· 68 68 #include <linux/hashtable.h> 69 69 #include <linux/dma-fence.h> 70 70 71 + #ifdef CONFIG_MMU_NOTIFIER 72 + #include <linux/mmu_notifier.h> 73 + #endif 74 + 71 75 #include <drm/ttm/ttm_bo_api.h> 72 76 #include <drm/ttm/ttm_bo_driver.h> 73 77 #include <drm/ttm/ttm_placement.h> ··· 513 509 struct ttm_bo_kmap_obj dma_buf_vmap; 514 510 pid_t pid; 515 511 516 - struct radeon_mn *mn; 517 - struct list_head mn_list; 512 + #ifdef CONFIG_MMU_NOTIFIER 513 + struct mmu_interval_notifier notifier; 514 + #endif 518 515 }; 519 516 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base) 520 517
+44 -174
drivers/gpu/drm/radeon/radeon_mn.c
··· 36 36 37 37 #include "radeon.h" 38 38 39 - struct radeon_mn { 40 - struct mmu_notifier mn; 41 - 42 - /* objects protected by lock */ 43 - struct mutex lock; 44 - struct rb_root_cached objects; 45 - }; 46 - 47 - struct radeon_mn_node { 48 - struct interval_tree_node it; 49 - struct list_head bos; 50 - }; 51 - 52 39 /** 53 - * radeon_mn_invalidate_range_start - callback to notify about mm change 40 + * radeon_mn_invalidate - callback to notify about mm change 54 41 * 55 42 * @mn: our notifier 56 - * @mn: the mm this callback is about 57 - * @start: start of updated range 58 - * @end: end of updated range 43 + * @range: the VMA under invalidation 59 44 * 60 45 * We block for all BOs between start and end to be idle and 61 46 * unmap them by move them into system domain again. 62 47 */ 63 - static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn, 64 - const struct mmu_notifier_range *range) 48 + static bool radeon_mn_invalidate(struct mmu_interval_notifier *mn, 49 + const struct mmu_notifier_range *range, 50 + unsigned long cur_seq) 65 51 { 66 - struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn); 52 + struct radeon_bo *bo = container_of(mn, struct radeon_bo, notifier); 67 53 struct ttm_operation_ctx ctx = { false, false }; 68 - struct interval_tree_node *it; 69 - unsigned long end; 70 - int ret = 0; 54 + long r; 71 55 72 - /* notification is exclusive, but interval is inclusive */ 73 - end = range->end - 1; 56 + if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound) 57 + return true; 74 58 75 - /* TODO we should be able to split locking for interval tree and 76 - * the tear down. 77 - */ 78 - if (mmu_notifier_range_blockable(range)) 79 - mutex_lock(&rmn->lock); 80 - else if (!mutex_trylock(&rmn->lock)) 81 - return -EAGAIN; 59 + if (!mmu_notifier_range_blockable(range)) 60 + return false; 82 61 83 - it = interval_tree_iter_first(&rmn->objects, range->start, end); 84 - while (it) { 85 - struct radeon_mn_node *node; 86 - struct radeon_bo *bo; 87 - long r; 88 - 89 - if (!mmu_notifier_range_blockable(range)) { 90 - ret = -EAGAIN; 91 - goto out_unlock; 92 - } 93 - 94 - node = container_of(it, struct radeon_mn_node, it); 95 - it = interval_tree_iter_next(it, range->start, end); 96 - 97 - list_for_each_entry(bo, &node->bos, mn_list) { 98 - 99 - if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound) 100 - continue; 101 - 102 - r = radeon_bo_reserve(bo, true); 103 - if (r) { 104 - DRM_ERROR("(%ld) failed to reserve user bo\n", r); 105 - continue; 106 - } 107 - 108 - r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, 109 - true, false, MAX_SCHEDULE_TIMEOUT); 110 - if (r <= 0) 111 - DRM_ERROR("(%ld) failed to wait for user bo\n", r); 112 - 113 - radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU); 114 - r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); 115 - if (r) 116 - DRM_ERROR("(%ld) failed to validate user bo\n", r); 117 - 118 - radeon_bo_unreserve(bo); 119 - } 62 + r = radeon_bo_reserve(bo, true); 63 + if (r) { 64 + DRM_ERROR("(%ld) failed to reserve user bo\n", r); 65 + return true; 120 66 } 121 - 122 - out_unlock: 123 - mutex_unlock(&rmn->lock); 124 67 125 - return ret; 68 + r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false, 69 + MAX_SCHEDULE_TIMEOUT); 70 + if (r <= 0) 71 + DRM_ERROR("(%ld) failed to wait for user bo\n", r); 72 + 73 + radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU); 74 + r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); 75 + if (r) 76 + DRM_ERROR("(%ld) failed to validate user bo\n", r); 77 + 78 + radeon_bo_unreserve(bo); 79 + return true; 126 80 } 127 81 128 - static void radeon_mn_release(struct mmu_notifier *mn, struct mm_struct *mm) 129 - { 130 - struct mmu_notifier_range range = { 131 - .mm = mm, 132 - .start = 0, 133 - .end = ULONG_MAX, 134 - .flags = 0, 135 - .event = MMU_NOTIFY_UNMAP, 136 - }; 137 - 138 - radeon_mn_invalidate_range_start(mn, &range); 139 - } 140 - 141 - static struct mmu_notifier *radeon_mn_alloc_notifier(struct mm_struct *mm) 142 - { 143 - struct radeon_mn *rmn; 144 - 145 - rmn = kzalloc(sizeof(*rmn), GFP_KERNEL); 146 - if (!rmn) 147 - return ERR_PTR(-ENOMEM); 148 - 149 - mutex_init(&rmn->lock); 150 - rmn->objects = RB_ROOT_CACHED; 151 - return &rmn->mn; 152 - } 153 - 154 - static void radeon_mn_free_notifier(struct mmu_notifier *mn) 155 - { 156 - kfree(container_of(mn, struct radeon_mn, mn)); 157 - } 158 - 159 - static const struct mmu_notifier_ops radeon_mn_ops = { 160 - .release = radeon_mn_release, 161 - .invalidate_range_start = radeon_mn_invalidate_range_start, 162 - .alloc_notifier = radeon_mn_alloc_notifier, 163 - .free_notifier = radeon_mn_free_notifier, 82 + static const struct mmu_interval_notifier_ops radeon_mn_ops = { 83 + .invalidate = radeon_mn_invalidate, 164 84 }; 165 85 166 86 /** ··· 94 174 */ 95 175 int radeon_mn_register(struct radeon_bo *bo, unsigned long addr) 96 176 { 97 - unsigned long end = addr + radeon_bo_size(bo) - 1; 98 - struct mmu_notifier *mn; 99 - struct radeon_mn *rmn; 100 - struct radeon_mn_node *node = NULL; 101 - struct list_head bos; 102 - struct interval_tree_node *it; 177 + int ret; 103 178 104 - mn = mmu_notifier_get(&radeon_mn_ops, current->mm); 105 - if (IS_ERR(mn)) 106 - return PTR_ERR(mn); 107 - rmn = container_of(mn, struct radeon_mn, mn); 179 + ret = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr, 180 + radeon_bo_size(bo), &radeon_mn_ops); 181 + if (ret) 182 + return ret; 108 183 109 - INIT_LIST_HEAD(&bos); 110 - 111 - mutex_lock(&rmn->lock); 112 - 113 - while ((it = interval_tree_iter_first(&rmn->objects, addr, end))) { 114 - kfree(node); 115 - node = container_of(it, struct radeon_mn_node, it); 116 - interval_tree_remove(&node->it, &rmn->objects); 117 - addr = min(it->start, addr); 118 - end = max(it->last, end); 119 - list_splice(&node->bos, &bos); 120 - } 121 - 122 - if (!node) { 123 - node = kmalloc(sizeof(struct radeon_mn_node), GFP_KERNEL); 124 - if (!node) { 125 - mutex_unlock(&rmn->lock); 126 - return -ENOMEM; 127 - } 128 - } 129 - 130 - bo->mn = rmn; 131 - 132 - node->it.start = addr; 133 - node->it.last = end; 134 - INIT_LIST_HEAD(&node->bos); 135 - list_splice(&bos, &node->bos); 136 - list_add(&bo->mn_list, &node->bos); 137 - 138 - interval_tree_insert(&node->it, &rmn->objects); 139 - 140 - mutex_unlock(&rmn->lock); 141 - 184 + /* 185 + * FIXME: radeon appears to allow get_user_pages to run during 186 + * invalidate_range_start/end, which is not a safe way to read the 187 + * PTEs. It should use the mmu_interval_read_begin() scheme around the 188 + * get_user_pages to ensure that the PTEs are read properly 189 + */ 190 + mmu_interval_read_begin(&bo->notifier); 142 191 return 0; 143 192 } 144 193 ··· 120 231 */ 121 232 void radeon_mn_unregister(struct radeon_bo *bo) 122 233 { 123 - struct radeon_mn *rmn = bo->mn; 124 - struct list_head *head; 125 - 126 - if (!rmn) 234 + if (!bo->notifier.mm) 127 235 return; 128 - 129 - mutex_lock(&rmn->lock); 130 - /* save the next list entry for later */ 131 - head = bo->mn_list.next; 132 - 133 - list_del(&bo->mn_list); 134 - 135 - if (list_empty(head)) { 136 - struct radeon_mn_node *node; 137 - node = container_of(head, struct radeon_mn_node, bos); 138 - interval_tree_remove(&node->it, &rmn->objects); 139 - kfree(node); 140 - } 141 - 142 - mutex_unlock(&rmn->lock); 143 - 144 - mmu_notifier_put(&rmn->mn); 145 - bo->mn = NULL; 236 + mmu_interval_notifier_remove(&bo->notifier); 237 + bo->notifier.mm = NULL; 146 238 }
-1
drivers/infiniband/core/device.c
··· 2634 2634 SET_DEVICE_OP(dev_ops, get_vf_guid); 2635 2635 SET_DEVICE_OP(dev_ops, get_vf_stats); 2636 2636 SET_DEVICE_OP(dev_ops, init_port); 2637 - SET_DEVICE_OP(dev_ops, invalidate_range); 2638 2637 SET_DEVICE_OP(dev_ops, iw_accept); 2639 2638 SET_DEVICE_OP(dev_ops, iw_add_ref); 2640 2639 SET_DEVICE_OP(dev_ops, iw_connect);
+41 -262
drivers/infiniband/core/umem_odp.c
··· 48 48 49 49 #include "uverbs.h" 50 50 51 - static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp) 51 + static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp, 52 + const struct mmu_interval_notifier_ops *ops) 52 53 { 53 - mutex_lock(&umem_odp->umem_mutex); 54 - if (umem_odp->notifiers_count++ == 0) 55 - /* 56 - * Initialize the completion object for waiting on 57 - * notifiers. Since notifier_count is zero, no one should be 58 - * waiting right now. 59 - */ 60 - reinit_completion(&umem_odp->notifier_completion); 61 - mutex_unlock(&umem_odp->umem_mutex); 62 - } 63 - 64 - static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp) 65 - { 66 - mutex_lock(&umem_odp->umem_mutex); 67 - /* 68 - * This sequence increase will notify the QP page fault that the page 69 - * that is going to be mapped in the spte could have been freed. 70 - */ 71 - ++umem_odp->notifiers_seq; 72 - if (--umem_odp->notifiers_count == 0) 73 - complete_all(&umem_odp->notifier_completion); 74 - mutex_unlock(&umem_odp->umem_mutex); 75 - } 76 - 77 - static void ib_umem_notifier_release(struct mmu_notifier *mn, 78 - struct mm_struct *mm) 79 - { 80 - struct ib_ucontext_per_mm *per_mm = 81 - container_of(mn, struct ib_ucontext_per_mm, mn); 82 - struct rb_node *node; 83 - 84 - down_read(&per_mm->umem_rwsem); 85 - if (!per_mm->mn.users) 86 - goto out; 87 - 88 - for (node = rb_first_cached(&per_mm->umem_tree); node; 89 - node = rb_next(node)) { 90 - struct ib_umem_odp *umem_odp = 91 - rb_entry(node, struct ib_umem_odp, interval_tree.rb); 92 - 93 - /* 94 - * Increase the number of notifiers running, to prevent any 95 - * further fault handling on this MR. 96 - */ 97 - ib_umem_notifier_start_account(umem_odp); 98 - complete_all(&umem_odp->notifier_completion); 99 - umem_odp->umem.ibdev->ops.invalidate_range( 100 - umem_odp, ib_umem_start(umem_odp), 101 - ib_umem_end(umem_odp)); 102 - } 103 - 104 - out: 105 - up_read(&per_mm->umem_rwsem); 106 - } 107 - 108 - static int invalidate_range_start_trampoline(struct ib_umem_odp *item, 109 - u64 start, u64 end, void *cookie) 110 - { 111 - ib_umem_notifier_start_account(item); 112 - item->umem.ibdev->ops.invalidate_range(item, start, end); 113 - return 0; 114 - } 115 - 116 - static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn, 117 - const struct mmu_notifier_range *range) 118 - { 119 - struct ib_ucontext_per_mm *per_mm = 120 - container_of(mn, struct ib_ucontext_per_mm, mn); 121 - int rc; 122 - 123 - if (mmu_notifier_range_blockable(range)) 124 - down_read(&per_mm->umem_rwsem); 125 - else if (!down_read_trylock(&per_mm->umem_rwsem)) 126 - return -EAGAIN; 127 - 128 - if (!per_mm->mn.users) { 129 - up_read(&per_mm->umem_rwsem); 130 - /* 131 - * At this point users is permanently zero and visible to this 132 - * CPU without a lock, that fact is relied on to skip the unlock 133 - * in range_end. 134 - */ 135 - return 0; 136 - } 137 - 138 - rc = rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, 139 - range->end, 140 - invalidate_range_start_trampoline, 141 - mmu_notifier_range_blockable(range), 142 - NULL); 143 - if (rc) 144 - up_read(&per_mm->umem_rwsem); 145 - return rc; 146 - } 147 - 148 - static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start, 149 - u64 end, void *cookie) 150 - { 151 - ib_umem_notifier_end_account(item); 152 - return 0; 153 - } 154 - 155 - static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn, 156 - const struct mmu_notifier_range *range) 157 - { 158 - struct ib_ucontext_per_mm *per_mm = 159 - container_of(mn, struct ib_ucontext_per_mm, mn); 160 - 161 - if (unlikely(!per_mm->mn.users)) 162 - return; 163 - 164 - rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, 165 - range->end, 166 - invalidate_range_end_trampoline, true, NULL); 167 - up_read(&per_mm->umem_rwsem); 168 - } 169 - 170 - static struct mmu_notifier *ib_umem_alloc_notifier(struct mm_struct *mm) 171 - { 172 - struct ib_ucontext_per_mm *per_mm; 173 - 174 - per_mm = kzalloc(sizeof(*per_mm), GFP_KERNEL); 175 - if (!per_mm) 176 - return ERR_PTR(-ENOMEM); 177 - 178 - per_mm->umem_tree = RB_ROOT_CACHED; 179 - init_rwsem(&per_mm->umem_rwsem); 180 - 181 - WARN_ON(mm != current->mm); 182 - rcu_read_lock(); 183 - per_mm->tgid = get_task_pid(current->group_leader, PIDTYPE_PID); 184 - rcu_read_unlock(); 185 - return &per_mm->mn; 186 - } 187 - 188 - static void ib_umem_free_notifier(struct mmu_notifier *mn) 189 - { 190 - struct ib_ucontext_per_mm *per_mm = 191 - container_of(mn, struct ib_ucontext_per_mm, mn); 192 - 193 - WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root)); 194 - 195 - put_pid(per_mm->tgid); 196 - kfree(per_mm); 197 - } 198 - 199 - static const struct mmu_notifier_ops ib_umem_notifiers = { 200 - .release = ib_umem_notifier_release, 201 - .invalidate_range_start = ib_umem_notifier_invalidate_range_start, 202 - .invalidate_range_end = ib_umem_notifier_invalidate_range_end, 203 - .alloc_notifier = ib_umem_alloc_notifier, 204 - .free_notifier = ib_umem_free_notifier, 205 - }; 206 - 207 - static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp) 208 - { 209 - struct ib_ucontext_per_mm *per_mm; 210 - struct mmu_notifier *mn; 211 54 int ret; 212 55 213 56 umem_odp->umem.is_odp = 1; 57 + mutex_init(&umem_odp->umem_mutex); 58 + 214 59 if (!umem_odp->is_implicit_odp) { 215 60 size_t page_size = 1UL << umem_odp->page_shift; 61 + unsigned long start; 62 + unsigned long end; 216 63 size_t pages; 217 64 218 - umem_odp->interval_tree.start = 219 - ALIGN_DOWN(umem_odp->umem.address, page_size); 65 + start = ALIGN_DOWN(umem_odp->umem.address, page_size); 220 66 if (check_add_overflow(umem_odp->umem.address, 221 67 (unsigned long)umem_odp->umem.length, 222 - &umem_odp->interval_tree.last)) 68 + &end)) 223 69 return -EOVERFLOW; 224 - umem_odp->interval_tree.last = 225 - ALIGN(umem_odp->interval_tree.last, page_size); 226 - if (unlikely(umem_odp->interval_tree.last < page_size)) 70 + end = ALIGN(end, page_size); 71 + if (unlikely(end < page_size)) 227 72 return -EOVERFLOW; 228 73 229 - pages = (umem_odp->interval_tree.last - 230 - umem_odp->interval_tree.start) >> 231 - umem_odp->page_shift; 74 + pages = (end - start) >> umem_odp->page_shift; 232 75 if (!pages) 233 76 return -EINVAL; 234 - 235 - /* 236 - * Note that the representation of the intervals in the 237 - * interval tree considers the ending point as contained in 238 - * the interval. 239 - */ 240 - umem_odp->interval_tree.last--; 241 77 242 78 umem_odp->page_list = kvcalloc( 243 79 pages, sizeof(*umem_odp->page_list), GFP_KERNEL); ··· 86 250 ret = -ENOMEM; 87 251 goto out_page_list; 88 252 } 89 - } 90 253 91 - mn = mmu_notifier_get(&ib_umem_notifiers, umem_odp->umem.owning_mm); 92 - if (IS_ERR(mn)) { 93 - ret = PTR_ERR(mn); 94 - goto out_dma_list; 254 + ret = mmu_interval_notifier_insert(&umem_odp->notifier, 255 + umem_odp->umem.owning_mm, 256 + start, end - start, ops); 257 + if (ret) 258 + goto out_dma_list; 95 259 } 96 - umem_odp->per_mm = per_mm = 97 - container_of(mn, struct ib_ucontext_per_mm, mn); 98 - 99 - mutex_init(&umem_odp->umem_mutex); 100 - init_completion(&umem_odp->notifier_completion); 101 - 102 - if (!umem_odp->is_implicit_odp) { 103 - down_write(&per_mm->umem_rwsem); 104 - interval_tree_insert(&umem_odp->interval_tree, 105 - &per_mm->umem_tree); 106 - up_write(&per_mm->umem_rwsem); 107 - } 108 - mmgrab(umem_odp->umem.owning_mm); 109 260 110 261 return 0; 111 262 ··· 128 305 129 306 if (!context) 130 307 return ERR_PTR(-EIO); 131 - if (WARN_ON_ONCE(!context->device->ops.invalidate_range)) 132 - return ERR_PTR(-EINVAL); 133 308 134 309 umem_odp = kzalloc(sizeof(*umem_odp), GFP_KERNEL); 135 310 if (!umem_odp) ··· 139 318 umem_odp->is_implicit_odp = 1; 140 319 umem_odp->page_shift = PAGE_SHIFT; 141 320 142 - ret = ib_init_umem_odp(umem_odp); 321 + umem_odp->tgid = get_task_pid(current->group_leader, PIDTYPE_PID); 322 + ret = ib_init_umem_odp(umem_odp, NULL); 143 323 if (ret) { 324 + put_pid(umem_odp->tgid); 144 325 kfree(umem_odp); 145 326 return ERR_PTR(ret); 146 327 } ··· 159 336 * @addr: The starting userspace VA 160 337 * @size: The length of the userspace VA 161 338 */ 162 - struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root, 163 - unsigned long addr, size_t size) 339 + struct ib_umem_odp * 340 + ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr, 341 + size_t size, 342 + const struct mmu_interval_notifier_ops *ops) 164 343 { 165 344 /* 166 345 * Caller must ensure that root cannot be freed during the call to ··· 185 360 umem->writable = root->umem.writable; 186 361 umem->owning_mm = root->umem.owning_mm; 187 362 odp_data->page_shift = PAGE_SHIFT; 363 + odp_data->notifier.ops = ops; 188 364 189 - ret = ib_init_umem_odp(odp_data); 365 + odp_data->tgid = get_pid(root->tgid); 366 + ret = ib_init_umem_odp(odp_data, ops); 190 367 if (ret) { 368 + put_pid(odp_data->tgid); 191 369 kfree(odp_data); 192 370 return ERR_PTR(ret); 193 371 } ··· 211 383 * conjunction with MMU notifiers. 212 384 */ 213 385 struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, 214 - size_t size, int access) 386 + size_t size, int access, 387 + const struct mmu_interval_notifier_ops *ops) 215 388 { 216 389 struct ib_umem_odp *umem_odp; 217 390 struct ib_ucontext *context; ··· 227 398 if (!context) 228 399 return ERR_PTR(-EIO); 229 400 230 - if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND)) || 231 - WARN_ON_ONCE(!context->device->ops.invalidate_range)) 401 + if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND))) 232 402 return ERR_PTR(-EINVAL); 233 403 234 404 umem_odp = kzalloc(sizeof(struct ib_umem_odp), GFP_KERNEL); ··· 239 411 umem_odp->umem.address = addr; 240 412 umem_odp->umem.writable = ib_access_writable(access); 241 413 umem_odp->umem.owning_mm = mm = current->mm; 414 + umem_odp->notifier.ops = ops; 242 415 243 416 umem_odp->page_shift = PAGE_SHIFT; 244 417 if (access & IB_ACCESS_HUGETLB) { ··· 258 429 up_read(&mm->mmap_sem); 259 430 } 260 431 261 - ret = ib_init_umem_odp(umem_odp); 432 + umem_odp->tgid = get_task_pid(current->group_leader, PIDTYPE_PID); 433 + ret = ib_init_umem_odp(umem_odp, ops); 262 434 if (ret) 263 - goto err_free; 435 + goto err_put_pid; 264 436 return umem_odp; 265 437 438 + err_put_pid: 439 + put_pid(umem_odp->tgid); 266 440 err_free: 267 441 kfree(umem_odp); 268 442 return ERR_PTR(ret); ··· 274 442 275 443 void ib_umem_odp_release(struct ib_umem_odp *umem_odp) 276 444 { 277 - struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm; 278 - 279 445 /* 280 446 * Ensure that no more pages are mapped in the umem. 281 447 * ··· 285 455 ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp), 286 456 ib_umem_end(umem_odp)); 287 457 mutex_unlock(&umem_odp->umem_mutex); 458 + mmu_interval_notifier_remove(&umem_odp->notifier); 288 459 kvfree(umem_odp->dma_list); 289 460 kvfree(umem_odp->page_list); 461 + put_pid(umem_odp->tgid); 290 462 } 291 - 292 - down_write(&per_mm->umem_rwsem); 293 - if (!umem_odp->is_implicit_odp) { 294 - interval_tree_remove(&umem_odp->interval_tree, 295 - &per_mm->umem_tree); 296 - complete_all(&umem_odp->notifier_completion); 297 - } 298 - /* 299 - * NOTE! mmu_notifier_unregister() can happen between a start/end 300 - * callback, resulting in a missing end, and thus an unbalanced 301 - * lock. This doesn't really matter to us since we are about to kfree 302 - * the memory that holds the lock, however LOCKDEP doesn't like this. 303 - * Thus we call the mmu_notifier_put under the rwsem and test the 304 - * internal users count to reliably see if we are past this point. 305 - */ 306 - mmu_notifier_put(&per_mm->mn); 307 - up_write(&per_mm->umem_rwsem); 308 - 309 - mmdrop(umem_odp->umem.owning_mm); 310 463 kfree(umem_odp); 311 464 } 312 465 EXPORT_SYMBOL(ib_umem_odp_release); ··· 314 501 */ 315 502 static int ib_umem_odp_map_dma_single_page( 316 503 struct ib_umem_odp *umem_odp, 317 - int page_index, 504 + unsigned int page_index, 318 505 struct page *page, 319 506 u64 access_mask, 320 507 unsigned long current_seq) ··· 323 510 dma_addr_t dma_addr; 324 511 int ret = 0; 325 512 326 - /* 327 - * Note: we avoid writing if seq is different from the initial seq, to 328 - * handle case of a racing notifier. This check also allows us to bail 329 - * early if we have a notifier running in parallel with us. 330 - */ 331 - if (ib_umem_mmu_notifier_retry(umem_odp, current_seq)) { 513 + if (mmu_interval_check_retry(&umem_odp->notifier, current_seq)) { 332 514 ret = -EAGAIN; 333 515 goto out; 334 516 } ··· 426 618 * existing beyond the lifetime of the originating process.. Presumably 427 619 * mmget_not_zero will fail in this case. 428 620 */ 429 - owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID); 621 + owning_process = get_pid_task(umem_odp->tgid, PIDTYPE_PID); 430 622 if (!owning_process || !mmget_not_zero(owning_mm)) { 431 623 ret = -EINVAL; 432 624 goto out_put_task; ··· 570 762 } 571 763 } 572 764 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages); 573 - 574 - /* @last is not a part of the interval. See comment for function 575 - * node_last. 576 - */ 577 - int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, 578 - u64 start, u64 last, 579 - umem_call_back cb, 580 - bool blockable, 581 - void *cookie) 582 - { 583 - int ret_val = 0; 584 - struct interval_tree_node *node, *next; 585 - struct ib_umem_odp *umem; 586 - 587 - if (unlikely(start == last)) 588 - return ret_val; 589 - 590 - for (node = interval_tree_iter_first(root, start, last - 1); 591 - node; node = next) { 592 - /* TODO move the blockable decision up to the callback */ 593 - if (!blockable) 594 - return -EAGAIN; 595 - next = interval_tree_iter_next(node, start, last - 1); 596 - umem = container_of(node, struct ib_umem_odp, interval_tree); 597 - ret_val = cb(umem, start, last, cookie) || ret_val; 598 - } 599 - 600 - return ret_val; 601 - }
+1 -1
drivers/infiniband/hw/hfi1/file_ops.c
··· 1138 1138 HFI1_CAP_UGET_MASK(uctxt->flags, MASK) | 1139 1139 HFI1_CAP_KGET_MASK(uctxt->flags, K2U); 1140 1140 /* adjust flag if this fd is not able to cache */ 1141 - if (!fd->handler) 1141 + if (!fd->use_mn) 1142 1142 cinfo.runtime_flags |= HFI1_CAP_TID_UNMAP; /* no caching */ 1143 1143 1144 1144 cinfo.num_active = hfi1_count_active_units();
+1 -1
drivers/infiniband/hw/hfi1/hfi.h
··· 1444 1444 /* for cpu affinity; -1 if none */ 1445 1445 int rec_cpu_num; 1446 1446 u32 tid_n_pinned; 1447 - struct mmu_rb_handler *handler; 1447 + bool use_mn; 1448 1448 struct tid_rb_node **entry_to_rb; 1449 1449 spinlock_t tid_lock; /* protect tid_[limit,used] counters */ 1450 1450 u32 tid_limit;
+56 -90
drivers/infiniband/hw/hfi1/user_exp_rcv.c
··· 59 59 struct tid_user_buf *tbuf, 60 60 u32 rcventry, struct tid_group *grp, 61 61 u16 pageidx, unsigned int npages); 62 - static int tid_rb_insert(void *arg, struct mmu_rb_node *node); 63 62 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata, 64 63 struct tid_rb_node *tnode); 65 - static void tid_rb_remove(void *arg, struct mmu_rb_node *node); 66 - static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode); 64 + static bool tid_rb_invalidate(struct mmu_interval_notifier *mni, 65 + const struct mmu_notifier_range *range, 66 + unsigned long cur_seq); 67 67 static int program_rcvarray(struct hfi1_filedata *fd, struct tid_user_buf *, 68 68 struct tid_group *grp, 69 69 unsigned int start, u16 count, ··· 73 73 struct tid_group **grp); 74 74 static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node); 75 75 76 - static struct mmu_rb_ops tid_rb_ops = { 77 - .insert = tid_rb_insert, 78 - .remove = tid_rb_remove, 79 - .invalidate = tid_rb_invalidate 76 + static const struct mmu_interval_notifier_ops tid_mn_ops = { 77 + .invalidate = tid_rb_invalidate, 80 78 }; 81 79 82 80 /* ··· 85 87 int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd, 86 88 struct hfi1_ctxtdata *uctxt) 87 89 { 88 - struct hfi1_devdata *dd = uctxt->dd; 89 90 int ret = 0; 90 91 91 92 spin_lock_init(&fd->tid_lock); ··· 106 109 fd->entry_to_rb = NULL; 107 110 return -ENOMEM; 108 111 } 109 - 110 - /* 111 - * Register MMU notifier callbacks. If the registration 112 - * fails, continue without TID caching for this context. 113 - */ 114 - ret = hfi1_mmu_rb_register(fd, fd->mm, &tid_rb_ops, 115 - dd->pport->hfi1_wq, 116 - &fd->handler); 117 - if (ret) { 118 - dd_dev_info(dd, 119 - "Failed MMU notifier registration %d\n", 120 - ret); 121 - ret = 0; 122 - } 112 + fd->use_mn = true; 123 113 } 124 114 125 115 /* ··· 123 139 * init. 124 140 */ 125 141 spin_lock(&fd->tid_lock); 126 - if (uctxt->subctxt_cnt && fd->handler) { 142 + if (uctxt->subctxt_cnt && fd->use_mn) { 127 143 u16 remainder; 128 144 129 145 fd->tid_limit = uctxt->expected_count / uctxt->subctxt_cnt; ··· 142 158 { 143 159 struct hfi1_ctxtdata *uctxt = fd->uctxt; 144 160 145 - /* 146 - * The notifier would have been removed when the process'es mm 147 - * was freed. 148 - */ 149 - if (fd->handler) { 150 - hfi1_mmu_rb_unregister(fd->handler); 151 - } else { 152 - if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list)) 153 - unlock_exp_tids(uctxt, &uctxt->tid_full_list, fd); 154 - if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list)) 155 - unlock_exp_tids(uctxt, &uctxt->tid_used_list, fd); 156 - } 161 + if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list)) 162 + unlock_exp_tids(uctxt, &uctxt->tid_full_list, fd); 163 + if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list)) 164 + unlock_exp_tids(uctxt, &uctxt->tid_used_list, fd); 157 165 158 166 kfree(fd->invalid_tids); 159 167 fd->invalid_tids = NULL; ··· 177 201 178 202 if (mapped) { 179 203 pci_unmap_single(dd->pcidev, node->dma_addr, 180 - node->mmu.len, PCI_DMA_FROMDEVICE); 204 + node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE); 181 205 pages = &node->pages[idx]; 182 206 } else { 183 207 pages = &tidbuf->pages[idx]; ··· 753 777 return -EFAULT; 754 778 } 755 779 756 - node->mmu.addr = tbuf->vaddr + (pageidx * PAGE_SIZE); 757 - node->mmu.len = npages * PAGE_SIZE; 780 + node->fdata = fd; 758 781 node->phys = page_to_phys(pages[0]); 759 782 node->npages = npages; 760 783 node->rcventry = rcventry; ··· 762 787 node->freed = false; 763 788 memcpy(node->pages, pages, sizeof(struct page *) * npages); 764 789 765 - if (!fd->handler) 766 - ret = tid_rb_insert(fd, &node->mmu); 767 - else 768 - ret = hfi1_mmu_rb_insert(fd->handler, &node->mmu); 769 - 770 - if (ret) { 771 - hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d", 772 - node->rcventry, node->mmu.addr, node->phys, ret); 773 - pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE, 774 - PCI_DMA_FROMDEVICE); 775 - kfree(node); 776 - return -EFAULT; 790 + if (fd->use_mn) { 791 + ret = mmu_interval_notifier_insert( 792 + &node->notifier, fd->mm, 793 + tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE, 794 + &tid_mn_ops); 795 + if (ret) 796 + goto out_unmap; 797 + /* 798 + * FIXME: This is in the wrong order, the notifier should be 799 + * established before the pages are pinned by pin_rcv_pages. 800 + */ 801 + mmu_interval_read_begin(&node->notifier); 777 802 } 803 + fd->entry_to_rb[node->rcventry - uctxt->expected_base] = node; 804 + 778 805 hfi1_put_tid(dd, rcventry, PT_EXPECTED, phys, ilog2(npages) + 1); 779 806 trace_hfi1_exp_tid_reg(uctxt->ctxt, fd->subctxt, rcventry, npages, 780 - node->mmu.addr, node->phys, phys); 807 + node->notifier.interval_tree.start, node->phys, 808 + phys); 781 809 return 0; 810 + 811 + out_unmap: 812 + hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d", 813 + node->rcventry, node->notifier.interval_tree.start, 814 + node->phys, ret); 815 + pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE, 816 + PCI_DMA_FROMDEVICE); 817 + kfree(node); 818 + return -EFAULT; 782 819 } 783 820 784 821 static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 tidinfo, ··· 820 833 if (grp) 821 834 *grp = node->grp; 822 835 823 - if (!fd->handler) 824 - cacheless_tid_rb_remove(fd, node); 825 - else 826 - hfi1_mmu_rb_remove(fd->handler, &node->mmu); 836 + if (fd->use_mn) 837 + mmu_interval_notifier_remove(&node->notifier); 838 + cacheless_tid_rb_remove(fd, node); 827 839 828 840 return 0; 829 841 } ··· 833 847 struct hfi1_devdata *dd = uctxt->dd; 834 848 835 849 trace_hfi1_exp_tid_unreg(uctxt->ctxt, fd->subctxt, node->rcventry, 836 - node->npages, node->mmu.addr, node->phys, 850 + node->npages, 851 + node->notifier.interval_tree.start, node->phys, 837 852 node->dma_addr); 838 853 839 854 /* ··· 881 894 if (!node || node->rcventry != rcventry) 882 895 continue; 883 896 897 + if (fd->use_mn) 898 + mmu_interval_notifier_remove( 899 + &node->notifier); 884 900 cacheless_tid_rb_remove(fd, node); 885 901 } 886 902 } 887 903 } 888 904 } 889 905 890 - /* 891 - * Always return 0 from this function. A non-zero return indicates that the 892 - * remove operation will be called and that memory should be unpinned. 893 - * However, the driver cannot unpin out from under PSM. Instead, retain the 894 - * memory (by returning 0) and inform PSM that the memory is going away. PSM 895 - * will call back later when it has removed the memory from its list. 896 - */ 897 - static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode) 906 + static bool tid_rb_invalidate(struct mmu_interval_notifier *mni, 907 + const struct mmu_notifier_range *range, 908 + unsigned long cur_seq) 898 909 { 899 - struct hfi1_filedata *fdata = arg; 900 - struct hfi1_ctxtdata *uctxt = fdata->uctxt; 901 910 struct tid_rb_node *node = 902 - container_of(mnode, struct tid_rb_node, mmu); 911 + container_of(mni, struct tid_rb_node, notifier); 912 + struct hfi1_filedata *fdata = node->fdata; 913 + struct hfi1_ctxtdata *uctxt = fdata->uctxt; 903 914 904 915 if (node->freed) 905 - return 0; 916 + return true; 906 917 907 - trace_hfi1_exp_tid_inval(uctxt->ctxt, fdata->subctxt, node->mmu.addr, 918 + trace_hfi1_exp_tid_inval(uctxt->ctxt, fdata->subctxt, 919 + node->notifier.interval_tree.start, 908 920 node->rcventry, node->npages, node->dma_addr); 909 921 node->freed = true; 910 922 ··· 932 946 fdata->invalid_tid_idx++; 933 947 } 934 948 spin_unlock(&fdata->invalid_lock); 935 - return 0; 936 - } 937 - 938 - static int tid_rb_insert(void *arg, struct mmu_rb_node *node) 939 - { 940 - struct hfi1_filedata *fdata = arg; 941 - struct tid_rb_node *tnode = 942 - container_of(node, struct tid_rb_node, mmu); 943 - u32 base = fdata->uctxt->expected_base; 944 - 945 - fdata->entry_to_rb[tnode->rcventry - base] = tnode; 946 - return 0; 949 + return true; 947 950 } 948 951 949 952 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata, ··· 942 967 943 968 fdata->entry_to_rb[tnode->rcventry - base] = NULL; 944 969 clear_tid_node(fdata, tnode); 945 - } 946 - 947 - static void tid_rb_remove(void *arg, struct mmu_rb_node *node) 948 - { 949 - struct hfi1_filedata *fdata = arg; 950 - struct tid_rb_node *tnode = 951 - container_of(node, struct tid_rb_node, mmu); 952 - 953 - cacheless_tid_rb_remove(fdata, tnode); 954 970 }
+2 -1
drivers/infiniband/hw/hfi1/user_exp_rcv.h
··· 65 65 }; 66 66 67 67 struct tid_rb_node { 68 - struct mmu_rb_node mmu; 68 + struct mmu_interval_notifier notifier; 69 + struct hfi1_filedata *fdata; 69 70 unsigned long phys; 70 71 struct tid_group *grp; 71 72 u32 rcventry;
+2 -5
drivers/infiniband/hw/mlx5/mlx5_ib.h
··· 1258 1258 void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev); 1259 1259 int __init mlx5_ib_odp_init(void); 1260 1260 void mlx5_ib_odp_cleanup(void); 1261 - void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start, 1262 - unsigned long end); 1263 1261 void mlx5_odp_init_mr_cache_entry(struct mlx5_cache_ent *ent); 1264 1262 void mlx5_odp_populate_klm(struct mlx5_klm *pklm, size_t offset, 1265 1263 size_t nentries, struct mlx5_ib_mr *mr, int flags); ··· 1287 1289 { 1288 1290 return -EOPNOTSUPP; 1289 1291 } 1290 - static inline void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, 1291 - unsigned long start, 1292 - unsigned long end){}; 1293 1292 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ 1293 + 1294 + extern const struct mmu_interval_notifier_ops mlx5_mn_ops; 1294 1295 1295 1296 /* Needed for rep profile */ 1296 1297 void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
+2 -1
drivers/infiniband/hw/mlx5/mr.c
··· 749 749 if (access_flags & IB_ACCESS_ON_DEMAND) { 750 750 struct ib_umem_odp *odp; 751 751 752 - odp = ib_umem_odp_get(udata, start, length, access_flags); 752 + odp = ib_umem_odp_get(udata, start, length, access_flags, 753 + &mlx5_mn_ops); 753 754 if (IS_ERR(odp)) { 754 755 mlx5_ib_dbg(dev, "umem get failed (%ld)\n", 755 756 PTR_ERR(odp));
+23 -27
drivers/infiniband/hw/mlx5/odp.c
··· 241 241 xa_unlock(&imr->implicit_children); 242 242 } 243 243 244 - void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start, 245 - unsigned long end) 244 + static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni, 245 + const struct mmu_notifier_range *range, 246 + unsigned long cur_seq) 246 247 { 248 + struct ib_umem_odp *umem_odp = 249 + container_of(mni, struct ib_umem_odp, notifier); 247 250 struct mlx5_ib_mr *mr; 248 251 const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT / 249 252 sizeof(struct mlx5_mtt)) - 1; 250 253 u64 idx = 0, blk_start_idx = 0; 251 254 u64 invalidations = 0; 255 + unsigned long start; 256 + unsigned long end; 252 257 int in_block = 0; 253 258 u64 addr; 254 259 260 + if (!mmu_notifier_range_blockable(range)) 261 + return false; 262 + 255 263 mutex_lock(&umem_odp->umem_mutex); 264 + mmu_interval_set_seq(mni, cur_seq); 256 265 /* 257 266 * If npages is zero then umem_odp->private may not be setup yet. This 258 267 * does not complete until after the first page is mapped for DMA. ··· 270 261 goto out; 271 262 mr = umem_odp->private; 272 263 273 - start = max_t(u64, ib_umem_start(umem_odp), start); 274 - end = min_t(u64, ib_umem_end(umem_odp), end); 264 + start = max_t(u64, ib_umem_start(umem_odp), range->start); 265 + end = min_t(u64, ib_umem_end(umem_odp), range->end); 275 266 276 267 /* 277 268 * Iteration one - zap the HW's MTTs. The notifiers_count ensures that ··· 328 319 destroy_unused_implicit_child_mr(mr); 329 320 out: 330 321 mutex_unlock(&umem_odp->umem_mutex); 322 + return true; 331 323 } 324 + 325 + const struct mmu_interval_notifier_ops mlx5_mn_ops = { 326 + .invalidate = mlx5_ib_invalidate_range, 327 + }; 332 328 333 329 void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev) 334 330 { ··· 433 419 434 420 odp = ib_umem_odp_alloc_child(to_ib_umem_odp(imr->umem), 435 421 idx * MLX5_IMR_MTT_SIZE, 436 - MLX5_IMR_MTT_SIZE); 422 + MLX5_IMR_MTT_SIZE, &mlx5_mn_ops); 437 423 if (IS_ERR(odp)) 438 424 return ERR_CAST(odp); 439 425 ··· 620 606 u64 user_va, size_t bcnt, u32 *bytes_mapped, 621 607 u32 flags) 622 608 { 623 - int current_seq, page_shift, ret, np; 609 + int page_shift, ret, np; 624 610 bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE; 611 + unsigned long current_seq; 625 612 u64 access_mask; 626 613 u64 start_idx, page_mask; 627 614 ··· 634 619 if (odp->umem.writable && !downgrade) 635 620 access_mask |= ODP_WRITE_ALLOWED_BIT; 636 621 637 - current_seq = READ_ONCE(odp->notifiers_seq); 638 - /* 639 - * Ensure the sequence number is valid for some time before we call 640 - * gup. 641 - */ 642 - smp_rmb(); 622 + current_seq = mmu_interval_read_begin(&odp->notifier); 643 623 644 624 np = ib_umem_odp_map_dma_pages(odp, user_va, bcnt, access_mask, 645 625 current_seq); ··· 642 632 return np; 643 633 644 634 mutex_lock(&odp->umem_mutex); 645 - if (!ib_umem_mmu_notifier_retry(odp, current_seq)) { 635 + if (!mmu_interval_read_retry(&odp->notifier, current_seq)) { 646 636 /* 647 637 * No need to check whether the MTTs really belong to 648 638 * this MR, since ib_umem_odp_map_dma_pages already ··· 672 662 return np << (page_shift - PAGE_SHIFT); 673 663 674 664 out: 675 - if (ret == -EAGAIN) { 676 - unsigned long timeout = msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT); 677 - 678 - if (!wait_for_completion_timeout(&odp->notifier_completion, 679 - timeout)) { 680 - mlx5_ib_warn( 681 - mr->dev, 682 - "timeout waiting for mmu notifier. seq %d against %d. notifiers_count=%d\n", 683 - current_seq, odp->notifiers_seq, 684 - odp->notifiers_count); 685 - } 686 - } 687 - 688 665 return ret; 689 666 } 690 667 ··· 1619 1622 1620 1623 static const struct ib_device_ops mlx5_ib_dev_odp_ops = { 1621 1624 .advise_mr = mlx5_ib_advise_mr, 1622 - .invalidate_range = mlx5_ib_invalidate_range, 1623 1625 }; 1624 1626 1625 1627 int mlx5_ib_odp_init_one(struct mlx5_ib_dev *dev)
+1 -7
drivers/xen/gntdev-common.h
··· 21 21 struct gntdev_priv { 22 22 /* Maps with visible offsets in the file descriptor. */ 23 23 struct list_head maps; 24 - /* 25 - * Maps that are not visible; will be freed on munmap. 26 - * Only populated if populate_freeable_maps == 1 27 - */ 28 - struct list_head freeable_maps; 29 24 /* lock protects maps and freeable_maps. */ 30 25 struct mutex lock; 31 - struct mm_struct *mm; 32 - struct mmu_notifier mn; 33 26 34 27 #ifdef CONFIG_XEN_GRANT_DMA_ALLOC 35 28 /* Device for which DMA memory is allocated. */ ··· 42 49 }; 43 50 44 51 struct gntdev_grant_map { 52 + struct mmu_interval_notifier notifier; 45 53 struct list_head next; 46 54 struct vm_area_struct *vma; 47 55 int index;
+48 -131
drivers/xen/gntdev.c
··· 63 63 static atomic_t pages_mapped = ATOMIC_INIT(0); 64 64 65 65 static int use_ptemod; 66 - #define populate_freeable_maps use_ptemod 67 66 68 67 static int unmap_grant_pages(struct gntdev_grant_map *map, 69 68 int offset, int pages); ··· 246 247 if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) { 247 248 notify_remote_via_evtchn(map->notify.event); 248 249 evtchn_put(map->notify.event); 249 - } 250 - 251 - if (populate_freeable_maps && priv) { 252 - mutex_lock(&priv->lock); 253 - list_del(&map->next); 254 - mutex_unlock(&priv->lock); 255 250 } 256 251 257 252 if (map->pages && !use_ptemod) ··· 437 444 438 445 pr_debug("gntdev_vma_close %p\n", vma); 439 446 if (use_ptemod) { 440 - /* It is possible that an mmu notifier could be running 441 - * concurrently, so take priv->lock to ensure that the vma won't 442 - * vanishing during the unmap_grant_pages call, since we will 443 - * spin here until that completes. Such a concurrent call will 444 - * not do any unmapping, since that has been done prior to 445 - * closing the vma, but it may still iterate the unmap_ops list. 446 - */ 447 - mutex_lock(&priv->lock); 447 + WARN_ON(map->vma != vma); 448 + mmu_interval_notifier_remove(&map->notifier); 448 449 map->vma = NULL; 449 - mutex_unlock(&priv->lock); 450 450 } 451 451 vma->vm_private_data = NULL; 452 452 gntdev_put_map(priv, map); ··· 461 475 462 476 /* ------------------------------------------------------------------ */ 463 477 464 - static bool in_range(struct gntdev_grant_map *map, 465 - unsigned long start, unsigned long end) 478 + static bool gntdev_invalidate(struct mmu_interval_notifier *mn, 479 + const struct mmu_notifier_range *range, 480 + unsigned long cur_seq) 466 481 { 467 - if (!map->vma) 468 - return false; 469 - if (map->vma->vm_start >= end) 470 - return false; 471 - if (map->vma->vm_end <= start) 472 - return false; 473 - 474 - return true; 475 - } 476 - 477 - static int unmap_if_in_range(struct gntdev_grant_map *map, 478 - unsigned long start, unsigned long end, 479 - bool blockable) 480 - { 482 + struct gntdev_grant_map *map = 483 + container_of(mn, struct gntdev_grant_map, notifier); 481 484 unsigned long mstart, mend; 482 485 int err; 483 486 484 - if (!in_range(map, start, end)) 485 - return 0; 487 + if (!mmu_notifier_range_blockable(range)) 488 + return false; 486 489 487 - if (!blockable) 488 - return -EAGAIN; 490 + /* 491 + * If the VMA is split or otherwise changed the notifier is not 492 + * updated, but we don't want to process VA's outside the modified 493 + * VMA. FIXME: It would be much more understandable to just prevent 494 + * modifying the VMA in the first place. 495 + */ 496 + if (map->vma->vm_start >= range->end || 497 + map->vma->vm_end <= range->start) 498 + return true; 489 499 490 - mstart = max(start, map->vma->vm_start); 491 - mend = min(end, map->vma->vm_end); 500 + mstart = max(range->start, map->vma->vm_start); 501 + mend = min(range->end, map->vma->vm_end); 492 502 pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n", 493 503 map->index, map->count, 494 504 map->vma->vm_start, map->vma->vm_end, 495 - start, end, mstart, mend); 505 + range->start, range->end, mstart, mend); 496 506 err = unmap_grant_pages(map, 497 507 (mstart - map->vma->vm_start) >> PAGE_SHIFT, 498 508 (mend - mstart) >> PAGE_SHIFT); 499 509 WARN_ON(err); 500 510 501 - return 0; 511 + return true; 502 512 } 503 513 504 - static int mn_invl_range_start(struct mmu_notifier *mn, 505 - const struct mmu_notifier_range *range) 506 - { 507 - struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn); 508 - struct gntdev_grant_map *map; 509 - int ret = 0; 510 - 511 - if (mmu_notifier_range_blockable(range)) 512 - mutex_lock(&priv->lock); 513 - else if (!mutex_trylock(&priv->lock)) 514 - return -EAGAIN; 515 - 516 - list_for_each_entry(map, &priv->maps, next) { 517 - ret = unmap_if_in_range(map, range->start, range->end, 518 - mmu_notifier_range_blockable(range)); 519 - if (ret) 520 - goto out_unlock; 521 - } 522 - list_for_each_entry(map, &priv->freeable_maps, next) { 523 - ret = unmap_if_in_range(map, range->start, range->end, 524 - mmu_notifier_range_blockable(range)); 525 - if (ret) 526 - goto out_unlock; 527 - } 528 - 529 - out_unlock: 530 - mutex_unlock(&priv->lock); 531 - 532 - return ret; 533 - } 534 - 535 - static void mn_release(struct mmu_notifier *mn, 536 - struct mm_struct *mm) 537 - { 538 - struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn); 539 - struct gntdev_grant_map *map; 540 - int err; 541 - 542 - mutex_lock(&priv->lock); 543 - list_for_each_entry(map, &priv->maps, next) { 544 - if (!map->vma) 545 - continue; 546 - pr_debug("map %d+%d (%lx %lx)\n", 547 - map->index, map->count, 548 - map->vma->vm_start, map->vma->vm_end); 549 - err = unmap_grant_pages(map, /* offset */ 0, map->count); 550 - WARN_ON(err); 551 - } 552 - list_for_each_entry(map, &priv->freeable_maps, next) { 553 - if (!map->vma) 554 - continue; 555 - pr_debug("map %d+%d (%lx %lx)\n", 556 - map->index, map->count, 557 - map->vma->vm_start, map->vma->vm_end); 558 - err = unmap_grant_pages(map, /* offset */ 0, map->count); 559 - WARN_ON(err); 560 - } 561 - mutex_unlock(&priv->lock); 562 - } 563 - 564 - static const struct mmu_notifier_ops gntdev_mmu_ops = { 565 - .release = mn_release, 566 - .invalidate_range_start = mn_invl_range_start, 514 + static const struct mmu_interval_notifier_ops gntdev_mmu_ops = { 515 + .invalidate = gntdev_invalidate, 567 516 }; 568 517 569 518 /* ------------------------------------------------------------------ */ ··· 513 592 return -ENOMEM; 514 593 515 594 INIT_LIST_HEAD(&priv->maps); 516 - INIT_LIST_HEAD(&priv->freeable_maps); 517 595 mutex_init(&priv->lock); 518 596 519 597 #ifdef CONFIG_XEN_GNTDEV_DMABUF ··· 523 603 return ret; 524 604 } 525 605 #endif 526 - 527 - if (use_ptemod) { 528 - priv->mm = get_task_mm(current); 529 - if (!priv->mm) { 530 - kfree(priv); 531 - return -ENOMEM; 532 - } 533 - priv->mn.ops = &gntdev_mmu_ops; 534 - ret = mmu_notifier_register(&priv->mn, priv->mm); 535 - mmput(priv->mm); 536 - } 537 606 538 607 if (ret) { 539 608 kfree(priv); ··· 553 644 list_del(&map->next); 554 645 gntdev_put_map(NULL /* already removed */, map); 555 646 } 556 - WARN_ON(!list_empty(&priv->freeable_maps)); 557 647 mutex_unlock(&priv->lock); 558 648 559 649 #ifdef CONFIG_XEN_GNTDEV_DMABUF 560 650 gntdev_dmabuf_fini(priv->dmabuf_priv); 561 651 #endif 562 - 563 - if (use_ptemod) 564 - mmu_notifier_unregister(&priv->mn, priv->mm); 565 652 566 653 kfree(priv); 567 654 return 0; ··· 619 714 map = gntdev_find_map_index(priv, op.index >> PAGE_SHIFT, op.count); 620 715 if (map) { 621 716 list_del(&map->next); 622 - if (populate_freeable_maps) 623 - list_add_tail(&map->next, &priv->freeable_maps); 624 717 err = 0; 625 718 } 626 719 mutex_unlock(&priv->lock); ··· 990 1087 goto unlock_out; 991 1088 if (use_ptemod && map->vma) 992 1089 goto unlock_out; 993 - if (use_ptemod && priv->mm != vma->vm_mm) { 994 - pr_warn("Huh? Other mm?\n"); 995 - goto unlock_out; 996 - } 997 - 998 1090 refcount_inc(&map->users); 999 1091 1000 1092 vma->vm_ops = &gntdev_vmops; ··· 1000 1102 vma->vm_flags |= VM_DONTCOPY; 1001 1103 1002 1104 vma->vm_private_data = map; 1003 - 1004 - if (use_ptemod) 1005 - map->vma = vma; 1006 - 1007 1105 if (map->flags) { 1008 1106 if ((vma->vm_flags & VM_WRITE) && 1009 1107 (map->flags & GNTMAP_readonly)) ··· 1010 1116 map->flags |= GNTMAP_readonly; 1011 1117 } 1012 1118 1119 + if (use_ptemod) { 1120 + map->vma = vma; 1121 + err = mmu_interval_notifier_insert_locked( 1122 + &map->notifier, vma->vm_mm, vma->vm_start, 1123 + vma->vm_end - vma->vm_start, &gntdev_mmu_ops); 1124 + if (err) 1125 + goto out_unlock_put; 1126 + } 1013 1127 mutex_unlock(&priv->lock); 1128 + 1129 + /* 1130 + * gntdev takes the address of the PTE in find_grant_ptes() and passes 1131 + * it to the hypervisor in gntdev_map_grant_pages(). The purpose of 1132 + * the notifier is to prevent the hypervisor pointer to the PTE from 1133 + * going stale. 1134 + * 1135 + * Since this vma's mappings can't be touched without the mmap_sem, 1136 + * and we are holding it now, there is no need for the notifier_range 1137 + * locking pattern. 1138 + */ 1139 + mmu_interval_read_begin(&map->notifier); 1014 1140 1015 1141 if (use_ptemod) { 1016 1142 map->pages_vm_start = vma->vm_start; ··· 1080 1166 mutex_unlock(&priv->lock); 1081 1167 out_put_map: 1082 1168 if (use_ptemod) { 1083 - map->vma = NULL; 1084 1169 unmap_grant_pages(map, 0, map->count); 1170 + if (map->vma) { 1171 + mmu_interval_notifier_remove(&map->notifier); 1172 + map->vma = NULL; 1173 + } 1085 1174 } 1086 1175 gntdev_put_map(priv, map); 1087 1176 return err;
+14 -176
include/linux/hmm.h
··· 62 62 #include <linux/kconfig.h> 63 63 #include <asm/pgtable.h> 64 64 65 - #ifdef CONFIG_HMM_MIRROR 66 - 67 65 #include <linux/device.h> 68 66 #include <linux/migrate.h> 69 67 #include <linux/memremap.h> 70 68 #include <linux/completion.h> 71 69 #include <linux/mmu_notifier.h> 72 - 73 - 74 - /* 75 - * struct hmm - HMM per mm struct 76 - * 77 - * @mm: mm struct this HMM struct is bound to 78 - * @lock: lock protecting ranges list 79 - * @ranges: list of range being snapshotted 80 - * @mirrors: list of mirrors for this mm 81 - * @mmu_notifier: mmu notifier to track updates to CPU page table 82 - * @mirrors_sem: read/write semaphore protecting the mirrors list 83 - * @wq: wait queue for user waiting on a range invalidation 84 - * @notifiers: count of active mmu notifiers 85 - */ 86 - struct hmm { 87 - struct mmu_notifier mmu_notifier; 88 - spinlock_t ranges_lock; 89 - struct list_head ranges; 90 - struct list_head mirrors; 91 - struct rw_semaphore mirrors_sem; 92 - wait_queue_head_t wq; 93 - long notifiers; 94 - }; 95 70 96 71 /* 97 72 * hmm_pfn_flag_e - HMM flag enums ··· 120 145 /* 121 146 * struct hmm_range - track invalidation lock on virtual address range 122 147 * 148 + * @notifier: a mmu_interval_notifier that includes the start/end 149 + * @notifier_seq: result of mmu_interval_read_begin() 123 150 * @hmm: the core HMM structure this range is active against 124 151 * @vma: the vm area struct for the range 125 152 * @list: all range lock are on a list ··· 136 159 * @valid: pfns array did not change since it has been fill by an HMM function 137 160 */ 138 161 struct hmm_range { 139 - struct hmm *hmm; 140 - struct list_head list; 162 + struct mmu_interval_notifier *notifier; 163 + unsigned long notifier_seq; 141 164 unsigned long start; 142 165 unsigned long end; 143 166 uint64_t *pfns; ··· 146 169 uint64_t default_flags; 147 170 uint64_t pfn_flags_mask; 148 171 uint8_t pfn_shift; 149 - bool valid; 150 172 }; 151 - 152 - /* 153 - * hmm_range_wait_until_valid() - wait for range to be valid 154 - * @range: range affected by invalidation to wait on 155 - * @timeout: time out for wait in ms (ie abort wait after that period of time) 156 - * Return: true if the range is valid, false otherwise. 157 - */ 158 - static inline bool hmm_range_wait_until_valid(struct hmm_range *range, 159 - unsigned long timeout) 160 - { 161 - return wait_event_timeout(range->hmm->wq, range->valid, 162 - msecs_to_jiffies(timeout)) != 0; 163 - } 164 - 165 - /* 166 - * hmm_range_valid() - test if a range is valid or not 167 - * @range: range 168 - * Return: true if the range is valid, false otherwise. 169 - */ 170 - static inline bool hmm_range_valid(struct hmm_range *range) 171 - { 172 - return range->valid; 173 - } 174 173 175 174 /* 176 175 * hmm_device_entry_to_page() - return struct page pointed to by a device entry ··· 218 265 } 219 266 220 267 /* 221 - * Mirroring: how to synchronize device page table with CPU page table. 222 - * 223 - * A device driver that is participating in HMM mirroring must always 224 - * synchronize with CPU page table updates. For this, device drivers can either 225 - * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device 226 - * drivers can decide to register one mirror per device per process, or just 227 - * one mirror per process for a group of devices. The pattern is: 228 - * 229 - * int device_bind_address_space(..., struct mm_struct *mm, ...) 230 - * { 231 - * struct device_address_space *das; 232 - * 233 - * // Device driver specific initialization, and allocation of das 234 - * // which contains an hmm_mirror struct as one of its fields. 235 - * ... 236 - * 237 - * ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops); 238 - * if (ret) { 239 - * // Cleanup on error 240 - * return ret; 241 - * } 242 - * 243 - * // Other device driver specific initialization 244 - * ... 245 - * } 246 - * 247 - * Once an hmm_mirror is registered for an address space, the device driver 248 - * will get callbacks through sync_cpu_device_pagetables() operation (see 249 - * hmm_mirror_ops struct). 250 - * 251 - * Device driver must not free the struct containing the hmm_mirror struct 252 - * before calling hmm_mirror_unregister(). The expected usage is to do that when 253 - * the device driver is unbinding from an address space. 254 - * 255 - * 256 - * void device_unbind_address_space(struct device_address_space *das) 257 - * { 258 - * // Device driver specific cleanup 259 - * ... 260 - * 261 - * hmm_mirror_unregister(&das->mirror); 262 - * 263 - * // Other device driver specific cleanup, and now das can be freed 264 - * ... 265 - * } 266 - */ 267 - 268 - struct hmm_mirror; 269 - 270 - /* 271 - * struct hmm_mirror_ops - HMM mirror device operations callback 272 - * 273 - * @update: callback to update range on a device 274 - */ 275 - struct hmm_mirror_ops { 276 - /* release() - release hmm_mirror 277 - * 278 - * @mirror: pointer to struct hmm_mirror 279 - * 280 - * This is called when the mm_struct is being released. The callback 281 - * must ensure that all access to any pages obtained from this mirror 282 - * is halted before the callback returns. All future access should 283 - * fault. 284 - */ 285 - void (*release)(struct hmm_mirror *mirror); 286 - 287 - /* sync_cpu_device_pagetables() - synchronize page tables 288 - * 289 - * @mirror: pointer to struct hmm_mirror 290 - * @update: update information (see struct mmu_notifier_range) 291 - * Return: -EAGAIN if mmu_notifier_range_blockable(update) is false 292 - * and callback needs to block, 0 otherwise. 293 - * 294 - * This callback ultimately originates from mmu_notifiers when the CPU 295 - * page table is updated. The device driver must update its page table 296 - * in response to this callback. The update argument tells what action 297 - * to perform. 298 - * 299 - * The device driver must not return from this callback until the device 300 - * page tables are completely updated (TLBs flushed, etc); this is a 301 - * synchronous call. 302 - */ 303 - int (*sync_cpu_device_pagetables)( 304 - struct hmm_mirror *mirror, 305 - const struct mmu_notifier_range *update); 306 - }; 307 - 308 - /* 309 - * struct hmm_mirror - mirror struct for a device driver 310 - * 311 - * @hmm: pointer to struct hmm (which is unique per mm_struct) 312 - * @ops: device driver callback for HMM mirror operations 313 - * @list: for list of mirrors of a given mm 314 - * 315 - * Each address space (mm_struct) being mirrored by a device must register one 316 - * instance of an hmm_mirror struct with HMM. HMM will track the list of all 317 - * mirrors for each mm_struct. 318 - */ 319 - struct hmm_mirror { 320 - struct hmm *hmm; 321 - const struct hmm_mirror_ops *ops; 322 - struct list_head list; 323 - }; 324 - 325 - int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm); 326 - void hmm_mirror_unregister(struct hmm_mirror *mirror); 327 - 328 - /* 329 - * Please see Documentation/vm/hmm.rst for how to use the range API. 330 - */ 331 - int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror); 332 - void hmm_range_unregister(struct hmm_range *range); 333 - 334 - /* 335 268 * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case. 336 269 */ 337 270 #define HMM_FAULT_ALLOW_RETRY (1 << 0) ··· 225 386 /* Don't fault in missing PTEs, just snapshot the current state. */ 226 387 #define HMM_FAULT_SNAPSHOT (1 << 1) 227 388 389 + #ifdef CONFIG_HMM_MIRROR 390 + /* 391 + * Please see Documentation/vm/hmm.rst for how to use the range API. 392 + */ 228 393 long hmm_range_fault(struct hmm_range *range, unsigned int flags); 229 - 230 - long hmm_range_dma_map(struct hmm_range *range, 231 - struct device *device, 232 - dma_addr_t *daddrs, 233 - unsigned int flags); 234 - long hmm_range_dma_unmap(struct hmm_range *range, 235 - struct device *device, 236 - dma_addr_t *daddrs, 237 - bool dirty); 394 + #else 395 + static inline long hmm_range_fault(struct hmm_range *range, unsigned int flags) 396 + { 397 + return -EOPNOTSUPP; 398 + } 399 + #endif 238 400 239 401 /* 240 402 * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range ··· 245 405 * wait already. 246 406 */ 247 407 #define HMM_RANGE_DEFAULT_TIMEOUT 1000 248 - 249 - #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */ 250 408 251 409 #endif /* LINUX_HMM_H */
+118 -29
include/linux/mmu_notifier.h
··· 6 6 #include <linux/spinlock.h> 7 7 #include <linux/mm_types.h> 8 8 #include <linux/srcu.h> 9 + #include <linux/interval_tree.h> 9 10 11 + struct mmu_notifier_mm; 10 12 struct mmu_notifier; 11 - struct mmu_notifier_ops; 13 + struct mmu_notifier_range; 14 + struct mmu_interval_notifier; 12 15 13 16 /** 14 17 * enum mmu_notifier_event - reason for the mmu notifier callback ··· 34 31 * access flags). User should soft dirty the page in the end callback to make 35 32 * sure that anyone relying on soft dirtyness catch pages that might be written 36 33 * through non CPU mappings. 34 + * 35 + * @MMU_NOTIFY_RELEASE: used during mmu_interval_notifier invalidate to signal 36 + * that the mm refcount is zero and the range is no longer accessible. 37 37 */ 38 38 enum mmu_notifier_event { 39 39 MMU_NOTIFY_UNMAP = 0, ··· 44 38 MMU_NOTIFY_PROTECTION_VMA, 45 39 MMU_NOTIFY_PROTECTION_PAGE, 46 40 MMU_NOTIFY_SOFT_DIRTY, 47 - }; 48 - 49 - #ifdef CONFIG_MMU_NOTIFIER 50 - 51 - #ifdef CONFIG_LOCKDEP 52 - extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; 53 - #endif 54 - 55 - /* 56 - * The mmu notifier_mm structure is allocated and installed in 57 - * mm->mmu_notifier_mm inside the mm_take_all_locks() protected 58 - * critical section and it's released only when mm_count reaches zero 59 - * in mmdrop(). 60 - */ 61 - struct mmu_notifier_mm { 62 - /* all mmu notifiers registerd in this mm are queued in this list */ 63 - struct hlist_head list; 64 - /* to serialize the list modifications and hlist_unhashed */ 65 - spinlock_t lock; 41 + MMU_NOTIFY_RELEASE, 66 42 }; 67 43 68 44 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) 69 - 70 - struct mmu_notifier_range { 71 - struct vm_area_struct *vma; 72 - struct mm_struct *mm; 73 - unsigned long start; 74 - unsigned long end; 75 - unsigned flags; 76 - enum mmu_notifier_event event; 77 - }; 78 45 79 46 struct mmu_notifier_ops { 80 47 /* ··· 228 249 unsigned int users; 229 250 }; 230 251 252 + /** 253 + * struct mmu_interval_notifier_ops 254 + * @invalidate: Upon return the caller must stop using any SPTEs within this 255 + * range. This function can sleep. Return false only if sleeping 256 + * was required but mmu_notifier_range_blockable(range) is false. 257 + */ 258 + struct mmu_interval_notifier_ops { 259 + bool (*invalidate)(struct mmu_interval_notifier *mni, 260 + const struct mmu_notifier_range *range, 261 + unsigned long cur_seq); 262 + }; 263 + 264 + struct mmu_interval_notifier { 265 + struct interval_tree_node interval_tree; 266 + const struct mmu_interval_notifier_ops *ops; 267 + struct mm_struct *mm; 268 + struct hlist_node deferred_item; 269 + unsigned long invalidate_seq; 270 + }; 271 + 272 + #ifdef CONFIG_MMU_NOTIFIER 273 + 274 + #ifdef CONFIG_LOCKDEP 275 + extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; 276 + #endif 277 + 278 + struct mmu_notifier_range { 279 + struct vm_area_struct *vma; 280 + struct mm_struct *mm; 281 + unsigned long start; 282 + unsigned long end; 283 + unsigned flags; 284 + enum mmu_notifier_event event; 285 + }; 286 + 231 287 static inline int mm_has_notifiers(struct mm_struct *mm) 232 288 { 233 289 return unlikely(mm->mmu_notifier_mm); ··· 289 275 struct mm_struct *mm); 290 276 extern void mmu_notifier_unregister(struct mmu_notifier *mn, 291 277 struct mm_struct *mm); 278 + 279 + unsigned long mmu_interval_read_begin(struct mmu_interval_notifier *mni); 280 + int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni, 281 + struct mm_struct *mm, unsigned long start, 282 + unsigned long length, 283 + const struct mmu_interval_notifier_ops *ops); 284 + int mmu_interval_notifier_insert_locked( 285 + struct mmu_interval_notifier *mni, struct mm_struct *mm, 286 + unsigned long start, unsigned long length, 287 + const struct mmu_interval_notifier_ops *ops); 288 + void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); 289 + 290 + /** 291 + * mmu_interval_set_seq - Save the invalidation sequence 292 + * @mni - The mni passed to invalidate 293 + * @cur_seq - The cur_seq passed to the invalidate() callback 294 + * 295 + * This must be called unconditionally from the invalidate callback of a 296 + * struct mmu_interval_notifier_ops under the same lock that is used to call 297 + * mmu_interval_read_retry(). It updates the sequence number for later use by 298 + * mmu_interval_read_retry(). The provided cur_seq will always be odd. 299 + * 300 + * If the caller does not call mmu_interval_read_begin() or 301 + * mmu_interval_read_retry() then this call is not required. 302 + */ 303 + static inline void mmu_interval_set_seq(struct mmu_interval_notifier *mni, 304 + unsigned long cur_seq) 305 + { 306 + WRITE_ONCE(mni->invalidate_seq, cur_seq); 307 + } 308 + 309 + /** 310 + * mmu_interval_read_retry - End a read side critical section against a VA range 311 + * mni: The range 312 + * seq: The return of the paired mmu_interval_read_begin() 313 + * 314 + * This MUST be called under a user provided lock that is also held 315 + * unconditionally by op->invalidate() when it calls mmu_interval_set_seq(). 316 + * 317 + * Each call should be paired with a single mmu_interval_read_begin() and 318 + * should be used to conclude the read side. 319 + * 320 + * Returns true if an invalidation collided with this critical section, and 321 + * the caller should retry. 322 + */ 323 + static inline bool mmu_interval_read_retry(struct mmu_interval_notifier *mni, 324 + unsigned long seq) 325 + { 326 + return mni->invalidate_seq != seq; 327 + } 328 + 329 + /** 330 + * mmu_interval_check_retry - Test if a collision has occurred 331 + * mni: The range 332 + * seq: The return of the matching mmu_interval_read_begin() 333 + * 334 + * This can be used in the critical section between mmu_interval_read_begin() 335 + * and mmu_interval_read_retry(). A return of true indicates an invalidation 336 + * has collided with this critical region and a future 337 + * mmu_interval_read_retry() will return true. 338 + * 339 + * False is not reliable and only suggests a collision may not have 340 + * occured. It can be called many times and does not have to hold the user 341 + * provided lock. 342 + * 343 + * This call can be used as part of loops and other expensive operations to 344 + * expedite a retry. 345 + */ 346 + static inline bool mmu_interval_check_retry(struct mmu_interval_notifier *mni, 347 + unsigned long seq) 348 + { 349 + /* Pairs with the WRITE_ONCE in mmu_interval_set_seq() */ 350 + return READ_ONCE(mni->invalidate_seq) != seq; 351 + } 352 + 292 353 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); 293 354 extern void __mmu_notifier_release(struct mm_struct *mm); 294 355 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+14 -54
include/rdma/ib_umem_odp.h
··· 35 35 36 36 #include <rdma/ib_umem.h> 37 37 #include <rdma/ib_verbs.h> 38 - #include <linux/interval_tree.h> 39 38 40 39 struct ib_umem_odp { 41 40 struct ib_umem umem; 42 - struct ib_ucontext_per_mm *per_mm; 41 + struct mmu_interval_notifier notifier; 42 + struct pid *tgid; 43 43 44 44 /* 45 45 * An array of the pages included in the on-demand paging umem. ··· 62 62 struct mutex umem_mutex; 63 63 void *private; /* for the HW driver to use. */ 64 64 65 - int notifiers_seq; 66 - int notifiers_count; 67 65 int npages; 68 - 69 - /* Tree tracking */ 70 - struct interval_tree_node interval_tree; 71 66 72 67 /* 73 68 * An implicit odp umem cannot be DMA mapped, has 0 length, and serves ··· 72 77 */ 73 78 bool is_implicit_odp; 74 79 75 - struct completion notifier_completion; 76 80 unsigned int page_shift; 77 81 }; 78 82 ··· 83 89 /* Returns the first page of an ODP umem. */ 84 90 static inline unsigned long ib_umem_start(struct ib_umem_odp *umem_odp) 85 91 { 86 - return umem_odp->interval_tree.start; 92 + return umem_odp->notifier.interval_tree.start; 87 93 } 88 94 89 95 /* Returns the address of the page after the last one of an ODP umem. */ 90 96 static inline unsigned long ib_umem_end(struct ib_umem_odp *umem_odp) 91 97 { 92 - return umem_odp->interval_tree.last + 1; 98 + return umem_odp->notifier.interval_tree.last + 1; 93 99 } 94 100 95 101 static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp) ··· 113 119 114 120 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING 115 121 116 - struct ib_ucontext_per_mm { 117 - struct mmu_notifier mn; 118 - struct pid *tgid; 119 - 120 - struct rb_root_cached umem_tree; 121 - /* Protects umem_tree */ 122 - struct rw_semaphore umem_rwsem; 123 - }; 124 - 125 - struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, 126 - size_t size, int access); 122 + struct ib_umem_odp * 123 + ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, size_t size, 124 + int access, const struct mmu_interval_notifier_ops *ops); 127 125 struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata, 128 126 int access); 129 - struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root_umem, 130 - unsigned long addr, size_t size); 127 + struct ib_umem_odp * 128 + ib_umem_odp_alloc_child(struct ib_umem_odp *root_umem, unsigned long addr, 129 + size_t size, 130 + const struct mmu_interval_notifier_ops *ops); 131 131 void ib_umem_odp_release(struct ib_umem_odp *umem_odp); 132 132 133 133 int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset, ··· 131 143 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset, 132 144 u64 bound); 133 145 134 - typedef int (*umem_call_back)(struct ib_umem_odp *item, u64 start, u64 end, 135 - void *cookie); 136 - /* 137 - * Call the callback on each ib_umem in the range. Returns the logical or of 138 - * the return values of the functions called. 139 - */ 140 - int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, 141 - u64 start, u64 end, 142 - umem_call_back cb, 143 - bool blockable, void *cookie); 144 - 145 - static inline int ib_umem_mmu_notifier_retry(struct ib_umem_odp *umem_odp, 146 - unsigned long mmu_seq) 147 - { 148 - /* 149 - * This code is strongly based on the KVM code from 150 - * mmu_notifier_retry. Should be called with 151 - * the relevant locks taken (umem_odp->umem_mutex 152 - * and the ucontext umem_mutex semaphore locked for read). 153 - */ 154 - 155 - if (unlikely(umem_odp->notifiers_count)) 156 - return 1; 157 - if (umem_odp->notifiers_seq != mmu_seq) 158 - return 1; 159 - return 0; 160 - } 161 - 162 146 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ 163 147 164 - static inline struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, 165 - unsigned long addr, 166 - size_t size, int access) 148 + static inline struct ib_umem_odp * 149 + ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, size_t size, 150 + int access, const struct mmu_interval_notifier_ops *ops) 167 151 { 168 152 return ERR_PTR(-EINVAL); 169 153 }
-2
include/rdma/ib_verbs.h
··· 2451 2451 u64 iova); 2452 2452 int (*unmap_fmr)(struct list_head *fmr_list); 2453 2453 int (*dealloc_fmr)(struct ib_fmr *fmr); 2454 - void (*invalidate_range)(struct ib_umem_odp *umem_odp, 2455 - unsigned long start, unsigned long end); 2456 2454 int (*attach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); 2457 2455 int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); 2458 2456 struct ib_xrcd *(*alloc_xrcd)(struct ib_device *device,
-1
kernel/fork.c
··· 40 40 #include <linux/binfmts.h> 41 41 #include <linux/mman.h> 42 42 #include <linux/mmu_notifier.h> 43 - #include <linux/hmm.h> 44 43 #include <linux/fs.h> 45 44 #include <linux/mm.h> 46 45 #include <linux/vmacache.h>
+1 -1
mm/Kconfig
··· 284 284 config MMU_NOTIFIER 285 285 bool 286 286 select SRCU 287 + select INTERVAL_TREE 287 288 288 289 config KSM 289 290 bool "Enable KSM for page merging" ··· 675 674 config HMM_MIRROR 676 675 bool 677 676 depends on MMU 678 - depends on MMU_NOTIFIER 679 677 680 678 config DEVICE_PRIVATE 681 679 bool "Unaddressable device memory (GPU memory, ...)"
+63 -468
mm/hmm.c
··· 26 26 #include <linux/mmu_notifier.h> 27 27 #include <linux/memory_hotplug.h> 28 28 29 - static struct mmu_notifier *hmm_alloc_notifier(struct mm_struct *mm) 30 - { 31 - struct hmm *hmm; 32 - 33 - hmm = kzalloc(sizeof(*hmm), GFP_KERNEL); 34 - if (!hmm) 35 - return ERR_PTR(-ENOMEM); 36 - 37 - init_waitqueue_head(&hmm->wq); 38 - INIT_LIST_HEAD(&hmm->mirrors); 39 - init_rwsem(&hmm->mirrors_sem); 40 - INIT_LIST_HEAD(&hmm->ranges); 41 - spin_lock_init(&hmm->ranges_lock); 42 - hmm->notifiers = 0; 43 - return &hmm->mmu_notifier; 44 - } 45 - 46 - static void hmm_free_notifier(struct mmu_notifier *mn) 47 - { 48 - struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier); 49 - 50 - WARN_ON(!list_empty(&hmm->ranges)); 51 - WARN_ON(!list_empty(&hmm->mirrors)); 52 - kfree(hmm); 53 - } 54 - 55 - static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) 56 - { 57 - struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier); 58 - struct hmm_mirror *mirror; 59 - 60 - /* 61 - * Since hmm_range_register() holds the mmget() lock hmm_release() is 62 - * prevented as long as a range exists. 63 - */ 64 - WARN_ON(!list_empty_careful(&hmm->ranges)); 65 - 66 - down_read(&hmm->mirrors_sem); 67 - list_for_each_entry(mirror, &hmm->mirrors, list) { 68 - /* 69 - * Note: The driver is not allowed to trigger 70 - * hmm_mirror_unregister() from this thread. 71 - */ 72 - if (mirror->ops->release) 73 - mirror->ops->release(mirror); 74 - } 75 - up_read(&hmm->mirrors_sem); 76 - } 77 - 78 - static void notifiers_decrement(struct hmm *hmm) 79 - { 80 - unsigned long flags; 81 - 82 - spin_lock_irqsave(&hmm->ranges_lock, flags); 83 - hmm->notifiers--; 84 - if (!hmm->notifiers) { 85 - struct hmm_range *range; 86 - 87 - list_for_each_entry(range, &hmm->ranges, list) { 88 - if (range->valid) 89 - continue; 90 - range->valid = true; 91 - } 92 - wake_up_all(&hmm->wq); 93 - } 94 - spin_unlock_irqrestore(&hmm->ranges_lock, flags); 95 - } 96 - 97 - static int hmm_invalidate_range_start(struct mmu_notifier *mn, 98 - const struct mmu_notifier_range *nrange) 99 - { 100 - struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier); 101 - struct hmm_mirror *mirror; 102 - struct hmm_range *range; 103 - unsigned long flags; 104 - int ret = 0; 105 - 106 - spin_lock_irqsave(&hmm->ranges_lock, flags); 107 - hmm->notifiers++; 108 - list_for_each_entry(range, &hmm->ranges, list) { 109 - if (nrange->end < range->start || nrange->start >= range->end) 110 - continue; 111 - 112 - range->valid = false; 113 - } 114 - spin_unlock_irqrestore(&hmm->ranges_lock, flags); 115 - 116 - if (mmu_notifier_range_blockable(nrange)) 117 - down_read(&hmm->mirrors_sem); 118 - else if (!down_read_trylock(&hmm->mirrors_sem)) { 119 - ret = -EAGAIN; 120 - goto out; 121 - } 122 - 123 - list_for_each_entry(mirror, &hmm->mirrors, list) { 124 - int rc; 125 - 126 - rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange); 127 - if (rc) { 128 - if (WARN_ON(mmu_notifier_range_blockable(nrange) || 129 - rc != -EAGAIN)) 130 - continue; 131 - ret = -EAGAIN; 132 - break; 133 - } 134 - } 135 - up_read(&hmm->mirrors_sem); 136 - 137 - out: 138 - if (ret) 139 - notifiers_decrement(hmm); 140 - return ret; 141 - } 142 - 143 - static void hmm_invalidate_range_end(struct mmu_notifier *mn, 144 - const struct mmu_notifier_range *nrange) 145 - { 146 - struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier); 147 - 148 - notifiers_decrement(hmm); 149 - } 150 - 151 - static const struct mmu_notifier_ops hmm_mmu_notifier_ops = { 152 - .release = hmm_release, 153 - .invalidate_range_start = hmm_invalidate_range_start, 154 - .invalidate_range_end = hmm_invalidate_range_end, 155 - .alloc_notifier = hmm_alloc_notifier, 156 - .free_notifier = hmm_free_notifier, 157 - }; 158 - 159 - /* 160 - * hmm_mirror_register() - register a mirror against an mm 161 - * 162 - * @mirror: new mirror struct to register 163 - * @mm: mm to register against 164 - * Return: 0 on success, -ENOMEM if no memory, -EINVAL if invalid arguments 165 - * 166 - * To start mirroring a process address space, the device driver must register 167 - * an HMM mirror struct. 168 - * 169 - * The caller cannot unregister the hmm_mirror while any ranges are 170 - * registered. 171 - * 172 - * Callers using this function must put a call to mmu_notifier_synchronize() 173 - * in their module exit functions. 174 - */ 175 - int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm) 176 - { 177 - struct mmu_notifier *mn; 178 - 179 - lockdep_assert_held_write(&mm->mmap_sem); 180 - 181 - /* Sanity check */ 182 - if (!mm || !mirror || !mirror->ops) 183 - return -EINVAL; 184 - 185 - mn = mmu_notifier_get_locked(&hmm_mmu_notifier_ops, mm); 186 - if (IS_ERR(mn)) 187 - return PTR_ERR(mn); 188 - mirror->hmm = container_of(mn, struct hmm, mmu_notifier); 189 - 190 - down_write(&mirror->hmm->mirrors_sem); 191 - list_add(&mirror->list, &mirror->hmm->mirrors); 192 - up_write(&mirror->hmm->mirrors_sem); 193 - 194 - return 0; 195 - } 196 - EXPORT_SYMBOL(hmm_mirror_register); 197 - 198 - /* 199 - * hmm_mirror_unregister() - unregister a mirror 200 - * 201 - * @mirror: mirror struct to unregister 202 - * 203 - * Stop mirroring a process address space, and cleanup. 204 - */ 205 - void hmm_mirror_unregister(struct hmm_mirror *mirror) 206 - { 207 - struct hmm *hmm = mirror->hmm; 208 - 209 - down_write(&hmm->mirrors_sem); 210 - list_del(&mirror->list); 211 - up_write(&hmm->mirrors_sem); 212 - mmu_notifier_put(&hmm->mmu_notifier); 213 - } 214 - EXPORT_SYMBOL(hmm_mirror_unregister); 215 - 216 29 struct hmm_vma_walk { 217 30 struct hmm_range *range; 218 31 struct dev_pagemap *pgmap; ··· 65 252 return -EFAULT; 66 253 } 67 254 68 - static int hmm_pfns_bad(unsigned long addr, 69 - unsigned long end, 70 - struct mm_walk *walk) 255 + static int hmm_pfns_fill(unsigned long addr, unsigned long end, 256 + struct hmm_range *range, enum hmm_pfn_value_e value) 71 257 { 72 - struct hmm_vma_walk *hmm_vma_walk = walk->private; 73 - struct hmm_range *range = hmm_vma_walk->range; 74 258 uint64_t *pfns = range->pfns; 75 259 unsigned long i; 76 260 77 261 i = (addr - range->start) >> PAGE_SHIFT; 78 262 for (; addr < end; addr += PAGE_SIZE, i++) 79 - pfns[i] = range->values[HMM_PFN_ERROR]; 263 + pfns[i] = range->values[value]; 80 264 81 265 return 0; 82 266 } ··· 342 532 if (unlikely(!hmm_vma_walk->pgmap)) 343 533 return -EBUSY; 344 534 } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { 345 - *pfn = range->values[HMM_PFN_SPECIAL]; 346 - return -EFAULT; 535 + if (!is_zero_pfn(pte_pfn(pte))) { 536 + *pfn = range->values[HMM_PFN_SPECIAL]; 537 + return -EFAULT; 538 + } 539 + /* 540 + * Since each architecture defines a struct page for the zero 541 + * page, just fall through and treat it like a normal page. 542 + */ 347 543 } 348 544 349 545 *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags; ··· 400 584 } 401 585 return 0; 402 586 } else if (!pmd_present(pmd)) 403 - return hmm_pfns_bad(start, end, walk); 587 + return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); 404 588 405 589 if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) { 406 590 /* ··· 428 612 * recover. 429 613 */ 430 614 if (pmd_bad(pmd)) 431 - return hmm_pfns_bad(start, end, walk); 615 + return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); 432 616 433 617 ptep = pte_offset_map(pmdp, addr); 434 618 i = (addr - range->start) >> PAGE_SHIFT; ··· 586 770 #define hmm_vma_walk_hugetlb_entry NULL 587 771 #endif /* CONFIG_HUGETLB_PAGE */ 588 772 589 - static void hmm_pfns_clear(struct hmm_range *range, 590 - uint64_t *pfns, 591 - unsigned long addr, 592 - unsigned long end) 773 + static int hmm_vma_walk_test(unsigned long start, unsigned long end, 774 + struct mm_walk *walk) 593 775 { 594 - for (; addr < end; addr += PAGE_SIZE, pfns++) 595 - *pfns = range->values[HMM_PFN_NONE]; 596 - } 597 - 598 - /* 599 - * hmm_range_register() - start tracking change to CPU page table over a range 600 - * @range: range 601 - * @mm: the mm struct for the range of virtual address 602 - * 603 - * Return: 0 on success, -EFAULT if the address space is no longer valid 604 - * 605 - * Track updates to the CPU page table see include/linux/hmm.h 606 - */ 607 - int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror) 608 - { 609 - struct hmm *hmm = mirror->hmm; 610 - unsigned long flags; 611 - 612 - range->valid = false; 613 - range->hmm = NULL; 614 - 615 - if ((range->start & (PAGE_SIZE - 1)) || (range->end & (PAGE_SIZE - 1))) 616 - return -EINVAL; 617 - if (range->start >= range->end) 618 - return -EINVAL; 619 - 620 - /* Prevent hmm_release() from running while the range is valid */ 621 - if (!mmget_not_zero(hmm->mmu_notifier.mm)) 622 - return -EFAULT; 623 - 624 - /* Initialize range to track CPU page table updates. */ 625 - spin_lock_irqsave(&hmm->ranges_lock, flags); 626 - 627 - range->hmm = hmm; 628 - list_add(&range->list, &hmm->ranges); 776 + struct hmm_vma_walk *hmm_vma_walk = walk->private; 777 + struct hmm_range *range = hmm_vma_walk->range; 778 + struct vm_area_struct *vma = walk->vma; 629 779 630 780 /* 631 - * If there are any concurrent notifiers we have to wait for them for 632 - * the range to be valid (see hmm_range_wait_until_valid()). 781 + * Skip vma ranges that don't have struct page backing them or 782 + * map I/O devices directly. 633 783 */ 634 - if (!hmm->notifiers) 635 - range->valid = true; 636 - spin_unlock_irqrestore(&hmm->ranges_lock, flags); 784 + if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) 785 + return -EFAULT; 786 + 787 + /* 788 + * If the vma does not allow read access, then assume that it does not 789 + * allow write access either. HMM does not support architectures 790 + * that allow write without read. 791 + */ 792 + if (!(vma->vm_flags & VM_READ)) { 793 + bool fault, write_fault; 794 + 795 + /* 796 + * Check to see if a fault is requested for any page in the 797 + * range. 798 + */ 799 + hmm_range_need_fault(hmm_vma_walk, range->pfns + 800 + ((start - range->start) >> PAGE_SHIFT), 801 + (end - start) >> PAGE_SHIFT, 802 + 0, &fault, &write_fault); 803 + if (fault || write_fault) 804 + return -EFAULT; 805 + 806 + hmm_pfns_fill(start, end, range, HMM_PFN_NONE); 807 + hmm_vma_walk->last = end; 808 + 809 + /* Skip this vma and continue processing the next vma. */ 810 + return 1; 811 + } 637 812 638 813 return 0; 639 814 } 640 - EXPORT_SYMBOL(hmm_range_register); 641 - 642 - /* 643 - * hmm_range_unregister() - stop tracking change to CPU page table over a range 644 - * @range: range 645 - * 646 - * Range struct is used to track updates to the CPU page table after a call to 647 - * hmm_range_register(). See include/linux/hmm.h for how to use it. 648 - */ 649 - void hmm_range_unregister(struct hmm_range *range) 650 - { 651 - struct hmm *hmm = range->hmm; 652 - unsigned long flags; 653 - 654 - spin_lock_irqsave(&hmm->ranges_lock, flags); 655 - list_del_init(&range->list); 656 - spin_unlock_irqrestore(&hmm->ranges_lock, flags); 657 - 658 - /* Drop reference taken by hmm_range_register() */ 659 - mmput(hmm->mmu_notifier.mm); 660 - 661 - /* 662 - * The range is now invalid and the ref on the hmm is dropped, so 663 - * poison the pointer. Leave other fields in place, for the caller's 664 - * use. 665 - */ 666 - range->valid = false; 667 - memset(&range->hmm, POISON_INUSE, sizeof(range->hmm)); 668 - } 669 - EXPORT_SYMBOL(hmm_range_unregister); 670 815 671 816 static const struct mm_walk_ops hmm_walk_ops = { 672 817 .pud_entry = hmm_vma_walk_pud, 673 818 .pmd_entry = hmm_vma_walk_pmd, 674 819 .pte_hole = hmm_vma_walk_hole, 675 820 .hugetlb_entry = hmm_vma_walk_hugetlb_entry, 821 + .test_walk = hmm_vma_walk_test, 676 822 }; 677 823 678 824 /** ··· 667 889 */ 668 890 long hmm_range_fault(struct hmm_range *range, unsigned int flags) 669 891 { 670 - const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP; 671 - unsigned long start = range->start, end; 672 - struct hmm_vma_walk hmm_vma_walk; 673 - struct hmm *hmm = range->hmm; 674 - struct vm_area_struct *vma; 892 + struct hmm_vma_walk hmm_vma_walk = { 893 + .range = range, 894 + .last = range->start, 895 + .flags = flags, 896 + }; 897 + struct mm_struct *mm = range->notifier->mm; 675 898 int ret; 676 899 677 - lockdep_assert_held(&hmm->mmu_notifier.mm->mmap_sem); 900 + lockdep_assert_held(&mm->mmap_sem); 678 901 679 902 do { 680 903 /* If range is no longer valid force retry. */ 681 - if (!range->valid) 904 + if (mmu_interval_check_retry(range->notifier, 905 + range->notifier_seq)) 682 906 return -EBUSY; 907 + ret = walk_page_range(mm, hmm_vma_walk.last, range->end, 908 + &hmm_walk_ops, &hmm_vma_walk); 909 + } while (ret == -EBUSY); 683 910 684 - vma = find_vma(hmm->mmu_notifier.mm, start); 685 - if (vma == NULL || (vma->vm_flags & device_vma)) 686 - return -EFAULT; 687 - 688 - if (!(vma->vm_flags & VM_READ)) { 689 - /* 690 - * If vma do not allow read access, then assume that it 691 - * does not allow write access, either. HMM does not 692 - * support architecture that allow write without read. 693 - */ 694 - hmm_pfns_clear(range, range->pfns, 695 - range->start, range->end); 696 - return -EPERM; 697 - } 698 - 699 - hmm_vma_walk.pgmap = NULL; 700 - hmm_vma_walk.last = start; 701 - hmm_vma_walk.flags = flags; 702 - hmm_vma_walk.range = range; 703 - end = min(range->end, vma->vm_end); 704 - 705 - walk_page_range(vma->vm_mm, start, end, &hmm_walk_ops, 706 - &hmm_vma_walk); 707 - 708 - do { 709 - ret = walk_page_range(vma->vm_mm, start, end, 710 - &hmm_walk_ops, &hmm_vma_walk); 711 - start = hmm_vma_walk.last; 712 - 713 - /* Keep trying while the range is valid. */ 714 - } while (ret == -EBUSY && range->valid); 715 - 716 - if (ret) { 717 - unsigned long i; 718 - 719 - i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT; 720 - hmm_pfns_clear(range, &range->pfns[i], 721 - hmm_vma_walk.last, range->end); 722 - return ret; 723 - } 724 - start = end; 725 - 726 - } while (start < range->end); 727 - 911 + if (ret) 912 + return ret; 728 913 return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT; 729 914 } 730 915 EXPORT_SYMBOL(hmm_range_fault); 731 - 732 - /** 733 - * hmm_range_dma_map - hmm_range_fault() and dma map page all in one. 734 - * @range: range being faulted 735 - * @device: device to map page to 736 - * @daddrs: array of dma addresses for the mapped pages 737 - * @flags: HMM_FAULT_* 738 - * 739 - * Return: the number of pages mapped on success (including zero), or any 740 - * status return from hmm_range_fault() otherwise. 741 - */ 742 - long hmm_range_dma_map(struct hmm_range *range, struct device *device, 743 - dma_addr_t *daddrs, unsigned int flags) 744 - { 745 - unsigned long i, npages, mapped; 746 - long ret; 747 - 748 - ret = hmm_range_fault(range, flags); 749 - if (ret <= 0) 750 - return ret ? ret : -EBUSY; 751 - 752 - npages = (range->end - range->start) >> PAGE_SHIFT; 753 - for (i = 0, mapped = 0; i < npages; ++i) { 754 - enum dma_data_direction dir = DMA_TO_DEVICE; 755 - struct page *page; 756 - 757 - /* 758 - * FIXME need to update DMA API to provide invalid DMA address 759 - * value instead of a function to test dma address value. This 760 - * would remove lot of dumb code duplicated accross many arch. 761 - * 762 - * For now setting it to 0 here is good enough as the pfns[] 763 - * value is what is use to check what is valid and what isn't. 764 - */ 765 - daddrs[i] = 0; 766 - 767 - page = hmm_device_entry_to_page(range, range->pfns[i]); 768 - if (page == NULL) 769 - continue; 770 - 771 - /* Check if range is being invalidated */ 772 - if (!range->valid) { 773 - ret = -EBUSY; 774 - goto unmap; 775 - } 776 - 777 - /* If it is read and write than map bi-directional. */ 778 - if (range->pfns[i] & range->flags[HMM_PFN_WRITE]) 779 - dir = DMA_BIDIRECTIONAL; 780 - 781 - daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir); 782 - if (dma_mapping_error(device, daddrs[i])) { 783 - ret = -EFAULT; 784 - goto unmap; 785 - } 786 - 787 - mapped++; 788 - } 789 - 790 - return mapped; 791 - 792 - unmap: 793 - for (npages = i, i = 0; (i < npages) && mapped; ++i) { 794 - enum dma_data_direction dir = DMA_TO_DEVICE; 795 - struct page *page; 796 - 797 - page = hmm_device_entry_to_page(range, range->pfns[i]); 798 - if (page == NULL) 799 - continue; 800 - 801 - if (dma_mapping_error(device, daddrs[i])) 802 - continue; 803 - 804 - /* If it is read and write than map bi-directional. */ 805 - if (range->pfns[i] & range->flags[HMM_PFN_WRITE]) 806 - dir = DMA_BIDIRECTIONAL; 807 - 808 - dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir); 809 - mapped--; 810 - } 811 - 812 - return ret; 813 - } 814 - EXPORT_SYMBOL(hmm_range_dma_map); 815 - 816 - /** 817 - * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map() 818 - * @range: range being unmapped 819 - * @device: device against which dma map was done 820 - * @daddrs: dma address of mapped pages 821 - * @dirty: dirty page if it had the write flag set 822 - * Return: number of page unmapped on success, -EINVAL otherwise 823 - * 824 - * Note that caller MUST abide by mmu notifier or use HMM mirror and abide 825 - * to the sync_cpu_device_pagetables() callback so that it is safe here to 826 - * call set_page_dirty(). Caller must also take appropriate locks to avoid 827 - * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress. 828 - */ 829 - long hmm_range_dma_unmap(struct hmm_range *range, 830 - struct device *device, 831 - dma_addr_t *daddrs, 832 - bool dirty) 833 - { 834 - unsigned long i, npages; 835 - long cpages = 0; 836 - 837 - /* Sanity check. */ 838 - if (range->end <= range->start) 839 - return -EINVAL; 840 - if (!daddrs) 841 - return -EINVAL; 842 - if (!range->pfns) 843 - return -EINVAL; 844 - 845 - npages = (range->end - range->start) >> PAGE_SHIFT; 846 - for (i = 0; i < npages; ++i) { 847 - enum dma_data_direction dir = DMA_TO_DEVICE; 848 - struct page *page; 849 - 850 - page = hmm_device_entry_to_page(range, range->pfns[i]); 851 - if (page == NULL) 852 - continue; 853 - 854 - /* If it is read and write than map bi-directional. */ 855 - if (range->pfns[i] & range->flags[HMM_PFN_WRITE]) { 856 - dir = DMA_BIDIRECTIONAL; 857 - 858 - /* 859 - * See comments in function description on why it is 860 - * safe here to call set_page_dirty() 861 - */ 862 - if (dirty) 863 - set_page_dirty(page); 864 - } 865 - 866 - /* Unmap and clear pfns/dma address */ 867 - dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir); 868 - range->pfns[i] = range->values[HMM_PFN_NONE]; 869 - /* FIXME see comments in hmm_vma_dma_map() */ 870 - daddrs[i] = 0; 871 - cpages++; 872 - } 873 - 874 - return cpages; 875 - } 876 - EXPORT_SYMBOL(hmm_range_dma_unmap);
+531 -26
mm/mmu_notifier.c
··· 12 12 #include <linux/export.h> 13 13 #include <linux/mm.h> 14 14 #include <linux/err.h> 15 + #include <linux/interval_tree.h> 15 16 #include <linux/srcu.h> 16 17 #include <linux/rcupdate.h> 17 18 #include <linux/sched.h> ··· 29 28 #endif 30 29 31 30 /* 31 + * The mmu notifier_mm structure is allocated and installed in 32 + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected 33 + * critical section and it's released only when mm_count reaches zero 34 + * in mmdrop(). 35 + */ 36 + struct mmu_notifier_mm { 37 + /* all mmu notifiers registered in this mm are queued in this list */ 38 + struct hlist_head list; 39 + bool has_itree; 40 + /* to serialize the list modifications and hlist_unhashed */ 41 + spinlock_t lock; 42 + unsigned long invalidate_seq; 43 + unsigned long active_invalidate_ranges; 44 + struct rb_root_cached itree; 45 + wait_queue_head_t wq; 46 + struct hlist_head deferred_list; 47 + }; 48 + 49 + /* 50 + * This is a collision-retry read-side/write-side 'lock', a lot like a 51 + * seqcount, however this allows multiple write-sides to hold it at 52 + * once. Conceptually the write side is protecting the values of the PTEs in 53 + * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any 54 + * writer exists. 55 + * 56 + * Note that the core mm creates nested invalidate_range_start()/end() regions 57 + * within the same thread, and runs invalidate_range_start()/end() in parallel 58 + * on multiple CPUs. This is designed to not reduce concurrency or block 59 + * progress on the mm side. 60 + * 61 + * As a secondary function, holding the full write side also serves to prevent 62 + * writers for the itree, this is an optimization to avoid extra locking 63 + * during invalidate_range_start/end notifiers. 64 + * 65 + * The write side has two states, fully excluded: 66 + * - mm->active_invalidate_ranges != 0 67 + * - mnn->invalidate_seq & 1 == True (odd) 68 + * - some range on the mm_struct is being invalidated 69 + * - the itree is not allowed to change 70 + * 71 + * And partially excluded: 72 + * - mm->active_invalidate_ranges != 0 73 + * - mnn->invalidate_seq & 1 == False (even) 74 + * - some range on the mm_struct is being invalidated 75 + * - the itree is allowed to change 76 + * 77 + * Operations on mmu_notifier_mm->invalidate_seq (under spinlock): 78 + * seq |= 1 # Begin writing 79 + * seq++ # Release the writing state 80 + * seq & 1 # True if a writer exists 81 + * 82 + * The later state avoids some expensive work on inv_end in the common case of 83 + * no mni monitoring the VA. 84 + */ 85 + static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm) 86 + { 87 + lockdep_assert_held(&mmn_mm->lock); 88 + return mmn_mm->invalidate_seq & 1; 89 + } 90 + 91 + static struct mmu_interval_notifier * 92 + mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm, 93 + const struct mmu_notifier_range *range, 94 + unsigned long *seq) 95 + { 96 + struct interval_tree_node *node; 97 + struct mmu_interval_notifier *res = NULL; 98 + 99 + spin_lock(&mmn_mm->lock); 100 + mmn_mm->active_invalidate_ranges++; 101 + node = interval_tree_iter_first(&mmn_mm->itree, range->start, 102 + range->end - 1); 103 + if (node) { 104 + mmn_mm->invalidate_seq |= 1; 105 + res = container_of(node, struct mmu_interval_notifier, 106 + interval_tree); 107 + } 108 + 109 + *seq = mmn_mm->invalidate_seq; 110 + spin_unlock(&mmn_mm->lock); 111 + return res; 112 + } 113 + 114 + static struct mmu_interval_notifier * 115 + mn_itree_inv_next(struct mmu_interval_notifier *mni, 116 + const struct mmu_notifier_range *range) 117 + { 118 + struct interval_tree_node *node; 119 + 120 + node = interval_tree_iter_next(&mni->interval_tree, range->start, 121 + range->end - 1); 122 + if (!node) 123 + return NULL; 124 + return container_of(node, struct mmu_interval_notifier, interval_tree); 125 + } 126 + 127 + static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) 128 + { 129 + struct mmu_interval_notifier *mni; 130 + struct hlist_node *next; 131 + 132 + spin_lock(&mmn_mm->lock); 133 + if (--mmn_mm->active_invalidate_ranges || 134 + !mn_itree_is_invalidating(mmn_mm)) { 135 + spin_unlock(&mmn_mm->lock); 136 + return; 137 + } 138 + 139 + /* Make invalidate_seq even */ 140 + mmn_mm->invalidate_seq++; 141 + 142 + /* 143 + * The inv_end incorporates a deferred mechanism like rtnl_unlock(). 144 + * Adds and removes are queued until the final inv_end happens then 145 + * they are progressed. This arrangement for tree updates is used to 146 + * avoid using a blocking lock during invalidate_range_start. 147 + */ 148 + hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list, 149 + deferred_item) { 150 + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) 151 + interval_tree_insert(&mni->interval_tree, 152 + &mmn_mm->itree); 153 + else 154 + interval_tree_remove(&mni->interval_tree, 155 + &mmn_mm->itree); 156 + hlist_del(&mni->deferred_item); 157 + } 158 + spin_unlock(&mmn_mm->lock); 159 + 160 + wake_up_all(&mmn_mm->wq); 161 + } 162 + 163 + /** 164 + * mmu_interval_read_begin - Begin a read side critical section against a VA 165 + * range 166 + * mni: The range to use 167 + * 168 + * mmu_iterval_read_begin()/mmu_iterval_read_retry() implement a 169 + * collision-retry scheme similar to seqcount for the VA range under mni. If 170 + * the mm invokes invalidation during the critical section then 171 + * mmu_interval_read_retry() will return true. 172 + * 173 + * This is useful to obtain shadow PTEs where teardown or setup of the SPTEs 174 + * require a blocking context. The critical region formed by this can sleep, 175 + * and the required 'user_lock' can also be a sleeping lock. 176 + * 177 + * The caller is required to provide a 'user_lock' to serialize both teardown 178 + * and setup. 179 + * 180 + * The return value should be passed to mmu_interval_read_retry(). 181 + */ 182 + unsigned long mmu_interval_read_begin(struct mmu_interval_notifier *mni) 183 + { 184 + struct mmu_notifier_mm *mmn_mm = mni->mm->mmu_notifier_mm; 185 + unsigned long seq; 186 + bool is_invalidating; 187 + 188 + /* 189 + * If the mni has a different seq value under the user_lock than we 190 + * started with then it has collided. 191 + * 192 + * If the mni currently has the same seq value as the mmn_mm seq, then 193 + * it is currently between invalidate_start/end and is colliding. 194 + * 195 + * The locking looks broadly like this: 196 + * mn_tree_invalidate_start(): mmu_interval_read_begin(): 197 + * spin_lock 198 + * seq = READ_ONCE(mni->invalidate_seq); 199 + * seq == mmn_mm->invalidate_seq 200 + * spin_unlock 201 + * spin_lock 202 + * seq = ++mmn_mm->invalidate_seq 203 + * spin_unlock 204 + * op->invalidate_range(): 205 + * user_lock 206 + * mmu_interval_set_seq() 207 + * mni->invalidate_seq = seq 208 + * user_unlock 209 + * 210 + * [Required: mmu_interval_read_retry() == true] 211 + * 212 + * mn_itree_inv_end(): 213 + * spin_lock 214 + * seq = ++mmn_mm->invalidate_seq 215 + * spin_unlock 216 + * 217 + * user_lock 218 + * mmu_interval_read_retry(): 219 + * mni->invalidate_seq != seq 220 + * user_unlock 221 + * 222 + * Barriers are not needed here as any races here are closed by an 223 + * eventual mmu_interval_read_retry(), which provides a barrier via the 224 + * user_lock. 225 + */ 226 + spin_lock(&mmn_mm->lock); 227 + /* Pairs with the WRITE_ONCE in mmu_interval_set_seq() */ 228 + seq = READ_ONCE(mni->invalidate_seq); 229 + is_invalidating = seq == mmn_mm->invalidate_seq; 230 + spin_unlock(&mmn_mm->lock); 231 + 232 + /* 233 + * mni->invalidate_seq must always be set to an odd value via 234 + * mmu_interval_set_seq() using the provided cur_seq from 235 + * mn_itree_inv_start_range(). This ensures that if seq does wrap we 236 + * will always clear the below sleep in some reasonable time as 237 + * mmn_mm->invalidate_seq is even in the idle state. 238 + */ 239 + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); 240 + lock_map_release(&__mmu_notifier_invalidate_range_start_map); 241 + if (is_invalidating) 242 + wait_event(mmn_mm->wq, 243 + READ_ONCE(mmn_mm->invalidate_seq) != seq); 244 + 245 + /* 246 + * Notice that mmu_interval_read_retry() can already be true at this 247 + * point, avoiding loops here allows the caller to provide a global 248 + * time bound. 249 + */ 250 + 251 + return seq; 252 + } 253 + EXPORT_SYMBOL_GPL(mmu_interval_read_begin); 254 + 255 + static void mn_itree_release(struct mmu_notifier_mm *mmn_mm, 256 + struct mm_struct *mm) 257 + { 258 + struct mmu_notifier_range range = { 259 + .flags = MMU_NOTIFIER_RANGE_BLOCKABLE, 260 + .event = MMU_NOTIFY_RELEASE, 261 + .mm = mm, 262 + .start = 0, 263 + .end = ULONG_MAX, 264 + }; 265 + struct mmu_interval_notifier *mni; 266 + unsigned long cur_seq; 267 + bool ret; 268 + 269 + for (mni = mn_itree_inv_start_range(mmn_mm, &range, &cur_seq); mni; 270 + mni = mn_itree_inv_next(mni, &range)) { 271 + ret = mni->ops->invalidate(mni, &range, cur_seq); 272 + WARN_ON(!ret); 273 + } 274 + 275 + mn_itree_inv_end(mmn_mm); 276 + } 277 + 278 + /* 32 279 * This function can't run concurrently against mmu_notifier_register 33 280 * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap 34 281 * runs with mm_users == 0. Other tasks may still invoke mmu notifiers ··· 288 39 * can't go away from under us as exit_mmap holds an mm_count pin 289 40 * itself. 290 41 */ 291 - void __mmu_notifier_release(struct mm_struct *mm) 42 + static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm, 43 + struct mm_struct *mm) 292 44 { 293 45 struct mmu_notifier *mn; 294 46 int id; ··· 299 49 * ->release returns. 300 50 */ 301 51 id = srcu_read_lock(&srcu); 302 - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) 52 + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) 303 53 /* 304 54 * If ->release runs before mmu_notifier_unregister it must be 305 55 * handled, as it's the only way for the driver to flush all ··· 309 59 if (mn->ops->release) 310 60 mn->ops->release(mn, mm); 311 61 312 - spin_lock(&mm->mmu_notifier_mm->lock); 313 - while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { 314 - mn = hlist_entry(mm->mmu_notifier_mm->list.first, 315 - struct mmu_notifier, 62 + spin_lock(&mmn_mm->lock); 63 + while (unlikely(!hlist_empty(&mmn_mm->list))) { 64 + mn = hlist_entry(mmn_mm->list.first, struct mmu_notifier, 316 65 hlist); 317 66 /* 318 67 * We arrived before mmu_notifier_unregister so ··· 321 72 */ 322 73 hlist_del_init_rcu(&mn->hlist); 323 74 } 324 - spin_unlock(&mm->mmu_notifier_mm->lock); 75 + spin_unlock(&mmn_mm->lock); 325 76 srcu_read_unlock(&srcu, id); 326 77 327 78 /* ··· 334 85 * is held by exit_mmap. 335 86 */ 336 87 synchronize_srcu(&srcu); 88 + } 89 + 90 + void __mmu_notifier_release(struct mm_struct *mm) 91 + { 92 + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; 93 + 94 + if (mmn_mm->has_itree) 95 + mn_itree_release(mmn_mm, mm); 96 + 97 + if (!hlist_empty(&mmn_mm->list)) 98 + mn_hlist_release(mmn_mm, mm); 337 99 } 338 100 339 101 /* ··· 419 159 srcu_read_unlock(&srcu, id); 420 160 } 421 161 422 - int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) 162 + static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm, 163 + const struct mmu_notifier_range *range) 164 + { 165 + struct mmu_interval_notifier *mni; 166 + unsigned long cur_seq; 167 + 168 + for (mni = mn_itree_inv_start_range(mmn_mm, range, &cur_seq); mni; 169 + mni = mn_itree_inv_next(mni, range)) { 170 + bool ret; 171 + 172 + ret = mni->ops->invalidate(mni, range, cur_seq); 173 + if (!ret) { 174 + if (WARN_ON(mmu_notifier_range_blockable(range))) 175 + continue; 176 + goto out_would_block; 177 + } 178 + } 179 + return 0; 180 + 181 + out_would_block: 182 + /* 183 + * On -EAGAIN the non-blocking caller is not allowed to call 184 + * invalidate_range_end() 185 + */ 186 + mn_itree_inv_end(mmn_mm); 187 + return -EAGAIN; 188 + } 189 + 190 + static int mn_hlist_invalidate_range_start(struct mmu_notifier_mm *mmn_mm, 191 + struct mmu_notifier_range *range) 423 192 { 424 193 struct mmu_notifier *mn; 425 194 int ret = 0; 426 195 int id; 427 196 428 197 id = srcu_read_lock(&srcu); 429 - hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { 198 + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) { 430 199 if (mn->ops->invalidate_range_start) { 431 200 int _ret; 432 201 ··· 479 190 return ret; 480 191 } 481 192 482 - void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, 483 - bool only_end) 193 + int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) 194 + { 195 + struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm; 196 + int ret; 197 + 198 + if (mmn_mm->has_itree) { 199 + ret = mn_itree_invalidate(mmn_mm, range); 200 + if (ret) 201 + return ret; 202 + } 203 + if (!hlist_empty(&mmn_mm->list)) 204 + return mn_hlist_invalidate_range_start(mmn_mm, range); 205 + return 0; 206 + } 207 + 208 + static void mn_hlist_invalidate_end(struct mmu_notifier_mm *mmn_mm, 209 + struct mmu_notifier_range *range, 210 + bool only_end) 484 211 { 485 212 struct mmu_notifier *mn; 486 213 int id; 487 214 488 - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); 489 215 id = srcu_read_lock(&srcu); 490 - hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { 216 + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) { 491 217 /* 492 218 * Call invalidate_range here too to avoid the need for the 493 219 * subsystem of having to register an invalidate_range_end ··· 529 225 } 530 226 } 531 227 srcu_read_unlock(&srcu, id); 228 + } 229 + 230 + void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, 231 + bool only_end) 232 + { 233 + struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm; 234 + 235 + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); 236 + if (mmn_mm->has_itree) 237 + mn_itree_inv_end(mmn_mm); 238 + 239 + if (!hlist_empty(&mmn_mm->list)) 240 + mn_hlist_invalidate_end(mmn_mm, range, only_end); 532 241 lock_map_release(&__mmu_notifier_invalidate_range_start_map); 533 242 } 534 243 ··· 560 243 } 561 244 562 245 /* 563 - * Same as mmu_notifier_register but here the caller must hold the 564 - * mmap_sem in write mode. 246 + * Same as mmu_notifier_register but here the caller must hold the mmap_sem in 247 + * write mode. A NULL mn signals the notifier is being registered for itree 248 + * mode. 565 249 */ 566 250 int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) 567 251 { ··· 579 261 fs_reclaim_release(GFP_KERNEL); 580 262 } 581 263 582 - mn->mm = mm; 583 - mn->users = 1; 584 - 585 264 if (!mm->mmu_notifier_mm) { 586 265 /* 587 266 * kmalloc cannot be called under mm_take_all_locks(), but we ··· 586 271 * the write side of the mmap_sem. 587 272 */ 588 273 mmu_notifier_mm = 589 - kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); 274 + kzalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); 590 275 if (!mmu_notifier_mm) 591 276 return -ENOMEM; 592 277 593 278 INIT_HLIST_HEAD(&mmu_notifier_mm->list); 594 279 spin_lock_init(&mmu_notifier_mm->lock); 280 + mmu_notifier_mm->invalidate_seq = 2; 281 + mmu_notifier_mm->itree = RB_ROOT_CACHED; 282 + init_waitqueue_head(&mmu_notifier_mm->wq); 283 + INIT_HLIST_HEAD(&mmu_notifier_mm->deferred_list); 595 284 } 596 285 597 286 ret = mm_take_all_locks(mm); 598 287 if (unlikely(ret)) 599 288 goto out_clean; 600 - 601 - /* Pairs with the mmdrop in mmu_notifier_unregister_* */ 602 - mmgrab(mm); 603 289 604 290 /* 605 291 * Serialize the update against mmu_notifier_unregister. A ··· 609 293 * current->mm or explicitly with get_task_mm() or similar). 610 294 * We can't race against any other mmu notifier method either 611 295 * thanks to mm_take_all_locks(). 296 + * 297 + * release semantics on the initialization of the mmu_notifier_mm's 298 + * contents are provided for unlocked readers. acquire can only be 299 + * used while holding the mmgrab or mmget, and is safe because once 300 + * created the mmu_notififer_mm is not freed until the mm is 301 + * destroyed. As above, users holding the mmap_sem or one of the 302 + * mm_take_all_locks() do not need to use acquire semantics. 612 303 */ 613 304 if (mmu_notifier_mm) 614 - mm->mmu_notifier_mm = mmu_notifier_mm; 305 + smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm); 615 306 616 - spin_lock(&mm->mmu_notifier_mm->lock); 617 - hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list); 618 - spin_unlock(&mm->mmu_notifier_mm->lock); 307 + if (mn) { 308 + /* Pairs with the mmdrop in mmu_notifier_unregister_* */ 309 + mmgrab(mm); 310 + mn->mm = mm; 311 + mn->users = 1; 312 + 313 + spin_lock(&mm->mmu_notifier_mm->lock); 314 + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list); 315 + spin_unlock(&mm->mmu_notifier_mm->lock); 316 + } else 317 + mm->mmu_notifier_mm->has_itree = true; 619 318 620 319 mm_drop_all_locks(mm); 621 320 BUG_ON(atomic_read(&mm->mm_users) <= 0); ··· 846 515 spin_unlock(&mm->mmu_notifier_mm->lock); 847 516 } 848 517 EXPORT_SYMBOL_GPL(mmu_notifier_put); 518 + 519 + static int __mmu_interval_notifier_insert( 520 + struct mmu_interval_notifier *mni, struct mm_struct *mm, 521 + struct mmu_notifier_mm *mmn_mm, unsigned long start, 522 + unsigned long length, const struct mmu_interval_notifier_ops *ops) 523 + { 524 + mni->mm = mm; 525 + mni->ops = ops; 526 + RB_CLEAR_NODE(&mni->interval_tree.rb); 527 + mni->interval_tree.start = start; 528 + /* 529 + * Note that the representation of the intervals in the interval tree 530 + * considers the ending point as contained in the interval. 531 + */ 532 + if (length == 0 || 533 + check_add_overflow(start, length - 1, &mni->interval_tree.last)) 534 + return -EOVERFLOW; 535 + 536 + /* Must call with a mmget() held */ 537 + if (WARN_ON(atomic_read(&mm->mm_count) <= 0)) 538 + return -EINVAL; 539 + 540 + /* pairs with mmdrop in mmu_interval_notifier_remove() */ 541 + mmgrab(mm); 542 + 543 + /* 544 + * If some invalidate_range_start/end region is going on in parallel 545 + * we don't know what VA ranges are affected, so we must assume this 546 + * new range is included. 547 + * 548 + * If the itree is invalidating then we are not allowed to change 549 + * it. Retrying until invalidation is done is tricky due to the 550 + * possibility for live lock, instead defer the add to 551 + * mn_itree_inv_end() so this algorithm is deterministic. 552 + * 553 + * In all cases the value for the mni->invalidate_seq should be 554 + * odd, see mmu_interval_read_begin() 555 + */ 556 + spin_lock(&mmn_mm->lock); 557 + if (mmn_mm->active_invalidate_ranges) { 558 + if (mn_itree_is_invalidating(mmn_mm)) 559 + hlist_add_head(&mni->deferred_item, 560 + &mmn_mm->deferred_list); 561 + else { 562 + mmn_mm->invalidate_seq |= 1; 563 + interval_tree_insert(&mni->interval_tree, 564 + &mmn_mm->itree); 565 + } 566 + mni->invalidate_seq = mmn_mm->invalidate_seq; 567 + } else { 568 + WARN_ON(mn_itree_is_invalidating(mmn_mm)); 569 + /* 570 + * The starting seq for a mni not under invalidation should be 571 + * odd, not equal to the current invalidate_seq and 572 + * invalidate_seq should not 'wrap' to the new seq any time 573 + * soon. 574 + */ 575 + mni->invalidate_seq = mmn_mm->invalidate_seq - 1; 576 + interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); 577 + } 578 + spin_unlock(&mmn_mm->lock); 579 + return 0; 580 + } 581 + 582 + /** 583 + * mmu_interval_notifier_insert - Insert an interval notifier 584 + * @mni: Interval notifier to register 585 + * @start: Starting virtual address to monitor 586 + * @length: Length of the range to monitor 587 + * @mm : mm_struct to attach to 588 + * 589 + * This function subscribes the interval notifier for notifications from the 590 + * mm. Upon return the ops related to mmu_interval_notifier will be called 591 + * whenever an event that intersects with the given range occurs. 592 + * 593 + * Upon return the range_notifier may not be present in the interval tree yet. 594 + * The caller must use the normal interval notifier read flow via 595 + * mmu_interval_read_begin() to establish SPTEs for this range. 596 + */ 597 + int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni, 598 + struct mm_struct *mm, unsigned long start, 599 + unsigned long length, 600 + const struct mmu_interval_notifier_ops *ops) 601 + { 602 + struct mmu_notifier_mm *mmn_mm; 603 + int ret; 604 + 605 + might_lock(&mm->mmap_sem); 606 + 607 + mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm); 608 + if (!mmn_mm || !mmn_mm->has_itree) { 609 + ret = mmu_notifier_register(NULL, mm); 610 + if (ret) 611 + return ret; 612 + mmn_mm = mm->mmu_notifier_mm; 613 + } 614 + return __mmu_interval_notifier_insert(mni, mm, mmn_mm, start, length, 615 + ops); 616 + } 617 + EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert); 618 + 619 + int mmu_interval_notifier_insert_locked( 620 + struct mmu_interval_notifier *mni, struct mm_struct *mm, 621 + unsigned long start, unsigned long length, 622 + const struct mmu_interval_notifier_ops *ops) 623 + { 624 + struct mmu_notifier_mm *mmn_mm; 625 + int ret; 626 + 627 + lockdep_assert_held_write(&mm->mmap_sem); 628 + 629 + mmn_mm = mm->mmu_notifier_mm; 630 + if (!mmn_mm || !mmn_mm->has_itree) { 631 + ret = __mmu_notifier_register(NULL, mm); 632 + if (ret) 633 + return ret; 634 + mmn_mm = mm->mmu_notifier_mm; 635 + } 636 + return __mmu_interval_notifier_insert(mni, mm, mmn_mm, start, length, 637 + ops); 638 + } 639 + EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked); 640 + 641 + /** 642 + * mmu_interval_notifier_remove - Remove a interval notifier 643 + * @mni: Interval notifier to unregister 644 + * 645 + * This function must be paired with mmu_interval_notifier_insert(). It cannot 646 + * be called from any ops callback. 647 + * 648 + * Once this returns ops callbacks are no longer running on other CPUs and 649 + * will not be called in future. 650 + */ 651 + void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) 652 + { 653 + struct mm_struct *mm = mni->mm; 654 + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; 655 + unsigned long seq = 0; 656 + 657 + might_sleep(); 658 + 659 + spin_lock(&mmn_mm->lock); 660 + if (mn_itree_is_invalidating(mmn_mm)) { 661 + /* 662 + * remove is being called after insert put this on the 663 + * deferred list, but before the deferred list was processed. 664 + */ 665 + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { 666 + hlist_del(&mni->deferred_item); 667 + } else { 668 + hlist_add_head(&mni->deferred_item, 669 + &mmn_mm->deferred_list); 670 + seq = mmn_mm->invalidate_seq; 671 + } 672 + } else { 673 + WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb)); 674 + interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); 675 + } 676 + spin_unlock(&mmn_mm->lock); 677 + 678 + /* 679 + * The possible sleep on progress in the invalidation requires the 680 + * caller not hold any locks held by invalidation callbacks. 681 + */ 682 + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); 683 + lock_map_release(&__mmu_notifier_invalidate_range_start_map); 684 + if (seq) 685 + wait_event(mmn_mm->wq, 686 + READ_ONCE(mmn_mm->invalidate_seq) != seq); 687 + 688 + /* pairs with mmgrab in mmu_interval_notifier_insert() */ 689 + mmdrop(mm); 690 + } 691 + EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove); 849 692 850 693 /** 851 694 * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed