mm: rewrite vmap layer · tjh.dev/kernel@db64fe0

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

mm: rewrite vmap layer

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.

threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.

Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229

After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Nick Piggin and committed by

Linus Torvalds 17 years ago db64fe02 cb8f488c

+862 -134

6 changed files

expand all

arch

x86

pageattr.c

xen

enlighten.c

mmu.c

include

linux

vmalloc.h

init

main.c

vmalloc.c

arch/x86/mm/pageattr.c

··· 792 792 /* Must avoid aliasing mappings in the highmem code */ 793 793 kmap_flush_unused(); 794 794 795 + vm_unmap_aliases(); 796 + 795 797 cpa.vaddr = addr; 796 798 cpa.numpages = numpages; 797 799 cpa.mask_set = mask_set;

arch/x86/xen/enlighten.c

··· 871 871 /* make sure there are no stray mappings of 872 872 this page */ 873 873 kmap_flush_unused(); 874 + vm_unmap_aliases(); 874 875 } 875 876 } 876 877

arch/x86/xen/mmu.c

··· 846 846 /* re-enable interrupts for kmap_flush_unused */ 847 847 xen_mc_issue(0); 848 848 kmap_flush_unused(); 849 + vm_unmap_aliases(); 849 850 xen_mc_batch(); 850 851 } 851 852

+14 -1

include/linux/vmalloc.h

··· 2 2 #define _LINUX_VMALLOC_H 3 3 4 4 #include <linux/spinlock.h> 5 + #include <linux/init.h> 5 6 #include <asm/page.h> /* pgprot_t */ 6 7 7 8 struct vm_area_struct; /* vma defining user mapping in mm_types.h */ ··· 24 23 #endif 25 24 26 25 struct vm_struct { 27 - /* keep next,addr,size together to speedup lookups */ 28 26 struct vm_struct *next; 29 27 void *addr; 30 28 unsigned long size; ··· 37 37 /* 38 38 * Highlevel APIs for driver use 39 39 */ 40 + extern void vm_unmap_ram(const void *mem, unsigned int count); 41 + extern void *vm_map_ram(struct page **pages, unsigned int count, 42 + int node, pgprot_t prot); 43 + extern void vm_unmap_aliases(void); 44 + 45 + #ifdef CONFIG_MMU 46 + extern void __init vmalloc_init(void); 47 + #else 48 + static inline void vmalloc_init(void) 49 + { 50 + } 51 + #endif 52 + 40 53 extern void *vmalloc(unsigned long size); 41 54 extern void *vmalloc_user(unsigned long size); 42 55 extern void *vmalloc_node(unsigned long size, int node);

init/main.c

··· 27 27 #include <linux/gfp.h> 28 28 #include <linux/percpu.h> 29 29 #include <linux/kmod.h> 30 + #include <linux/vmalloc.h> 30 31 #include <linux/kernel_stat.h> 31 32 #include <linux/start_kernel.h> 32 33 #include <linux/security.h> ··· 643 642 initrd_start = 0; 644 643 } 645 644 #endif 645 + vmalloc_init(); 646 646 vfs_caches_init_early(); 647 647 cpuset_init_early(); 648 648 mem_init();

+842 -133

mm/vmalloc.c

··· 8 8 * Numa awareness, Christoph Lameter, SGI, June 2005 9 9 */ 10 10 11 + #include <linux/vmalloc.h> 11 12 #include <linux/mm.h> 12 13 #include <linux/module.h> 13 14 #include <linux/highmem.h> ··· 19 18 #include <linux/debugobjects.h> 20 19 #include <linux/vmalloc.h> 21 20 #include <linux/kallsyms.h> 21 + #include <linux/list.h> 22 + #include <linux/rbtree.h> 23 + #include <linux/radix-tree.h> 24 + #include <linux/rcupdate.h> 22 25 26 + #include <asm/atomic.h> 23 27 #include <asm/uaccess.h> 24 28 #include <asm/tlbflush.h> 25 29 26 30 27 - DEFINE_RWLOCK(vmlist_lock); 28 - struct vm_struct *vmlist; 29 - 30 - static void *__vmalloc_node(unsigned long size, gfp_t gfp_mask, pgprot_t prot, 31 - int node, void *caller); 31 + /*** Page table manipulation functions ***/ 32 32 33 33 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end) 34 34 { ··· 42 40 } while (pte++, addr += PAGE_SIZE, addr != end); 43 41 } 44 42 45 - static inline void vunmap_pmd_range(pud_t *pud, unsigned long addr, 46 - unsigned long end) 43 + static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end) 47 44 { 48 45 pmd_t *pmd; 49 46 unsigned long next; ··· 56 55 } while (pmd++, addr = next, addr != end); 57 56 } 58 57 59 - static inline void vunmap_pud_range(pgd_t *pgd, unsigned long addr, 60 - unsigned long end) 58 + static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end) 61 59 { 62 60 pud_t *pud; 63 61 unsigned long next; ··· 70 70 } while (pud++, addr = next, addr != end); 71 71 } 72 72 73 - void unmap_kernel_range(unsigned long addr, unsigned long size) 73 + static void vunmap_page_range(unsigned long addr, unsigned long end) 74 74 { 75 75 pgd_t *pgd; 76 76 unsigned long next; 77 - unsigned long start = addr; 78 - unsigned long end = addr + size; 79 77 80 78 BUG_ON(addr >= end); 81 79 pgd = pgd_offset_k(addr); ··· 84 86 continue; 85 87 vunmap_pud_range(pgd, addr, next); 86 88 } while (pgd++, addr = next, addr != end); 87 - flush_tlb_kernel_range(start, end); 88 - } 89 - 90 - static void unmap_vm_area(struct vm_struct *area) 91 - { 92 - unmap_kernel_range((unsigned long)area->addr, area->size); 93 89 } 94 90 95 91 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, 96 - unsigned long end, pgprot_t prot, struct page ***pages) 92 + unsigned long end, pgprot_t prot, struct page **pages, int *nr) 97 93 { 98 94 pte_t *pte; 95 + 96 + /* 97 + * nr is a running index into the array which helps higher level 98 + * callers keep track of where we're up to. 99 + */ 99 100 100 101 pte = pte_alloc_kernel(pmd, addr); 101 102 if (!pte) 102 103 return -ENOMEM; 103 104 do { 104 - struct page *page = **pages; 105 - WARN_ON(!pte_none(*pte)); 106 - if (!page) 105 + struct page *page = pages[*nr]; 106 + 107 + if (WARN_ON(!pte_none(*pte))) 108 + return -EBUSY; 109 + if (WARN_ON(!page)) 107 110 return -ENOMEM; 108 111 set_pte_at(&init_mm, addr, pte, mk_pte(page, prot)); 109 - (*pages)++; 112 + (*nr)++; 110 113 } while (pte++, addr += PAGE_SIZE, addr != end); 111 114 return 0; 112 115 } 113 116 114 - static inline int vmap_pmd_range(pud_t *pud, unsigned long addr, 115 - unsigned long end, pgprot_t prot, struct page ***pages) 117 + static int vmap_pmd_range(pud_t *pud, unsigned long addr, 118 + unsigned long end, pgprot_t prot, struct page **pages, int *nr) 116 119 { 117 120 pmd_t *pmd; 118 121 unsigned long next; ··· 123 124 return -ENOMEM; 124 125 do { 125 126 next = pmd_addr_end(addr, end); 126 - if (vmap_pte_range(pmd, addr, next, prot, pages)) 127 + if (vmap_pte_range(pmd, addr, next, prot, pages, nr)) 127 128 return -ENOMEM; 128 129 } while (pmd++, addr = next, addr != end); 129 130 return 0; 130 131 } 131 132 132 - static inline int vmap_pud_range(pgd_t *pgd, unsigned long addr, 133 - unsigned long end, pgprot_t prot, struct page ***pages) 133 + static int vmap_pud_range(pgd_t *pgd, unsigned long addr, 134 + unsigned long end, pgprot_t prot, struct page **pages, int *nr) 134 135 { 135 136 pud_t *pud; 136 137 unsigned long next; ··· 140 141 return -ENOMEM; 141 142 do { 142 143 next = pud_addr_end(addr, end); 143 - if (vmap_pmd_range(pud, addr, next, prot, pages)) 144 + if (vmap_pmd_range(pud, addr, next, prot, pages, nr)) 144 145 return -ENOMEM; 145 146 } while (pud++, addr = next, addr != end); 146 147 return 0; 147 148 } 148 149 149 - int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) 150 + /* 151 + * Set up page tables in kva (addr, end). The ptes shall have prot "prot", and 152 + * will have pfns corresponding to the "pages" array. 153 + * 154 + * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N] 155 + */ 156 + static int vmap_page_range(unsigned long addr, unsigned long end, 157 + pgprot_t prot, struct page **pages) 150 158 { 151 159 pgd_t *pgd; 152 160 unsigned long next; 153 - unsigned long addr = (unsigned long) area->addr; 154 - unsigned long end = addr + area->size - PAGE_SIZE; 155 - int err; 161 + int err = 0; 162 + int nr = 0; 156 163 157 164 BUG_ON(addr >= end); 158 165 pgd = pgd_offset_k(addr); 159 166 do { 160 167 next = pgd_addr_end(addr, end); 161 - err = vmap_pud_range(pgd, addr, next, prot, pages); 168 + err = vmap_pud_range(pgd, addr, next, prot, pages, &nr); 162 169 if (err) 163 170 break; 164 171 } while (pgd++, addr = next, addr != end); 165 - flush_cache_vmap((unsigned long) area->addr, end); 166 - return err; 172 + flush_cache_vmap(addr, end); 173 + 174 + if (unlikely(err)) 175 + return err; 176 + return nr; 167 177 } 168 - EXPORT_SYMBOL_GPL(map_vm_area); 169 178 170 179 /* 171 - * Map a vmalloc()-space virtual address to the physical page. 180 + * Walk a vmap address to the struct page it maps. 172 181 */ 173 182 struct page *vmalloc_to_page(const void *vmalloc_addr) 174 183 { 175 184 unsigned long addr = (unsigned long) vmalloc_addr; 176 185 struct page *page = NULL; 177 186 pgd_t *pgd = pgd_offset_k(addr); 178 - pud_t *pud; 179 - pmd_t *pmd; 180 - pte_t *ptep, pte; 181 187 182 188 /* 183 189 * XXX we might need to change this if we add VIRTUAL_BUG_ON for ··· 192 188 !is_module_address(addr)); 193 189 194 190 if (!pgd_none(*pgd)) { 195 - pud = pud_offset(pgd, addr); 191 + pud_t *pud = pud_offset(pgd, addr); 196 192 if (!pud_none(*pud)) { 197 - pmd = pmd_offset(pud, addr); 193 + pmd_t *pmd = pmd_offset(pud, addr); 198 194 if (!pmd_none(*pmd)) { 195 + pte_t *ptep, pte; 196 + 199 197 ptep = pte_offset_map(pmd, addr); 200 198 pte = *ptep; 201 199 if (pte_present(pte)) ··· 219 213 } 220 214 EXPORT_SYMBOL(vmalloc_to_pfn); 221 215 222 - static struct vm_struct * 223 - __get_vm_area_node(unsigned long size, unsigned long flags, unsigned long start, 224 - unsigned long end, int node, gfp_t gfp_mask, void *caller) 216 + 217 + /*** Global kva allocator ***/ 218 + 219 + #define VM_LAZY_FREE 0x01 220 + #define VM_LAZY_FREEING 0x02 221 + #define VM_VM_AREA 0x04 222 + 223 + struct vmap_area { 224 + unsigned long va_start; 225 + unsigned long va_end; 226 + unsigned long flags; 227 + struct rb_node rb_node; /* address sorted rbtree */ 228 + struct list_head list; /* address sorted list */ 229 + struct list_head purge_list; /* "lazy purge" list */ 230 + void *private; 231 + struct rcu_head rcu_head; 232 + }; 233 + 234 + static DEFINE_SPINLOCK(vmap_area_lock); 235 + static struct rb_root vmap_area_root = RB_ROOT; 236 + static LIST_HEAD(vmap_area_list); 237 + 238 + static struct vmap_area *__find_vmap_area(unsigned long addr) 225 239 { 226 - struct vm_struct **p, *tmp, *area; 227 - unsigned long align = 1; 240 + struct rb_node *n = vmap_area_root.rb_node; 241 + 242 + while (n) { 243 + struct vmap_area *va; 244 + 245 + va = rb_entry(n, struct vmap_area, rb_node); 246 + if (addr < va->va_start) 247 + n = n->rb_left; 248 + else if (addr > va->va_start) 249 + n = n->rb_right; 250 + else 251 + return va; 252 + } 253 + 254 + return NULL; 255 + } 256 + 257 + static void __insert_vmap_area(struct vmap_area *va) 258 + { 259 + struct rb_node **p = &vmap_area_root.rb_node; 260 + struct rb_node *parent = NULL; 261 + struct rb_node *tmp; 262 + 263 + while (*p) { 264 + struct vmap_area *tmp; 265 + 266 + parent = *p; 267 + tmp = rb_entry(parent, struct vmap_area, rb_node); 268 + if (va->va_start < tmp->va_end) 269 + p = &(*p)->rb_left; 270 + else if (va->va_end > tmp->va_start) 271 + p = &(*p)->rb_right; 272 + else 273 + BUG(); 274 + } 275 + 276 + rb_link_node(&va->rb_node, parent, p); 277 + rb_insert_color(&va->rb_node, &vmap_area_root); 278 + 279 + /* address-sort this list so it is usable like the vmlist */ 280 + tmp = rb_prev(&va->rb_node); 281 + if (tmp) { 282 + struct vmap_area *prev; 283 + prev = rb_entry(tmp, struct vmap_area, rb_node); 284 + list_add_rcu(&va->list, &prev->list); 285 + } else 286 + list_add_rcu(&va->list, &vmap_area_list); 287 + } 288 + 289 + static void purge_vmap_area_lazy(void); 290 + 291 + /* 292 + * Allocate a region of KVA of the specified size and alignment, within the 293 + * vstart and vend. 294 + */ 295 + static struct vmap_area *alloc_vmap_area(unsigned long size, 296 + unsigned long align, 297 + unsigned long vstart, unsigned long vend, 298 + int node, gfp_t gfp_mask) 299 + { 300 + struct vmap_area *va; 301 + struct rb_node *n; 228 302 unsigned long addr; 303 + int purged = 0; 304 + 305 + BUG_ON(size & ~PAGE_MASK); 306 + 307 + addr = ALIGN(vstart, align); 308 + 309 + va = kmalloc_node(sizeof(struct vmap_area), 310 + gfp_mask & GFP_RECLAIM_MASK, node); 311 + if (unlikely(!va)) 312 + return ERR_PTR(-ENOMEM); 313 + 314 + retry: 315 + spin_lock(&vmap_area_lock); 316 + /* XXX: could have a last_hole cache */ 317 + n = vmap_area_root.rb_node; 318 + if (n) { 319 + struct vmap_area *first = NULL; 320 + 321 + do { 322 + struct vmap_area *tmp; 323 + tmp = rb_entry(n, struct vmap_area, rb_node); 324 + if (tmp->va_end >= addr) { 325 + if (!first && tmp->va_start < addr + size) 326 + first = tmp; 327 + n = n->rb_left; 328 + } else { 329 + first = tmp; 330 + n = n->rb_right; 331 + } 332 + } while (n); 333 + 334 + if (!first) 335 + goto found; 336 + 337 + if (first->va_end < addr) { 338 + n = rb_next(&first->rb_node); 339 + if (n) 340 + first = rb_entry(n, struct vmap_area, rb_node); 341 + else 342 + goto found; 343 + } 344 + 345 + while (addr + size >= first->va_start && addr + size <= vend) { 346 + addr = ALIGN(first->va_end + PAGE_SIZE, align); 347 + 348 + n = rb_next(&first->rb_node); 349 + if (n) 350 + first = rb_entry(n, struct vmap_area, rb_node); 351 + else 352 + goto found; 353 + } 354 + } 355 + found: 356 + if (addr + size > vend) { 357 + spin_unlock(&vmap_area_lock); 358 + if (!purged) { 359 + purge_vmap_area_lazy(); 360 + purged = 1; 361 + goto retry; 362 + } 363 + if (printk_ratelimit()) 364 + printk(KERN_WARNING "vmap allocation failed: " 365 + "use vmalloc=<size> to increase size.\n"); 366 + return ERR_PTR(-EBUSY); 367 + } 368 + 369 + BUG_ON(addr & (align-1)); 370 + 371 + va->va_start = addr; 372 + va->va_end = addr + size; 373 + va->flags = 0; 374 + __insert_vmap_area(va); 375 + spin_unlock(&vmap_area_lock); 376 + 377 + return va; 378 + } 379 + 380 + static void rcu_free_va(struct rcu_head *head) 381 + { 382 + struct vmap_area *va = container_of(head, struct vmap_area, rcu_head); 383 + 384 + kfree(va); 385 + } 386 + 387 + static void __free_vmap_area(struct vmap_area *va) 388 + { 389 + BUG_ON(RB_EMPTY_NODE(&va->rb_node)); 390 + rb_erase(&va->rb_node, &vmap_area_root); 391 + RB_CLEAR_NODE(&va->rb_node); 392 + list_del_rcu(&va->list); 393 + 394 + call_rcu(&va->rcu_head, rcu_free_va); 395 + } 396 + 397 + /* 398 + * Free a region of KVA allocated by alloc_vmap_area 399 + */ 400 + static void free_vmap_area(struct vmap_area *va) 401 + { 402 + spin_lock(&vmap_area_lock); 403 + __free_vmap_area(va); 404 + spin_unlock(&vmap_area_lock); 405 + } 406 + 407 + /* 408 + * Clear the pagetable entries of a given vmap_area 409 + */ 410 + static void unmap_vmap_area(struct vmap_area *va) 411 + { 412 + vunmap_page_range(va->va_start, va->va_end); 413 + } 414 + 415 + /* 416 + * lazy_max_pages is the maximum amount of virtual address space we gather up 417 + * before attempting to purge with a TLB flush. 418 + * 419 + * There is a tradeoff here: a larger number will cover more kernel page tables 420 + * and take slightly longer to purge, but it will linearly reduce the number of 421 + * global TLB flushes that must be performed. It would seem natural to scale 422 + * this number up linearly with the number of CPUs (because vmapping activity 423 + * could also scale linearly with the number of CPUs), however it is likely 424 + * that in practice, workloads might be constrained in other ways that mean 425 + * vmap activity will not scale linearly with CPUs. Also, I want to be 426 + * conservative and not introduce a big latency on huge systems, so go with 427 + * a less aggressive log scale. It will still be an improvement over the old 428 + * code, and it will be simple to change the scale factor if we find that it 429 + * becomes a problem on bigger systems. 430 + */ 431 + static unsigned long lazy_max_pages(void) 432 + { 433 + unsigned int log; 434 + 435 + log = fls(num_online_cpus()); 436 + 437 + return log * (32UL * 1024 * 1024 / PAGE_SIZE); 438 + } 439 + 440 + static atomic_t vmap_lazy_nr = ATOMIC_INIT(0); 441 + 442 + /* 443 + * Purges all lazily-freed vmap areas. 444 + * 445 + * If sync is 0 then don't purge if there is already a purge in progress. 446 + * If force_flush is 1, then flush kernel TLBs between *start and *end even 447 + * if we found no lazy vmap areas to unmap (callers can use this to optimise 448 + * their own TLB flushing). 449 + * Returns with *start = min(*start, lowest purged address) 450 + * *end = max(*end, highest purged address) 451 + */ 452 + static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, 453 + int sync, int force_flush) 454 + { 455 + static DEFINE_SPINLOCK(purge_lock); 456 + LIST_HEAD(valist); 457 + struct vmap_area *va; 458 + int nr = 0; 459 + 460 + /* 461 + * If sync is 0 but force_flush is 1, we'll go sync anyway but callers 462 + * should not expect such behaviour. This just simplifies locking for 463 + * the case that isn't actually used at the moment anyway. 464 + */ 465 + if (!sync && !force_flush) { 466 + if (!spin_trylock(&purge_lock)) 467 + return; 468 + } else 469 + spin_lock(&purge_lock); 470 + 471 + rcu_read_lock(); 472 + list_for_each_entry_rcu(va, &vmap_area_list, list) { 473 + if (va->flags & VM_LAZY_FREE) { 474 + if (va->va_start < *start) 475 + *start = va->va_start; 476 + if (va->va_end > *end) 477 + *end = va->va_end; 478 + nr += (va->va_end - va->va_start) >> PAGE_SHIFT; 479 + unmap_vmap_area(va); 480 + list_add_tail(&va->purge_list, &valist); 481 + va->flags |= VM_LAZY_FREEING; 482 + va->flags &= ~VM_LAZY_FREE; 483 + } 484 + } 485 + rcu_read_unlock(); 486 + 487 + if (nr) { 488 + BUG_ON(nr > atomic_read(&vmap_lazy_nr)); 489 + atomic_sub(nr, &vmap_lazy_nr); 490 + } 491 + 492 + if (nr || force_flush) 493 + flush_tlb_kernel_range(*start, *end); 494 + 495 + if (nr) { 496 + spin_lock(&vmap_area_lock); 497 + list_for_each_entry(va, &valist, purge_list) 498 + __free_vmap_area(va); 499 + spin_unlock(&vmap_area_lock); 500 + } 501 + spin_unlock(&purge_lock); 502 + } 503 + 504 + /* 505 + * Kick off a purge of the outstanding lazy areas. 506 + */ 507 + static void purge_vmap_area_lazy(void) 508 + { 509 + unsigned long start = ULONG_MAX, end = 0; 510 + 511 + __purge_vmap_area_lazy(&start, &end, 0, 0); 512 + } 513 + 514 + /* 515 + * Free and unmap a vmap area 516 + */ 517 + static void free_unmap_vmap_area(struct vmap_area *va) 518 + { 519 + va->flags |= VM_LAZY_FREE; 520 + atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); 521 + if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) 522 + purge_vmap_area_lazy(); 523 + } 524 + 525 + static struct vmap_area *find_vmap_area(unsigned long addr) 526 + { 527 + struct vmap_area *va; 528 + 529 + spin_lock(&vmap_area_lock); 530 + va = __find_vmap_area(addr); 531 + spin_unlock(&vmap_area_lock); 532 + 533 + return va; 534 + } 535 + 536 + static void free_unmap_vmap_area_addr(unsigned long addr) 537 + { 538 + struct vmap_area *va; 539 + 540 + va = find_vmap_area(addr); 541 + BUG_ON(!va); 542 + free_unmap_vmap_area(va); 543 + } 544 + 545 + 546 + /*** Per cpu kva allocator ***/ 547 + 548 + /* 549 + * vmap space is limited especially on 32 bit architectures. Ensure there is 550 + * room for at least 16 percpu vmap blocks per CPU. 551 + */ 552 + /* 553 + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able 554 + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess 555 + * instead (we just need a rough idea) 556 + */ 557 + #if BITS_PER_LONG == 32 558 + #define VMALLOC_SPACE (128UL*1024*1024) 559 + #else 560 + #define VMALLOC_SPACE (128UL*1024*1024*1024) 561 + #endif 562 + 563 + #define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) 564 + #define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ 565 + #define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ 566 + #define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) 567 + #define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ 568 + #define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ 569 + #define VMAP_BBMAP_BITS VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ 570 + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ 571 + VMALLOC_PAGES / NR_CPUS / 16)) 572 + 573 + #define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) 574 + 575 + struct vmap_block_queue { 576 + spinlock_t lock; 577 + struct list_head free; 578 + struct list_head dirty; 579 + unsigned int nr_dirty; 580 + }; 581 + 582 + struct vmap_block { 583 + spinlock_t lock; 584 + struct vmap_area *va; 585 + struct vmap_block_queue *vbq; 586 + unsigned long free, dirty; 587 + DECLARE_BITMAP(alloc_map, VMAP_BBMAP_BITS); 588 + DECLARE_BITMAP(dirty_map, VMAP_BBMAP_BITS); 589 + union { 590 + struct { 591 + struct list_head free_list; 592 + struct list_head dirty_list; 593 + }; 594 + struct rcu_head rcu_head; 595 + }; 596 + }; 597 + 598 + /* Queue of free and dirty vmap blocks, for allocation and flushing purposes */ 599 + static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue); 600 + 601 + /* 602 + * Radix tree of vmap blocks, indexed by address, to quickly find a vmap block 603 + * in the free path. Could get rid of this if we change the API to return a 604 + * "cookie" from alloc, to be passed to free. But no big deal yet. 605 + */ 606 + static DEFINE_SPINLOCK(vmap_block_tree_lock); 607 + static RADIX_TREE(vmap_block_tree, GFP_ATOMIC); 608 + 609 + /* 610 + * We should probably have a fallback mechanism to allocate virtual memory 611 + * out of partially filled vmap blocks. However vmap block sizing should be 612 + * fairly reasonable according to the vmalloc size, so it shouldn't be a 613 + * big problem. 614 + */ 615 + 616 + static unsigned long addr_to_vb_idx(unsigned long addr) 617 + { 618 + addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1); 619 + addr /= VMAP_BLOCK_SIZE; 620 + return addr; 621 + } 622 + 623 + static struct vmap_block *new_vmap_block(gfp_t gfp_mask) 624 + { 625 + struct vmap_block_queue *vbq; 626 + struct vmap_block *vb; 627 + struct vmap_area *va; 628 + unsigned long vb_idx; 629 + int node, err; 630 + 631 + node = numa_node_id(); 632 + 633 + vb = kmalloc_node(sizeof(struct vmap_block), 634 + gfp_mask & GFP_RECLAIM_MASK, node); 635 + if (unlikely(!vb)) 636 + return ERR_PTR(-ENOMEM); 637 + 638 + va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, 639 + VMALLOC_START, VMALLOC_END, 640 + node, gfp_mask); 641 + if (unlikely(IS_ERR(va))) { 642 + kfree(vb); 643 + return ERR_PTR(PTR_ERR(va)); 644 + } 645 + 646 + err = radix_tree_preload(gfp_mask); 647 + if (unlikely(err)) { 648 + kfree(vb); 649 + free_vmap_area(va); 650 + return ERR_PTR(err); 651 + } 652 + 653 + spin_lock_init(&vb->lock); 654 + vb->va = va; 655 + vb->free = VMAP_BBMAP_BITS; 656 + vb->dirty = 0; 657 + bitmap_zero(vb->alloc_map, VMAP_BBMAP_BITS); 658 + bitmap_zero(vb->dirty_map, VMAP_BBMAP_BITS); 659 + INIT_LIST_HEAD(&vb->free_list); 660 + INIT_LIST_HEAD(&vb->dirty_list); 661 + 662 + vb_idx = addr_to_vb_idx(va->va_start); 663 + spin_lock(&vmap_block_tree_lock); 664 + err = radix_tree_insert(&vmap_block_tree, vb_idx, vb); 665 + spin_unlock(&vmap_block_tree_lock); 666 + BUG_ON(err); 667 + radix_tree_preload_end(); 668 + 669 + vbq = &get_cpu_var(vmap_block_queue); 670 + vb->vbq = vbq; 671 + spin_lock(&vbq->lock); 672 + list_add(&vb->free_list, &vbq->free); 673 + spin_unlock(&vbq->lock); 674 + put_cpu_var(vmap_cpu_blocks); 675 + 676 + return vb; 677 + } 678 + 679 + static void rcu_free_vb(struct rcu_head *head) 680 + { 681 + struct vmap_block *vb = container_of(head, struct vmap_block, rcu_head); 682 + 683 + kfree(vb); 684 + } 685 + 686 + static void free_vmap_block(struct vmap_block *vb) 687 + { 688 + struct vmap_block *tmp; 689 + unsigned long vb_idx; 690 + 691 + spin_lock(&vb->vbq->lock); 692 + if (!list_empty(&vb->free_list)) 693 + list_del(&vb->free_list); 694 + if (!list_empty(&vb->dirty_list)) 695 + list_del(&vb->dirty_list); 696 + spin_unlock(&vb->vbq->lock); 697 + 698 + vb_idx = addr_to_vb_idx(vb->va->va_start); 699 + spin_lock(&vmap_block_tree_lock); 700 + tmp = radix_tree_delete(&vmap_block_tree, vb_idx); 701 + spin_unlock(&vmap_block_tree_lock); 702 + BUG_ON(tmp != vb); 703 + 704 + free_unmap_vmap_area(vb->va); 705 + call_rcu(&vb->rcu_head, rcu_free_vb); 706 + } 707 + 708 + static void *vb_alloc(unsigned long size, gfp_t gfp_mask) 709 + { 710 + struct vmap_block_queue *vbq; 711 + struct vmap_block *vb; 712 + unsigned long addr = 0; 713 + unsigned int order; 714 + 715 + BUG_ON(size & ~PAGE_MASK); 716 + BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 717 + order = get_order(size); 718 + 719 + again: 720 + rcu_read_lock(); 721 + vbq = &get_cpu_var(vmap_block_queue); 722 + list_for_each_entry_rcu(vb, &vbq->free, free_list) { 723 + int i; 724 + 725 + spin_lock(&vb->lock); 726 + i = bitmap_find_free_region(vb->alloc_map, 727 + VMAP_BBMAP_BITS, order); 728 + 729 + if (i >= 0) { 730 + addr = vb->va->va_start + (i << PAGE_SHIFT); 731 + BUG_ON(addr_to_vb_idx(addr) != 732 + addr_to_vb_idx(vb->va->va_start)); 733 + vb->free -= 1UL << order; 734 + if (vb->free == 0) { 735 + spin_lock(&vbq->lock); 736 + list_del_init(&vb->free_list); 737 + spin_unlock(&vbq->lock); 738 + } 739 + spin_unlock(&vb->lock); 740 + break; 741 + } 742 + spin_unlock(&vb->lock); 743 + } 744 + put_cpu_var(vmap_cpu_blocks); 745 + rcu_read_unlock(); 746 + 747 + if (!addr) { 748 + vb = new_vmap_block(gfp_mask); 749 + if (IS_ERR(vb)) 750 + return vb; 751 + goto again; 752 + } 753 + 754 + return (void *)addr; 755 + } 756 + 757 + static void vb_free(const void *addr, unsigned long size) 758 + { 759 + unsigned long offset; 760 + unsigned long vb_idx; 761 + unsigned int order; 762 + struct vmap_block *vb; 763 + 764 + BUG_ON(size & ~PAGE_MASK); 765 + BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 766 + order = get_order(size); 767 + 768 + offset = (unsigned long)addr & (VMAP_BLOCK_SIZE - 1); 769 + 770 + vb_idx = addr_to_vb_idx((unsigned long)addr); 771 + rcu_read_lock(); 772 + vb = radix_tree_lookup(&vmap_block_tree, vb_idx); 773 + rcu_read_unlock(); 774 + BUG_ON(!vb); 775 + 776 + spin_lock(&vb->lock); 777 + bitmap_allocate_region(vb->dirty_map, offset >> PAGE_SHIFT, order); 778 + if (!vb->dirty) { 779 + spin_lock(&vb->vbq->lock); 780 + list_add(&vb->dirty_list, &vb->vbq->dirty); 781 + spin_unlock(&vb->vbq->lock); 782 + } 783 + vb->dirty += 1UL << order; 784 + if (vb->dirty == VMAP_BBMAP_BITS) { 785 + BUG_ON(vb->free || !list_empty(&vb->free_list)); 786 + spin_unlock(&vb->lock); 787 + free_vmap_block(vb); 788 + } else 789 + spin_unlock(&vb->lock); 790 + } 791 + 792 + /** 793 + * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer 794 + * 795 + * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily 796 + * to amortize TLB flushing overheads. What this means is that any page you 797 + * have now, may, in a former life, have been mapped into kernel virtual 798 + * address by the vmap layer and so there might be some CPUs with TLB entries 799 + * still referencing that page (additional to the regular 1:1 kernel mapping). 800 + * 801 + * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can 802 + * be sure that none of the pages we have control over will have any aliases 803 + * from the vmap layer. 804 + */ 805 + void vm_unmap_aliases(void) 806 + { 807 + unsigned long start = ULONG_MAX, end = 0; 808 + int cpu; 809 + int flush = 0; 810 + 811 + for_each_possible_cpu(cpu) { 812 + struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu); 813 + struct vmap_block *vb; 814 + 815 + rcu_read_lock(); 816 + list_for_each_entry_rcu(vb, &vbq->free, free_list) { 817 + int i; 818 + 819 + spin_lock(&vb->lock); 820 + i = find_first_bit(vb->dirty_map, VMAP_BBMAP_BITS); 821 + while (i < VMAP_BBMAP_BITS) { 822 + unsigned long s, e; 823 + int j; 824 + j = find_next_zero_bit(vb->dirty_map, 825 + VMAP_BBMAP_BITS, i); 826 + 827 + s = vb->va->va_start + (i << PAGE_SHIFT); 828 + e = vb->va->va_start + (j << PAGE_SHIFT); 829 + vunmap_page_range(s, e); 830 + flush = 1; 831 + 832 + if (s < start) 833 + start = s; 834 + if (e > end) 835 + end = e; 836 + 837 + i = j; 838 + i = find_next_bit(vb->dirty_map, 839 + VMAP_BBMAP_BITS, i); 840 + } 841 + spin_unlock(&vb->lock); 842 + } 843 + rcu_read_unlock(); 844 + } 845 + 846 + __purge_vmap_area_lazy(&start, &end, 1, flush); 847 + } 848 + EXPORT_SYMBOL_GPL(vm_unmap_aliases); 849 + 850 + /** 851 + * vm_unmap_ram - unmap linear kernel address space set up by vm_map_ram 852 + * @mem: the pointer returned by vm_map_ram 853 + * @count: the count passed to that vm_map_ram call (cannot unmap partial) 854 + */ 855 + void vm_unmap_ram(const void *mem, unsigned int count) 856 + { 857 + unsigned long size = count << PAGE_SHIFT; 858 + unsigned long addr = (unsigned long)mem; 859 + 860 + BUG_ON(!addr); 861 + BUG_ON(addr < VMALLOC_START); 862 + BUG_ON(addr > VMALLOC_END); 863 + BUG_ON(addr & (PAGE_SIZE-1)); 864 + 865 + debug_check_no_locks_freed(mem, size); 866 + 867 + if (likely(count <= VMAP_MAX_ALLOC)) 868 + vb_free(mem, size); 869 + else 870 + free_unmap_vmap_area_addr(addr); 871 + } 872 + EXPORT_SYMBOL(vm_unmap_ram); 873 + 874 + /** 875 + * vm_map_ram - map pages linearly into kernel virtual address (vmalloc space) 876 + * @pages: an array of pointers to the pages to be mapped 877 + * @count: number of pages 878 + * @node: prefer to allocate data structures on this node 879 + * @prot: memory protection to use. PAGE_KERNEL for regular RAM 880 + * @returns: a pointer to the address that has been mapped, or NULL on failure 881 + */ 882 + void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) 883 + { 884 + unsigned long size = count << PAGE_SHIFT; 885 + unsigned long addr; 886 + void *mem; 887 + 888 + if (likely(count <= VMAP_MAX_ALLOC)) { 889 + mem = vb_alloc(size, GFP_KERNEL); 890 + if (IS_ERR(mem)) 891 + return NULL; 892 + addr = (unsigned long)mem; 893 + } else { 894 + struct vmap_area *va; 895 + va = alloc_vmap_area(size, PAGE_SIZE, 896 + VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); 897 + if (IS_ERR(va)) 898 + return NULL; 899 + 900 + addr = va->va_start; 901 + mem = (void *)addr; 902 + } 903 + if (vmap_page_range(addr, addr + size, prot, pages) < 0) { 904 + vm_unmap_ram(mem, count); 905 + return NULL; 906 + } 907 + return mem; 908 + } 909 + EXPORT_SYMBOL(vm_map_ram); 910 + 911 + void __init vmalloc_init(void) 912 + { 913 + int i; 914 + 915 + for_each_possible_cpu(i) { 916 + struct vmap_block_queue *vbq; 917 + 918 + vbq = &per_cpu(vmap_block_queue, i); 919 + spin_lock_init(&vbq->lock); 920 + INIT_LIST_HEAD(&vbq->free); 921 + INIT_LIST_HEAD(&vbq->dirty); 922 + vbq->nr_dirty = 0; 923 + } 924 + } 925 + 926 + void unmap_kernel_range(unsigned long addr, unsigned long size) 927 + { 928 + unsigned long end = addr + size; 929 + vunmap_page_range(addr, end); 930 + flush_tlb_kernel_range(addr, end); 931 + } 932 + 933 + int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) 934 + { 935 + unsigned long addr = (unsigned long)area->addr; 936 + unsigned long end = addr + area->size - PAGE_SIZE; 937 + int err; 938 + 939 + err = vmap_page_range(addr, end, prot, *pages); 940 + if (err > 0) { 941 + *pages += err; 942 + err = 0; 943 + } 944 + 945 + return err; 946 + } 947 + EXPORT_SYMBOL_GPL(map_vm_area); 948 + 949 + /*** Old vmalloc interfaces ***/ 950 + DEFINE_RWLOCK(vmlist_lock); 951 + struct vm_struct *vmlist; 952 + 953 + static struct vm_struct *__get_vm_area_node(unsigned long size, 954 + unsigned long flags, unsigned long start, unsigned long end, 955 + int node, gfp_t gfp_mask, void *caller) 956 + { 957 + static struct vmap_area *va; 958 + struct vm_struct *area; 959 + struct vm_struct *tmp, **p; 960 + unsigned long align = 1; 229 961 230 962 BUG_ON(in_interrupt()); 231 963 if (flags & VM_IOREMAP) { ··· 976 232 977 233 align = 1ul << bit; 978 234 } 979 - addr = ALIGN(start, align); 235 + 980 236 size = PAGE_ALIGN(size); 981 237 if (unlikely(!size)) 982 238 return NULL; 983 239 984 240 area = kmalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node); 985 - 986 241 if (unlikely(!area)) 987 242 return NULL; 988 243 ··· 990 247 */ 991 248 size += PAGE_SIZE; 992 249 993 - write_lock(&vmlist_lock); 994 - for (p = &vmlist; (tmp = *p) != NULL ;p = &tmp->next) { 995 - if ((unsigned long)tmp->addr < addr) { 996 - if((unsigned long)tmp->addr + tmp->size >= addr) 997 - addr = ALIGN(tmp->size + 998 - (unsigned long)tmp->addr, align); 999 - continue; 1000 - } 1001 - if ((size + addr) < addr) 1002 - goto out; 1003 - if (size + addr <= (unsigned long)tmp->addr) 1004 - goto found; 1005 - addr = ALIGN(tmp->size + (unsigned long)tmp->addr, align); 1006 - if (addr > end - size) 1007 - goto out; 250 + va = alloc_vmap_area(size, align, start, end, node, gfp_mask); 251 + if (IS_ERR(va)) { 252 + kfree(area); 253 + return NULL; 1008 254 } 1009 - if ((size + addr) < addr) 1010 - goto out; 1011 - if (addr > end - size) 1012 - goto out; 1013 - 1014 - found: 1015 - area->next = *p; 1016 - *p = area; 1017 255 1018 256 area->flags = flags; 1019 - area->addr = (void *)addr; 257 + area->addr = (void *)va->va_start; 1020 258 area->size = size; 1021 259 area->pages = NULL; 1022 260 area->nr_pages = 0; 1023 261 area->phys_addr = 0; 1024 262 area->caller = caller; 263 + va->private = area; 264 + va->flags |= VM_VM_AREA; 265 + 266 + write_lock(&vmlist_lock); 267 + for (p = &vmlist; (tmp = *p) != NULL; p = &tmp->next) { 268 + if (tmp->addr >= area->addr) 269 + break; 270 + } 271 + area->next = *p; 272 + *p = area; 1025 273 write_unlock(&vmlist_lock); 1026 274 1027 275 return area; 1028 - 1029 - out: 1030 - write_unlock(&vmlist_lock); 1031 - kfree(area); 1032 - if (printk_ratelimit()) 1033 - printk(KERN_WARNING "allocation failed: out of vmalloc space - use vmalloc=<size> to increase size.\n"); 1034 - return NULL; 1035 276 } 1036 277 1037 278 struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, ··· 1055 328 gfp_mask, __builtin_return_address(0)); 1056 329 } 1057 330 1058 - /* Caller must hold vmlist_lock */ 1059 - static struct vm_struct *__find_vm_area(const void *addr) 331 + static struct vm_struct *find_vm_area(const void *addr) 1060 332 { 1061 - struct vm_struct *tmp; 333 + struct vmap_area *va; 1062 334 1063 - for (tmp = vmlist; tmp != NULL; tmp = tmp->next) { 1064 - if (tmp->addr == addr) 1065 - break; 1066 - } 335 + va = find_vmap_area((unsigned long)addr); 336 + if (va && va->flags & VM_VM_AREA) 337 + return va->private; 1067 338 1068 - return tmp; 1069 - } 1070 - 1071 - /* Caller must hold vmlist_lock */ 1072 - static struct vm_struct *__remove_vm_area(const void *addr) 1073 - { 1074 - struct vm_struct **p, *tmp; 1075 - 1076 - for (p = &vmlist ; (tmp = *p) != NULL ;p = &tmp->next) { 1077 - if (tmp->addr == addr) 1078 - goto found; 1079 - } 1080 339 return NULL; 1081 - 1082 - found: 1083 - unmap_vm_area(tmp); 1084 - *p = tmp->next; 1085 - 1086 - /* 1087 - * Remove the guard page. 1088 - */ 1089 - tmp->size -= PAGE_SIZE; 1090 - return tmp; 1091 340 } 1092 341 1093 342 /** ··· 1076 373 */ 1077 374 struct vm_struct *remove_vm_area(const void *addr) 1078 375 { 1079 - struct vm_struct *v; 1080 - write_lock(&vmlist_lock); 1081 - v = __remove_vm_area(addr); 1082 - write_unlock(&vmlist_lock); 1083 - return v; 376 + struct vmap_area *va; 377 + 378 + va = find_vmap_area((unsigned long)addr); 379 + if (va && va->flags & VM_VM_AREA) { 380 + struct vm_struct *vm = va->private; 381 + struct vm_struct *tmp, **p; 382 + free_unmap_vmap_area(va); 383 + vm->size -= PAGE_SIZE; 384 + 385 + write_lock(&vmlist_lock); 386 + for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) 387 + ; 388 + *p = tmp->next; 389 + write_unlock(&vmlist_lock); 390 + 391 + return vm; 392 + } 393 + return NULL; 1084 394 } 1085 395 1086 396 static void __vunmap(const void *addr, int deallocate_pages) ··· 1203 487 } 1204 488 EXPORT_SYMBOL(vmap); 1205 489 490 + static void *__vmalloc_node(unsigned long size, gfp_t gfp_mask, pgprot_t prot, 491 + int node, void *caller); 1206 492 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, 1207 493 pgprot_t prot, int node, void *caller) 1208 494 { ··· 1331 613 1332 614 ret = __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); 1333 615 if (ret) { 1334 - write_lock(&vmlist_lock); 1335 - area = __find_vm_area(ret); 616 + area = find_vm_area(ret); 1336 617 area->flags |= VM_USERMAP; 1337 - write_unlock(&vmlist_lock); 1338 618 } 1339 619 return ret; 1340 620 } ··· 1412 696 1413 697 ret = __vmalloc(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL); 1414 698 if (ret) { 1415 - write_lock(&vmlist_lock); 1416 - area = __find_vm_area(ret); 699 + area = find_vm_area(ret); 1417 700 area->flags |= VM_USERMAP; 1418 - write_unlock(&vmlist_lock); 1419 701 } 1420 702 return ret; 1421 703 } ··· 1514 800 struct vm_struct *area; 1515 801 unsigned long uaddr = vma->vm_start; 1516 802 unsigned long usize = vma->vm_end - vma->vm_start; 1517 - int ret; 1518 803 1519 804 if ((PAGE_SIZE-1) & (unsigned long)addr) 1520 805 return -EINVAL; 1521 806 1522 - read_lock(&vmlist_lock); 1523 - area = __find_vm_area(addr); 807 + area = find_vm_area(addr); 1524 808 if (!area) 1525 - goto out_einval_locked; 809 + return -EINVAL; 1526 810 1527 811 if (!(area->flags & VM_USERMAP)) 1528 - goto out_einval_locked; 812 + return -EINVAL; 1529 813 1530 814 if (usize + (pgoff << PAGE_SHIFT) > area->size - PAGE_SIZE) 1531 - goto out_einval_locked; 1532 - read_unlock(&vmlist_lock); 815 + return -EINVAL; 1533 816 1534 817 addr += pgoff << PAGE_SHIFT; 1535 818 do { 1536 819 struct page *page = vmalloc_to_page(addr); 820 + int ret; 821 + 1537 822 ret = vm_insert_page(vma, uaddr, page); 1538 823 if (ret) 1539 824 return ret; ··· 1545 832 /* Prevent "things" like memory migration? VM_flags need a cleanup... */ 1546 833 vma->vm_flags |= VM_RESERVED; 1547 834 1548 - return ret; 1549 - 1550 - out_einval_locked: 1551 - read_unlock(&vmlist_lock); 1552 - return -EINVAL; 835 + return 0; 1553 836 } 1554 837 EXPORT_SYMBOL(remap_vmalloc_range); 1555 838