Merge branch 'akpm' (Andrew's patch-bomb)

-2

Documentation/00-INDEX

··· 270 270 - info on locking under a preemptive kernel. 271 271 printk-formats.txt 272 272 - how to get printk format specifiers right 273 - prio_tree.txt 274 - - info on radix-priority-search-tree use for indexing vmas. 275 273 ramoops.txt 276 274 - documentation of the ramoops oops/panic logging module. 277 275 rbtree.txt

-22

Documentation/ABI/obsolete/proc-pid-oom_adj

··· 1 - What: /proc/<pid>/oom_adj 2 - When: August 2012 3 - Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's 4 - badness heuristic used to determine which task to kill when the kernel 5 - is out of memory. 6 - 7 - The badness heuristic has since been rewritten since the introduction of 8 - this tunable such that its meaning is deprecated. The value was 9 - implemented as a bitshift on a score generated by the badness() 10 - function that did not have any precise units of measure. With the 11 - rewrite, the score is given as a proportion of available memory to the 12 - task allocating pages, so using a bitshift which grows the score 13 - exponentially is, thus, impossible to tune with fine granularity. 14 - 15 - A much more powerful interface, /proc/<pid>/oom_score_adj, was 16 - introduced with the oom killer rewrite that allows users to increase or 17 - decrease the badness score linearly. This interface will replace 18 - /proc/<pid>/oom_adj. 19 - 20 - A warning will be emitted to the kernel log if an application uses this 21 - deprecated interface. After it is printed once, future warnings will be 22 - suppressed until the kernel is rebooted.

+45 -45

Documentation/cgroups/memory.txt

··· 18 18 uses of the memory controller. The memory controller can be used to 19 19 20 20 a. Isolate an application or a group of applications 21 - Memory hungry applications can be isolated and limited to a smaller 21 + Memory-hungry applications can be isolated and limited to a smaller 22 22 amount of memory. 23 - b. Create a cgroup with limited amount of memory, this can be used 23 + b. Create a cgroup with a limited amount of memory; this can be used 24 24 as a good alternative to booting with mem=XXXX. 25 25 c. Virtualization solutions can control the amount of memory they want 26 26 to assign to a virtual machine instance. 27 27 d. A CD/DVD burner could control the amount of memory used by the 28 28 rest of the system to ensure that burning does not fail due to lack 29 29 of available memory. 30 - e. There are several other use cases, find one or use the controller just 30 + e. There are several other use cases; find one or use the controller just 31 31 for fun (to learn and hack on the VM subsystem). 32 32 33 33 Current Status: linux-2.6.34-mmotm(development version of 2010/April) ··· 38 38 - optionally, memory+swap usage can be accounted and limited. 39 39 - hierarchical accounting 40 40 - soft limit 41 - - moving(recharging) account at moving a task is selectable. 41 + - moving (recharging) account at moving a task is selectable. 42 42 - usage threshold notifier 43 43 - oom-killer disable knob and oom-notifier 44 44 - Root cgroup has no limit controls. 45 45 46 - Kernel memory support is work in progress, and the current version provides 46 + Kernel memory support is a work in progress, and the current version provides 47 47 basically functionality. (See Section 2.7) 48 48 49 49 Brief summary of control files. ··· 144 144 3. Each page has a pointer to the page_cgroup, which in turn knows the 145 145 cgroup it belongs to 146 146 147 - The accounting is done as follows: mem_cgroup_charge() is invoked to setup 147 + The accounting is done as follows: mem_cgroup_charge() is invoked to set up 148 148 the necessary data structures and check if the cgroup that is being charged 149 - is over its limit. If it is then reclaim is invoked on the cgroup. 149 + is over its limit. If it is, then reclaim is invoked on the cgroup. 150 150 More details can be found in the reclaim section of this document. 151 151 If everything goes well, a page meta-data-structure called page_cgroup is 152 152 updated. page_cgroup has its own LRU on cgroup. ··· 163 163 inserted into inode (radix-tree). While it's mapped into the page tables of 164 164 processes, duplicate accounting is carefully avoided. 165 165 166 - A RSS page is unaccounted when it's fully unmapped. A PageCache page is 166 + An RSS page is unaccounted when it's fully unmapped. A PageCache page is 167 167 unaccounted when it's removed from radix-tree. Even if RSS pages are fully 168 168 unmapped (by kswapd), they may exist as SwapCache in the system until they 169 - are really freed. Such SwapCaches also also accounted. 169 + are really freed. Such SwapCaches are also accounted. 170 170 A swapped-in page is not accounted until it's mapped. 171 171 172 - Note: The kernel does swapin-readahead and read multiple swaps at once. 172 + Note: The kernel does swapin-readahead and reads multiple swaps at once. 173 173 This means swapped-in pages may contain pages for other tasks than a task 174 174 causing page fault. So, we avoid accounting at swap-in I/O. 175 175 ··· 209 209 Example: Assume a system with 4G of swap. A task which allocates 6G of memory 210 210 (by mistake) under 2G memory limitation will use all swap. 211 211 In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 212 - By using memsw limit, you can avoid system OOM which can be caused by swap 212 + By using the memsw limit, you can avoid system OOM which can be caused by swap 213 213 shortage. 214 214 215 215 * why 'memory+swap' rather than swap. ··· 217 217 to move account from memory to swap...there is no change in usage of 218 218 memory+swap. In other words, when we want to limit the usage of swap without 219 219 affecting global LRU, memory+swap limit is better than just limiting swap from 220 - OS point of view. 220 + an OS point of view. 221 221 222 222 * What happens when a cgroup hits memory.memsw.limit_in_bytes 223 223 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out ··· 236 236 cgroup. (See 10. OOM Control below.) 237 237 238 238 The reclaim algorithm has not been modified for cgroups, except that 239 - pages that are selected for reclaiming come from the per cgroup LRU 239 + pages that are selected for reclaiming come from the per-cgroup LRU 240 240 list. 241 241 242 242 NOTE: Reclaim does not work for the root cgroup, since we cannot set any ··· 316 316 # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 317 317 1216512 318 318 319 - A successful write to this file does not guarantee a successful set of 319 + A successful write to this file does not guarantee a successful setting of 320 320 this limit to the value written into the file. This can be due to a 321 321 number of factors, such as rounding up to page boundaries or the total 322 322 availability of memory on the system. The user is required to re-read ··· 350 350 4.1 Troubleshooting 351 351 352 352 Sometimes a user might find that the application under a cgroup is 353 - terminated by OOM killer. There are several causes for this: 353 + terminated by the OOM killer. There are several causes for this: 354 354 355 355 1. The cgroup limit is too low (just too low to do anything useful) 356 356 2. The user is using anonymous memory and swap is turned off or too low ··· 358 358 A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 359 359 some of the pages cached in the cgroup (page cache pages). 360 360 361 - To know what happens, disable OOM_Kill by 10. OOM Control(see below) and 361 + To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and 362 362 seeing what happens will be helpful. 363 363 364 364 4.2 Task migration ··· 399 399 400 400 Almost all pages tracked by this memory cgroup will be unmapped and freed. 401 401 Some pages cannot be freed because they are locked or in-use. Such pages are 402 - moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this 402 + moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this 403 403 cgroup will be empty. 404 404 405 - Typical use case of this interface is that calling this before rmdir(). 405 + The typical use case for this interface is before calling rmdir(). 406 406 Because rmdir() moves all pages to parent, some out-of-use page caches can be 407 407 moved to the parent. If you want to avoid that, force_empty will be useful. 408 408 ··· 486 486 487 487 For efficiency, as other kernel components, memory cgroup uses some optimization 488 488 to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 489 - method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz 489 + method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz 490 490 value for efficient access. (Of course, when necessary, it's synchronized.) 491 491 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 492 492 value in memory.stat(see 5.2). ··· 496 496 This is similar to numa_maps but operates on a per-memcg basis. This is 497 497 useful for providing visibility into the numa locality information within 498 498 an memcg since the pages are allowed to be allocated from any physical 499 - node. One of the usecases is evaluating application performance by 500 - combining this information with the application's cpu allocation. 499 + node. One of the use cases is evaluating application performance by 500 + combining this information with the application's CPU allocation. 501 501 502 502 We export "total", "file", "anon" and "unevictable" pages per-node for 503 503 each memcg. The ouput format of memory.numa_stat is: ··· 561 561 group is very high, they are pushed back as much as possible to make 562 562 sure that one control group does not starve the others of memory. 563 563 564 - Please note that soft limits is a best effort feature, it comes with 564 + Please note that soft limits is a best-effort feature; it comes with 565 565 no guarantees, but it does its best to make sure that when memory is 566 566 heavily contended for, memory is allocated based on the soft limit 567 - hints/setup. Currently soft limit based reclaim is setup such that 567 + hints/setup. Currently soft limit based reclaim is set up such that 568 568 it gets invoked from balance_pgdat (kswapd). 569 569 570 570 7.1 Interface ··· 592 592 593 593 8.1 Interface 594 594 595 - This feature is disabled by default. It can be enabled(and disabled again) by 595 + This feature is disabled by default. It can be enabledi (and disabled again) by 596 596 writing to memory.move_charge_at_immigrate of the destination cgroup. 597 597 598 598 If you want to enable it: ··· 601 601 602 602 Note: Each bits of move_charge_at_immigrate has its own meaning about what type 603 603 of charges should be moved. See 8.2 for details. 604 - Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread 605 - group. 604 + Note: Charges are moved only when you move mm->owner, in other words, 605 + a leader of a thread group. 606 606 Note: If we cannot find enough space for the task in the destination cgroup, we 607 607 try to make space by reclaiming memory. Task migration may fail if we 608 608 cannot make enough space. ··· 612 612 613 613 # echo 0 > memory.move_charge_at_immigrate 614 614 615 - 8.2 Type of charges which can be move 615 + 8.2 Type of charges which can be moved 616 616 617 - Each bits of move_charge_at_immigrate has its own meaning about what type of 618 - charges should be moved. But in any cases, it must be noted that an account of 619 - a page or a swap can be moved only when it is charged to the task's current(old) 620 - memory cgroup. 617 + Each bit in move_charge_at_immigrate has its own meaning about what type of 618 + charges should be moved. But in any case, it must be noted that an account of 619 + a page or a swap can be moved only when it is charged to the task's current 620 + (old) memory cgroup. 621 621 622 622 bit | what type of charges would be moved ? 623 623 -----+------------------------------------------------------------------------ 624 - 0 | A charge of an anonymous page(or swap of it) used by the target task. 625 - | You must enable Swap Extension(see 2.4) to enable move of swap charges. 624 + 0 | A charge of an anonymous page (or swap of it) used by the target task. 625 + | You must enable Swap Extension (see 2.4) to enable move of swap charges. 626 626 -----+------------------------------------------------------------------------ 627 - 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) 627 + 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) 628 628 | and swaps of tmpfs file) mmapped by the target task. Unlike the case of 629 - | anonymous pages, file pages(and swaps) in the range mmapped by the task 629 + | anonymous pages, file pages (and swaps) in the range mmapped by the task 630 630 | will be moved even if the task hasn't done page fault, i.e. they might 631 631 | not be the task's "RSS", but other task's "RSS" that maps the same file. 632 - | And mapcount of the page is ignored(the page can be moved even if 633 - | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to 632 + | And mapcount of the page is ignored (the page can be moved even if 633 + | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to 634 634 | enable move of swap charges. 635 635 636 636 8.3 TODO ··· 640 640 641 641 9. Memory thresholds 642 642 643 - Memory cgroup implements memory thresholds using cgroups notification 643 + Memory cgroup implements memory thresholds using the cgroups notification 644 644 API (see cgroups.txt). It allows to register multiple memory and memsw 645 645 thresholds and gets notifications when it crosses. 646 646 647 - To register a threshold application need: 647 + To register a threshold, an application must: 648 648 - create an eventfd using eventfd(2); 649 649 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 650 650 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to ··· 659 659 660 660 memory.oom_control file is for OOM notification and other controls. 661 661 662 - Memory cgroup implements OOM notifier using cgroup notification 662 + Memory cgroup implements OOM notifier using the cgroup notification 663 663 API (See cgroups.txt). It allows to register multiple OOM notification 664 664 delivery and gets notification when OOM happens. 665 665 666 - To register a notifier, application need: 666 + To register a notifier, an application must: 667 667 - create an eventfd using eventfd(2) 668 668 - open memory.oom_control file 669 669 - write string like "<event_fd> <fd of memory.oom_control>" to 670 670 cgroup.event_control 671 671 672 - Application will be notified through eventfd when OOM happens. 673 - OOM notification doesn't work for root cgroup. 672 + The application will be notified through eventfd when OOM happens. 673 + OOM notification doesn't work for the root cgroup. 674 674 675 - You can disable OOM-killer by writing "1" to memory.oom_control file, as: 675 + You can disable the OOM-killer by writing "1" to memory.oom_control file, as: 676 676 677 677 #echo 1 > memory.oom_control 678 678 679 - This operation is only allowed to the top cgroup of sub-hierarchy. 679 + This operation is only allowed to the top cgroup of a sub-hierarchy. 680 680 If OOM-killer is disabled, tasks under cgroup will hang/sleep 681 681 in memory cgroup's OOM-waitqueue when they request accountable memory. 682 682

+4 -18

Documentation/filesystems/proc.txt

··· 33 33 2 Modifying System Parameters 34 34 35 35 3 Per-Process Parameters 36 - 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer 36 + 3.1 /proc/<pid>/oom_score_adj - Adjust the oom-killer 37 37 score 38 38 3.2 /proc/<pid>/oom_score - Display current oom-killer score 39 39 3.3 /proc/<pid>/io - Display the IO accounting fields ··· 1320 1320 CHAPTER 3: PER-PROCESS PARAMETERS 1321 1321 ------------------------------------------------------------------------------ 1322 1322 1323 - 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score 1323 + 3.1 /proc/<pid>/oom_score_adj- Adjust the oom-killer score 1324 1324 -------------------------------------------------------------------------------- 1325 1325 1326 - These file can be used to adjust the badness heuristic used to select which 1326 + This file can be used to adjust the badness heuristic used to select which 1327 1327 process gets killed in out of memory conditions. 1328 1328 1329 1329 The badness heuristic assigns a value to each candidate task ranging from 0 ··· 1361 1361 equivalent to discounting 50% of the task's allowed memory from being considered 1362 1362 as scoring against the task. 1363 1363 1364 - For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also 1365 - be used to tune the badness score. Its acceptable values range from -16 1366 - (OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 1367 - (OOM_DISABLE) to disable oom killing entirely for that task. Its value is 1368 - scaled linearly with /proc/<pid>/oom_score_adj. 1369 - 1370 - Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the 1371 - other with its scaled value. 1372 - 1373 1364 The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last 1374 1365 value set by a CAP_SYS_RESOURCE process. To reduce the value any lower 1375 1366 requires CAP_SYS_RESOURCE. 1376 - 1377 - NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see 1378 - Documentation/feature-removal-schedule.txt. 1379 1367 1380 1368 Caveat: when a parent task is selected, the oom killer will sacrifice any first 1381 1369 generation children with separate address spaces instead, if possible. This ··· 1375 1387 ------------------------------------------------------------- 1376 1388 1377 1389 This file can be used to check the current score used by the oom-killer is for 1378 - any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which 1379 - process should be killed in an out-of-memory situation. 1380 - 1390 + any given <pid>. 1381 1391 1382 1392 3.3 /proc/<pid>/io - Display the IO accounting fields 1383 1393 -------------------------------------------------------

-33

Documentation/memory.txt

··· 1 - There are several classic problems related to memory on Linux 2 - systems. 3 - 4 - 1) There are some motherboards that will not cache above 5 - a certain quantity of memory. If you have one of these 6 - motherboards, your system will be SLOWER, not faster 7 - as you add more memory. Consider exchanging your 8 - motherboard. 9 - 10 - All of these problems can be addressed with the "mem=XXXM" boot option 11 - (where XXX is the size of RAM to use in megabytes). 12 - It can also tell Linux to use less memory than is actually installed. 13 - If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid 14 - physical address space collisions. 15 - 16 - See the documentation of your boot loader (LILO, grub, loadlin, etc.) about 17 - how to pass options to the kernel. 18 - 19 - There are other memory problems which Linux cannot deal with. Random 20 - corruption of memory is usually a sign of serious hardware trouble. 21 - Try: 22 - 23 - * Reducing memory settings in the BIOS to the most conservative 24 - timings. 25 - 26 - * Adding a cooling fan. 27 - 28 - * Not overclocking your CPU. 29 - 30 - * Having the memory tested in a memory tester or exchanged 31 - with the vendor. Consider testing it with memtest86 yourself. 32 - 33 - * Exchanging your CPU, cache, or motherboard for one that works.

-107

Documentation/prio_tree.txt

··· 1 - The prio_tree.c code indexes vmas using 3 different indexes: 2 - * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff 3 - * radix_index = vm_pgoff : start_vm_pgoff 4 - * size_index = vm_size_in_pages 5 - 6 - A regular radix-priority-search-tree indexes vmas using only heap_index and 7 - radix_index. The conditions for indexing are: 8 - * ->heap_index >= ->left->heap_index && 9 - ->heap_index >= ->right->heap_index 10 - * if (->heap_index == ->left->heap_index) 11 - then ->radix_index < ->left->radix_index; 12 - * if (->heap_index == ->right->heap_index) 13 - then ->radix_index < ->right->radix_index; 14 - * nodes are hashed to left or right subtree using radix_index 15 - similar to a pure binary radix tree. 16 - 17 - A regular radix-priority-search-tree helps to store and query 18 - intervals (vmas). However, a regular radix-priority-search-tree is only 19 - suitable for storing vmas with different radix indices (vm_pgoff). 20 - 21 - Therefore, the prio_tree.c extends the regular radix-priority-search-tree 22 - to handle many vmas with the same vm_pgoff. Such vmas are handled in 23 - 2 different ways: 1) All vmas with the same radix _and_ heap indices are 24 - linked using vm_set.list, 2) if there are many vmas with the same radix 25 - index, but different heap indices and if the regular radix-priority-search 26 - tree cannot index them all, we build an overflow-sub-tree that indexes such 27 - vmas using heap and size indices instead of heap and radix indices. For 28 - example, in the figure below some vmas with vm_pgoff = 0 (zero) are 29 - indexed by regular radix-priority-search-tree whereas others are pushed 30 - into an overflow-subtree. Note that all vmas in an overflow-sub-tree have 31 - the same vm_pgoff (radix_index) and if necessary we build different 32 - overflow-sub-trees to handle each possible radix_index. For example, 33 - in figure we have 3 overflow-sub-trees corresponding to radix indices 34 - 0, 2, and 4. 35 - 36 - In the final tree the first few (prio_tree_root->index_bits) levels 37 - are indexed using heap and radix indices whereas the overflow-sub-trees below 38 - those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are 39 - indexed using heap and size indices. In overflow-sub-trees the size_index 40 - is used for hashing the nodes to appropriate places. 41 - 42 - Now, an example prio_tree: 43 - 44 - vmas are represented [radix_index, size_index, heap_index] 45 - i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff] 46 - 47 - level prio_tree_root->index_bits = 3 48 - ----- 49 - _ 50 - 0 [0,7,7] | 51 - / \ | 52 - ------------------ ------------ | Regular 53 - / \ | radix priority 54 - 1 [1,6,7] [4,3,7] | search tree 55 - / \ / \ | 56 - ------- ----- ------ ----- | heap-and-radix 57 - / \ / \ | indexed 58 - 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] | 59 - / \ / \ / \ / \ | 60 - 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] | 61 - / / / _ 62 - / / / _ 63 - 4 [0,4,4] [2,3,5] [4,1,5] | 64 - / / / | 65 - 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees 66 - / / | 67 - 6 [0,2,2] [2,1,3] | heap-and-size 68 - / / | indexed 69 - 7 [0,1,1] [2,0,2] | 70 - / | 71 - 8 [0,0,0] | 72 - _ 73 - 74 - Note that we use prio_tree_root->index_bits to optimize the height 75 - of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is 76 - set according to the maximum end_vm_pgoff mapped, we are sure that all 77 - bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore, 78 - we only use the first prio_tree_root->index_bits as radix_index. 79 - Whenever index_bits is increased in prio_tree_expand, we shuffle the tree 80 - to make sure that the first prio_tree_root->index_bits levels of the tree 81 - is indexed properly using heap and radix indices. 82 - 83 - We do not optimize the height of overflow-sub-trees using index_bits. 84 - The reason is: there can be many such overflow-sub-trees and all of 85 - them have to be suffled whenever the index_bits increases. This may involve 86 - walking the whole prio_tree in prio_tree_insert->prio_tree_expand code 87 - path which is not desirable. Hence, we do not optimize the height of the 88 - heap-and-size indexed overflow-sub-trees using prio_tree->index_bits. 89 - Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits 90 - of size_index. This may lead to skewed sub-trees because most of the 91 - higher significant bits of the size_index are likely to be 0 (zero). In 92 - the example above, all 3 overflow-sub-trees are skewed. This may marginally 93 - affect the performance. However, processes rarely map many vmas with the 94 - same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally 95 - do not require overflow-sub-trees to index all vmas. 96 - 97 - From the above discussion it is clear that the maximum height of 98 - a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG. 99 - However, in most of the common cases we do not need overflow-sub-trees, 100 - so the tree height in the common cases will be prio_tree_root->index_bits. 101 - 102 - It is fair to mention here that the prio_tree_root->index_bits 103 - is increased on demand, however, the index_bits is not decreased when 104 - vmas are removed from the prio_tree. That's tricky to do. Hence, it's 105 - left as a home work problem. 106 - 107 -

+170 -33

Documentation/rbtree.txt

··· 193 193 Support for Augmented rbtrees 194 194 ----------------------------- 195 195 196 - Augmented rbtree is an rbtree with "some" additional data stored in each node. 197 - This data can be used to augment some new functionality to rbtree. 198 - Augmented rbtree is an optional feature built on top of basic rbtree 199 - infrastructure. An rbtree user who wants this feature will have to call the 200 - augmentation functions with the user provided augmentation callback 201 - when inserting and erasing nodes. 196 + Augmented rbtree is an rbtree with "some" additional data stored in 197 + each node, where the additional data for node N must be a function of 198 + the contents of all nodes in the subtree rooted at N. This data can 199 + be used to augment some new functionality to rbtree. Augmented rbtree 200 + is an optional feature built on top of basic rbtree infrastructure. 201 + An rbtree user who wants this feature will have to call the augmentation 202 + functions with the user provided augmentation callback when inserting 203 + and erasing nodes. 202 204 203 - On insertion, the user must call rb_augment_insert() once the new node is in 204 - place. This will cause the augmentation function callback to be called for 205 - each node between the new node and the root which has been affected by the 206 - insertion. 205 + C files implementing augmented rbtree manipulation must include 206 + <linux/rbtree_augmented.h> instead of <linus/rbtree.h>. Note that 207 + linux/rbtree_augmented.h exposes some rbtree implementations details 208 + you are not expected to rely on; please stick to the documented APIs 209 + there and do not include <linux/rbtree_augmented.h> from header files 210 + either so as to minimize chances of your users accidentally relying on 211 + such implementation details. 207 212 208 - When erasing a node, the user must call rb_augment_erase_begin() first to 209 - retrieve the deepest node on the rebalance path. Then, after erasing the 210 - original node, the user must call rb_augment_erase_end() with the deepest 211 - node found earlier. This will cause the augmentation function to be called 212 - for each affected node between the deepest node and the root. 213 + On insertion, the user must update the augmented information on the path 214 + leading to the inserted node, then call rb_link_node() as usual and 215 + rb_augment_inserted() instead of the usual rb_insert_color() call. 216 + If rb_augment_inserted() rebalances the rbtree, it will callback into 217 + a user provided function to update the augmented information on the 218 + affected subtrees. 213 219 220 + When erasing a node, the user must call rb_erase_augmented() instead of 221 + rb_erase(). rb_erase_augmented() calls back into user provided functions 222 + to updated the augmented information on affected subtrees. 223 + 224 + In both cases, the callbacks are provided through struct rb_augment_callbacks. 225 + 3 callbacks must be defined: 226 + 227 + - A propagation callback, which updates the augmented value for a given 228 + node and its ancestors, up to a given stop point (or NULL to update 229 + all the way to the root). 230 + 231 + - A copy callback, which copies the augmented value for a given subtree 232 + to a newly assigned subtree root. 233 + 234 + - A tree rotation callback, which copies the augmented value for a given 235 + subtree to a newly assigned subtree root AND recomputes the augmented 236 + information for the former subtree root. 237 + 238 + The compiled code for rb_erase_augmented() may inline the propagation and 239 + copy callbacks, which results in a large function, so each augmented rbtree 240 + user should have a single rb_erase_augmented() call site in order to limit 241 + compiled code size. 242 + 243 + 244 + Sample usage: 214 245 215 246 Interval tree is an example of augmented rb tree. Reference - 216 247 "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. ··· 261 230 for lowest match (lowest start address among all possible matches) 262 231 with something like: 263 232 264 - find_lowest_match(lo, hi, node) 233 + struct interval_tree_node * 234 + interval_tree_first_match(struct rb_root *root, 235 + unsigned long start, unsigned long last) 265 236 { 266 - lowest_match = NULL; 267 - while (node) { 268 - if (max_hi(node->left) > lo) { 269 - // Lowest overlap if any must be on left side 270 - node = node->left; 271 - } else if (overlap(lo, hi, node)) { 272 - lowest_match = node; 273 - break; 274 - } else if (lo > node->lo) { 275 - // Lowest overlap if any must be on right side 276 - node = node->right; 277 - } else { 278 - break; 237 + struct interval_tree_node *node; 238 + 239 + if (!root->rb_node) 240 + return NULL; 241 + node = rb_entry(root->rb_node, struct interval_tree_node, rb); 242 + 243 + while (true) { 244 + if (node->rb.rb_left) { 245 + struct interval_tree_node *left = 246 + rb_entry(node->rb.rb_left, 247 + struct interval_tree_node, rb); 248 + if (left->__subtree_last >= start) { 249 + /* 250 + * Some nodes in left subtree satisfy Cond2. 251 + * Iterate to find the leftmost such node N. 252 + * If it also satisfies Cond1, that's the match 253 + * we are looking for. Otherwise, there is no 254 + * matching interval as nodes to the right of N 255 + * can't satisfy Cond1 either. 256 + */ 257 + node = left; 258 + continue; 259 + } 279 260 } 261 + if (node->start <= last) { /* Cond1 */ 262 + if (node->last >= start) /* Cond2 */ 263 + return node; /* node is leftmost match */ 264 + if (node->rb.rb_right) { 265 + node = rb_entry(node->rb.rb_right, 266 + struct interval_tree_node, rb); 267 + if (node->__subtree_last >= start) 268 + continue; 269 + } 270 + } 271 + return NULL; /* No match */ 280 272 } 281 - return lowest_match; 282 273 } 283 274 284 - Finding exact match will be to first find lowest match and then to follow 285 - successor nodes looking for exact match, until the start of a node is beyond 286 - the hi value we are looking for. 275 + Insertion/removal are defined using the following augmented callbacks: 276 + 277 + static inline unsigned long 278 + compute_subtree_last(struct interval_tree_node *node) 279 + { 280 + unsigned long max = node->last, subtree_last; 281 + if (node->rb.rb_left) { 282 + subtree_last = rb_entry(node->rb.rb_left, 283 + struct interval_tree_node, rb)->__subtree_last; 284 + if (max < subtree_last) 285 + max = subtree_last; 286 + } 287 + if (node->rb.rb_right) { 288 + subtree_last = rb_entry(node->rb.rb_right, 289 + struct interval_tree_node, rb)->__subtree_last; 290 + if (max < subtree_last) 291 + max = subtree_last; 292 + } 293 + return max; 294 + } 295 + 296 + static void augment_propagate(struct rb_node *rb, struct rb_node *stop) 297 + { 298 + while (rb != stop) { 299 + struct interval_tree_node *node = 300 + rb_entry(rb, struct interval_tree_node, rb); 301 + unsigned long subtree_last = compute_subtree_last(node); 302 + if (node->__subtree_last == subtree_last) 303 + break; 304 + node->__subtree_last = subtree_last; 305 + rb = rb_parent(&node->rb); 306 + } 307 + } 308 + 309 + static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) 310 + { 311 + struct interval_tree_node *old = 312 + rb_entry(rb_old, struct interval_tree_node, rb); 313 + struct interval_tree_node *new = 314 + rb_entry(rb_new, struct interval_tree_node, rb); 315 + 316 + new->__subtree_last = old->__subtree_last; 317 + } 318 + 319 + static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) 320 + { 321 + struct interval_tree_node *old = 322 + rb_entry(rb_old, struct interval_tree_node, rb); 323 + struct interval_tree_node *new = 324 + rb_entry(rb_new, struct interval_tree_node, rb); 325 + 326 + new->__subtree_last = old->__subtree_last; 327 + old->__subtree_last = compute_subtree_last(old); 328 + } 329 + 330 + static const struct rb_augment_callbacks augment_callbacks = { 331 + augment_propagate, augment_copy, augment_rotate 332 + }; 333 + 334 + void interval_tree_insert(struct interval_tree_node *node, 335 + struct rb_root *root) 336 + { 337 + struct rb_node **link = &root->rb_node, *rb_parent = NULL; 338 + unsigned long start = node->start, last = node->last; 339 + struct interval_tree_node *parent; 340 + 341 + while (*link) { 342 + rb_parent = *link; 343 + parent = rb_entry(rb_parent, struct interval_tree_node, rb); 344 + if (parent->__subtree_last < last) 345 + parent->__subtree_last = last; 346 + if (start < parent->start) 347 + link = &parent->rb.rb_left; 348 + else 349 + link = &parent->rb.rb_right; 350 + } 351 + 352 + node->__subtree_last = last; 353 + rb_link_node(&node->rb, rb_parent, link); 354 + rb_insert_augmented(&node->rb, root, &augment_callbacks); 355 + } 356 + 357 + void interval_tree_remove(struct interval_tree_node *node, 358 + struct rb_root *root) 359 + { 360 + rb_erase_augmented(&node->rb, root, &augment_callbacks); 361 + }

+5 -9

Documentation/vm/unevictable-lru.txt

··· 197 197 freeing them. 198 198 199 199 page_evictable() also checks for mlocked pages by testing an additional page 200 - flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, 201 - and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is 202 - VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and 203 - update the appropriate statistics if the vma is VM_LOCKED. This method allows 204 - efficient "culling" of pages in the fault path that are being faulted in to 205 - VM_LOCKED VMAs. 200 + flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is 201 + faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. 206 202 207 203 208 204 VMSCAN'S HANDLING OF UNEVICTABLE PAGES ··· 367 371 mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to 368 372 allocate the huge pages and populate the ptes. 369 373 370 - 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of 371 - kernel pages, such as the VDSO page, relay channel pages, etc. These pages 374 + 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, 375 + such as the VDSO page, relay channel pages, etc. These pages 372 376 are inherently unevictable and are not managed on the LRU lists. 373 377 mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls 374 378 make_pages_present() to populate the ptes. ··· 647 651 ------------------------------- 648 652 649 653 shrink_active_list() culls any obviously unevictable pages - i.e. 650 - !page_evictable(page, NULL) - diverting these to the unevictable list. 654 + !page_evictable(page) - diverting these to the unevictable list. 651 655 However, shrink_active_list() only sees unevictable pages that made it onto the 652 656 active/inactive lru lists. Note that these pages do not have PageUnevictable 653 657 set - otherwise they would be on the unevictable list and shrink_active_list

+8

MAINTAINERS

··· 7039 7039 F: Documentation/svga.txt 7040 7040 F: arch/x86/boot/video* 7041 7041 7042 + SWIOTLB SUBSYSTEM 7043 + M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 7044 + L: linux-kernel@vger.kernel.org 7045 + S: Supported 7046 + F: lib/swiotlb.c 7047 + F: arch/*/kernel/pci-swiotlb.c 7048 + F: include/linux/swiotlb.h 7049 + 7042 7050 SYSV FILESYSTEM 7043 7051 M: Christoph Hellwig <hch@infradead.org> 7044 7052 S: Maintained

+3

arch/Kconfig

··· 313 313 Archs need to ensure they use a high enough resolution clock to 314 314 support irq time accounting and then call enable_sched_clock_irqtime(). 315 315 316 + config HAVE_ARCH_TRANSPARENT_HUGEPAGE 317 + bool 318 + 316 319 source "kernel/gcov/Kconfig"

+1 -1

arch/alpha/kernel/pci-sysfs.c

··· 26 26 base = sparse ? hose->sparse_io_base : hose->dense_io_base; 27 27 28 28 vma->vm_pgoff += base >> PAGE_SHIFT; 29 - vma->vm_flags |= (VM_IO | VM_RESERVED); 29 + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; 30 30 31 31 return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, 32 32 vma->vm_end - vma->vm_start,

+2

arch/arm/Kconfig

··· 25 25 select HAVE_FUNCTION_GRAPH_TRACER if (!THUMB2_KERNEL) 26 26 select ARCH_BINFMT_ELF_RANDOMIZE_PIE 27 27 select HAVE_GENERIC_DMA_COHERENT 28 + select HAVE_DEBUG_KMEMLEAK 28 29 select HAVE_KERNEL_GZIP 29 30 select HAVE_KERNEL_LZO 30 31 select HAVE_KERNEL_LZMA ··· 40 39 select HARDIRQS_SW_RESEND 41 40 select GENERIC_IRQ_PROBE 42 41 select GENERIC_IRQ_SHOW 42 + select HAVE_UID16 43 43 select ARCH_WANT_IPC_PARSE_VERSION 44 44 select HARDIRQS_SW_RESEND 45 45 select CPU_PM if (SUSPEND || CPU_IDLE)

+1 -2

arch/arm/mm/fault-armv.c

··· 134 134 { 135 135 struct mm_struct *mm = vma->vm_mm; 136 136 struct vm_area_struct *mpnt; 137 - struct prio_tree_iter iter; 138 137 unsigned long offset; 139 138 pgoff_t pgoff; 140 139 int aliases = 0; ··· 146 147 * cache coherency. 147 148 */ 148 149 flush_dcache_mmap_lock(mapping); 149 - vma_prio_tree_foreach(mpnt, &iter, &mapping->i_mmap, pgoff, pgoff) { 150 + vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 150 151 /* 151 152 * If this VMA is not in our MM, we can ignore it. 152 153 * Note that we intentionally mask out the VMA

+1

arch/arm/mm/fault.c

··· 336 336 /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk 337 337 * of starvation. */ 338 338 flags &= ~FAULT_FLAG_ALLOW_RETRY; 339 + flags |= FAULT_FLAG_TRIED; 339 340 goto retry; 340 341 } 341 342 }

+1 -2

arch/arm/mm/flush.c

··· 196 196 { 197 197 struct mm_struct *mm = current->active_mm; 198 198 struct vm_area_struct *mpnt; 199 - struct prio_tree_iter iter; 200 199 pgoff_t pgoff; 201 200 202 201 /* ··· 207 208 pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 208 209 209 210 flush_dcache_mmap_lock(mapping); 210 - vma_prio_tree_foreach(mpnt, &iter, &mapping->i_mmap, pgoff, pgoff) { 211 + vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 211 212 unsigned long offset; 212 213 213 214 /*

+4

arch/arm64/Kconfig

··· 10 10 select GENERIC_TIME_VSYSCALL 11 11 select HARDIRQS_SW_RESEND 12 12 select HAVE_ARCH_TRACEHOOK 13 + select HAVE_DEBUG_BUGVERBOSE 14 + select HAVE_DEBUG_KMEMLEAK 13 15 select HAVE_DMA_API_DEBUG 14 16 select HAVE_DMA_ATTRS 15 17 select HAVE_GENERIC_DMA_COHERENT ··· 28 26 select PERF_USE_VMALLOC 29 27 select RTC_LIB 30 28 select SPARSE_IRQ 29 + select SYSCTL_EXCEPTION_TRACE 31 30 help 32 31 ARM 64-bit (AArch64) Linux support. 33 32 ··· 196 193 bool "Kernel support for 32-bit EL0" 197 194 depends on !ARM64_64K_PAGES 198 195 select COMPAT_BINFMT_ELF 196 + select HAVE_UID16 199 197 help 200 198 This option enables support for a 32-bit EL0 running under a 64-bit 201 199 kernel at EL1. AArch32-specific components such as system calls,

+1

arch/avr32/mm/fault.c

··· 152 152 tsk->min_flt++; 153 153 if (fault & VM_FAULT_RETRY) { 154 154 flags &= ~FAULT_FLAG_ALLOW_RETRY; 155 + flags |= FAULT_FLAG_TRIED; 155 156 156 157 /* 157 158 * No need to up_read(&mm->mmap_sem) as we would have

+1

arch/blackfin/Kconfig

··· 33 33 select HAVE_PERF_EVENTS 34 34 select ARCH_HAVE_CUSTOM_GPIO_H 35 35 select ARCH_WANT_OPTIONAL_GPIOLIB 36 + select HAVE_UID16 36 37 select ARCH_WANT_IPC_PARSE_VERSION 37 38 select HAVE_GENERIC_HARDIRQS 38 39 select GENERIC_ATOMIC64

+1

arch/cris/Kconfig

··· 42 42 select HAVE_IDE 43 43 select GENERIC_ATOMIC64 44 44 select HAVE_GENERIC_HARDIRQS 45 + select HAVE_UID16 45 46 select ARCH_WANT_IPC_PARSE_VERSION 46 47 select GENERIC_IRQ_SHOW 47 48 select GENERIC_IOMAP

+1

arch/cris/mm/fault.c

··· 186 186 tsk->min_flt++; 187 187 if (fault & VM_FAULT_RETRY) { 188 188 flags &= ~FAULT_FLAG_ALLOW_RETRY; 189 + flags |= FAULT_FLAG_TRIED; 189 190 190 191 /* 191 192 * No need to up_read(&mm->mmap_sem) as we would

+2

arch/frv/Kconfig

··· 5 5 select HAVE_ARCH_TRACEHOOK 6 6 select HAVE_IRQ_WORK 7 7 select HAVE_PERF_EVENTS 8 + select HAVE_UID16 8 9 select HAVE_GENERIC_HARDIRQS 9 10 select GENERIC_IRQ_SHOW 11 + select HAVE_DEBUG_BUGVERBOSE 10 12 select ARCH_HAVE_NMI_SAFE_CMPXCHG 11 13 select GENERIC_CPU_DEVICES 12 14 select ARCH_WANT_IPC_PARSE_VERSION

+1

arch/h8300/Kconfig

··· 3 3 default y 4 4 select HAVE_IDE 5 5 select HAVE_GENERIC_HARDIRQS 6 + select HAVE_UID16 6 7 select ARCH_WANT_IPC_PARSE_VERSION 7 8 select GENERIC_IRQ_SHOW 8 9 select GENERIC_CPU_DEVICES

+1

arch/hexagon/mm/vm_fault.c

··· 113 113 current->min_flt++; 114 114 if (fault & VM_FAULT_RETRY) { 115 115 flags &= ~FAULT_FLAG_ALLOW_RETRY; 116 + flags |= FAULT_FLAG_TRIED; 116 117 goto retry; 117 118 } 118 119 }

+4

arch/ia64/include/asm/hugetlb.h

··· 77 77 { 78 78 } 79 79 80 + static inline void arch_clear_hugepage_flags(struct page *page) 81 + { 82 + } 83 + 80 84 #endif /* _ASM_IA64_HUGETLB_H */

+1 -1

arch/ia64/kernel/perfmon.c

··· 2307 2307 */ 2308 2308 vma->vm_mm = mm; 2309 2309 vma->vm_file = get_file(filp); 2310 - vma->vm_flags = VM_READ| VM_MAYREAD |VM_RESERVED; 2310 + vma->vm_flags = VM_READ|VM_MAYREAD|VM_DONTEXPAND|VM_DONTDUMP; 2311 2311 vma->vm_page_prot = PAGE_READONLY; /* XXX may need to change */ 2312 2312 2313 2313 /*

+1

arch/ia64/mm/fault.c

··· 184 184 current->min_flt++; 185 185 if (fault & VM_FAULT_RETRY) { 186 186 flags &= ~FAULT_FLAG_ALLOW_RETRY; 187 + flags |= FAULT_FLAG_TRIED; 187 188 188 189 /* No need to up_read(&mm->mmap_sem) as we would 189 190 * have already released it in __lock_page_or_retry

+3 -1

arch/ia64/mm/init.c

··· 138 138 vma->vm_mm = current->mm; 139 139 vma->vm_end = PAGE_SIZE; 140 140 vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT); 141 - vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED; 141 + vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | 142 + VM_DONTEXPAND | VM_DONTDUMP; 142 143 down_write(&current->mm->mmap_sem); 143 144 if (insert_vm_struct(current->mm, vma)) { 144 145 up_write(&current->mm->mmap_sem); ··· 637 636 638 637 high_memory = __va(max_low_pfn * PAGE_SIZE); 639 638 639 + reset_zone_present_pages(); 640 640 for_each_online_pgdat(pgdat) 641 641 if (pgdat->bdata->node_bootmem_map) 642 642 totalram_pages += free_all_bootmem_node(pgdat);

+1

arch/m32r/Kconfig

··· 8 8 select HAVE_KERNEL_BZIP2 9 9 select HAVE_KERNEL_LZMA 10 10 select ARCH_WANT_IPC_PARSE_VERSION 11 + select HAVE_DEBUG_BUGVERBOSE 11 12 select HAVE_GENERIC_HARDIRQS 12 13 select GENERIC_IRQ_PROBE 13 14 select GENERIC_IRQ_SHOW

+2

arch/m68k/Kconfig

··· 3 3 default y 4 4 select HAVE_IDE 5 5 select HAVE_AOUT if MMU 6 + select HAVE_DEBUG_BUGVERBOSE 6 7 select HAVE_GENERIC_HARDIRQS 7 8 select GENERIC_IRQ_SHOW 8 9 select GENERIC_ATOMIC64 10 + select HAVE_UID16 9 11 select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS 10 12 select GENERIC_CPU_DEVICES 11 13 select GENERIC_STRNCPY_FROM_USER if MMU

+1

arch/m68k/mm/fault.c

··· 170 170 /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk 171 171 * of starvation. */ 172 172 flags &= ~FAULT_FLAG_ALLOW_RETRY; 173 + flags |= FAULT_FLAG_TRIED; 173 174 174 175 /* 175 176 * No need to up_read(&mm->mmap_sem) as we would

+1

arch/microblaze/Kconfig

··· 16 16 select OF 17 17 select OF_EARLY_FLATTREE 18 18 select ARCH_WANT_IPC_PARSE_VERSION 19 + select HAVE_DEBUG_KMEMLEAK 19 20 select IRQ_DOMAIN 20 21 select HAVE_GENERIC_HARDIRQS 21 22 select GENERIC_IRQ_PROBE

+1

arch/microblaze/include/asm/atomic.h

··· 22 22 23 23 return res; 24 24 } 25 + #define atomic_dec_if_positive atomic_dec_if_positive 25 26 26 27 #endif /* _ASM_MICROBLAZE_ATOMIC_H */

+1

arch/microblaze/mm/fault.c

··· 233 233 current->min_flt++; 234 234 if (fault & VM_FAULT_RETRY) { 235 235 flags &= ~FAULT_FLAG_ALLOW_RETRY; 236 + flags |= FAULT_FLAG_TRIED; 236 237 237 238 /* 238 239 * No need to up_read(&mm->mmap_sem) as we would

+1

arch/mips/Kconfig

··· 17 17 select HAVE_FUNCTION_GRAPH_TRACER 18 18 select HAVE_KPROBES 19 19 select HAVE_KRETPROBES 20 + select HAVE_DEBUG_KMEMLEAK 20 21 select ARCH_BINFMT_ELF_RANDOMIZE_PIE 21 22 select RTC_LIB if !MACH_LOONGSON 22 23 select GENERIC_ATOMIC64 if !64BIT

+4

arch/mips/include/asm/hugetlb.h

··· 112 112 { 113 113 } 114 114 115 + static inline void arch_clear_hugepage_flags(struct page *page) 116 + { 117 + } 118 + 115 119 #endif /* __ASM_HUGETLB_H */

+1

arch/mips/mm/fault.c

··· 171 171 } 172 172 if (fault & VM_FAULT_RETRY) { 173 173 flags &= ~FAULT_FLAG_ALLOW_RETRY; 174 + flags |= FAULT_FLAG_TRIED; 174 175 175 176 /* 176 177 * No need to up_read(&mm->mmap_sem) as we would

+1

arch/openrisc/mm/fault.c

··· 183 183 tsk->min_flt++; 184 184 if (fault & VM_FAULT_RETRY) { 185 185 flags &= ~FAULT_FLAG_ALLOW_RETRY; 186 + flags |= FAULT_FLAG_TRIED; 186 187 187 188 /* No need to up_read(&mm->mmap_sem) as we would 188 189 * have already released it in __lock_page_or_retry

+1 -2

arch/parisc/kernel/cache.c

··· 276 276 { 277 277 struct address_space *mapping = page_mapping(page); 278 278 struct vm_area_struct *mpnt; 279 - struct prio_tree_iter iter; 280 279 unsigned long offset; 281 280 unsigned long addr, old_addr = 0; 282 281 pgoff_t pgoff; ··· 298 299 * to flush one address here for them all to become coherent */ 299 300 300 301 flush_dcache_mmap_lock(mapping); 301 - vma_prio_tree_foreach(mpnt, &iter, &mapping->i_mmap, pgoff, pgoff) { 302 + vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 302 303 offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; 303 304 addr = mpnt->vm_start + offset; 304 305

+2

arch/powerpc/Kconfig

··· 99 99 select HAVE_DYNAMIC_FTRACE 100 100 select HAVE_FUNCTION_TRACER 101 101 select HAVE_FUNCTION_GRAPH_TRACER 102 + select SYSCTL_EXCEPTION_TRACE 102 103 select ARCH_WANT_OPTIONAL_GPIOLIB 103 104 select HAVE_IDE 104 105 select HAVE_IOREMAP_PROT ··· 114 113 select HAVE_DMA_API_DEBUG 115 114 select USE_GENERIC_SMP_HELPERS if SMP 116 115 select HAVE_OPROFILE 116 + select HAVE_DEBUG_KMEMLEAK 117 117 select HAVE_SYSCALL_WRAPPERS if PPC64 118 118 select GENERIC_ATOMIC64 if PPC32 119 119 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE

+1

arch/powerpc/include/asm/atomic.h

··· 268 268 269 269 return t; 270 270 } 271 + #define atomic_dec_if_positive atomic_dec_if_positive 271 272 272 273 #define smp_mb__before_atomic_dec() smp_mb() 273 274 #define smp_mb__after_atomic_dec() smp_mb()

+4

arch/powerpc/include/asm/hugetlb.h

··· 151 151 { 152 152 } 153 153 154 + static inline void arch_clear_hugepage_flags(struct page *page) 155 + { 156 + } 157 + 154 158 #else /* ! CONFIG_HUGETLB_PAGE */ 155 159 static inline void flush_hugetlb_page(struct vm_area_struct *vma, 156 160 unsigned long vmaddr)

+1 -1

arch/powerpc/kvm/book3s_hv.c

··· 1183 1183 1184 1184 static int kvm_rma_mmap(struct file *file, struct vm_area_struct *vma) 1185 1185 { 1186 - vma->vm_flags |= VM_RESERVED; 1186 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 1187 1187 vma->vm_ops = &kvm_rma_vm_ops; 1188 1188 return 0; 1189 1189 }

+1

arch/powerpc/mm/fault.c

··· 451 451 /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk 452 452 * of starvation. */ 453 453 flags &= ~FAULT_FLAG_ALLOW_RETRY; 454 + flags |= FAULT_FLAG_TRIED; 454 455 goto retry; 455 456 } 456 457 }

+4 -11

arch/powerpc/oprofile/cell/spu_task_sync.c

··· 304 304 return cookie; 305 305 } 306 306 307 - /* Look up the dcookie for the task's first VM_EXECUTABLE mapping, 307 + /* Look up the dcookie for the task's mm->exe_file, 308 308 * which corresponds loosely to "application name". Also, determine 309 309 * the offset for the SPU ELF object. If computed offset is 310 310 * non-zero, it implies an embedded SPU object; otherwise, it's a ··· 321 321 { 322 322 unsigned long app_cookie = 0; 323 323 unsigned int my_offset = 0; 324 - struct file *app = NULL; 325 324 struct vm_area_struct *vma; 326 325 struct mm_struct *mm = spu->mm; 327 326 ··· 329 330 330 331 down_read(&mm->mmap_sem); 331 332 332 - for (vma = mm->mmap; vma; vma = vma->vm_next) { 333 - if (!vma->vm_file) 334 - continue; 335 - if (!(vma->vm_flags & VM_EXECUTABLE)) 336 - continue; 337 - app_cookie = fast_get_dcookie(&vma->vm_file->f_path); 333 + if (mm->exe_file) { 334 + app_cookie = fast_get_dcookie(&mm->exe_file->f_path); 338 335 pr_debug("got dcookie for %s\n", 339 - vma->vm_file->f_dentry->d_name.name); 340 - app = vma->vm_file; 341 - break; 336 + mm->exe_file->f_dentry->d_name.name); 342 337 } 343 338 344 339 for (vma = mm->mmap; vma; vma = vma->vm_next) {

+9 -4

arch/powerpc/platforms/pseries/hotplug-memory.c

··· 77 77 { 78 78 unsigned long start, start_pfn; 79 79 struct zone *zone; 80 - int ret; 80 + int i, ret; 81 + int sections_to_remove; 81 82 82 83 start_pfn = base >> PAGE_SHIFT; 83 84 ··· 98 97 * to sysfs "state" file and we can't remove sysfs entries 99 98 * while writing to it. So we have to defer it to here. 100 99 */ 101 - ret = __remove_pages(zone, start_pfn, memblock_size >> PAGE_SHIFT); 102 - if (ret) 103 - return ret; 100 + sections_to_remove = (memblock_size >> PAGE_SHIFT) / PAGES_PER_SECTION; 101 + for (i = 0; i < sections_to_remove; i++) { 102 + unsigned long pfn = start_pfn + i * PAGES_PER_SECTION; 103 + ret = __remove_pages(zone, start_pfn, PAGES_PER_SECTION); 104 + if (ret) 105 + return ret; 106 + } 104 107 105 108 /* 106 109 * Update memory regions for memory remove

+3

arch/s390/Kconfig

··· 68 68 select HAVE_FTRACE_MCOUNT_RECORD 69 69 select HAVE_C_RECORDMCOUNT 70 70 select HAVE_SYSCALL_TRACEPOINTS 71 + select SYSCTL_EXCEPTION_TRACE 71 72 select HAVE_DYNAMIC_FTRACE 72 73 select HAVE_FUNCTION_GRAPH_TRACER 73 74 select HAVE_REGS_AND_STACK_ACCESS_API ··· 81 80 select HAVE_IRQ_WORK 82 81 select HAVE_PERF_EVENTS 83 82 select ARCH_HAVE_NMI_SAFE_CMPXCHG 83 + select HAVE_DEBUG_KMEMLEAK 84 84 select HAVE_KERNEL_GZIP 85 85 select HAVE_KERNEL_BZIP2 86 86 select HAVE_KERNEL_LZMA ··· 128 126 select ARCH_INLINE_WRITE_UNLOCK_BH 129 127 select ARCH_INLINE_WRITE_UNLOCK_IRQ 130 128 select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE 129 + select HAVE_UID16 if 32BIT 131 130 select ARCH_WANT_IPC_PARSE_VERSION 132 131 select GENERIC_SMP_IDLE_THREAD 133 132 select GENERIC_TIME_VSYSCALL

+2 -17

arch/s390/include/asm/hugetlb.h

··· 33 33 } 34 34 35 35 #define hugetlb_prefault_arch_hook(mm) do { } while (0) 36 + #define arch_clear_hugepage_flags(page) do { } while (0) 36 37 37 38 int arch_prepare_hugepage(struct page *page); 38 39 void arch_release_hugepage(struct page *page); ··· 78 77 " csp %1,%3" 79 78 : "=m" (*pmdp) 80 79 : "d" (reg2), "d" (reg3), "d" (reg4), "m" (*pmdp) : "cc"); 81 - pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY; 82 - } 83 - 84 - static inline void __pmd_idte(unsigned long address, pmd_t *pmdp) 85 - { 86 - unsigned long sto = (unsigned long) pmdp - 87 - pmd_index(address) * sizeof(pmd_t); 88 - 89 - if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_INV)) { 90 - asm volatile( 91 - " .insn rrf,0xb98e0000,%2,%3,0,0" 92 - : "=m" (*pmdp) 93 - : "m" (*pmdp), "a" (sto), 94 - "a" ((address & HPAGE_MASK)) 95 - ); 96 - } 97 - pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY; 98 80 } 99 81 100 82 static inline void huge_ptep_invalidate(struct mm_struct *mm, ··· 89 105 __pmd_idte(address, pmdp); 90 106 else 91 107 __pmd_csp(pmdp); 108 + pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY; 92 109 } 93 110 94 111 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,

+210

arch/s390/include/asm/pgtable.h

··· 42 42 * tables contain all the necessary information. 43 43 */ 44 44 #define update_mmu_cache(vma, address, ptep) do { } while (0) 45 + #define update_mmu_cache_pmd(vma, address, ptep) do { } while (0) 45 46 46 47 /* 47 48 * ZERO_PAGE is a global shared page that is always zero; used ··· 348 347 349 348 #define _SEGMENT_ENTRY_LARGE 0x400 /* STE-format control, large page */ 350 349 #define _SEGMENT_ENTRY_CO 0x100 /* change-recording override */ 350 + #define _SEGMENT_ENTRY_SPLIT_BIT 0 /* THP splitting bit number */ 351 + #define _SEGMENT_ENTRY_SPLIT (1UL << _SEGMENT_ENTRY_SPLIT_BIT) 352 + 353 + /* Set of bits not changed in pmd_modify */ 354 + #define _SEGMENT_CHG_MASK (_SEGMENT_ENTRY_ORIGIN | _SEGMENT_ENTRY_LARGE \ 355 + | _SEGMENT_ENTRY_SPLIT | _SEGMENT_ENTRY_CO) 351 356 352 357 /* Page status table bits for virtualization */ 353 358 #define RCP_ACC_BITS 0xf000000000000000UL ··· 511 504 { 512 505 unsigned long mask = ~_SEGMENT_ENTRY_ORIGIN & ~_SEGMENT_ENTRY_INV; 513 506 return (pmd_val(pmd) & mask) != _SEGMENT_ENTRY; 507 + } 508 + 509 + #define __HAVE_ARCH_PMDP_SPLITTING_FLUSH 510 + extern void pmdp_splitting_flush(struct vm_area_struct *vma, 511 + unsigned long addr, pmd_t *pmdp); 512 + 513 + #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS 514 + extern int pmdp_set_access_flags(struct vm_area_struct *vma, 515 + unsigned long address, pmd_t *pmdp, 516 + pmd_t entry, int dirty); 517 + 518 + #define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH 519 + extern int pmdp_clear_flush_young(struct vm_area_struct *vma, 520 + unsigned long address, pmd_t *pmdp); 521 + 522 + #define __HAVE_ARCH_PMD_WRITE 523 + static inline int pmd_write(pmd_t pmd) 524 + { 525 + return (pmd_val(pmd) & _SEGMENT_ENTRY_RO) == 0; 526 + } 527 + 528 + static inline int pmd_young(pmd_t pmd) 529 + { 530 + return 0; 514 531 } 515 532 516 533 static inline int pte_none(pte_t pte) ··· 1189 1158 #define pte_offset_kernel(pmd, address) pte_offset(pmd,address) 1190 1159 #define pte_offset_map(pmd, address) pte_offset_kernel(pmd, address) 1191 1160 #define pte_unmap(pte) do { } while (0) 1161 + 1162 + static inline void __pmd_idte(unsigned long address, pmd_t *pmdp) 1163 + { 1164 + unsigned long sto = (unsigned long) pmdp - 1165 + pmd_index(address) * sizeof(pmd_t); 1166 + 1167 + if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_INV)) { 1168 + asm volatile( 1169 + " .insn rrf,0xb98e0000,%2,%3,0,0" 1170 + : "=m" (*pmdp) 1171 + : "m" (*pmdp), "a" (sto), 1172 + "a" ((address & HPAGE_MASK)) 1173 + : "cc" 1174 + ); 1175 + } 1176 + } 1177 + 1178 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1179 + #define __HAVE_ARCH_PGTABLE_DEPOSIT 1180 + extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable); 1181 + 1182 + #define __HAVE_ARCH_PGTABLE_WITHDRAW 1183 + extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm); 1184 + 1185 + static inline int pmd_trans_splitting(pmd_t pmd) 1186 + { 1187 + return pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT; 1188 + } 1189 + 1190 + static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, 1191 + pmd_t *pmdp, pmd_t entry) 1192 + { 1193 + *pmdp = entry; 1194 + } 1195 + 1196 + static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot) 1197 + { 1198 + unsigned long pgprot_pmd = 0; 1199 + 1200 + if (pgprot_val(pgprot) & _PAGE_INVALID) { 1201 + if (pgprot_val(pgprot) & _PAGE_SWT) 1202 + pgprot_pmd |= _HPAGE_TYPE_NONE; 1203 + pgprot_pmd |= _SEGMENT_ENTRY_INV; 1204 + } 1205 + if (pgprot_val(pgprot) & _PAGE_RO) 1206 + pgprot_pmd |= _SEGMENT_ENTRY_RO; 1207 + return pgprot_pmd; 1208 + } 1209 + 1210 + static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) 1211 + { 1212 + pmd_val(pmd) &= _SEGMENT_CHG_MASK; 1213 + pmd_val(pmd) |= massage_pgprot_pmd(newprot); 1214 + return pmd; 1215 + } 1216 + 1217 + static inline pmd_t pmd_mkhuge(pmd_t pmd) 1218 + { 1219 + pmd_val(pmd) |= _SEGMENT_ENTRY_LARGE; 1220 + return pmd; 1221 + } 1222 + 1223 + static inline pmd_t pmd_mkwrite(pmd_t pmd) 1224 + { 1225 + pmd_val(pmd) &= ~_SEGMENT_ENTRY_RO; 1226 + return pmd; 1227 + } 1228 + 1229 + static inline pmd_t pmd_wrprotect(pmd_t pmd) 1230 + { 1231 + pmd_val(pmd) |= _SEGMENT_ENTRY_RO; 1232 + return pmd; 1233 + } 1234 + 1235 + static inline pmd_t pmd_mkdirty(pmd_t pmd) 1236 + { 1237 + /* No dirty bit in the segment table entry. */ 1238 + return pmd; 1239 + } 1240 + 1241 + static inline pmd_t pmd_mkold(pmd_t pmd) 1242 + { 1243 + /* No referenced bit in the segment table entry. */ 1244 + return pmd; 1245 + } 1246 + 1247 + static inline pmd_t pmd_mkyoung(pmd_t pmd) 1248 + { 1249 + /* No referenced bit in the segment table entry. */ 1250 + return pmd; 1251 + } 1252 + 1253 + #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG 1254 + static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, 1255 + unsigned long address, pmd_t *pmdp) 1256 + { 1257 + unsigned long pmd_addr = pmd_val(*pmdp) & HPAGE_MASK; 1258 + long tmp, rc; 1259 + int counter; 1260 + 1261 + rc = 0; 1262 + if (MACHINE_HAS_RRBM) { 1263 + counter = PTRS_PER_PTE >> 6; 1264 + asm volatile( 1265 + "0: .insn rre,0xb9ae0000,%0,%3\n" /* rrbm */ 1266 + " ogr %1,%0\n" 1267 + " la %3,0(%4,%3)\n" 1268 + " brct %2,0b\n" 1269 + : "=&d" (tmp), "+&d" (rc), "+d" (counter), 1270 + "+a" (pmd_addr) 1271 + : "a" (64 * 4096UL) : "cc"); 1272 + rc = !!rc; 1273 + } else { 1274 + counter = PTRS_PER_PTE; 1275 + asm volatile( 1276 + "0: rrbe 0,%2\n" 1277 + " la %2,0(%3,%2)\n" 1278 + " brc 12,1f\n" 1279 + " lhi %0,1\n" 1280 + "1: brct %1,0b\n" 1281 + : "+d" (rc), "+d" (counter), "+a" (pmd_addr) 1282 + : "a" (4096UL) : "cc"); 1283 + } 1284 + return rc; 1285 + } 1286 + 1287 + #define __HAVE_ARCH_PMDP_GET_AND_CLEAR 1288 + static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, 1289 + unsigned long address, pmd_t *pmdp) 1290 + { 1291 + pmd_t pmd = *pmdp; 1292 + 1293 + __pmd_idte(address, pmdp); 1294 + pmd_clear(pmdp); 1295 + return pmd; 1296 + } 1297 + 1298 + #define __HAVE_ARCH_PMDP_CLEAR_FLUSH 1299 + static inline pmd_t pmdp_clear_flush(struct vm_area_struct *vma, 1300 + unsigned long address, pmd_t *pmdp) 1301 + { 1302 + return pmdp_get_and_clear(vma->vm_mm, address, pmdp); 1303 + } 1304 + 1305 + #define __HAVE_ARCH_PMDP_INVALIDATE 1306 + static inline void pmdp_invalidate(struct vm_area_struct *vma, 1307 + unsigned long address, pmd_t *pmdp) 1308 + { 1309 + __pmd_idte(address, pmdp); 1310 + } 1311 + 1312 + static inline pmd_t mk_pmd_phys(unsigned long physpage, pgprot_t pgprot) 1313 + { 1314 + pmd_t __pmd; 1315 + pmd_val(__pmd) = physpage + massage_pgprot_pmd(pgprot); 1316 + return __pmd; 1317 + } 1318 + 1319 + #define pfn_pmd(pfn, pgprot) mk_pmd_phys(__pa((pfn) << PAGE_SHIFT), (pgprot)) 1320 + #define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) 1321 + 1322 + static inline int pmd_trans_huge(pmd_t pmd) 1323 + { 1324 + return pmd_val(pmd) & _SEGMENT_ENTRY_LARGE; 1325 + } 1326 + 1327 + static inline int has_transparent_hugepage(void) 1328 + { 1329 + return MACHINE_HAS_HPAGE ? 1 : 0; 1330 + } 1331 + 1332 + static inline unsigned long pmd_pfn(pmd_t pmd) 1333 + { 1334 + if (pmd_trans_huge(pmd)) 1335 + return pmd_val(pmd) >> HPAGE_SHIFT; 1336 + else 1337 + return pmd_val(pmd) >> PAGE_SHIFT; 1338 + } 1339 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 1192 1340 1193 1341 /* 1194 1342 * 31 bit swap entry format:

+4 -1

arch/s390/include/asm/setup.h

··· 81 81 #define MACHINE_FLAG_SPP (1UL << 13) 82 82 #define MACHINE_FLAG_TOPOLOGY (1UL << 14) 83 83 #define MACHINE_FLAG_TE (1UL << 15) 84 + #define MACHINE_FLAG_RRBM (1UL << 16) 84 85 85 86 #define MACHINE_IS_VM (S390_lowcore.machine_flags & MACHINE_FLAG_VM) 86 87 #define MACHINE_IS_KVM (S390_lowcore.machine_flags & MACHINE_FLAG_KVM) ··· 100 99 #define MACHINE_HAS_PFMF (0) 101 100 #define MACHINE_HAS_SPP (0) 102 101 #define MACHINE_HAS_TOPOLOGY (0) 103 - #define MACHINE_HAS_TE (0) 102 + #define MACHINE_HAS_TE (0) 103 + #define MACHINE_HAS_RRBM (0) 104 104 #else /* CONFIG_64BIT */ 105 105 #define MACHINE_HAS_IEEE (1) 106 106 #define MACHINE_HAS_CSP (1) ··· 114 112 #define MACHINE_HAS_SPP (S390_lowcore.machine_flags & MACHINE_FLAG_SPP) 115 113 #define MACHINE_HAS_TOPOLOGY (S390_lowcore.machine_flags & MACHINE_FLAG_TOPOLOGY) 116 114 #define MACHINE_HAS_TE (S390_lowcore.machine_flags & MACHINE_FLAG_TE) 115 + #define MACHINE_HAS_RRBM (S390_lowcore.machine_flags & MACHINE_FLAG_RRBM) 117 116 #endif /* CONFIG_64BIT */ 118 117 119 118 #define ZFCPDUMP_HSA_SIZE (32UL<<20)

+1

arch/s390/include/asm/tlb.h

··· 137 137 #define tlb_start_vma(tlb, vma) do { } while (0) 138 138 #define tlb_end_vma(tlb, vma) do { } while (0) 139 139 #define tlb_remove_tlb_entry(tlb, ptep, addr) do { } while (0) 140 + #define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr) do { } while (0) 140 141 #define tlb_migrate_finish(mm) do { } while (0) 141 142 142 143 #endif /* _S390_TLB_H */

+2

arch/s390/kernel/early.c

··· 388 388 S390_lowcore.machine_flags |= MACHINE_FLAG_SPP; 389 389 if (test_facility(50) && test_facility(73)) 390 390 S390_lowcore.machine_flags |= MACHINE_FLAG_TE; 391 + if (test_facility(66)) 392 + S390_lowcore.machine_flags |= MACHINE_FLAG_RRBM; 391 393 #endif 392 394 } 393 395

+1

arch/s390/mm/fault.c

··· 367 367 /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk 368 368 * of starvation. */ 369 369 flags &= ~FAULT_FLAG_ALLOW_RETRY; 370 + flags |= FAULT_FLAG_TRIED; 370 371 down_read(&mm->mmap_sem); 371 372 goto retry; 372 373 }

+10 -1

arch/s390/mm/gup.c

··· 115 115 pmd = *pmdp; 116 116 barrier(); 117 117 next = pmd_addr_end(addr, end); 118 - if (pmd_none(pmd)) 118 + /* 119 + * The pmd_trans_splitting() check below explains why 120 + * pmdp_splitting_flush() has to serialize with 121 + * smp_call_function() against our disabled IRQs, to stop 122 + * this gup-fast code from running while we set the 123 + * splitting bit in the pmd. Returning zero will take 124 + * the slow path that will call wait_split_huge_page() 125 + * if the pmd is still in splitting state. 126 + */ 127 + if (pmd_none(pmd) || pmd_trans_splitting(pmd)) 119 128 return 0; 120 129 if (unlikely(pmd_huge(pmd))) { 121 130 if (!gup_huge_pmd(pmdp, pmd, addr, next,

+108

arch/s390/mm/pgtable.c

··· 787 787 tlb_table_flush(tlb); 788 788 } 789 789 790 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 791 + void thp_split_vma(struct vm_area_struct *vma) 792 + { 793 + unsigned long addr; 794 + struct page *page; 795 + 796 + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { 797 + page = follow_page(vma, addr, FOLL_SPLIT); 798 + } 799 + } 800 + 801 + void thp_split_mm(struct mm_struct *mm) 802 + { 803 + struct vm_area_struct *vma = mm->mmap; 804 + 805 + while (vma != NULL) { 806 + thp_split_vma(vma); 807 + vma->vm_flags &= ~VM_HUGEPAGE; 808 + vma->vm_flags |= VM_NOHUGEPAGE; 809 + vma = vma->vm_next; 810 + } 811 + } 812 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 813 + 790 814 /* 791 815 * switch on pgstes for its userspace process (for kvm) 792 816 */ ··· 847 823 tsk->mm->context.alloc_pgste = 0; 848 824 if (!mm) 849 825 return -ENOMEM; 826 + 827 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 828 + /* split thp mappings and disable thp for future mappings */ 829 + thp_split_mm(mm); 830 + mm->def_flags |= VM_NOHUGEPAGE; 831 + #endif 850 832 851 833 /* Now lets check again if something happened */ 852 834 task_lock(tsk); ··· 896 866 return cc == 0; 897 867 } 898 868 #endif /* CONFIG_HIBERNATION && CONFIG_DEBUG_PAGEALLOC */ 869 + 870 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 871 + int pmdp_clear_flush_young(struct vm_area_struct *vma, unsigned long address, 872 + pmd_t *pmdp) 873 + { 874 + VM_BUG_ON(address & ~HPAGE_PMD_MASK); 875 + /* No need to flush TLB 876 + * On s390 reference bits are in storage key and never in TLB */ 877 + return pmdp_test_and_clear_young(vma, address, pmdp); 878 + } 879 + 880 + int pmdp_set_access_flags(struct vm_area_struct *vma, 881 + unsigned long address, pmd_t *pmdp, 882 + pmd_t entry, int dirty) 883 + { 884 + VM_BUG_ON(address & ~HPAGE_PMD_MASK); 885 + 886 + if (pmd_same(*pmdp, entry)) 887 + return 0; 888 + pmdp_invalidate(vma, address, pmdp); 889 + set_pmd_at(vma->vm_mm, address, pmdp, entry); 890 + return 1; 891 + } 892 + 893 + static void pmdp_splitting_flush_sync(void *arg) 894 + { 895 + /* Simply deliver the interrupt */ 896 + } 897 + 898 + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, 899 + pmd_t *pmdp) 900 + { 901 + VM_BUG_ON(address & ~HPAGE_PMD_MASK); 902 + if (!test_and_set_bit(_SEGMENT_ENTRY_SPLIT_BIT, 903 + (unsigned long *) pmdp)) { 904 + /* need to serialize against gup-fast (IRQ disabled) */ 905 + smp_call_function(pmdp_splitting_flush_sync, NULL, 1); 906 + } 907 + } 908 + 909 + void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable) 910 + { 911 + struct list_head *lh = (struct list_head *) pgtable; 912 + 913 + assert_spin_locked(&mm->page_table_lock); 914 + 915 + /* FIFO */ 916 + if (!mm->pmd_huge_pte) 917 + INIT_LIST_HEAD(lh); 918 + else 919 + list_add(lh, (struct list_head *) mm->pmd_huge_pte); 920 + mm->pmd_huge_pte = pgtable; 921 + } 922 + 923 + pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm) 924 + { 925 + struct list_head *lh; 926 + pgtable_t pgtable; 927 + pte_t *ptep; 928 + 929 + assert_spin_locked(&mm->page_table_lock); 930 + 931 + /* FIFO */ 932 + pgtable = mm->pmd_huge_pte; 933 + lh = (struct list_head *) pgtable; 934 + if (list_empty(lh)) 935 + mm->pmd_huge_pte = NULL; 936 + else { 937 + mm->pmd_huge_pte = (pgtable_t) lh->next; 938 + list_del(lh); 939 + } 940 + ptep = (pte_t *) pgtable; 941 + pte_val(*ptep) = _PAGE_TYPE_EMPTY; 942 + ptep++; 943 + pte_val(*ptep) = _PAGE_TYPE_EMPTY; 944 + return pgtable; 945 + } 946 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+3

arch/sh/Kconfig

··· 13 13 select HAVE_DMA_ATTRS 14 14 select HAVE_IRQ_WORK 15 15 select HAVE_PERF_EVENTS 16 + select HAVE_DEBUG_BUGVERBOSE 16 17 select ARCH_HAVE_CUSTOM_GPIO_H 17 18 select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A) 18 19 select PERF_USE_VMALLOC 20 + select HAVE_DEBUG_KMEMLEAK 19 21 select HAVE_KERNEL_GZIP 20 22 select HAVE_KERNEL_BZIP2 21 23 select HAVE_KERNEL_LZMA 22 24 select HAVE_KERNEL_XZ 23 25 select HAVE_KERNEL_LZO 26 + select HAVE_UID16 24 27 select ARCH_WANT_IPC_PARSE_VERSION 25 28 select HAVE_SYSCALL_TRACEPOINTS 26 29 select HAVE_REGS_AND_STACK_ACCESS_API

+6

arch/sh/include/asm/hugetlb.h

··· 1 1 #ifndef _ASM_SH_HUGETLB_H 2 2 #define _ASM_SH_HUGETLB_H 3 3 4 + #include <asm/cacheflush.h> 4 5 #include <asm/page.h> 5 6 6 7 ··· 88 87 89 88 static inline void arch_release_hugepage(struct page *page) 90 89 { 90 + } 91 + 92 + static inline void arch_clear_hugepage_flags(struct page *page) 93 + { 94 + clear_bit(PG_dcache_clean, &page->flags); 91 95 } 92 96 93 97 #endif /* _ASM_SH_HUGETLB_H */

+1

arch/sh/mm/fault.c

··· 504 504 } 505 505 if (fault & VM_FAULT_RETRY) { 506 506 flags &= ~FAULT_FLAG_ALLOW_RETRY; 507 + flags |= FAULT_FLAG_TRIED; 507 508 508 509 /* 509 510 * No need to up_read(&mm->mmap_sem) as we would

+5 -36

arch/sparc/Kconfig

··· 18 18 select HAVE_OPROFILE 19 19 select HAVE_ARCH_KGDB if !SMP || SPARC64 20 20 select HAVE_ARCH_TRACEHOOK 21 + select SYSCTL_EXCEPTION_TRACE 21 22 select ARCH_WANT_OPTIONAL_GPIOLIB 22 23 select RTC_CLASS 23 24 select RTC_DRV_M48T59 ··· 33 32 select GENERIC_PCI_IOMAP 34 33 select HAVE_NMI_WATCHDOG if SPARC64 35 34 select HAVE_BPF_JIT 35 + select HAVE_DEBUG_BUGVERBOSE 36 36 select GENERIC_SMP_IDLE_THREAD 37 37 select GENERIC_CMOS_UPDATE 38 38 select GENERIC_CLOCKEVENTS ··· 44 42 def_bool !64BIT 45 43 select GENERIC_ATOMIC64 46 44 select CLZ_TAB 45 + select HAVE_UID16 47 46 48 47 config SPARC64 49 48 def_bool 64BIT ··· 62 59 select HAVE_DYNAMIC_FTRACE 63 60 select HAVE_FTRACE_MCOUNT_RECORD 64 61 select HAVE_SYSCALL_TRACEPOINTS 62 + select HAVE_DEBUG_KMEMLEAK 65 63 select RTC_DRV_CMOS 66 64 select RTC_DRV_BQ4802 67 65 select RTC_DRV_SUN4V ··· 230 226 help 231 227 Say Y here to enable a faster early framebuffer boot console. 232 228 233 - choice 234 - prompt "Kernel page size" if SPARC64 235 - default SPARC64_PAGE_SIZE_8KB 236 - 237 - config SPARC64_PAGE_SIZE_8KB 238 - bool "8KB" 239 - help 240 - This lets you select the page size of the kernel. 241 - 242 - 8KB and 64KB work quite well, since SPARC ELF sections 243 - provide for up to 64KB alignment. 244 - 245 - If you don't know what to do, choose 8KB. 246 - 247 - config SPARC64_PAGE_SIZE_64KB 248 - bool "64KB" 249 - 250 - endchoice 251 - 252 229 config SECCOMP 253 230 bool "Enable seccomp to safely compute untrusted bytecode" 254 231 depends on SPARC64 && PROC_FS ··· 300 315 bool 301 316 default y 302 317 depends on SPARC64 && SMP && PREEMPT 303 - 304 - choice 305 - prompt "SPARC64 Huge TLB Page Size" 306 - depends on SPARC64 && HUGETLB_PAGE 307 - default HUGETLB_PAGE_SIZE_4MB 308 - 309 - config HUGETLB_PAGE_SIZE_4MB 310 - bool "4MB" 311 - 312 - config HUGETLB_PAGE_SIZE_512K 313 - bool "512K" 314 - 315 - config HUGETLB_PAGE_SIZE_64K 316 - depends on !SPARC64_PAGE_SIZE_64KB 317 - bool "64K" 318 - 319 - endchoice 320 318 321 319 config NUMA 322 320 bool "NUMA support" ··· 539 571 depends on SPARC64 540 572 default y 541 573 select COMPAT_BINFMT_ELF 574 + select HAVE_UID16 542 575 select ARCH_WANT_OLD_COMPAT_IPC 543 576 544 577 config SYSVIPC_COMPAT

+8 -1

arch/sparc/include/asm/hugetlb.h

··· 10 10 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, 11 11 pte_t *ptep); 12 12 13 - void hugetlb_prefault_arch_hook(struct mm_struct *mm); 13 + static inline void hugetlb_prefault_arch_hook(struct mm_struct *mm) 14 + { 15 + hugetlb_setup(mm); 16 + } 14 17 15 18 static inline int is_hugepage_only_range(struct mm_struct *mm, 16 19 unsigned long addr, ··· 82 79 } 83 80 84 81 static inline void arch_release_hugepage(struct page *page) 82 + { 83 + } 84 + 85 + static inline void arch_clear_hugepage_flags(struct page *page) 85 86 { 86 87 } 87 88

+3 -16

arch/sparc/include/asm/mmu_64.h

··· 30 30 #define CTX_PGSZ_MASK ((CTX_PGSZ_BITS << CTX_PGSZ0_SHIFT) | \ 31 31 (CTX_PGSZ_BITS << CTX_PGSZ1_SHIFT)) 32 32 33 - #if defined(CONFIG_SPARC64_PAGE_SIZE_8KB) 34 33 #define CTX_PGSZ_BASE CTX_PGSZ_8KB 35 - #elif defined(CONFIG_SPARC64_PAGE_SIZE_64KB) 36 - #define CTX_PGSZ_BASE CTX_PGSZ_64KB 37 - #else 38 - #error No page size specified in kernel configuration 39 - #endif 40 - 41 - #if defined(CONFIG_HUGETLB_PAGE_SIZE_4MB) 42 - #define CTX_PGSZ_HUGE CTX_PGSZ_4MB 43 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_512K) 44 - #define CTX_PGSZ_HUGE CTX_PGSZ_512KB 45 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_64K) 46 - #define CTX_PGSZ_HUGE CTX_PGSZ_64KB 47 - #endif 48 - 34 + #define CTX_PGSZ_HUGE CTX_PGSZ_4MB 49 35 #define CTX_PGSZ_KERN CTX_PGSZ_4MB 50 36 51 37 /* Thus, when running on UltraSPARC-III+ and later, we use the following ··· 82 96 83 97 #define MM_TSB_BASE 0 84 98 85 - #ifdef CONFIG_HUGETLB_PAGE 99 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 86 100 #define MM_TSB_HUGE 1 87 101 #define MM_NUM_TSBS 2 88 102 #else ··· 93 107 spinlock_t lock; 94 108 unsigned long sparc64_ctx_val; 95 109 unsigned long huge_pte_count; 110 + struct page *pgtable_page; 96 111 struct tsb_config tsb_block[MM_NUM_TSBS]; 97 112 struct hv_tsb_descr tsb_descr[MM_NUM_TSBS]; 98 113 } mm_context_t;

+1 -1

arch/sparc/include/asm/mmu_context_64.h

··· 36 36 { 37 37 __tsb_context_switch(__pa(mm->pgd), 38 38 &mm->context.tsb_block[0], 39 - #ifdef CONFIG_HUGETLB_PAGE 39 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 40 40 (mm->context.tsb_block[1].tsb ? 41 41 &mm->context.tsb_block[1] : 42 42 NULL)

+7 -14

arch/sparc/include/asm/page_64.h

··· 3 3 4 4 #include <linux/const.h> 5 5 6 - #if defined(CONFIG_SPARC64_PAGE_SIZE_8KB) 7 6 #define PAGE_SHIFT 13 8 - #elif defined(CONFIG_SPARC64_PAGE_SIZE_64KB) 9 - #define PAGE_SHIFT 16 10 - #else 11 - #error No page size specified in kernel configuration 12 - #endif 13 7 14 8 #define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) 15 9 #define PAGE_MASK (~(PAGE_SIZE-1)) ··· 15 21 #define DCACHE_ALIASING_POSSIBLE 16 22 #endif 17 23 18 - #if defined(CONFIG_HUGETLB_PAGE_SIZE_4MB) 19 24 #define HPAGE_SHIFT 22 20 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_512K) 21 - #define HPAGE_SHIFT 19 22 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_64K) 23 - #define HPAGE_SHIFT 16 24 - #endif 25 25 26 - #ifdef CONFIG_HUGETLB_PAGE 26 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 27 27 #define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT) 28 28 #define HPAGE_MASK (~(HPAGE_SIZE - 1UL)) 29 29 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) ··· 25 37 #endif 26 38 27 39 #ifndef __ASSEMBLY__ 40 + 41 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 42 + struct mm_struct; 43 + extern void hugetlb_setup(struct mm_struct *mm); 44 + #endif 28 45 29 46 #define WANT_PAGE_VIRTUAL 30 47 ··· 91 98 92 99 #endif /* (STRICT_MM_TYPECHECKS) */ 93 100 94 - typedef struct page *pgtable_t; 101 + typedef pte_t *pgtable_t; 95 102 96 103 #define TASK_UNMAPPED_BASE (test_thread_flag(TIF_32BIT) ? \ 97 104 (_AC(0x0000000070000000,UL)) : \

+12 -44

arch/sparc/include/asm/pgalloc_64.h

··· 38 38 kmem_cache_free(pgtable_cache, pmd); 39 39 } 40 40 41 - static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, 42 - unsigned long address) 43 - { 44 - return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO); 45 - } 41 + extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, 42 + unsigned long address); 43 + extern pgtable_t pte_alloc_one(struct mm_struct *mm, 44 + unsigned long address); 45 + extern void pte_free_kernel(struct mm_struct *mm, pte_t *pte); 46 + extern void pte_free(struct mm_struct *mm, pgtable_t ptepage); 46 47 47 - static inline pgtable_t pte_alloc_one(struct mm_struct *mm, 48 - unsigned long address) 49 - { 50 - struct page *page; 51 - pte_t *pte; 52 - 53 - pte = pte_alloc_one_kernel(mm, address); 54 - if (!pte) 55 - return NULL; 56 - page = virt_to_page(pte); 57 - pgtable_page_ctor(page); 58 - return page; 59 - } 60 - 61 - static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 62 - { 63 - free_page((unsigned long)pte); 64 - } 65 - 66 - static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage) 67 - { 68 - pgtable_page_dtor(ptepage); 69 - __free_page(ptepage); 70 - } 71 - 72 - #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(PMD, PTE) 73 - #define pmd_populate(MM,PMD,PTE_PAGE) \ 74 - pmd_populate_kernel(MM,PMD,page_address(PTE_PAGE)) 75 - #define pmd_pgtable(pmd) pmd_page(pmd) 48 + #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 49 + #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 50 + #define pmd_pgtable(PMD) ((pte_t *)__pmd_page(PMD)) 76 51 77 52 #define check_pgt_cache() do { } while (0) 78 53 79 - static inline void pgtable_free(void *table, bool is_page) 80 - { 81 - if (is_page) 82 - free_page((unsigned long)table); 83 - else 84 - kmem_cache_free(pgtable_cache, table); 85 - } 54 + extern void pgtable_free(void *table, bool is_page); 86 55 87 56 #ifdef CONFIG_SMP 88 57 ··· 82 113 } 83 114 #endif /* !CONFIG_SMP */ 84 115 85 - static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage, 116 + static inline void __pte_free_tlb(struct mmu_gather *tlb, pte_t *pte, 86 117 unsigned long address) 87 118 { 88 - pgtable_page_dtor(ptepage); 89 - pgtable_free_tlb(tlb, page_address(ptepage), true); 119 + pgtable_free_tlb(tlb, pte, true); 90 120 } 91 121 92 122 #define __pmd_free_tlb(tlb, pmd, addr) \

+193 -58

arch/sparc/include/asm/pgtable_64.h

··· 45 45 46 46 #define vmemmap ((struct page *)VMEMMAP_BASE) 47 47 48 - /* XXX All of this needs to be rethought so we can take advantage 49 - * XXX cheetah's full 64-bit virtual address space, ie. no more hole 50 - * XXX in the middle like on spitfire. -DaveM 51 - */ 52 - /* 53 - * Given a virtual address, the lowest PAGE_SHIFT bits determine offset 54 - * into the page; the next higher PAGE_SHIFT-3 bits determine the pte# 55 - * in the proper pagetable (the -3 is from the 8 byte ptes, and each page 56 - * table is a single page long). The next higher PMD_BITS determine pmd# 57 - * in the proper pmdtable (where we must have PMD_BITS <= (PAGE_SHIFT-2) 58 - * since the pmd entries are 4 bytes, and each pmd page is a single page 59 - * long). Finally, the higher few bits determine pgde#. 60 - */ 61 - 62 48 /* PMD_SHIFT determines the size of the area a second-level page 63 49 * table can map 64 50 */ 65 - #define PMD_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3)) 51 + #define PMD_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-4)) 66 52 #define PMD_SIZE (_AC(1,UL) << PMD_SHIFT) 67 53 #define PMD_MASK (~(PMD_SIZE-1)) 68 54 #define PMD_BITS (PAGE_SHIFT - 2) 69 55 70 56 /* PGDIR_SHIFT determines what a third-level page table entry can map */ 71 - #define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS) 57 + #define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-4) + PMD_BITS) 72 58 #define PGDIR_SIZE (_AC(1,UL) << PGDIR_SHIFT) 73 59 #define PGDIR_MASK (~(PGDIR_SIZE-1)) 74 60 #define PGDIR_BITS (PAGE_SHIFT - 2) 61 + 62 + #if (PGDIR_SHIFT + PGDIR_BITS) != 44 63 + #error Page table parameters do not cover virtual address space properly. 64 + #endif 65 + 66 + #if (PMD_SHIFT != HPAGE_SHIFT) 67 + #error PMD_SHIFT must equal HPAGE_SHIFT for transparent huge pages. 68 + #endif 69 + 70 + /* PMDs point to PTE tables which are 4K aligned. */ 71 + #define PMD_PADDR _AC(0xfffffffe,UL) 72 + #define PMD_PADDR_SHIFT _AC(11,UL) 73 + 74 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 75 + #define PMD_ISHUGE _AC(0x00000001,UL) 76 + 77 + /* This is the PMD layout when PMD_ISHUGE is set. With 4MB huge 78 + * pages, this frees up a bunch of bits in the layout that we can 79 + * use for the protection settings and software metadata. 80 + */ 81 + #define PMD_HUGE_PADDR _AC(0xfffff800,UL) 82 + #define PMD_HUGE_PROTBITS _AC(0x000007ff,UL) 83 + #define PMD_HUGE_PRESENT _AC(0x00000400,UL) 84 + #define PMD_HUGE_WRITE _AC(0x00000200,UL) 85 + #define PMD_HUGE_DIRTY _AC(0x00000100,UL) 86 + #define PMD_HUGE_ACCESSED _AC(0x00000080,UL) 87 + #define PMD_HUGE_EXEC _AC(0x00000040,UL) 88 + #define PMD_HUGE_SPLITTING _AC(0x00000020,UL) 89 + #endif 90 + 91 + /* PGDs point to PMD tables which are 8K aligned. */ 92 + #define PGD_PADDR _AC(0xfffffffc,UL) 93 + #define PGD_PADDR_SHIFT _AC(11,UL) 75 94 76 95 #ifndef __ASSEMBLY__ 77 96 78 97 #include <linux/sched.h> 79 98 80 99 /* Entries per page directory level. */ 81 - #define PTRS_PER_PTE (1UL << (PAGE_SHIFT-3)) 100 + #define PTRS_PER_PTE (1UL << (PAGE_SHIFT-4)) 82 101 #define PTRS_PER_PMD (1UL << PMD_BITS) 83 102 #define PTRS_PER_PGD (1UL << PGDIR_BITS) 84 103 ··· 179 160 #define _PAGE_SZ8K_4V _AC(0x0000000000000000,UL) /* 8K Page */ 180 161 #define _PAGE_SZALL_4V _AC(0x0000000000000007,UL) /* All pgsz bits */ 181 162 182 - #if PAGE_SHIFT == 13 183 163 #define _PAGE_SZBITS_4U _PAGE_SZ8K_4U 184 164 #define _PAGE_SZBITS_4V _PAGE_SZ8K_4V 185 - #elif PAGE_SHIFT == 16 186 - #define _PAGE_SZBITS_4U _PAGE_SZ64K_4U 187 - #define _PAGE_SZBITS_4V _PAGE_SZ64K_4V 188 - #else 189 - #error Wrong PAGE_SHIFT specified 190 - #endif 191 165 192 - #if defined(CONFIG_HUGETLB_PAGE_SIZE_4MB) 193 166 #define _PAGE_SZHUGE_4U _PAGE_SZ4MB_4U 194 167 #define _PAGE_SZHUGE_4V _PAGE_SZ4MB_4V 195 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_512K) 196 - #define _PAGE_SZHUGE_4U _PAGE_SZ512K_4U 197 - #define _PAGE_SZHUGE_4V _PAGE_SZ512K_4V 198 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_64K) 199 - #define _PAGE_SZHUGE_4U _PAGE_SZ64K_4U 200 - #define _PAGE_SZHUGE_4V _PAGE_SZ64K_4V 201 - #endif 202 168 203 169 /* These are actually filled in at boot time by sun4{u,v}_pgprot_init() */ 204 170 #define __P000 __pgprot(0) ··· 222 218 223 219 extern unsigned long pg_iobits; 224 220 extern unsigned long _PAGE_ALL_SZ_BITS; 225 - extern unsigned long _PAGE_SZBITS; 226 221 227 222 extern struct page *mem_map_zero; 228 223 #define ZERO_PAGE(vaddr) (mem_map_zero) ··· 234 231 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot) 235 232 { 236 233 unsigned long paddr = pfn << PAGE_SHIFT; 237 - unsigned long sz_bits; 238 234 239 - sz_bits = 0UL; 240 - if (_PAGE_SZBITS_4U != 0UL || _PAGE_SZBITS_4V != 0UL) { 241 - __asm__ __volatile__( 242 - "\n661: sethi %%uhi(%1), %0\n" 243 - " sllx %0, 32, %0\n" 244 - " .section .sun4v_2insn_patch, \"ax\"\n" 245 - " .word 661b\n" 246 - " mov %2, %0\n" 247 - " nop\n" 248 - " .previous\n" 249 - : "=r" (sz_bits) 250 - : "i" (_PAGE_SZBITS_4U), "i" (_PAGE_SZBITS_4V)); 251 - } 252 - return __pte(paddr | sz_bits | pgprot_val(prot)); 235 + BUILD_BUG_ON(_PAGE_SZBITS_4U != 0UL || _PAGE_SZBITS_4V != 0UL); 236 + return __pte(paddr | pgprot_val(prot)); 253 237 } 254 238 #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) 239 + 240 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 241 + extern pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot); 242 + #define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) 243 + 244 + extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); 245 + 246 + static inline pmd_t pmd_mkhuge(pmd_t pmd) 247 + { 248 + /* Do nothing, mk_pmd() does this part. */ 249 + return pmd; 250 + } 251 + #endif 255 252 256 253 /* This one can be done with two shifts. */ 257 254 static inline unsigned long pte_pfn(pte_t pte) ··· 289 286 * Note: We encode this into 3 sun4v 2-insn patch sequences. 290 287 */ 291 288 289 + BUILD_BUG_ON(_PAGE_SZBITS_4U != 0UL || _PAGE_SZBITS_4V != 0UL); 292 290 __asm__ __volatile__( 293 291 "\n661: sethi %%uhi(%2), %1\n" 294 292 " sethi %%hi(%2), %0\n" ··· 311 307 : "=r" (mask), "=r" (tmp) 312 308 : "i" (_PAGE_PADDR_4U | _PAGE_MODIFIED_4U | _PAGE_ACCESSED_4U | 313 309 _PAGE_CP_4U | _PAGE_CV_4U | _PAGE_E_4U | _PAGE_PRESENT_4U | 314 - _PAGE_SZBITS_4U | _PAGE_SPECIAL), 310 + _PAGE_SPECIAL), 315 311 "i" (_PAGE_PADDR_4V | _PAGE_MODIFIED_4V | _PAGE_ACCESSED_4V | 316 312 _PAGE_CP_4V | _PAGE_CV_4V | _PAGE_E_4V | _PAGE_PRESENT_4V | 317 - _PAGE_SZBITS_4V | _PAGE_SPECIAL)); 313 + _PAGE_SPECIAL)); 318 314 319 315 return __pte((pte_val(pte) & mask) | (pgprot_val(prot) & ~mask)); 320 316 } ··· 622 618 return pte_val(pte) & _PAGE_SPECIAL; 623 619 } 624 620 625 - #define pmd_set(pmdp, ptep) \ 626 - (pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL)) 621 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 622 + static inline int pmd_young(pmd_t pmd) 623 + { 624 + return pmd_val(pmd) & PMD_HUGE_ACCESSED; 625 + } 626 + 627 + static inline int pmd_write(pmd_t pmd) 628 + { 629 + return pmd_val(pmd) & PMD_HUGE_WRITE; 630 + } 631 + 632 + static inline unsigned long pmd_pfn(pmd_t pmd) 633 + { 634 + unsigned long val = pmd_val(pmd) & PMD_HUGE_PADDR; 635 + 636 + return val >> (PAGE_SHIFT - PMD_PADDR_SHIFT); 637 + } 638 + 639 + static inline int pmd_large(pmd_t pmd) 640 + { 641 + return (pmd_val(pmd) & (PMD_ISHUGE | PMD_HUGE_PRESENT)) == 642 + (PMD_ISHUGE | PMD_HUGE_PRESENT); 643 + } 644 + 645 + static inline int pmd_trans_splitting(pmd_t pmd) 646 + { 647 + return (pmd_val(pmd) & (PMD_ISHUGE|PMD_HUGE_SPLITTING)) == 648 + (PMD_ISHUGE|PMD_HUGE_SPLITTING); 649 + } 650 + 651 + static inline int pmd_trans_huge(pmd_t pmd) 652 + { 653 + return pmd_val(pmd) & PMD_ISHUGE; 654 + } 655 + 656 + #define has_transparent_hugepage() 1 657 + 658 + static inline pmd_t pmd_mkold(pmd_t pmd) 659 + { 660 + pmd_val(pmd) &= ~PMD_HUGE_ACCESSED; 661 + return pmd; 662 + } 663 + 664 + static inline pmd_t pmd_wrprotect(pmd_t pmd) 665 + { 666 + pmd_val(pmd) &= ~PMD_HUGE_WRITE; 667 + return pmd; 668 + } 669 + 670 + static inline pmd_t pmd_mkdirty(pmd_t pmd) 671 + { 672 + pmd_val(pmd) |= PMD_HUGE_DIRTY; 673 + return pmd; 674 + } 675 + 676 + static inline pmd_t pmd_mkyoung(pmd_t pmd) 677 + { 678 + pmd_val(pmd) |= PMD_HUGE_ACCESSED; 679 + return pmd; 680 + } 681 + 682 + static inline pmd_t pmd_mkwrite(pmd_t pmd) 683 + { 684 + pmd_val(pmd) |= PMD_HUGE_WRITE; 685 + return pmd; 686 + } 687 + 688 + static inline pmd_t pmd_mknotpresent(pmd_t pmd) 689 + { 690 + pmd_val(pmd) &= ~PMD_HUGE_PRESENT; 691 + return pmd; 692 + } 693 + 694 + static inline pmd_t pmd_mksplitting(pmd_t pmd) 695 + { 696 + pmd_val(pmd) |= PMD_HUGE_SPLITTING; 697 + return pmd; 698 + } 699 + 700 + extern pgprot_t pmd_pgprot(pmd_t entry); 701 + #endif 702 + 703 + static inline int pmd_present(pmd_t pmd) 704 + { 705 + return pmd_val(pmd) != 0U; 706 + } 707 + 708 + #define pmd_none(pmd) (!pmd_val(pmd)) 709 + 710 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 711 + extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, 712 + pmd_t *pmdp, pmd_t pmd); 713 + #else 714 + static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, 715 + pmd_t *pmdp, pmd_t pmd) 716 + { 717 + *pmdp = pmd; 718 + } 719 + #endif 720 + 721 + static inline void pmd_set(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep) 722 + { 723 + unsigned long val = __pa((unsigned long) (ptep)) >> PMD_PADDR_SHIFT; 724 + 725 + pmd_val(*pmdp) = val; 726 + } 727 + 627 728 #define pud_set(pudp, pmdp) \ 628 - (pud_val(*(pudp)) = (__pa((unsigned long) (pmdp)) >> 11UL)) 629 - #define __pmd_page(pmd) \ 630 - ((unsigned long) __va((((unsigned long)pmd_val(pmd))<<11UL))) 729 + (pud_val(*(pudp)) = (__pa((unsigned long) (pmdp)) >> PGD_PADDR_SHIFT)) 730 + static inline unsigned long __pmd_page(pmd_t pmd) 731 + { 732 + unsigned long paddr = (unsigned long) pmd_val(pmd); 733 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 734 + if (pmd_val(pmd) & PMD_ISHUGE) 735 + paddr &= PMD_HUGE_PADDR; 736 + #endif 737 + paddr <<= PMD_PADDR_SHIFT; 738 + return ((unsigned long) __va(paddr)); 739 + } 631 740 #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) 632 741 #define pud_page_vaddr(pud) \ 633 - ((unsigned long) __va((((unsigned long)pud_val(pud))<<11UL))) 742 + ((unsigned long) __va((((unsigned long)pud_val(pud))<<PGD_PADDR_SHIFT))) 634 743 #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) 635 - #define pmd_none(pmd) (!pmd_val(pmd)) 636 744 #define pmd_bad(pmd) (0) 637 - #define pmd_present(pmd) (pmd_val(pmd) != 0U) 638 745 #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0U) 639 746 #define pud_none(pud) (!pud_val(pud)) 640 747 #define pud_bad(pud) (0) ··· 778 663 /* Actual page table PTE updates. */ 779 664 extern void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, 780 665 pte_t *ptep, pte_t orig, int fullmm); 666 + 667 + #define __HAVE_ARCH_PMDP_GET_AND_CLEAR 668 + static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, 669 + unsigned long addr, 670 + pmd_t *pmdp) 671 + { 672 + pmd_t pmd = *pmdp; 673 + set_pmd_at(mm, addr, pmdp, __pmd(0U)); 674 + return pmd; 675 + } 781 676 782 677 static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, 783 678 pte_t *ptep, pte_t pte, int fullmm) ··· 844 719 845 720 struct vm_area_struct; 846 721 extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *); 722 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 723 + extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, 724 + pmd_t *pmd); 725 + 726 + #define __HAVE_ARCH_PGTABLE_DEPOSIT 727 + extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable); 728 + 729 + #define __HAVE_ARCH_PGTABLE_WITHDRAW 730 + extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm); 731 + #endif 847 732 848 733 /* Encode and de-code a swap entry */ 849 734 #define __swp_type(entry) (((entry).val >> PAGE_SHIFT) & 0xffUL)

+93 -13

arch/sparc/include/asm/tsb.h

··· 147 147 brz,pn REG1, FAIL_LABEL; \ 148 148 sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ 149 149 srlx REG2, 64 - PAGE_SHIFT, REG2; \ 150 - sllx REG1, 11, REG1; \ 150 + sllx REG1, PGD_PADDR_SHIFT, REG1; \ 151 151 andn REG2, 0x3, REG2; \ 152 152 lduwa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ 153 153 brz,pn REG1, FAIL_LABEL; \ 154 154 sllx VADDR, 64 - PMD_SHIFT, REG2; \ 155 - srlx REG2, 64 - PAGE_SHIFT, REG2; \ 156 - sllx REG1, 11, REG1; \ 155 + srlx REG2, 64 - (PAGE_SHIFT - 1), REG2; \ 156 + sllx REG1, PMD_PADDR_SHIFT, REG1; \ 157 157 andn REG2, 0x7, REG2; \ 158 158 add REG1, REG2, REG1; 159 159 160 - /* Do a user page table walk in MMU globals. Leaves physical PTE 161 - * pointer in REG1. Jumps to FAIL_LABEL on early page table walk 162 - * termination. Physical base of page tables is in PHYS_PGD which 163 - * will not be modified. 160 + /* This macro exists only to make the PMD translator below easier 161 + * to read. It hides the ELF section switch for the sun4v code 162 + * patching. 163 + */ 164 + #define OR_PTE_BIT(REG, NAME) \ 165 + 661: or REG, _PAGE_##NAME##_4U, REG; \ 166 + .section .sun4v_1insn_patch, "ax"; \ 167 + .word 661b; \ 168 + or REG, _PAGE_##NAME##_4V, REG; \ 169 + .previous; 170 + 171 + /* Load into REG the PTE value for VALID, CACHE, and SZHUGE. */ 172 + #define BUILD_PTE_VALID_SZHUGE_CACHE(REG) \ 173 + 661: sethi %uhi(_PAGE_VALID|_PAGE_SZHUGE_4U), REG; \ 174 + .section .sun4v_1insn_patch, "ax"; \ 175 + .word 661b; \ 176 + sethi %uhi(_PAGE_VALID), REG; \ 177 + .previous; \ 178 + sllx REG, 32, REG; \ 179 + 661: or REG, _PAGE_CP_4U|_PAGE_CV_4U, REG; \ 180 + .section .sun4v_1insn_patch, "ax"; \ 181 + .word 661b; \ 182 + or REG, _PAGE_CP_4V|_PAGE_CV_4V|_PAGE_SZHUGE_4V, REG; \ 183 + .previous; 184 + 185 + /* PMD has been loaded into REG1, interpret the value, seeing 186 + * if it is a HUGE PMD or a normal one. If it is not valid 187 + * then jump to FAIL_LABEL. If it is a HUGE PMD, and it 188 + * translates to a valid PTE, branch to PTE_LABEL. 189 + * 190 + * We translate the PMD by hand, one bit at a time, 191 + * constructing the huge PTE. 192 + * 193 + * So we construct the PTE in REG2 as follows: 194 + * 195 + * 1) Extract the PMD PFN from REG1 and place it into REG2. 196 + * 197 + * 2) Translate PMD protection bits in REG1 into REG2, one bit 198 + * at a time using andcc tests on REG1 and OR's into REG2. 199 + * 200 + * Only two bits to be concerned with here, EXEC and WRITE. 201 + * Now REG1 is freed up and we can use it as a temporary. 202 + * 203 + * 3) Construct the VALID, CACHE, and page size PTE bits in 204 + * REG1, OR with REG2 to form final PTE. 205 + */ 206 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 207 + #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ 208 + brz,pn REG1, FAIL_LABEL; \ 209 + andcc REG1, PMD_ISHUGE, %g0; \ 210 + be,pt %xcc, 700f; \ 211 + and REG1, PMD_HUGE_PRESENT|PMD_HUGE_ACCESSED, REG2; \ 212 + cmp REG2, PMD_HUGE_PRESENT|PMD_HUGE_ACCESSED; \ 213 + bne,pn %xcc, FAIL_LABEL; \ 214 + andn REG1, PMD_HUGE_PROTBITS, REG2; \ 215 + sllx REG2, PMD_PADDR_SHIFT, REG2; \ 216 + /* REG2 now holds PFN << PAGE_SHIFT */ \ 217 + andcc REG1, PMD_HUGE_EXEC, %g0; \ 218 + bne,a,pt %xcc, 1f; \ 219 + OR_PTE_BIT(REG2, EXEC); \ 220 + 1: andcc REG1, PMD_HUGE_WRITE, %g0; \ 221 + bne,a,pt %xcc, 1f; \ 222 + OR_PTE_BIT(REG2, W); \ 223 + /* REG1 can now be clobbered, build final PTE */ \ 224 + 1: BUILD_PTE_VALID_SZHUGE_CACHE(REG1); \ 225 + ba,pt %xcc, PTE_LABEL; \ 226 + or REG1, REG2, REG1; \ 227 + 700: 228 + #else 229 + #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ 230 + brz,pn REG1, FAIL_LABEL; \ 231 + nop; 232 + #endif 233 + 234 + /* Do a user page table walk in MMU globals. Leaves final, 235 + * valid, PTE value in REG1. Jumps to FAIL_LABEL on early 236 + * page table walk termination or if the PTE is not valid. 237 + * 238 + * Physical base of page tables is in PHYS_PGD which will not 239 + * be modified. 164 240 * 165 241 * VADDR will not be clobbered, but REG1 and REG2 will. 166 242 */ ··· 248 172 brz,pn REG1, FAIL_LABEL; \ 249 173 sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ 250 174 srlx REG2, 64 - PAGE_SHIFT, REG2; \ 251 - sllx REG1, 11, REG1; \ 175 + sllx REG1, PGD_PADDR_SHIFT, REG1; \ 252 176 andn REG2, 0x3, REG2; \ 253 177 lduwa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ 254 - brz,pn REG1, FAIL_LABEL; \ 255 - sllx VADDR, 64 - PMD_SHIFT, REG2; \ 256 - srlx REG2, 64 - PAGE_SHIFT, REG2; \ 257 - sllx REG1, 11, REG1; \ 178 + USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ 179 + sllx VADDR, 64 - PMD_SHIFT, REG2; \ 180 + srlx REG2, 64 - (PAGE_SHIFT - 1), REG2; \ 181 + sllx REG1, PMD_PADDR_SHIFT, REG1; \ 258 182 andn REG2, 0x7, REG2; \ 259 - add REG1, REG2, REG1; 183 + add REG1, REG2, REG1; \ 184 + ldxa [REG1] ASI_PHYS_USE_EC, REG1; \ 185 + brgez,pn REG1, FAIL_LABEL; \ 186 + nop; \ 187 + 800: 260 188 261 189 /* Lookup a OBP mapping on VADDR in the prom_trans[] table at TL>0. 262 190 * If no entry is found, FAIL_LABEL will be branched to. On success

+1 -1

arch/sparc/kernel/pci.c

··· 779 779 static void __pci_mmap_set_flags(struct pci_dev *dev, struct vm_area_struct *vma, 780 780 enum pci_mmap_state mmap_state) 781 781 { 782 - vma->vm_flags |= (VM_IO | VM_RESERVED); 782 + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; 783 783 } 784 784 785 785 /* Set vm_page_prot of VMA, as appropriate for this architecture, for a pci

+1 -1

arch/sparc/kernel/sun4v_tlb_miss.S

··· 176 176 177 177 sub %g2, TRAP_PER_CPU_FAULT_INFO, %g2 178 178 179 - #ifdef CONFIG_HUGETLB_PAGE 179 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 180 180 mov SCRATCHPAD_UTSBREG2, %g5 181 181 ldxa [%g5] ASI_SCRATCHPAD, %g5 182 182 cmp %g5, -1

+3 -6

arch/sparc/kernel/tsb.S

··· 49 49 /* Before committing to a full page table walk, 50 50 * check the huge page TSB. 51 51 */ 52 - #ifdef CONFIG_HUGETLB_PAGE 52 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 53 53 54 54 661: ldx [%g7 + TRAP_PER_CPU_TSB_HUGE], %g5 55 55 nop ··· 110 110 tsb_miss_page_table_walk_sun4v_fastpath: 111 111 USER_PGTABLE_WALK_TL1(%g4, %g7, %g5, %g2, tsb_do_fault) 112 112 113 - /* Load and check PTE. */ 114 - ldxa [%g5] ASI_PHYS_USE_EC, %g5 115 - brgez,pn %g5, tsb_do_fault 116 - nop 113 + /* Valid PTE is now in %g5. */ 117 114 118 - #ifdef CONFIG_HUGETLB_PAGE 115 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 119 116 661: sethi %uhi(_PAGE_SZALL_4U), %g7 120 117 sllx %g7, 32, %g7 121 118 .section .sun4v_2insn_patch, "ax"

+1

arch/sparc/mm/fault_32.c

··· 265 265 } 266 266 if (fault & VM_FAULT_RETRY) { 267 267 flags &= ~FAULT_FLAG_ALLOW_RETRY; 268 + flags |= FAULT_FLAG_TRIED; 268 269 269 270 /* No need to up_read(&mm->mmap_sem) as we would 270 271 * have already released it in __lock_page_or_retry

+3 -2

arch/sparc/mm/fault_64.c

··· 452 452 } 453 453 if (fault & VM_FAULT_RETRY) { 454 454 flags &= ~FAULT_FLAG_ALLOW_RETRY; 455 + flags |= FAULT_FLAG_TRIED; 455 456 456 457 /* No need to up_read(&mm->mmap_sem) as we would 457 458 * have already released it in __lock_page_or_retry ··· 465 464 up_read(&mm->mmap_sem); 466 465 467 466 mm_rss = get_mm_rss(mm); 468 - #ifdef CONFIG_HUGETLB_PAGE 467 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 469 468 mm_rss -= (mm->context.huge_pte_count * (HPAGE_SIZE / PAGE_SIZE)); 470 469 #endif 471 470 if (unlikely(mm_rss > 472 471 mm->context.tsb_block[MM_TSB_BASE].tsb_rss_limit)) 473 472 tsb_grow(mm, MM_TSB_BASE, mm_rss); 474 - #ifdef CONFIG_HUGETLB_PAGE 473 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 475 474 mm_rss = mm->context.huge_pte_count; 476 475 if (unlikely(mm_rss > 477 476 mm->context.tsb_block[MM_TSB_HUGE].tsb_rss_limit))

-50

arch/sparc/mm/hugetlbpage.c

··· 303 303 { 304 304 return NULL; 305 305 } 306 - 307 - static void context_reload(void *__data) 308 - { 309 - struct mm_struct *mm = __data; 310 - 311 - if (mm == current->mm) 312 - load_secondary_context(mm); 313 - } 314 - 315 - void hugetlb_prefault_arch_hook(struct mm_struct *mm) 316 - { 317 - struct tsb_config *tp = &mm->context.tsb_block[MM_TSB_HUGE]; 318 - 319 - if (likely(tp->tsb != NULL)) 320 - return; 321 - 322 - tsb_grow(mm, MM_TSB_HUGE, 0); 323 - tsb_context_switch(mm); 324 - smp_tsb_sync(mm); 325 - 326 - /* On UltraSPARC-III+ and later, configure the second half of 327 - * the Data-TLB for huge pages. 328 - */ 329 - if (tlb_type == cheetah_plus) { 330 - unsigned long ctx; 331 - 332 - spin_lock(&ctx_alloc_lock); 333 - ctx = mm->context.sparc64_ctx_val; 334 - ctx &= ~CTX_PGSZ_MASK; 335 - ctx |= CTX_PGSZ_BASE << CTX_PGSZ0_SHIFT; 336 - ctx |= CTX_PGSZ_HUGE << CTX_PGSZ1_SHIFT; 337 - 338 - if (ctx != mm->context.sparc64_ctx_val) { 339 - /* When changing the page size fields, we 340 - * must perform a context flush so that no 341 - * stale entries match. This flush must 342 - * occur with the original context register 343 - * settings. 344 - */ 345 - do_flush_tlb_mm(mm); 346 - 347 - /* Reload the context register of all processors 348 - * also executing in this address space. 349 - */ 350 - mm->context.sparc64_ctx_val = ctx; 351 - on_each_cpu(context_reload, mm, 0); 352 - } 353 - spin_unlock(&ctx_alloc_lock); 354 - } 355 - }

+298 -16

arch/sparc/mm/init_64.c

··· 276 276 } 277 277 278 278 unsigned long _PAGE_ALL_SZ_BITS __read_mostly; 279 - unsigned long _PAGE_SZBITS __read_mostly; 280 279 281 280 static void flush_dcache(unsigned long pfn) 282 281 { ··· 306 307 } 307 308 } 308 309 310 + /* mm->context.lock must be held */ 311 + static void __update_mmu_tsb_insert(struct mm_struct *mm, unsigned long tsb_index, 312 + unsigned long tsb_hash_shift, unsigned long address, 313 + unsigned long tte) 314 + { 315 + struct tsb *tsb = mm->context.tsb_block[tsb_index].tsb; 316 + unsigned long tag; 317 + 318 + tsb += ((address >> tsb_hash_shift) & 319 + (mm->context.tsb_block[tsb_index].tsb_nentries - 1UL)); 320 + tag = (address >> 22UL); 321 + tsb_insert(tsb, tag, tte); 322 + } 323 + 309 324 void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) 310 325 { 326 + unsigned long tsb_index, tsb_hash_shift, flags; 311 327 struct mm_struct *mm; 312 - struct tsb *tsb; 313 - unsigned long tag, flags; 314 - unsigned long tsb_index, tsb_hash_shift; 315 328 pte_t pte = *ptep; 316 329 317 330 if (tlb_type != hypervisor) { ··· 340 329 341 330 spin_lock_irqsave(&mm->context.lock, flags); 342 331 343 - #ifdef CONFIG_HUGETLB_PAGE 332 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 344 333 if (mm->context.tsb_block[MM_TSB_HUGE].tsb != NULL) { 345 334 if ((tlb_type == hypervisor && 346 335 (pte_val(pte) & _PAGE_SZALL_4V) == _PAGE_SZHUGE_4V) || ··· 352 341 } 353 342 #endif 354 343 355 - tsb = mm->context.tsb_block[tsb_index].tsb; 356 - tsb += ((address >> tsb_hash_shift) & 357 - (mm->context.tsb_block[tsb_index].tsb_nentries - 1UL)); 358 - tag = (address >> 22UL); 359 - tsb_insert(tsb, tag, pte_val(pte)); 344 + __update_mmu_tsb_insert(mm, tsb_index, tsb_hash_shift, 345 + address, pte_val(pte)); 360 346 361 347 spin_unlock_irqrestore(&mm->context.lock, flags); 362 348 } ··· 2283 2275 __ACCESS_BITS_4U | _PAGE_E_4U); 2284 2276 2285 2277 #ifdef CONFIG_DEBUG_PAGEALLOC 2286 - kern_linear_pte_xor[0] = (_PAGE_VALID | _PAGE_SZBITS_4U) ^ 2287 - 0xfffff80000000000UL; 2278 + kern_linear_pte_xor[0] = _PAGE_VALID ^ 0xfffff80000000000UL; 2288 2279 #else 2289 2280 kern_linear_pte_xor[0] = (_PAGE_VALID | _PAGE_SZ4MB_4U) ^ 2290 2281 0xfffff80000000000UL; ··· 2294 2287 for (i = 1; i < 4; i++) 2295 2288 kern_linear_pte_xor[i] = kern_linear_pte_xor[0]; 2296 2289 2297 - _PAGE_SZBITS = _PAGE_SZBITS_4U; 2298 2290 _PAGE_ALL_SZ_BITS = (_PAGE_SZ4MB_4U | _PAGE_SZ512K_4U | 2299 2291 _PAGE_SZ64K_4U | _PAGE_SZ8K_4U | 2300 2292 _PAGE_SZ32MB_4U | _PAGE_SZ256MB_4U); ··· 2330 2324 _PAGE_CACHE = _PAGE_CACHE_4V; 2331 2325 2332 2326 #ifdef CONFIG_DEBUG_PAGEALLOC 2333 - kern_linear_pte_xor[0] = (_PAGE_VALID | _PAGE_SZBITS_4V) ^ 2334 - 0xfffff80000000000UL; 2327 + kern_linear_pte_xor[0] = _PAGE_VALID ^ 0xfffff80000000000UL; 2335 2328 #else 2336 2329 kern_linear_pte_xor[0] = (_PAGE_VALID | _PAGE_SZ4MB_4V) ^ 2337 2330 0xfffff80000000000UL; ··· 2344 2339 pg_iobits = (_PAGE_VALID | _PAGE_PRESENT_4V | __DIRTY_BITS_4V | 2345 2340 __ACCESS_BITS_4V | _PAGE_E_4V); 2346 2341 2347 - _PAGE_SZBITS = _PAGE_SZBITS_4V; 2348 2342 _PAGE_ALL_SZ_BITS = (_PAGE_SZ16GB_4V | _PAGE_SZ2GB_4V | 2349 2343 _PAGE_SZ256MB_4V | _PAGE_SZ32MB_4V | 2350 2344 _PAGE_SZ4MB_4V | _PAGE_SZ512K_4V | ··· 2476 2472 __asm__ __volatile__("wrpr %0, 0, %%pstate" 2477 2473 : : "r" (pstate)); 2478 2474 } 2475 + 2476 + static pte_t *get_from_cache(struct mm_struct *mm) 2477 + { 2478 + struct page *page; 2479 + pte_t *ret; 2480 + 2481 + spin_lock(&mm->page_table_lock); 2482 + page = mm->context.pgtable_page; 2483 + ret = NULL; 2484 + if (page) { 2485 + void *p = page_address(page); 2486 + 2487 + mm->context.pgtable_page = NULL; 2488 + 2489 + ret = (pte_t *) (p + (PAGE_SIZE / 2)); 2490 + } 2491 + spin_unlock(&mm->page_table_lock); 2492 + 2493 + return ret; 2494 + } 2495 + 2496 + static struct page *__alloc_for_cache(struct mm_struct *mm) 2497 + { 2498 + struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | 2499 + __GFP_REPEAT | __GFP_ZERO); 2500 + 2501 + if (page) { 2502 + spin_lock(&mm->page_table_lock); 2503 + if (!mm->context.pgtable_page) { 2504 + atomic_set(&page->_count, 2); 2505 + mm->context.pgtable_page = page; 2506 + } 2507 + spin_unlock(&mm->page_table_lock); 2508 + } 2509 + return page; 2510 + } 2511 + 2512 + pte_t *pte_alloc_one_kernel(struct mm_struct *mm, 2513 + unsigned long address) 2514 + { 2515 + struct page *page; 2516 + pte_t *pte; 2517 + 2518 + pte = get_from_cache(mm); 2519 + if (pte) 2520 + return pte; 2521 + 2522 + page = __alloc_for_cache(mm); 2523 + if (page) 2524 + pte = (pte_t *) page_address(page); 2525 + 2526 + return pte; 2527 + } 2528 + 2529 + pgtable_t pte_alloc_one(struct mm_struct *mm, 2530 + unsigned long address) 2531 + { 2532 + struct page *page; 2533 + pte_t *pte; 2534 + 2535 + pte = get_from_cache(mm); 2536 + if (pte) 2537 + return pte; 2538 + 2539 + page = __alloc_for_cache(mm); 2540 + if (page) { 2541 + pgtable_page_ctor(page); 2542 + pte = (pte_t *) page_address(page); 2543 + } 2544 + 2545 + return pte; 2546 + } 2547 + 2548 + void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 2549 + { 2550 + struct page *page = virt_to_page(pte); 2551 + if (put_page_testzero(page)) 2552 + free_hot_cold_page(page, 0); 2553 + } 2554 + 2555 + static void __pte_free(pgtable_t pte) 2556 + { 2557 + struct page *page = virt_to_page(pte); 2558 + if (put_page_testzero(page)) { 2559 + pgtable_page_dtor(page); 2560 + free_hot_cold_page(page, 0); 2561 + } 2562 + } 2563 + 2564 + void pte_free(struct mm_struct *mm, pgtable_t pte) 2565 + { 2566 + __pte_free(pte); 2567 + } 2568 + 2569 + void pgtable_free(void *table, bool is_page) 2570 + { 2571 + if (is_page) 2572 + __pte_free(table); 2573 + else 2574 + kmem_cache_free(pgtable_cache, table); 2575 + } 2576 + 2577 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 2578 + static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot, bool for_modify) 2579 + { 2580 + if (pgprot_val(pgprot) & _PAGE_VALID) 2581 + pmd_val(pmd) |= PMD_HUGE_PRESENT; 2582 + if (tlb_type == hypervisor) { 2583 + if (pgprot_val(pgprot) & _PAGE_WRITE_4V) 2584 + pmd_val(pmd) |= PMD_HUGE_WRITE; 2585 + if (pgprot_val(pgprot) & _PAGE_EXEC_4V) 2586 + pmd_val(pmd) |= PMD_HUGE_EXEC; 2587 + 2588 + if (!for_modify) { 2589 + if (pgprot_val(pgprot) & _PAGE_ACCESSED_4V) 2590 + pmd_val(pmd) |= PMD_HUGE_ACCESSED; 2591 + if (pgprot_val(pgprot) & _PAGE_MODIFIED_4V) 2592 + pmd_val(pmd) |= PMD_HUGE_DIRTY; 2593 + } 2594 + } else { 2595 + if (pgprot_val(pgprot) & _PAGE_WRITE_4U) 2596 + pmd_val(pmd) |= PMD_HUGE_WRITE; 2597 + if (pgprot_val(pgprot) & _PAGE_EXEC_4U) 2598 + pmd_val(pmd) |= PMD_HUGE_EXEC; 2599 + 2600 + if (!for_modify) { 2601 + if (pgprot_val(pgprot) & _PAGE_ACCESSED_4U) 2602 + pmd_val(pmd) |= PMD_HUGE_ACCESSED; 2603 + if (pgprot_val(pgprot) & _PAGE_MODIFIED_4U) 2604 + pmd_val(pmd) |= PMD_HUGE_DIRTY; 2605 + } 2606 + } 2607 + 2608 + return pmd; 2609 + } 2610 + 2611 + pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot) 2612 + { 2613 + pmd_t pmd; 2614 + 2615 + pmd_val(pmd) = (page_nr << ((PAGE_SHIFT - PMD_PADDR_SHIFT))); 2616 + pmd_val(pmd) |= PMD_ISHUGE; 2617 + pmd = pmd_set_protbits(pmd, pgprot, false); 2618 + return pmd; 2619 + } 2620 + 2621 + pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) 2622 + { 2623 + pmd_val(pmd) &= ~(PMD_HUGE_PRESENT | 2624 + PMD_HUGE_WRITE | 2625 + PMD_HUGE_EXEC); 2626 + pmd = pmd_set_protbits(pmd, newprot, true); 2627 + return pmd; 2628 + } 2629 + 2630 + pgprot_t pmd_pgprot(pmd_t entry) 2631 + { 2632 + unsigned long pte = 0; 2633 + 2634 + if (pmd_val(entry) & PMD_HUGE_PRESENT) 2635 + pte |= _PAGE_VALID; 2636 + 2637 + if (tlb_type == hypervisor) { 2638 + if (pmd_val(entry) & PMD_HUGE_PRESENT) 2639 + pte |= _PAGE_PRESENT_4V; 2640 + if (pmd_val(entry) & PMD_HUGE_EXEC) 2641 + pte |= _PAGE_EXEC_4V; 2642 + if (pmd_val(entry) & PMD_HUGE_WRITE) 2643 + pte |= _PAGE_W_4V; 2644 + if (pmd_val(entry) & PMD_HUGE_ACCESSED) 2645 + pte |= _PAGE_ACCESSED_4V; 2646 + if (pmd_val(entry) & PMD_HUGE_DIRTY) 2647 + pte |= _PAGE_MODIFIED_4V; 2648 + pte |= _PAGE_CP_4V|_PAGE_CV_4V; 2649 + } else { 2650 + if (pmd_val(entry) & PMD_HUGE_PRESENT) 2651 + pte |= _PAGE_PRESENT_4U; 2652 + if (pmd_val(entry) & PMD_HUGE_EXEC) 2653 + pte |= _PAGE_EXEC_4U; 2654 + if (pmd_val(entry) & PMD_HUGE_WRITE) 2655 + pte |= _PAGE_W_4U; 2656 + if (pmd_val(entry) & PMD_HUGE_ACCESSED) 2657 + pte |= _PAGE_ACCESSED_4U; 2658 + if (pmd_val(entry) & PMD_HUGE_DIRTY) 2659 + pte |= _PAGE_MODIFIED_4U; 2660 + pte |= _PAGE_CP_4U|_PAGE_CV_4U; 2661 + } 2662 + 2663 + return __pgprot(pte); 2664 + } 2665 + 2666 + void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, 2667 + pmd_t *pmd) 2668 + { 2669 + unsigned long pte, flags; 2670 + struct mm_struct *mm; 2671 + pmd_t entry = *pmd; 2672 + pgprot_t prot; 2673 + 2674 + if (!pmd_large(entry) || !pmd_young(entry)) 2675 + return; 2676 + 2677 + pte = (pmd_val(entry) & ~PMD_HUGE_PROTBITS); 2678 + pte <<= PMD_PADDR_SHIFT; 2679 + pte |= _PAGE_VALID; 2680 + 2681 + prot = pmd_pgprot(entry); 2682 + 2683 + if (tlb_type == hypervisor) 2684 + pgprot_val(prot) |= _PAGE_SZHUGE_4V; 2685 + else 2686 + pgprot_val(prot) |= _PAGE_SZHUGE_4U; 2687 + 2688 + pte |= pgprot_val(prot); 2689 + 2690 + mm = vma->vm_mm; 2691 + 2692 + spin_lock_irqsave(&mm->context.lock, flags); 2693 + 2694 + if (mm->context.tsb_block[MM_TSB_HUGE].tsb != NULL) 2695 + __update_mmu_tsb_insert(mm, MM_TSB_HUGE, HPAGE_SHIFT, 2696 + addr, pte); 2697 + 2698 + spin_unlock_irqrestore(&mm->context.lock, flags); 2699 + } 2700 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 2701 + 2702 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 2703 + static void context_reload(void *__data) 2704 + { 2705 + struct mm_struct *mm = __data; 2706 + 2707 + if (mm == current->mm) 2708 + load_secondary_context(mm); 2709 + } 2710 + 2711 + void hugetlb_setup(struct mm_struct *mm) 2712 + { 2713 + struct tsb_config *tp = &mm->context.tsb_block[MM_TSB_HUGE]; 2714 + 2715 + if (likely(tp->tsb != NULL)) 2716 + return; 2717 + 2718 + tsb_grow(mm, MM_TSB_HUGE, 0); 2719 + tsb_context_switch(mm); 2720 + smp_tsb_sync(mm); 2721 + 2722 + /* On UltraSPARC-III+ and later, configure the second half of 2723 + * the Data-TLB for huge pages. 2724 + */ 2725 + if (tlb_type == cheetah_plus) { 2726 + unsigned long ctx; 2727 + 2728 + spin_lock(&ctx_alloc_lock); 2729 + ctx = mm->context.sparc64_ctx_val; 2730 + ctx &= ~CTX_PGSZ_MASK; 2731 + ctx |= CTX_PGSZ_BASE << CTX_PGSZ0_SHIFT; 2732 + ctx |= CTX_PGSZ_HUGE << CTX_PGSZ1_SHIFT; 2733 + 2734 + if (ctx != mm->context.sparc64_ctx_val) { 2735 + /* When changing the page size fields, we 2736 + * must perform a context flush so that no 2737 + * stale entries match. This flush must 2738 + * occur with the original context register 2739 + * settings. 2740 + */ 2741 + do_flush_tlb_mm(mm); 2742 + 2743 + /* Reload the context register of all processors 2744 + * also executing in this address space. 2745 + */ 2746 + mm->context.sparc64_ctx_val = ctx; 2747 + on_each_cpu(context_reload, mm, 0); 2748 + } 2749 + spin_unlock(&ctx_alloc_lock); 2750 + } 2751 + } 2752 + #endif

+111 -25

arch/sparc/mm/tlb.c

··· 43 43 put_cpu_var(tlb_batch); 44 44 } 45 45 46 - void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, 47 - pte_t *ptep, pte_t orig, int fullmm) 46 + static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr, 47 + bool exec) 48 48 { 49 49 struct tlb_batch *tb = &get_cpu_var(tlb_batch); 50 50 unsigned long nr; 51 51 52 52 vaddr &= PAGE_MASK; 53 - if (pte_exec(orig)) 53 + if (exec) 54 54 vaddr |= 0x1UL; 55 55 56 + nr = tb->tlb_nr; 57 + 58 + if (unlikely(nr != 0 && mm != tb->mm)) { 59 + flush_tlb_pending(); 60 + nr = 0; 61 + } 62 + 63 + if (nr == 0) 64 + tb->mm = mm; 65 + 66 + tb->vaddrs[nr] = vaddr; 67 + tb->tlb_nr = ++nr; 68 + if (nr >= TLB_BATCH_NR) 69 + flush_tlb_pending(); 70 + 71 + put_cpu_var(tlb_batch); 72 + } 73 + 74 + void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, 75 + pte_t *ptep, pte_t orig, int fullmm) 76 + { 56 77 if (tlb_type != hypervisor && 57 78 pte_dirty(orig)) { 58 79 unsigned long paddr, pfn = pte_pfn(orig); ··· 98 77 } 99 78 100 79 no_cache_flush: 101 - 102 - if (fullmm) { 103 - put_cpu_var(tlb_batch); 104 - return; 105 - } 106 - 107 - nr = tb->tlb_nr; 108 - 109 - if (unlikely(nr != 0 && mm != tb->mm)) { 110 - flush_tlb_pending(); 111 - nr = 0; 112 - } 113 - 114 - if (nr == 0) 115 - tb->mm = mm; 116 - 117 - tb->vaddrs[nr] = vaddr; 118 - tb->tlb_nr = ++nr; 119 - if (nr >= TLB_BATCH_NR) 120 - flush_tlb_pending(); 121 - 122 - put_cpu_var(tlb_batch); 80 + if (!fullmm) 81 + tlb_batch_add_one(mm, vaddr, pte_exec(orig)); 123 82 } 83 + 84 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 85 + static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr, 86 + pmd_t pmd, bool exec) 87 + { 88 + unsigned long end; 89 + pte_t *pte; 90 + 91 + pte = pte_offset_map(&pmd, vaddr); 92 + end = vaddr + HPAGE_SIZE; 93 + while (vaddr < end) { 94 + if (pte_val(*pte) & _PAGE_VALID) 95 + tlb_batch_add_one(mm, vaddr, exec); 96 + pte++; 97 + vaddr += PAGE_SIZE; 98 + } 99 + pte_unmap(pte); 100 + } 101 + 102 + void set_pmd_at(struct mm_struct *mm, unsigned long addr, 103 + pmd_t *pmdp, pmd_t pmd) 104 + { 105 + pmd_t orig = *pmdp; 106 + 107 + *pmdp = pmd; 108 + 109 + if (mm == &init_mm) 110 + return; 111 + 112 + if ((pmd_val(pmd) ^ pmd_val(orig)) & PMD_ISHUGE) { 113 + if (pmd_val(pmd) & PMD_ISHUGE) 114 + mm->context.huge_pte_count++; 115 + else 116 + mm->context.huge_pte_count--; 117 + if (mm->context.huge_pte_count == 1) 118 + hugetlb_setup(mm); 119 + } 120 + 121 + if (!pmd_none(orig)) { 122 + bool exec = ((pmd_val(orig) & PMD_HUGE_EXEC) != 0); 123 + 124 + addr &= HPAGE_MASK; 125 + if (pmd_val(orig) & PMD_ISHUGE) 126 + tlb_batch_add_one(mm, addr, exec); 127 + else 128 + tlb_batch_pmd_scan(mm, addr, orig, exec); 129 + } 130 + } 131 + 132 + void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable) 133 + { 134 + struct list_head *lh = (struct list_head *) pgtable; 135 + 136 + assert_spin_locked(&mm->page_table_lock); 137 + 138 + /* FIFO */ 139 + if (!mm->pmd_huge_pte) 140 + INIT_LIST_HEAD(lh); 141 + else 142 + list_add(lh, (struct list_head *) mm->pmd_huge_pte); 143 + mm->pmd_huge_pte = pgtable; 144 + } 145 + 146 + pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm) 147 + { 148 + struct list_head *lh; 149 + pgtable_t pgtable; 150 + 151 + assert_spin_locked(&mm->page_table_lock); 152 + 153 + /* FIFO */ 154 + pgtable = mm->pmd_huge_pte; 155 + lh = (struct list_head *) pgtable; 156 + if (list_empty(lh)) 157 + mm->pmd_huge_pte = NULL; 158 + else { 159 + mm->pmd_huge_pte = (pgtable_t) lh->next; 160 + list_del(lh); 161 + } 162 + pte_val(pgtable[0]) = 0; 163 + pte_val(pgtable[1]) = 0; 164 + 165 + return pgtable; 166 + } 167 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+16 -24

arch/sparc/mm/tsb.c

··· 78 78 base = __pa(base); 79 79 __flush_tsb_one(tb, PAGE_SHIFT, base, nentries); 80 80 81 - #ifdef CONFIG_HUGETLB_PAGE 81 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 82 82 if (mm->context.tsb_block[MM_TSB_HUGE].tsb) { 83 83 base = (unsigned long) mm->context.tsb_block[MM_TSB_HUGE].tsb; 84 84 nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries; ··· 90 90 spin_unlock_irqrestore(&mm->context.lock, flags); 91 91 } 92 92 93 - #if defined(CONFIG_SPARC64_PAGE_SIZE_8KB) 94 93 #define HV_PGSZ_IDX_BASE HV_PGSZ_IDX_8K 95 94 #define HV_PGSZ_MASK_BASE HV_PGSZ_MASK_8K 96 - #elif defined(CONFIG_SPARC64_PAGE_SIZE_64KB) 97 - #define HV_PGSZ_IDX_BASE HV_PGSZ_IDX_64K 98 - #define HV_PGSZ_MASK_BASE HV_PGSZ_MASK_64K 99 - #else 100 - #error Broken base page size setting... 101 - #endif 102 95 103 - #ifdef CONFIG_HUGETLB_PAGE 104 - #if defined(CONFIG_HUGETLB_PAGE_SIZE_64K) 105 - #define HV_PGSZ_IDX_HUGE HV_PGSZ_IDX_64K 106 - #define HV_PGSZ_MASK_HUGE HV_PGSZ_MASK_64K 107 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_512K) 108 - #define HV_PGSZ_IDX_HUGE HV_PGSZ_IDX_512K 109 - #define HV_PGSZ_MASK_HUGE HV_PGSZ_MASK_512K 110 - #elif defined(CONFIG_HUGETLB_PAGE_SIZE_4MB) 96 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 111 97 #define HV_PGSZ_IDX_HUGE HV_PGSZ_IDX_4MB 112 98 #define HV_PGSZ_MASK_HUGE HV_PGSZ_MASK_4MB 113 - #else 114 - #error Broken huge page size setting... 115 - #endif 116 99 #endif 117 100 118 101 static void setup_tsb_params(struct mm_struct *mm, unsigned long tsb_idx, unsigned long tsb_bytes) ··· 190 207 case MM_TSB_BASE: 191 208 hp->pgsz_idx = HV_PGSZ_IDX_BASE; 192 209 break; 193 - #ifdef CONFIG_HUGETLB_PAGE 210 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 194 211 case MM_TSB_HUGE: 195 212 hp->pgsz_idx = HV_PGSZ_IDX_HUGE; 196 213 break; ··· 205 222 case MM_TSB_BASE: 206 223 hp->pgsz_mask = HV_PGSZ_MASK_BASE; 207 224 break; 208 - #ifdef CONFIG_HUGETLB_PAGE 225 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 209 226 case MM_TSB_HUGE: 210 227 hp->pgsz_mask = HV_PGSZ_MASK_HUGE; 211 228 break; ··· 427 444 428 445 int init_new_context(struct task_struct *tsk, struct mm_struct *mm) 429 446 { 430 - #ifdef CONFIG_HUGETLB_PAGE 447 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 431 448 unsigned long huge_pte_count; 432 449 #endif 433 450 unsigned int i; ··· 436 453 437 454 mm->context.sparc64_ctx_val = 0UL; 438 455 439 - #ifdef CONFIG_HUGETLB_PAGE 456 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 440 457 /* We reset it to zero because the fork() page copying 441 458 * will re-increment the counters as the parent PTEs are 442 459 * copied into the child address space. ··· 444 461 huge_pte_count = mm->context.huge_pte_count; 445 462 mm->context.huge_pte_count = 0; 446 463 #endif 464 + 465 + mm->context.pgtable_page = NULL; 447 466 448 467 /* copy_mm() copies over the parent's mm_struct before calling 449 468 * us, so we need to zero out the TSB pointer or else tsb_grow() ··· 459 474 */ 460 475 tsb_grow(mm, MM_TSB_BASE, get_mm_rss(mm)); 461 476 462 - #ifdef CONFIG_HUGETLB_PAGE 477 + #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 463 478 if (unlikely(huge_pte_count)) 464 479 tsb_grow(mm, MM_TSB_HUGE, huge_pte_count); 465 480 #endif ··· 485 500 void destroy_context(struct mm_struct *mm) 486 501 { 487 502 unsigned long flags, i; 503 + struct page *page; 488 504 489 505 for (i = 0; i < MM_NUM_TSBS; i++) 490 506 tsb_destroy_one(&mm->context.tsb_block[i]); 507 + 508 + page = mm->context.pgtable_page; 509 + if (page && put_page_testzero(page)) { 510 + pgtable_page_dtor(page); 511 + free_hot_cold_page(page, 0); 512 + } 491 513 492 514 spin_lock_irqsave(&ctx_alloc_lock, flags); 493 515

+3

arch/tile/Kconfig

··· 7 7 select HAVE_DMA_API_DEBUG 8 8 select HAVE_KVM if !TILEGX 9 9 select GENERIC_FIND_FIRST_BIT 10 + select SYSCTL_EXCEPTION_TRACE 10 11 select USE_GENERIC_SMP_HELPERS 11 12 select CC_OPTIMIZE_FOR_SIZE 13 + select HAVE_DEBUG_KMEMLEAK 12 14 select HAVE_GENERIC_HARDIRQS 13 15 select GENERIC_IRQ_PROBE 14 16 select GENERIC_PENDING_IRQ if SMP 15 17 select GENERIC_IRQ_SHOW 18 + select HAVE_DEBUG_BUGVERBOSE 16 19 select HAVE_SYSCALL_WRAPPERS if TILEGX 17 20 select SYS_HYPERVISOR 18 21 select ARCH_HAVE_NMI_SAFE_CMPXCHG

+4

arch/tile/include/asm/hugetlb.h

··· 106 106 { 107 107 } 108 108 109 + static inline void arch_clear_hugepage_flags(struct page *page) 110 + { 111 + } 112 + 109 113 #ifdef CONFIG_HUGETLB_SUPER_PAGES 110 114 static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 111 115 struct page *page, int writable)

+7 -12

arch/tile/mm/elf.c

··· 36 36 } while (c); 37 37 } 38 38 39 - static int notify_exec(void) 39 + static int notify_exec(struct mm_struct *mm) 40 40 { 41 41 int retval = 0; /* failure */ 42 - struct vm_area_struct *vma = current->mm->mmap; 43 - while (vma) { 44 - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) 45 - break; 46 - vma = vma->vm_next; 47 - } 48 - if (vma) { 42 + 43 + if (mm->exe_file) { 49 44 char *buf = (char *) __get_free_page(GFP_KERNEL); 50 45 if (buf) { 51 - char *path = d_path(&vma->vm_file->f_path, 46 + char *path = d_path(&mm->exe_file->f_path, 52 47 buf, PAGE_SIZE); 53 48 if (!IS_ERR(path)) { 54 49 sim_notify_exec(path); ··· 101 106 unsigned long vdso_base; 102 107 int retval = 0; 103 108 109 + down_write(&mm->mmap_sem); 110 + 104 111 /* 105 112 * Notify the simulator that an exec just occurred. 106 113 * If we can't find the filename of the mapping, just use 107 114 * whatever was passed as the linux_binprm filename. 108 115 */ 109 - if (!notify_exec()) 116 + if (!notify_exec(mm)) 110 117 sim_notify_exec(bprm->filename); 111 - 112 - down_write(&mm->mmap_sem); 113 118 114 119 /* 115 120 * MAYWRITE to allow gdb to COW and set breakpoints

+1

arch/tile/mm/fault.c

··· 454 454 tsk->min_flt++; 455 455 if (fault & VM_FAULT_RETRY) { 456 456 flags &= ~FAULT_FLAG_ALLOW_RETRY; 457 + flags |= FAULT_FLAG_TRIED; 457 458 458 459 /* 459 460 * No need to up_read(&mm->mmap_sem) as we would

+1

arch/um/Kconfig.common

··· 7 7 bool 8 8 default y 9 9 select HAVE_GENERIC_HARDIRQS 10 + select HAVE_UID16 10 11 select GENERIC_IRQ_SHOW 11 12 select GENERIC_CPU_DEVICES 12 13 select GENERIC_IO

+1

arch/um/kernel/trap.c

··· 89 89 current->min_flt++; 90 90 if (fault & VM_FAULT_RETRY) { 91 91 flags &= ~FAULT_FLAG_ALLOW_RETRY; 92 + flags |= FAULT_FLAG_TRIED; 92 93 93 94 goto retry; 94 95 }

+1 -1

arch/unicore32/kernel/process.c

··· 380 380 return install_special_mapping(mm, 0xffff0000, PAGE_SIZE, 381 381 VM_READ | VM_EXEC | 382 382 VM_MAYREAD | VM_MAYEXEC | 383 - VM_RESERVED, 383 + VM_DONTEXPAND | VM_DONTDUMP, 384 384 NULL); 385 385 } 386 386

+5

arch/x86/Kconfig

··· 10 10 def_bool y 11 11 depends on !64BIT 12 12 select CLKSRC_I8253 13 + select HAVE_UID16 13 14 14 15 config X86_64 15 16 def_bool y ··· 47 46 select HAVE_FUNCTION_GRAPH_FP_TEST 48 47 select HAVE_FUNCTION_TRACE_MCOUNT_TEST 49 48 select HAVE_SYSCALL_TRACEPOINTS 49 + select SYSCTL_EXCEPTION_TRACE 50 50 select HAVE_KVM 51 51 select HAVE_ARCH_KGDB 52 52 select HAVE_ARCH_TRACEHOOK ··· 67 65 select HAVE_PERF_EVENTS_NMI 68 66 select HAVE_PERF_REGS 69 67 select HAVE_PERF_USER_STACK_DUMP 68 + select HAVE_DEBUG_KMEMLEAK 70 69 select ANON_INODES 71 70 select HAVE_ALIGNED_STRUCT_PAGE if SLUB && !M386 72 71 select HAVE_CMPXCHG_LOCAL if !M386 ··· 88 85 select IRQ_FORCED_THREADING 89 86 select USE_GENERIC_SMP_HELPERS if SMP 90 87 select HAVE_BPF_JIT if X86_64 88 + select HAVE_ARCH_TRANSPARENT_HUGEPAGE 91 89 select CLKEVT_I8253 92 90 select ARCH_HAVE_NMI_SAFE_CMPXCHG 93 91 select GENERIC_IOMAP ··· 2172 2168 bool "IA32 Emulation" 2173 2169 depends on X86_64 2174 2170 select COMPAT_BINFMT_ELF 2171 + select HAVE_UID16 2175 2172 ---help--- 2176 2173 Include code to run legacy 32-bit programs under a 2177 2174 64-bit kernel. You should likely turn this on, unless you're

-24

arch/x86/include/asm/atomic.h

··· 240 240 return c; 241 241 } 242 242 243 - 244 - /* 245 - * atomic_dec_if_positive - decrement by 1 if old value positive 246 - * @v: pointer of type atomic_t 247 - * 248 - * The function returns the old value of *v minus 1, even if 249 - * the atomic variable, v, was not decremented. 250 - */ 251 - static inline int atomic_dec_if_positive(atomic_t *v) 252 - { 253 - int c, old, dec; 254 - c = atomic_read(v); 255 - for (;;) { 256 - dec = c - 1; 257 - if (unlikely(dec < 0)) 258 - break; 259 - old = atomic_cmpxchg((v), c, dec); 260 - if (likely(old == c)) 261 - break; 262 - c = old; 263 - } 264 - return dec; 265 - } 266 - 267 243 /** 268 244 * atomic_inc_short - increment of a short integer 269 245 * @v: pointer to type int

+4

arch/x86/include/asm/hugetlb.h

··· 90 90 { 91 91 } 92 92 93 + static inline void arch_clear_hugepage_flags(struct page *page) 94 + { 95 + } 96 + 93 97 #endif /* _ASM_X86_HUGETLB_H */

+8 -3

arch/x86/include/asm/pgtable.h

··· 146 146 147 147 static inline int pmd_large(pmd_t pte) 148 148 { 149 - return (pmd_flags(pte) & (_PAGE_PSE | _PAGE_PRESENT)) == 150 - (_PAGE_PSE | _PAGE_PRESENT); 149 + return pmd_flags(pte) & _PAGE_PSE; 151 150 } 152 151 153 152 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 414 415 415 416 static inline int pmd_present(pmd_t pmd) 416 417 { 417 - return pmd_flags(pmd) & _PAGE_PRESENT; 418 + /* 419 + * Checking for _PAGE_PSE is needed too because 420 + * split_huge_page will temporarily clear the present bit (but 421 + * the _PAGE_PSE flag will remain set at all times while the 422 + * _PAGE_PRESENT bit is clear). 423 + */ 424 + return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE); 418 425 } 419 426 420 427 static inline int pmd_none(pmd_t pmd)

+1

arch/x86/include/asm/pgtable_32.h

··· 71 71 * tables contain all the necessary information. 72 72 */ 73 73 #define update_mmu_cache(vma, address, ptep) do { } while (0) 74 + #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) 74 75 75 76 #endif /* !__ASSEMBLY__ */ 76 77

+1

arch/x86/include/asm/pgtable_64.h

··· 143 143 #define pte_unmap(pte) ((void)(pte))/* NOP */ 144 144 145 145 #define update_mmu_cache(vma, address, ptep) do { } while (0) 146 + #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) 146 147 147 148 /* Encode and de-code a swap entry */ 148 149 #if _PAGE_BIT_FILE < _PAGE_BIT_PROTNONE

+1

arch/x86/mm/fault.c

··· 1220 1220 /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk 1221 1221 * of starvation. */ 1222 1222 flags &= ~FAULT_FLAG_ALLOW_RETRY; 1223 + flags |= FAULT_FLAG_TRIED; 1223 1224 goto retry; 1224 1225 } 1225 1226 }

+1 -2

arch/x86/mm/hugetlbpage.c

··· 71 71 struct address_space *mapping = vma->vm_file->f_mapping; 72 72 pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + 73 73 vma->vm_pgoff; 74 - struct prio_tree_iter iter; 75 74 struct vm_area_struct *svma; 76 75 unsigned long saddr; 77 76 pte_t *spte = NULL; ··· 80 81 return (pte_t *)pmd_alloc(mm, pud, addr); 81 82 82 83 mutex_lock(&mapping->i_mmap_mutex); 83 - vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { 84 + vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { 84 85 if (svma == vma) 85 86 continue; 86 87

+62 -25

arch/x86/mm/pat.c

··· 664 664 } 665 665 666 666 /* 667 - * track_pfn_vma_copy is called when vma that is covering the pfnmap gets 667 + * track_pfn_copy is called when vma that is covering the pfnmap gets 668 668 * copied through copy_page_range(). 669 669 * 670 670 * If the vma has a linear pfn mapping for the entire range, we get the prot 671 671 * from pte and reserve the entire vma range with single reserve_pfn_range call. 672 672 */ 673 - int track_pfn_vma_copy(struct vm_area_struct *vma) 673 + int track_pfn_copy(struct vm_area_struct *vma) 674 674 { 675 675 resource_size_t paddr; 676 676 unsigned long prot; 677 677 unsigned long vma_size = vma->vm_end - vma->vm_start; 678 678 pgprot_t pgprot; 679 679 680 - if (is_linear_pfn_mapping(vma)) { 680 + if (vma->vm_flags & VM_PAT) { 681 681 /* 682 682 * reserve the whole chunk covered by vma. We need the 683 683 * starting address and protection from pte. ··· 694 694 } 695 695 696 696 /* 697 - * track_pfn_vma_new is called when a _new_ pfn mapping is being established 698 - * for physical range indicated by pfn and size. 699 - * 700 697 * prot is passed in as a parameter for the new mapping. If the vma has a 701 698 * linear pfn mapping for the entire range reserve the entire vma range with 702 699 * single reserve_pfn_range call. 703 700 */ 704 - int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t *prot, 705 - unsigned long pfn, unsigned long size) 701 + int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, 702 + unsigned long pfn, unsigned long addr, unsigned long size) 706 703 { 704 + resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; 707 705 unsigned long flags; 708 - resource_size_t paddr; 709 - unsigned long vma_size = vma->vm_end - vma->vm_start; 710 706 711 - if (is_linear_pfn_mapping(vma)) { 712 - /* reserve the whole chunk starting from vm_pgoff */ 713 - paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT; 714 - return reserve_pfn_range(paddr, vma_size, prot, 0); 707 + /* reserve the whole chunk starting from paddr */ 708 + if (addr == vma->vm_start && size == (vma->vm_end - vma->vm_start)) { 709 + int ret; 710 + 711 + ret = reserve_pfn_range(paddr, size, prot, 0); 712 + if (!ret) 713 + vma->vm_flags |= VM_PAT; 714 + return ret; 715 715 } 716 716 717 717 if (!pat_enabled) 718 718 return 0; 719 719 720 - /* for vm_insert_pfn and friends, we set prot based on lookup */ 721 - flags = lookup_memtype(pfn << PAGE_SHIFT); 720 + /* 721 + * For anything smaller than the vma size we set prot based on the 722 + * lookup. 723 + */ 724 + flags = lookup_memtype(paddr); 725 + 726 + /* Check memtype for the remaining pages */ 727 + while (size > PAGE_SIZE) { 728 + size -= PAGE_SIZE; 729 + paddr += PAGE_SIZE; 730 + if (flags != lookup_memtype(paddr)) 731 + return -EINVAL; 732 + } 733 + 734 + *prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) | 735 + flags); 736 + 737 + return 0; 738 + } 739 + 740 + int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, 741 + unsigned long pfn) 742 + { 743 + unsigned long flags; 744 + 745 + if (!pat_enabled) 746 + return 0; 747 + 748 + /* Set prot based on lookup */ 749 + flags = lookup_memtype((resource_size_t)pfn << PAGE_SHIFT); 722 750 *prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) | 723 751 flags); 724 752 ··· 754 726 } 755 727 756 728 /* 757 - * untrack_pfn_vma is called while unmapping a pfnmap for a region. 729 + * untrack_pfn is called while unmapping a pfnmap for a region. 758 730 * untrack can be called for a specific region indicated by pfn and size or 759 - * can be for the entire vma (in which case size can be zero). 731 + * can be for the entire vma (in which case pfn, size are zero). 760 732 */ 761 - void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn, 762 - unsigned long size) 733 + void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, 734 + unsigned long size) 763 735 { 764 736 resource_size_t paddr; 765 - unsigned long vma_size = vma->vm_end - vma->vm_start; 737 + unsigned long prot; 766 738 767 - if (is_linear_pfn_mapping(vma)) { 768 - /* free the whole chunk starting from vm_pgoff */ 769 - paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT; 770 - free_pfn_range(paddr, vma_size); 739 + if (!(vma->vm_flags & VM_PAT)) 771 740 return; 741 + 742 + /* free the chunk starting from pfn or the whole chunk */ 743 + paddr = (resource_size_t)pfn << PAGE_SHIFT; 744 + if (!paddr && !size) { 745 + if (follow_phys(vma, vma->vm_start, 0, &prot, &paddr)) { 746 + WARN_ON_ONCE(1); 747 + return; 748 + } 749 + 750 + size = vma->vm_end - vma->vm_start; 772 751 } 752 + free_pfn_range(paddr, size); 753 + vma->vm_flags &= ~VM_PAT; 773 754 } 774 755 775 756 pgprot_t pgprot_writecombine(pgprot_t prot)

+14 -20

arch/x86/mm/pat_rbtree.c

··· 12 12 #include <linux/debugfs.h> 13 13 #include <linux/kernel.h> 14 14 #include <linux/module.h> 15 - #include <linux/rbtree.h> 15 + #include <linux/rbtree_augmented.h> 16 16 #include <linux/sched.h> 17 17 #include <linux/gfp.h> 18 18 ··· 54 54 return ret; 55 55 } 56 56 57 - /* Update 'subtree_max_end' for a node, based on node and its children */ 58 - static void memtype_rb_augment_cb(struct rb_node *node, void *__unused) 57 + static u64 compute_subtree_max_end(struct memtype *data) 59 58 { 60 - struct memtype *data; 61 - u64 max_end, child_max_end; 59 + u64 max_end = data->end, child_max_end; 62 60 63 - if (!node) 64 - return; 65 - 66 - data = container_of(node, struct memtype, rb); 67 - max_end = data->end; 68 - 69 - child_max_end = get_subtree_max_end(node->rb_right); 61 + child_max_end = get_subtree_max_end(data->rb.rb_right); 70 62 if (child_max_end > max_end) 71 63 max_end = child_max_end; 72 64 73 - child_max_end = get_subtree_max_end(node->rb_left); 65 + child_max_end = get_subtree_max_end(data->rb.rb_left); 74 66 if (child_max_end > max_end) 75 67 max_end = child_max_end; 76 68 77 - data->subtree_max_end = max_end; 69 + return max_end; 78 70 } 71 + 72 + RB_DECLARE_CALLBACKS(static, memtype_rb_augment_cb, struct memtype, rb, 73 + u64, subtree_max_end, compute_subtree_max_end) 79 74 80 75 /* Find the first (lowest start addr) overlapping range from rb tree */ 81 76 static struct memtype *memtype_rb_lowest_match(struct rb_root *root, ··· 174 179 struct memtype *data = container_of(*node, struct memtype, rb); 175 180 176 181 parent = *node; 182 + if (data->subtree_max_end < newdata->end) 183 + data->subtree_max_end = newdata->end; 177 184 if (newdata->start <= data->start) 178 185 node = &((*node)->rb_left); 179 186 else if (newdata->start > data->start) 180 187 node = &((*node)->rb_right); 181 188 } 182 189 190 + newdata->subtree_max_end = newdata->end; 183 191 rb_link_node(&newdata->rb, parent, node); 184 - rb_insert_color(&newdata->rb, root); 185 - rb_augment_insert(&newdata->rb, memtype_rb_augment_cb, NULL); 192 + rb_insert_augmented(&newdata->rb, root, &memtype_rb_augment_cb); 186 193 } 187 194 188 195 int rbt_memtype_check_insert(struct memtype *new, unsigned long *ret_type) ··· 206 209 207 210 struct memtype *rbt_memtype_erase(u64 start, u64 end) 208 211 { 209 - struct rb_node *deepest; 210 212 struct memtype *data; 211 213 212 214 data = memtype_rb_exact_match(&memtype_rbroot, start, end); 213 215 if (!data) 214 216 goto out; 215 217 216 - deepest = rb_augment_erase_begin(&data->rb); 217 - rb_erase(&data->rb, &memtype_rbroot); 218 - rb_augment_erase_end(deepest, memtype_rb_augment_cb, NULL); 218 + rb_erase_augmented(&data->rb, &memtype_rbroot, &memtype_rb_augment_cb); 219 219 out: 220 220 return data; 221 221 }

+1 -2

arch/x86/xen/mmu.c

+1

arch/xtensa/mm/fault.c

··· 126 126 current->min_flt++; 127 127 if (fault & VM_FAULT_RETRY) { 128 128 flags &= ~FAULT_FLAG_ALLOW_RETRY; 129 + flags |= FAULT_FLAG_TRIED; 129 130 130 131 /* No need to up_read(&mm->mmap_sem) as we would 131 132 * have already released it in __lock_page_or_retry

+30 -10

drivers/base/memory.c

··· 248 248 static int 249 249 memory_block_action(unsigned long phys_index, unsigned long action) 250 250 { 251 - unsigned long start_pfn, start_paddr; 251 + unsigned long start_pfn; 252 252 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 253 253 struct page *first_page; 254 254 int ret; 255 255 256 256 first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT); 257 + start_pfn = page_to_pfn(first_page); 257 258 258 259 switch (action) { 259 260 case MEM_ONLINE: 260 - start_pfn = page_to_pfn(first_page); 261 - 262 261 if (!pages_correctly_reserved(start_pfn, nr_pages)) 263 262 return -EBUSY; 264 263 265 264 ret = online_pages(start_pfn, nr_pages); 266 265 break; 267 266 case MEM_OFFLINE: 268 - start_paddr = page_to_pfn(first_page) << PAGE_SHIFT; 269 - ret = remove_memory(start_paddr, 270 - nr_pages << PAGE_SHIFT); 267 + ret = offline_pages(start_pfn, nr_pages); 271 268 break; 272 269 default: 273 270 WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " ··· 275 278 return ret; 276 279 } 277 280 278 - static int memory_block_change_state(struct memory_block *mem, 281 + static int __memory_block_change_state(struct memory_block *mem, 279 282 unsigned long to_state, unsigned long from_state_req) 280 283 { 281 284 int ret = 0; 282 - 283 - mutex_lock(&mem->state_mutex); 284 285 285 286 if (mem->state != from_state_req) { 286 287 ret = -EINVAL; ··· 307 312 break; 308 313 } 309 314 out: 310 - mutex_unlock(&mem->state_mutex); 311 315 return ret; 312 316 } 313 317 318 + static int memory_block_change_state(struct memory_block *mem, 319 + unsigned long to_state, unsigned long from_state_req) 320 + { 321 + int ret; 322 + 323 + mutex_lock(&mem->state_mutex); 324 + ret = __memory_block_change_state(mem, to_state, from_state_req); 325 + mutex_unlock(&mem->state_mutex); 326 + 327 + return ret; 328 + } 314 329 static ssize_t 315 330 store_mem_state(struct device *dev, 316 331 struct device_attribute *attr, const char *buf, size_t count) ··· 658 653 return -EINVAL; 659 654 660 655 return remove_memory_block(0, section, 0); 656 + } 657 + 658 + /* 659 + * offline one memory block. If the memory block has been offlined, do nothing. 660 + */ 661 + int offline_memory_block(struct memory_block *mem) 662 + { 663 + int ret = 0; 664 + 665 + mutex_lock(&mem->state_mutex); 666 + if (mem->state != MEM_OFFLINE) 667 + ret = __memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); 668 + mutex_unlock(&mem->state_mutex); 669 + 670 + return ret; 661 671 } 662 672 663 673 /*

+1 -1

drivers/char/mbcs.c

··· 507 507 508 508 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 509 509 510 - /* Remap-pfn-range will mark the range VM_IO and VM_RESERVED */ 510 + /* Remap-pfn-range will mark the range VM_IO */ 511 511 if (remap_pfn_range(vma, 512 512 vma->vm_start, 513 513 __pa(soft->gscr_addr) >> PAGE_SHIFT,

+1 -1

drivers/char/mem.c

··· 322 322 323 323 vma->vm_ops = &mmap_mem_ops; 324 324 325 - /* Remap-pfn-range will mark the range VM_IO and VM_RESERVED */ 325 + /* Remap-pfn-range will mark the range VM_IO */ 326 326 if (remap_pfn_range(vma, 327 327 vma->vm_start, 328 328 vma->vm_pgoff,

+1 -1

drivers/char/mspec.c

··· 286 286 atomic_set(&vdata->refcnt, 1); 287 287 vma->vm_private_data = vdata; 288 288 289 - vma->vm_flags |= (VM_IO | VM_RESERVED | VM_PFNMAP | VM_DONTEXPAND); 289 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 290 290 if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED) 291 291 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 292 292 vma->vm_ops = &mspec_vm_ops;

+1 -1

drivers/gpu/drm/drm_gem.c

··· 706 706 goto out_unlock; 707 707 } 708 708 709 - vma->vm_flags |= VM_RESERVED | VM_IO | VM_PFNMAP | VM_DONTEXPAND; 709 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 710 710 vma->vm_ops = obj->dev->driver->gem_vm_ops; 711 711 vma->vm_private_data = map->handle; 712 712 vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

+2 -8

drivers/gpu/drm/drm_vm.c

··· 514 514 515 515 vma->vm_ops = &drm_vm_dma_ops; 516 516 517 - vma->vm_flags |= VM_RESERVED; /* Don't swap */ 518 - vma->vm_flags |= VM_DONTEXPAND; 517 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 519 518 520 519 drm_vm_open_locked(dev, vma); 521 520 return 0; ··· 642 643 case _DRM_SHM: 643 644 vma->vm_ops = &drm_vm_shm_ops; 644 645 vma->vm_private_data = (void *)map; 645 - /* Don't let this area swap. Change when 646 - DRM_KERNEL advisory is supported. */ 647 - vma->vm_flags |= VM_RESERVED; 648 646 break; 649 647 case _DRM_SCATTER_GATHER: 650 648 vma->vm_ops = &drm_vm_sg_ops; 651 649 vma->vm_private_data = (void *)map; 652 - vma->vm_flags |= VM_RESERVED; 653 650 vma->vm_page_prot = drm_dma_prot(map->type, vma); 654 651 break; 655 652 default: 656 653 return -EINVAL; /* This should never happen. */ 657 654 } 658 - vma->vm_flags |= VM_RESERVED; /* Don't swap */ 659 - vma->vm_flags |= VM_DONTEXPAND; 655 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 660 656 661 657 drm_vm_open_locked(dev, vma); 662 658 return 0;

+1 -1

drivers/gpu/drm/exynos/exynos_drm_gem.c

+1 -2

drivers/gpu/drm/gma500/framebuffer.c

··· 178 178 */ 179 179 vma->vm_ops = &psbfb_vm_ops; 180 180 vma->vm_private_data = (void *)psbfb; 181 - vma->vm_flags |= VM_RESERVED | VM_IO | 182 - VM_MIXEDMAP | VM_DONTEXPAND; 181 + vma->vm_flags |= VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP; 183 182 return 0; 184 183 } 185 184

+2 -2

drivers/gpu/drm/ttm/ttm_bo_vm.c

··· 285 285 */ 286 286 287 287 vma->vm_private_data = bo; 288 - vma->vm_flags |= VM_RESERVED | VM_IO | VM_MIXEDMAP | VM_DONTEXPAND; 288 + vma->vm_flags |= VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP; 289 289 return 0; 290 290 out_unref: 291 291 ttm_bo_unref(&bo); ··· 300 300 301 301 vma->vm_ops = &ttm_bo_vm_ops; 302 302 vma->vm_private_data = ttm_bo_reference(bo); 303 - vma->vm_flags |= VM_RESERVED | VM_IO | VM_MIXEDMAP | VM_DONTEXPAND; 303 + vma->vm_flags |= VM_IO | VM_MIXEDMAP | VM_DONTEXPAND; 304 304 return 0; 305 305 } 306 306 EXPORT_SYMBOL(ttm_fbdev_mmap);

+1 -1

drivers/gpu/drm/udl/udl_fb.c

··· 243 243 size = 0; 244 244 } 245 245 246 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 246 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 247 247 return 0; 248 248 } 249 249

+2 -2

drivers/infiniband/hw/ehca/ehca_uverbs.c

··· 117 117 physical = galpas->user.fw_handle; 118 118 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 119 119 ehca_gen_dbg("vsize=%llx physical=%llx", vsize, physical); 120 - /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ 120 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 121 121 ret = remap_4k_pfn(vma, vma->vm_start, physical >> EHCA_PAGESHIFT, 122 122 vma->vm_page_prot); 123 123 if (unlikely(ret)) { ··· 139 139 u64 start, ofs; 140 140 struct page *page; 141 141 142 - vma->vm_flags |= VM_RESERVED; 142 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 143 143 start = vma->vm_start; 144 144 for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) { 145 145 u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs);

+1 -1

drivers/infiniband/hw/ipath/ipath_file_ops.c

··· 1225 1225 1226 1226 vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT; 1227 1227 vma->vm_ops = &ipath_file_vm_ops; 1228 - vma->vm_flags |= VM_RESERVED | VM_DONTEXPAND; 1228 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 1229 1229 ret = 1; 1230 1230 1231 1231 bail:

+1 -1

drivers/infiniband/hw/qib/qib_file_ops.c

··· 971 971 972 972 vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT; 973 973 vma->vm_ops = &qib_file_vm_ops; 974 - vma->vm_flags |= VM_RESERVED | VM_DONTEXPAND; 974 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 975 975 ret = 1; 976 976 977 977 bail:

+1 -1

drivers/media/pci/meye/meye.c

··· 1647 1647 1648 1648 vma->vm_ops = &meye_vm_ops; 1649 1649 vma->vm_flags &= ~VM_IO; /* not I/O memory */ 1650 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 1650 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 1651 1651 vma->vm_private_data = (void *) (offset / gbufsize); 1652 1652 meye_vm_open(vma); 1653 1653

+1 -1

drivers/media/platform/omap/omap_vout.c

··· 911 911 912 912 q->bufs[i]->baddr = vma->vm_start; 913 913 914 - vma->vm_flags |= VM_RESERVED; 914 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 915 915 vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); 916 916 vma->vm_ops = &omap_vout_vm_ops; 917 917 vma->vm_private_data = (void *) vout;

+1 -1

drivers/media/platform/vino.c

··· 3950 3950 3951 3951 fb->map_count = 1; 3952 3952 3953 - vma->vm_flags |= VM_DONTEXPAND | VM_RESERVED; 3953 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 3954 3954 vma->vm_flags &= ~VM_IO; 3955 3955 vma->vm_private_data = fb; 3956 3956 vma->vm_file = file;

+1 -2

drivers/media/usb/sn9c102/sn9c102_core.c

··· 2126 2126 return -EINVAL; 2127 2127 } 2128 2128 2129 - vma->vm_flags |= VM_IO; 2130 - vma->vm_flags |= VM_RESERVED; 2129 + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; 2131 2130 2132 2131 pos = cam->frame[i].bufmem; 2133 2132 while (size > 0) { /* size is page-aligned */

+1 -2

drivers/media/usb/usbvision/usbvision-video.c

··· 1108 1108 } 1109 1109 1110 1110 /* VM_IO is eventually going to replace PageReserved altogether */ 1111 - vma->vm_flags |= VM_IO; 1112 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 1111 + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; 1113 1112 1114 1113 pos = usbvision->frame[i].data; 1115 1114 while (size > 0) {

+1 -1

drivers/media/v4l2-core/videobuf-dma-sg.c

··· 582 582 map->count = 1; 583 583 map->q = q; 584 584 vma->vm_ops = &videobuf_vm_ops; 585 - vma->vm_flags |= VM_DONTEXPAND | VM_RESERVED; 585 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 586 586 vma->vm_flags &= ~VM_IO; /* using shared anonymous pages */ 587 587 vma->vm_private_data = map; 588 588 dprintk(1, "mmap %p: q=%p %08lx-%08lx pgoff %08lx bufs %d-%d\n",

+1 -1

drivers/media/v4l2-core/videobuf-vmalloc.c

··· 270 270 } 271 271 272 272 vma->vm_ops = &videobuf_vm_ops; 273 - vma->vm_flags |= VM_DONTEXPAND | VM_RESERVED; 273 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 274 274 vma->vm_private_data = map; 275 275 276 276 dprintk(1, "mmap %p: q=%p %08lx-%08lx (%lx) pgoff %08lx buf %d\n",

+1 -1

drivers/media/v4l2-core/videobuf2-memops.c

··· 163 163 return ret; 164 164 } 165 165 166 - vma->vm_flags |= VM_DONTEXPAND | VM_RESERVED; 166 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 167 167 vma->vm_private_data = priv; 168 168 vma->vm_ops = vm_ops; 169 169

-2

drivers/misc/carma/carma-fpga.c

··· 1243 1243 return -EINVAL; 1244 1244 } 1245 1245 1246 - /* IO memory (stop cacheing) */ 1247 - vma->vm_flags |= VM_IO | VM_RESERVED; 1248 1246 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 1249 1247 1250 1248 return io_remap_pfn_range(vma, vma->vm_start, addr, vsize,

+2 -3

drivers/misc/sgi-gru/grufile.c

+1 -1

drivers/mtd/mtdchar.c

··· 1182 1182 return -EINVAL; 1183 1183 if (set_vm_offset(vma, off) < 0) 1184 1184 return -EINVAL; 1185 - vma->vm_flags |= VM_IO | VM_RESERVED; 1185 + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; 1186 1186 1187 1187 #ifdef pgprot_noncached 1188 1188 if (file->f_flags & O_DSYNC || off >= __pa(high_memory))

+2 -4

drivers/mtd/mtdcore.c

··· 1056 1056 * until the request succeeds or until the allocation size falls below 1057 1057 * the system page size. This attempts to make sure it does not adversely 1058 1058 * impact system performance, so when allocating more than one page, we 1059 - * ask the memory allocator to avoid re-trying, swapping, writing back 1060 - * or performing I/O. 1059 + * ask the memory allocator to avoid re-trying. 1061 1060 * 1062 1061 * Note, this function also makes sure that the allocated buffer is aligned to 1063 1062 * the MTD device's min. I/O unit, i.e. the "mtd->writesize" value. ··· 1070 1071 */ 1071 1072 void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size) 1072 1073 { 1073 - gfp_t flags = __GFP_NOWARN | __GFP_WAIT | 1074 - __GFP_NORETRY | __GFP_NO_KSWAPD; 1074 + gfp_t flags = __GFP_NOWARN | __GFP_WAIT | __GFP_NORETRY; 1075 1075 size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE); 1076 1076 void *kbuf; 1077 1077

+3 -14

drivers/oprofile/buffer_sync.c

··· 216 216 } 217 217 218 218 219 - /* Look up the dcookie for the task's first VM_EXECUTABLE mapping, 219 + /* Look up the dcookie for the task's mm->exe_file, 220 220 * which corresponds loosely to "application name". This is 221 221 * not strictly necessary but allows oprofile to associate 222 222 * shared-library samples with particular applications ··· 224 224 static unsigned long get_exec_dcookie(struct mm_struct *mm) 225 225 { 226 226 unsigned long cookie = NO_COOKIE; 227 - struct vm_area_struct *vma; 228 227 229 - if (!mm) 230 - goto out; 228 + if (mm && mm->exe_file) 229 + cookie = fast_get_dcookie(&mm->exe_file->f_path); 231 230 232 - for (vma = mm->mmap; vma; vma = vma->vm_next) { 233 - if (!vma->vm_file) 234 - continue; 235 - if (!(vma->vm_flags & VM_EXECUTABLE)) 236 - continue; 237 - cookie = fast_get_dcookie(&vma->vm_file->f_path); 238 - break; 239 - } 240 - 241 - out: 242 231 return cookie; 243 232 } 244 233

+1 -1

drivers/scsi/sg.c

··· 1257 1257 } 1258 1258 1259 1259 sfp->mmap_called = 1; 1260 - vma->vm_flags |= VM_RESERVED; 1260 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 1261 1261 vma->vm_private_data = sfp; 1262 1262 vma->vm_ops = &sg_mmap_vm_ops; 1263 1263 return 0;

-1

drivers/staging/android/ashmem.c

··· 332 332 if (vma->vm_file) 333 333 fput(vma->vm_file); 334 334 vma->vm_file = asma->file; 335 - vma->vm_flags |= VM_CAN_NONLINEAR; 336 335 337 336 out: 338 337 mutex_unlock(&ashmem_mutex);

+1 -1

drivers/staging/omapdrm/omap_gem_dmabuf.c

··· 160 160 goto out_unlock; 161 161 } 162 162 163 - vma->vm_flags |= VM_RESERVED | VM_IO | VM_PFNMAP | VM_DONTEXPAND; 163 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 164 164 vma->vm_ops = obj->dev->driver->gem_vm_ops; 165 165 vma->vm_private_data = obj; 166 166 vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

+1 -1

drivers/staging/tidspbridge/rmgr/drv_interface.c

··· 261 261 { 262 262 u32 status; 263 263 264 - vma->vm_flags |= VM_RESERVED | VM_IO; 264 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 265 265 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 266 266 267 267 dev_dbg(bridge, "%s: vm filp %p start %lx end %lx page_prot %ulx "

+1 -3

drivers/uio/uio.c

··· 653 653 if (mi < 0) 654 654 return -EINVAL; 655 655 656 - vma->vm_flags |= VM_IO | VM_RESERVED; 657 - 658 656 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 659 657 660 658 return remap_pfn_range(vma, ··· 664 666 665 667 static int uio_mmap_logical(struct vm_area_struct *vma) 666 668 { 667 - vma->vm_flags |= VM_RESERVED; 669 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 668 670 vma->vm_ops = &uio_vm_ops; 669 671 uio_vma_open(vma); 670 672 return 0;

+1 -1

drivers/usb/mon/mon_bin.c

··· 1247 1247 { 1248 1248 /* don't do anything here: "fault" will set up page table entries */ 1249 1249 vma->vm_ops = &mon_bin_vm_ops; 1250 - vma->vm_flags |= VM_RESERVED; 1250 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 1251 1251 vma->vm_private_data = filp->private_data; 1252 1252 mon_bin_vma_open(vma); 1253 1253 return 0;

+1 -1

drivers/video/68328fb.c

··· 400 400 #ifndef MMU 401 401 /* this is uClinux (no MMU) specific code */ 402 402 403 - vma->vm_flags |= VM_RESERVED; 403 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 404 404 vma->vm_start = videomemory; 405 405 406 406 return 0;

+1 -2

drivers/video/aty/atyfb_base.c

··· 1942 1942 off = vma->vm_pgoff << PAGE_SHIFT; 1943 1943 size = vma->vm_end - vma->vm_start; 1944 1944 1945 - /* To stop the swapper from even considering these pages. */ 1946 - vma->vm_flags |= (VM_IO | VM_RESERVED); 1945 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 1947 1946 1948 1947 if (((vma->vm_pgoff == 0) && (size == info->fix.smem_len)) || 1949 1948 ((off == info->fix.smem_len) && (size == PAGE_SIZE)))

+1 -2

drivers/video/fb-puv3.c

··· 653 653 vma->vm_page_prot)) 654 654 return -EAGAIN; 655 655 656 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 656 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 657 657 return 0; 658 - 659 658 } 660 659 661 660 static struct fb_ops unifb_ops = {

+1 -1

drivers/video/fb_defio.c

··· 166 166 static int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma) 167 167 { 168 168 vma->vm_ops = &fb_deferred_io_vm_ops; 169 - vma->vm_flags |= ( VM_RESERVED | VM_DONTEXPAND ); 169 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 170 170 if (!(info->flags & FBINFO_VIRTFB)) 171 171 vma->vm_flags |= VM_IO; 172 172 vma->vm_private_data = info;

+1 -2

drivers/video/fbmem.c

··· 1410 1410 return -EINVAL; 1411 1411 off += start; 1412 1412 vma->vm_pgoff = off >> PAGE_SHIFT; 1413 - /* This is an IO map - tell maydump to skip this VMA */ 1414 - vma->vm_flags |= VM_IO | VM_RESERVED; 1413 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by io_remap_pfn_range()*/ 1415 1414 vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); 1416 1415 fb_pgprotect(file, vma, off); 1417 1416 if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,

+1 -1

drivers/video/gbefb.c

··· 1024 1024 pgprot_val(vma->vm_page_prot) = 1025 1025 pgprot_fb(pgprot_val(vma->vm_page_prot)); 1026 1026 1027 - vma->vm_flags |= VM_IO | VM_RESERVED; 1027 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 1028 1028 1029 1029 /* look for the starting tile */ 1030 1030 tile = &gbe_tiles.cpu[offset >> TILE_SHIFT];

+1 -1

drivers/video/omap2/omapfb/omapfb-main.c

··· 1128 1128 DBG("user mmap region start %lx, len %d, off %lx\n", start, len, off); 1129 1129 1130 1130 vma->vm_pgoff = off >> PAGE_SHIFT; 1131 - vma->vm_flags |= VM_IO | VM_RESERVED; 1131 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 1132 1132 vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); 1133 1133 vma->vm_ops = &mmap_user_ops; 1134 1134 vma->vm_private_data = rg;

+2 -3

drivers/video/sbuslib.c

··· 57 57 58 58 off = vma->vm_pgoff << PAGE_SHIFT; 59 59 60 - /* To stop the swapper from even considering these pages */ 61 - vma->vm_flags |= (VM_IO | VM_RESERVED); 62 - 60 + /* VM_IO | VM_DONTEXPAND | VM_DONTDUMP are set by remap_pfn_range() */ 61 + 63 62 vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 64 63 65 64 /* Each page, see which map applies */

-1

drivers/video/smscufx.c

··· 803 803 size = 0; 804 804 } 805 805 806 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 807 806 return 0; 808 807 } 809 808

-1

drivers/video/udlfb.c

··· 345 345 size = 0; 346 346 } 347 347 348 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 349 348 return 0; 350 349 } 351 350

-1

drivers/video/vermilion/vermilion.c

··· 1018 1018 offset += vinfo->vram_start; 1019 1019 pgprot_val(vma->vm_page_prot) |= _PAGE_PCD; 1020 1020 pgprot_val(vma->vm_page_prot) &= ~_PAGE_PWT; 1021 - vma->vm_flags |= VM_RESERVED | VM_IO; 1022 1021 if (remap_pfn_range(vma, vma->vm_start, offset >> PAGE_SHIFT, 1023 1022 size, vma->vm_page_prot)) 1024 1023 return -EAGAIN;

-1

drivers/video/vfb.c

··· 439 439 size = 0; 440 440 } 441 441 442 - vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */ 443 442 return 0; 444 443 445 444 }

+1 -1

drivers/xen/gntalloc.c

··· 535 535 536 536 vma->vm_private_data = vm_priv; 537 537 538 - vma->vm_flags |= VM_RESERVED | VM_DONTEXPAND; 538 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 539 539 540 540 vma->vm_ops = &gntalloc_vmops; 541 541

+1 -1

drivers/xen/gntdev.c

+2 -1

drivers/xen/privcmd.c

··· 455 455 { 456 456 /* DONTCOPY is essential for Xen because copy_page_range doesn't know 457 457 * how to recreate these mappings */ 458 - vma->vm_flags |= VM_RESERVED | VM_IO | VM_DONTCOPY | VM_PFNMAP; 458 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | 459 + VM_DONTEXPAND | VM_DONTDUMP; 459 460 vma->vm_ops = &privcmd_vm_ops; 460 461 vma->vm_private_data = NULL; 461 462

+1

fs/9p/vfs_file.c

··· 738 738 static const struct vm_operations_struct v9fs_file_vm_ops = { 739 739 .fault = filemap_fault, 740 740 .page_mkwrite = v9fs_vm_page_mkwrite, 741 + .remap_pages = generic_file_remap_pages, 741 742 }; 742 743 743 744

+2 -2

fs/binfmt_elf.c

··· 1123 1123 if (always_dump_vma(vma)) 1124 1124 goto whole; 1125 1125 1126 - if (vma->vm_flags & VM_NODUMP) 1126 + if (vma->vm_flags & VM_DONTDUMP) 1127 1127 return 0; 1128 1128 1129 1129 /* Hugetlb memory check */ ··· 1135 1135 } 1136 1136 1137 1137 /* Do not dump I/O mapped devices or special mappings */ 1138 - if (vma->vm_flags & (VM_IO | VM_RESERVED)) 1138 + if (vma->vm_flags & VM_IO) 1139 1139 return 0; 1140 1140 1141 1141 /* By default, dump shared memory if mapped from an anonymous file. */

+1 -1

fs/binfmt_elf_fdpic.c

··· 1205 1205 int dump_ok; 1206 1206 1207 1207 /* Do not dump I/O mapped devices or special mappings */ 1208 - if (vma->vm_flags & (VM_IO | VM_RESERVED)) { 1208 + if (vma->vm_flags & VM_IO) { 1209 1209 kdcore("%08lx: %08lx: no (IO)", vma->vm_start, vma->vm_flags); 1210 1210 return 0; 1211 1211 }

+1 -1

fs/btrfs/file.c

··· 1599 1599 static const struct vm_operations_struct btrfs_file_vm_ops = { 1600 1600 .fault = filemap_fault, 1601 1601 .page_mkwrite = btrfs_page_mkwrite, 1602 + .remap_pages = generic_file_remap_pages, 1602 1603 }; 1603 1604 1604 1605 static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) ··· 1611 1610 1612 1611 file_accessed(filp); 1613 1612 vma->vm_ops = &btrfs_file_vm_ops; 1614 - vma->vm_flags |= VM_CAN_NONLINEAR; 1615 1613 1616 1614 return 0; 1617 1615 }

+1 -1

fs/ceph/addr.c

··· 1224 1224 static struct vm_operations_struct ceph_vmops = { 1225 1225 .fault = filemap_fault, 1226 1226 .page_mkwrite = ceph_page_mkwrite, 1227 + .remap_pages = generic_file_remap_pages, 1227 1228 }; 1228 1229 1229 1230 int ceph_mmap(struct file *file, struct vm_area_struct *vma) ··· 1235 1234 return -ENOEXEC; 1236 1235 file_accessed(file); 1237 1236 vma->vm_ops = &ceph_vmops; 1238 - vma->vm_flags |= VM_CAN_NONLINEAR; 1239 1237 return 0; 1240 1238 }

+1

fs/cifs/file.c

··· 3003 3003 static struct vm_operations_struct cifs_file_vm_ops = { 3004 3004 .fault = filemap_fault, 3005 3005 .page_mkwrite = cifs_page_mkwrite, 3006 + .remap_pages = generic_file_remap_pages, 3006 3007 }; 3007 3008 3008 3009 int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma)

+1 -1

fs/exec.c

··· 603 603 * process cleanup to remove whatever mess we made. 604 604 */ 605 605 if (length != move_page_tables(vma, old_start, 606 - vma, new_start, length)) 606 + vma, new_start, length, false)) 607 607 return -ENOMEM; 608 608 609 609 lru_add_drain();

+1 -1

fs/ext4/file.c

··· 207 207 static const struct vm_operations_struct ext4_file_vm_ops = { 208 208 .fault = filemap_fault, 209 209 .page_mkwrite = ext4_page_mkwrite, 210 + .remap_pages = generic_file_remap_pages, 210 211 }; 211 212 212 213 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) ··· 218 217 return -ENOEXEC; 219 218 file_accessed(file); 220 219 vma->vm_ops = &ext4_file_vm_ops; 221 - vma->vm_flags |= VM_CAN_NONLINEAR; 222 220 return 0; 223 221 } 224 222

+3 -4

fs/fs-writeback.c

··· 439 439 * setting I_SYNC flag and calling inode_sync_complete() to clear it. 440 440 */ 441 441 static int 442 - __writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, 443 - struct writeback_control *wbc) 442 + __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) 444 443 { 445 444 struct address_space *mapping = inode->i_mapping; 446 445 long nr_to_write = wbc->nr_to_write; ··· 526 527 inode->i_state |= I_SYNC; 527 528 spin_unlock(&inode->i_lock); 528 529 529 - ret = __writeback_single_inode(inode, wb, wbc); 530 + ret = __writeback_single_inode(inode, wbc); 530 531 531 532 spin_lock(&wb->list_lock); 532 533 spin_lock(&inode->i_lock); ··· 669 670 * We use I_SYNC to pin the inode in memory. While it is set 670 671 * evict_inode() will wait so the inode cannot be freed. 671 672 */ 672 - __writeback_single_inode(inode, wb, &wbc); 673 + __writeback_single_inode(inode, &wbc); 673 674 674 675 work->nr_pages -= write_chunk - wbc.nr_to_write; 675 676 wrote += write_chunk - wbc.nr_to_write;

+1

fs/fuse/file.c

··· 1379 1379 .close = fuse_vma_close, 1380 1380 .fault = filemap_fault, 1381 1381 .page_mkwrite = fuse_page_mkwrite, 1382 + .remap_pages = generic_file_remap_pages, 1382 1383 }; 1383 1384 1384 1385 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)

+1 -1

fs/gfs2/file.c

··· 492 492 static const struct vm_operations_struct gfs2_vm_ops = { 493 493 .fault = filemap_fault, 494 494 .page_mkwrite = gfs2_page_mkwrite, 495 + .remap_pages = generic_file_remap_pages, 495 496 }; 496 497 497 498 /** ··· 527 526 return error; 528 527 } 529 528 vma->vm_ops = &gfs2_vm_ops; 530 - vma->vm_flags |= VM_CAN_NONLINEAR; 531 529 532 530 return 0; 533 531 }

+5 -6

fs/hugetlbfs/inode.c

··· 110 110 * way when do_mmap_pgoff unwinds (may be important on powerpc 111 111 * and ia64). 112 112 */ 113 - vma->vm_flags |= VM_HUGETLB | VM_RESERVED; 113 + vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND | VM_DONTDUMP; 114 114 vma->vm_ops = &hugetlb_vm_ops; 115 115 116 116 if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) ··· 397 397 } 398 398 399 399 static inline void 400 - hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) 400 + hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff) 401 401 { 402 402 struct vm_area_struct *vma; 403 - struct prio_tree_iter iter; 404 403 405 - vma_prio_tree_foreach(vma, &iter, root, pgoff, ULONG_MAX) { 404 + vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) { 406 405 unsigned long v_offset; 407 406 408 407 /* 409 408 * Can the expression below overflow on 32-bit arches? 410 - * No, because the prio_tree returns us only those vmas 409 + * No, because the interval tree returns us only those vmas 411 410 * which overlap the truncated area starting at pgoff, 412 411 * and no vma on a 32-bit arch can span beyond the 4GB. 413 412 */ ··· 431 432 432 433 i_size_write(inode, offset); 433 434 mutex_lock(&mapping->i_mmap_mutex); 434 - if (!prio_tree_empty(&mapping->i_mmap)) 435 + if (!RB_EMPTY_ROOT(&mapping->i_mmap)) 435 436 hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); 436 437 mutex_unlock(&mapping->i_mmap_mutex); 437 438 truncate_hugepages(inode, offset);

+1 -1

fs/inode.c

··· 348 348 mutex_init(&mapping->i_mmap_mutex); 349 349 INIT_LIST_HEAD(&mapping->private_list); 350 350 spin_lock_init(&mapping->private_lock); 351 - INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); 351 + mapping->i_mmap = RB_ROOT; 352 352 INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); 353 353 } 354 354 EXPORT_SYMBOL(address_space_init_once);

+8 -5

fs/jffs2/readinode.c

··· 394 394 } 395 395 396 396 /* Trivial function to remove the last node in the tree. Which by definition 397 - has no right-hand -- so can be removed just by making its only child (if 398 - any) take its place under its parent. */ 397 + has no right-hand child — so can be removed just by making its left-hand 398 + child (if any) take its place under its parent. Since this is only done 399 + when we're consuming the whole tree, there's no need to use rb_erase() 400 + and let it worry about adjusting colours and balancing the tree. That 401 + would just be a waste of time. */ 399 402 static void eat_last(struct rb_root *root, struct rb_node *node) 400 403 { 401 404 struct rb_node *parent = rb_parent(node); ··· 415 412 link = &parent->rb_right; 416 413 417 414 *link = node->rb_left; 418 - /* Colour doesn't matter now. Only the parent pointer. */ 419 415 if (node->rb_left) 420 - node->rb_left->rb_parent_color = node->rb_parent_color; 416 + node->rb_left->__rb_parent_color = node->__rb_parent_color; 421 417 } 422 418 423 - /* We put this in reverse order, so we can just use eat_last */ 419 + /* We put the version tree in reverse order, so we can use the same eat_last() 420 + function that we use to consume the tmpnode tree (tn_root). */ 424 421 static void ver_insert(struct rb_root *ver_root, struct jffs2_tmp_dnode_info *tn) 425 422 { 426 423 struct rb_node **link = &ver_root->rb_node;

+1

fs/nfs/file.c

··· 578 578 static const struct vm_operations_struct nfs_file_vm_ops = { 579 579 .fault = filemap_fault, 580 580 .page_mkwrite = nfs_vm_page_mkwrite, 581 + .remap_pages = generic_file_remap_pages, 581 582 }; 582 583 583 584 static int nfs_need_sync_write(struct file *filp, struct inode *inode)

+1 -1

fs/nilfs2/file.c

··· 135 135 static const struct vm_operations_struct nilfs_file_vm_ops = { 136 136 .fault = filemap_fault, 137 137 .page_mkwrite = nilfs_page_mkwrite, 138 + .remap_pages = generic_file_remap_pages, 138 139 }; 139 140 140 141 static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma) 141 142 { 142 143 file_accessed(file); 143 144 vma->vm_ops = &nilfs_file_vm_ops; 144 - vma->vm_flags |= VM_CAN_NONLINEAR; 145 145 return 0; 146 146 } 147 147

+1 -1

fs/ocfs2/mmap.c

··· 173 173 static const struct vm_operations_struct ocfs2_file_vm_ops = { 174 174 .fault = ocfs2_fault, 175 175 .page_mkwrite = ocfs2_page_mkwrite, 176 + .remap_pages = generic_file_remap_pages, 176 177 }; 177 178 178 179 int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) ··· 189 188 ocfs2_inode_unlock(file->f_dentry->d_inode, lock_level); 190 189 out: 191 190 vma->vm_ops = &ocfs2_file_vm_ops; 192 - vma->vm_flags |= VM_CAN_NONLINEAR; 193 191 return 0; 194 192 } 195 193

+1 -116

fs/proc/base.c

··· 873 873 .release = mem_release, 874 874 }; 875 875 876 - static ssize_t oom_adjust_read(struct file *file, char __user *buf, 877 - size_t count, loff_t *ppos) 878 - { 879 - struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); 880 - char buffer[PROC_NUMBUF]; 881 - size_t len; 882 - int oom_adjust = OOM_DISABLE; 883 - unsigned long flags; 884 - 885 - if (!task) 886 - return -ESRCH; 887 - 888 - if (lock_task_sighand(task, &flags)) { 889 - oom_adjust = task->signal->oom_adj; 890 - unlock_task_sighand(task, &flags); 891 - } 892 - 893 - put_task_struct(task); 894 - 895 - len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust); 896 - 897 - return simple_read_from_buffer(buf, count, ppos, buffer, len); 898 - } 899 - 900 - static ssize_t oom_adjust_write(struct file *file, const char __user *buf, 901 - size_t count, loff_t *ppos) 902 - { 903 - struct task_struct *task; 904 - char buffer[PROC_NUMBUF]; 905 - int oom_adjust; 906 - unsigned long flags; 907 - int err; 908 - 909 - memset(buffer, 0, sizeof(buffer)); 910 - if (count > sizeof(buffer) - 1) 911 - count = sizeof(buffer) - 1; 912 - if (copy_from_user(buffer, buf, count)) { 913 - err = -EFAULT; 914 - goto out; 915 - } 916 - 917 - err = kstrtoint(strstrip(buffer), 0, &oom_adjust); 918 - if (err) 919 - goto out; 920 - if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) && 921 - oom_adjust != OOM_DISABLE) { 922 - err = -EINVAL; 923 - goto out; 924 - } 925 - 926 - task = get_proc_task(file->f_path.dentry->d_inode); 927 - if (!task) { 928 - err = -ESRCH; 929 - goto out; 930 - } 931 - 932 - task_lock(task); 933 - if (!task->mm) { 934 - err = -EINVAL; 935 - goto err_task_lock; 936 - } 937 - 938 - if (!lock_task_sighand(task, &flags)) { 939 - err = -ESRCH; 940 - goto err_task_lock; 941 - } 942 - 943 - if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) { 944 - err = -EACCES; 945 - goto err_sighand; 946 - } 947 - 948 - /* 949 - * Warn that /proc/pid/oom_adj is deprecated, see 950 - * Documentation/feature-removal-schedule.txt. 951 - */ 952 - printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n", 953 - current->comm, task_pid_nr(current), task_pid_nr(task), 954 - task_pid_nr(task)); 955 - task->signal->oom_adj = oom_adjust; 956 - /* 957 - * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum 958 - * value is always attainable. 959 - */ 960 - if (task->signal->oom_adj == OOM_ADJUST_MAX) 961 - task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; 962 - else 963 - task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / 964 - -OOM_DISABLE; 965 - trace_oom_score_adj_update(task); 966 - err_sighand: 967 - unlock_task_sighand(task, &flags); 968 - err_task_lock: 969 - task_unlock(task); 970 - put_task_struct(task); 971 - out: 972 - return err < 0 ? err : count; 973 - } 974 - 975 - static const struct file_operations proc_oom_adjust_operations = { 976 - .read = oom_adjust_read, 977 - .write = oom_adjust_write, 978 - .llseek = generic_file_llseek, 979 - }; 980 - 981 876 static ssize_t oom_score_adj_read(struct file *file, char __user *buf, 982 877 size_t count, loff_t *ppos) 983 878 { ··· 946 1051 if (has_capability_noaudit(current, CAP_SYS_RESOURCE)) 947 1052 task->signal->oom_score_adj_min = oom_score_adj; 948 1053 trace_oom_score_adj_update(task); 949 - /* 950 - * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is 951 - * always attainable. 952 - */ 953 - if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) 954 - task->signal->oom_adj = OOM_DISABLE; 955 - else 956 - task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / 957 - OOM_SCORE_ADJ_MAX; 1054 + 958 1055 err_sighand: 959 1056 unlock_task_sighand(task, &flags); 960 1057 err_task_lock: ··· 2597 2710 REG("cgroup", S_IRUGO, proc_cgroup_operations), 2598 2711 #endif 2599 2712 INF("oom_score", S_IRUGO, proc_oom_score), 2600 - REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), 2601 2713 REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), 2602 2714 #ifdef CONFIG_AUDITSYSCALL 2603 2715 REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), ··· 2963 3077 REG("cgroup", S_IRUGO, proc_cgroup_operations), 2964 3078 #endif 2965 3079 INF("oom_score", S_IRUGO, proc_oom_score), 2966 - REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), 2967 3080 REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), 2968 3081 #ifdef CONFIG_AUDITSYSCALL 2969 3082 REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),

+7 -1

fs/proc/page.c

··· 115 115 u |= 1 << KPF_COMPOUND_TAIL; 116 116 if (PageHuge(page)) 117 117 u |= 1 << KPF_HUGE; 118 - else if (PageTransCompound(page)) 118 + /* 119 + * PageTransCompound can be true for non-huge compound pages (slab 120 + * pages or pages allocated by drivers with __GFP_COMP) because it 121 + * just checks PG_head/PG_tail, so we need to check PageLRU to make 122 + * sure a given page is a thp, not a non-huge compound page. 123 + */ 124 + else if (PageTransCompound(page) && PageLRU(compound_trans_head(page))) 119 125 u |= 1 << KPF_THP; 120 126 121 127 /*

+2 -3

fs/proc/proc_sysctl.c

··· 142 142 } 143 143 144 144 rb_link_node(node, parent, p); 145 + rb_insert_color(node, &head->parent->root); 145 146 return 0; 146 147 } 147 148 ··· 169 168 head->node = node; 170 169 if (node) { 171 170 struct ctl_table *entry; 172 - for (entry = table; entry->procname; entry++, node++) { 173 - rb_init_node(&node->node); 171 + for (entry = table; entry->procname; entry++, node++) 174 172 node->header = head; 175 - } 176 173 } 177 174 } 178 175

+1 -1

fs/proc/task_mmu.c

··· 54 54 "VmPTE:\t%8lu kB\n" 55 55 "VmSwap:\t%8lu kB\n", 56 56 hiwater_vm << (PAGE_SHIFT-10), 57 - (total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), 57 + total_vm << (PAGE_SHIFT-10), 58 58 mm->locked_vm << (PAGE_SHIFT-10), 59 59 mm->pinned_vm << (PAGE_SHIFT-10), 60 60 hiwater_rss << (PAGE_SHIFT-10),

+1

fs/ubifs/file.c

··· 1536 1536 static const struct vm_operations_struct ubifs_file_vm_ops = { 1537 1537 .fault = filemap_fault, 1538 1538 .page_mkwrite = ubifs_vm_page_mkwrite, 1539 + .remap_pages = generic_file_remap_pages, 1539 1540 }; 1540 1541 1541 1542 static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma)

+1 -1

fs/xfs/xfs_file.c

··· 940 940 struct vm_area_struct *vma) 941 941 { 942 942 vma->vm_ops = &xfs_file_vm_ops; 943 - vma->vm_flags |= VM_CAN_NONLINEAR; 944 943 945 944 file_accessed(filp); 946 945 return 0; ··· 1442 1443 static const struct vm_operations_struct xfs_file_vm_ops = { 1443 1444 .fault = filemap_fault, 1444 1445 .page_mkwrite = xfs_vm_page_mkwrite, 1446 + .remap_pages = generic_file_remap_pages, 1445 1447 };

+48 -24

include/asm-generic/pgtable.h

··· 87 87 pmd_t *pmdp) 88 88 { 89 89 pmd_t pmd = *pmdp; 90 - pmd_clear(mm, address, pmdp); 90 + pmd_clear(pmdp); 91 91 return pmd; 92 92 } 93 93 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ ··· 160 160 #ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH 161 161 extern void pmdp_splitting_flush(struct vm_area_struct *vma, 162 162 unsigned long address, pmd_t *pmdp); 163 + #endif 164 + 165 + #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT 166 + extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable); 167 + #endif 168 + 169 + #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW 170 + extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm); 171 + #endif 172 + 173 + #ifndef __HAVE_ARCH_PMDP_INVALIDATE 174 + extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, 175 + pmd_t *pmdp); 163 176 #endif 164 177 165 178 #ifndef __HAVE_ARCH_PTE_SAME ··· 394 381 395 382 #ifndef __HAVE_PFNMAP_TRACKING 396 383 /* 397 - * Interface that can be used by architecture code to keep track of 398 - * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn) 399 - * 400 - * track_pfn_vma_new is called when a _new_ pfn mapping is being established 401 - * for physical range indicated by pfn and size. 384 + * Interfaces that can be used by architecture code to keep track of 385 + * memory type of pfn mappings specified by the remap_pfn_range, 386 + * vm_insert_pfn. 402 387 */ 403 - static inline int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t *prot, 404 - unsigned long pfn, unsigned long size) 388 + 389 + /* 390 + * track_pfn_remap is called when a _new_ pfn mapping is being established 391 + * by remap_pfn_range() for physical range indicated by pfn and size. 392 + */ 393 + static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, 394 + unsigned long pfn, unsigned long addr, 395 + unsigned long size) 405 396 { 406 397 return 0; 407 398 } 408 399 409 400 /* 410 - * Interface that can be used by architecture code to keep track of 411 - * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn) 412 - * 413 - * track_pfn_vma_copy is called when vma that is covering the pfnmap gets 401 + * track_pfn_insert is called when a _new_ single pfn is established 402 + * by vm_insert_pfn(). 403 + */ 404 + static inline int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, 405 + unsigned long pfn) 406 + { 407 + return 0; 408 + } 409 + 410 + /* 411 + * track_pfn_copy is called when vma that is covering the pfnmap gets 414 412 * copied through copy_page_range(). 415 413 */ 416 - static inline int track_pfn_vma_copy(struct vm_area_struct *vma) 414 + static inline int track_pfn_copy(struct vm_area_struct *vma) 417 415 { 418 416 return 0; 419 417 } 420 418 421 419 /* 422 - * Interface that can be used by architecture code to keep track of 423 - * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn) 424 - * 425 420 * untrack_pfn_vma is called while unmapping a pfnmap for a region. 426 421 * untrack can be called for a specific region indicated by pfn and size or 427 - * can be for the entire vma (in which case size can be zero). 422 + * can be for the entire vma (in which case pfn, size are zero). 428 423 */ 429 - static inline void untrack_pfn_vma(struct vm_area_struct *vma, 430 - unsigned long pfn, unsigned long size) 424 + static inline void untrack_pfn(struct vm_area_struct *vma, 425 + unsigned long pfn, unsigned long size) 431 426 { 432 427 } 433 428 #else 434 - extern int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t *prot, 435 - unsigned long pfn, unsigned long size); 436 - extern int track_pfn_vma_copy(struct vm_area_struct *vma); 437 - extern void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn, 438 - unsigned long size); 429 + extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, 430 + unsigned long pfn, unsigned long addr, 431 + unsigned long size); 432 + extern int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, 433 + unsigned long pfn); 434 + extern int track_pfn_copy(struct vm_area_struct *vma); 435 + extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, 436 + unsigned long size); 439 437 #endif 440 438 441 439 #ifdef CONFIG_MMU

+25

include/linux/atomic.h

··· 86 86 } 87 87 #endif 88 88 89 + /* 90 + * atomic_dec_if_positive - decrement by 1 if old value positive 91 + * @v: pointer of type atomic_t 92 + * 93 + * The function returns the old value of *v minus 1, even if 94 + * the atomic variable, v, was not decremented. 95 + */ 96 + #ifndef atomic_dec_if_positive 97 + static inline int atomic_dec_if_positive(atomic_t *v) 98 + { 99 + int c, old, dec; 100 + c = atomic_read(v); 101 + for (;;) { 102 + dec = c - 1; 103 + if (unlikely(dec < 0)) 104 + break; 105 + old = atomic_cmpxchg((v), c, dec); 106 + if (likely(old == c)) 107 + break; 108 + c = old; 109 + } 110 + return dec; 111 + } 112 + #endif 113 + 89 114 #ifndef CONFIG_ARCH_HAS_ATOMIC_OR 90 115 static inline void atomic_or(int i, atomic_t *v) 91 116 {

+17 -2

include/linux/compaction.h

··· 22 22 extern int fragmentation_index(struct zone *zone, unsigned int order); 23 23 extern unsigned long try_to_compact_pages(struct zonelist *zonelist, 24 24 int order, gfp_t gfp_mask, nodemask_t *mask, 25 - bool sync, bool *contended); 25 + bool sync, bool *contended, struct page **page); 26 26 extern int compact_pgdat(pg_data_t *pgdat, int order); 27 + extern void reset_isolation_suitable(pg_data_t *pgdat); 27 28 extern unsigned long compaction_suitable(struct zone *zone, int order); 28 29 29 30 /* Do not skip compaction more than 64 times */ ··· 62 61 return zone->compact_considered < defer_limit; 63 62 } 64 63 64 + /* Returns true if restarting compaction after many failures */ 65 + static inline bool compaction_restarting(struct zone *zone, int order) 66 + { 67 + if (order < zone->compact_order_failed) 68 + return false; 69 + 70 + return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT && 71 + zone->compact_considered >= 1UL << zone->compact_defer_shift; 72 + } 73 + 65 74 #else 66 75 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist, 67 76 int order, gfp_t gfp_mask, nodemask_t *nodemask, 68 - bool sync, bool *contended) 77 + bool sync, bool *contended, struct page **page) 69 78 { 70 79 return COMPACT_CONTINUE; 71 80 } ··· 83 72 static inline int compact_pgdat(pg_data_t *pgdat, int order) 84 73 { 85 74 return COMPACT_CONTINUE; 75 + } 76 + 77 + static inline void reset_isolation_suitable(pg_data_t *pgdat) 78 + { 86 79 } 87 80 88 81 static inline unsigned long compaction_suitable(struct zone *zone, int order)

+5 -3

include/linux/fs.h

··· 401 401 #include <linux/cache.h> 402 402 #include <linux/list.h> 403 403 #include <linux/radix-tree.h> 404 - #include <linux/prio_tree.h> 404 + #include <linux/rbtree.h> 405 405 #include <linux/init.h> 406 406 #include <linux/pid.h> 407 407 #include <linux/bug.h> ··· 669 669 struct radix_tree_root page_tree; /* radix tree of all pages */ 670 670 spinlock_t tree_lock; /* and lock protecting it */ 671 671 unsigned int i_mmap_writable;/* count VM_SHARED mappings */ 672 - struct prio_tree_root i_mmap; /* tree of private and shared mappings */ 672 + struct rb_root i_mmap; /* tree of private and shared mappings */ 673 673 struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ 674 674 struct mutex i_mmap_mutex; /* protect tree, count, list */ 675 675 /* Protected by tree_lock together with the radix tree */ ··· 741 741 */ 742 742 static inline int mapping_mapped(struct address_space *mapping) 743 743 { 744 - return !prio_tree_empty(&mapping->i_mmap) || 744 + return !RB_EMPTY_ROOT(&mapping->i_mmap) || 745 745 !list_empty(&mapping->i_mmap_nonlinear); 746 746 } 747 747 ··· 2552 2552 2553 2553 extern int generic_file_mmap(struct file *, struct vm_area_struct *); 2554 2554 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); 2555 + extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr, 2556 + unsigned long size, pgoff_t pgoff); 2555 2557 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); 2556 2558 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); 2557 2559 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);

+1 -8

include/linux/gfp.h

··· 30 30 #define ___GFP_HARDWALL 0x20000u 31 31 #define ___GFP_THISNODE 0x40000u 32 32 #define ___GFP_RECLAIMABLE 0x80000u 33 - #ifdef CONFIG_KMEMCHECK 34 33 #define ___GFP_NOTRACK 0x200000u 35 - #else 36 - #define ___GFP_NOTRACK 0 37 - #endif 38 - #define ___GFP_NO_KSWAPD 0x400000u 39 34 #define ___GFP_OTHER_NODE 0x800000u 40 35 #define ___GFP_WRITE 0x1000000u 41 36 ··· 85 90 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */ 86 91 #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */ 87 92 88 - #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) 89 93 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ 90 94 #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ 91 95 ··· 114 120 __GFP_MOVABLE) 115 121 #define GFP_IOFS (__GFP_IO | __GFP_FS) 116 122 #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ 117 - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ 118 - __GFP_NO_KSWAPD) 123 + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) 119 124 120 125 #ifdef CONFIG_NUMA 121 126 #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

+1 -2

include/linux/huge_mm.h

··· 11 11 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, 12 12 unsigned long address, pmd_t *pmd, 13 13 pmd_t orig_pmd); 14 - extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm); 15 - extern struct page *follow_trans_huge_pmd(struct mm_struct *mm, 14 + extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, 16 15 unsigned long addr, 17 16 pmd_t *pmd, 18 17 unsigned int flags);

+27

include/linux/interval_tree.h

··· 1 + #ifndef _LINUX_INTERVAL_TREE_H 2 + #define _LINUX_INTERVAL_TREE_H 3 + 4 + #include <linux/rbtree.h> 5 + 6 + struct interval_tree_node { 7 + struct rb_node rb; 8 + unsigned long start; /* Start of interval */ 9 + unsigned long last; /* Last location _in_ interval */ 10 + unsigned long __subtree_last; 11 + }; 12 + 13 + extern void 14 + interval_tree_insert(struct interval_tree_node *node, struct rb_root *root); 15 + 16 + extern void 17 + interval_tree_remove(struct interval_tree_node *node, struct rb_root *root); 18 + 19 + extern struct interval_tree_node * 20 + interval_tree_iter_first(struct rb_root *root, 21 + unsigned long start, unsigned long last); 22 + 23 + extern struct interval_tree_node * 24 + interval_tree_iter_next(struct interval_tree_node *node, 25 + unsigned long start, unsigned long last); 26 + 27 + #endif /* _LINUX_INTERVAL_TREE_H */

+191

include/linux/interval_tree_generic.h

··· 1 + /* 2 + Interval Trees 3 + (C) 2012 Michel Lespinasse <walken@google.com> 4 + 5 + This program is free software; you can redistribute it and/or modify 6 + it under the terms of the GNU General Public License as published by 7 + the Free Software Foundation; either version 2 of the License, or 8 + (at your option) any later version. 9 + 10 + This program is distributed in the hope that it will be useful, 11 + but WITHOUT ANY WARRANTY; without even the implied warranty of 12 + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 + GNU General Public License for more details. 14 + 15 + You should have received a copy of the GNU General Public License 16 + along with this program; if not, write to the Free Software 17 + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 18 + 19 + include/linux/interval_tree_generic.h 20 + */ 21 + 22 + #include <linux/rbtree_augmented.h> 23 + 24 + /* 25 + * Template for implementing interval trees 26 + * 27 + * ITSTRUCT: struct type of the interval tree nodes 28 + * ITRB: name of struct rb_node field within ITSTRUCT 29 + * ITTYPE: type of the interval endpoints 30 + * ITSUBTREE: name of ITTYPE field within ITSTRUCT holding last-in-subtree 31 + * ITSTART(n): start endpoint of ITSTRUCT node n 32 + * ITLAST(n): last endpoint of ITSTRUCT node n 33 + * ITSTATIC: 'static' or empty 34 + * ITPREFIX: prefix to use for the inline tree definitions 35 + * 36 + * Note - before using this, please consider if non-generic version 37 + * (interval_tree.h) would work for you... 38 + */ 39 + 40 + #define INTERVAL_TREE_DEFINE(ITSTRUCT, ITRB, ITTYPE, ITSUBTREE, \ 41 + ITSTART, ITLAST, ITSTATIC, ITPREFIX) \ 42 + \ 43 + /* Callbacks for augmented rbtree insert and remove */ \ 44 + \ 45 + static inline ITTYPE ITPREFIX ## _compute_subtree_last(ITSTRUCT *node) \ 46 + { \ 47 + ITTYPE max = ITLAST(node), subtree_last; \ 48 + if (node->ITRB.rb_left) { \ 49 + subtree_last = rb_entry(node->ITRB.rb_left, \ 50 + ITSTRUCT, ITRB)->ITSUBTREE; \ 51 + if (max < subtree_last) \ 52 + max = subtree_last; \ 53 + } \ 54 + if (node->ITRB.rb_right) { \ 55 + subtree_last = rb_entry(node->ITRB.rb_right, \ 56 + ITSTRUCT, ITRB)->ITSUBTREE; \ 57 + if (max < subtree_last) \ 58 + max = subtree_last; \ 59 + } \ 60 + return max; \ 61 + } \ 62 + \ 63 + RB_DECLARE_CALLBACKS(static, ITPREFIX ## _augment, ITSTRUCT, ITRB, \ 64 + ITTYPE, ITSUBTREE, ITPREFIX ## _compute_subtree_last) \ 65 + \ 66 + /* Insert / remove interval nodes from the tree */ \ 67 + \ 68 + ITSTATIC void ITPREFIX ## _insert(ITSTRUCT *node, struct rb_root *root) \ 69 + { \ 70 + struct rb_node **link = &root->rb_node, *rb_parent = NULL; \ 71 + ITTYPE start = ITSTART(node), last = ITLAST(node); \ 72 + ITSTRUCT *parent; \ 73 + \ 74 + while (*link) { \ 75 + rb_parent = *link; \ 76 + parent = rb_entry(rb_parent, ITSTRUCT, ITRB); \ 77 + if (parent->ITSUBTREE < last) \ 78 + parent->ITSUBTREE = last; \ 79 + if (start < ITSTART(parent)) \ 80 + link = &parent->ITRB.rb_left; \ 81 + else \ 82 + link = &parent->ITRB.rb_right; \ 83 + } \ 84 + \ 85 + node->ITSUBTREE = last; \ 86 + rb_link_node(&node->ITRB, rb_parent, link); \ 87 + rb_insert_augmented(&node->ITRB, root, &ITPREFIX ## _augment); \ 88 + } \ 89 + \ 90 + ITSTATIC void ITPREFIX ## _remove(ITSTRUCT *node, struct rb_root *root) \ 91 + { \ 92 + rb_erase_augmented(&node->ITRB, root, &ITPREFIX ## _augment); \ 93 + } \ 94 + \ 95 + /* \ 96 + * Iterate over intervals intersecting [start;last] \ 97 + * \ 98 + * Note that a node's interval intersects [start;last] iff: \ 99 + * Cond1: ITSTART(node) <= last \ 100 + * and \ 101 + * Cond2: start <= ITLAST(node) \ 102 + */ \ 103 + \ 104 + static ITSTRUCT * \ 105 + ITPREFIX ## _subtree_search(ITSTRUCT *node, ITTYPE start, ITTYPE last) \ 106 + { \ 107 + while (true) { \ 108 + /* \ 109 + * Loop invariant: start <= node->ITSUBTREE \ 110 + * (Cond2 is satisfied by one of the subtree nodes) \ 111 + */ \ 112 + if (node->ITRB.rb_left) { \ 113 + ITSTRUCT *left = rb_entry(node->ITRB.rb_left, \ 114 + ITSTRUCT, ITRB); \ 115 + if (start <= left->ITSUBTREE) { \ 116 + /* \ 117 + * Some nodes in left subtree satisfy Cond2. \ 118 + * Iterate to find the leftmost such node N. \ 119 + * If it also satisfies Cond1, that's the \ 120 + * match we are looking for. Otherwise, there \ 121 + * is no matching interval as nodes to the \ 122 + * right of N can't satisfy Cond1 either. \ 123 + */ \ 124 + node = left; \ 125 + continue; \ 126 + } \ 127 + } \ 128 + if (ITSTART(node) <= last) { /* Cond1 */ \ 129 + if (start <= ITLAST(node)) /* Cond2 */ \ 130 + return node; /* node is leftmost match */ \ 131 + if (node->ITRB.rb_right) { \ 132 + node = rb_entry(node->ITRB.rb_right, \ 133 + ITSTRUCT, ITRB); \ 134 + if (start <= node->ITSUBTREE) \ 135 + continue; \ 136 + } \ 137 + } \ 138 + return NULL; /* No match */ \ 139 + } \ 140 + } \ 141 + \ 142 + ITSTATIC ITSTRUCT * \ 143 + ITPREFIX ## _iter_first(struct rb_root *root, ITTYPE start, ITTYPE last) \ 144 + { \ 145 + ITSTRUCT *node; \ 146 + \ 147 + if (!root->rb_node) \ 148 + return NULL; \ 149 + node = rb_entry(root->rb_node, ITSTRUCT, ITRB); \ 150 + if (node->ITSUBTREE < start) \ 151 + return NULL; \ 152 + return ITPREFIX ## _subtree_search(node, start, last); \ 153 + } \ 154 + \ 155 + ITSTATIC ITSTRUCT * \ 156 + ITPREFIX ## _iter_next(ITSTRUCT *node, ITTYPE start, ITTYPE last) \ 157 + { \ 158 + struct rb_node *rb = node->ITRB.rb_right, *prev; \ 159 + \ 160 + while (true) { \ 161 + /* \ 162 + * Loop invariants: \ 163 + * Cond1: ITSTART(node) <= last \ 164 + * rb == node->ITRB.rb_right \ 165 + * \ 166 + * First, search right subtree if suitable \ 167 + */ \ 168 + if (rb) { \ 169 + ITSTRUCT *right = rb_entry(rb, ITSTRUCT, ITRB); \ 170 + if (start <= right->ITSUBTREE) \ 171 + return ITPREFIX ## _subtree_search(right, \ 172 + start, last); \ 173 + } \ 174 + \ 175 + /* Move up the tree until we come from a node's left child */ \ 176 + do { \ 177 + rb = rb_parent(&node->ITRB); \ 178 + if (!rb) \ 179 + return NULL; \ 180 + prev = &node->ITRB; \ 181 + node = rb_entry(rb, ITSTRUCT, ITRB); \ 182 + rb = node->ITRB.rb_right; \ 183 + } while (prev == rb); \ 184 + \ 185 + /* Check if the node intersects [start;last] */ \ 186 + if (last < ITSTART(node)) /* !Cond1 */ \ 187 + return NULL; \ 188 + else if (start <= ITLAST(node)) /* Cond2 */ \ 189 + return node; \ 190 + } \ 191 + }

+1 -2

include/linux/memblock.h

··· 70 70 * @p_end: ptr to ulong for end pfn of the range, can be %NULL 71 71 * @p_nid: ptr to int for nid of the range, can be %NULL 72 72 * 73 - * Walks over configured memory ranges. Available after early_node_map is 74 - * populated. 73 + * Walks over configured memory ranges. 75 74 */ 76 75 #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid) \ 77 76 for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \

+7 -7

include/linux/memcontrol.h

··· 84 84 extern struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont); 85 85 86 86 static inline 87 - int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup) 87 + bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg) 88 88 { 89 - struct mem_cgroup *memcg; 90 - int match; 89 + struct mem_cgroup *task_memcg; 90 + bool match; 91 91 92 92 rcu_read_lock(); 93 - memcg = mem_cgroup_from_task(rcu_dereference((mm)->owner)); 94 - match = __mem_cgroup_same_or_subtree(cgroup, memcg); 93 + task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 94 + match = __mem_cgroup_same_or_subtree(memcg, task_memcg); 95 95 rcu_read_unlock(); 96 96 return match; 97 97 } ··· 258 258 return NULL; 259 259 } 260 260 261 - static inline int mm_match_cgroup(struct mm_struct *mm, 261 + static inline bool mm_match_cgroup(struct mm_struct *mm, 262 262 struct mem_cgroup *memcg) 263 263 { 264 - return 1; 264 + return true; 265 265 } 266 266 267 267 static inline int task_in_mem_cgroup(struct task_struct *task,

+3

include/linux/memory_hotplug.h

··· 10 10 struct zone; 11 11 struct pglist_data; 12 12 struct mem_section; 13 + struct memory_block; 13 14 14 15 #ifdef CONFIG_MEMORY_HOTPLUG 15 16 ··· 234 233 extern int mem_online_node(int nid); 235 234 extern int add_memory(int nid, u64 start, u64 size); 236 235 extern int arch_add_memory(int nid, u64 start, u64 size); 236 + extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); 237 + extern int offline_memory_block(struct memory_block *mem); 237 238 extern int remove_memory(u64 start, u64 size); 238 239 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn, 239 240 int nr_pages);

+2 -2

include/linux/mempolicy.h

··· 188 188 189 189 struct shared_policy { 190 190 struct rb_root root; 191 - spinlock_t lock; 191 + struct mutex mutex; 192 192 }; 193 193 194 194 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol); ··· 239 239 /* Check if a vma is migratable */ 240 240 static inline int vma_migratable(struct vm_area_struct *vma) 241 241 { 242 - if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED)) 242 + if (vma->vm_flags & (VM_IO | VM_HUGETLB | VM_PFNMAP)) 243 243 return 0; 244 244 /* 245 245 * Migration allocates pages in the highest zone. If we cannot

+82 -56

include/linux/mm.h

··· 10 10 #include <linux/list.h> 11 11 #include <linux/mmzone.h> 12 12 #include <linux/rbtree.h> 13 - #include <linux/prio_tree.h> 14 13 #include <linux/atomic.h> 15 14 #include <linux/debug_locks.h> 16 15 #include <linux/mm_types.h> ··· 20 21 21 22 struct mempolicy; 22 23 struct anon_vma; 24 + struct anon_vma_chain; 23 25 struct file_ra_state; 24 26 struct user_struct; 25 27 struct writeback_control; ··· 70 70 /* 71 71 * vm_flags in vm_area_struct, see mm_types.h. 72 72 */ 73 + #define VM_NONE 0x00000000 74 + 73 75 #define VM_READ 0x00000001 /* currently active flags */ 74 76 #define VM_WRITE 0x00000002 75 77 #define VM_EXEC 0x00000004 ··· 84 82 #define VM_MAYSHARE 0x00000080 85 83 86 84 #define VM_GROWSDOWN 0x00000100 /* general info on the segment */ 87 - #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64) 88 - #define VM_GROWSUP 0x00000200 89 - #else 90 - #define VM_GROWSUP 0x00000000 91 - #define VM_NOHUGEPAGE 0x00000200 /* MADV_NOHUGEPAGE marked this vma */ 92 - #endif 93 85 #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ 94 86 #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ 95 87 96 - #define VM_EXECUTABLE 0x00001000 97 88 #define VM_LOCKED 0x00002000 98 89 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ 99 90 ··· 96 101 97 102 #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ 98 103 #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ 99 - #define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */ 100 104 #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ 101 105 #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ 102 106 #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ 103 107 #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ 104 - #ifndef CONFIG_TRANSPARENT_HUGEPAGE 105 - #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ 106 - #else 107 - #define VM_HUGEPAGE 0x01000000 /* MADV_HUGEPAGE marked this vma */ 108 - #endif 109 - #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ 110 - #define VM_NODUMP 0x04000000 /* Do not include in the core dump */ 108 + #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ 109 + #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ 111 110 112 - #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ 113 111 #define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ 114 - #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ 115 - #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ 112 + #define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */ 113 + #define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */ 116 114 #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ 115 + 116 + #if defined(CONFIG_X86) 117 + # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ 118 + #elif defined(CONFIG_PPC) 119 + # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */ 120 + #elif defined(CONFIG_PARISC) 121 + # define VM_GROWSUP VM_ARCH_1 122 + #elif defined(CONFIG_IA64) 123 + # define VM_GROWSUP VM_ARCH_1 124 + #elif !defined(CONFIG_MMU) 125 + # define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */ 126 + #endif 127 + 128 + #ifndef VM_GROWSUP 129 + # define VM_GROWSUP VM_NONE 130 + #endif 117 131 118 132 /* Bits set in the VMA until the stack is in its final location */ 119 133 #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) ··· 147 143 * Special vmas that are non-mergable, non-mlock()able. 148 144 * Note: mm/huge_memory.c VM_NO_THP depends on this definition. 149 145 */ 150 - #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) 146 + #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP) 151 147 152 148 /* 153 149 * mapping from the currently active vm_flags protection bits (the ··· 161 157 #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ 162 158 #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ 163 159 #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ 164 - 165 - /* 166 - * This interface is used by x86 PAT code to identify a pfn mapping that is 167 - * linear over entire vma. This is to optimize PAT code that deals with 168 - * marking the physical region with a particular prot. This is not for generic 169 - * mm use. Note also that this check will not work if the pfn mapping is 170 - * linear for a vma starting at physical address 0. In which case PAT code 171 - * falls back to slow path of reserving physical range page by page. 172 - */ 173 - static inline int is_linear_pfn_mapping(struct vm_area_struct *vma) 174 - { 175 - return !!(vma->vm_flags & VM_PFN_AT_MMAP); 176 - } 177 - 178 - static inline int is_pfn_mapping(struct vm_area_struct *vma) 179 - { 180 - return !!(vma->vm_flags & VM_PFNMAP); 181 - } 160 + #define FAULT_FLAG_TRIED 0x40 /* second try */ 182 161 183 162 /* 184 163 * vm_fault is filled by the the pagefault handler and passed to the vma's ··· 169 182 * of VM_FAULT_xxx flags that give details about how the fault was handled. 170 183 * 171 184 * pgoff should be used in favour of virtual_address, if possible. If pgoff 172 - * is used, one may set VM_CAN_NONLINEAR in the vma->vm_flags to get nonlinear 173 - * mapping support. 185 + * is used, one may implement ->remap_pages to get nonlinear mapping support. 174 186 */ 175 187 struct vm_fault { 176 188 unsigned int flags; /* FAULT_FLAG_xxx flags */ ··· 227 241 int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, 228 242 const nodemask_t *to, unsigned long flags); 229 243 #endif 244 + /* called by sys_remap_file_pages() to populate non-linear mapping */ 245 + int (*remap_pages)(struct vm_area_struct *vma, unsigned long addr, 246 + unsigned long size, pgoff_t pgoff); 230 247 }; 231 248 232 249 struct mmu_gather; ··· 237 248 238 249 #define page_private(page) ((page)->private) 239 250 #define set_page_private(page, v) ((page)->private = (v)) 251 + 252 + /* It's valid only if the page is free path or free_list */ 253 + static inline void set_freepage_migratetype(struct page *page, int migratetype) 254 + { 255 + page->index = migratetype; 256 + } 257 + 258 + /* It's valid only if the page is free path or free_list */ 259 + static inline int get_freepage_migratetype(struct page *page) 260 + { 261 + return page->index; 262 + } 240 263 241 264 /* 242 265 * FIXME: take this include out, include page-flags.h in ··· 455 454 456 455 void split_page(struct page *page, unsigned int order); 457 456 int split_free_page(struct page *page); 457 + int capture_free_page(struct page *page, int alloc_order, int migratetype); 458 458 459 459 /* 460 460 * Compound pages have a destructor function. Provide a ··· 1073 1071 1074 1072 extern unsigned long move_page_tables(struct vm_area_struct *vma, 1075 1073 unsigned long old_addr, struct vm_area_struct *new_vma, 1076 - unsigned long new_addr, unsigned long len); 1074 + unsigned long new_addr, unsigned long len, 1075 + bool need_rmap_locks); 1077 1076 extern unsigned long do_mremap(unsigned long addr, 1078 1077 unsigned long old_len, unsigned long new_len, 1079 1078 unsigned long flags, unsigned long new_addr); ··· 1369 1366 extern atomic_long_t mmap_pages_allocated; 1370 1367 extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t); 1371 1368 1372 - /* prio_tree.c */ 1373 - void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old); 1374 - void vma_prio_tree_insert(struct vm_area_struct *, struct prio_tree_root *); 1375 - void vma_prio_tree_remove(struct vm_area_struct *, struct prio_tree_root *); 1376 - struct vm_area_struct *vma_prio_tree_next(struct vm_area_struct *vma, 1377 - struct prio_tree_iter *iter); 1369 + /* interval_tree.c */ 1370 + void vma_interval_tree_insert(struct vm_area_struct *node, 1371 + struct rb_root *root); 1372 + void vma_interval_tree_insert_after(struct vm_area_struct *node, 1373 + struct vm_area_struct *prev, 1374 + struct rb_root *root); 1375 + void vma_interval_tree_remove(struct vm_area_struct *node, 1376 + struct rb_root *root); 1377 + struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root *root, 1378 + unsigned long start, unsigned long last); 1379 + struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node, 1380 + unsigned long start, unsigned long last); 1378 1381 1379 - #define vma_prio_tree_foreach(vma, iter, root, begin, end) \ 1380 - for (prio_tree_iter_init(iter, root, begin, end), vma = NULL; \ 1381 - (vma = vma_prio_tree_next(vma, iter)); ) 1382 + #define vma_interval_tree_foreach(vma, root, start, last) \ 1383 + for (vma = vma_interval_tree_iter_first(root, start, last); \ 1384 + vma; vma = vma_interval_tree_iter_next(vma, start, last)) 1382 1385 1383 1386 static inline void vma_nonlinear_insert(struct vm_area_struct *vma, 1384 1387 struct list_head *list) 1385 1388 { 1386 - vma->shared.vm_set.parent = NULL; 1387 - list_add_tail(&vma->shared.vm_set.list, list); 1389 + list_add_tail(&vma->shared.nonlinear, list); 1388 1390 } 1391 + 1392 + void anon_vma_interval_tree_insert(struct anon_vma_chain *node, 1393 + struct rb_root *root); 1394 + void anon_vma_interval_tree_remove(struct anon_vma_chain *node, 1395 + struct rb_root *root); 1396 + struct anon_vma_chain *anon_vma_interval_tree_iter_first( 1397 + struct rb_root *root, unsigned long start, unsigned long last); 1398 + struct anon_vma_chain *anon_vma_interval_tree_iter_next( 1399 + struct anon_vma_chain *node, unsigned long start, unsigned long last); 1400 + #ifdef CONFIG_DEBUG_VM_RB 1401 + void anon_vma_interval_tree_verify(struct anon_vma_chain *node); 1402 + #endif 1403 + 1404 + #define anon_vma_interval_tree_foreach(avc, root, start, last) \ 1405 + for (avc = anon_vma_interval_tree_iter_first(root, start, last); \ 1406 + avc; avc = anon_vma_interval_tree_iter_next(avc, start, last)) 1389 1407 1390 1408 /* mmap.c */ 1391 1409 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); ··· 1424 1400 struct rb_node **, struct rb_node *); 1425 1401 extern void unlink_file_vma(struct vm_area_struct *); 1426 1402 extern struct vm_area_struct *copy_vma(struct vm_area_struct **, 1427 - unsigned long addr, unsigned long len, pgoff_t pgoff); 1403 + unsigned long addr, unsigned long len, pgoff_t pgoff, 1404 + bool *need_rmap_locks); 1428 1405 extern void exit_mmap(struct mm_struct *); 1429 1406 1430 1407 extern int mm_take_all_locks(struct mm_struct *mm); 1431 1408 extern void mm_drop_all_locks(struct mm_struct *mm); 1432 1409 1433 - /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */ 1434 - extern void added_exe_file_vma(struct mm_struct *mm); 1435 - extern void removed_exe_file_vma(struct mm_struct *mm); 1436 1410 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); 1437 1411 extern struct file *get_mm_exe_file(struct mm_struct *mm); 1438 1412 ··· 1683 1661 static inline unsigned int debug_guardpage_minorder(void) { return 0; } 1684 1662 static inline bool page_is_guard(struct page *page) { return false; } 1685 1663 #endif /* CONFIG_DEBUG_PAGEALLOC */ 1664 + 1665 + extern void reset_zone_present_pages(void); 1666 + extern void fixup_zone_present_pages(int nid, unsigned long start_pfn, 1667 + unsigned long end_pfn); 1686 1668 1687 1669 #endif /* __KERNEL__ */ 1688 1670 #endif /* _LINUX_MM_H */

+5 -11

include/linux/mm_types.h

··· 6 6 #include <linux/threads.h> 7 7 #include <linux/list.h> 8 8 #include <linux/spinlock.h> 9 - #include <linux/prio_tree.h> 10 9 #include <linux/rbtree.h> 11 10 #include <linux/rwsem.h> 12 11 #include <linux/completion.h> ··· 239 240 240 241 /* 241 242 * For areas with an address space and backing store, 242 - * linkage into the address_space->i_mmap prio tree, or 243 - * linkage to the list of like vmas hanging off its node, or 243 + * linkage into the address_space->i_mmap interval tree, or 244 244 * linkage of vma in the address_space->i_mmap_nonlinear list. 245 245 */ 246 246 union { 247 247 struct { 248 - struct list_head list; 249 - void *parent; /* aligns with prio_tree_node parent */ 250 - struct vm_area_struct *head; 251 - } vm_set; 252 - 253 - struct raw_prio_tree_node prio_tree_node; 248 + struct rb_node rb; 249 + unsigned long rb_subtree_last; 250 + } linear; 251 + struct list_head nonlinear; 254 252 } shared; 255 253 256 254 /* ··· 345 349 unsigned long shared_vm; /* Shared pages (files) */ 346 350 unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */ 347 351 unsigned long stack_vm; /* VM_GROWSUP/DOWN */ 348 - unsigned long reserved_vm; /* VM_RESERVED|VM_IO pages */ 349 352 unsigned long def_flags; 350 353 unsigned long nr_ptes; /* Page table pages */ 351 354 unsigned long start_code, end_code, start_data, end_data; ··· 389 394 390 395 /* store ref to file /proc/<pid>/exe symlink points to */ 391 396 struct file *exe_file; 392 - unsigned long num_exe_file_vmas; 393 397 #ifdef CONFIG_MMU_NOTIFIER 394 398 struct mmu_notifier_mm *mmu_notifier_mm; 395 399 #endif

-1

include/linux/mman.h

··· 86 86 { 87 87 return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | 88 88 _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | 89 - _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | 90 89 _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); 91 90 } 92 91 #endif /* __KERNEL__ */

+12 -48

include/linux/mmu_notifier.h

··· 4 4 #include <linux/list.h> 5 5 #include <linux/spinlock.h> 6 6 #include <linux/mm_types.h> 7 + #include <linux/srcu.h> 7 8 8 9 struct mmu_notifier; 9 10 struct mmu_notifier_ops; ··· 246 245 __mmu_notifier_mm_destroy(mm); 247 246 } 248 247 249 - /* 250 - * These two macros will sometime replace ptep_clear_flush. 251 - * ptep_clear_flush is implemented as macro itself, so this also is 252 - * implemented as a macro until ptep_clear_flush will converted to an 253 - * inline function, to diminish the risk of compilation failure. The 254 - * invalidate_page method over time can be moved outside the PT lock 255 - * and these two macros can be later removed. 256 - */ 257 - #define ptep_clear_flush_notify(__vma, __address, __ptep) \ 258 - ({ \ 259 - pte_t __pte; \ 260 - struct vm_area_struct *___vma = __vma; \ 261 - unsigned long ___address = __address; \ 262 - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ 263 - mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ 264 - __pte; \ 265 - }) 266 - 267 - #define pmdp_clear_flush_notify(__vma, __address, __pmdp) \ 268 - ({ \ 269 - pmd_t __pmd; \ 270 - struct vm_area_struct *___vma = __vma; \ 271 - unsigned long ___address = __address; \ 272 - VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ 273 - mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ 274 - (__address)+HPAGE_PMD_SIZE);\ 275 - __pmd = pmdp_clear_flush(___vma, ___address, __pmdp); \ 276 - mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ 277 - (__address)+HPAGE_PMD_SIZE); \ 278 - __pmd; \ 279 - }) 280 - 281 - #define pmdp_splitting_flush_notify(__vma, __address, __pmdp) \ 282 - ({ \ 283 - struct vm_area_struct *___vma = __vma; \ 284 - unsigned long ___address = __address; \ 285 - VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ 286 - mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ 287 - (__address)+HPAGE_PMD_SIZE);\ 288 - pmdp_splitting_flush(___vma, ___address, __pmdp); \ 289 - mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ 290 - (__address)+HPAGE_PMD_SIZE); \ 291 - }) 292 - 293 248 #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ 294 249 ({ \ 295 250 int __young; \ ··· 268 311 __young; \ 269 312 }) 270 313 314 + /* 315 + * set_pte_at_notify() sets the pte _after_ running the notifier. 316 + * This is safe to start by updating the secondary MMUs, because the primary MMU 317 + * pte invalidate must have already happened with a ptep_clear_flush() before 318 + * set_pte_at_notify() has been invoked. Updating the secondary MMUs first is 319 + * required when we change both the protection of the mapping from read-only to 320 + * read-write and the pfn (like during copy on write page faults). Otherwise the 321 + * old page would remain mapped readonly in the secondary MMUs after the new 322 + * page is already writable by some CPU through the primary MMU. 323 + */ 271 324 #define set_pte_at_notify(__mm, __address, __ptep, __pte) \ 272 325 ({ \ 273 326 struct mm_struct *___mm = __mm; \ 274 327 unsigned long ___address = __address; \ 275 328 pte_t ___pte = __pte; \ 276 329 \ 277 - set_pte_at(___mm, ___address, __ptep, ___pte); \ 278 330 mmu_notifier_change_pte(___mm, ___address, ___pte); \ 331 + set_pte_at(___mm, ___address, __ptep, ___pte); \ 279 332 }) 280 333 281 334 #else /* CONFIG_MMU_NOTIFIER */ ··· 336 369 337 370 #define ptep_clear_flush_young_notify ptep_clear_flush_young 338 371 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young 339 - #define ptep_clear_flush_notify ptep_clear_flush 340 - #define pmdp_clear_flush_notify pmdp_clear_flush 341 - #define pmdp_splitting_flush_notify pmdp_splitting_flush 342 372 #define set_pte_at_notify set_pte_at 343 373 344 374 #endif /* CONFIG_MMU_NOTIFIER */

+9 -1

include/linux/mmzone.h

··· 142 142 NUMA_OTHER, /* allocation from other node */ 143 143 #endif 144 144 NR_ANON_TRANSPARENT_HUGEPAGES, 145 + NR_FREE_CMA_PAGES, 145 146 NR_VM_ZONE_STAT_ITEMS }; 146 147 147 148 /* ··· 218 217 #define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2) 219 218 /* Isolate for asynchronous migration */ 220 219 #define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x4) 220 + /* Isolate unevictable pages */ 221 + #define ISOLATE_UNEVICTABLE ((__force isolate_mode_t)0x8) 221 222 222 223 /* LRU Isolation modes. */ 223 224 typedef unsigned __bitwise__ isolate_mode_t; ··· 372 369 spinlock_t lock; 373 370 int all_unreclaimable; /* All pages pinned */ 374 371 #if defined CONFIG_COMPACTION || defined CONFIG_CMA 375 - /* pfn where the last incremental compaction isolated free pages */ 372 + /* Set to true when the PG_migrate_skip bits should be cleared */ 373 + bool compact_blockskip_flush; 374 + 375 + /* pfns where compaction scanners should start */ 376 376 unsigned long compact_cached_free_pfn; 377 + unsigned long compact_cached_migrate_pfn; 377 378 #endif 378 379 #ifdef CONFIG_MEMORY_HOTPLUG 379 380 /* see spanned/present_pages for more description */ ··· 711 704 unsigned long node_spanned_pages; /* total size of physical page 712 705 range, including holes */ 713 706 int node_id; 707 + nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ 714 708 wait_queue_head_t kswapd_wait; 715 709 wait_queue_head_t pfmemalloc_wait; 716 710 struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */

-11

include/linux/oom.h

··· 2 2 #define __INCLUDE_LINUX_OOM_H 3 3 4 4 /* 5 - * /proc/<pid>/oom_adj is deprecated, see 6 - * Documentation/feature-removal-schedule.txt. 7 - * 8 - * /proc/<pid>/oom_adj set to -17 protects from the oom-killer 9 - */ 10 - #define OOM_DISABLE (-17) 11 - /* inclusive */ 12 - #define OOM_ADJUST_MIN (-16) 13 - #define OOM_ADJUST_MAX 15 14 - 15 - /* 16 5 * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for 17 6 * pid. 18 7 */

+6 -1

include/linux/page-isolation.h

··· 6 6 void set_pageblock_migratetype(struct page *page, int migratetype); 7 7 int move_freepages_block(struct zone *zone, struct page *page, 8 8 int migratetype); 9 + int move_freepages(struct zone *zone, 10 + struct page *start_page, struct page *end_page, 11 + int migratetype); 12 + 9 13 /* 10 14 * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE. 11 15 * If specified range includes migrate types other than MOVABLE or CMA, ··· 41 37 */ 42 38 int set_migratetype_isolate(struct page *page); 43 39 void unset_migratetype_isolate(struct page *page, unsigned migratetype); 44 - 40 + struct page *alloc_migrate_target(struct page *page, unsigned long private, 41 + int **resultp); 45 42 46 43 #endif

+17 -2

include/linux/pageblock-flags.h

··· 30 30 PB_migrate, 31 31 PB_migrate_end = PB_migrate + 3 - 1, 32 32 /* 3 bits required for migrate types */ 33 + #ifdef CONFIG_COMPACTION 34 + PB_migrate_skip,/* If set the block is skipped by compaction */ 35 + #endif /* CONFIG_COMPACTION */ 33 36 NR_PAGEBLOCK_BITS 34 37 }; 35 38 ··· 68 65 void set_pageblock_flags_group(struct page *page, unsigned long flags, 69 66 int start_bitidx, int end_bitidx); 70 67 68 + #ifdef CONFIG_COMPACTION 69 + #define get_pageblock_skip(page) \ 70 + get_pageblock_flags_group(page, PB_migrate_skip, \ 71 + PB_migrate_skip + 1) 72 + #define clear_pageblock_skip(page) \ 73 + set_pageblock_flags_group(page, 0, PB_migrate_skip, \ 74 + PB_migrate_skip + 1) 75 + #define set_pageblock_skip(page) \ 76 + set_pageblock_flags_group(page, 1, PB_migrate_skip, \ 77 + PB_migrate_skip + 1) 78 + #endif /* CONFIG_COMPACTION */ 79 + 71 80 #define get_pageblock_flags(page) \ 72 - get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1) 81 + get_pageblock_flags_group(page, 0, PB_migrate_end) 73 82 #define set_pageblock_flags(page, flags) \ 74 83 set_pageblock_flags_group(page, flags, \ 75 - 0, NR_PAGEBLOCK_BITS-1) 84 + 0, PB_migrate_end) 76 85 77 86 #endif /* PAGEBLOCK_FLAGS_H */

-120

include/linux/prio_tree.h

··· 1 - #ifndef _LINUX_PRIO_TREE_H 2 - #define _LINUX_PRIO_TREE_H 3 - 4 - /* 5 - * K&R 2nd ed. A8.3 somewhat obliquely hints that initial sequences of struct 6 - * fields with identical types should end up at the same location. We'll use 7 - * this until we can scrap struct raw_prio_tree_node. 8 - * 9 - * Note: all this could be done more elegantly by using unnamed union/struct 10 - * fields. However, gcc 2.95.3 and apparently also gcc 3.0.4 don't support this 11 - * language extension. 12 - */ 13 - 14 - struct raw_prio_tree_node { 15 - struct prio_tree_node *left; 16 - struct prio_tree_node *right; 17 - struct prio_tree_node *parent; 18 - }; 19 - 20 - struct prio_tree_node { 21 - struct prio_tree_node *left; 22 - struct prio_tree_node *right; 23 - struct prio_tree_node *parent; 24 - unsigned long start; 25 - unsigned long last; /* last location _in_ interval */ 26 - }; 27 - 28 - struct prio_tree_root { 29 - struct prio_tree_node *prio_tree_node; 30 - unsigned short index_bits; 31 - unsigned short raw; 32 - /* 33 - * 0: nodes are of type struct prio_tree_node 34 - * 1: nodes are of type raw_prio_tree_node 35 - */ 36 - }; 37 - 38 - struct prio_tree_iter { 39 - struct prio_tree_node *cur; 40 - unsigned long mask; 41 - unsigned long value; 42 - int size_level; 43 - 44 - struct prio_tree_root *root; 45 - pgoff_t r_index; 46 - pgoff_t h_index; 47 - }; 48 - 49 - static inline void prio_tree_iter_init(struct prio_tree_iter *iter, 50 - struct prio_tree_root *root, pgoff_t r_index, pgoff_t h_index) 51 - { 52 - iter->root = root; 53 - iter->r_index = r_index; 54 - iter->h_index = h_index; 55 - iter->cur = NULL; 56 - } 57 - 58 - #define __INIT_PRIO_TREE_ROOT(ptr, _raw) \ 59 - do { \ 60 - (ptr)->prio_tree_node = NULL; \ 61 - (ptr)->index_bits = 1; \ 62 - (ptr)->raw = (_raw); \ 63 - } while (0) 64 - 65 - #define INIT_PRIO_TREE_ROOT(ptr) __INIT_PRIO_TREE_ROOT(ptr, 0) 66 - #define INIT_RAW_PRIO_TREE_ROOT(ptr) __INIT_PRIO_TREE_ROOT(ptr, 1) 67 - 68 - #define INIT_PRIO_TREE_NODE(ptr) \ 69 - do { \ 70 - (ptr)->left = (ptr)->right = (ptr)->parent = (ptr); \ 71 - } while (0) 72 - 73 - #define INIT_PRIO_TREE_ITER(ptr) \ 74 - do { \ 75 - (ptr)->cur = NULL; \ 76 - (ptr)->mask = 0UL; \ 77 - (ptr)->value = 0UL; \ 78 - (ptr)->size_level = 0; \ 79 - } while (0) 80 - 81 - #define prio_tree_entry(ptr, type, member) \ 82 - ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member))) 83 - 84 - static inline int prio_tree_empty(const struct prio_tree_root *root) 85 - { 86 - return root->prio_tree_node == NULL; 87 - } 88 - 89 - static inline int prio_tree_root(const struct prio_tree_node *node) 90 - { 91 - return node->parent == node; 92 - } 93 - 94 - static inline int prio_tree_left_empty(const struct prio_tree_node *node) 95 - { 96 - return node->left == node; 97 - } 98 - 99 - static inline int prio_tree_right_empty(const struct prio_tree_node *node) 100 - { 101 - return node->right == node; 102 - } 103 - 104 - 105 - struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root, 106 - struct prio_tree_node *old, struct prio_tree_node *node); 107 - struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root, 108 - struct prio_tree_node *node); 109 - void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node); 110 - struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter); 111 - 112 - #define raw_prio_tree_replace(root, old, node) \ 113 - prio_tree_replace(root, (struct prio_tree_node *) (old), \ 114 - (struct prio_tree_node *) (node)) 115 - #define raw_prio_tree_insert(root, node) \ 116 - prio_tree_insert(root, (struct prio_tree_node *) (node)) 117 - #define raw_prio_tree_remove(root, node) \ 118 - prio_tree_remove(root, (struct prio_tree_node *) (node)) 119 - 120 - #endif /* _LINUX_PRIO_TREE_H */

+13 -106

include/linux/rbtree.h

··· 23 23 I know it's not the cleaner way, but in C (not in C++) to get 24 24 performances and genericity... 25 25 26 - Some example of insert and search follows here. The search is a plain 27 - normal search over an ordered tree. The insert instead must be implemented 28 - in two steps: First, the code must insert the element in order as a red leaf 29 - in the tree, and then the support library function rb_insert_color() must 30 - be called. Such function will do the not trivial work to rebalance the 31 - rbtree, if necessary. 32 - 33 - ----------------------------------------------------------------------- 34 - static inline struct page * rb_search_page_cache(struct inode * inode, 35 - unsigned long offset) 36 - { 37 - struct rb_node * n = inode->i_rb_page_cache.rb_node; 38 - struct page * page; 39 - 40 - while (n) 41 - { 42 - page = rb_entry(n, struct page, rb_page_cache); 43 - 44 - if (offset < page->offset) 45 - n = n->rb_left; 46 - else if (offset > page->offset) 47 - n = n->rb_right; 48 - else 49 - return page; 50 - } 51 - return NULL; 52 - } 53 - 54 - static inline struct page * __rb_insert_page_cache(struct inode * inode, 55 - unsigned long offset, 56 - struct rb_node * node) 57 - { 58 - struct rb_node ** p = &inode->i_rb_page_cache.rb_node; 59 - struct rb_node * parent = NULL; 60 - struct page * page; 61 - 62 - while (*p) 63 - { 64 - parent = *p; 65 - page = rb_entry(parent, struct page, rb_page_cache); 66 - 67 - if (offset < page->offset) 68 - p = &(*p)->rb_left; 69 - else if (offset > page->offset) 70 - p = &(*p)->rb_right; 71 - else 72 - return page; 73 - } 74 - 75 - rb_link_node(node, parent, p); 76 - 77 - return NULL; 78 - } 79 - 80 - static inline struct page * rb_insert_page_cache(struct inode * inode, 81 - unsigned long offset, 82 - struct rb_node * node) 83 - { 84 - struct page * ret; 85 - if ((ret = __rb_insert_page_cache(inode, offset, node))) 86 - goto out; 87 - rb_insert_color(node, &inode->i_rb_page_cache); 88 - out: 89 - return ret; 90 - } 91 - ----------------------------------------------------------------------- 26 + See Documentation/rbtree.txt for documentation and samples. 92 27 */ 93 28 94 29 #ifndef _LINUX_RBTREE_H ··· 32 97 #include <linux/kernel.h> 33 98 #include <linux/stddef.h> 34 99 35 - struct rb_node 36 - { 37 - unsigned long rb_parent_color; 38 - #define RB_RED 0 39 - #define RB_BLACK 1 100 + struct rb_node { 101 + unsigned long __rb_parent_color; 40 102 struct rb_node *rb_right; 41 103 struct rb_node *rb_left; 42 104 } __attribute__((aligned(sizeof(long)))); 43 105 /* The alignment might seem pointless, but allegedly CRIS needs it */ 44 106 45 - struct rb_root 46 - { 107 + struct rb_root { 47 108 struct rb_node *rb_node; 48 109 }; 49 110 50 111 51 - #define rb_parent(r) ((struct rb_node *)((r)->rb_parent_color & ~3)) 52 - #define rb_color(r) ((r)->rb_parent_color & 1) 53 - #define rb_is_red(r) (!rb_color(r)) 54 - #define rb_is_black(r) rb_color(r) 55 - #define rb_set_red(r) do { (r)->rb_parent_color &= ~1; } while (0) 56 - #define rb_set_black(r) do { (r)->rb_parent_color |= 1; } while (0) 57 - 58 - static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) 59 - { 60 - rb->rb_parent_color = (rb->rb_parent_color & 3) | (unsigned long)p; 61 - } 62 - static inline void rb_set_color(struct rb_node *rb, int color) 63 - { 64 - rb->rb_parent_color = (rb->rb_parent_color & ~1) | color; 65 - } 112 + #define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3)) 66 113 67 114 #define RB_ROOT (struct rb_root) { NULL, } 68 115 #define rb_entry(ptr, type, member) container_of(ptr, type, member) 69 116 70 - #define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL) 71 - #define RB_EMPTY_NODE(node) (rb_parent(node) == node) 72 - #define RB_CLEAR_NODE(node) (rb_set_parent(node, node)) 117 + #define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL) 73 118 74 - static inline void rb_init_node(struct rb_node *rb) 75 - { 76 - rb->rb_parent_color = 0; 77 - rb->rb_right = NULL; 78 - rb->rb_left = NULL; 79 - RB_CLEAR_NODE(rb); 80 - } 119 + /* 'empty' nodes are nodes that are known not to be inserted in an rbree */ 120 + #define RB_EMPTY_NODE(node) \ 121 + ((node)->__rb_parent_color == (unsigned long)(node)) 122 + #define RB_CLEAR_NODE(node) \ 123 + ((node)->__rb_parent_color = (unsigned long)(node)) 124 + 81 125 82 126 extern void rb_insert_color(struct rb_node *, struct rb_root *); 83 127 extern void rb_erase(struct rb_node *, struct rb_root *); 84 128 85 - typedef void (*rb_augment_f)(struct rb_node *node, void *data); 86 - 87 - extern void rb_augment_insert(struct rb_node *node, 88 - rb_augment_f func, void *data); 89 - extern struct rb_node *rb_augment_erase_begin(struct rb_node *node); 90 - extern void rb_augment_erase_end(struct rb_node *node, 91 - rb_augment_f func, void *data); 92 129 93 130 /* Find logical next and previous nodes in a tree */ 94 131 extern struct rb_node *rb_next(const struct rb_node *); ··· 75 168 static inline void rb_link_node(struct rb_node * node, struct rb_node * parent, 76 169 struct rb_node ** rb_link) 77 170 { 78 - node->rb_parent_color = (unsigned long )parent; 171 + node->__rb_parent_color = (unsigned long)parent; 79 172 node->rb_left = node->rb_right = NULL; 80 173 81 174 *rb_link = node;

+223

include/linux/rbtree_augmented.h

··· 1 + /* 2 + Red Black Trees 3 + (C) 1999 Andrea Arcangeli <andrea@suse.de> 4 + (C) 2002 David Woodhouse <dwmw2@infradead.org> 5 + (C) 2012 Michel Lespinasse <walken@google.com> 6 + 7 + This program is free software; you can redistribute it and/or modify 8 + it under the terms of the GNU General Public License as published by 9 + the Free Software Foundation; either version 2 of the License, or 10 + (at your option) any later version. 11 + 12 + This program is distributed in the hope that it will be useful, 13 + but WITHOUT ANY WARRANTY; without even the implied warranty of 14 + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 15 + GNU General Public License for more details. 16 + 17 + You should have received a copy of the GNU General Public License 18 + along with this program; if not, write to the Free Software 19 + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 20 + 21 + linux/include/linux/rbtree_augmented.h 22 + */ 23 + 24 + #ifndef _LINUX_RBTREE_AUGMENTED_H 25 + #define _LINUX_RBTREE_AUGMENTED_H 26 + 27 + #include <linux/rbtree.h> 28 + 29 + /* 30 + * Please note - only struct rb_augment_callbacks and the prototypes for 31 + * rb_insert_augmented() and rb_erase_augmented() are intended to be public. 32 + * The rest are implementation details you are not expected to depend on. 33 + * 34 + * See Documentation/rbtree.txt for documentation and samples. 35 + */ 36 + 37 + struct rb_augment_callbacks { 38 + void (*propagate)(struct rb_node *node, struct rb_node *stop); 39 + void (*copy)(struct rb_node *old, struct rb_node *new); 40 + void (*rotate)(struct rb_node *old, struct rb_node *new); 41 + }; 42 + 43 + extern void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, 44 + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); 45 + static inline void 46 + rb_insert_augmented(struct rb_node *node, struct rb_root *root, 47 + const struct rb_augment_callbacks *augment) 48 + { 49 + __rb_insert_augmented(node, root, augment->rotate); 50 + } 51 + 52 + #define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \ 53 + rbtype, rbaugmented, rbcompute) \ 54 + static inline void \ 55 + rbname ## _propagate(struct rb_node *rb, struct rb_node *stop) \ 56 + { \ 57 + while (rb != stop) { \ 58 + rbstruct *node = rb_entry(rb, rbstruct, rbfield); \ 59 + rbtype augmented = rbcompute(node); \ 60 + if (node->rbaugmented == augmented) \ 61 + break; \ 62 + node->rbaugmented = augmented; \ 63 + rb = rb_parent(&node->rbfield); \ 64 + } \ 65 + } \ 66 + static inline void \ 67 + rbname ## _copy(struct rb_node *rb_old, struct rb_node *rb_new) \ 68 + { \ 69 + rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ 70 + rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ 71 + new->rbaugmented = old->rbaugmented; \ 72 + } \ 73 + static void \ 74 + rbname ## _rotate(struct rb_node *rb_old, struct rb_node *rb_new) \ 75 + { \ 76 + rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ 77 + rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ 78 + new->rbaugmented = old->rbaugmented; \ 79 + old->rbaugmented = rbcompute(old); \ 80 + } \ 81 + rbstatic const struct rb_augment_callbacks rbname = { \ 82 + rbname ## _propagate, rbname ## _copy, rbname ## _rotate \ 83 + }; 84 + 85 + 86 + #define RB_RED 0 87 + #define RB_BLACK 1 88 + 89 + #define __rb_parent(pc) ((struct rb_node *)(pc & ~3)) 90 + 91 + #define __rb_color(pc) ((pc) & 1) 92 + #define __rb_is_black(pc) __rb_color(pc) 93 + #define __rb_is_red(pc) (!__rb_color(pc)) 94 + #define rb_color(rb) __rb_color((rb)->__rb_parent_color) 95 + #define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) 96 + #define rb_is_black(rb) __rb_is_black((rb)->__rb_parent_color) 97 + 98 + static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) 99 + { 100 + rb->__rb_parent_color = rb_color(rb) | (unsigned long)p; 101 + } 102 + 103 + static inline void rb_set_parent_color(struct rb_node *rb, 104 + struct rb_node *p, int color) 105 + { 106 + rb->__rb_parent_color = (unsigned long)p | color; 107 + } 108 + 109 + static inline void 110 + __rb_change_child(struct rb_node *old, struct rb_node *new, 111 + struct rb_node *parent, struct rb_root *root) 112 + { 113 + if (parent) { 114 + if (parent->rb_left == old) 115 + parent->rb_left = new; 116 + else 117 + parent->rb_right = new; 118 + } else 119 + root->rb_node = new; 120 + } 121 + 122 + extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root, 123 + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); 124 + 125 + static __always_inline void 126 + rb_erase_augmented(struct rb_node *node, struct rb_root *root, 127 + const struct rb_augment_callbacks *augment) 128 + { 129 + struct rb_node *child = node->rb_right, *tmp = node->rb_left; 130 + struct rb_node *parent, *rebalance; 131 + unsigned long pc; 132 + 133 + if (!tmp) { 134 + /* 135 + * Case 1: node to erase has no more than 1 child (easy!) 136 + * 137 + * Note that if there is one child it must be red due to 5) 138 + * and node must be black due to 4). We adjust colors locally 139 + * so as to bypass __rb_erase_color() later on. 140 + */ 141 + pc = node->__rb_parent_color; 142 + parent = __rb_parent(pc); 143 + __rb_change_child(node, child, parent, root); 144 + if (child) { 145 + child->__rb_parent_color = pc; 146 + rebalance = NULL; 147 + } else 148 + rebalance = __rb_is_black(pc) ? parent : NULL; 149 + tmp = parent; 150 + } else if (!child) { 151 + /* Still case 1, but this time the child is node->rb_left */ 152 + tmp->__rb_parent_color = pc = node->__rb_parent_color; 153 + parent = __rb_parent(pc); 154 + __rb_change_child(node, tmp, parent, root); 155 + rebalance = NULL; 156 + tmp = parent; 157 + } else { 158 + struct rb_node *successor = child, *child2; 159 + tmp = child->rb_left; 160 + if (!tmp) { 161 + /* 162 + * Case 2: node's successor is its right child 163 + * 164 + * (n) (s) 165 + * / \ / \ 166 + * (x) (s) -> (x) (c) 167 + * \ 168 + * (c) 169 + */ 170 + parent = successor; 171 + child2 = successor->rb_right; 172 + augment->copy(node, successor); 173 + } else { 174 + /* 175 + * Case 3: node's successor is leftmost under 176 + * node's right child subtree 177 + * 178 + * (n) (s) 179 + * / \ / \ 180 + * (x) (y) -> (x) (y) 181 + * / / 182 + * (p) (p) 183 + * / / 184 + * (s) (c) 185 + * \ 186 + * (c) 187 + */ 188 + do { 189 + parent = successor; 190 + successor = tmp; 191 + tmp = tmp->rb_left; 192 + } while (tmp); 193 + parent->rb_left = child2 = successor->rb_right; 194 + successor->rb_right = child; 195 + rb_set_parent(child, successor); 196 + augment->copy(node, successor); 197 + augment->propagate(parent, successor); 198 + } 199 + 200 + successor->rb_left = tmp = node->rb_left; 201 + rb_set_parent(tmp, successor); 202 + 203 + pc = node->__rb_parent_color; 204 + tmp = __rb_parent(pc); 205 + __rb_change_child(node, successor, tmp, root); 206 + if (child2) { 207 + successor->__rb_parent_color = pc; 208 + rb_set_parent_color(child2, parent, RB_BLACK); 209 + rebalance = NULL; 210 + } else { 211 + unsigned long pc2 = successor->__rb_parent_color; 212 + successor->__rb_parent_color = pc; 213 + rebalance = __rb_is_black(pc2) ? parent : NULL; 214 + } 215 + tmp = successor; 216 + } 217 + 218 + augment->propagate(tmp, NULL); 219 + if (rebalance) 220 + __rb_erase_color(rebalance, root, augment->rotate); 221 + } 222 + 223 + #endif /* _LINUX_RBTREE_AUGMENTED_H */

+20 -16

include/linux/rmap.h

··· 37 37 atomic_t refcount; 38 38 39 39 /* 40 - * NOTE: the LSB of the head.next is set by 40 + * NOTE: the LSB of the rb_root.rb_node is set by 41 41 * mm_take_all_locks() _after_ taking the above lock. So the 42 - * head must only be read/written after taking the above lock 42 + * rb_root must only be read/written after taking the above lock 43 43 * to be sure to see a valid next pointer. The LSB bit itself 44 44 * is serialized by a system wide lock only visible to 45 45 * mm_take_all_locks() (mm_all_locks_mutex). 46 46 */ 47 - struct list_head head; /* Chain of private "related" vmas */ 47 + struct rb_root rb_root; /* Interval tree of private "related" vmas */ 48 48 }; 49 49 50 50 /* ··· 57 57 * with a VMA, or the VMAs associated with an anon_vma. 58 58 * The "same_vma" list contains the anon_vma_chains linking 59 59 * all the anon_vmas associated with this VMA. 60 - * The "same_anon_vma" list contains the anon_vma_chains 60 + * The "rb" field indexes on an interval tree the anon_vma_chains 61 61 * which link all the VMAs associated with this anon_vma. 62 62 */ 63 63 struct anon_vma_chain { 64 64 struct vm_area_struct *vma; 65 65 struct anon_vma *anon_vma; 66 66 struct list_head same_vma; /* locked by mmap_sem & page_table_lock */ 67 - struct list_head same_anon_vma; /* locked by anon_vma->mutex */ 67 + struct rb_node rb; /* locked by anon_vma->mutex */ 68 + unsigned long rb_subtree_last; 69 + #ifdef CONFIG_DEBUG_VM_RB 70 + unsigned long cached_vma_start, cached_vma_last; 71 + #endif 72 + }; 73 + 74 + enum ttu_flags { 75 + TTU_UNMAP = 0, /* unmap mode */ 76 + TTU_MIGRATION = 1, /* migration mode */ 77 + TTU_MUNLOCK = 2, /* munlock mode */ 78 + TTU_ACTION_MASK = 0xff, 79 + 80 + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ 81 + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ 82 + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ 68 83 }; 69 84 70 85 #ifdef CONFIG_MMU ··· 135 120 int anon_vma_prepare(struct vm_area_struct *); 136 121 void unlink_anon_vmas(struct vm_area_struct *); 137 122 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); 138 - void anon_vma_moveto_tail(struct vm_area_struct *); 139 123 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); 140 124 141 125 static inline void anon_vma_merge(struct vm_area_struct *vma, ··· 175 161 int page_referenced_one(struct page *, struct vm_area_struct *, 176 162 unsigned long address, unsigned int *mapcount, unsigned long *vm_flags); 177 163 178 - enum ttu_flags { 179 - TTU_UNMAP = 0, /* unmap mode */ 180 - TTU_MIGRATION = 1, /* migration mode */ 181 - TTU_MUNLOCK = 2, /* munlock mode */ 182 - TTU_ACTION_MASK = 0xff, 183 - 184 - TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ 185 - TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ 186 - TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ 187 - }; 188 164 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) 189 165 190 166 int try_to_unmap(struct page *, enum ttu_flags flags);

-1

include/linux/sched.h

··· 671 671 struct rw_semaphore group_rwsem; 672 672 #endif 673 673 674 - int oom_adj; /* OOM kill score adjustment (bit shift) */ 675 674 int oom_score_adj; /* OOM kill score adjustment */ 676 675 int oom_score_adj_min; /* OOM kill score adjustment minimum value. 677 676 * Only settable by CAP_SYS_RESOURCE. */

+1 -1

include/linux/swap.h

··· 281 281 } 282 282 #endif 283 283 284 - extern int page_evictable(struct page *page, struct vm_area_struct *vma); 284 + extern int page_evictable(struct page *page); 285 285 extern void check_move_unevictable_pages(struct page **, int nr_pages); 286 286 287 287 extern unsigned long scan_unevictable_pages;

+1 -1

include/linux/timerqueue.h

··· 39 39 40 40 static inline void timerqueue_init(struct timerqueue_node *node) 41 41 { 42 - rb_init_node(&node->node); 42 + RB_CLEAR_NODE(&node->node); 43 43 } 44 44 45 45 static inline void timerqueue_init_head(struct timerqueue_head *head)

-1

include/linux/vm_event_item.h

··· 52 52 UNEVICTABLE_PGMUNLOCKED, 53 53 UNEVICTABLE_PGCLEARED, /* on COW, page truncate */ 54 54 UNEVICTABLE_PGSTRANDED, /* unable to isolate on unlock */ 55 - UNEVICTABLE_MLOCKFREED, 56 55 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 57 56 THP_FAULT_ALLOC, 58 57 THP_FAULT_FALLBACK,

+12

include/linux/vmstat.h

··· 198 198 void refresh_cpu_vm_stats(int); 199 199 void refresh_zone_stat_thresholds(void); 200 200 201 + void drain_zonestat(struct zone *zone, struct per_cpu_pageset *); 202 + 201 203 int calculate_pressure_threshold(struct zone *zone); 202 204 int calculate_normal_threshold(struct zone *zone); 203 205 void set_pgdat_percpu_threshold(pg_data_t *pgdat, ··· 253 251 static inline void refresh_cpu_vm_stats(int cpu) { } 254 252 static inline void refresh_zone_stat_thresholds(void) { } 255 253 254 + static inline void drain_zonestat(struct zone *zone, 255 + struct per_cpu_pageset *pset) { } 256 256 #endif /* CONFIG_SMP */ 257 + 258 + static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, 259 + int migratetype) 260 + { 261 + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages); 262 + if (is_migrate_cma(migratetype)) 263 + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); 264 + } 257 265 258 266 extern const char * const vmstat_text[]; 259 267

-1

include/trace/events/gfpflags.h

··· 36 36 {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \ 37 37 {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \ 38 38 {(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \ 39 - {(unsigned long)__GFP_NO_KSWAPD, "GFP_NO_KSWAPD"}, \ 40 39 {(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \ 41 40 ) : "GFP_NOWAIT" 42 41

+9 -2

init/Kconfig

··· 1125 1125 environments which can tolerate a "non-standard" kernel. 1126 1126 Only use this if you really know what you are doing. 1127 1127 1128 + config HAVE_UID16 1129 + bool 1130 + 1128 1131 config UID16 1129 1132 bool "Enable 16-bit UID system calls" if EXPERT 1130 - depends on ARM || BLACKFIN || CRIS || FRV || H8300 || X86_32 || M68K || (S390 && !64BIT) || SUPERH || SPARC32 || (SPARC64 && COMPAT) || UML || (X86_64 && IA32_EMULATION) \ 1131 - || AARCH32_EMULATION 1133 + depends on HAVE_UID16 1132 1134 default y 1133 1135 help 1134 1136 This enables the legacy 16-bit UID syscall wrappers. ··· 1151 1149 making your kernel marginally smaller. 1152 1150 1153 1151 If unsure say N here. 1152 + 1153 + config SYSCTL_EXCEPTION_TRACE 1154 + bool 1155 + help 1156 + Enable support for /proc/sys/debug/exception-trace. 1154 1157 1155 1158 config KALLSYMS 1156 1159 bool "Load all symbols for debugging/ksymoops" if EXPERT

-2

init/main.c

··· 86 86 extern void fork_init(unsigned long); 87 87 extern void mca_init(void); 88 88 extern void sbus_init(void); 89 - extern void prio_tree_init(void); 90 89 extern void radix_tree_init(void); 91 90 #ifndef CONFIG_DEBUG_RODATA 92 91 static inline void mark_rodata_ro(void) { } ··· 546 547 /* init some links before init_ISA_irqs() */ 547 548 early_irq_init(); 548 549 init_IRQ(); 549 - prio_tree_init(); 550 550 init_timers(); 551 551 hrtimers_init(); 552 552 softirq_init();

-3

ipc/mqueue.c

··· 142 142 leaf = kmalloc(sizeof(*leaf), GFP_ATOMIC); 143 143 if (!leaf) 144 144 return -ENOMEM; 145 - rb_init_node(&leaf->rb_node); 146 145 INIT_LIST_HEAD(&leaf->msg_list); 147 146 info->qsize += sizeof(*leaf); 148 147 } ··· 1012 1013 1013 1014 if (!info->node_cache && new_leaf) { 1014 1015 /* Save our speculative allocation into the cache */ 1015 - rb_init_node(&new_leaf->rb_node); 1016 1016 INIT_LIST_HEAD(&new_leaf->msg_list); 1017 1017 info->node_cache = new_leaf; 1018 1018 info->qsize += sizeof(*new_leaf); ··· 1119 1121 1120 1122 if (!info->node_cache && new_leaf) { 1121 1123 /* Save our speculative allocation into the cache */ 1122 - rb_init_node(&new_leaf->rb_node); 1123 1124 INIT_LIST_HEAD(&new_leaf->msg_list); 1124 1125 info->node_cache = new_leaf; 1125 1126 info->qsize += sizeof(*new_leaf);

+2 -11

kernel/auditsc.c

··· 1151 1151 const struct cred *cred; 1152 1152 char name[sizeof(tsk->comm)]; 1153 1153 struct mm_struct *mm = tsk->mm; 1154 - struct vm_area_struct *vma; 1155 1154 char *tty; 1156 1155 1157 1156 if (!ab) ··· 1190 1191 1191 1192 if (mm) { 1192 1193 down_read(&mm->mmap_sem); 1193 - vma = mm->mmap; 1194 - while (vma) { 1195 - if ((vma->vm_flags & VM_EXECUTABLE) && 1196 - vma->vm_file) { 1197 - audit_log_d_path(ab, " exe=", 1198 - &vma->vm_file->f_path); 1199 - break; 1200 - } 1201 - vma = vma->vm_next; 1202 - } 1194 + if (mm->exe_file) 1195 + audit_log_d_path(ab, " exe=", &mm->exe_file->f_path); 1203 1196 up_read(&mm->mmap_sem); 1204 1197 } 1205 1198 audit_log_task_context(ab);

+4

kernel/cpu.c

··· 80 80 if (cpu_hotplug.active_writer == current) 81 81 return; 82 82 mutex_lock(&cpu_hotplug.lock); 83 + 84 + if (WARN_ON(!cpu_hotplug.refcount)) 85 + cpu_hotplug.refcount++; /* try to fix things up */ 86 + 83 87 if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer)) 84 88 wake_up_process(cpu_hotplug.active_writer); 85 89 mutex_unlock(&cpu_hotplug.lock);

+1 -1

kernel/events/core.c

··· 3671 3671 atomic_inc(&event->mmap_count); 3672 3672 mutex_unlock(&event->mmap_mutex); 3673 3673 3674 - vma->vm_flags |= VM_RESERVED; 3674 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 3675 3675 vma->vm_ops = &perf_mmap_vmops; 3676 3676 3677 3677 return ret;

+6 -2

kernel/events/uprobes.c

··· 141 141 spinlock_t *ptl; 142 142 pte_t *ptep; 143 143 int err; 144 + /* For mmu_notifiers */ 145 + const unsigned long mmun_start = addr; 146 + const unsigned long mmun_end = addr + PAGE_SIZE; 144 147 145 148 /* For try_to_free_swap() and munlock_vma_page() below */ 146 149 lock_page(page); 147 150 151 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 148 152 err = -EAGAIN; 149 153 ptep = page_check_address(page, mm, addr, &ptl, 0); 150 154 if (!ptep) ··· 177 173 178 174 err = 0; 179 175 unlock: 176 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 180 177 unlock_page(page); 181 178 return err; 182 179 } ··· 740 735 build_map_info(struct address_space *mapping, loff_t offset, bool is_register) 741 736 { 742 737 unsigned long pgoff = offset >> PAGE_SHIFT; 743 - struct prio_tree_iter iter; 744 738 struct vm_area_struct *vma; 745 739 struct map_info *curr = NULL; 746 740 struct map_info *prev = NULL; ··· 748 744 749 745 again: 750 746 mutex_lock(&mapping->i_mmap_mutex); 751 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 747 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 752 748 if (!valid_vma(vma, is_register)) 753 749 continue; 754 750

+7 -25

kernel/fork.c

··· 423 423 mapping->i_mmap_writable++; 424 424 flush_dcache_mmap_lock(mapping); 425 425 /* insert tmp into the share list, just after mpnt */ 426 - vma_prio_tree_add(tmp, mpnt); 426 + if (unlikely(tmp->vm_flags & VM_NONLINEAR)) 427 + vma_nonlinear_insert(tmp, 428 + &mapping->i_mmap_nonlinear); 429 + else 430 + vma_interval_tree_insert_after(tmp, mpnt, 431 + &mapping->i_mmap); 427 432 flush_dcache_mmap_unlock(mapping); 428 433 mutex_unlock(&mapping->i_mmap_mutex); 429 434 } ··· 627 622 } 628 623 EXPORT_SYMBOL_GPL(mmput); 629 624 630 - /* 631 - * We added or removed a vma mapping the executable. The vmas are only mapped 632 - * during exec and are not mapped with the mmap system call. 633 - * Callers must hold down_write() on the mm's mmap_sem for these 634 - */ 635 - void added_exe_file_vma(struct mm_struct *mm) 636 - { 637 - mm->num_exe_file_vmas++; 638 - } 639 - 640 - void removed_exe_file_vma(struct mm_struct *mm) 641 - { 642 - mm->num_exe_file_vmas--; 643 - if ((mm->num_exe_file_vmas == 0) && mm->exe_file) { 644 - fput(mm->exe_file); 645 - mm->exe_file = NULL; 646 - } 647 - 648 - } 649 - 650 625 void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) 651 626 { 652 627 if (new_exe_file) ··· 634 649 if (mm->exe_file) 635 650 fput(mm->exe_file); 636 651 mm->exe_file = new_exe_file; 637 - mm->num_exe_file_vmas = 0; 638 652 } 639 653 640 654 struct file *get_mm_exe_file(struct mm_struct *mm) 641 655 { 642 656 struct file *exe_file; 643 657 644 - /* We need mmap_sem to protect against races with removal of 645 - * VM_EXECUTABLE vmas */ 658 + /* We need mmap_sem to protect against races with removal of exe_file */ 646 659 down_read(&mm->mmap_sem); 647 660 exe_file = mm->exe_file; 648 661 if (exe_file) ··· 1061 1078 init_rwsem(&sig->group_rwsem); 1062 1079 #endif 1063 1080 1064 - sig->oom_adj = current->signal->oom_adj; 1065 1081 sig->oom_score_adj = current->signal->oom_score_adj; 1066 1082 sig->oom_score_adj_min = current->signal->oom_score_adj_min; 1067 1083

+1 -2

kernel/sysctl.c

··· 1549 1549 }; 1550 1550 1551 1551 static struct ctl_table debug_table[] = { 1552 - #if defined(CONFIG_X86) || defined(CONFIG_PPC) || defined(CONFIG_SPARC) || \ 1553 - defined(CONFIG_S390) || defined(CONFIG_TILE) || defined(CONFIG_ARM64) 1552 + #ifdef CONFIG_SYSCTL_EXCEPTION_TRACE 1554 1553 { 1555 1554 .procname = "exception-trace", 1556 1555 .data = &show_unhandled_signals,

+30 -8

lib/Kconfig.debug

··· 450 450 out which slabs are relevant to a particular load. 451 451 Try running: slabinfo -DA 452 452 453 + config HAVE_DEBUG_KMEMLEAK 454 + bool 455 + 453 456 config DEBUG_KMEMLEAK 454 457 bool "Kernel memory leak detector" 455 - depends on DEBUG_KERNEL && EXPERIMENTAL && \ 456 - (X86 || ARM || PPC || MIPS || S390 || SPARC64 || SUPERH || \ 457 - MICROBLAZE || TILE || ARM64) 458 - 458 + depends on DEBUG_KERNEL && EXPERIMENTAL && HAVE_DEBUG_KMEMLEAK 459 459 select DEBUG_FS 460 460 select STACKTRACE if STACKTRACE_SUPPORT 461 461 select KALLSYMS ··· 751 751 This options enables addition error checking for high memory systems. 752 752 Disable for production systems. 753 753 754 + config HAVE_DEBUG_BUGVERBOSE 755 + bool 756 + 754 757 config DEBUG_BUGVERBOSE 755 758 bool "Verbose BUG() reporting (adds 70K)" if DEBUG_KERNEL && EXPERT 756 - depends on BUG 757 - depends on ARM || AVR32 || M32R || M68K || SPARC32 || SPARC64 || \ 758 - FRV || SUPERH || GENERIC_BUG || BLACKFIN || MN10300 || \ 759 - TILE || ARM64 759 + depends on BUG && (GENERIC_BUG || HAVE_DEBUG_BUGVERBOSE) 760 760 default y 761 761 help 762 762 Say Y here to make BUG() panics output the file name and line number ··· 795 795 help 796 796 Enable this to turn on extended checks in the virtual-memory system 797 797 that may impact performance. 798 + 799 + If unsure, say N. 800 + 801 + config DEBUG_VM_RB 802 + bool "Debug VM red-black trees" 803 + depends on DEBUG_VM 804 + help 805 + Enable this to turn on more extended checks in the virtual-memory 806 + system that may impact performance. 798 807 799 808 If unsure, say N. 800 809 ··· 1290 1281 1291 1282 source mm/Kconfig.debug 1292 1283 source kernel/trace/Kconfig 1284 + 1285 + config RBTREE_TEST 1286 + tristate "Red-Black tree test" 1287 + depends on m && DEBUG_KERNEL 1288 + help 1289 + A benchmark measuring the performance of the rbtree library. 1290 + Also includes rbtree invariant checks. 1291 + 1292 + config INTERVAL_TREE_TEST 1293 + tristate "Interval tree test" 1294 + depends on m && DEBUG_KERNEL 1295 + help 1296 + A benchmark measuring the performance of the interval tree library 1293 1297 1294 1298 config PROVIDE_OHCI1394_DMA_INIT 1295 1299 bool "Remote debugging over FireWire early on boot"

+6 -1

lib/Makefile

··· 9 9 10 10 lib-y := ctype.o string.o vsprintf.o cmdline.o \ 11 11 rbtree.o radix-tree.o dump_stack.o timerqueue.o\ 12 - idr.o int_sqrt.o extable.o prio_tree.o \ 12 + idr.o int_sqrt.o extable.o \ 13 13 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ 14 14 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ 15 15 is_single_threaded.o plist.o decompress.o ··· 139 139 $(foreach file, $(libfdt_files), \ 140 140 $(eval CFLAGS_$(file) = -I$(src)/../scripts/dtc/libfdt)) 141 141 lib-$(CONFIG_LIBFDT) += $(libfdt_files) 142 + 143 + obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o 144 + obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o 145 + 146 + interval_tree_test-objs := interval_tree_test_main.o interval_tree.o 142 147 143 148 hostprogs-y := gen_crc32table 144 149 clean-files := crc32table.h

+10

lib/interval_tree.c

··· 1 + #include <linux/init.h> 2 + #include <linux/interval_tree.h> 3 + #include <linux/interval_tree_generic.h> 4 + 5 + #define START(node) ((node)->start) 6 + #define LAST(node) ((node)->last) 7 + 8 + INTERVAL_TREE_DEFINE(struct interval_tree_node, rb, 9 + unsigned long, __subtree_last, 10 + START, LAST,, interval_tree)

+105

lib/interval_tree_test_main.c

··· 1 + #include <linux/module.h> 2 + #include <linux/interval_tree.h> 3 + #include <linux/random.h> 4 + #include <asm/timex.h> 5 + 6 + #define NODES 100 7 + #define PERF_LOOPS 100000 8 + #define SEARCHES 100 9 + #define SEARCH_LOOPS 10000 10 + 11 + static struct rb_root root = RB_ROOT; 12 + static struct interval_tree_node nodes[NODES]; 13 + static u32 queries[SEARCHES]; 14 + 15 + static struct rnd_state rnd; 16 + 17 + static inline unsigned long 18 + search(unsigned long query, struct rb_root *root) 19 + { 20 + struct interval_tree_node *node; 21 + unsigned long results = 0; 22 + 23 + for (node = interval_tree_iter_first(root, query, query); node; 24 + node = interval_tree_iter_next(node, query, query)) 25 + results++; 26 + return results; 27 + } 28 + 29 + static void init(void) 30 + { 31 + int i; 32 + for (i = 0; i < NODES; i++) { 33 + u32 a = prandom32(&rnd), b = prandom32(&rnd); 34 + if (a <= b) { 35 + nodes[i].start = a; 36 + nodes[i].last = b; 37 + } else { 38 + nodes[i].start = b; 39 + nodes[i].last = a; 40 + } 41 + } 42 + for (i = 0; i < SEARCHES; i++) 43 + queries[i] = prandom32(&rnd); 44 + } 45 + 46 + static int interval_tree_test_init(void) 47 + { 48 + int i, j; 49 + unsigned long results; 50 + cycles_t time1, time2, time; 51 + 52 + printk(KERN_ALERT "interval tree insert/remove"); 53 + 54 + prandom32_seed(&rnd, 3141592653589793238ULL); 55 + init(); 56 + 57 + time1 = get_cycles(); 58 + 59 + for (i = 0; i < PERF_LOOPS; i++) { 60 + for (j = 0; j < NODES; j++) 61 + interval_tree_insert(nodes + j, &root); 62 + for (j = 0; j < NODES; j++) 63 + interval_tree_remove(nodes + j, &root); 64 + } 65 + 66 + time2 = get_cycles(); 67 + time = time2 - time1; 68 + 69 + time = div_u64(time, PERF_LOOPS); 70 + printk(" -> %llu cycles\n", (unsigned long long)time); 71 + 72 + printk(KERN_ALERT "interval tree search"); 73 + 74 + for (j = 0; j < NODES; j++) 75 + interval_tree_insert(nodes + j, &root); 76 + 77 + time1 = get_cycles(); 78 + 79 + results = 0; 80 + for (i = 0; i < SEARCH_LOOPS; i++) 81 + for (j = 0; j < SEARCHES; j++) 82 + results += search(queries[j], &root); 83 + 84 + time2 = get_cycles(); 85 + time = time2 - time1; 86 + 87 + time = div_u64(time, SEARCH_LOOPS); 88 + results = div_u64(results, SEARCH_LOOPS); 89 + printk(" -> %llu cycles (%lu results)\n", 90 + (unsigned long long)time, results); 91 + 92 + return -EAGAIN; /* Fail will directly unload the module */ 93 + } 94 + 95 + static void interval_tree_test_exit(void) 96 + { 97 + printk(KERN_ALERT "test exit\n"); 98 + } 99 + 100 + module_init(interval_tree_test_init) 101 + module_exit(interval_tree_test_exit) 102 + 103 + MODULE_LICENSE("GPL"); 104 + MODULE_AUTHOR("Michel Lespinasse"); 105 + MODULE_DESCRIPTION("Interval Tree test");

-466

lib/prio_tree.c

··· 1 - /* 2 - * lib/prio_tree.c - priority search tree 3 - * 4 - * Copyright (C) 2004, Rajesh Venkatasubramanian <vrajesh@umich.edu> 5 - * 6 - * This file is released under the GPL v2. 7 - * 8 - * Based on the radix priority search tree proposed by Edward M. McCreight 9 - * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985 10 - * 11 - * 02Feb2004 Initial version 12 - */ 13 - 14 - #include <linux/init.h> 15 - #include <linux/mm.h> 16 - #include <linux/prio_tree.h> 17 - 18 - /* 19 - * A clever mix of heap and radix trees forms a radix priority search tree (PST) 20 - * which is useful for storing intervals, e.g, we can consider a vma as a closed 21 - * interval of file pages [offset_begin, offset_end], and store all vmas that 22 - * map a file in a PST. Then, using the PST, we can answer a stabbing query, 23 - * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a 24 - * given input interval X (a set of consecutive file pages), in "O(log n + m)" 25 - * time where 'log n' is the height of the PST, and 'm' is the number of stored 26 - * intervals (vmas) that overlap (map) with the input interval X (the set of 27 - * consecutive file pages). 28 - * 29 - * In our implementation, we store closed intervals of the form [radix_index, 30 - * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST 31 - * is designed for storing intervals with unique radix indices, i.e., each 32 - * interval have different radix_index. However, this limitation can be easily 33 - * overcome by using the size, i.e., heap_index - radix_index, as part of the 34 - * index, so we index the tree using [(radix_index,size), heap_index]. 35 - * 36 - * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit 37 - * machine, the maximum height of a PST can be 64. We can use a balanced version 38 - * of the priority search tree to optimize the tree height, but the balanced 39 - * tree proposed by McCreight is too complex and memory-hungry for our purpose. 40 - */ 41 - 42 - /* 43 - * The following macros are used for implementing prio_tree for i_mmap 44 - */ 45 - 46 - #define RADIX_INDEX(vma) ((vma)->vm_pgoff) 47 - #define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT) 48 - /* avoid overflow */ 49 - #define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1)) 50 - 51 - 52 - static void get_index(const struct prio_tree_root *root, 53 - const struct prio_tree_node *node, 54 - unsigned long *radix, unsigned long *heap) 55 - { 56 - if (root->raw) { 57 - struct vm_area_struct *vma = prio_tree_entry( 58 - node, struct vm_area_struct, shared.prio_tree_node); 59 - 60 - *radix = RADIX_INDEX(vma); 61 - *heap = HEAP_INDEX(vma); 62 - } 63 - else { 64 - *radix = node->start; 65 - *heap = node->last; 66 - } 67 - } 68 - 69 - static unsigned long index_bits_to_maxindex[BITS_PER_LONG]; 70 - 71 - void __init prio_tree_init(void) 72 - { 73 - unsigned int i; 74 - 75 - for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++) 76 - index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1; 77 - index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL; 78 - } 79 - 80 - /* 81 - * Maximum heap_index that can be stored in a PST with index_bits bits 82 - */ 83 - static inline unsigned long prio_tree_maxindex(unsigned int bits) 84 - { 85 - return index_bits_to_maxindex[bits - 1]; 86 - } 87 - 88 - static void prio_set_parent(struct prio_tree_node *parent, 89 - struct prio_tree_node *child, bool left) 90 - { 91 - if (left) 92 - parent->left = child; 93 - else 94 - parent->right = child; 95 - 96 - child->parent = parent; 97 - } 98 - 99 - /* 100 - * Extend a priority search tree so that it can store a node with heap_index 101 - * max_heap_index. In the worst case, this algorithm takes O((log n)^2). 102 - * However, this function is used rarely and the common case performance is 103 - * not bad. 104 - */ 105 - static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root, 106 - struct prio_tree_node *node, unsigned long max_heap_index) 107 - { 108 - struct prio_tree_node *prev; 109 - 110 - if (max_heap_index > prio_tree_maxindex(root->index_bits)) 111 - root->index_bits++; 112 - 113 - prev = node; 114 - INIT_PRIO_TREE_NODE(node); 115 - 116 - while (max_heap_index > prio_tree_maxindex(root->index_bits)) { 117 - struct prio_tree_node *tmp = root->prio_tree_node; 118 - 119 - root->index_bits++; 120 - 121 - if (prio_tree_empty(root)) 122 - continue; 123 - 124 - prio_tree_remove(root, root->prio_tree_node); 125 - INIT_PRIO_TREE_NODE(tmp); 126 - 127 - prio_set_parent(prev, tmp, true); 128 - prev = tmp; 129 - } 130 - 131 - if (!prio_tree_empty(root)) 132 - prio_set_parent(prev, root->prio_tree_node, true); 133 - 134 - root->prio_tree_node = node; 135 - return node; 136 - } 137 - 138 - /* 139 - * Replace a prio_tree_node with a new node and return the old node 140 - */ 141 - struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root, 142 - struct prio_tree_node *old, struct prio_tree_node *node) 143 - { 144 - INIT_PRIO_TREE_NODE(node); 145 - 146 - if (prio_tree_root(old)) { 147 - BUG_ON(root->prio_tree_node != old); 148 - /* 149 - * We can reduce root->index_bits here. However, it is complex 150 - * and does not help much to improve performance (IMO). 151 - */ 152 - root->prio_tree_node = node; 153 - } else 154 - prio_set_parent(old->parent, node, old->parent->left == old); 155 - 156 - if (!prio_tree_left_empty(old)) 157 - prio_set_parent(node, old->left, true); 158 - 159 - if (!prio_tree_right_empty(old)) 160 - prio_set_parent(node, old->right, false); 161 - 162 - return old; 163 - } 164 - 165 - /* 166 - * Insert a prio_tree_node @node into a radix priority search tree @root. The 167 - * algorithm typically takes O(log n) time where 'log n' is the number of bits 168 - * required to represent the maximum heap_index. In the worst case, the algo 169 - * can take O((log n)^2) - check prio_tree_expand. 170 - * 171 - * If a prior node with same radix_index and heap_index is already found in 172 - * the tree, then returns the address of the prior node. Otherwise, inserts 173 - * @node into the tree and returns @node. 174 - */ 175 - struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root, 176 - struct prio_tree_node *node) 177 - { 178 - struct prio_tree_node *cur, *res = node; 179 - unsigned long radix_index, heap_index; 180 - unsigned long r_index, h_index, index, mask; 181 - int size_flag = 0; 182 - 183 - get_index(root, node, &radix_index, &heap_index); 184 - 185 - if (prio_tree_empty(root) || 186 - heap_index > prio_tree_maxindex(root->index_bits)) 187 - return prio_tree_expand(root, node, heap_index); 188 - 189 - cur = root->prio_tree_node; 190 - mask = 1UL << (root->index_bits - 1); 191 - 192 - while (mask) { 193 - get_index(root, cur, &r_index, &h_index); 194 - 195 - if (r_index == radix_index && h_index == heap_index) 196 - return cur; 197 - 198 - if (h_index < heap_index || 199 - (h_index == heap_index && r_index > radix_index)) { 200 - struct prio_tree_node *tmp = node; 201 - node = prio_tree_replace(root, cur, node); 202 - cur = tmp; 203 - /* swap indices */ 204 - index = r_index; 205 - r_index = radix_index; 206 - radix_index = index; 207 - index = h_index; 208 - h_index = heap_index; 209 - heap_index = index; 210 - } 211 - 212 - if (size_flag) 213 - index = heap_index - radix_index; 214 - else 215 - index = radix_index; 216 - 217 - if (index & mask) { 218 - if (prio_tree_right_empty(cur)) { 219 - INIT_PRIO_TREE_NODE(node); 220 - prio_set_parent(cur, node, false); 221 - return res; 222 - } else 223 - cur = cur->right; 224 - } else { 225 - if (prio_tree_left_empty(cur)) { 226 - INIT_PRIO_TREE_NODE(node); 227 - prio_set_parent(cur, node, true); 228 - return res; 229 - } else 230 - cur = cur->left; 231 - } 232 - 233 - mask >>= 1; 234 - 235 - if (!mask) { 236 - mask = 1UL << (BITS_PER_LONG - 1); 237 - size_flag = 1; 238 - } 239 - } 240 - /* Should not reach here */ 241 - BUG(); 242 - return NULL; 243 - } 244 - 245 - /* 246 - * Remove a prio_tree_node @node from a radix priority search tree @root. The 247 - * algorithm takes O(log n) time where 'log n' is the number of bits required 248 - * to represent the maximum heap_index. 249 - */ 250 - void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node) 251 - { 252 - struct prio_tree_node *cur; 253 - unsigned long r_index, h_index_right, h_index_left; 254 - 255 - cur = node; 256 - 257 - while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) { 258 - if (!prio_tree_left_empty(cur)) 259 - get_index(root, cur->left, &r_index, &h_index_left); 260 - else { 261 - cur = cur->right; 262 - continue; 263 - } 264 - 265 - if (!prio_tree_right_empty(cur)) 266 - get_index(root, cur->right, &r_index, &h_index_right); 267 - else { 268 - cur = cur->left; 269 - continue; 270 - } 271 - 272 - /* both h_index_left and h_index_right cannot be 0 */ 273 - if (h_index_left >= h_index_right) 274 - cur = cur->left; 275 - else 276 - cur = cur->right; 277 - } 278 - 279 - if (prio_tree_root(cur)) { 280 - BUG_ON(root->prio_tree_node != cur); 281 - __INIT_PRIO_TREE_ROOT(root, root->raw); 282 - return; 283 - } 284 - 285 - if (cur->parent->right == cur) 286 - cur->parent->right = cur->parent; 287 - else 288 - cur->parent->left = cur->parent; 289 - 290 - while (cur != node) 291 - cur = prio_tree_replace(root, cur->parent, cur); 292 - } 293 - 294 - static void iter_walk_down(struct prio_tree_iter *iter) 295 - { 296 - iter->mask >>= 1; 297 - if (iter->mask) { 298 - if (iter->size_level) 299 - iter->size_level++; 300 - return; 301 - } 302 - 303 - if (iter->size_level) { 304 - BUG_ON(!prio_tree_left_empty(iter->cur)); 305 - BUG_ON(!prio_tree_right_empty(iter->cur)); 306 - iter->size_level++; 307 - iter->mask = ULONG_MAX; 308 - } else { 309 - iter->size_level = 1; 310 - iter->mask = 1UL << (BITS_PER_LONG - 1); 311 - } 312 - } 313 - 314 - static void iter_walk_up(struct prio_tree_iter *iter) 315 - { 316 - if (iter->mask == ULONG_MAX) 317 - iter->mask = 1UL; 318 - else if (iter->size_level == 1) 319 - iter->mask = 1UL; 320 - else 321 - iter->mask <<= 1; 322 - if (iter->size_level) 323 - iter->size_level--; 324 - if (!iter->size_level && (iter->value & iter->mask)) 325 - iter->value ^= iter->mask; 326 - } 327 - 328 - /* 329 - * Following functions help to enumerate all prio_tree_nodes in the tree that 330 - * overlap with the input interval X [radix_index, heap_index]. The enumeration 331 - * takes O(log n + m) time where 'log n' is the height of the tree (which is 332 - * proportional to # of bits required to represent the maximum heap_index) and 333 - * 'm' is the number of prio_tree_nodes that overlap the interval X. 334 - */ 335 - 336 - static struct prio_tree_node *prio_tree_left(struct prio_tree_iter *iter, 337 - unsigned long *r_index, unsigned long *h_index) 338 - { 339 - if (prio_tree_left_empty(iter->cur)) 340 - return NULL; 341 - 342 - get_index(iter->root, iter->cur->left, r_index, h_index); 343 - 344 - if (iter->r_index <= *h_index) { 345 - iter->cur = iter->cur->left; 346 - iter_walk_down(iter); 347 - return iter->cur; 348 - } 349 - 350 - return NULL; 351 - } 352 - 353 - static struct prio_tree_node *prio_tree_right(struct prio_tree_iter *iter, 354 - unsigned long *r_index, unsigned long *h_index) 355 - { 356 - unsigned long value; 357 - 358 - if (prio_tree_right_empty(iter->cur)) 359 - return NULL; 360 - 361 - if (iter->size_level) 362 - value = iter->value; 363 - else 364 - value = iter->value | iter->mask; 365 - 366 - if (iter->h_index < value) 367 - return NULL; 368 - 369 - get_index(iter->root, iter->cur->right, r_index, h_index); 370 - 371 - if (iter->r_index <= *h_index) { 372 - iter->cur = iter->cur->right; 373 - iter_walk_down(iter); 374 - return iter->cur; 375 - } 376 - 377 - return NULL; 378 - } 379 - 380 - static struct prio_tree_node *prio_tree_parent(struct prio_tree_iter *iter) 381 - { 382 - iter->cur = iter->cur->parent; 383 - iter_walk_up(iter); 384 - return iter->cur; 385 - } 386 - 387 - static inline int overlap(struct prio_tree_iter *iter, 388 - unsigned long r_index, unsigned long h_index) 389 - { 390 - return iter->h_index >= r_index && iter->r_index <= h_index; 391 - } 392 - 393 - /* 394 - * prio_tree_first: 395 - * 396 - * Get the first prio_tree_node that overlaps with the interval [radix_index, 397 - * heap_index]. Note that always radix_index <= heap_index. We do a pre-order 398 - * traversal of the tree. 399 - */ 400 - static struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter) 401 - { 402 - struct prio_tree_root *root; 403 - unsigned long r_index, h_index; 404 - 405 - INIT_PRIO_TREE_ITER(iter); 406 - 407 - root = iter->root; 408 - if (prio_tree_empty(root)) 409 - return NULL; 410 - 411 - get_index(root, root->prio_tree_node, &r_index, &h_index); 412 - 413 - if (iter->r_index > h_index) 414 - return NULL; 415 - 416 - iter->mask = 1UL << (root->index_bits - 1); 417 - iter->cur = root->prio_tree_node; 418 - 419 - while (1) { 420 - if (overlap(iter, r_index, h_index)) 421 - return iter->cur; 422 - 423 - if (prio_tree_left(iter, &r_index, &h_index)) 424 - continue; 425 - 426 - if (prio_tree_right(iter, &r_index, &h_index)) 427 - continue; 428 - 429 - break; 430 - } 431 - return NULL; 432 - } 433 - 434 - /* 435 - * prio_tree_next: 436 - * 437 - * Get the next prio_tree_node that overlaps with the input interval in iter 438 - */ 439 - struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter) 440 - { 441 - unsigned long r_index, h_index; 442 - 443 - if (iter->cur == NULL) 444 - return prio_tree_first(iter); 445 - 446 - repeat: 447 - while (prio_tree_left(iter, &r_index, &h_index)) 448 - if (overlap(iter, r_index, h_index)) 449 - return iter->cur; 450 - 451 - while (!prio_tree_right(iter, &r_index, &h_index)) { 452 - while (!prio_tree_root(iter->cur) && 453 - iter->cur->parent->right == iter->cur) 454 - prio_tree_parent(iter); 455 - 456 - if (prio_tree_root(iter->cur)) 457 - return NULL; 458 - 459 - prio_tree_parent(iter); 460 - } 461 - 462 - if (overlap(iter, r_index, h_index)) 463 - return iter->cur; 464 - 465 - goto repeat; 466 - }

+378 -334

lib/rbtree.c

··· 2 2 Red Black Trees 3 3 (C) 1999 Andrea Arcangeli <andrea@suse.de> 4 4 (C) 2002 David Woodhouse <dwmw2@infradead.org> 5 - 5 + (C) 2012 Michel Lespinasse <walken@google.com> 6 + 6 7 This program is free software; you can redistribute it and/or modify 7 8 it under the terms of the GNU General Public License as published by 8 9 the Free Software Foundation; either version 2 of the License, or ··· 21 20 linux/lib/rbtree.c 22 21 */ 23 22 24 - #include <linux/rbtree.h> 23 + #include <linux/rbtree_augmented.h> 25 24 #include <linux/export.h> 26 25 27 - static void __rb_rotate_left(struct rb_node *node, struct rb_root *root) 26 + /* 27 + * red-black trees properties: http://en.wikipedia.org/wiki/Rbtree 28 + * 29 + * 1) A node is either red or black 30 + * 2) The root is black 31 + * 3) All leaves (NULL) are black 32 + * 4) Both children of every red node are black 33 + * 5) Every simple path from root to leaves contains the same number 34 + * of black nodes. 35 + * 36 + * 4 and 5 give the O(log n) guarantee, since 4 implies you cannot have two 37 + * consecutive red nodes in a path and every red node is therefore followed by 38 + * a black. So if B is the number of black nodes on every simple path (as per 39 + * 5), then the longest possible path due to 4 is 2B. 40 + * 41 + * We shall indicate color with case, where black nodes are uppercase and red 42 + * nodes will be lowercase. Unknown color nodes shall be drawn as red within 43 + * parentheses and have some accompanying text comment. 44 + */ 45 + 46 + static inline void rb_set_black(struct rb_node *rb) 28 47 { 29 - struct rb_node *right = node->rb_right; 30 - struct rb_node *parent = rb_parent(node); 31 - 32 - if ((node->rb_right = right->rb_left)) 33 - rb_set_parent(right->rb_left, node); 34 - right->rb_left = node; 35 - 36 - rb_set_parent(right, parent); 37 - 38 - if (parent) 39 - { 40 - if (node == parent->rb_left) 41 - parent->rb_left = right; 42 - else 43 - parent->rb_right = right; 44 - } 45 - else 46 - root->rb_node = right; 47 - rb_set_parent(node, right); 48 + rb->__rb_parent_color |= RB_BLACK; 48 49 } 49 50 50 - static void __rb_rotate_right(struct rb_node *node, struct rb_root *root) 51 + static inline struct rb_node *rb_red_parent(struct rb_node *red) 51 52 { 52 - struct rb_node *left = node->rb_left; 53 - struct rb_node *parent = rb_parent(node); 54 - 55 - if ((node->rb_left = left->rb_right)) 56 - rb_set_parent(left->rb_right, node); 57 - left->rb_right = node; 58 - 59 - rb_set_parent(left, parent); 60 - 61 - if (parent) 62 - { 63 - if (node == parent->rb_right) 64 - parent->rb_right = left; 65 - else 66 - parent->rb_left = left; 67 - } 68 - else 69 - root->rb_node = left; 70 - rb_set_parent(node, left); 53 + return (struct rb_node *)red->__rb_parent_color; 71 54 } 55 + 56 + /* 57 + * Helper function for rotations: 58 + * - old's parent and color get assigned to new 59 + * - old gets assigned new as a parent and 'color' as a color. 60 + */ 61 + static inline void 62 + __rb_rotate_set_parents(struct rb_node *old, struct rb_node *new, 63 + struct rb_root *root, int color) 64 + { 65 + struct rb_node *parent = rb_parent(old); 66 + new->__rb_parent_color = old->__rb_parent_color; 67 + rb_set_parent_color(old, new, color); 68 + __rb_change_child(old, new, parent, root); 69 + } 70 + 71 + static __always_inline void 72 + __rb_insert(struct rb_node *node, struct rb_root *root, 73 + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) 74 + { 75 + struct rb_node *parent = rb_red_parent(node), *gparent, *tmp; 76 + 77 + while (true) { 78 + /* 79 + * Loop invariant: node is red 80 + * 81 + * If there is a black parent, we are done. 82 + * Otherwise, take some corrective action as we don't 83 + * want a red root or two consecutive red nodes. 84 + */ 85 + if (!parent) { 86 + rb_set_parent_color(node, NULL, RB_BLACK); 87 + break; 88 + } else if (rb_is_black(parent)) 89 + break; 90 + 91 + gparent = rb_red_parent(parent); 92 + 93 + tmp = gparent->rb_right; 94 + if (parent != tmp) { /* parent == gparent->rb_left */ 95 + if (tmp && rb_is_red(tmp)) { 96 + /* 97 + * Case 1 - color flips 98 + * 99 + * G g 100 + * / \ / \ 101 + * p u --> P U 102 + * / / 103 + * n N 104 + * 105 + * However, since g's parent might be red, and 106 + * 4) does not allow this, we need to recurse 107 + * at g. 108 + */ 109 + rb_set_parent_color(tmp, gparent, RB_BLACK); 110 + rb_set_parent_color(parent, gparent, RB_BLACK); 111 + node = gparent; 112 + parent = rb_parent(node); 113 + rb_set_parent_color(node, parent, RB_RED); 114 + continue; 115 + } 116 + 117 + tmp = parent->rb_right; 118 + if (node == tmp) { 119 + /* 120 + * Case 2 - left rotate at parent 121 + * 122 + * G G 123 + * / \ / \ 124 + * p U --> n U 125 + * \ / 126 + * n p 127 + * 128 + * This still leaves us in violation of 4), the 129 + * continuation into Case 3 will fix that. 130 + */ 131 + parent->rb_right = tmp = node->rb_left; 132 + node->rb_left = parent; 133 + if (tmp) 134 + rb_set_parent_color(tmp, parent, 135 + RB_BLACK); 136 + rb_set_parent_color(parent, node, RB_RED); 137 + augment_rotate(parent, node); 138 + parent = node; 139 + tmp = node->rb_right; 140 + } 141 + 142 + /* 143 + * Case 3 - right rotate at gparent 144 + * 145 + * G P 146 + * / \ / \ 147 + * p U --> n g 148 + * / \ 149 + * n U 150 + */ 151 + gparent->rb_left = tmp; /* == parent->rb_right */ 152 + parent->rb_right = gparent; 153 + if (tmp) 154 + rb_set_parent_color(tmp, gparent, RB_BLACK); 155 + __rb_rotate_set_parents(gparent, parent, root, RB_RED); 156 + augment_rotate(gparent, parent); 157 + break; 158 + } else { 159 + tmp = gparent->rb_left; 160 + if (tmp && rb_is_red(tmp)) { 161 + /* Case 1 - color flips */ 162 + rb_set_parent_color(tmp, gparent, RB_BLACK); 163 + rb_set_parent_color(parent, gparent, RB_BLACK); 164 + node = gparent; 165 + parent = rb_parent(node); 166 + rb_set_parent_color(node, parent, RB_RED); 167 + continue; 168 + } 169 + 170 + tmp = parent->rb_left; 171 + if (node == tmp) { 172 + /* Case 2 - right rotate at parent */ 173 + parent->rb_left = tmp = node->rb_right; 174 + node->rb_right = parent; 175 + if (tmp) 176 + rb_set_parent_color(tmp, parent, 177 + RB_BLACK); 178 + rb_set_parent_color(parent, node, RB_RED); 179 + augment_rotate(parent, node); 180 + parent = node; 181 + tmp = node->rb_left; 182 + } 183 + 184 + /* Case 3 - left rotate at gparent */ 185 + gparent->rb_right = tmp; /* == parent->rb_left */ 186 + parent->rb_left = gparent; 187 + if (tmp) 188 + rb_set_parent_color(tmp, gparent, RB_BLACK); 189 + __rb_rotate_set_parents(gparent, parent, root, RB_RED); 190 + augment_rotate(gparent, parent); 191 + break; 192 + } 193 + } 194 + } 195 + 196 + __always_inline void 197 + __rb_erase_color(struct rb_node *parent, struct rb_root *root, 198 + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) 199 + { 200 + struct rb_node *node = NULL, *sibling, *tmp1, *tmp2; 201 + 202 + while (true) { 203 + /* 204 + * Loop invariants: 205 + * - node is black (or NULL on first iteration) 206 + * - node is not the root (parent is not NULL) 207 + * - All leaf paths going through parent and node have a 208 + * black node count that is 1 lower than other leaf paths. 209 + */ 210 + sibling = parent->rb_right; 211 + if (node != sibling) { /* node == parent->rb_left */ 212 + if (rb_is_red(sibling)) { 213 + /* 214 + * Case 1 - left rotate at parent 215 + * 216 + * P S 217 + * / \ / \ 218 + * N s --> p Sr 219 + * / \ / \ 220 + * Sl Sr N Sl 221 + */ 222 + parent->rb_right = tmp1 = sibling->rb_left; 223 + sibling->rb_left = parent; 224 + rb_set_parent_color(tmp1, parent, RB_BLACK); 225 + __rb_rotate_set_parents(parent, sibling, root, 226 + RB_RED); 227 + augment_rotate(parent, sibling); 228 + sibling = tmp1; 229 + } 230 + tmp1 = sibling->rb_right; 231 + if (!tmp1 || rb_is_black(tmp1)) { 232 + tmp2 = sibling->rb_left; 233 + if (!tmp2 || rb_is_black(tmp2)) { 234 + /* 235 + * Case 2 - sibling color flip 236 + * (p could be either color here) 237 + * 238 + * (p) (p) 239 + * / \ / \ 240 + * N S --> N s 241 + * / \ / \ 242 + * Sl Sr Sl Sr 243 + * 244 + * This leaves us violating 5) which 245 + * can be fixed by flipping p to black 246 + * if it was red, or by recursing at p. 247 + * p is red when coming from Case 1. 248 + */ 249 + rb_set_parent_color(sibling, parent, 250 + RB_RED); 251 + if (rb_is_red(parent)) 252 + rb_set_black(parent); 253 + else { 254 + node = parent; 255 + parent = rb_parent(node); 256 + if (parent) 257 + continue; 258 + } 259 + break; 260 + } 261 + /* 262 + * Case 3 - right rotate at sibling 263 + * (p could be either color here) 264 + * 265 + * (p) (p) 266 + * / \ / \ 267 + * N S --> N Sl 268 + * / \ \ 269 + * sl Sr s 270 + * \ 271 + * Sr 272 + */ 273 + sibling->rb_left = tmp1 = tmp2->rb_right; 274 + tmp2->rb_right = sibling; 275 + parent->rb_right = tmp2; 276 + if (tmp1) 277 + rb_set_parent_color(tmp1, sibling, 278 + RB_BLACK); 279 + augment_rotate(sibling, tmp2); 280 + tmp1 = sibling; 281 + sibling = tmp2; 282 + } 283 + /* 284 + * Case 4 - left rotate at parent + color flips 285 + * (p and sl could be either color here. 286 + * After rotation, p becomes black, s acquires 287 + * p's color, and sl keeps its color) 288 + * 289 + * (p) (s) 290 + * / \ / \ 291 + * N S --> P Sr 292 + * / \ / \ 293 + * (sl) sr N (sl) 294 + */ 295 + parent->rb_right = tmp2 = sibling->rb_left; 296 + sibling->rb_left = parent; 297 + rb_set_parent_color(tmp1, sibling, RB_BLACK); 298 + if (tmp2) 299 + rb_set_parent(tmp2, parent); 300 + __rb_rotate_set_parents(parent, sibling, root, 301 + RB_BLACK); 302 + augment_rotate(parent, sibling); 303 + break; 304 + } else { 305 + sibling = parent->rb_left; 306 + if (rb_is_red(sibling)) { 307 + /* Case 1 - right rotate at parent */ 308 + parent->rb_left = tmp1 = sibling->rb_right; 309 + sibling->rb_right = parent; 310 + rb_set_parent_color(tmp1, parent, RB_BLACK); 311 + __rb_rotate_set_parents(parent, sibling, root, 312 + RB_RED); 313 + augment_rotate(parent, sibling); 314 + sibling = tmp1; 315 + } 316 + tmp1 = sibling->rb_left; 317 + if (!tmp1 || rb_is_black(tmp1)) { 318 + tmp2 = sibling->rb_right; 319 + if (!tmp2 || rb_is_black(tmp2)) { 320 + /* Case 2 - sibling color flip */ 321 + rb_set_parent_color(sibling, parent, 322 + RB_RED); 323 + if (rb_is_red(parent)) 324 + rb_set_black(parent); 325 + else { 326 + node = parent; 327 + parent = rb_parent(node); 328 + if (parent) 329 + continue; 330 + } 331 + break; 332 + } 333 + /* Case 3 - right rotate at sibling */ 334 + sibling->rb_right = tmp1 = tmp2->rb_left; 335 + tmp2->rb_left = sibling; 336 + parent->rb_left = tmp2; 337 + if (tmp1) 338 + rb_set_parent_color(tmp1, sibling, 339 + RB_BLACK); 340 + augment_rotate(sibling, tmp2); 341 + tmp1 = sibling; 342 + sibling = tmp2; 343 + } 344 + /* Case 4 - left rotate at parent + color flips */ 345 + parent->rb_left = tmp2 = sibling->rb_right; 346 + sibling->rb_right = parent; 347 + rb_set_parent_color(tmp1, sibling, RB_BLACK); 348 + if (tmp2) 349 + rb_set_parent(tmp2, parent); 350 + __rb_rotate_set_parents(parent, sibling, root, 351 + RB_BLACK); 352 + augment_rotate(parent, sibling); 353 + break; 354 + } 355 + } 356 + } 357 + EXPORT_SYMBOL(__rb_erase_color); 358 + 359 + /* 360 + * Non-augmented rbtree manipulation functions. 361 + * 362 + * We use dummy augmented callbacks here, and have the compiler optimize them 363 + * out of the rb_insert_color() and rb_erase() function definitions. 364 + */ 365 + 366 + static inline void dummy_propagate(struct rb_node *node, struct rb_node *stop) {} 367 + static inline void dummy_copy(struct rb_node *old, struct rb_node *new) {} 368 + static inline void dummy_rotate(struct rb_node *old, struct rb_node *new) {} 369 + 370 + static const struct rb_augment_callbacks dummy_callbacks = { 371 + dummy_propagate, dummy_copy, dummy_rotate 372 + }; 72 373 73 374 void rb_insert_color(struct rb_node *node, struct rb_root *root) 74 375 { 75 - struct rb_node *parent, *gparent; 76 - 77 - while ((parent = rb_parent(node)) && rb_is_red(parent)) 78 - { 79 - gparent = rb_parent(parent); 80 - 81 - if (parent == gparent->rb_left) 82 - { 83 - { 84 - register struct rb_node *uncle = gparent->rb_right; 85 - if (uncle && rb_is_red(uncle)) 86 - { 87 - rb_set_black(uncle); 88 - rb_set_black(parent); 89 - rb_set_red(gparent); 90 - node = gparent; 91 - continue; 92 - } 93 - } 94 - 95 - if (parent->rb_right == node) 96 - { 97 - register struct rb_node *tmp; 98 - __rb_rotate_left(parent, root); 99 - tmp = parent; 100 - parent = node; 101 - node = tmp; 102 - } 103 - 104 - rb_set_black(parent); 105 - rb_set_red(gparent); 106 - __rb_rotate_right(gparent, root); 107 - } else { 108 - { 109 - register struct rb_node *uncle = gparent->rb_left; 110 - if (uncle && rb_is_red(uncle)) 111 - { 112 - rb_set_black(uncle); 113 - rb_set_black(parent); 114 - rb_set_red(gparent); 115 - node = gparent; 116 - continue; 117 - } 118 - } 119 - 120 - if (parent->rb_left == node) 121 - { 122 - register struct rb_node *tmp; 123 - __rb_rotate_right(parent, root); 124 - tmp = parent; 125 - parent = node; 126 - node = tmp; 127 - } 128 - 129 - rb_set_black(parent); 130 - rb_set_red(gparent); 131 - __rb_rotate_left(gparent, root); 132 - } 133 - } 134 - 135 - rb_set_black(root->rb_node); 376 + __rb_insert(node, root, dummy_rotate); 136 377 } 137 378 EXPORT_SYMBOL(rb_insert_color); 138 379 139 - static void __rb_erase_color(struct rb_node *node, struct rb_node *parent, 140 - struct rb_root *root) 141 - { 142 - struct rb_node *other; 143 - 144 - while ((!node || rb_is_black(node)) && node != root->rb_node) 145 - { 146 - if (parent->rb_left == node) 147 - { 148 - other = parent->rb_right; 149 - if (rb_is_red(other)) 150 - { 151 - rb_set_black(other); 152 - rb_set_red(parent); 153 - __rb_rotate_left(parent, root); 154 - other = parent->rb_right; 155 - } 156 - if ((!other->rb_left || rb_is_black(other->rb_left)) && 157 - (!other->rb_right || rb_is_black(other->rb_right))) 158 - { 159 - rb_set_red(other); 160 - node = parent; 161 - parent = rb_parent(node); 162 - } 163 - else 164 - { 165 - if (!other->rb_right || rb_is_black(other->rb_right)) 166 - { 167 - rb_set_black(other->rb_left); 168 - rb_set_red(other); 169 - __rb_rotate_right(other, root); 170 - other = parent->rb_right; 171 - } 172 - rb_set_color(other, rb_color(parent)); 173 - rb_set_black(parent); 174 - rb_set_black(other->rb_right); 175 - __rb_rotate_left(parent, root); 176 - node = root->rb_node; 177 - break; 178 - } 179 - } 180 - else 181 - { 182 - other = parent->rb_left; 183 - if (rb_is_red(other)) 184 - { 185 - rb_set_black(other); 186 - rb_set_red(parent); 187 - __rb_rotate_right(parent, root); 188 - other = parent->rb_left; 189 - } 190 - if ((!other->rb_left || rb_is_black(other->rb_left)) && 191 - (!other->rb_right || rb_is_black(other->rb_right))) 192 - { 193 - rb_set_red(other); 194 - node = parent; 195 - parent = rb_parent(node); 196 - } 197 - else 198 - { 199 - if (!other->rb_left || rb_is_black(other->rb_left)) 200 - { 201 - rb_set_black(other->rb_right); 202 - rb_set_red(other); 203 - __rb_rotate_left(other, root); 204 - other = parent->rb_left; 205 - } 206 - rb_set_color(other, rb_color(parent)); 207 - rb_set_black(parent); 208 - rb_set_black(other->rb_left); 209 - __rb_rotate_right(parent, root); 210 - node = root->rb_node; 211 - break; 212 - } 213 - } 214 - } 215 - if (node) 216 - rb_set_black(node); 217 - } 218 - 219 380 void rb_erase(struct rb_node *node, struct rb_root *root) 220 381 { 221 - struct rb_node *child, *parent; 222 - int color; 223 - 224 - if (!node->rb_left) 225 - child = node->rb_right; 226 - else if (!node->rb_right) 227 - child = node->rb_left; 228 - else 229 - { 230 - struct rb_node *old = node, *left; 231 - 232 - node = node->rb_right; 233 - while ((left = node->rb_left) != NULL) 234 - node = left; 235 - 236 - if (rb_parent(old)) { 237 - if (rb_parent(old)->rb_left == old) 238 - rb_parent(old)->rb_left = node; 239 - else 240 - rb_parent(old)->rb_right = node; 241 - } else 242 - root->rb_node = node; 243 - 244 - child = node->rb_right; 245 - parent = rb_parent(node); 246 - color = rb_color(node); 247 - 248 - if (parent == old) { 249 - parent = node; 250 - } else { 251 - if (child) 252 - rb_set_parent(child, parent); 253 - parent->rb_left = child; 254 - 255 - node->rb_right = old->rb_right; 256 - rb_set_parent(old->rb_right, node); 257 - } 258 - 259 - node->rb_parent_color = old->rb_parent_color; 260 - node->rb_left = old->rb_left; 261 - rb_set_parent(old->rb_left, node); 262 - 263 - goto color; 264 - } 265 - 266 - parent = rb_parent(node); 267 - color = rb_color(node); 268 - 269 - if (child) 270 - rb_set_parent(child, parent); 271 - if (parent) 272 - { 273 - if (parent->rb_left == node) 274 - parent->rb_left = child; 275 - else 276 - parent->rb_right = child; 277 - } 278 - else 279 - root->rb_node = child; 280 - 281 - color: 282 - if (color == RB_BLACK) 283 - __rb_erase_color(child, parent, root); 382 + rb_erase_augmented(node, root, &dummy_callbacks); 284 383 } 285 384 EXPORT_SYMBOL(rb_erase); 286 385 287 - static void rb_augment_path(struct rb_node *node, rb_augment_f func, void *data) 288 - { 289 - struct rb_node *parent; 290 - 291 - up: 292 - func(node, data); 293 - parent = rb_parent(node); 294 - if (!parent) 295 - return; 296 - 297 - if (node == parent->rb_left && parent->rb_right) 298 - func(parent->rb_right, data); 299 - else if (parent->rb_left) 300 - func(parent->rb_left, data); 301 - 302 - node = parent; 303 - goto up; 304 - } 305 - 306 386 /* 307 - * after inserting @node into the tree, update the tree to account for 308 - * both the new entry and any damage done by rebalance 387 + * Augmented rbtree manipulation functions. 388 + * 389 + * This instantiates the same __always_inline functions as in the non-augmented 390 + * case, but this time with user-defined callbacks. 309 391 */ 310 - void rb_augment_insert(struct rb_node *node, rb_augment_f func, void *data) 392 + 393 + void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, 394 + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) 311 395 { 312 - if (node->rb_left) 313 - node = node->rb_left; 314 - else if (node->rb_right) 315 - node = node->rb_right; 316 - 317 - rb_augment_path(node, func, data); 396 + __rb_insert(node, root, augment_rotate); 318 397 } 319 - EXPORT_SYMBOL(rb_augment_insert); 320 - 321 - /* 322 - * before removing the node, find the deepest node on the rebalance path 323 - * that will still be there after @node gets removed 324 - */ 325 - struct rb_node *rb_augment_erase_begin(struct rb_node *node) 326 - { 327 - struct rb_node *deepest; 328 - 329 - if (!node->rb_right && !node->rb_left) 330 - deepest = rb_parent(node); 331 - else if (!node->rb_right) 332 - deepest = node->rb_left; 333 - else if (!node->rb_left) 334 - deepest = node->rb_right; 335 - else { 336 - deepest = rb_next(node); 337 - if (deepest->rb_right) 338 - deepest = deepest->rb_right; 339 - else if (rb_parent(deepest) != node) 340 - deepest = rb_parent(deepest); 341 - } 342 - 343 - return deepest; 344 - } 345 - EXPORT_SYMBOL(rb_augment_erase_begin); 346 - 347 - /* 348 - * after removal, update the tree to account for the removed entry 349 - * and any rebalance damage. 350 - */ 351 - void rb_augment_erase_end(struct rb_node *node, rb_augment_f func, void *data) 352 - { 353 - if (node) 354 - rb_augment_path(node, func, data); 355 - } 356 - EXPORT_SYMBOL(rb_augment_erase_end); 398 + EXPORT_SYMBOL(__rb_insert_augmented); 357 399 358 400 /* 359 401 * This function returns the first node (in sort order) of the tree. ··· 431 387 { 432 388 struct rb_node *parent; 433 389 434 - if (rb_parent(node) == node) 390 + if (RB_EMPTY_NODE(node)) 435 391 return NULL; 436 392 437 - /* If we have a right-hand child, go down and then left as far 438 - as we can. */ 393 + /* 394 + * If we have a right-hand child, go down and then left as far 395 + * as we can. 396 + */ 439 397 if (node->rb_right) { 440 398 node = node->rb_right; 441 399 while (node->rb_left) ··· 445 399 return (struct rb_node *)node; 446 400 } 447 401 448 - /* No right-hand children. Everything down and left is 449 - smaller than us, so any 'next' node must be in the general 450 - direction of our parent. Go up the tree; any time the 451 - ancestor is a right-hand child of its parent, keep going 452 - up. First time it's a left-hand child of its parent, said 453 - parent is our 'next' node. */ 402 + /* 403 + * No right-hand children. Everything down and left is smaller than us, 404 + * so any 'next' node must be in the general direction of our parent. 405 + * Go up the tree; any time the ancestor is a right-hand child of its 406 + * parent, keep going up. First time it's a left-hand child of its 407 + * parent, said parent is our 'next' node. 408 + */ 454 409 while ((parent = rb_parent(node)) && node == parent->rb_right) 455 410 node = parent; 456 411 ··· 463 416 { 464 417 struct rb_node *parent; 465 418 466 - if (rb_parent(node) == node) 419 + if (RB_EMPTY_NODE(node)) 467 420 return NULL; 468 421 469 - /* If we have a left-hand child, go down and then right as far 470 - as we can. */ 422 + /* 423 + * If we have a left-hand child, go down and then right as far 424 + * as we can. 425 + */ 471 426 if (node->rb_left) { 472 427 node = node->rb_left; 473 428 while (node->rb_right) ··· 477 428 return (struct rb_node *)node; 478 429 } 479 430 480 - /* No left-hand children. Go up till we find an ancestor which 481 - is a right-hand child of its parent */ 431 + /* 432 + * No left-hand children. Go up till we find an ancestor which 433 + * is a right-hand child of its parent. 434 + */ 482 435 while ((parent = rb_parent(node)) && node == parent->rb_left) 483 436 node = parent; 484 437 ··· 494 443 struct rb_node *parent = rb_parent(victim); 495 444 496 445 /* Set the surrounding nodes to point to the replacement */ 497 - if (parent) { 498 - if (victim == parent->rb_left) 499 - parent->rb_left = new; 500 - else 501 - parent->rb_right = new; 502 - } else { 503 - root->rb_node = new; 504 - } 446 + __rb_change_child(victim, new, parent, root); 505 447 if (victim->rb_left) 506 448 rb_set_parent(victim->rb_left, new); 507 449 if (victim->rb_right)

+234

lib/rbtree_test.c

··· 1 + #include <linux/module.h> 2 + #include <linux/rbtree_augmented.h> 3 + #include <linux/random.h> 4 + #include <asm/timex.h> 5 + 6 + #define NODES 100 7 + #define PERF_LOOPS 100000 8 + #define CHECK_LOOPS 100 9 + 10 + struct test_node { 11 + struct rb_node rb; 12 + u32 key; 13 + 14 + /* following fields used for testing augmented rbtree functionality */ 15 + u32 val; 16 + u32 augmented; 17 + }; 18 + 19 + static struct rb_root root = RB_ROOT; 20 + static struct test_node nodes[NODES]; 21 + 22 + static struct rnd_state rnd; 23 + 24 + static void insert(struct test_node *node, struct rb_root *root) 25 + { 26 + struct rb_node **new = &root->rb_node, *parent = NULL; 27 + u32 key = node->key; 28 + 29 + while (*new) { 30 + parent = *new; 31 + if (key < rb_entry(parent, struct test_node, rb)->key) 32 + new = &parent->rb_left; 33 + else 34 + new = &parent->rb_right; 35 + } 36 + 37 + rb_link_node(&node->rb, parent, new); 38 + rb_insert_color(&node->rb, root); 39 + } 40 + 41 + static inline void erase(struct test_node *node, struct rb_root *root) 42 + { 43 + rb_erase(&node->rb, root); 44 + } 45 + 46 + static inline u32 augment_recompute(struct test_node *node) 47 + { 48 + u32 max = node->val, child_augmented; 49 + if (node->rb.rb_left) { 50 + child_augmented = rb_entry(node->rb.rb_left, struct test_node, 51 + rb)->augmented; 52 + if (max < child_augmented) 53 + max = child_augmented; 54 + } 55 + if (node->rb.rb_right) { 56 + child_augmented = rb_entry(node->rb.rb_right, struct test_node, 57 + rb)->augmented; 58 + if (max < child_augmented) 59 + max = child_augmented; 60 + } 61 + return max; 62 + } 63 + 64 + RB_DECLARE_CALLBACKS(static, augment_callbacks, struct test_node, rb, 65 + u32, augmented, augment_recompute) 66 + 67 + static void insert_augmented(struct test_node *node, struct rb_root *root) 68 + { 69 + struct rb_node **new = &root->rb_node, *rb_parent = NULL; 70 + u32 key = node->key; 71 + u32 val = node->val; 72 + struct test_node *parent; 73 + 74 + while (*new) { 75 + rb_parent = *new; 76 + parent = rb_entry(rb_parent, struct test_node, rb); 77 + if (parent->augmented < val) 78 + parent->augmented = val; 79 + if (key < parent->key) 80 + new = &parent->rb.rb_left; 81 + else 82 + new = &parent->rb.rb_right; 83 + } 84 + 85 + node->augmented = val; 86 + rb_link_node(&node->rb, rb_parent, new); 87 + rb_insert_augmented(&node->rb, root, &augment_callbacks); 88 + } 89 + 90 + static void erase_augmented(struct test_node *node, struct rb_root *root) 91 + { 92 + rb_erase_augmented(&node->rb, root, &augment_callbacks); 93 + } 94 + 95 + static void init(void) 96 + { 97 + int i; 98 + for (i = 0; i < NODES; i++) { 99 + nodes[i].key = prandom32(&rnd); 100 + nodes[i].val = prandom32(&rnd); 101 + } 102 + } 103 + 104 + static bool is_red(struct rb_node *rb) 105 + { 106 + return !(rb->__rb_parent_color & 1); 107 + } 108 + 109 + static int black_path_count(struct rb_node *rb) 110 + { 111 + int count; 112 + for (count = 0; rb; rb = rb_parent(rb)) 113 + count += !is_red(rb); 114 + return count; 115 + } 116 + 117 + static void check(int nr_nodes) 118 + { 119 + struct rb_node *rb; 120 + int count = 0; 121 + int blacks; 122 + u32 prev_key = 0; 123 + 124 + for (rb = rb_first(&root); rb; rb = rb_next(rb)) { 125 + struct test_node *node = rb_entry(rb, struct test_node, rb); 126 + WARN_ON_ONCE(node->key < prev_key); 127 + WARN_ON_ONCE(is_red(rb) && 128 + (!rb_parent(rb) || is_red(rb_parent(rb)))); 129 + if (!count) 130 + blacks = black_path_count(rb); 131 + else 132 + WARN_ON_ONCE((!rb->rb_left || !rb->rb_right) && 133 + blacks != black_path_count(rb)); 134 + prev_key = node->key; 135 + count++; 136 + } 137 + WARN_ON_ONCE(count != nr_nodes); 138 + } 139 + 140 + static void check_augmented(int nr_nodes) 141 + { 142 + struct rb_node *rb; 143 + 144 + check(nr_nodes); 145 + for (rb = rb_first(&root); rb; rb = rb_next(rb)) { 146 + struct test_node *node = rb_entry(rb, struct test_node, rb); 147 + WARN_ON_ONCE(node->augmented != augment_recompute(node)); 148 + } 149 + } 150 + 151 + static int rbtree_test_init(void) 152 + { 153 + int i, j; 154 + cycles_t time1, time2, time; 155 + 156 + printk(KERN_ALERT "rbtree testing"); 157 + 158 + prandom32_seed(&rnd, 3141592653589793238ULL); 159 + init(); 160 + 161 + time1 = get_cycles(); 162 + 163 + for (i = 0; i < PERF_LOOPS; i++) { 164 + for (j = 0; j < NODES; j++) 165 + insert(nodes + j, &root); 166 + for (j = 0; j < NODES; j++) 167 + erase(nodes + j, &root); 168 + } 169 + 170 + time2 = get_cycles(); 171 + time = time2 - time1; 172 + 173 + time = div_u64(time, PERF_LOOPS); 174 + printk(" -> %llu cycles\n", (unsigned long long)time); 175 + 176 + for (i = 0; i < CHECK_LOOPS; i++) { 177 + init(); 178 + for (j = 0; j < NODES; j++) { 179 + check(j); 180 + insert(nodes + j, &root); 181 + } 182 + for (j = 0; j < NODES; j++) { 183 + check(NODES - j); 184 + erase(nodes + j, &root); 185 + } 186 + check(0); 187 + } 188 + 189 + printk(KERN_ALERT "augmented rbtree testing"); 190 + 191 + init(); 192 + 193 + time1 = get_cycles(); 194 + 195 + for (i = 0; i < PERF_LOOPS; i++) { 196 + for (j = 0; j < NODES; j++) 197 + insert_augmented(nodes + j, &root); 198 + for (j = 0; j < NODES; j++) 199 + erase_augmented(nodes + j, &root); 200 + } 201 + 202 + time2 = get_cycles(); 203 + time = time2 - time1; 204 + 205 + time = div_u64(time, PERF_LOOPS); 206 + printk(" -> %llu cycles\n", (unsigned long long)time); 207 + 208 + for (i = 0; i < CHECK_LOOPS; i++) { 209 + init(); 210 + for (j = 0; j < NODES; j++) { 211 + check_augmented(j); 212 + insert_augmented(nodes + j, &root); 213 + } 214 + for (j = 0; j < NODES; j++) { 215 + check_augmented(NODES - j); 216 + erase_augmented(nodes + j, &root); 217 + } 218 + check_augmented(0); 219 + } 220 + 221 + return -EAGAIN; /* Fail will directly unload the module */ 222 + } 223 + 224 + static void rbtree_test_exit(void) 225 + { 226 + printk(KERN_ALERT "test exit\n"); 227 + } 228 + 229 + module_init(rbtree_test_init) 230 + module_exit(rbtree_test_exit) 231 + 232 + MODULE_LICENSE("GPL"); 233 + MODULE_AUTHOR("Michel Lespinasse"); 234 + MODULE_DESCRIPTION("Red Black Tree test");

+2 -1

mm/Kconfig

··· 191 191 # support for memory compaction 192 192 config COMPACTION 193 193 bool "Allow for memory compaction" 194 + def_bool y 194 195 select MIGRATION 195 196 depends on MMU 196 197 help ··· 319 318 320 319 config TRANSPARENT_HUGEPAGE 321 320 bool "Transparent Hugepage Support" 322 - depends on X86 && MMU 321 + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE 323 322 select COMPACTION 324 323 help 325 324 Transparent Hugepages allows the kernel to use huge pages and

+2 -2

mm/Makefile

··· 14 14 obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ 15 15 maccess.o page_alloc.o page-writeback.o \ 16 16 readahead.o swap.o truncate.o vmscan.o shmem.o \ 17 - prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ 17 + util.o mmzone.o vmstat.o backing-dev.o \ 18 18 mm_init.o mmu_context.o percpu.o slab_common.o \ 19 - compaction.o $(mmu-y) 19 + compaction.o interval_tree.o $(mmu-y) 20 20 21 21 obj-y += init-mm.o 22 22

+9 -1

mm/bootmem.c

··· 198 198 int order = ilog2(BITS_PER_LONG); 199 199 200 200 __free_pages_bootmem(pfn_to_page(start), order); 201 + fixup_zone_present_pages(page_to_nid(pfn_to_page(start)), 202 + start, start + BITS_PER_LONG); 201 203 count += BITS_PER_LONG; 202 204 start += BITS_PER_LONG; 203 205 } else { ··· 210 208 if (vec & 1) { 211 209 page = pfn_to_page(start + off); 212 210 __free_pages_bootmem(page, 0); 211 + fixup_zone_present_pages( 212 + page_to_nid(page), 213 + start + off, start + off + 1); 213 214 count++; 214 215 } 215 216 vec >>= 1; ··· 226 221 pages = bdata->node_low_pfn - bdata->node_min_pfn; 227 222 pages = bootmem_bootmap_pages(pages); 228 223 count += pages; 229 - while (pages--) 224 + while (pages--) { 225 + fixup_zone_present_pages(page_to_nid(page), 226 + page_to_pfn(page), page_to_pfn(page) + 1); 230 227 __free_pages_bootmem(page++, 0); 228 + } 231 229 232 230 bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count); 233 231

+386 -176

mm/compaction.c

··· 50 50 return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE; 51 51 } 52 52 53 + #ifdef CONFIG_COMPACTION 54 + /* Returns true if the pageblock should be scanned for pages to isolate. */ 55 + static inline bool isolation_suitable(struct compact_control *cc, 56 + struct page *page) 57 + { 58 + if (cc->ignore_skip_hint) 59 + return true; 60 + 61 + return !get_pageblock_skip(page); 62 + } 63 + 64 + /* 65 + * This function is called to clear all cached information on pageblocks that 66 + * should be skipped for page isolation when the migrate and free page scanner 67 + * meet. 68 + */ 69 + static void __reset_isolation_suitable(struct zone *zone) 70 + { 71 + unsigned long start_pfn = zone->zone_start_pfn; 72 + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; 73 + unsigned long pfn; 74 + 75 + zone->compact_cached_migrate_pfn = start_pfn; 76 + zone->compact_cached_free_pfn = end_pfn; 77 + zone->compact_blockskip_flush = false; 78 + 79 + /* Walk the zone and mark every pageblock as suitable for isolation */ 80 + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { 81 + struct page *page; 82 + 83 + cond_resched(); 84 + 85 + if (!pfn_valid(pfn)) 86 + continue; 87 + 88 + page = pfn_to_page(pfn); 89 + if (zone != page_zone(page)) 90 + continue; 91 + 92 + clear_pageblock_skip(page); 93 + } 94 + } 95 + 96 + void reset_isolation_suitable(pg_data_t *pgdat) 97 + { 98 + int zoneid; 99 + 100 + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { 101 + struct zone *zone = &pgdat->node_zones[zoneid]; 102 + if (!populated_zone(zone)) 103 + continue; 104 + 105 + /* Only flush if a full compaction finished recently */ 106 + if (zone->compact_blockskip_flush) 107 + __reset_isolation_suitable(zone); 108 + } 109 + } 110 + 111 + /* 112 + * If no pages were isolated then mark this pageblock to be skipped in the 113 + * future. The information is later cleared by __reset_isolation_suitable(). 114 + */ 115 + static void update_pageblock_skip(struct compact_control *cc, 116 + struct page *page, unsigned long nr_isolated, 117 + bool migrate_scanner) 118 + { 119 + struct zone *zone = cc->zone; 120 + if (!page) 121 + return; 122 + 123 + if (!nr_isolated) { 124 + unsigned long pfn = page_to_pfn(page); 125 + set_pageblock_skip(page); 126 + 127 + /* Update where compaction should restart */ 128 + if (migrate_scanner) { 129 + if (!cc->finished_update_migrate && 130 + pfn > zone->compact_cached_migrate_pfn) 131 + zone->compact_cached_migrate_pfn = pfn; 132 + } else { 133 + if (!cc->finished_update_free && 134 + pfn < zone->compact_cached_free_pfn) 135 + zone->compact_cached_free_pfn = pfn; 136 + } 137 + } 138 + } 139 + #else 140 + static inline bool isolation_suitable(struct compact_control *cc, 141 + struct page *page) 142 + { 143 + return true; 144 + } 145 + 146 + static void update_pageblock_skip(struct compact_control *cc, 147 + struct page *page, unsigned long nr_isolated, 148 + bool migrate_scanner) 149 + { 150 + } 151 + #endif /* CONFIG_COMPACTION */ 152 + 153 + static inline bool should_release_lock(spinlock_t *lock) 154 + { 155 + return need_resched() || spin_is_contended(lock); 156 + } 157 + 53 158 /* 54 159 * Compaction requires the taking of some coarse locks that are potentially 55 160 * very heavily contended. Check if the process needs to be scheduled or ··· 167 62 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, 168 63 bool locked, struct compact_control *cc) 169 64 { 170 - if (need_resched() || spin_is_contended(lock)) { 65 + if (should_release_lock(lock)) { 171 66 if (locked) { 172 67 spin_unlock_irqrestore(lock, *flags); 173 68 locked = false; ··· 175 70 176 71 /* async aborts if taking too long or contended */ 177 72 if (!cc->sync) { 178 - if (cc->contended) 179 - *cc->contended = true; 73 + cc->contended = true; 180 74 return false; 181 75 } 182 76 183 77 cond_resched(); 184 - if (fatal_signal_pending(current)) 185 - return false; 186 78 } 187 79 188 80 if (!locked) ··· 193 91 return compact_checklock_irqsave(lock, flags, false, cc); 194 92 } 195 93 94 + /* Returns true if the page is within a block suitable for migration to */ 95 + static bool suitable_migration_target(struct page *page) 96 + { 97 + int migratetype = get_pageblock_migratetype(page); 98 + 99 + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ 100 + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) 101 + return false; 102 + 103 + /* If the page is a large free page, then allow migration */ 104 + if (PageBuddy(page) && page_order(page) >= pageblock_order) 105 + return true; 106 + 107 + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ 108 + if (migrate_async_suitable(migratetype)) 109 + return true; 110 + 111 + /* Otherwise skip the block */ 112 + return false; 113 + } 114 + 115 + static void compact_capture_page(struct compact_control *cc) 116 + { 117 + unsigned long flags; 118 + int mtype, mtype_low, mtype_high; 119 + 120 + if (!cc->page || *cc->page) 121 + return; 122 + 123 + /* 124 + * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP 125 + * regardless of the migratetype of the freelist is is captured from. 126 + * This is fine because the order for a high-order MIGRATE_MOVABLE 127 + * allocation is typically at least a pageblock size and overall 128 + * fragmentation is not impaired. Other allocation types must 129 + * capture pages from their own migratelist because otherwise they 130 + * could pollute other pageblocks like MIGRATE_MOVABLE with 131 + * difficult to move pages and making fragmentation worse overall. 132 + */ 133 + if (cc->migratetype == MIGRATE_MOVABLE) { 134 + mtype_low = 0; 135 + mtype_high = MIGRATE_PCPTYPES; 136 + } else { 137 + mtype_low = cc->migratetype; 138 + mtype_high = cc->migratetype + 1; 139 + } 140 + 141 + /* Speculatively examine the free lists without zone lock */ 142 + for (mtype = mtype_low; mtype < mtype_high; mtype++) { 143 + int order; 144 + for (order = cc->order; order < MAX_ORDER; order++) { 145 + struct page *page; 146 + struct free_area *area; 147 + area = &(cc->zone->free_area[order]); 148 + if (list_empty(&area->free_list[mtype])) 149 + continue; 150 + 151 + /* Take the lock and attempt capture of the page */ 152 + if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc)) 153 + return; 154 + if (!list_empty(&area->free_list[mtype])) { 155 + page = list_entry(area->free_list[mtype].next, 156 + struct page, lru); 157 + if (capture_free_page(page, cc->order, mtype)) { 158 + spin_unlock_irqrestore(&cc->zone->lock, 159 + flags); 160 + *cc->page = page; 161 + return; 162 + } 163 + } 164 + spin_unlock_irqrestore(&cc->zone->lock, flags); 165 + } 166 + } 167 + } 168 + 196 169 /* 197 170 * Isolate free pages onto a private freelist. Caller must hold zone->lock. 198 171 * If @strict is true, will abort returning 0 on any invalid PFNs or non-free 199 172 * pages inside of the pageblock (even though it may still end up isolating 200 173 * some pages). 201 174 */ 202 - static unsigned long isolate_freepages_block(unsigned long blockpfn, 175 + static unsigned long isolate_freepages_block(struct compact_control *cc, 176 + unsigned long blockpfn, 203 177 unsigned long end_pfn, 204 178 struct list_head *freelist, 205 179 bool strict) 206 180 { 207 181 int nr_scanned = 0, total_isolated = 0; 208 - struct page *cursor; 182 + struct page *cursor, *valid_page = NULL; 183 + unsigned long nr_strict_required = end_pfn - blockpfn; 184 + unsigned long flags; 185 + bool locked = false; 209 186 210 187 cursor = pfn_to_page(blockpfn); 211 188 212 - /* Isolate free pages. This assumes the block is valid */ 189 + /* Isolate free pages. */ 213 190 for (; blockpfn < end_pfn; blockpfn++, cursor++) { 214 191 int isolated, i; 215 192 struct page *page = cursor; 216 193 217 - if (!pfn_valid_within(blockpfn)) { 218 - if (strict) 219 - return 0; 220 - continue; 221 - } 222 194 nr_scanned++; 223 - 224 - if (!PageBuddy(page)) { 225 - if (strict) 226 - return 0; 195 + if (!pfn_valid_within(blockpfn)) 227 196 continue; 228 - } 197 + if (!valid_page) 198 + valid_page = page; 199 + if (!PageBuddy(page)) 200 + continue; 201 + 202 + /* 203 + * The zone lock must be held to isolate freepages. 204 + * Unfortunately this is a very coarse lock and can be 205 + * heavily contended if there are parallel allocations 206 + * or parallel compactions. For async compaction do not 207 + * spin on the lock and we acquire the lock as late as 208 + * possible. 209 + */ 210 + locked = compact_checklock_irqsave(&cc->zone->lock, &flags, 211 + locked, cc); 212 + if (!locked) 213 + break; 214 + 215 + /* Recheck this is a suitable migration target under lock */ 216 + if (!strict && !suitable_migration_target(page)) 217 + break; 218 + 219 + /* Recheck this is a buddy page under lock */ 220 + if (!PageBuddy(page)) 221 + continue; 229 222 230 223 /* Found a free page, break it into order-0 pages */ 231 224 isolated = split_free_page(page); 232 225 if (!isolated && strict) 233 - return 0; 226 + break; 234 227 total_isolated += isolated; 235 228 for (i = 0; i < isolated; i++) { 236 229 list_add(&page->lru, freelist); ··· 340 143 } 341 144 342 145 trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); 146 + 147 + /* 148 + * If strict isolation is requested by CMA then check that all the 149 + * pages requested were isolated. If there were any failures, 0 is 150 + * returned and CMA will fail. 151 + */ 152 + if (strict && nr_strict_required != total_isolated) 153 + total_isolated = 0; 154 + 155 + if (locked) 156 + spin_unlock_irqrestore(&cc->zone->lock, flags); 157 + 158 + /* Update the pageblock-skip if the whole pageblock was scanned */ 159 + if (blockpfn == end_pfn) 160 + update_pageblock_skip(cc, valid_page, total_isolated, false); 161 + 343 162 return total_isolated; 344 163 } 345 164 ··· 373 160 * a free page). 374 161 */ 375 162 unsigned long 376 - isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn) 163 + isolate_freepages_range(struct compact_control *cc, 164 + unsigned long start_pfn, unsigned long end_pfn) 377 165 { 378 - unsigned long isolated, pfn, block_end_pfn, flags; 379 - struct zone *zone = NULL; 166 + unsigned long isolated, pfn, block_end_pfn; 380 167 LIST_HEAD(freelist); 381 168 382 - if (pfn_valid(start_pfn)) 383 - zone = page_zone(pfn_to_page(start_pfn)); 384 - 385 169 for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) { 386 - if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn))) 170 + if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn))) 387 171 break; 388 172 389 173 /* ··· 390 180 block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 391 181 block_end_pfn = min(block_end_pfn, end_pfn); 392 182 393 - spin_lock_irqsave(&zone->lock, flags); 394 - isolated = isolate_freepages_block(pfn, block_end_pfn, 183 + isolated = isolate_freepages_block(cc, pfn, block_end_pfn, 395 184 &freelist, true); 396 - spin_unlock_irqrestore(&zone->lock, flags); 397 185 398 186 /* 399 187 * In strict mode, isolate_freepages_block() returns 0 if ··· 461 253 * @cc: Compaction control structure. 462 254 * @low_pfn: The first PFN of the range. 463 255 * @end_pfn: The one-past-the-last PFN of the range. 256 + * @unevictable: true if it allows to isolate unevictable pages 464 257 * 465 258 * Isolate all pages that can be migrated from the range specified by 466 259 * [low_pfn, end_pfn). Returns zero if there is a fatal signal ··· 477 268 */ 478 269 unsigned long 479 270 isolate_migratepages_range(struct zone *zone, struct compact_control *cc, 480 - unsigned long low_pfn, unsigned long end_pfn) 271 + unsigned long low_pfn, unsigned long end_pfn, bool unevictable) 481 272 { 482 273 unsigned long last_pageblock_nr = 0, pageblock_nr; 483 274 unsigned long nr_scanned = 0, nr_isolated = 0; ··· 485 276 isolate_mode_t mode = 0; 486 277 struct lruvec *lruvec; 487 278 unsigned long flags; 488 - bool locked; 279 + bool locked = false; 280 + struct page *page = NULL, *valid_page = NULL; 489 281 490 282 /* 491 283 * Ensure that there are not too many pages isolated from the LRU ··· 506 296 507 297 /* Time to isolate some pages for migration */ 508 298 cond_resched(); 509 - spin_lock_irqsave(&zone->lru_lock, flags); 510 - locked = true; 511 299 for (; low_pfn < end_pfn; low_pfn++) { 512 - struct page *page; 513 - 514 300 /* give a chance to irqs before checking need_resched() */ 515 - if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) { 516 - spin_unlock_irqrestore(&zone->lru_lock, flags); 517 - locked = false; 301 + if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) { 302 + if (should_release_lock(&zone->lru_lock)) { 303 + spin_unlock_irqrestore(&zone->lru_lock, flags); 304 + locked = false; 305 + } 518 306 } 519 - 520 - /* Check if it is ok to still hold the lock */ 521 - locked = compact_checklock_irqsave(&zone->lru_lock, &flags, 522 - locked, cc); 523 - if (!locked) 524 - break; 525 307 526 308 /* 527 309 * migrate_pfn does not necessarily start aligned to a ··· 542 340 if (page_zone(page) != zone) 543 341 continue; 544 342 343 + if (!valid_page) 344 + valid_page = page; 345 + 346 + /* If isolation recently failed, do not retry */ 347 + pageblock_nr = low_pfn >> pageblock_order; 348 + if (!isolation_suitable(cc, page)) 349 + goto next_pageblock; 350 + 545 351 /* Skip if free */ 546 352 if (PageBuddy(page)) 547 353 continue; ··· 559 349 * migration is optimistic to see if the minimum amount of work 560 350 * satisfies the allocation 561 351 */ 562 - pageblock_nr = low_pfn >> pageblock_order; 563 352 if (!cc->sync && last_pageblock_nr != pageblock_nr && 564 353 !migrate_async_suitable(get_pageblock_migratetype(page))) { 565 - low_pfn += pageblock_nr_pages; 566 - low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1; 567 - last_pageblock_nr = pageblock_nr; 568 - continue; 354 + cc->finished_update_migrate = true; 355 + goto next_pageblock; 569 356 } 570 357 358 + /* Check may be lockless but that's ok as we recheck later */ 571 359 if (!PageLRU(page)) 572 360 continue; 573 361 574 362 /* 575 - * PageLRU is set, and lru_lock excludes isolation, 576 - * splitting and collapsing (collapsing has already 577 - * happened if PageLRU is set). 363 + * PageLRU is set. lru_lock normally excludes isolation 364 + * splitting and collapsing (collapsing has already happened 365 + * if PageLRU is set) but the lock is not necessarily taken 366 + * here and it is wasteful to take it just to check transhuge. 367 + * Check TransHuge without lock and skip the whole pageblock if 368 + * it's either a transhuge or hugetlbfs page, as calling 369 + * compound_order() without preventing THP from splitting the 370 + * page underneath us may return surprising results. 578 371 */ 372 + if (PageTransHuge(page)) { 373 + if (!locked) 374 + goto next_pageblock; 375 + low_pfn += (1 << compound_order(page)) - 1; 376 + continue; 377 + } 378 + 379 + /* Check if it is ok to still hold the lock */ 380 + locked = compact_checklock_irqsave(&zone->lru_lock, &flags, 381 + locked, cc); 382 + if (!locked || fatal_signal_pending(current)) 383 + break; 384 + 385 + /* Recheck PageLRU and PageTransHuge under lock */ 386 + if (!PageLRU(page)) 387 + continue; 579 388 if (PageTransHuge(page)) { 580 389 low_pfn += (1 << compound_order(page)) - 1; 581 390 continue; ··· 602 373 603 374 if (!cc->sync) 604 375 mode |= ISOLATE_ASYNC_MIGRATE; 376 + 377 + if (unevictable) 378 + mode |= ISOLATE_UNEVICTABLE; 605 379 606 380 lruvec = mem_cgroup_page_lruvec(page, zone); 607 381 ··· 615 383 VM_BUG_ON(PageTransCompound(page)); 616 384 617 385 /* Successfully isolated */ 386 + cc->finished_update_migrate = true; 618 387 del_page_from_lru_list(page, lruvec, page_lru(page)); 619 388 list_add(&page->lru, migratelist); 620 389 cc->nr_migratepages++; ··· 626 393 ++low_pfn; 627 394 break; 628 395 } 396 + 397 + continue; 398 + 399 + next_pageblock: 400 + low_pfn += pageblock_nr_pages; 401 + low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1; 402 + last_pageblock_nr = pageblock_nr; 629 403 } 630 404 631 405 acct_isolated(zone, locked, cc); 632 406 633 407 if (locked) 634 408 spin_unlock_irqrestore(&zone->lru_lock, flags); 409 + 410 + /* Update the pageblock-skip if the whole pageblock was scanned */ 411 + if (low_pfn == end_pfn) 412 + update_pageblock_skip(cc, valid_page, nr_isolated, true); 635 413 636 414 trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated); 637 415 ··· 651 407 652 408 #endif /* CONFIG_COMPACTION || CONFIG_CMA */ 653 409 #ifdef CONFIG_COMPACTION 654 - 655 - /* Returns true if the page is within a block suitable for migration to */ 656 - static bool suitable_migration_target(struct page *page) 657 - { 658 - 659 - int migratetype = get_pageblock_migratetype(page); 660 - 661 - /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ 662 - if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) 663 - return false; 664 - 665 - /* If the page is a large free page, then allow migration */ 666 - if (PageBuddy(page) && page_order(page) >= pageblock_order) 667 - return true; 668 - 669 - /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ 670 - if (migrate_async_suitable(migratetype)) 671 - return true; 672 - 673 - /* Otherwise skip the block */ 674 - return false; 675 - } 676 - 677 - /* 678 - * Returns the start pfn of the last page block in a zone. This is the starting 679 - * point for full compaction of a zone. Compaction searches for free pages from 680 - * the end of each zone, while isolate_freepages_block scans forward inside each 681 - * page block. 682 - */ 683 - static unsigned long start_free_pfn(struct zone *zone) 684 - { 685 - unsigned long free_pfn; 686 - free_pfn = zone->zone_start_pfn + zone->spanned_pages; 687 - free_pfn &= ~(pageblock_nr_pages-1); 688 - return free_pfn; 689 - } 690 - 691 410 /* 692 411 * Based on information in the current compact_control, find blocks 693 412 * suitable for isolating free pages from and then isolate them. ··· 660 453 { 661 454 struct page *page; 662 455 unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn; 663 - unsigned long flags; 664 456 int nr_freepages = cc->nr_freepages; 665 457 struct list_head *freelist = &cc->freepages; 666 458 ··· 707 501 if (!suitable_migration_target(page)) 708 502 continue; 709 503 710 - /* 711 - * Found a block suitable for isolating free pages from. Now 712 - * we disabled interrupts, double check things are ok and 713 - * isolate the pages. This is to minimise the time IRQs 714 - * are disabled 715 - */ 716 - isolated = 0; 504 + /* If isolation recently failed, do not retry */ 505 + if (!isolation_suitable(cc, page)) 506 + continue; 717 507 718 - /* 719 - * The zone lock must be held to isolate freepages. This 720 - * unfortunately this is a very coarse lock and can be 721 - * heavily contended if there are parallel allocations 722 - * or parallel compactions. For async compaction do not 723 - * spin on the lock 724 - */ 725 - if (!compact_trylock_irqsave(&zone->lock, &flags, cc)) 726 - break; 727 - if (suitable_migration_target(page)) { 728 - end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); 729 - isolated = isolate_freepages_block(pfn, end_pfn, 730 - freelist, false); 731 - nr_freepages += isolated; 732 - } 733 - spin_unlock_irqrestore(&zone->lock, flags); 508 + /* Found a block suitable for isolating free pages from */ 509 + isolated = 0; 510 + end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); 511 + isolated = isolate_freepages_block(cc, pfn, end_pfn, 512 + freelist, false); 513 + nr_freepages += isolated; 734 514 735 515 /* 736 516 * Record the highest PFN we isolated pages from. When next ··· 724 532 * page migration may have returned some pages to the allocator 725 533 */ 726 534 if (isolated) { 535 + cc->finished_update_free = true; 727 536 high_pfn = max(high_pfn, pfn); 728 - 729 - /* 730 - * If the free scanner has wrapped, update 731 - * compact_cached_free_pfn to point to the highest 732 - * pageblock with free pages. This reduces excessive 733 - * scanning of full pageblocks near the end of the 734 - * zone 735 - */ 736 - if (cc->order > 0 && cc->wrapped) 737 - zone->compact_cached_free_pfn = high_pfn; 738 537 } 739 538 } 740 539 ··· 734 551 735 552 cc->free_pfn = high_pfn; 736 553 cc->nr_freepages = nr_freepages; 737 - 738 - /* If compact_cached_free_pfn is reset then set it now */ 739 - if (cc->order > 0 && !cc->wrapped && 740 - zone->compact_cached_free_pfn == start_free_pfn(zone)) 741 - zone->compact_cached_free_pfn = high_pfn; 742 554 } 743 555 744 556 /* ··· 811 633 } 812 634 813 635 /* Perform the isolation */ 814 - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); 815 - if (!low_pfn) 636 + low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false); 637 + if (!low_pfn || cc->contended) 816 638 return ISOLATE_ABORT; 817 639 818 640 cc->migrate_pfn = low_pfn; ··· 823 645 static int compact_finished(struct zone *zone, 824 646 struct compact_control *cc) 825 647 { 826 - unsigned int order; 827 648 unsigned long watermark; 828 649 829 650 if (fatal_signal_pending(current)) 830 651 return COMPACT_PARTIAL; 831 652 832 - /* 833 - * A full (order == -1) compaction run starts at the beginning and 834 - * end of a zone; it completes when the migrate and free scanner meet. 835 - * A partial (order > 0) compaction can start with the free scanner 836 - * at a random point in the zone, and may have to restart. 837 - */ 653 + /* Compaction run completes if the migrate and free scanner meet */ 838 654 if (cc->free_pfn <= cc->migrate_pfn) { 839 - if (cc->order > 0 && !cc->wrapped) { 840 - /* We started partway through; restart at the end. */ 841 - unsigned long free_pfn = start_free_pfn(zone); 842 - zone->compact_cached_free_pfn = free_pfn; 843 - cc->free_pfn = free_pfn; 844 - cc->wrapped = 1; 845 - return COMPACT_CONTINUE; 846 - } 655 + /* 656 + * Mark that the PG_migrate_skip information should be cleared 657 + * by kswapd when it goes to sleep. kswapd does not set the 658 + * flag itself as the decision to be clear should be directly 659 + * based on an allocation request. 660 + */ 661 + if (!current_is_kswapd()) 662 + zone->compact_blockskip_flush = true; 663 + 847 664 return COMPACT_COMPLETE; 848 665 } 849 - 850 - /* We wrapped around and ended up where we started. */ 851 - if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn) 852 - return COMPACT_COMPLETE; 853 666 854 667 /* 855 668 * order == -1 is expected when compacting via ··· 857 688 return COMPACT_CONTINUE; 858 689 859 690 /* Direct compactor: Is a suitable page free? */ 860 - for (order = cc->order; order < MAX_ORDER; order++) { 861 - /* Job done if page is free of the right migratetype */ 862 - if (!list_empty(&zone->free_area[order].free_list[cc->migratetype])) 691 + if (cc->page) { 692 + /* Was a suitable page captured? */ 693 + if (*cc->page) 863 694 return COMPACT_PARTIAL; 695 + } else { 696 + unsigned int order; 697 + for (order = cc->order; order < MAX_ORDER; order++) { 698 + struct free_area *area = &zone->free_area[cc->order]; 699 + /* Job done if page is free of the right migratetype */ 700 + if (!list_empty(&area->free_list[cc->migratetype])) 701 + return COMPACT_PARTIAL; 864 702 865 - /* Job done if allocation would set block type */ 866 - if (order >= pageblock_order && zone->free_area[order].nr_free) 867 - return COMPACT_PARTIAL; 703 + /* Job done if allocation would set block type */ 704 + if (cc->order >= pageblock_order && area->nr_free) 705 + return COMPACT_PARTIAL; 706 + } 868 707 } 869 708 870 709 return COMPACT_CONTINUE; ··· 931 754 static int compact_zone(struct zone *zone, struct compact_control *cc) 932 755 { 933 756 int ret; 757 + unsigned long start_pfn = zone->zone_start_pfn; 758 + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; 934 759 935 760 ret = compaction_suitable(zone, cc->order); 936 761 switch (ret) { ··· 945 766 ; 946 767 } 947 768 948 - /* Setup to move all movable pages to the end of the zone */ 949 - cc->migrate_pfn = zone->zone_start_pfn; 950 - 951 - if (cc->order > 0) { 952 - /* Incremental compaction. Start where the last one stopped. */ 953 - cc->free_pfn = zone->compact_cached_free_pfn; 954 - cc->start_free_pfn = cc->free_pfn; 955 - } else { 956 - /* Order == -1 starts at the end of the zone. */ 957 - cc->free_pfn = start_free_pfn(zone); 769 + /* 770 + * Setup to move all movable pages to the end of the zone. Used cached 771 + * information on where the scanners should start but check that it 772 + * is initialised by ensuring the values are within zone boundaries. 773 + */ 774 + cc->migrate_pfn = zone->compact_cached_migrate_pfn; 775 + cc->free_pfn = zone->compact_cached_free_pfn; 776 + if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) { 777 + cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1); 778 + zone->compact_cached_free_pfn = cc->free_pfn; 958 779 } 780 + if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) { 781 + cc->migrate_pfn = start_pfn; 782 + zone->compact_cached_migrate_pfn = cc->migrate_pfn; 783 + } 784 + 785 + /* 786 + * Clear pageblock skip if there were failures recently and compaction 787 + * is about to be retried after being deferred. kswapd does not do 788 + * this reset as it'll reset the cached information when going to sleep. 789 + */ 790 + if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) 791 + __reset_isolation_suitable(zone); 959 792 960 793 migrate_prep_local(); 961 794 ··· 978 787 switch (isolate_migratepages(zone, cc)) { 979 788 case ISOLATE_ABORT: 980 789 ret = COMPACT_PARTIAL; 790 + putback_lru_pages(&cc->migratepages); 791 + cc->nr_migratepages = 0; 981 792 goto out; 982 793 case ISOLATE_NONE: 983 794 continue; ··· 1010 817 goto out; 1011 818 } 1012 819 } 820 + 821 + /* Capture a page now if it is a suitable size */ 822 + compact_capture_page(cc); 1013 823 } 1014 824 1015 825 out: ··· 1025 829 1026 830 static unsigned long compact_zone_order(struct zone *zone, 1027 831 int order, gfp_t gfp_mask, 1028 - bool sync, bool *contended) 832 + bool sync, bool *contended, 833 + struct page **page) 1029 834 { 835 + unsigned long ret; 1030 836 struct compact_control cc = { 1031 837 .nr_freepages = 0, 1032 838 .nr_migratepages = 0, ··· 1036 838 .migratetype = allocflags_to_migratetype(gfp_mask), 1037 839 .zone = zone, 1038 840 .sync = sync, 1039 - .contended = contended, 841 + .page = page, 1040 842 }; 1041 843 INIT_LIST_HEAD(&cc.freepages); 1042 844 INIT_LIST_HEAD(&cc.migratepages); 1043 845 1044 - return compact_zone(zone, &cc); 846 + ret = compact_zone(zone, &cc); 847 + 848 + VM_BUG_ON(!list_empty(&cc.freepages)); 849 + VM_BUG_ON(!list_empty(&cc.migratepages)); 850 + 851 + *contended = cc.contended; 852 + return ret; 1045 853 } 1046 854 1047 855 int sysctl_extfrag_threshold = 500; ··· 1059 855 * @gfp_mask: The GFP mask of the current allocation 1060 856 * @nodemask: The allowed nodes to allocate from 1061 857 * @sync: Whether migration is synchronous or not 858 + * @contended: Return value that is true if compaction was aborted due to lock contention 859 + * @page: Optionally capture a free page of the requested order during compaction 1062 860 * 1063 861 * This is the main entry point for direct page compaction. 1064 862 */ 1065 863 unsigned long try_to_compact_pages(struct zonelist *zonelist, 1066 864 int order, gfp_t gfp_mask, nodemask_t *nodemask, 1067 - bool sync, bool *contended) 865 + bool sync, bool *contended, struct page **page) 1068 866 { 1069 867 enum zone_type high_zoneidx = gfp_zone(gfp_mask); 1070 868 int may_enter_fs = gfp_mask & __GFP_FS; ··· 1074 868 struct zoneref *z; 1075 869 struct zone *zone; 1076 870 int rc = COMPACT_SKIPPED; 871 + int alloc_flags = 0; 1077 872 1078 - /* 1079 - * Check whether it is worth even starting compaction. The order check is 1080 - * made because an assumption is made that the page allocator can satisfy 1081 - * the "cheaper" orders without taking special steps 1082 - */ 873 + /* Check if the GFP flags allow compaction */ 1083 874 if (!order || !may_enter_fs || !may_perform_io) 1084 875 return rc; 1085 876 1086 877 count_vm_event(COMPACTSTALL); 1087 878 879 + #ifdef CONFIG_CMA 880 + if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 881 + alloc_flags |= ALLOC_CMA; 882 + #endif 1088 883 /* Compact each zone in the list */ 1089 884 for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, 1090 885 nodemask) { 1091 886 int status; 1092 887 1093 888 status = compact_zone_order(zone, order, gfp_mask, sync, 1094 - contended); 889 + contended, page); 1095 890 rc = max(status, rc); 1096 891 1097 892 /* If a normal allocation would succeed, stop compacting */ 1098 - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) 893 + if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 894 + alloc_flags)) 1099 895 break; 1100 896 } 1101 897 ··· 1148 940 struct compact_control cc = { 1149 941 .order = order, 1150 942 .sync = false, 943 + .page = NULL, 1151 944 }; 1152 945 1153 946 return __compact_pgdat(pgdat, &cc); ··· 1159 950 struct compact_control cc = { 1160 951 .order = -1, 1161 952 .sync = true, 953 + .page = NULL, 1162 954 }; 1163 955 1164 956 return __compact_pgdat(NODE_DATA(nid), &cc);

+3 -3

mm/filemap.c

··· 1607 1607 * Do we have something in the page cache already? 1608 1608 */ 1609 1609 page = find_get_page(mapping, offset); 1610 - if (likely(page)) { 1610 + if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { 1611 1611 /* 1612 1612 * We found the page, so try async readahead before 1613 1613 * waiting for the lock. 1614 1614 */ 1615 1615 do_async_mmap_readahead(vma, ra, file, page, offset); 1616 - } else { 1616 + } else if (!page) { 1617 1617 /* No page in the page cache at all */ 1618 1618 do_sync_mmap_readahead(vma, ra, file, offset); 1619 1619 count_vm_event(PGMAJFAULT); ··· 1737 1737 const struct vm_operations_struct generic_file_vm_ops = { 1738 1738 .fault = filemap_fault, 1739 1739 .page_mkwrite = filemap_page_mkwrite, 1740 + .remap_pages = generic_file_remap_pages, 1740 1741 }; 1741 1742 1742 1743 /* This is used for a general mmap of a disk file */ ··· 1750 1749 return -ENOEXEC; 1751 1750 file_accessed(file); 1752 1751 vma->vm_ops = &generic_file_vm_ops; 1753 - vma->vm_flags |= VM_CAN_NONLINEAR; 1754 1752 return 0; 1755 1753 } 1756 1754

+6 -4

mm/filemap_xip.c

··· 167 167 { 168 168 struct vm_area_struct *vma; 169 169 struct mm_struct *mm; 170 - struct prio_tree_iter iter; 171 170 unsigned long address; 172 171 pte_t *pte; 173 172 pte_t pteval; ··· 183 184 184 185 retry: 185 186 mutex_lock(&mapping->i_mmap_mutex); 186 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 187 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 187 188 mm = vma->vm_mm; 188 189 address = vma->vm_start + 189 190 ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); ··· 192 193 if (pte) { 193 194 /* Nuke the page table entry. */ 194 195 flush_cache_page(vma, address, pte_pfn(*pte)); 195 - pteval = ptep_clear_flush_notify(vma, address, pte); 196 + pteval = ptep_clear_flush(vma, address, pte); 196 197 page_remove_rmap(page); 197 198 dec_mm_counter(mm, MM_FILEPAGES); 198 199 BUG_ON(pte_dirty(pteval)); 199 200 pte_unmap_unlock(pte, ptl); 201 + /* must invalidate_page _before_ freeing the page */ 202 + mmu_notifier_invalidate_page(mm, address); 200 203 page_cache_release(page); 201 204 } 202 205 } ··· 306 305 static const struct vm_operations_struct xip_file_vm_ops = { 307 306 .fault = xip_file_fault, 308 307 .page_mkwrite = filemap_page_mkwrite, 308 + .remap_pages = generic_file_remap_pages, 309 309 }; 310 310 311 311 int xip_file_mmap(struct file * file, struct vm_area_struct * vma) ··· 315 313 316 314 file_accessed(file); 317 315 vma->vm_ops = &xip_file_vm_ops; 318 - vma->vm_flags |= VM_CAN_NONLINEAR | VM_MIXEDMAP; 316 + vma->vm_flags |= VM_MIXEDMAP; 319 317 return 0; 320 318 } 321 319 EXPORT_SYMBOL_GPL(xip_file_mmap);

+9 -7

mm/fremap.c

··· 5 5 * 6 6 * started by Ingo Molnar, Copyright (C) 2002, 2003 7 7 */ 8 + #include <linux/export.h> 8 9 #include <linux/backing-dev.h> 9 10 #include <linux/mm.h> 10 11 #include <linux/swap.h> ··· 81 80 return err; 82 81 } 83 82 84 - static int populate_range(struct mm_struct *mm, struct vm_area_struct *vma, 85 - unsigned long addr, unsigned long size, pgoff_t pgoff) 83 + int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, 84 + unsigned long size, pgoff_t pgoff) 86 85 { 86 + struct mm_struct *mm = vma->vm_mm; 87 87 int err; 88 88 89 89 do { ··· 97 95 pgoff++; 98 96 } while (size); 99 97 100 - return 0; 101 - 98 + return 0; 102 99 } 100 + EXPORT_SYMBOL(generic_file_remap_pages); 103 101 104 102 /** 105 103 * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma ··· 169 167 if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)) 170 168 goto out; 171 169 172 - if (!(vma->vm_flags & VM_CAN_NONLINEAR)) 170 + if (!vma->vm_ops->remap_pages) 173 171 goto out; 174 172 175 173 if (start < vma->vm_start || start + size > vma->vm_end) ··· 214 212 mutex_lock(&mapping->i_mmap_mutex); 215 213 flush_dcache_mmap_lock(mapping); 216 214 vma->vm_flags |= VM_NONLINEAR; 217 - vma_prio_tree_remove(vma, &mapping->i_mmap); 215 + vma_interval_tree_remove(vma, &mapping->i_mmap); 218 216 vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); 219 217 flush_dcache_mmap_unlock(mapping); 220 218 mutex_unlock(&mapping->i_mmap_mutex); ··· 230 228 } 231 229 232 230 mmu_notifier_invalidate_range_start(mm, start, start + size); 233 - err = populate_range(mm, vma, start, size, pgoff); 231 + err = vma->vm_ops->remap_pages(vma, start, size, pgoff); 234 232 mmu_notifier_invalidate_range_end(mm, start, start + size); 235 233 if (!err && !(flags & MAP_NONBLOCK)) { 236 234 if (vma->vm_flags & VM_LOCKED) {

+211 -231

mm/huge_memory.c

··· 102 102 unsigned long recommended_min; 103 103 extern int min_free_kbytes; 104 104 105 - if (!test_bit(TRANSPARENT_HUGEPAGE_FLAG, 106 - &transparent_hugepage_flags) && 107 - !test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, 108 - &transparent_hugepage_flags)) 105 + if (!khugepaged_enabled()) 109 106 return 0; 110 107 111 108 for_each_populated_zone(zone) ··· 136 139 { 137 140 int err = 0; 138 141 if (khugepaged_enabled()) { 139 - int wakeup; 140 - if (unlikely(!mm_slot_cache || !mm_slots_hash)) { 141 - err = -ENOMEM; 142 - goto out; 143 - } 144 - mutex_lock(&khugepaged_mutex); 145 142 if (!khugepaged_thread) 146 143 khugepaged_thread = kthread_run(khugepaged, NULL, 147 144 "khugepaged"); ··· 145 154 err = PTR_ERR(khugepaged_thread); 146 155 khugepaged_thread = NULL; 147 156 } 148 - wakeup = !list_empty(&khugepaged_scan.mm_head); 149 - mutex_unlock(&khugepaged_mutex); 150 - if (wakeup) 157 + 158 + if (!list_empty(&khugepaged_scan.mm_head)) 151 159 wake_up_interruptible(&khugepaged_wait); 152 160 153 161 set_recommended_min_free_kbytes(); 154 - } else 155 - /* wakeup to exit */ 156 - wake_up_interruptible(&khugepaged_wait); 157 - out: 162 + } else if (khugepaged_thread) { 163 + kthread_stop(khugepaged_thread); 164 + khugepaged_thread = NULL; 165 + } 166 + 158 167 return err; 159 168 } 160 169 ··· 215 224 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG); 216 225 217 226 if (ret > 0) { 218 - int err = start_khugepaged(); 227 + int err; 228 + 229 + mutex_lock(&khugepaged_mutex); 230 + err = start_khugepaged(); 231 + mutex_unlock(&khugepaged_mutex); 232 + 219 233 if (err) 220 234 ret = err; 221 235 } 222 - 223 - if (ret > 0 && 224 - (test_bit(TRANSPARENT_HUGEPAGE_FLAG, 225 - &transparent_hugepage_flags) || 226 - test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, 227 - &transparent_hugepage_flags))) 228 - set_recommended_min_free_kbytes(); 229 236 230 237 return ret; 231 238 } ··· 559 570 560 571 start_khugepaged(); 561 572 562 - set_recommended_min_free_kbytes(); 563 - 564 573 return 0; 565 574 out: 566 575 hugepage_exit_sysfs(hugepage_kobj); ··· 597 610 return ret; 598 611 } 599 612 __setup("transparent_hugepage=", setup_transparent_hugepage); 600 - 601 - static void prepare_pmd_huge_pte(pgtable_t pgtable, 602 - struct mm_struct *mm) 603 - { 604 - assert_spin_locked(&mm->page_table_lock); 605 - 606 - /* FIFO */ 607 - if (!mm->pmd_huge_pte) 608 - INIT_LIST_HEAD(&pgtable->lru); 609 - else 610 - list_add(&pgtable->lru, &mm->pmd_huge_pte->lru); 611 - mm->pmd_huge_pte = pgtable; 612 - } 613 613 614 614 static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) 615 615 { ··· 639 665 */ 640 666 page_add_new_anon_rmap(page, vma, haddr); 641 667 set_pmd_at(mm, haddr, pmd, entry); 642 - prepare_pmd_huge_pte(pgtable, mm); 668 + pgtable_trans_huge_deposit(mm, pgtable); 643 669 add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); 644 670 mm->nr_ptes++; 645 671 spin_unlock(&mm->page_table_lock); ··· 765 791 pmdp_set_wrprotect(src_mm, addr, src_pmd); 766 792 pmd = pmd_mkold(pmd_wrprotect(pmd)); 767 793 set_pmd_at(dst_mm, addr, dst_pmd, pmd); 768 - prepare_pmd_huge_pte(pgtable, dst_mm); 794 + pgtable_trans_huge_deposit(dst_mm, pgtable); 769 795 dst_mm->nr_ptes++; 770 796 771 797 ret = 0; ··· 774 800 spin_unlock(&dst_mm->page_table_lock); 775 801 out: 776 802 return ret; 777 - } 778 - 779 - /* no "address" argument so destroys page coloring of some arch */ 780 - pgtable_t get_pmd_huge_pte(struct mm_struct *mm) 781 - { 782 - pgtable_t pgtable; 783 - 784 - assert_spin_locked(&mm->page_table_lock); 785 - 786 - /* FIFO */ 787 - pgtable = mm->pmd_huge_pte; 788 - if (list_empty(&pgtable->lru)) 789 - mm->pmd_huge_pte = NULL; 790 - else { 791 - mm->pmd_huge_pte = list_entry(pgtable->lru.next, 792 - struct page, lru); 793 - list_del(&pgtable->lru); 794 - } 795 - return pgtable; 796 803 } 797 804 798 805 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, ··· 787 832 pmd_t _pmd; 788 833 int ret = 0, i; 789 834 struct page **pages; 835 + unsigned long mmun_start; /* For mmu_notifiers */ 836 + unsigned long mmun_end; /* For mmu_notifiers */ 790 837 791 838 pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR, 792 839 GFP_KERNEL); ··· 825 868 cond_resched(); 826 869 } 827 870 871 + mmun_start = haddr; 872 + mmun_end = haddr + HPAGE_PMD_SIZE; 873 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 874 + 828 875 spin_lock(&mm->page_table_lock); 829 876 if (unlikely(!pmd_same(*pmd, orig_pmd))) 830 877 goto out_free_pages; 831 878 VM_BUG_ON(!PageHead(page)); 832 879 833 - pmdp_clear_flush_notify(vma, haddr, pmd); 880 + pmdp_clear_flush(vma, haddr, pmd); 834 881 /* leave pmd empty until pte is filled */ 835 882 836 - pgtable = get_pmd_huge_pte(mm); 883 + pgtable = pgtable_trans_huge_withdraw(mm); 837 884 pmd_populate(mm, &_pmd, pgtable); 838 885 839 886 for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { ··· 857 896 page_remove_rmap(page); 858 897 spin_unlock(&mm->page_table_lock); 859 898 899 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 900 + 860 901 ret |= VM_FAULT_WRITE; 861 902 put_page(page); 862 903 ··· 867 904 868 905 out_free_pages: 869 906 spin_unlock(&mm->page_table_lock); 907 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 870 908 mem_cgroup_uncharge_start(); 871 909 for (i = 0; i < HPAGE_PMD_NR; i++) { 872 910 mem_cgroup_uncharge_page(pages[i]); ··· 884 920 int ret = 0; 885 921 struct page *page, *new_page; 886 922 unsigned long haddr; 923 + unsigned long mmun_start; /* For mmu_notifiers */ 924 + unsigned long mmun_end; /* For mmu_notifiers */ 887 925 888 926 VM_BUG_ON(!vma->anon_vma); 889 927 spin_lock(&mm->page_table_lock); ··· 900 934 entry = pmd_mkyoung(orig_pmd); 901 935 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); 902 936 if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1)) 903 - update_mmu_cache(vma, address, entry); 937 + update_mmu_cache_pmd(vma, address, pmd); 904 938 ret |= VM_FAULT_WRITE; 905 939 goto out_unlock; 906 940 } ··· 936 970 copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); 937 971 __SetPageUptodate(new_page); 938 972 973 + mmun_start = haddr; 974 + mmun_end = haddr + HPAGE_PMD_SIZE; 975 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 976 + 939 977 spin_lock(&mm->page_table_lock); 940 978 put_page(page); 941 979 if (unlikely(!pmd_same(*pmd, orig_pmd))) { 942 980 spin_unlock(&mm->page_table_lock); 943 981 mem_cgroup_uncharge_page(new_page); 944 982 put_page(new_page); 945 - goto out; 983 + goto out_mn; 946 984 } else { 947 985 pmd_t entry; 948 986 VM_BUG_ON(!PageHead(page)); 949 987 entry = mk_pmd(new_page, vma->vm_page_prot); 950 988 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); 951 989 entry = pmd_mkhuge(entry); 952 - pmdp_clear_flush_notify(vma, haddr, pmd); 990 + pmdp_clear_flush(vma, haddr, pmd); 953 991 page_add_new_anon_rmap(new_page, vma, haddr); 954 992 set_pmd_at(mm, haddr, pmd, entry); 955 - update_mmu_cache(vma, address, entry); 993 + update_mmu_cache_pmd(vma, address, pmd); 956 994 page_remove_rmap(page); 957 995 put_page(page); 958 996 ret |= VM_FAULT_WRITE; 959 997 } 998 + spin_unlock(&mm->page_table_lock); 999 + out_mn: 1000 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 1001 + out: 1002 + return ret; 960 1003 out_unlock: 961 1004 spin_unlock(&mm->page_table_lock); 962 - out: 963 1005 return ret; 964 1006 } 965 1007 966 - struct page *follow_trans_huge_pmd(struct mm_struct *mm, 1008 + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, 967 1009 unsigned long addr, 968 1010 pmd_t *pmd, 969 1011 unsigned int flags) 970 1012 { 1013 + struct mm_struct *mm = vma->vm_mm; 971 1014 struct page *page = NULL; 972 1015 973 1016 assert_spin_locked(&mm->page_table_lock); ··· 999 1024 _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); 1000 1025 set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd); 1001 1026 } 1027 + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { 1028 + if (page->mapping && trylock_page(page)) { 1029 + lru_add_drain(); 1030 + if (page->mapping) 1031 + mlock_vma_page(page); 1032 + unlock_page(page); 1033 + } 1034 + } 1002 1035 page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; 1003 1036 VM_BUG_ON(!PageCompound(page)); 1004 1037 if (flags & FOLL_GET) ··· 1024 1041 if (__pmd_trans_huge_lock(pmd, vma) == 1) { 1025 1042 struct page *page; 1026 1043 pgtable_t pgtable; 1027 - pgtable = get_pmd_huge_pte(tlb->mm); 1028 - page = pmd_page(*pmd); 1029 - pmd_clear(pmd); 1044 + pmd_t orig_pmd; 1045 + pgtable = pgtable_trans_huge_withdraw(tlb->mm); 1046 + orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd); 1047 + page = pmd_page(orig_pmd); 1030 1048 tlb_remove_pmd_tlb_entry(tlb, pmd, addr); 1031 1049 page_remove_rmap(page); 1032 1050 VM_BUG_ON(page_mapcount(page) < 0); ··· 1191 1207 struct mm_struct *mm = vma->vm_mm; 1192 1208 pmd_t *pmd; 1193 1209 int ret = 0; 1210 + /* For mmu_notifiers */ 1211 + const unsigned long mmun_start = address; 1212 + const unsigned long mmun_end = address + HPAGE_PMD_SIZE; 1194 1213 1214 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 1195 1215 spin_lock(&mm->page_table_lock); 1196 1216 pmd = page_check_address_pmd(page, mm, address, 1197 1217 PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG); ··· 1207 1219 * and it won't wait on the anon_vma->root->mutex to 1208 1220 * serialize against split_huge_page*. 1209 1221 */ 1210 - pmdp_splitting_flush_notify(vma, address, pmd); 1222 + pmdp_splitting_flush(vma, address, pmd); 1211 1223 ret = 1; 1212 1224 } 1213 1225 spin_unlock(&mm->page_table_lock); 1226 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 1214 1227 1215 1228 return ret; 1216 1229 } ··· 1347 1358 pmd = page_check_address_pmd(page, mm, address, 1348 1359 PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); 1349 1360 if (pmd) { 1350 - pgtable = get_pmd_huge_pte(mm); 1361 + pgtable = pgtable_trans_huge_withdraw(mm); 1351 1362 pmd_populate(mm, &_pmd, pgtable); 1352 1363 1353 - for (i = 0, haddr = address; i < HPAGE_PMD_NR; 1354 - i++, haddr += PAGE_SIZE) { 1364 + haddr = address; 1365 + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { 1355 1366 pte_t *pte, entry; 1356 1367 BUG_ON(PageCompound(page+i)); 1357 1368 entry = mk_pte(page + i, vma->vm_page_prot); ··· 1395 1406 * SMP TLB and finally we write the non-huge version 1396 1407 * of the pmd entry with pmd_populate. 1397 1408 */ 1398 - set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd)); 1399 - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); 1409 + pmdp_invalidate(vma, address, pmd); 1400 1410 pmd_populate(mm, pmd, pgtable); 1401 1411 ret = 1; 1402 1412 } ··· 1409 1421 struct anon_vma *anon_vma) 1410 1422 { 1411 1423 int mapcount, mapcount2; 1424 + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 1412 1425 struct anon_vma_chain *avc; 1413 1426 1414 1427 BUG_ON(!PageHead(page)); 1415 1428 BUG_ON(PageTail(page)); 1416 1429 1417 1430 mapcount = 0; 1418 - list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { 1431 + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 1419 1432 struct vm_area_struct *vma = avc->vma; 1420 1433 unsigned long addr = vma_address(page, vma); 1421 1434 BUG_ON(is_vma_temporary_stack(vma)); 1422 - if (addr == -EFAULT) 1423 - continue; 1424 1435 mapcount += __split_huge_page_splitting(page, vma, addr); 1425 1436 } 1426 1437 /* ··· 1440 1453 __split_huge_page_refcount(page); 1441 1454 1442 1455 mapcount2 = 0; 1443 - list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { 1456 + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 1444 1457 struct vm_area_struct *vma = avc->vma; 1445 1458 unsigned long addr = vma_address(page, vma); 1446 1459 BUG_ON(is_vma_temporary_stack(vma)); 1447 - if (addr == -EFAULT) 1448 - continue; 1449 1460 mapcount2 += __split_huge_page_map(page, vma, addr); 1450 1461 } 1451 1462 if (mapcount != mapcount2) ··· 1476 1491 return ret; 1477 1492 } 1478 1493 1479 - #define VM_NO_THP (VM_SPECIAL|VM_INSERTPAGE|VM_MIXEDMAP|VM_SAO| \ 1480 - VM_HUGETLB|VM_SHARED|VM_MAYSHARE) 1494 + #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE) 1481 1495 1482 1496 int hugepage_madvise(struct vm_area_struct *vma, 1483 1497 unsigned long *vm_flags, int advice) 1484 1498 { 1499 + struct mm_struct *mm = vma->vm_mm; 1500 + 1485 1501 switch (advice) { 1486 1502 case MADV_HUGEPAGE: 1487 1503 /* 1488 1504 * Be somewhat over-protective like KSM for now! 1489 1505 */ 1490 1506 if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) 1507 + return -EINVAL; 1508 + if (mm->def_flags & VM_NOHUGEPAGE) 1491 1509 return -EINVAL; 1492 1510 *vm_flags &= ~VM_NOHUGEPAGE; 1493 1511 *vm_flags |= VM_HUGEPAGE; ··· 1643 1655 if (vma->vm_ops) 1644 1656 /* khugepaged not yet working on file or special mappings */ 1645 1657 return 0; 1646 - /* 1647 - * If is_pfn_mapping() is true is_learn_pfn_mapping() must be 1648 - * true too, verify it here. 1649 - */ 1650 - VM_BUG_ON(is_linear_pfn_mapping(vma) || vma->vm_flags & VM_NO_THP); 1658 + VM_BUG_ON(vma->vm_flags & VM_NO_THP); 1651 1659 hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; 1652 1660 hend = vma->vm_end & HPAGE_PMD_MASK; 1653 1661 if (hstart < hend) ··· 1817 1833 } 1818 1834 } 1819 1835 1820 - static void collapse_huge_page(struct mm_struct *mm, 1821 - unsigned long address, 1822 - struct page **hpage, 1823 - struct vm_area_struct *vma, 1824 - int node) 1836 + static void khugepaged_alloc_sleep(void) 1825 1837 { 1826 - pgd_t *pgd; 1827 - pud_t *pud; 1828 - pmd_t *pmd, _pmd; 1829 - pte_t *pte; 1830 - pgtable_t pgtable; 1831 - struct page *new_page; 1832 - spinlock_t *ptl; 1833 - int isolated; 1834 - unsigned long hstart, hend; 1838 + wait_event_freezable_timeout(khugepaged_wait, false, 1839 + msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); 1840 + } 1835 1841 1836 - VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1837 - #ifndef CONFIG_NUMA 1838 - up_read(&mm->mmap_sem); 1839 - VM_BUG_ON(!*hpage); 1840 - new_page = *hpage; 1841 - #else 1842 + #ifdef CONFIG_NUMA 1843 + static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) 1844 + { 1845 + if (IS_ERR(*hpage)) { 1846 + if (!*wait) 1847 + return false; 1848 + 1849 + *wait = false; 1850 + *hpage = NULL; 1851 + khugepaged_alloc_sleep(); 1852 + } else if (*hpage) { 1853 + put_page(*hpage); 1854 + *hpage = NULL; 1855 + } 1856 + 1857 + return true; 1858 + } 1859 + 1860 + static struct page 1861 + *khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm, 1862 + struct vm_area_struct *vma, unsigned long address, 1863 + int node) 1864 + { 1842 1865 VM_BUG_ON(*hpage); 1843 1866 /* 1844 1867 * Allocate the page while the vma is still valid and under ··· 1857 1866 * mmap_sem in read mode is good idea also to allow greater 1858 1867 * scalability. 1859 1868 */ 1860 - new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address, 1869 + *hpage = alloc_hugepage_vma(khugepaged_defrag(), vma, address, 1861 1870 node, __GFP_OTHER_NODE); 1862 1871 1863 1872 /* ··· 1865 1874 * preparation for taking it in write mode. 1866 1875 */ 1867 1876 up_read(&mm->mmap_sem); 1868 - if (unlikely(!new_page)) { 1877 + if (unlikely(!*hpage)) { 1869 1878 count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 1870 1879 *hpage = ERR_PTR(-ENOMEM); 1871 - return; 1880 + return NULL; 1872 1881 } 1873 - #endif 1874 1882 1875 1883 count_vm_event(THP_COLLAPSE_ALLOC); 1876 - if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { 1877 - #ifdef CONFIG_NUMA 1878 - put_page(new_page); 1884 + return *hpage; 1885 + } 1886 + #else 1887 + static struct page *khugepaged_alloc_hugepage(bool *wait) 1888 + { 1889 + struct page *hpage; 1890 + 1891 + do { 1892 + hpage = alloc_hugepage(khugepaged_defrag()); 1893 + if (!hpage) { 1894 + count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 1895 + if (!*wait) 1896 + return NULL; 1897 + 1898 + *wait = false; 1899 + khugepaged_alloc_sleep(); 1900 + } else 1901 + count_vm_event(THP_COLLAPSE_ALLOC); 1902 + } while (unlikely(!hpage) && likely(khugepaged_enabled())); 1903 + 1904 + return hpage; 1905 + } 1906 + 1907 + static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) 1908 + { 1909 + if (!*hpage) 1910 + *hpage = khugepaged_alloc_hugepage(wait); 1911 + 1912 + if (unlikely(!*hpage)) 1913 + return false; 1914 + 1915 + return true; 1916 + } 1917 + 1918 + static struct page 1919 + *khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm, 1920 + struct vm_area_struct *vma, unsigned long address, 1921 + int node) 1922 + { 1923 + up_read(&mm->mmap_sem); 1924 + VM_BUG_ON(!*hpage); 1925 + return *hpage; 1926 + } 1879 1927 #endif 1928 + 1929 + static void collapse_huge_page(struct mm_struct *mm, 1930 + unsigned long address, 1931 + struct page **hpage, 1932 + struct vm_area_struct *vma, 1933 + int node) 1934 + { 1935 + pgd_t *pgd; 1936 + pud_t *pud; 1937 + pmd_t *pmd, _pmd; 1938 + pte_t *pte; 1939 + pgtable_t pgtable; 1940 + struct page *new_page; 1941 + spinlock_t *ptl; 1942 + int isolated; 1943 + unsigned long hstart, hend; 1944 + unsigned long mmun_start; /* For mmu_notifiers */ 1945 + unsigned long mmun_end; /* For mmu_notifiers */ 1946 + 1947 + VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1948 + 1949 + /* release the mmap_sem read lock. */ 1950 + new_page = khugepaged_alloc_page(hpage, mm, vma, address, node); 1951 + if (!new_page) 1880 1952 return; 1881 - } 1953 + 1954 + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) 1955 + return; 1882 1956 1883 1957 /* 1884 1958 * Prevent all access to pagetables with the exception of ··· 1968 1912 goto out; 1969 1913 if (is_vma_temporary_stack(vma)) 1970 1914 goto out; 1971 - /* 1972 - * If is_pfn_mapping() is true is_learn_pfn_mapping() must be 1973 - * true too, verify it here. 1974 - */ 1975 - VM_BUG_ON(is_linear_pfn_mapping(vma) || vma->vm_flags & VM_NO_THP); 1915 + VM_BUG_ON(vma->vm_flags & VM_NO_THP); 1976 1916 1977 1917 pgd = pgd_offset(mm, address); 1978 1918 if (!pgd_present(*pgd)) ··· 1988 1936 pte = pte_offset_map(pmd, address); 1989 1937 ptl = pte_lockptr(mm, pmd); 1990 1938 1939 + mmun_start = address; 1940 + mmun_end = address + HPAGE_PMD_SIZE; 1941 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 1991 1942 spin_lock(&mm->page_table_lock); /* probably unnecessary */ 1992 1943 /* 1993 1944 * After this gup_fast can't run anymore. This also removes ··· 1998 1943 * huge and small TLB entries for the same virtual address 1999 1944 * to avoid the risk of CPU bugs in that area. 2000 1945 */ 2001 - _pmd = pmdp_clear_flush_notify(vma, address, pmd); 1946 + _pmd = pmdp_clear_flush(vma, address, pmd); 2002 1947 spin_unlock(&mm->page_table_lock); 1948 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2003 1949 2004 1950 spin_lock(ptl); 2005 1951 isolated = __collapse_huge_page_isolate(vma, address, pte); ··· 2026 1970 pte_unmap(pte); 2027 1971 __SetPageUptodate(new_page); 2028 1972 pgtable = pmd_pgtable(_pmd); 2029 - VM_BUG_ON(page_count(pgtable) != 1); 2030 - VM_BUG_ON(page_mapcount(pgtable) != 0); 2031 1973 2032 1974 _pmd = mk_pmd(new_page, vma->vm_page_prot); 2033 1975 _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); ··· 2042 1988 BUG_ON(!pmd_none(*pmd)); 2043 1989 page_add_new_anon_rmap(new_page, vma, address); 2044 1990 set_pmd_at(mm, address, pmd, _pmd); 2045 - update_mmu_cache(vma, address, _pmd); 2046 - prepare_pmd_huge_pte(pgtable, mm); 1991 + update_mmu_cache_pmd(vma, address, pmd); 1992 + pgtable_trans_huge_deposit(mm, pgtable); 2047 1993 spin_unlock(&mm->page_table_lock); 2048 1994 2049 - #ifndef CONFIG_NUMA 2050 1995 *hpage = NULL; 2051 - #endif 1996 + 2052 1997 khugepaged_pages_collapsed++; 2053 1998 out_up_write: 2054 1999 up_write(&mm->mmap_sem); ··· 2055 2002 2056 2003 out: 2057 2004 mem_cgroup_uncharge_page(new_page); 2058 - #ifdef CONFIG_NUMA 2059 - put_page(new_page); 2060 - #endif 2061 2005 goto out_up_write; 2062 2006 } 2063 2007 ··· 2204 2154 goto skip; 2205 2155 if (is_vma_temporary_stack(vma)) 2206 2156 goto skip; 2207 - /* 2208 - * If is_pfn_mapping() is true is_learn_pfn_mapping() 2209 - * must be true too, verify it here. 2210 - */ 2211 - VM_BUG_ON(is_linear_pfn_mapping(vma) || 2212 - vma->vm_flags & VM_NO_THP); 2157 + VM_BUG_ON(vma->vm_flags & VM_NO_THP); 2213 2158 2214 2159 hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; 2215 2160 hend = vma->vm_end & HPAGE_PMD_MASK; ··· 2279 2234 static int khugepaged_wait_event(void) 2280 2235 { 2281 2236 return !list_empty(&khugepaged_scan.mm_head) || 2282 - !khugepaged_enabled(); 2237 + kthread_should_stop(); 2283 2238 } 2284 2239 2285 - static void khugepaged_do_scan(struct page **hpage) 2240 + static void khugepaged_do_scan(void) 2286 2241 { 2242 + struct page *hpage = NULL; 2287 2243 unsigned int progress = 0, pass_through_head = 0; 2288 2244 unsigned int pages = khugepaged_pages_to_scan; 2245 + bool wait = true; 2289 2246 2290 2247 barrier(); /* write khugepaged_pages_to_scan to local stack */ 2291 2248 2292 2249 while (progress < pages) { 2293 - cond_resched(); 2294 - 2295 - #ifndef CONFIG_NUMA 2296 - if (!*hpage) { 2297 - *hpage = alloc_hugepage(khugepaged_defrag()); 2298 - if (unlikely(!*hpage)) { 2299 - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 2300 - break; 2301 - } 2302 - count_vm_event(THP_COLLAPSE_ALLOC); 2303 - } 2304 - #else 2305 - if (IS_ERR(*hpage)) 2250 + if (!khugepaged_prealloc_page(&hpage, &wait)) 2306 2251 break; 2307 - #endif 2252 + 2253 + cond_resched(); 2308 2254 2309 2255 if (unlikely(kthread_should_stop() || freezing(current))) 2310 2256 break; ··· 2306 2270 if (khugepaged_has_work() && 2307 2271 pass_through_head < 2) 2308 2272 progress += khugepaged_scan_mm_slot(pages - progress, 2309 - hpage); 2273 + &hpage); 2310 2274 else 2311 2275 progress = pages; 2312 2276 spin_unlock(&khugepaged_mm_lock); 2313 2277 } 2278 + 2279 + if (!IS_ERR_OR_NULL(hpage)) 2280 + put_page(hpage); 2314 2281 } 2315 2282 2316 - static void khugepaged_alloc_sleep(void) 2283 + static void khugepaged_wait_work(void) 2317 2284 { 2318 - wait_event_freezable_timeout(khugepaged_wait, false, 2319 - msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); 2320 - } 2285 + try_to_freeze(); 2321 2286 2322 - #ifndef CONFIG_NUMA 2323 - static struct page *khugepaged_alloc_hugepage(void) 2324 - { 2325 - struct page *hpage; 2287 + if (khugepaged_has_work()) { 2288 + if (!khugepaged_scan_sleep_millisecs) 2289 + return; 2326 2290 2327 - do { 2328 - hpage = alloc_hugepage(khugepaged_defrag()); 2329 - if (!hpage) { 2330 - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 2331 - khugepaged_alloc_sleep(); 2332 - } else 2333 - count_vm_event(THP_COLLAPSE_ALLOC); 2334 - } while (unlikely(!hpage) && 2335 - likely(khugepaged_enabled())); 2336 - return hpage; 2337 - } 2338 - #endif 2339 - 2340 - static void khugepaged_loop(void) 2341 - { 2342 - struct page *hpage; 2343 - 2344 - #ifdef CONFIG_NUMA 2345 - hpage = NULL; 2346 - #endif 2347 - while (likely(khugepaged_enabled())) { 2348 - #ifndef CONFIG_NUMA 2349 - hpage = khugepaged_alloc_hugepage(); 2350 - if (unlikely(!hpage)) 2351 - break; 2352 - #else 2353 - if (IS_ERR(hpage)) { 2354 - khugepaged_alloc_sleep(); 2355 - hpage = NULL; 2356 - } 2357 - #endif 2358 - 2359 - khugepaged_do_scan(&hpage); 2360 - #ifndef CONFIG_NUMA 2361 - if (hpage) 2362 - put_page(hpage); 2363 - #endif 2364 - try_to_freeze(); 2365 - if (unlikely(kthread_should_stop())) 2366 - break; 2367 - if (khugepaged_has_work()) { 2368 - if (!khugepaged_scan_sleep_millisecs) 2369 - continue; 2370 - wait_event_freezable_timeout(khugepaged_wait, false, 2371 - msecs_to_jiffies(khugepaged_scan_sleep_millisecs)); 2372 - } else if (khugepaged_enabled()) 2373 - wait_event_freezable(khugepaged_wait, 2374 - khugepaged_wait_event()); 2291 + wait_event_freezable_timeout(khugepaged_wait, 2292 + kthread_should_stop(), 2293 + msecs_to_jiffies(khugepaged_scan_sleep_millisecs)); 2294 + return; 2375 2295 } 2296 + 2297 + if (khugepaged_enabled()) 2298 + wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); 2376 2299 } 2377 2300 2378 2301 static int khugepaged(void *none) ··· 2341 2346 set_freezable(); 2342 2347 set_user_nice(current, 19); 2343 2348 2344 - /* serialize with start_khugepaged() */ 2345 - mutex_lock(&khugepaged_mutex); 2346 - 2347 - for (;;) { 2348 - mutex_unlock(&khugepaged_mutex); 2349 - VM_BUG_ON(khugepaged_thread != current); 2350 - khugepaged_loop(); 2351 - VM_BUG_ON(khugepaged_thread != current); 2352 - 2353 - mutex_lock(&khugepaged_mutex); 2354 - if (!khugepaged_enabled()) 2355 - break; 2356 - if (unlikely(kthread_should_stop())) 2357 - break; 2349 + while (!kthread_should_stop()) { 2350 + khugepaged_do_scan(); 2351 + khugepaged_wait_work(); 2358 2352 } 2359 2353 2360 2354 spin_lock(&khugepaged_mm_lock); ··· 2352 2368 if (mm_slot) 2353 2369 collect_mm_slot(mm_slot); 2354 2370 spin_unlock(&khugepaged_mm_lock); 2355 - 2356 - khugepaged_thread = NULL; 2357 - mutex_unlock(&khugepaged_mutex); 2358 - 2359 2371 return 0; 2360 2372 } 2361 2373

+22 -12

mm/hugetlb.c

··· 30 30 #include <linux/hugetlb.h> 31 31 #include <linux/hugetlb_cgroup.h> 32 32 #include <linux/node.h> 33 - #include <linux/hugetlb_cgroup.h> 34 33 #include "internal.h" 35 34 36 35 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; ··· 636 637 h->surplus_huge_pages--; 637 638 h->surplus_huge_pages_node[nid]--; 638 639 } else { 640 + arch_clear_hugepage_flags(page); 639 641 enqueue_huge_page(h, page); 640 642 } 641 643 spin_unlock(&hugetlb_lock); ··· 671 671 } 672 672 } 673 673 674 + /* 675 + * PageHuge() only returns true for hugetlbfs pages, but not for normal or 676 + * transparent huge pages. See the PageTransHuge() documentation for more 677 + * details. 678 + */ 674 679 int PageHuge(struct page *page) 675 680 { 676 681 compound_page_dtor *dtor; ··· 2360 2355 struct page *page; 2361 2356 struct hstate *h = hstate_vma(vma); 2362 2357 unsigned long sz = huge_page_size(h); 2358 + const unsigned long mmun_start = start; /* For mmu_notifiers */ 2359 + const unsigned long mmun_end = end; /* For mmu_notifiers */ 2363 2360 2364 2361 WARN_ON(!is_vm_hugetlb_page(vma)); 2365 2362 BUG_ON(start & ~huge_page_mask(h)); 2366 2363 BUG_ON(end & ~huge_page_mask(h)); 2367 2364 2368 2365 tlb_start_vma(tlb, vma); 2369 - mmu_notifier_invalidate_range_start(mm, start, end); 2366 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2370 2367 again: 2371 2368 spin_lock(&mm->page_table_lock); 2372 2369 for (address = start; address < end; address += sz) { ··· 2432 2425 if (address < end && !ref_page) 2433 2426 goto again; 2434 2427 } 2435 - mmu_notifier_invalidate_range_end(mm, start, end); 2428 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2436 2429 tlb_end_vma(tlb, vma); 2437 2430 } 2438 2431 ··· 2480 2473 struct hstate *h = hstate_vma(vma); 2481 2474 struct vm_area_struct *iter_vma; 2482 2475 struct address_space *mapping; 2483 - struct prio_tree_iter iter; 2484 2476 pgoff_t pgoff; 2485 2477 2486 2478 /* ··· 2487 2481 * from page cache lookup which is in HPAGE_SIZE units. 2488 2482 */ 2489 2483 address = address & huge_page_mask(h); 2490 - pgoff = vma_hugecache_offset(h, vma, address); 2484 + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + 2485 + vma->vm_pgoff; 2491 2486 mapping = vma->vm_file->f_dentry->d_inode->i_mapping; 2492 2487 2493 2488 /* ··· 2497 2490 * __unmap_hugepage_range() is called as the lock is already held 2498 2491 */ 2499 2492 mutex_lock(&mapping->i_mmap_mutex); 2500 - vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 2493 + vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) { 2501 2494 /* Do not unmap the current VMA */ 2502 2495 if (iter_vma == vma) 2503 2496 continue; ··· 2532 2525 struct page *old_page, *new_page; 2533 2526 int avoidcopy; 2534 2527 int outside_reserve = 0; 2528 + unsigned long mmun_start; /* For mmu_notifiers */ 2529 + unsigned long mmun_end; /* For mmu_notifiers */ 2535 2530 2536 2531 old_page = pte_page(pte); 2537 2532 ··· 2620 2611 pages_per_huge_page(h)); 2621 2612 __SetPageUptodate(new_page); 2622 2613 2614 + mmun_start = address & huge_page_mask(h); 2615 + mmun_end = mmun_start + huge_page_size(h); 2616 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2623 2617 /* 2624 2618 * Retake the page_table_lock to check for racing updates 2625 2619 * before the page tables are altered ··· 2631 2619 ptep = huge_pte_offset(mm, address & huge_page_mask(h)); 2632 2620 if (likely(pte_same(huge_ptep_get(ptep), pte))) { 2633 2621 /* Break COW */ 2634 - mmu_notifier_invalidate_range_start(mm, 2635 - address & huge_page_mask(h), 2636 - (address & huge_page_mask(h)) + huge_page_size(h)); 2637 2622 huge_ptep_clear_flush(vma, address, ptep); 2638 2623 set_huge_pte_at(mm, address, ptep, 2639 2624 make_huge_pte(vma, new_page, 1)); ··· 2638 2629 hugepage_add_new_anon_rmap(new_page, vma, address); 2639 2630 /* Make the old page be freed below */ 2640 2631 new_page = old_page; 2641 - mmu_notifier_invalidate_range_end(mm, 2642 - address & huge_page_mask(h), 2643 - (address & huge_page_mask(h)) + huge_page_size(h)); 2644 2632 } 2633 + spin_unlock(&mm->page_table_lock); 2634 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2635 + /* Caller expects lock to be held */ 2636 + spin_lock(&mm->page_table_lock); 2645 2637 page_cache_release(new_page); 2646 2638 page_cache_release(old_page); 2647 2639 return 0;

+32 -20

mm/internal.h

··· 118 118 unsigned long nr_freepages; /* Number of isolated free pages */ 119 119 unsigned long nr_migratepages; /* Number of pages to migrate */ 120 120 unsigned long free_pfn; /* isolate_freepages search base */ 121 - unsigned long start_free_pfn; /* where we started the search */ 122 121 unsigned long migrate_pfn; /* isolate_migratepages search base */ 123 122 bool sync; /* Synchronous migration */ 124 - bool wrapped; /* Order > 0 compactions are 125 - incremental, once free_pfn 126 - and migrate_pfn meet, we restart 127 - from the top of the zone; 128 - remember we wrapped around. */ 123 + bool ignore_skip_hint; /* Scan blocks even if marked skip */ 124 + bool finished_update_free; /* True when the zone cached pfns are 125 + * no longer being updated 126 + */ 127 + bool finished_update_migrate; 129 128 130 129 int order; /* order a direct compactor needs */ 131 130 int migratetype; /* MOVABLE, RECLAIMABLE etc */ 132 131 struct zone *zone; 133 - bool *contended; /* True if a lock was contended */ 132 + bool contended; /* True if a lock was contended */ 133 + struct page **page; /* Page captured of requested size */ 134 134 }; 135 135 136 136 unsigned long 137 - isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn); 137 + isolate_freepages_range(struct compact_control *cc, 138 + unsigned long start_pfn, unsigned long end_pfn); 138 139 unsigned long 139 140 isolate_migratepages_range(struct zone *zone, struct compact_control *cc, 140 - unsigned long low_pfn, unsigned long end_pfn); 141 + unsigned long low_pfn, unsigned long end_pfn, bool unevictable); 141 142 142 143 #endif 143 144 ··· 168 167 } 169 168 170 169 /* 171 - * Called only in fault path via page_evictable() for a new page 172 - * to determine if it's being mapped into a LOCKED vma. 173 - * If so, mark page as mlocked. 170 + * Called only in fault path, to determine if a new page is being 171 + * mapped into a LOCKED vma. If it is, mark page as mlocked. 174 172 */ 175 173 static inline int mlocked_vma_newpage(struct vm_area_struct *vma, 176 174 struct page *page) ··· 180 180 return 0; 181 181 182 182 if (!TestSetPageMlocked(page)) { 183 - inc_zone_page_state(page, NR_MLOCK); 183 + mod_zone_page_state(page_zone(page), NR_MLOCK, 184 + hpage_nr_pages(page)); 184 185 count_vm_event(UNEVICTABLE_PGMLOCKED); 185 186 } 186 187 return 1; ··· 202 201 * If called for a page that is still mapped by mlocked vmas, all we do 203 202 * is revert to lazy LRU behaviour -- semantics are not broken. 204 203 */ 205 - extern void __clear_page_mlock(struct page *page); 206 - static inline void clear_page_mlock(struct page *page) 207 - { 208 - if (unlikely(TestClearPageMlocked(page))) 209 - __clear_page_mlock(page); 210 - } 204 + extern void clear_page_mlock(struct page *page); 211 205 212 206 /* 213 207 * mlock_migrate_page - called only from migrate_page_copy() to ··· 336 340 #define ZONE_RECLAIM_FULL -1 337 341 #define ZONE_RECLAIM_SOME 0 338 342 #define ZONE_RECLAIM_SUCCESS 1 339 - #endif 340 343 341 344 extern int hwpoison_filter(struct page *p); 342 345 ··· 351 356 unsigned long, unsigned long); 352 357 353 358 extern void set_pageblock_order(void); 359 + unsigned long reclaim_clean_pages_from_list(struct zone *zone, 360 + struct list_head *page_list); 361 + /* The ALLOC_WMARK bits are used as an index to zone->watermark */ 362 + #define ALLOC_WMARK_MIN WMARK_MIN 363 + #define ALLOC_WMARK_LOW WMARK_LOW 364 + #define ALLOC_WMARK_HIGH WMARK_HIGH 365 + #define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */ 366 + 367 + /* Mask to get the watermark bits */ 368 + #define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1) 369 + 370 + #define ALLOC_HARDER 0x10 /* try to alloc harder */ 371 + #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ 372 + #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ 373 + #define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ 374 + 375 + #endif /* __MM_INTERNAL_H */

+112

mm/interval_tree.c

··· 1 + /* 2 + * mm/interval_tree.c - interval tree for mapping->i_mmap 3 + * 4 + * Copyright (C) 2012, Michel Lespinasse <walken@google.com> 5 + * 6 + * This file is released under the GPL v2. 7 + */ 8 + 9 + #include <linux/mm.h> 10 + #include <linux/fs.h> 11 + #include <linux/rmap.h> 12 + #include <linux/interval_tree_generic.h> 13 + 14 + static inline unsigned long vma_start_pgoff(struct vm_area_struct *v) 15 + { 16 + return v->vm_pgoff; 17 + } 18 + 19 + static inline unsigned long vma_last_pgoff(struct vm_area_struct *v) 20 + { 21 + return v->vm_pgoff + ((v->vm_end - v->vm_start) >> PAGE_SHIFT) - 1; 22 + } 23 + 24 + INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.linear.rb, 25 + unsigned long, shared.linear.rb_subtree_last, 26 + vma_start_pgoff, vma_last_pgoff,, vma_interval_tree) 27 + 28 + /* Insert node immediately after prev in the interval tree */ 29 + void vma_interval_tree_insert_after(struct vm_area_struct *node, 30 + struct vm_area_struct *prev, 31 + struct rb_root *root) 32 + { 33 + struct rb_node **link; 34 + struct vm_area_struct *parent; 35 + unsigned long last = vma_last_pgoff(node); 36 + 37 + VM_BUG_ON(vma_start_pgoff(node) != vma_start_pgoff(prev)); 38 + 39 + if (!prev->shared.linear.rb.rb_right) { 40 + parent = prev; 41 + link = &prev->shared.linear.rb.rb_right; 42 + } else { 43 + parent = rb_entry(prev->shared.linear.rb.rb_right, 44 + struct vm_area_struct, shared.linear.rb); 45 + if (parent->shared.linear.rb_subtree_last < last) 46 + parent->shared.linear.rb_subtree_last = last; 47 + while (parent->shared.linear.rb.rb_left) { 48 + parent = rb_entry(parent->shared.linear.rb.rb_left, 49 + struct vm_area_struct, shared.linear.rb); 50 + if (parent->shared.linear.rb_subtree_last < last) 51 + parent->shared.linear.rb_subtree_last = last; 52 + } 53 + link = &parent->shared.linear.rb.rb_left; 54 + } 55 + 56 + node->shared.linear.rb_subtree_last = last; 57 + rb_link_node(&node->shared.linear.rb, &parent->shared.linear.rb, link); 58 + rb_insert_augmented(&node->shared.linear.rb, root, 59 + &vma_interval_tree_augment); 60 + } 61 + 62 + static inline unsigned long avc_start_pgoff(struct anon_vma_chain *avc) 63 + { 64 + return vma_start_pgoff(avc->vma); 65 + } 66 + 67 + static inline unsigned long avc_last_pgoff(struct anon_vma_chain *avc) 68 + { 69 + return vma_last_pgoff(avc->vma); 70 + } 71 + 72 + INTERVAL_TREE_DEFINE(struct anon_vma_chain, rb, unsigned long, rb_subtree_last, 73 + avc_start_pgoff, avc_last_pgoff, 74 + static inline, __anon_vma_interval_tree) 75 + 76 + void anon_vma_interval_tree_insert(struct anon_vma_chain *node, 77 + struct rb_root *root) 78 + { 79 + #ifdef CONFIG_DEBUG_VM_RB 80 + node->cached_vma_start = avc_start_pgoff(node); 81 + node->cached_vma_last = avc_last_pgoff(node); 82 + #endif 83 + __anon_vma_interval_tree_insert(node, root); 84 + } 85 + 86 + void anon_vma_interval_tree_remove(struct anon_vma_chain *node, 87 + struct rb_root *root) 88 + { 89 + __anon_vma_interval_tree_remove(node, root); 90 + } 91 + 92 + struct anon_vma_chain * 93 + anon_vma_interval_tree_iter_first(struct rb_root *root, 94 + unsigned long first, unsigned long last) 95 + { 96 + return __anon_vma_interval_tree_iter_first(root, first, last); 97 + } 98 + 99 + struct anon_vma_chain * 100 + anon_vma_interval_tree_iter_next(struct anon_vma_chain *node, 101 + unsigned long first, unsigned long last) 102 + { 103 + return __anon_vma_interval_tree_iter_next(node, first, last); 104 + } 105 + 106 + #ifdef CONFIG_DEBUG_VM_RB 107 + void anon_vma_interval_tree_verify(struct anon_vma_chain *node) 108 + { 109 + WARN_ON_ONCE(node->cached_vma_start != avc_start_pgoff(node)); 110 + WARN_ON_ONCE(node->cached_vma_last != avc_last_pgoff(node)); 111 + } 112 + #endif

+50 -48

mm/kmemleak.c

··· 29 29 * - kmemleak_lock (rwlock): protects the object_list modifications and 30 30 * accesses to the object_tree_root. The object_list is the main list 31 31 * holding the metadata (struct kmemleak_object) for the allocated memory 32 - * blocks. The object_tree_root is a priority search tree used to look-up 32 + * blocks. The object_tree_root is a red black tree used to look-up 33 33 * metadata based on a pointer to the corresponding memory block. The 34 34 * kmemleak_object structures are added to the object_list and 35 35 * object_tree_root in the create_object() function called from the ··· 71 71 #include <linux/delay.h> 72 72 #include <linux/export.h> 73 73 #include <linux/kthread.h> 74 - #include <linux/prio_tree.h> 74 + #include <linux/rbtree.h> 75 75 #include <linux/fs.h> 76 76 #include <linux/debugfs.h> 77 77 #include <linux/seq_file.h> ··· 132 132 * Structure holding the metadata for each allocated memory block. 133 133 * Modifications to such objects should be made while holding the 134 134 * object->lock. Insertions or deletions from object_list, gray_list or 135 - * tree_node are already protected by the corresponding locks or mutex (see 135 + * rb_node are already protected by the corresponding locks or mutex (see 136 136 * the notes on locking above). These objects are reference-counted 137 137 * (use_count) and freed using the RCU mechanism. 138 138 */ ··· 141 141 unsigned long flags; /* object status flags */ 142 142 struct list_head object_list; 143 143 struct list_head gray_list; 144 - struct prio_tree_node tree_node; 144 + struct rb_node rb_node; 145 145 struct rcu_head rcu; /* object_list lockless traversal */ 146 146 /* object usage count; object freed when use_count == 0 */ 147 147 atomic_t use_count; ··· 182 182 static LIST_HEAD(object_list); 183 183 /* the list of gray-colored objects (see color_gray comment below) */ 184 184 static LIST_HEAD(gray_list); 185 - /* prio search tree for object boundaries */ 186 - static struct prio_tree_root object_tree_root; 187 - /* rw_lock protecting the access to object_list and prio_tree_root */ 185 + /* search tree for object boundaries */ 186 + static struct rb_root object_tree_root = RB_ROOT; 187 + /* rw_lock protecting the access to object_list and object_tree_root */ 188 188 static DEFINE_RWLOCK(kmemleak_lock); 189 189 190 190 /* allocation caches for kmemleak internal data */ ··· 380 380 trace.entries = object->trace; 381 381 382 382 pr_notice("Object 0x%08lx (size %zu):\n", 383 - object->tree_node.start, object->size); 383 + object->pointer, object->size); 384 384 pr_notice(" comm \"%s\", pid %d, jiffies %lu\n", 385 385 object->comm, object->pid, object->jiffies); 386 386 pr_notice(" min_count = %d\n", object->min_count); ··· 392 392 } 393 393 394 394 /* 395 - * Look-up a memory block metadata (kmemleak_object) in the priority search 395 + * Look-up a memory block metadata (kmemleak_object) in the object search 396 396 * tree based on a pointer value. If alias is 0, only values pointing to the 397 397 * beginning of the memory block are allowed. The kmemleak_lock must be held 398 398 * when calling this function. 399 399 */ 400 400 static struct kmemleak_object *lookup_object(unsigned long ptr, int alias) 401 401 { 402 - struct prio_tree_node *node; 403 - struct prio_tree_iter iter; 404 - struct kmemleak_object *object; 402 + struct rb_node *rb = object_tree_root.rb_node; 405 403 406 - prio_tree_iter_init(&iter, &object_tree_root, ptr, ptr); 407 - node = prio_tree_next(&iter); 408 - if (node) { 409 - object = prio_tree_entry(node, struct kmemleak_object, 410 - tree_node); 411 - if (!alias && object->pointer != ptr) { 404 + while (rb) { 405 + struct kmemleak_object *object = 406 + rb_entry(rb, struct kmemleak_object, rb_node); 407 + if (ptr < object->pointer) 408 + rb = object->rb_node.rb_left; 409 + else if (object->pointer + object->size <= ptr) 410 + rb = object->rb_node.rb_right; 411 + else if (object->pointer == ptr || alias) 412 + return object; 413 + else { 412 414 kmemleak_warn("Found object by alias at 0x%08lx\n", 413 415 ptr); 414 416 dump_object_info(object); 415 - object = NULL; 417 + break; 416 418 } 417 - } else 418 - object = NULL; 419 - 420 - return object; 419 + } 420 + return NULL; 421 421 } 422 422 423 423 /* ··· 471 471 } 472 472 473 473 /* 474 - * Look up an object in the prio search tree and increase its use_count. 474 + * Look up an object in the object search tree and increase its use_count. 475 475 */ 476 476 static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) 477 477 { ··· 516 516 int min_count, gfp_t gfp) 517 517 { 518 518 unsigned long flags; 519 - struct kmemleak_object *object; 520 - struct prio_tree_node *node; 519 + struct kmemleak_object *object, *parent; 520 + struct rb_node **link, *rb_parent; 521 521 522 522 object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp)); 523 523 if (!object) { ··· 560 560 /* kernel backtrace */ 561 561 object->trace_len = __save_stack_trace(object->trace); 562 562 563 - INIT_PRIO_TREE_NODE(&object->tree_node); 564 - object->tree_node.start = ptr; 565 - object->tree_node.last = ptr + size - 1; 566 - 567 563 write_lock_irqsave(&kmemleak_lock, flags); 568 564 569 565 min_addr = min(min_addr, ptr); 570 566 max_addr = max(max_addr, ptr + size); 571 - node = prio_tree_insert(&object_tree_root, &object->tree_node); 572 - /* 573 - * The code calling the kernel does not yet have the pointer to the 574 - * memory block to be able to free it. However, we still hold the 575 - * kmemleak_lock here in case parts of the kernel started freeing 576 - * random memory blocks. 577 - */ 578 - if (node != &object->tree_node) { 579 - kmemleak_stop("Cannot insert 0x%lx into the object search tree " 580 - "(already existing)\n", ptr); 581 - object = lookup_object(ptr, 1); 582 - spin_lock(&object->lock); 583 - dump_object_info(object); 584 - spin_unlock(&object->lock); 585 - 586 - goto out; 567 + link = &object_tree_root.rb_node; 568 + rb_parent = NULL; 569 + while (*link) { 570 + rb_parent = *link; 571 + parent = rb_entry(rb_parent, struct kmemleak_object, rb_node); 572 + if (ptr + size <= parent->pointer) 573 + link = &parent->rb_node.rb_left; 574 + else if (parent->pointer + parent->size <= ptr) 575 + link = &parent->rb_node.rb_right; 576 + else { 577 + kmemleak_stop("Cannot insert 0x%lx into the object " 578 + "search tree (overlaps existing)\n", 579 + ptr); 580 + kmem_cache_free(object_cache, object); 581 + object = parent; 582 + spin_lock(&object->lock); 583 + dump_object_info(object); 584 + spin_unlock(&object->lock); 585 + goto out; 586 + } 587 587 } 588 + rb_link_node(&object->rb_node, rb_parent, link); 589 + rb_insert_color(&object->rb_node, &object_tree_root); 590 + 588 591 list_add_tail_rcu(&object->object_list, &object_list); 589 592 out: 590 593 write_unlock_irqrestore(&kmemleak_lock, flags); ··· 603 600 unsigned long flags; 604 601 605 602 write_lock_irqsave(&kmemleak_lock, flags); 606 - prio_tree_remove(&object_tree_root, &object->tree_node); 603 + rb_erase(&object->rb_node, &object_tree_root); 607 604 list_del_rcu(&object->object_list); 608 605 write_unlock_irqrestore(&kmemleak_lock, flags); 609 606 ··· 1769 1766 1770 1767 object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE); 1771 1768 scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE); 1772 - INIT_PRIO_TREE_ROOT(&object_tree_root); 1773 1769 1774 1770 if (crt_early_log >= ARRAY_SIZE(early_log)) 1775 1771 pr_warning("Early log buffer exceeded (%d), please increase "

+32 -8

mm/ksm.c

··· 709 709 spinlock_t *ptl; 710 710 int swapped; 711 711 int err = -EFAULT; 712 + unsigned long mmun_start; /* For mmu_notifiers */ 713 + unsigned long mmun_end; /* For mmu_notifiers */ 712 714 713 715 addr = page_address_in_vma(page, vma); 714 716 if (addr == -EFAULT) 715 717 goto out; 716 718 717 719 BUG_ON(PageTransCompound(page)); 720 + 721 + mmun_start = addr; 722 + mmun_end = addr + PAGE_SIZE; 723 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 724 + 718 725 ptep = page_check_address(page, mm, addr, &ptl, 0); 719 726 if (!ptep) 720 - goto out; 727 + goto out_mn; 721 728 722 729 if (pte_write(*ptep) || pte_dirty(*ptep)) { 723 730 pte_t entry; ··· 759 752 760 753 out_unlock: 761 754 pte_unmap_unlock(ptep, ptl); 755 + out_mn: 756 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 762 757 out: 763 758 return err; 764 759 } ··· 785 776 spinlock_t *ptl; 786 777 unsigned long addr; 787 778 int err = -EFAULT; 779 + unsigned long mmun_start; /* For mmu_notifiers */ 780 + unsigned long mmun_end; /* For mmu_notifiers */ 788 781 789 782 addr = page_address_in_vma(page, vma); 790 783 if (addr == -EFAULT) ··· 805 794 if (!pmd_present(*pmd)) 806 795 goto out; 807 796 797 + mmun_start = addr; 798 + mmun_end = addr + PAGE_SIZE; 799 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 800 + 808 801 ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); 809 802 if (!pte_same(*ptep, orig_pte)) { 810 803 pte_unmap_unlock(ptep, ptl); 811 - goto out; 804 + goto out_mn; 812 805 } 813 806 814 807 get_page(kpage); ··· 829 814 830 815 pte_unmap_unlock(ptep, ptl); 831 816 err = 0; 817 + out_mn: 818 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 832 819 out: 833 820 return err; 834 821 } ··· 1486 1469 */ 1487 1470 if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | 1488 1471 VM_PFNMAP | VM_IO | VM_DONTEXPAND | 1489 - VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE | 1490 - VM_NONLINEAR | VM_MIXEDMAP | VM_SAO)) 1472 + VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP)) 1491 1473 return 0; /* just ignore the advice */ 1474 + 1475 + #ifdef VM_SAO 1476 + if (*vm_flags & VM_SAO) 1477 + return 0; 1478 + #endif 1492 1479 1493 1480 if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { 1494 1481 err = __ksm_enter(mm); ··· 1603 1582 SetPageSwapBacked(new_page); 1604 1583 __set_page_locked(new_page); 1605 1584 1606 - if (page_evictable(new_page, vma)) 1585 + if (!mlocked_vma_newpage(vma, new_page)) 1607 1586 lru_cache_add_lru(new_page, LRU_ACTIVE_ANON); 1608 1587 else 1609 1588 add_page_to_unevictable_list(new_page); ··· 1635 1614 struct vm_area_struct *vma; 1636 1615 1637 1616 anon_vma_lock(anon_vma); 1638 - list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) { 1617 + anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 1618 + 0, ULONG_MAX) { 1639 1619 vma = vmac->vma; 1640 1620 if (rmap_item->address < vma->vm_start || 1641 1621 rmap_item->address >= vma->vm_end) ··· 1689 1667 struct vm_area_struct *vma; 1690 1668 1691 1669 anon_vma_lock(anon_vma); 1692 - list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) { 1670 + anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 1671 + 0, ULONG_MAX) { 1693 1672 vma = vmac->vma; 1694 1673 if (rmap_item->address < vma->vm_start || 1695 1674 rmap_item->address >= vma->vm_end) ··· 1742 1719 struct vm_area_struct *vma; 1743 1720 1744 1721 anon_vma_lock(anon_vma); 1745 - list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) { 1722 + anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 1723 + 0, ULONG_MAX) { 1746 1724 vma = vmac->vma; 1747 1725 if (rmap_item->address < vma->vm_start || 1748 1726 rmap_item->address >= vma->vm_end)

+6 -2

mm/madvise.c

··· 69 69 new_flags &= ~VM_DONTCOPY; 70 70 break; 71 71 case MADV_DONTDUMP: 72 - new_flags |= VM_NODUMP; 72 + new_flags |= VM_DONTDUMP; 73 73 break; 74 74 case MADV_DODUMP: 75 - new_flags &= ~VM_NODUMP; 75 + if (new_flags & VM_SPECIAL) { 76 + error = -EINVAL; 77 + goto out; 78 + } 79 + new_flags &= ~VM_DONTDUMP; 76 80 break; 77 81 case MADV_MERGEABLE: 78 82 case MADV_UNMERGEABLE:

+3 -2

mm/memblock.c

··· 41 41 static int memblock_reserved_in_slab __initdata_memblock = 0; 42 42 43 43 /* inline so we don't get a warning when pr_debug is compiled out */ 44 - static inline const char *memblock_type_name(struct memblock_type *type) 44 + static __init_memblock const char * 45 + memblock_type_name(struct memblock_type *type) 45 46 { 46 47 if (type == &memblock.memory) 47 48 return "memory"; ··· 757 756 return ret; 758 757 759 758 for (i = start_rgn; i < end_rgn; i++) 760 - type->regions[i].nid = nid; 759 + memblock_set_region_node(&type->regions[i], nid); 761 760 762 761 memblock_merge_regions(type); 763 762 return 0;

+9 -15

mm/memcontrol.c

··· 51 51 #include <linux/oom.h> 52 52 #include "internal.h" 53 53 #include <net/sock.h> 54 + #include <net/ip.h> 54 55 #include <net/tcp_memcontrol.h> 55 56 56 57 #include <asm/uaccess.h> ··· 327 326 struct mem_cgroup_stat_cpu nocpu_base; 328 327 spinlock_t pcp_counter_lock; 329 328 330 - #ifdef CONFIG_INET 329 + #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) 331 330 struct tcp_memcontrol tcp_mem; 332 331 #endif 333 332 }; ··· 412 411 return container_of(s, struct mem_cgroup, css); 413 412 } 414 413 415 - /* Writing them here to avoid exposing memcg's inner layout */ 416 - #ifdef CONFIG_MEMCG_KMEM 417 - #include <net/sock.h> 418 - #include <net/ip.h> 414 + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 415 + { 416 + return (memcg == root_mem_cgroup); 417 + } 419 418 420 - static bool mem_cgroup_is_root(struct mem_cgroup *memcg); 419 + /* Writing them here to avoid exposing memcg's inner layout */ 420 + #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) 421 + 421 422 void sock_update_memcg(struct sock *sk) 422 423 { 423 424 if (mem_cgroup_sockets_enabled) { ··· 464 461 } 465 462 } 466 463 467 - #ifdef CONFIG_INET 468 464 struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) 469 465 { 470 466 if (!memcg || mem_cgroup_is_root(memcg)) ··· 472 470 return &memcg->tcp_mem.cg_proto; 473 471 } 474 472 EXPORT_SYMBOL(tcp_proto_cgroup); 475 - #endif /* CONFIG_INET */ 476 - #endif /* CONFIG_MEMCG_KMEM */ 477 473 478 - #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) 479 474 static void disarm_sock_keys(struct mem_cgroup *memcg) 480 475 { 481 476 if (!memcg_proto_activated(&memcg->tcp_mem.cg_proto)) ··· 1014 1015 for (iter = mem_cgroup_iter(NULL, NULL, NULL); \ 1015 1016 iter != NULL; \ 1016 1017 iter = mem_cgroup_iter(NULL, iter, NULL)) 1017 - 1018 - static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 1019 - { 1020 - return (memcg == root_mem_cgroup); 1021 - } 1022 1018 1023 1019 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) 1024 1020 {

+5 -3

mm/memory-failure.c

··· 400 400 struct vm_area_struct *vma; 401 401 struct task_struct *tsk; 402 402 struct anon_vma *av; 403 + pgoff_t pgoff; 403 404 404 405 av = page_lock_anon_vma(page); 405 406 if (av == NULL) /* Not actually mapped anymore */ 406 407 return; 407 408 409 + pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 408 410 read_lock(&tasklist_lock); 409 411 for_each_process (tsk) { 410 412 struct anon_vma_chain *vmac; 411 413 412 414 if (!task_early_kill(tsk)) 413 415 continue; 414 - list_for_each_entry(vmac, &av->head, same_anon_vma) { 416 + anon_vma_interval_tree_foreach(vmac, &av->rb_root, 417 + pgoff, pgoff) { 415 418 vma = vmac->vma; 416 419 if (!page_mapped_in_vma(page, vma)) 417 420 continue; ··· 434 431 { 435 432 struct vm_area_struct *vma; 436 433 struct task_struct *tsk; 437 - struct prio_tree_iter iter; 438 434 struct address_space *mapping = page->mapping; 439 435 440 436 mutex_lock(&mapping->i_mmap_mutex); ··· 444 442 if (!task_early_kill(tsk)) 445 443 continue; 446 444 447 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, 445 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, 448 446 pgoff) { 449 447 /* 450 448 * Send early kill signal to tasks where a vma covers

+66 -51

mm/memory.c

··· 712 712 add_taint(TAINT_BAD_PAGE); 713 713 } 714 714 715 - static inline int is_cow_mapping(vm_flags_t flags) 715 + static inline bool is_cow_mapping(vm_flags_t flags) 716 716 { 717 717 return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; 718 718 } ··· 1039 1039 unsigned long next; 1040 1040 unsigned long addr = vma->vm_start; 1041 1041 unsigned long end = vma->vm_end; 1042 + unsigned long mmun_start; /* For mmu_notifiers */ 1043 + unsigned long mmun_end; /* For mmu_notifiers */ 1044 + bool is_cow; 1042 1045 int ret; 1043 1046 1044 1047 /* ··· 1050 1047 * readonly mappings. The tradeoff is that copy_page_range is more 1051 1048 * efficient than faulting. 1052 1049 */ 1053 - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { 1050 + if (!(vma->vm_flags & (VM_HUGETLB | VM_NONLINEAR | 1051 + VM_PFNMAP | VM_MIXEDMAP))) { 1054 1052 if (!vma->anon_vma) 1055 1053 return 0; 1056 1054 } ··· 1059 1055 if (is_vm_hugetlb_page(vma)) 1060 1056 return copy_hugetlb_page_range(dst_mm, src_mm, vma); 1061 1057 1062 - if (unlikely(is_pfn_mapping(vma))) { 1058 + if (unlikely(vma->vm_flags & VM_PFNMAP)) { 1063 1059 /* 1064 1060 * We do not free on error cases below as remove_vma 1065 1061 * gets called on error from higher level routine 1066 1062 */ 1067 - ret = track_pfn_vma_copy(vma); 1063 + ret = track_pfn_copy(vma); 1068 1064 if (ret) 1069 1065 return ret; 1070 1066 } ··· 1075 1071 * parent mm. And a permission downgrade will only happen if 1076 1072 * is_cow_mapping() returns true. 1077 1073 */ 1078 - if (is_cow_mapping(vma->vm_flags)) 1079 - mmu_notifier_invalidate_range_start(src_mm, addr, end); 1074 + is_cow = is_cow_mapping(vma->vm_flags); 1075 + mmun_start = addr; 1076 + mmun_end = end; 1077 + if (is_cow) 1078 + mmu_notifier_invalidate_range_start(src_mm, mmun_start, 1079 + mmun_end); 1080 1080 1081 1081 ret = 0; 1082 1082 dst_pgd = pgd_offset(dst_mm, addr); ··· 1096 1088 } 1097 1089 } while (dst_pgd++, src_pgd++, addr = next, addr != end); 1098 1090 1099 - if (is_cow_mapping(vma->vm_flags)) 1100 - mmu_notifier_invalidate_range_end(src_mm, 1101 - vma->vm_start, end); 1091 + if (is_cow) 1092 + mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end); 1102 1093 return ret; 1103 1094 } 1104 1095 ··· 1334 1327 if (vma->vm_file) 1335 1328 uprobe_munmap(vma, start, end); 1336 1329 1337 - if (unlikely(is_pfn_mapping(vma))) 1338 - untrack_pfn_vma(vma, 0, 0); 1330 + if (unlikely(vma->vm_flags & VM_PFNMAP)) 1331 + untrack_pfn(vma, 0, 0); 1339 1332 1340 1333 if (start != end) { 1341 1334 if (unlikely(is_vm_hugetlb_page(vma))) { ··· 1528 1521 spin_unlock(&mm->page_table_lock); 1529 1522 wait_split_huge_page(vma->anon_vma, pmd); 1530 1523 } else { 1531 - page = follow_trans_huge_pmd(mm, address, 1524 + page = follow_trans_huge_pmd(vma, address, 1532 1525 pmd, flags); 1533 1526 spin_unlock(&mm->page_table_lock); 1534 1527 goto out; ··· 1583 1576 if (page->mapping && trylock_page(page)) { 1584 1577 lru_add_drain(); /* push cached pages to LRU */ 1585 1578 /* 1586 - * Because we lock page here and migration is 1587 - * blocked by the pte's page reference, we need 1588 - * only check for file-cache page truncation. 1579 + * Because we lock page here, and migration is 1580 + * blocked by the pte's page reference, and we 1581 + * know the page is still mapped, we don't even 1582 + * need to check for file-cache page truncation. 1589 1583 */ 1590 - if (page->mapping) 1591 - mlock_vma_page(page); 1584 + mlock_vma_page(page); 1592 1585 unlock_page(page); 1593 1586 } 1594 1587 } ··· 2092 2085 * ask for a shared writable mapping! 2093 2086 * 2094 2087 * The page does not need to be reserved. 2088 + * 2089 + * Usually this function is called from f_op->mmap() handler 2090 + * under mm->mmap_sem write-lock, so it can change vma->vm_flags. 2091 + * Caller must set VM_MIXEDMAP on vma if it wants to call this 2092 + * function from other places, for example from page-fault handler. 2095 2093 */ 2096 2094 int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, 2097 2095 struct page *page) ··· 2105 2093 return -EFAULT; 2106 2094 if (!page_count(page)) 2107 2095 return -EINVAL; 2108 - vma->vm_flags |= VM_INSERTPAGE; 2096 + if (!(vma->vm_flags & VM_MIXEDMAP)) { 2097 + BUG_ON(down_read_trylock(&vma->vm_mm->mmap_sem)); 2098 + BUG_ON(vma->vm_flags & VM_PFNMAP); 2099 + vma->vm_flags |= VM_MIXEDMAP; 2100 + } 2109 2101 return insert_page(vma, addr, page, vma->vm_page_prot); 2110 2102 } 2111 2103 EXPORT_SYMBOL(vm_insert_page); ··· 2148 2132 * @addr: target user address of this page 2149 2133 * @pfn: source kernel pfn 2150 2134 * 2151 - * Similar to vm_inert_page, this allows drivers to insert individual pages 2135 + * Similar to vm_insert_page, this allows drivers to insert individual pages 2152 2136 * they've allocated into a user vma. Same comments apply. 2153 2137 * 2154 2138 * This function should only be called from a vm_ops->fault handler, and ··· 2178 2162 2179 2163 if (addr < vma->vm_start || addr >= vma->vm_end) 2180 2164 return -EFAULT; 2181 - if (track_pfn_vma_new(vma, &pgprot, pfn, PAGE_SIZE)) 2165 + if (track_pfn_insert(vma, &pgprot, pfn)) 2182 2166 return -EINVAL; 2183 2167 2184 2168 ret = insert_pfn(vma, addr, pfn, pgprot); 2185 - 2186 - if (ret) 2187 - untrack_pfn_vma(vma, pfn, PAGE_SIZE); 2188 2169 2189 2170 return ret; 2190 2171 } ··· 2303 2290 * rest of the world about it: 2304 2291 * VM_IO tells people not to look at these pages 2305 2292 * (accesses can have side effects). 2306 - * VM_RESERVED is specified all over the place, because 2307 - * in 2.4 it kept swapout's vma scan off this vma; but 2308 - * in 2.6 the LRU scan won't even find its pages, so this 2309 - * flag means no more than count its pages in reserved_vm, 2310 - * and omit it from core dump, even when VM_IO turned off. 2311 2293 * VM_PFNMAP tells the core MM that the base pages are just 2312 2294 * raw PFN mappings, and do not have a "struct page" associated 2313 2295 * with them. 2296 + * VM_DONTEXPAND 2297 + * Disable vma merging and expanding with mremap(). 2298 + * VM_DONTDUMP 2299 + * Omit vma from core dump, even when VM_IO turned off. 2314 2300 * 2315 2301 * There's a horrible special case to handle copy-on-write 2316 2302 * behaviour that some programs depend on. We mark the "original" 2317 2303 * un-COW'ed pages by matching them up with "vma->vm_pgoff". 2304 + * See vm_normal_page() for details. 2318 2305 */ 2319 - if (addr == vma->vm_start && end == vma->vm_end) { 2306 + if (is_cow_mapping(vma->vm_flags)) { 2307 + if (addr != vma->vm_start || end != vma->vm_end) 2308 + return -EINVAL; 2320 2309 vma->vm_pgoff = pfn; 2321 - vma->vm_flags |= VM_PFN_AT_MMAP; 2322 - } else if (is_cow_mapping(vma->vm_flags)) 2323 - return -EINVAL; 2324 - 2325 - vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP; 2326 - 2327 - err = track_pfn_vma_new(vma, &prot, pfn, PAGE_ALIGN(size)); 2328 - if (err) { 2329 - /* 2330 - * To indicate that track_pfn related cleanup is not 2331 - * needed from higher level routine calling unmap_vmas 2332 - */ 2333 - vma->vm_flags &= ~(VM_IO | VM_RESERVED | VM_PFNMAP); 2334 - vma->vm_flags &= ~VM_PFN_AT_MMAP; 2335 - return -EINVAL; 2336 2310 } 2311 + 2312 + err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); 2313 + if (err) 2314 + return -EINVAL; 2315 + 2316 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 2337 2317 2338 2318 BUG_ON(addr >= end); 2339 2319 pfn -= addr >> PAGE_SHIFT; ··· 2341 2335 } while (pgd++, addr = next, addr != end); 2342 2336 2343 2337 if (err) 2344 - untrack_pfn_vma(vma, pfn, PAGE_ALIGN(size)); 2338 + untrack_pfn(vma, pfn, PAGE_ALIGN(size)); 2345 2339 2346 2340 return err; 2347 2341 } ··· 2522 2516 spinlock_t *ptl, pte_t orig_pte) 2523 2517 __releases(ptl) 2524 2518 { 2525 - struct page *old_page, *new_page; 2519 + struct page *old_page, *new_page = NULL; 2526 2520 pte_t entry; 2527 2521 int ret = 0; 2528 2522 int page_mkwrite = 0; 2529 2523 struct page *dirty_page = NULL; 2524 + unsigned long mmun_start; /* For mmu_notifiers */ 2525 + unsigned long mmun_end; /* For mmu_notifiers */ 2526 + bool mmun_called = false; /* For mmu_notifiers */ 2530 2527 2531 2528 old_page = vm_normal_page(vma, address, orig_pte); 2532 2529 if (!old_page) { ··· 2707 2698 if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) 2708 2699 goto oom_free_new; 2709 2700 2701 + mmun_start = address & PAGE_MASK; 2702 + mmun_end = (address & PAGE_MASK) + PAGE_SIZE; 2703 + mmun_called = true; 2704 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2705 + 2710 2706 /* 2711 2707 * Re-check the pte - we dropped the lock 2712 2708 */ ··· 2778 2764 page_cache_release(new_page); 2779 2765 unlock: 2780 2766 pte_unmap_unlock(page_table, ptl); 2767 + if (mmun_called) 2768 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2781 2769 if (old_page) { 2782 2770 /* 2783 2771 * Don't let another task, with possibly unlocked vma, ··· 2817 2801 zap_page_range_single(vma, start_addr, end_addr - start_addr, details); 2818 2802 } 2819 2803 2820 - static inline void unmap_mapping_range_tree(struct prio_tree_root *root, 2804 + static inline void unmap_mapping_range_tree(struct rb_root *root, 2821 2805 struct zap_details *details) 2822 2806 { 2823 2807 struct vm_area_struct *vma; 2824 - struct prio_tree_iter iter; 2825 2808 pgoff_t vba, vea, zba, zea; 2826 2809 2827 - vma_prio_tree_foreach(vma, &iter, root, 2810 + vma_interval_tree_foreach(vma, root, 2828 2811 details->first_index, details->last_index) { 2829 2812 2830 2813 vba = vma->vm_pgoff; ··· 2854 2839 * across *all* the pages in each nonlinear VMA, not just the pages 2855 2840 * whose virtual address lies outside the file truncation point. 2856 2841 */ 2857 - list_for_each_entry(vma, head, shared.vm_set.list) { 2842 + list_for_each_entry(vma, head, shared.nonlinear) { 2858 2843 details->nonlinear_vma = vma; 2859 2844 unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details); 2860 2845 } ··· 2898 2883 2899 2884 2900 2885 mutex_lock(&mapping->i_mmap_mutex); 2901 - if (unlikely(!prio_tree_empty(&mapping->i_mmap))) 2886 + if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) 2902 2887 unmap_mapping_range_tree(&mapping->i_mmap, &details); 2903 2888 if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) 2904 2889 unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);

+63 -14

mm/memory_hotplug.c

··· 106 106 void __ref put_page_bootmem(struct page *page) 107 107 { 108 108 unsigned long type; 109 + struct zone *zone; 109 110 110 111 type = (unsigned long) page->lru.next; 111 112 BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || ··· 117 116 set_page_private(page, 0); 118 117 INIT_LIST_HEAD(&page->lru); 119 118 __free_pages_bootmem(page, 0); 119 + 120 + zone = page_zone(page); 121 + zone_span_writelock(zone); 122 + zone->present_pages++; 123 + zone_span_writeunlock(zone); 124 + totalram_pages++; 120 125 } 121 126 122 127 } ··· 369 362 BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK); 370 363 BUG_ON(nr_pages % PAGES_PER_SECTION); 371 364 365 + release_mem_region(phys_start_pfn << PAGE_SHIFT, nr_pages * PAGE_SIZE); 366 + 372 367 sections_to_remove = nr_pages / PAGES_PER_SECTION; 373 368 for (i = 0; i < sections_to_remove; i++) { 374 369 unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION; 375 - release_mem_region(pfn << PAGE_SHIFT, 376 - PAGES_PER_SECTION << PAGE_SHIFT); 377 370 ret = __remove_section(zone, __pfn_to_section(pfn)); 378 371 if (ret) 379 372 break; ··· 763 756 return 0; 764 757 } 765 758 766 - static struct page * 767 - hotremove_migrate_alloc(struct page *page, unsigned long private, int **x) 768 - { 769 - /* This should be improooooved!! */ 770 - return alloc_page(GFP_HIGHUSER_MOVABLE); 771 - } 772 - 773 759 #define NR_OFFLINE_AT_ONCE_PAGES (256) 774 760 static int 775 761 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) ··· 813 813 putback_lru_pages(&source); 814 814 goto out; 815 815 } 816 - /* this function returns # of failed pages */ 817 - ret = migrate_pages(&source, hotremove_migrate_alloc, 0, 816 + 817 + /* 818 + * alloc_migrate_target should be improooooved!! 819 + * migrate_pages returns # of failed pages. 820 + */ 821 + ret = migrate_pages(&source, alloc_migrate_target, 0, 818 822 true, MIGRATE_SYNC); 819 823 if (ret) 820 824 putback_lru_pages(&source); ··· 874 870 return offlined; 875 871 } 876 872 877 - static int __ref offline_pages(unsigned long start_pfn, 873 + static int __ref __offline_pages(unsigned long start_pfn, 878 874 unsigned long end_pfn, unsigned long timeout) 879 875 { 880 876 unsigned long pfn, nr_pages, expire; ··· 974 970 975 971 init_per_zone_wmark_min(); 976 972 977 - if (!populated_zone(zone)) 973 + if (!populated_zone(zone)) { 978 974 zone_pcp_reset(zone); 975 + mutex_lock(&zonelists_mutex); 976 + build_all_zonelists(NULL, NULL); 977 + mutex_unlock(&zonelists_mutex); 978 + } else 979 + zone_pcp_update(zone); 979 980 980 981 if (!node_present_pages(node)) { 981 982 node_clear_state(node, N_HIGH_MEMORY); ··· 1007 998 return ret; 1008 999 } 1009 1000 1001 + int offline_pages(unsigned long start_pfn, unsigned long nr_pages) 1002 + { 1003 + return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ); 1004 + } 1005 + 1010 1006 int remove_memory(u64 start, u64 size) 1011 1007 { 1008 + struct memory_block *mem = NULL; 1009 + struct mem_section *section; 1012 1010 unsigned long start_pfn, end_pfn; 1011 + unsigned long pfn, section_nr; 1012 + int ret; 1013 1013 1014 1014 start_pfn = PFN_DOWN(start); 1015 1015 end_pfn = start_pfn + PFN_DOWN(size); 1016 - return offline_pages(start_pfn, end_pfn, 120 * HZ); 1016 + 1017 + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 1018 + section_nr = pfn_to_section_nr(pfn); 1019 + if (!present_section_nr(section_nr)) 1020 + continue; 1021 + 1022 + section = __nr_to_section(section_nr); 1023 + /* same memblock? */ 1024 + if (mem) 1025 + if ((section_nr >= mem->start_section_nr) && 1026 + (section_nr <= mem->end_section_nr)) 1027 + continue; 1028 + 1029 + mem = find_memory_block_hinted(section, mem); 1030 + if (!mem) 1031 + continue; 1032 + 1033 + ret = offline_memory_block(mem); 1034 + if (ret) { 1035 + kobject_put(&mem->dev.kobj); 1036 + return ret; 1037 + } 1038 + } 1039 + 1040 + if (mem) 1041 + kobject_put(&mem->dev.kobj); 1042 + 1043 + return 0; 1017 1044 } 1018 1045 #else 1046 + int offline_pages(unsigned long start_pfn, unsigned long nr_pages) 1047 + { 1048 + return -EINVAL; 1049 + } 1019 1050 int remove_memory(u64 start, u64 size) 1020 1051 { 1021 1052 return -EINVAL;

+95 -53

mm/mempolicy.c

··· 607 607 return first; 608 608 } 609 609 610 + /* 611 + * Apply policy to a single VMA 612 + * This must be called with the mmap_sem held for writing. 613 + */ 614 + static int vma_replace_policy(struct vm_area_struct *vma, 615 + struct mempolicy *pol) 616 + { 617 + int err; 618 + struct mempolicy *old; 619 + struct mempolicy *new; 620 + 621 + pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n", 622 + vma->vm_start, vma->vm_end, vma->vm_pgoff, 623 + vma->vm_ops, vma->vm_file, 624 + vma->vm_ops ? vma->vm_ops->set_policy : NULL); 625 + 626 + new = mpol_dup(pol); 627 + if (IS_ERR(new)) 628 + return PTR_ERR(new); 629 + 630 + if (vma->vm_ops && vma->vm_ops->set_policy) { 631 + err = vma->vm_ops->set_policy(vma, new); 632 + if (err) 633 + goto err_out; 634 + } 635 + 636 + old = vma->vm_policy; 637 + vma->vm_policy = new; /* protected by mmap_sem */ 638 + mpol_put(old); 639 + 640 + return 0; 641 + err_out: 642 + mpol_put(new); 643 + return err; 644 + } 645 + 610 646 /* Step 2: apply policy to a range and do splits. */ 611 647 static int mbind_range(struct mm_struct *mm, unsigned long start, 612 648 unsigned long end, struct mempolicy *new_pol) ··· 691 655 if (err) 692 656 goto out; 693 657 } 694 - 695 - /* 696 - * Apply policy to a single VMA. The reference counting of 697 - * policy for vma_policy linkages has already been handled by 698 - * vma_merge and split_vma as necessary. If this is a shared 699 - * policy then ->set_policy will increment the reference count 700 - * for an sp node. 701 - */ 702 - pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n", 703 - vma->vm_start, vma->vm_end, vma->vm_pgoff, 704 - vma->vm_ops, vma->vm_file, 705 - vma->vm_ops ? vma->vm_ops->set_policy : NULL); 706 - if (vma->vm_ops && vma->vm_ops->set_policy) { 707 - err = vma->vm_ops->set_policy(vma, new_pol); 708 - if (err) 709 - goto out; 710 - } 658 + err = vma_replace_policy(vma, new_pol); 659 + if (err) 660 + goto out; 711 661 } 712 662 713 663 out: ··· 946 924 nodemask_t nmask; 947 925 LIST_HEAD(pagelist); 948 926 int err = 0; 949 - struct vm_area_struct *vma; 950 927 951 928 nodes_clear(nmask); 952 929 node_set(source, nmask); 953 930 954 - vma = check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask, 931 + /* 932 + * This does not "check" the range but isolates all pages that 933 + * need migration. Between passing in the full user address 934 + * space range and MPOL_MF_DISCONTIG_OK, this call can not fail. 935 + */ 936 + VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); 937 + check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask, 955 938 flags | MPOL_MF_DISCONTIG_OK, &pagelist); 956 - if (IS_ERR(vma)) 957 - return PTR_ERR(vma); 958 939 959 940 if (!list_empty(&pagelist)) { 960 941 err = migrate_pages(&pagelist, new_node_page, dest, ··· 1555 1530 addr); 1556 1531 if (vpol) 1557 1532 pol = vpol; 1558 - } else if (vma->vm_policy) 1533 + } else if (vma->vm_policy) { 1559 1534 pol = vma->vm_policy; 1535 + 1536 + /* 1537 + * shmem_alloc_page() passes MPOL_F_SHARED policy with 1538 + * a pseudo vma whose vma->vm_ops=NULL. Take a reference 1539 + * count on these policies which will be dropped by 1540 + * mpol_cond_put() later 1541 + */ 1542 + if (mpol_needs_cond_ref(pol)) 1543 + mpol_get(pol); 1544 + } 1560 1545 } 1561 1546 if (!pol) 1562 1547 pol = &default_policy; ··· 2096 2061 */ 2097 2062 2098 2063 /* lookup first element intersecting start-end */ 2099 - /* Caller holds sp->lock */ 2064 + /* Caller holds sp->mutex */ 2100 2065 static struct sp_node * 2101 2066 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end) 2102 2067 { ··· 2160 2125 2161 2126 if (!sp->root.rb_node) 2162 2127 return NULL; 2163 - spin_lock(&sp->lock); 2128 + mutex_lock(&sp->mutex); 2164 2129 sn = sp_lookup(sp, idx, idx+1); 2165 2130 if (sn) { 2166 2131 mpol_get(sn->policy); 2167 2132 pol = sn->policy; 2168 2133 } 2169 - spin_unlock(&sp->lock); 2134 + mutex_unlock(&sp->mutex); 2170 2135 return pol; 2136 + } 2137 + 2138 + static void sp_free(struct sp_node *n) 2139 + { 2140 + mpol_put(n->policy); 2141 + kmem_cache_free(sn_cache, n); 2171 2142 } 2172 2143 2173 2144 static void sp_delete(struct shared_policy *sp, struct sp_node *n) 2174 2145 { 2175 2146 pr_debug("deleting %lx-l%lx\n", n->start, n->end); 2176 2147 rb_erase(&n->nd, &sp->root); 2177 - mpol_put(n->policy); 2178 - kmem_cache_free(sn_cache, n); 2148 + sp_free(n); 2179 2149 } 2180 2150 2181 2151 static struct sp_node *sp_alloc(unsigned long start, unsigned long end, 2182 2152 struct mempolicy *pol) 2183 2153 { 2184 - struct sp_node *n = kmem_cache_alloc(sn_cache, GFP_KERNEL); 2154 + struct sp_node *n; 2155 + struct mempolicy *newpol; 2185 2156 2157 + n = kmem_cache_alloc(sn_cache, GFP_KERNEL); 2186 2158 if (!n) 2187 2159 return NULL; 2160 + 2161 + newpol = mpol_dup(pol); 2162 + if (IS_ERR(newpol)) { 2163 + kmem_cache_free(sn_cache, n); 2164 + return NULL; 2165 + } 2166 + newpol->flags |= MPOL_F_SHARED; 2167 + 2188 2168 n->start = start; 2189 2169 n->end = end; 2190 - mpol_get(pol); 2191 - pol->flags |= MPOL_F_SHARED; /* for unref */ 2192 - n->policy = pol; 2170 + n->policy = newpol; 2171 + 2193 2172 return n; 2194 2173 } 2195 2174 ··· 2211 2162 static int shared_policy_replace(struct shared_policy *sp, unsigned long start, 2212 2163 unsigned long end, struct sp_node *new) 2213 2164 { 2214 - struct sp_node *n, *new2 = NULL; 2165 + struct sp_node *n; 2166 + int ret = 0; 2215 2167 2216 - restart: 2217 - spin_lock(&sp->lock); 2168 + mutex_lock(&sp->mutex); 2218 2169 n = sp_lookup(sp, start, end); 2219 2170 /* Take care of old policies in the same range. */ 2220 2171 while (n && n->start < end) { ··· 2227 2178 } else { 2228 2179 /* Old policy spanning whole new range. */ 2229 2180 if (n->end > end) { 2181 + struct sp_node *new2; 2182 + new2 = sp_alloc(end, n->end, n->policy); 2230 2183 if (!new2) { 2231 - spin_unlock(&sp->lock); 2232 - new2 = sp_alloc(end, n->end, n->policy); 2233 - if (!new2) 2234 - return -ENOMEM; 2235 - goto restart; 2184 + ret = -ENOMEM; 2185 + goto out; 2236 2186 } 2237 2187 n->end = start; 2238 2188 sp_insert(sp, new2); 2239 - new2 = NULL; 2240 2189 break; 2241 2190 } else 2242 2191 n->end = start; ··· 2245 2198 } 2246 2199 if (new) 2247 2200 sp_insert(sp, new); 2248 - spin_unlock(&sp->lock); 2249 - if (new2) { 2250 - mpol_put(new2->policy); 2251 - kmem_cache_free(sn_cache, new2); 2252 - } 2253 - return 0; 2201 + out: 2202 + mutex_unlock(&sp->mutex); 2203 + return ret; 2254 2204 } 2255 2205 2256 2206 /** ··· 2265 2221 int ret; 2266 2222 2267 2223 sp->root = RB_ROOT; /* empty tree == default mempolicy */ 2268 - spin_lock_init(&sp->lock); 2224 + mutex_init(&sp->mutex); 2269 2225 2270 2226 if (mpol) { 2271 2227 struct vm_area_struct pvma; ··· 2319 2275 } 2320 2276 err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new); 2321 2277 if (err && new) 2322 - kmem_cache_free(sn_cache, new); 2278 + sp_free(new); 2323 2279 return err; 2324 2280 } 2325 2281 ··· 2331 2287 2332 2288 if (!p->root.rb_node) 2333 2289 return; 2334 - spin_lock(&p->lock); 2290 + mutex_lock(&p->mutex); 2335 2291 next = rb_first(&p->root); 2336 2292 while (next) { 2337 2293 n = rb_entry(next, struct sp_node, nd); 2338 2294 next = rb_next(&n->nd); 2339 - rb_erase(&n->nd, &p->root); 2340 - mpol_put(n->policy); 2341 - kmem_cache_free(sn_cache, n); 2295 + sp_delete(p, n); 2342 2296 } 2343 - spin_unlock(&p->lock); 2297 + mutex_unlock(&p->mutex); 2344 2298 } 2345 2299 2346 2300 /* assumes fs == KERNEL_DS */

+10 -17

mm/mlock.c

··· 51 51 /* 52 52 * LRU accounting for clear_page_mlock() 53 53 */ 54 - void __clear_page_mlock(struct page *page) 54 + void clear_page_mlock(struct page *page) 55 55 { 56 - VM_BUG_ON(!PageLocked(page)); 57 - 58 - if (!page->mapping) { /* truncated ? */ 56 + if (!TestClearPageMlocked(page)) 59 57 return; 60 - } 61 58 62 - dec_zone_page_state(page, NR_MLOCK); 59 + mod_zone_page_state(page_zone(page), NR_MLOCK, 60 + -hpage_nr_pages(page)); 63 61 count_vm_event(UNEVICTABLE_PGCLEARED); 64 62 if (!isolate_lru_page(page)) { 65 63 putback_lru_page(page); ··· 79 81 BUG_ON(!PageLocked(page)); 80 82 81 83 if (!TestSetPageMlocked(page)) { 82 - inc_zone_page_state(page, NR_MLOCK); 84 + mod_zone_page_state(page_zone(page), NR_MLOCK, 85 + hpage_nr_pages(page)); 83 86 count_vm_event(UNEVICTABLE_PGMLOCKED); 84 87 if (!isolate_lru_page(page)) 85 88 putback_lru_page(page); ··· 107 108 BUG_ON(!PageLocked(page)); 108 109 109 110 if (TestClearPageMlocked(page)) { 110 - dec_zone_page_state(page, NR_MLOCK); 111 + mod_zone_page_state(page_zone(page), NR_MLOCK, 112 + -hpage_nr_pages(page)); 111 113 if (!isolate_lru_page(page)) { 112 114 int ret = SWAP_AGAIN; 113 115 ··· 227 227 if (vma->vm_flags & (VM_IO | VM_PFNMAP)) 228 228 goto no_mlock; 229 229 230 - if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || 230 + if (!((vma->vm_flags & VM_DONTEXPAND) || 231 231 is_vm_hugetlb_page(vma) || 232 232 vma == get_gate_vma(current->mm))) { 233 233 ··· 290 290 page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); 291 291 if (page && !IS_ERR(page)) { 292 292 lock_page(page); 293 - /* 294 - * Like in __mlock_vma_pages_range(), 295 - * because we lock page here and migration is 296 - * blocked by the elevated reference, we need 297 - * only check for file-cache page truncation. 298 - */ 299 - if (page->mapping) 300 - munlock_vma_page(page); 293 + munlock_vma_page(page); 301 294 unlock_page(page); 302 295 put_page(page); 303 296 }

+116 -97

mm/mmap.c

··· 51 51 struct vm_area_struct *vma, struct vm_area_struct *prev, 52 52 unsigned long start, unsigned long end); 53 53 54 - /* 55 - * WARNING: the debugging will use recursive algorithms so never enable this 56 - * unless you know what you are doing. 57 - */ 58 - #undef DEBUG_MM_RB 59 - 60 54 /* description of effects of mapping type and prot in current implementation. 61 55 * this is due to the limited x86 page protection hardware. The expected 62 56 * behavior is in parens: ··· 193 199 194 200 flush_dcache_mmap_lock(mapping); 195 201 if (unlikely(vma->vm_flags & VM_NONLINEAR)) 196 - list_del_init(&vma->shared.vm_set.list); 202 + list_del_init(&vma->shared.nonlinear); 197 203 else 198 - vma_prio_tree_remove(vma, &mapping->i_mmap); 204 + vma_interval_tree_remove(vma, &mapping->i_mmap); 199 205 flush_dcache_mmap_unlock(mapping); 200 206 } 201 207 202 208 /* 203 - * Unlink a file-based vm structure from its prio_tree, to hide 209 + * Unlink a file-based vm structure from its interval tree, to hide 204 210 * vma from rmap and vmtruncate before freeing its page tables. 205 211 */ 206 212 void unlink_file_vma(struct vm_area_struct *vma) ··· 225 231 might_sleep(); 226 232 if (vma->vm_ops && vma->vm_ops->close) 227 233 vma->vm_ops->close(vma); 228 - if (vma->vm_file) { 234 + if (vma->vm_file) 229 235 fput(vma->vm_file); 230 - if (vma->vm_flags & VM_EXECUTABLE) 231 - removed_exe_file_vma(vma->vm_mm); 232 - } 233 236 mpol_put(vma_policy(vma)); 234 237 kmem_cache_free(vm_area_cachep, vma); 235 238 return next; ··· 297 306 return retval; 298 307 } 299 308 300 - #ifdef DEBUG_MM_RB 309 + #ifdef CONFIG_DEBUG_VM_RB 301 310 static int browse_rb(struct rb_root *root) 302 311 { 303 312 int i = 0, j; ··· 331 340 { 332 341 int bug = 0; 333 342 int i = 0; 334 - struct vm_area_struct *tmp = mm->mmap; 335 - while (tmp) { 336 - tmp = tmp->vm_next; 343 + struct vm_area_struct *vma = mm->mmap; 344 + while (vma) { 345 + struct anon_vma_chain *avc; 346 + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) 347 + anon_vma_interval_tree_verify(avc); 348 + vma = vma->vm_next; 337 349 i++; 338 350 } 339 351 if (i != mm->map_count) ··· 350 356 #define validate_mm(mm) do { } while (0) 351 357 #endif 352 358 353 - static struct vm_area_struct * 354 - find_vma_prepare(struct mm_struct *mm, unsigned long addr, 355 - struct vm_area_struct **pprev, struct rb_node ***rb_link, 356 - struct rb_node ** rb_parent) 359 + /* 360 + * vma has some anon_vma assigned, and is already inserted on that 361 + * anon_vma's interval trees. 362 + * 363 + * Before updating the vma's vm_start / vm_end / vm_pgoff fields, the 364 + * vma must be removed from the anon_vma's interval trees using 365 + * anon_vma_interval_tree_pre_update_vma(). 366 + * 367 + * After the update, the vma will be reinserted using 368 + * anon_vma_interval_tree_post_update_vma(). 369 + * 370 + * The entire update must be protected by exclusive mmap_sem and by 371 + * the root anon_vma's mutex. 372 + */ 373 + static inline void 374 + anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma) 357 375 { 358 - struct vm_area_struct * vma; 359 - struct rb_node ** __rb_link, * __rb_parent, * rb_prev; 376 + struct anon_vma_chain *avc; 377 + 378 + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) 379 + anon_vma_interval_tree_remove(avc, &avc->anon_vma->rb_root); 380 + } 381 + 382 + static inline void 383 + anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) 384 + { 385 + struct anon_vma_chain *avc; 386 + 387 + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) 388 + anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root); 389 + } 390 + 391 + static int find_vma_links(struct mm_struct *mm, unsigned long addr, 392 + unsigned long end, struct vm_area_struct **pprev, 393 + struct rb_node ***rb_link, struct rb_node **rb_parent) 394 + { 395 + struct rb_node **__rb_link, *__rb_parent, *rb_prev; 360 396 361 397 __rb_link = &mm->mm_rb.rb_node; 362 398 rb_prev = __rb_parent = NULL; 363 - vma = NULL; 364 399 365 400 while (*__rb_link) { 366 401 struct vm_area_struct *vma_tmp; ··· 398 375 vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb); 399 376 400 377 if (vma_tmp->vm_end > addr) { 401 - vma = vma_tmp; 402 - if (vma_tmp->vm_start <= addr) 403 - break; 378 + /* Fail if an existing vma overlaps the area */ 379 + if (vma_tmp->vm_start < end) 380 + return -ENOMEM; 404 381 __rb_link = &__rb_parent->rb_left; 405 382 } else { 406 383 rb_prev = __rb_parent; ··· 413 390 *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); 414 391 *rb_link = __rb_link; 415 392 *rb_parent = __rb_parent; 416 - return vma; 393 + return 0; 417 394 } 418 395 419 396 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, ··· 440 417 if (unlikely(vma->vm_flags & VM_NONLINEAR)) 441 418 vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); 442 419 else 443 - vma_prio_tree_insert(vma, &mapping->i_mmap); 420 + vma_interval_tree_insert(vma, &mapping->i_mmap); 444 421 flush_dcache_mmap_unlock(mapping); 445 422 } 446 423 } ··· 478 455 479 456 /* 480 457 * Helper for vma_adjust() in the split_vma insert case: insert a vma into the 481 - * mm's list and rbtree. It has already been inserted into the prio_tree. 458 + * mm's list and rbtree. It has already been inserted into the interval tree. 482 459 */ 483 460 static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) 484 461 { 485 - struct vm_area_struct *__vma, *prev; 462 + struct vm_area_struct *prev; 486 463 struct rb_node **rb_link, *rb_parent; 487 464 488 - __vma = find_vma_prepare(mm, vma->vm_start,&prev, &rb_link, &rb_parent); 489 - BUG_ON(__vma && __vma->vm_start < vma->vm_end); 465 + if (find_vma_links(mm, vma->vm_start, vma->vm_end, 466 + &prev, &rb_link, &rb_parent)) 467 + BUG(); 490 468 __vma_link(mm, vma, prev, rb_link, rb_parent); 491 469 mm->map_count++; 492 470 } ··· 520 496 struct vm_area_struct *next = vma->vm_next; 521 497 struct vm_area_struct *importer = NULL; 522 498 struct address_space *mapping = NULL; 523 - struct prio_tree_root *root = NULL; 499 + struct rb_root *root = NULL; 524 500 struct anon_vma *anon_vma = NULL; 525 501 struct file *file = vma->vm_file; 526 502 long adjust_next = 0; ··· 583 559 mutex_lock(&mapping->i_mmap_mutex); 584 560 if (insert) { 585 561 /* 586 - * Put into prio_tree now, so instantiated pages 562 + * Put into interval tree now, so instantiated pages 587 563 * are visible to arm/parisc __flush_dcache_page 588 564 * throughout; but we cannot insert into address 589 565 * space until vma start or end is updated. ··· 594 570 595 571 vma_adjust_trans_huge(vma, start, end, adjust_next); 596 572 597 - /* 598 - * When changing only vma->vm_end, we don't really need anon_vma 599 - * lock. This is a fairly rare case by itself, but the anon_vma 600 - * lock may be shared between many sibling processes. Skipping 601 - * the lock for brk adjustments makes a difference sometimes. 602 - */ 603 - if (vma->anon_vma && (importer || start != vma->vm_start)) { 604 - anon_vma = vma->anon_vma; 573 + anon_vma = vma->anon_vma; 574 + if (!anon_vma && adjust_next) 575 + anon_vma = next->anon_vma; 576 + if (anon_vma) { 577 + VM_BUG_ON(adjust_next && next->anon_vma && 578 + anon_vma != next->anon_vma); 605 579 anon_vma_lock(anon_vma); 580 + anon_vma_interval_tree_pre_update_vma(vma); 581 + if (adjust_next) 582 + anon_vma_interval_tree_pre_update_vma(next); 606 583 } 607 584 608 585 if (root) { 609 586 flush_dcache_mmap_lock(mapping); 610 - vma_prio_tree_remove(vma, root); 587 + vma_interval_tree_remove(vma, root); 611 588 if (adjust_next) 612 - vma_prio_tree_remove(next, root); 589 + vma_interval_tree_remove(next, root); 613 590 } 614 591 615 592 vma->vm_start = start; ··· 623 598 624 599 if (root) { 625 600 if (adjust_next) 626 - vma_prio_tree_insert(next, root); 627 - vma_prio_tree_insert(vma, root); 601 + vma_interval_tree_insert(next, root); 602 + vma_interval_tree_insert(vma, root); 628 603 flush_dcache_mmap_unlock(mapping); 629 604 } 630 605 ··· 645 620 __insert_vm_struct(mm, insert); 646 621 } 647 622 648 - if (anon_vma) 623 + if (anon_vma) { 624 + anon_vma_interval_tree_post_update_vma(vma); 625 + if (adjust_next) 626 + anon_vma_interval_tree_post_update_vma(next); 649 627 anon_vma_unlock(anon_vma); 628 + } 650 629 if (mapping) 651 630 mutex_unlock(&mapping->i_mmap_mutex); 652 631 ··· 665 636 if (file) { 666 637 uprobe_munmap(next, next->vm_start, next->vm_end); 667 638 fput(file); 668 - if (next->vm_flags & VM_EXECUTABLE) 669 - removed_exe_file_vma(mm); 670 639 } 671 640 if (next->anon_vma) 672 641 anon_vma_merge(vma, next); ··· 696 669 static inline int is_mergeable_vma(struct vm_area_struct *vma, 697 670 struct file *file, unsigned long vm_flags) 698 671 { 699 - /* VM_CAN_NONLINEAR may get set later by f_op->mmap() */ 700 - if ((vma->vm_flags ^ vm_flags) & ~VM_CAN_NONLINEAR) 672 + if (vma->vm_flags ^ vm_flags) 701 673 return 0; 702 674 if (vma->vm_file != file) 703 675 return 0; ··· 977 951 mm->exec_vm += pages; 978 952 } else if (flags & stack_flags) 979 953 mm->stack_vm += pages; 980 - if (flags & (VM_RESERVED|VM_IO)) 981 - mm->reserved_vm += pages; 982 954 } 983 955 #endif /* CONFIG_PROC_FS */ 984 956 ··· 1214 1190 return 0; 1215 1191 1216 1192 /* Specialty mapping? */ 1217 - if (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) 1193 + if (vm_flags & VM_PFNMAP) 1218 1194 return 0; 1219 1195 1220 1196 /* Can the mapping track the dirty pages? */ ··· 1253 1229 /* Clear old maps */ 1254 1230 error = -ENOMEM; 1255 1231 munmap_back: 1256 - vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); 1257 - if (vma && vma->vm_start < addr + len) { 1232 + if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { 1258 1233 if (do_munmap(mm, addr, len)) 1259 1234 return -ENOMEM; 1260 1235 goto munmap_back; ··· 1328 1305 error = file->f_op->mmap(file, vma); 1329 1306 if (error) 1330 1307 goto unmap_and_free_vma; 1331 - if (vm_flags & VM_EXECUTABLE) 1332 - added_exe_file_vma(mm); 1333 1308 1334 1309 /* Can addr have changed?? 1335 1310 * ··· 1778 1757 if (vma->vm_pgoff + (size >> PAGE_SHIFT) >= vma->vm_pgoff) { 1779 1758 error = acct_stack_growth(vma, size, grow); 1780 1759 if (!error) { 1760 + anon_vma_interval_tree_pre_update_vma(vma); 1781 1761 vma->vm_end = address; 1762 + anon_vma_interval_tree_post_update_vma(vma); 1782 1763 perf_event_mmap(vma); 1783 1764 } 1784 1765 } 1785 1766 } 1786 1767 vma_unlock_anon_vma(vma); 1787 1768 khugepaged_enter_vma_merge(vma); 1769 + validate_mm(vma->vm_mm); 1788 1770 return error; 1789 1771 } 1790 1772 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ ··· 1831 1807 if (grow <= vma->vm_pgoff) { 1832 1808 error = acct_stack_growth(vma, size, grow); 1833 1809 if (!error) { 1810 + anon_vma_interval_tree_pre_update_vma(vma); 1834 1811 vma->vm_start = address; 1835 1812 vma->vm_pgoff -= grow; 1813 + anon_vma_interval_tree_post_update_vma(vma); 1836 1814 perf_event_mmap(vma); 1837 1815 } 1838 1816 } 1839 1817 } 1840 1818 vma_unlock_anon_vma(vma); 1841 1819 khugepaged_enter_vma_merge(vma); 1820 + validate_mm(vma->vm_mm); 1842 1821 return error; 1843 1822 } 1844 1823 ··· 2015 1988 if (anon_vma_clone(new, vma)) 2016 1989 goto out_free_mpol; 2017 1990 2018 - if (new->vm_file) { 1991 + if (new->vm_file) 2019 1992 get_file(new->vm_file); 2020 - if (vma->vm_flags & VM_EXECUTABLE) 2021 - added_exe_file_vma(mm); 2022 - } 2023 1993 2024 1994 if (new->vm_ops && new->vm_ops->open) 2025 1995 new->vm_ops->open(new); ··· 2034 2010 /* Clean everything up if vma_adjust failed. */ 2035 2011 if (new->vm_ops && new->vm_ops->close) 2036 2012 new->vm_ops->close(new); 2037 - if (new->vm_file) { 2038 - if (vma->vm_flags & VM_EXECUTABLE) 2039 - removed_exe_file_vma(mm); 2013 + if (new->vm_file) 2040 2014 fput(new->vm_file); 2041 - } 2042 2015 unlink_anon_vmas(new); 2043 2016 out_free_mpol: 2044 2017 mpol_put(pol); ··· 2220 2199 * Clear old maps. this also does some error checking for us 2221 2200 */ 2222 2201 munmap_back: 2223 - vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); 2224 - if (vma && vma->vm_start < addr + len) { 2202 + if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { 2225 2203 if (do_munmap(mm, addr, len)) 2226 2204 return -ENOMEM; 2227 2205 goto munmap_back; ··· 2334 2314 * and into the inode's i_mmap tree. If vm_file is non-NULL 2335 2315 * then i_mmap_mutex is taken here. 2336 2316 */ 2337 - int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) 2317 + int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) 2338 2318 { 2339 - struct vm_area_struct * __vma, * prev; 2340 - struct rb_node ** rb_link, * rb_parent; 2319 + struct vm_area_struct *prev; 2320 + struct rb_node **rb_link, *rb_parent; 2341 2321 2342 2322 /* 2343 2323 * The vm_pgoff of a purely anonymous vma should be irrelevant ··· 2355 2335 BUG_ON(vma->anon_vma); 2356 2336 vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; 2357 2337 } 2358 - __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent); 2359 - if (__vma && __vma->vm_start < vma->vm_end) 2338 + if (find_vma_links(mm, vma->vm_start, vma->vm_end, 2339 + &prev, &rb_link, &rb_parent)) 2360 2340 return -ENOMEM; 2361 2341 if ((vma->vm_flags & VM_ACCOUNT) && 2362 2342 security_vm_enough_memory_mm(mm, vma_pages(vma))) ··· 2371 2351 * prior to moving page table entries, to effect an mremap move. 2372 2352 */ 2373 2353 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, 2374 - unsigned long addr, unsigned long len, pgoff_t pgoff) 2354 + unsigned long addr, unsigned long len, pgoff_t pgoff, 2355 + bool *need_rmap_locks) 2375 2356 { 2376 2357 struct vm_area_struct *vma = *vmap; 2377 2358 unsigned long vma_start = vma->vm_start; ··· 2391 2370 faulted_in_anon_vma = false; 2392 2371 } 2393 2372 2394 - find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); 2373 + if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) 2374 + return NULL; /* should never get here */ 2395 2375 new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, 2396 2376 vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma)); 2397 2377 if (new_vma) { ··· 2414 2392 * linear if there are no pages mapped yet. 2415 2393 */ 2416 2394 VM_BUG_ON(faulted_in_anon_vma); 2417 - *vmap = new_vma; 2418 - } else 2419 - anon_vma_moveto_tail(new_vma); 2395 + *vmap = vma = new_vma; 2396 + } 2397 + *need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff); 2420 2398 } else { 2421 2399 new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); 2422 2400 if (new_vma) { 2423 2401 *new_vma = *vma; 2424 - pol = mpol_dup(vma_policy(vma)); 2425 - if (IS_ERR(pol)) 2426 - goto out_free_vma; 2427 - INIT_LIST_HEAD(&new_vma->anon_vma_chain); 2428 - if (anon_vma_clone(new_vma, vma)) 2429 - goto out_free_mempol; 2430 - vma_set_policy(new_vma, pol); 2431 2402 new_vma->vm_start = addr; 2432 2403 new_vma->vm_end = addr + len; 2433 2404 new_vma->vm_pgoff = pgoff; 2434 - if (new_vma->vm_file) { 2405 + pol = mpol_dup(vma_policy(vma)); 2406 + if (IS_ERR(pol)) 2407 + goto out_free_vma; 2408 + vma_set_policy(new_vma, pol); 2409 + INIT_LIST_HEAD(&new_vma->anon_vma_chain); 2410 + if (anon_vma_clone(new_vma, vma)) 2411 + goto out_free_mempol; 2412 + if (new_vma->vm_file) 2435 2413 get_file(new_vma->vm_file); 2436 - 2437 - if (vma->vm_flags & VM_EXECUTABLE) 2438 - added_exe_file_vma(mm); 2439 - } 2440 2414 if (new_vma->vm_ops && new_vma->vm_ops->open) 2441 2415 new_vma->vm_ops->open(new_vma); 2442 2416 vma_link(mm, new_vma, prev, rb_link, rb_parent); 2417 + *need_rmap_locks = false; 2443 2418 } 2444 2419 } 2445 2420 return new_vma; ··· 2554 2535 2555 2536 static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) 2556 2537 { 2557 - if (!test_bit(0, (unsigned long *) &anon_vma->root->head.next)) { 2538 + if (!test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) { 2558 2539 /* 2559 2540 * The LSB of head.next can't change from under us 2560 2541 * because we hold the mm_all_locks_mutex. ··· 2570 2551 * anon_vma->root->mutex. 2571 2552 */ 2572 2553 if (__test_and_set_bit(0, (unsigned long *) 2573 - &anon_vma->root->head.next)) 2554 + &anon_vma->root->rb_root.rb_node)) 2574 2555 BUG(); 2575 2556 } 2576 2557 } ··· 2611 2592 * A single task can't take more than one mm_take_all_locks() in a row 2612 2593 * or it would deadlock. 2613 2594 * 2614 - * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in 2595 + * The LSB in anon_vma->rb_root.rb_node and the AS_MM_ALL_LOCKS bitflag in 2615 2596 * mapping->flags avoid to take the same lock twice, if more than one 2616 2597 * vma in this mm is backed by the same anon_vma or address_space. 2617 2598 * ··· 2658 2639 2659 2640 static void vm_unlock_anon_vma(struct anon_vma *anon_vma) 2660 2641 { 2661 - if (test_bit(0, (unsigned long *) &anon_vma->root->head.next)) { 2642 + if (test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_node)) { 2662 2643 /* 2663 2644 * The LSB of head.next can't change to 0 from under 2664 2645 * us because we hold the mm_all_locks_mutex. 2665 2646 * 2666 2647 * We must however clear the bitflag before unlocking 2667 - * the vma so the users using the anon_vma->head will 2648 + * the vma so the users using the anon_vma->rb_root will 2668 2649 * never see our bitflag. 2669 2650 * 2670 2651 * No need of atomic instructions here, head.next ··· 2672 2653 * anon_vma->root->mutex. 2673 2654 */ 2674 2655 if (!__test_and_clear_bit(0, (unsigned long *) 2675 - &anon_vma->root->head.next)) 2656 + &anon_vma->root->rb_root.rb_node)) 2676 2657 BUG(); 2677 2658 anon_vma_unlock(anon_vma); 2678 2659 }

+60 -43

mm/mmu_notifier.c

··· 14 14 #include <linux/export.h> 15 15 #include <linux/mm.h> 16 16 #include <linux/err.h> 17 + #include <linux/srcu.h> 17 18 #include <linux/rcupdate.h> 18 19 #include <linux/sched.h> 19 20 #include <linux/slab.h> 21 + 22 + /* global SRCU for all MMs */ 23 + static struct srcu_struct srcu; 20 24 21 25 /* 22 26 * This function can't run concurrently against mmu_notifier_register ··· 29 25 * in parallel despite there being no task using this mm any more, 30 26 * through the vmas outside of the exit_mmap context, such as with 31 27 * vmtruncate. This serializes against mmu_notifier_unregister with 32 - * the mmu_notifier_mm->lock in addition to RCU and it serializes 33 - * against the other mmu notifiers with RCU. struct mmu_notifier_mm 28 + * the mmu_notifier_mm->lock in addition to SRCU and it serializes 29 + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm 34 30 * can't go away from under us as exit_mmap holds an mm_count pin 35 31 * itself. 36 32 */ ··· 38 34 { 39 35 struct mmu_notifier *mn; 40 36 struct hlist_node *n; 37 + int id; 41 38 42 39 /* 43 - * RCU here will block mmu_notifier_unregister until 40 + * SRCU here will block mmu_notifier_unregister until 44 41 * ->release returns. 45 42 */ 46 - rcu_read_lock(); 43 + id = srcu_read_lock(&srcu); 47 44 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) 48 45 /* 49 46 * if ->release runs before mmu_notifier_unregister it ··· 55 50 */ 56 51 if (mn->ops->release) 57 52 mn->ops->release(mn, mm); 58 - rcu_read_unlock(); 53 + srcu_read_unlock(&srcu, id); 59 54 60 55 spin_lock(&mm->mmu_notifier_mm->lock); 61 56 while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { ··· 73 68 spin_unlock(&mm->mmu_notifier_mm->lock); 74 69 75 70 /* 76 - * synchronize_rcu here prevents mmu_notifier_release to 71 + * synchronize_srcu here prevents mmu_notifier_release to 77 72 * return to exit_mmap (which would proceed freeing all pages 78 73 * in the mm) until the ->release method returns, if it was 79 74 * invoked by mmu_notifier_unregister. ··· 81 76 * The mmu_notifier_mm can't go away from under us because one 82 77 * mm_count is hold by exit_mmap. 83 78 */ 84 - synchronize_rcu(); 79 + synchronize_srcu(&srcu); 85 80 } 86 81 87 82 /* ··· 94 89 { 95 90 struct mmu_notifier *mn; 96 91 struct hlist_node *n; 97 - int young = 0; 92 + int young = 0, id; 98 93 99 - rcu_read_lock(); 94 + id = srcu_read_lock(&srcu); 100 95 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 101 96 if (mn->ops->clear_flush_young) 102 97 young |= mn->ops->clear_flush_young(mn, mm, address); 103 98 } 104 - rcu_read_unlock(); 99 + srcu_read_unlock(&srcu, id); 105 100 106 101 return young; 107 102 } ··· 111 106 { 112 107 struct mmu_notifier *mn; 113 108 struct hlist_node *n; 114 - int young = 0; 109 + int young = 0, id; 115 110 116 - rcu_read_lock(); 111 + id = srcu_read_lock(&srcu); 117 112 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 118 113 if (mn->ops->test_young) { 119 114 young = mn->ops->test_young(mn, mm, address); ··· 121 116 break; 122 117 } 123 118 } 124 - rcu_read_unlock(); 119 + srcu_read_unlock(&srcu, id); 125 120 126 121 return young; 127 122 } ··· 131 126 { 132 127 struct mmu_notifier *mn; 133 128 struct hlist_node *n; 129 + int id; 134 130 135 - rcu_read_lock(); 131 + id = srcu_read_lock(&srcu); 136 132 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 137 133 if (mn->ops->change_pte) 138 134 mn->ops->change_pte(mn, mm, address, pte); 139 - /* 140 - * Some drivers don't have change_pte, 141 - * so we must call invalidate_page in that case. 142 - */ 143 - else if (mn->ops->invalidate_page) 144 - mn->ops->invalidate_page(mn, mm, address); 145 135 } 146 - rcu_read_unlock(); 136 + srcu_read_unlock(&srcu, id); 147 137 } 148 138 149 139 void __mmu_notifier_invalidate_page(struct mm_struct *mm, ··· 146 146 { 147 147 struct mmu_notifier *mn; 148 148 struct hlist_node *n; 149 + int id; 149 150 150 - rcu_read_lock(); 151 + id = srcu_read_lock(&srcu); 151 152 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 152 153 if (mn->ops->invalidate_page) 153 154 mn->ops->invalidate_page(mn, mm, address); 154 155 } 155 - rcu_read_unlock(); 156 + srcu_read_unlock(&srcu, id); 156 157 } 157 158 158 159 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, ··· 161 160 { 162 161 struct mmu_notifier *mn; 163 162 struct hlist_node *n; 163 + int id; 164 164 165 - rcu_read_lock(); 165 + id = srcu_read_lock(&srcu); 166 166 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 167 167 if (mn->ops->invalidate_range_start) 168 168 mn->ops->invalidate_range_start(mn, mm, start, end); 169 169 } 170 - rcu_read_unlock(); 170 + srcu_read_unlock(&srcu, id); 171 171 } 172 172 173 173 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, ··· 176 174 { 177 175 struct mmu_notifier *mn; 178 176 struct hlist_node *n; 177 + int id; 179 178 180 - rcu_read_lock(); 179 + id = srcu_read_lock(&srcu); 181 180 hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 182 181 if (mn->ops->invalidate_range_end) 183 182 mn->ops->invalidate_range_end(mn, mm, start, end); 184 183 } 185 - rcu_read_unlock(); 184 + srcu_read_unlock(&srcu, id); 186 185 } 187 186 188 187 static int do_mmu_notifier_register(struct mmu_notifier *mn, ··· 195 192 196 193 BUG_ON(atomic_read(&mm->mm_users) <= 0); 197 194 198 - ret = -ENOMEM; 199 - mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); 200 - if (unlikely(!mmu_notifier_mm)) 201 - goto out; 195 + /* 196 + * Verify that mmu_notifier_init() already run and the global srcu is 197 + * initialized. 198 + */ 199 + BUG_ON(!srcu.per_cpu_ref); 202 200 203 201 if (take_mmap_sem) 204 202 down_write(&mm->mmap_sem); 205 203 ret = mm_take_all_locks(mm); 206 204 if (unlikely(ret)) 207 - goto out_cleanup; 205 + goto out; 208 206 209 207 if (!mm_has_notifiers(mm)) { 208 + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), 209 + GFP_KERNEL); 210 + if (unlikely(!mmu_notifier_mm)) { 211 + ret = -ENOMEM; 212 + goto out_of_mem; 213 + } 210 214 INIT_HLIST_HEAD(&mmu_notifier_mm->list); 211 215 spin_lock_init(&mmu_notifier_mm->lock); 216 + 212 217 mm->mmu_notifier_mm = mmu_notifier_mm; 213 - mmu_notifier_mm = NULL; 214 218 } 215 219 atomic_inc(&mm->mm_count); 216 220 ··· 233 223 hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); 234 224 spin_unlock(&mm->mmu_notifier_mm->lock); 235 225 226 + out_of_mem: 236 227 mm_drop_all_locks(mm); 237 - out_cleanup: 228 + out: 238 229 if (take_mmap_sem) 239 230 up_write(&mm->mmap_sem); 240 - /* kfree() does nothing if mmu_notifier_mm is NULL */ 241 - kfree(mmu_notifier_mm); 242 - out: 231 + 243 232 BUG_ON(atomic_read(&mm->mm_users) <= 0); 244 233 return ret; 245 234 } ··· 283 274 /* 284 275 * This releases the mm_count pin automatically and frees the mm 285 276 * structure if it was the last user of it. It serializes against 286 - * running mmu notifiers with RCU and against mmu_notifier_unregister 287 - * with the unregister lock + RCU. All sptes must be dropped before 277 + * running mmu notifiers with SRCU and against mmu_notifier_unregister 278 + * with the unregister lock + SRCU. All sptes must be dropped before 288 279 * calling mmu_notifier_unregister. ->release or any other notifier 289 280 * method may be invoked concurrently with mmu_notifier_unregister, 290 281 * and only after mmu_notifier_unregister returned we're guaranteed ··· 296 287 297 288 if (!hlist_unhashed(&mn->hlist)) { 298 289 /* 299 - * RCU here will force exit_mmap to wait ->release to finish 290 + * SRCU here will force exit_mmap to wait ->release to finish 300 291 * before freeing the pages. 301 292 */ 302 - rcu_read_lock(); 293 + int id; 303 294 295 + id = srcu_read_lock(&srcu); 304 296 /* 305 297 * exit_mmap will block in mmu_notifier_release to 306 298 * guarantee ->release is called before freeing the ··· 309 299 */ 310 300 if (mn->ops->release) 311 301 mn->ops->release(mn, mm); 312 - rcu_read_unlock(); 302 + srcu_read_unlock(&srcu, id); 313 303 314 304 spin_lock(&mm->mmu_notifier_mm->lock); 315 305 hlist_del_rcu(&mn->hlist); ··· 320 310 * Wait any running method to finish, of course including 321 311 * ->release if it was run by mmu_notifier_relase instead of us. 322 312 */ 323 - synchronize_rcu(); 313 + synchronize_srcu(&srcu); 324 314 325 315 BUG_ON(atomic_read(&mm->mm_count) <= 0); 326 316 327 317 mmdrop(mm); 328 318 } 329 319 EXPORT_SYMBOL_GPL(mmu_notifier_unregister); 320 + 321 + static int __init mmu_notifier_init(void) 322 + { 323 + return init_srcu_struct(&srcu); 324 + } 325 + 326 + module_init(mmu_notifier_init);

+47 -26

mm/mremap.c

··· 71 71 static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, 72 72 unsigned long old_addr, unsigned long old_end, 73 73 struct vm_area_struct *new_vma, pmd_t *new_pmd, 74 - unsigned long new_addr) 74 + unsigned long new_addr, bool need_rmap_locks) 75 75 { 76 76 struct address_space *mapping = NULL; 77 + struct anon_vma *anon_vma = NULL; 77 78 struct mm_struct *mm = vma->vm_mm; 78 79 pte_t *old_pte, *new_pte, pte; 79 80 spinlock_t *old_ptl, *new_ptl; 80 81 81 - if (vma->vm_file) { 82 - /* 83 - * Subtle point from Rajesh Venkatasubramanian: before 84 - * moving file-based ptes, we must lock truncate_pagecache 85 - * out, since it might clean the dst vma before the src vma, 86 - * and we propagate stale pages into the dst afterward. 87 - */ 88 - mapping = vma->vm_file->f_mapping; 89 - mutex_lock(&mapping->i_mmap_mutex); 82 + /* 83 + * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma 84 + * locks to ensure that rmap will always observe either the old or the 85 + * new ptes. This is the easiest way to avoid races with 86 + * truncate_pagecache(), page migration, etc... 87 + * 88 + * When need_rmap_locks is false, we use other ways to avoid 89 + * such races: 90 + * 91 + * - During exec() shift_arg_pages(), we use a specially tagged vma 92 + * which rmap call sites look for using is_vma_temporary_stack(). 93 + * 94 + * - During mremap(), new_vma is often known to be placed after vma 95 + * in rmap traversal order. This ensures rmap will always observe 96 + * either the old pte, or the new pte, or both (the page table locks 97 + * serialize access to individual ptes, but only rmap traversal 98 + * order guarantees that we won't miss both the old and new ptes). 99 + */ 100 + if (need_rmap_locks) { 101 + if (vma->vm_file) { 102 + mapping = vma->vm_file->f_mapping; 103 + mutex_lock(&mapping->i_mmap_mutex); 104 + } 105 + if (vma->anon_vma) { 106 + anon_vma = vma->anon_vma; 107 + anon_vma_lock(anon_vma); 108 + } 90 109 } 91 110 92 111 /* ··· 133 114 spin_unlock(new_ptl); 134 115 pte_unmap(new_pte - 1); 135 116 pte_unmap_unlock(old_pte - 1, old_ptl); 117 + if (anon_vma) 118 + anon_vma_unlock(anon_vma); 136 119 if (mapping) 137 120 mutex_unlock(&mapping->i_mmap_mutex); 138 121 } ··· 143 122 144 123 unsigned long move_page_tables(struct vm_area_struct *vma, 145 124 unsigned long old_addr, struct vm_area_struct *new_vma, 146 - unsigned long new_addr, unsigned long len) 125 + unsigned long new_addr, unsigned long len, 126 + bool need_rmap_locks) 147 127 { 148 128 unsigned long extent, next, old_end; 149 129 pmd_t *old_pmd, *new_pmd; 150 130 bool need_flush = false; 131 + unsigned long mmun_start; /* For mmu_notifiers */ 132 + unsigned long mmun_end; /* For mmu_notifiers */ 151 133 152 134 old_end = old_addr + len; 153 135 flush_cache_range(vma, old_addr, old_end); 154 136 155 - mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end); 137 + mmun_start = old_addr; 138 + mmun_end = old_end; 139 + mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); 156 140 157 141 for (; old_addr < old_end; old_addr += extent, new_addr += extent) { 158 142 cond_resched(); ··· 195 169 if (extent > LATENCY_LIMIT) 196 170 extent = LATENCY_LIMIT; 197 171 move_ptes(vma, old_pmd, old_addr, old_addr + extent, 198 - new_vma, new_pmd, new_addr); 172 + new_vma, new_pmd, new_addr, need_rmap_locks); 199 173 need_flush = true; 200 174 } 201 175 if (likely(need_flush)) 202 176 flush_tlb_range(vma, old_end-len, old_addr); 203 177 204 - mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end); 178 + mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); 205 179 206 180 return len + old_addr - old_end; /* how much done */ 207 181 } ··· 219 193 unsigned long hiwater_vm; 220 194 int split = 0; 221 195 int err; 196 + bool need_rmap_locks; 222 197 223 198 /* 224 199 * We'd prefer to avoid failure later on in do_munmap: ··· 241 214 return err; 242 215 243 216 new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); 244 - new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff); 217 + new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff, 218 + &need_rmap_locks); 245 219 if (!new_vma) 246 220 return -ENOMEM; 247 221 248 - moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len); 222 + moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, 223 + need_rmap_locks); 249 224 if (moved_len < old_len) { 250 - /* 251 - * Before moving the page tables from the new vma to 252 - * the old vma, we need to be sure the old vma is 253 - * queued after new vma in the same_anon_vma list to 254 - * prevent SMP races with rmap_walk (that could lead 255 - * rmap_walk to miss some page table). 256 - */ 257 - anon_vma_moveto_tail(vma); 258 - 259 225 /* 260 226 * On error, move entries back from new area to old, 261 227 * which will succeed since page tables still there, 262 228 * and then proceed to unmap new area instead of old. 263 229 */ 264 - move_page_tables(new_vma, new_addr, vma, old_addr, moved_len); 230 + move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, 231 + true); 265 232 vma = new_vma; 266 233 old_len = new_len; 267 234 old_addr = new_addr;

+3 -2

mm/nobootmem.c

··· 116 116 return 0; 117 117 118 118 __free_pages_memory(start_pfn, end_pfn); 119 + fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT), 120 + start_pfn, end_pfn); 119 121 120 122 return end_pfn - start_pfn; 121 123 } ··· 128 126 phys_addr_t start, end, size; 129 127 u64 i; 130 128 129 + reset_zone_present_pages(); 131 130 for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) 132 131 count += __free_memory_core(start, end); 133 132 ··· 165 162 * We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id 166 163 * because in some case like Node0 doesn't have RAM installed 167 164 * low ram will be on Node1 168 - * Use MAX_NUMNODES will make sure all ranges in early_node_map[] 169 - * will be used instead of only Node0 related 170 165 */ 171 166 return free_low_memory_core_early(MAX_NUMNODES); 172 167 }

+15 -18

mm/nommu.c

··· 698 698 699 699 mutex_lock(&mapping->i_mmap_mutex); 700 700 flush_dcache_mmap_lock(mapping); 701 - vma_prio_tree_insert(vma, &mapping->i_mmap); 701 + vma_interval_tree_insert(vma, &mapping->i_mmap); 702 702 flush_dcache_mmap_unlock(mapping); 703 703 mutex_unlock(&mapping->i_mmap_mutex); 704 704 } ··· 764 764 765 765 mutex_lock(&mapping->i_mmap_mutex); 766 766 flush_dcache_mmap_lock(mapping); 767 - vma_prio_tree_remove(vma, &mapping->i_mmap); 767 + vma_interval_tree_remove(vma, &mapping->i_mmap); 768 768 flush_dcache_mmap_unlock(mapping); 769 769 mutex_unlock(&mapping->i_mmap_mutex); 770 770 } ··· 789 789 kenter("%p", vma); 790 790 if (vma->vm_ops && vma->vm_ops->close) 791 791 vma->vm_ops->close(vma); 792 - if (vma->vm_file) { 792 + if (vma->vm_file) 793 793 fput(vma->vm_file); 794 - if (vma->vm_flags & VM_EXECUTABLE) 795 - removed_exe_file_vma(mm); 796 - } 797 794 put_nommu_region(vma->vm_region); 798 795 kmem_cache_free(vm_area_cachep, vma); 799 796 } ··· 1281 1284 if (file) { 1282 1285 region->vm_file = get_file(file); 1283 1286 vma->vm_file = get_file(file); 1284 - if (vm_flags & VM_EXECUTABLE) { 1285 - added_exe_file_vma(current->mm); 1286 - vma->vm_mm = current->mm; 1287 - } 1288 1287 } 1289 1288 1290 1289 down_write(&nommu_region_sem); ··· 1433 1440 kmem_cache_free(vm_region_jar, region); 1434 1441 if (vma->vm_file) 1435 1442 fput(vma->vm_file); 1436 - if (vma->vm_flags & VM_EXECUTABLE) 1437 - removed_exe_file_vma(vma->vm_mm); 1438 1443 kmem_cache_free(vm_area_cachep, vma); 1439 1444 kleave(" = %d", ret); 1440 1445 return ret; ··· 1811 1820 if (addr != (pfn << PAGE_SHIFT)) 1812 1821 return -EINVAL; 1813 1822 1814 - vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP; 1823 + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 1815 1824 return 0; 1816 1825 } 1817 1826 EXPORT_SYMBOL(remap_pfn_range); ··· 1952 1961 } 1953 1962 EXPORT_SYMBOL(filemap_fault); 1954 1963 1964 + int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr, 1965 + unsigned long size, pgoff_t pgoff) 1966 + { 1967 + BUG(); 1968 + return 0; 1969 + } 1970 + EXPORT_SYMBOL(generic_file_remap_pages); 1971 + 1955 1972 static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, 1956 1973 unsigned long addr, void *buf, int len, int write) 1957 1974 { ··· 2044 2045 size_t newsize) 2045 2046 { 2046 2047 struct vm_area_struct *vma; 2047 - struct prio_tree_iter iter; 2048 2048 struct vm_region *region; 2049 2049 pgoff_t low, high; 2050 2050 size_t r_size, r_top; ··· 2055 2057 mutex_lock(&inode->i_mapping->i_mmap_mutex); 2056 2058 2057 2059 /* search for VMAs that fall within the dead zone */ 2058 - vma_prio_tree_foreach(vma, &iter, &inode->i_mapping->i_mmap, 2059 - low, high) { 2060 + vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) { 2060 2061 /* found one - only interested if it's shared out of the page 2061 2062 * cache */ 2062 2063 if (vma->vm_flags & VM_SHARED) { ··· 2071 2074 * we don't check for any regions that start beyond the EOF as there 2072 2075 * shouldn't be any 2073 2076 */ 2074 - vma_prio_tree_foreach(vma, &iter, &inode->i_mapping->i_mmap, 2075 - 0, ULONG_MAX) { 2077 + vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 2078 + 0, ULONG_MAX) { 2076 2079 if (!(vma->vm_flags & VM_SHARED)) 2077 2080 continue; 2078 2081

+2 -2

mm/oom_kill.c

··· 428 428 { 429 429 task_lock(current); 430 430 pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " 431 - "oom_adj=%d, oom_score_adj=%d\n", 432 - current->comm, gfp_mask, order, current->signal->oom_adj, 431 + "oom_score_adj=%d\n", 432 + current->comm, gfp_mask, order, 433 433 current->signal->oom_score_adj); 434 434 cpuset_print_task_mems_allowed(current); 435 435 task_unlock(current);

+201 -116

mm/page_alloc.c

··· 558 558 if (page_is_guard(buddy)) { 559 559 clear_page_guard_flag(buddy); 560 560 set_page_private(page, 0); 561 - __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); 561 + __mod_zone_freepage_state(zone, 1 << order, 562 + migratetype); 562 563 } else { 563 564 list_del(&buddy->lru); 564 565 zone->free_area[order].nr_free--; ··· 596 595 list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); 597 596 out: 598 597 zone->free_area[order].nr_free++; 599 - } 600 - 601 - /* 602 - * free_page_mlock() -- clean up attempts to free and mlocked() page. 603 - * Page should not be on lru, so no need to fix that up. 604 - * free_pages_check() will verify... 605 - */ 606 - static inline void free_page_mlock(struct page *page) 607 - { 608 - __dec_zone_page_state(page, NR_MLOCK); 609 - __count_vm_event(UNEVICTABLE_MLOCKFREED); 610 598 } 611 599 612 600 static inline int free_pages_check(struct page *page) ··· 658 668 batch_free = to_free; 659 669 660 670 do { 671 + int mt; /* migratetype of the to-be-freed page */ 672 + 661 673 page = list_entry(list->prev, struct page, lru); 662 674 /* must delete as __free_one_page list manipulates */ 663 675 list_del(&page->lru); 676 + mt = get_freepage_migratetype(page); 664 677 /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ 665 - __free_one_page(page, zone, 0, page_private(page)); 666 - trace_mm_page_pcpu_drain(page, 0, page_private(page)); 678 + __free_one_page(page, zone, 0, mt); 679 + trace_mm_page_pcpu_drain(page, 0, mt); 680 + if (is_migrate_cma(mt)) 681 + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1); 667 682 } while (--to_free && --batch_free && !list_empty(list)); 668 683 } 669 684 __mod_zone_page_state(zone, NR_FREE_PAGES, count); ··· 683 688 zone->pages_scanned = 0; 684 689 685 690 __free_one_page(page, zone, order, migratetype); 686 - __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); 691 + if (unlikely(migratetype != MIGRATE_ISOLATE)) 692 + __mod_zone_freepage_state(zone, 1 << order, migratetype); 687 693 spin_unlock(&zone->lock); 688 694 } 689 695 ··· 717 721 static void __free_pages_ok(struct page *page, unsigned int order) 718 722 { 719 723 unsigned long flags; 720 - int wasMlocked = __TestClearPageMlocked(page); 724 + int migratetype; 721 725 722 726 if (!free_pages_prepare(page, order)) 723 727 return; 724 728 725 729 local_irq_save(flags); 726 - if (unlikely(wasMlocked)) 727 - free_page_mlock(page); 728 730 __count_vm_events(PGFREE, 1 << order); 729 - free_one_page(page_zone(page), page, order, 730 - get_pageblock_migratetype(page)); 731 + migratetype = get_pageblock_migratetype(page); 732 + set_freepage_migratetype(page, migratetype); 733 + free_one_page(page_zone(page), page, order, migratetype); 731 734 local_irq_restore(flags); 732 735 } 733 736 ··· 806 811 set_page_guard_flag(&page[size]); 807 812 set_page_private(&page[size], high); 808 813 /* Guard pages are not available for any usage */ 809 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << high)); 814 + __mod_zone_freepage_state(zone, -(1 << high), 815 + migratetype); 810 816 continue; 811 817 } 812 818 #endif ··· 911 915 * Note that start_page and end_pages are not aligned on a pageblock 912 916 * boundary. If alignment is required, use move_freepages_block() 913 917 */ 914 - static int move_freepages(struct zone *zone, 918 + int move_freepages(struct zone *zone, 915 919 struct page *start_page, struct page *end_page, 916 920 int migratetype) 917 921 { ··· 947 951 order = page_order(page); 948 952 list_move(&page->lru, 949 953 &zone->free_area[order].free_list[migratetype]); 954 + set_freepage_migratetype(page, migratetype); 950 955 page += 1 << order; 951 956 pages_moved += 1 << order; 952 957 } ··· 1132 1135 if (!is_migrate_cma(mt) && mt != MIGRATE_ISOLATE) 1133 1136 mt = migratetype; 1134 1137 } 1135 - set_page_private(page, mt); 1138 + set_freepage_migratetype(page, mt); 1136 1139 list = &page->lru; 1140 + if (is_migrate_cma(mt)) 1141 + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1142 + -(1 << order)); 1137 1143 } 1138 1144 __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); 1139 1145 spin_unlock(&zone->lock); ··· 1296 1296 struct per_cpu_pages *pcp; 1297 1297 unsigned long flags; 1298 1298 int migratetype; 1299 - int wasMlocked = __TestClearPageMlocked(page); 1300 1299 1301 1300 if (!free_pages_prepare(page, 0)) 1302 1301 return; 1303 1302 1304 1303 migratetype = get_pageblock_migratetype(page); 1305 - set_page_private(page, migratetype); 1304 + set_freepage_migratetype(page, migratetype); 1306 1305 local_irq_save(flags); 1307 - if (unlikely(wasMlocked)) 1308 - free_page_mlock(page); 1309 1306 __count_vm_event(PGFREE); 1310 1307 1311 1308 /* ··· 1377 1380 } 1378 1381 1379 1382 /* 1380 - * Similar to split_page except the page is already free. As this is only 1381 - * being used for migration, the migratetype of the block also changes. 1382 - * As this is called with interrupts disabled, the caller is responsible 1383 - * for calling arch_alloc_page() and kernel_map_page() after interrupts 1384 - * are enabled. 1385 - * 1386 - * Note: this is probably too low level an operation for use in drivers. 1387 - * Please consult with lkml before using this in your driver. 1383 + * Similar to the split_page family of functions except that the page 1384 + * required at the given order and being isolated now to prevent races 1385 + * with parallel allocators 1388 1386 */ 1389 - int split_free_page(struct page *page) 1387 + int capture_free_page(struct page *page, int alloc_order, int migratetype) 1390 1388 { 1391 1389 unsigned int order; 1392 1390 unsigned long watermark; 1393 1391 struct zone *zone; 1392 + int mt; 1394 1393 1395 1394 BUG_ON(!PageBuddy(page)); 1396 1395 ··· 1402 1409 list_del(&page->lru); 1403 1410 zone->free_area[order].nr_free--; 1404 1411 rmv_page_order(page); 1405 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); 1406 1412 1407 - /* Split into individual pages */ 1408 - set_page_refcounted(page); 1409 - split_page(page, order); 1413 + mt = get_pageblock_migratetype(page); 1414 + if (unlikely(mt != MIGRATE_ISOLATE)) 1415 + __mod_zone_freepage_state(zone, -(1UL << order), mt); 1410 1416 1417 + if (alloc_order != order) 1418 + expand(zone, page, alloc_order, order, 1419 + &zone->free_area[order], migratetype); 1420 + 1421 + /* Set the pageblock if the captured page is at least a pageblock */ 1411 1422 if (order >= pageblock_order - 1) { 1412 1423 struct page *endpage = page + (1 << order) - 1; 1413 1424 for (; page < endpage; page += pageblock_nr_pages) { ··· 1422 1425 } 1423 1426 } 1424 1427 1425 - return 1 << order; 1428 + return 1UL << order; 1429 + } 1430 + 1431 + /* 1432 + * Similar to split_page except the page is already free. As this is only 1433 + * being used for migration, the migratetype of the block also changes. 1434 + * As this is called with interrupts disabled, the caller is responsible 1435 + * for calling arch_alloc_page() and kernel_map_page() after interrupts 1436 + * are enabled. 1437 + * 1438 + * Note: this is probably too low level an operation for use in drivers. 1439 + * Please consult with lkml before using this in your driver. 1440 + */ 1441 + int split_free_page(struct page *page) 1442 + { 1443 + unsigned int order; 1444 + int nr_pages; 1445 + 1446 + BUG_ON(!PageBuddy(page)); 1447 + order = page_order(page); 1448 + 1449 + nr_pages = capture_free_page(page, order, 0); 1450 + if (!nr_pages) 1451 + return 0; 1452 + 1453 + /* Split into individual pages */ 1454 + set_page_refcounted(page); 1455 + split_page(page, order); 1456 + return nr_pages; 1426 1457 } 1427 1458 1428 1459 /* ··· 1509 1484 spin_unlock(&zone->lock); 1510 1485 if (!page) 1511 1486 goto failed; 1512 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order)); 1487 + __mod_zone_freepage_state(zone, -(1 << order), 1488 + get_pageblock_migratetype(page)); 1513 1489 } 1514 1490 1515 1491 __count_zone_vm_events(PGALLOC, zone, 1 << order); ··· 1526 1500 local_irq_restore(flags); 1527 1501 return NULL; 1528 1502 } 1529 - 1530 - /* The ALLOC_WMARK bits are used as an index to zone->watermark */ 1531 - #define ALLOC_WMARK_MIN WMARK_MIN 1532 - #define ALLOC_WMARK_LOW WMARK_LOW 1533 - #define ALLOC_WMARK_HIGH WMARK_HIGH 1534 - #define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */ 1535 - 1536 - /* Mask to get the watermark bits */ 1537 - #define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1) 1538 - 1539 - #define ALLOC_HARDER 0x10 /* try to alloc harder */ 1540 - #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ 1541 - #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ 1542 1503 1543 1504 #ifdef CONFIG_FAIL_PAGE_ALLOC 1544 1505 ··· 1621 1608 min -= min / 2; 1622 1609 if (alloc_flags & ALLOC_HARDER) 1623 1610 min -= min / 4; 1624 - 1611 + #ifdef CONFIG_CMA 1612 + /* If allocation can't use CMA areas don't use free CMA pages */ 1613 + if (!(alloc_flags & ALLOC_CMA)) 1614 + free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); 1615 + #endif 1625 1616 if (free_pages <= min + lowmem_reserve) 1626 1617 return false; 1627 1618 for (o = 0; o < order; o++) { ··· 1799 1782 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); 1800 1783 } 1801 1784 1785 + static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) 1786 + { 1787 + return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); 1788 + } 1789 + 1790 + static void __paginginit init_zone_allows_reclaim(int nid) 1791 + { 1792 + int i; 1793 + 1794 + for_each_online_node(i) 1795 + if (node_distance(nid, i) <= RECLAIM_DISTANCE) { 1796 + node_set(i, NODE_DATA(nid)->reclaim_nodes); 1797 + zone_reclaim_mode = 1; 1798 + } 1799 + } 1800 + 1802 1801 #else /* CONFIG_NUMA */ 1803 1802 1804 1803 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags) ··· 1833 1800 } 1834 1801 1835 1802 static void zlc_clear_zones_full(struct zonelist *zonelist) 1803 + { 1804 + } 1805 + 1806 + static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) 1807 + { 1808 + return true; 1809 + } 1810 + 1811 + static inline void init_zone_allows_reclaim(int nid) 1836 1812 { 1837 1813 } 1838 1814 #endif /* CONFIG_NUMA */ ··· 1928 1886 did_zlc_setup = 1; 1929 1887 } 1930 1888 1931 - if (zone_reclaim_mode == 0) 1889 + if (zone_reclaim_mode == 0 || 1890 + !zone_allows_reclaim(preferred_zone, zone)) 1932 1891 goto this_zone_full; 1933 1892 1934 1893 /* ··· 2148 2105 bool *contended_compaction, bool *deferred_compaction, 2149 2106 unsigned long *did_some_progress) 2150 2107 { 2151 - struct page *page; 2108 + struct page *page = NULL; 2152 2109 2153 2110 if (!order) 2154 2111 return NULL; ··· 2161 2118 current->flags |= PF_MEMALLOC; 2162 2119 *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask, 2163 2120 nodemask, sync_migration, 2164 - contended_compaction); 2121 + contended_compaction, &page); 2165 2122 current->flags &= ~PF_MEMALLOC; 2166 - if (*did_some_progress != COMPACT_SKIPPED) { 2167 2123 2124 + /* If compaction captured a page, prep and use it */ 2125 + if (page) { 2126 + prep_new_page(page, order, gfp_mask); 2127 + goto got_page; 2128 + } 2129 + 2130 + if (*did_some_progress != COMPACT_SKIPPED) { 2168 2131 /* Page migration frees to the PCP lists but we want merging */ 2169 2132 drain_pages(get_cpu()); 2170 2133 put_cpu(); ··· 2180 2131 alloc_flags & ~ALLOC_NO_WATERMARKS, 2181 2132 preferred_zone, migratetype); 2182 2133 if (page) { 2134 + got_page: 2135 + preferred_zone->compact_blockskip_flush = false; 2183 2136 preferred_zone->compact_considered = 0; 2184 2137 preferred_zone->compact_defer_shift = 0; 2185 2138 if (order >= preferred_zone->compact_order_failed) ··· 2366 2315 unlikely(test_thread_flag(TIF_MEMDIE)))) 2367 2316 alloc_flags |= ALLOC_NO_WATERMARKS; 2368 2317 } 2369 - 2318 + #ifdef CONFIG_CMA 2319 + if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 2320 + alloc_flags |= ALLOC_CMA; 2321 + #endif 2370 2322 return alloc_flags; 2371 2323 } 2372 2324 ··· 2416 2362 goto nopage; 2417 2363 2418 2364 restart: 2419 - if (!(gfp_mask & __GFP_NO_KSWAPD)) 2420 - wake_all_kswapd(order, zonelist, high_zoneidx, 2421 - zone_idx(preferred_zone)); 2365 + wake_all_kswapd(order, zonelist, high_zoneidx, 2366 + zone_idx(preferred_zone)); 2422 2367 2423 2368 /* 2424 2369 * OK, we're below the kswapd watermark and have kicked background ··· 2494 2441 * system then fail the allocation instead of entering direct reclaim. 2495 2442 */ 2496 2443 if ((deferred_compaction || contended_compaction) && 2497 - (gfp_mask & __GFP_NO_KSWAPD)) 2444 + (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE) 2498 2445 goto nopage; 2499 2446 2500 2447 /* Try direct reclaim and then allocating */ ··· 2594 2541 struct page *page = NULL; 2595 2542 int migratetype = allocflags_to_migratetype(gfp_mask); 2596 2543 unsigned int cpuset_mems_cookie; 2544 + int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET; 2597 2545 2598 2546 gfp_mask &= gfp_allowed_mask; 2599 2547 ··· 2623 2569 if (!preferred_zone) 2624 2570 goto out; 2625 2571 2572 + #ifdef CONFIG_CMA 2573 + if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 2574 + alloc_flags |= ALLOC_CMA; 2575 + #endif 2626 2576 /* First allocation attempt */ 2627 2577 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, 2628 - zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET, 2578 + zonelist, high_zoneidx, alloc_flags, 2629 2579 preferred_zone, migratetype); 2630 2580 if (unlikely(!page)) 2631 2581 page = __alloc_pages_slowpath(gfp_mask, order, ··· 2910 2852 " unevictable:%lu" 2911 2853 " dirty:%lu writeback:%lu unstable:%lu\n" 2912 2854 " free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n" 2913 - " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n", 2855 + " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" 2856 + " free_cma:%lu\n", 2914 2857 global_page_state(NR_ACTIVE_ANON), 2915 2858 global_page_state(NR_INACTIVE_ANON), 2916 2859 global_page_state(NR_ISOLATED_ANON), ··· 2928 2869 global_page_state(NR_FILE_MAPPED), 2929 2870 global_page_state(NR_SHMEM), 2930 2871 global_page_state(NR_PAGETABLE), 2931 - global_page_state(NR_BOUNCE)); 2872 + global_page_state(NR_BOUNCE), 2873 + global_page_state(NR_FREE_CMA_PAGES)); 2932 2874 2933 2875 for_each_populated_zone(zone) { 2934 2876 int i; ··· 2961 2901 " pagetables:%lukB" 2962 2902 " unstable:%lukB" 2963 2903 " bounce:%lukB" 2904 + " free_cma:%lukB" 2964 2905 " writeback_tmp:%lukB" 2965 2906 " pages_scanned:%lu" 2966 2907 " all_unreclaimable? %s" ··· 2991 2930 K(zone_page_state(zone, NR_PAGETABLE)), 2992 2931 K(zone_page_state(zone, NR_UNSTABLE_NFS)), 2993 2932 K(zone_page_state(zone, NR_BOUNCE)), 2933 + K(zone_page_state(zone, NR_FREE_CMA_PAGES)), 2994 2934 K(zone_page_state(zone, NR_WRITEBACK_TEMP)), 2995 2935 zone->pages_scanned, 2996 2936 (zone->all_unreclaimable ? "yes" : "no") ··· 3390 3328 j = 0; 3391 3329 3392 3330 while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { 3393 - int distance = node_distance(local_node, node); 3394 - 3395 - /* 3396 - * If another node is sufficiently far away then it is better 3397 - * to reclaim pages in a zone before going off node. 3398 - */ 3399 - if (distance > RECLAIM_DISTANCE) 3400 - zone_reclaim_mode = 1; 3401 - 3402 3331 /* 3403 3332 * We don't want to pressure a particular node. 3404 3333 * So adding penalty to the first node in same 3405 3334 * distance group to make it round-robin. 3406 3335 */ 3407 - if (distance != node_distance(local_node, prev_node)) 3336 + if (node_distance(local_node, node) != 3337 + node_distance(local_node, prev_node)) 3408 3338 node_load[node] = load; 3409 3339 3410 3340 prev_node = node; ··· 4492 4438 4493 4439 zone->spanned_pages = size; 4494 4440 zone->present_pages = realsize; 4495 - #if defined CONFIG_COMPACTION || defined CONFIG_CMA 4496 - zone->compact_cached_free_pfn = zone->zone_start_pfn + 4497 - zone->spanned_pages; 4498 - zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1); 4499 - #endif 4500 4441 #ifdef CONFIG_NUMA 4501 4442 zone->node = nid; 4502 4443 zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) ··· 4570 4521 4571 4522 pgdat->node_id = nid; 4572 4523 pgdat->node_start_pfn = node_start_pfn; 4524 + init_zone_allows_reclaim(nid); 4573 4525 calculate_node_totalpages(pgdat, zones_size, zholes_size); 4574 4526 4575 4527 alloc_node_mem_map(pgdat); ··· 4929 4879 zone_movable_pfn[i] << PAGE_SHIFT); 4930 4880 } 4931 4881 4932 - /* Print out the early_node_map[] */ 4882 + /* Print out the early node map */ 4933 4883 printk("Early memory node ranges\n"); 4934 4884 for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) 4935 4885 printk(" node %3d: [mem %#010lx-%#010lx]\n", nid, ··· 5669 5619 pageblock_nr_pages)); 5670 5620 } 5671 5621 5672 - static struct page * 5673 - __alloc_contig_migrate_alloc(struct page *page, unsigned long private, 5674 - int **resultp) 5675 - { 5676 - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE; 5677 - 5678 - if (PageHighMem(page)) 5679 - gfp_mask |= __GFP_HIGHMEM; 5680 - 5681 - return alloc_page(gfp_mask); 5682 - } 5683 - 5684 5622 /* [start, end) must belong to a single zone. */ 5685 - static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) 5623 + static int __alloc_contig_migrate_range(struct compact_control *cc, 5624 + unsigned long start, unsigned long end) 5686 5625 { 5687 5626 /* This function is based on compact_zone() from compaction.c. */ 5688 - 5627 + unsigned long nr_reclaimed; 5689 5628 unsigned long pfn = start; 5690 5629 unsigned int tries = 0; 5691 5630 int ret = 0; 5692 5631 5693 - struct compact_control cc = { 5694 - .nr_migratepages = 0, 5695 - .order = -1, 5696 - .zone = page_zone(pfn_to_page(start)), 5697 - .sync = true, 5698 - }; 5699 - INIT_LIST_HEAD(&cc.migratepages); 5700 - 5701 5632 migrate_prep_local(); 5702 5633 5703 - while (pfn < end || !list_empty(&cc.migratepages)) { 5634 + while (pfn < end || !list_empty(&cc->migratepages)) { 5704 5635 if (fatal_signal_pending(current)) { 5705 5636 ret = -EINTR; 5706 5637 break; 5707 5638 } 5708 5639 5709 - if (list_empty(&cc.migratepages)) { 5710 - cc.nr_migratepages = 0; 5711 - pfn = isolate_migratepages_range(cc.zone, &cc, 5712 - pfn, end); 5640 + if (list_empty(&cc->migratepages)) { 5641 + cc->nr_migratepages = 0; 5642 + pfn = isolate_migratepages_range(cc->zone, cc, 5643 + pfn, end, true); 5713 5644 if (!pfn) { 5714 5645 ret = -EINTR; 5715 5646 break; ··· 5701 5670 break; 5702 5671 } 5703 5672 5704 - ret = migrate_pages(&cc.migratepages, 5705 - __alloc_contig_migrate_alloc, 5673 + nr_reclaimed = reclaim_clean_pages_from_list(cc->zone, 5674 + &cc->migratepages); 5675 + cc->nr_migratepages -= nr_reclaimed; 5676 + 5677 + ret = migrate_pages(&cc->migratepages, 5678 + alloc_migrate_target, 5706 5679 0, false, MIGRATE_SYNC); 5707 5680 } 5708 5681 5709 - putback_lru_pages(&cc.migratepages); 5682 + putback_lru_pages(&cc->migratepages); 5710 5683 return ret > 0 ? 0 : ret; 5711 5684 } 5712 5685 ··· 5789 5754 unsigned long outer_start, outer_end; 5790 5755 int ret = 0, order; 5791 5756 5757 + struct compact_control cc = { 5758 + .nr_migratepages = 0, 5759 + .order = -1, 5760 + .zone = page_zone(pfn_to_page(start)), 5761 + .sync = true, 5762 + .ignore_skip_hint = true, 5763 + }; 5764 + INIT_LIST_HEAD(&cc.migratepages); 5765 + 5792 5766 /* 5793 5767 * What we do here is we mark all pageblocks in range as 5794 5768 * MIGRATE_ISOLATE. Because pageblock and max order pages may ··· 5827 5783 if (ret) 5828 5784 goto done; 5829 5785 5830 - ret = __alloc_contig_migrate_range(start, end); 5786 + ret = __alloc_contig_migrate_range(&cc, start, end); 5831 5787 if (ret) 5832 5788 goto done; 5833 5789 ··· 5876 5832 __reclaim_pages(zone, GFP_HIGHUSER_MOVABLE, end-start); 5877 5833 5878 5834 /* Grab isolated pages from freelists. */ 5879 - outer_end = isolate_freepages_range(outer_start, end); 5835 + outer_end = isolate_freepages_range(&cc, outer_start, end); 5880 5836 if (!outer_end) { 5881 5837 ret = -EBUSY; 5882 5838 goto done; ··· 5918 5874 local_irq_save(flags); 5919 5875 if (pcp->count > 0) 5920 5876 free_pcppages_bulk(zone, pcp->count, pcp); 5877 + drain_zonestat(zone, pset); 5921 5878 setup_pageset(pset, batch); 5922 5879 local_irq_restore(flags); 5923 5880 } ··· 5935 5890 void zone_pcp_reset(struct zone *zone) 5936 5891 { 5937 5892 unsigned long flags; 5893 + int cpu; 5894 + struct per_cpu_pageset *pset; 5938 5895 5939 5896 /* avoid races with drain_pages() */ 5940 5897 local_irq_save(flags); 5941 5898 if (zone->pageset != &boot_pageset) { 5899 + for_each_online_cpu(cpu) { 5900 + pset = per_cpu_ptr(zone->pageset, cpu); 5901 + drain_zonestat(zone, pset); 5902 + } 5942 5903 free_percpu(zone->pageset); 5943 5904 zone->pageset = &boot_pageset; 5944 5905 } ··· 6097 6046 page->mapping, page->index); 6098 6047 dump_page_flags(page->flags); 6099 6048 mem_cgroup_print_bad_page(page); 6049 + } 6050 + 6051 + /* reset zone->present_pages */ 6052 + void reset_zone_present_pages(void) 6053 + { 6054 + struct zone *z; 6055 + int i, nid; 6056 + 6057 + for_each_node_state(nid, N_HIGH_MEMORY) { 6058 + for (i = 0; i < MAX_NR_ZONES; i++) { 6059 + z = NODE_DATA(nid)->node_zones + i; 6060 + z->present_pages = 0; 6061 + } 6062 + } 6063 + } 6064 + 6065 + /* calculate zone's present pages in buddy system */ 6066 + void fixup_zone_present_pages(int nid, unsigned long start_pfn, 6067 + unsigned long end_pfn) 6068 + { 6069 + struct zone *z; 6070 + unsigned long zone_start_pfn, zone_end_pfn; 6071 + int i; 6072 + 6073 + for (i = 0; i < MAX_NR_ZONES; i++) { 6074 + z = NODE_DATA(nid)->node_zones + i; 6075 + zone_start_pfn = z->zone_start_pfn; 6076 + zone_end_pfn = zone_start_pfn + z->spanned_pages; 6077 + 6078 + /* if the two regions intersect */ 6079 + if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn)) 6080 + z->present_pages += min(end_pfn, zone_end_pfn) - 6081 + max(start_pfn, zone_start_pfn); 6082 + } 6100 6083 }

+38 -5

mm/page_isolation.c

··· 76 76 77 77 out: 78 78 if (!ret) { 79 + unsigned long nr_pages; 80 + int migratetype = get_pageblock_migratetype(page); 81 + 79 82 set_pageblock_isolate(page); 80 - move_freepages_block(zone, page, MIGRATE_ISOLATE); 83 + nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); 84 + 85 + __mod_zone_freepage_state(zone, -nr_pages, migratetype); 81 86 } 82 87 83 88 spin_unlock_irqrestore(&zone->lock, flags); ··· 94 89 void unset_migratetype_isolate(struct page *page, unsigned migratetype) 95 90 { 96 91 struct zone *zone; 97 - unsigned long flags; 92 + unsigned long flags, nr_pages; 93 + 98 94 zone = page_zone(page); 99 95 spin_lock_irqsave(&zone->lock, flags); 100 96 if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) 101 97 goto out; 102 - move_freepages_block(zone, page, migratetype); 98 + nr_pages = move_freepages_block(zone, page, migratetype); 99 + __mod_zone_freepage_state(zone, nr_pages, migratetype); 103 100 restore_pageblock_isolate(page, migratetype); 104 101 out: 105 102 spin_unlock_irqrestore(&zone->lock, flags); ··· 200 193 continue; 201 194 } 202 195 page = pfn_to_page(pfn); 203 - if (PageBuddy(page)) 196 + if (PageBuddy(page)) { 197 + /* 198 + * If race between isolatation and allocation happens, 199 + * some free pages could be in MIGRATE_MOVABLE list 200 + * although pageblock's migratation type of the page 201 + * is MIGRATE_ISOLATE. Catch it and move the page into 202 + * MIGRATE_ISOLATE list. 203 + */ 204 + if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) { 205 + struct page *end_page; 206 + 207 + end_page = page + (1 << page_order(page)) - 1; 208 + move_freepages(page_zone(page), page, end_page, 209 + MIGRATE_ISOLATE); 210 + } 204 211 pfn += 1 << page_order(page); 212 + } 205 213 else if (page_count(page) == 0 && 206 - page_private(page) == MIGRATE_ISOLATE) 214 + get_freepage_migratetype(page) == MIGRATE_ISOLATE) 207 215 pfn += 1; 208 216 else 209 217 break; ··· 254 232 ret = __test_page_isolated_in_pageblock(start_pfn, end_pfn); 255 233 spin_unlock_irqrestore(&zone->lock, flags); 256 234 return ret ? 0 : -EBUSY; 235 + } 236 + 237 + struct page *alloc_migrate_target(struct page *page, unsigned long private, 238 + int **resultp) 239 + { 240 + gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE; 241 + 242 + if (PageHighMem(page)) 243 + gfp_mask |= __GFP_HIGHMEM; 244 + 245 + return alloc_page(gfp_mask); 257 246 }

+50

mm/pgtable-generic.c

··· 120 120 } 121 121 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 122 122 #endif 123 + 124 + #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT 125 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 126 + void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable) 127 + { 128 + assert_spin_locked(&mm->page_table_lock); 129 + 130 + /* FIFO */ 131 + if (!mm->pmd_huge_pte) 132 + INIT_LIST_HEAD(&pgtable->lru); 133 + else 134 + list_add(&pgtable->lru, &mm->pmd_huge_pte->lru); 135 + mm->pmd_huge_pte = pgtable; 136 + } 137 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 138 + #endif 139 + 140 + #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW 141 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 142 + /* no "address" argument so destroys page coloring of some arch */ 143 + pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm) 144 + { 145 + pgtable_t pgtable; 146 + 147 + assert_spin_locked(&mm->page_table_lock); 148 + 149 + /* FIFO */ 150 + pgtable = mm->pmd_huge_pte; 151 + if (list_empty(&pgtable->lru)) 152 + mm->pmd_huge_pte = NULL; 153 + else { 154 + mm->pmd_huge_pte = list_entry(pgtable->lru.next, 155 + struct page, lru); 156 + list_del(&pgtable->lru); 157 + } 158 + return pgtable; 159 + } 160 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 161 + #endif 162 + 163 + #ifndef __HAVE_ARCH_PMDP_INVALIDATE 164 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 165 + void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, 166 + pmd_t *pmdp) 167 + { 168 + set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp)); 169 + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); 170 + } 171 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 172 + #endif

-208

mm/prio_tree.c

··· 1 - /* 2 - * mm/prio_tree.c - priority search tree for mapping->i_mmap 3 - * 4 - * Copyright (C) 2004, Rajesh Venkatasubramanian <vrajesh@umich.edu> 5 - * 6 - * This file is released under the GPL v2. 7 - * 8 - * Based on the radix priority search tree proposed by Edward M. McCreight 9 - * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985 10 - * 11 - * 02Feb2004 Initial version 12 - */ 13 - 14 - #include <linux/mm.h> 15 - #include <linux/prio_tree.h> 16 - #include <linux/prefetch.h> 17 - 18 - /* 19 - * See lib/prio_tree.c for details on the general radix priority search tree 20 - * code. 21 - */ 22 - 23 - /* 24 - * The following #defines are mirrored from lib/prio_tree.c. They're only used 25 - * for debugging, and should be removed (along with the debugging code using 26 - * them) when switching also VMAs to the regular prio_tree code. 27 - */ 28 - 29 - #define RADIX_INDEX(vma) ((vma)->vm_pgoff) 30 - #define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT) 31 - /* avoid overflow */ 32 - #define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1)) 33 - 34 - /* 35 - * Radix priority search tree for address_space->i_mmap 36 - * 37 - * For each vma that map a unique set of file pages i.e., unique [radix_index, 38 - * heap_index] value, we have a corresponding priority search tree node. If 39 - * multiple vmas have identical [radix_index, heap_index] value, then one of 40 - * them is used as a tree node and others are stored in a vm_set list. The tree 41 - * node points to the first vma (head) of the list using vm_set.head. 42 - * 43 - * prio_tree_root 44 - * | 45 - * A vm_set.head 46 - * / \ / 47 - * L R -> H-I-J-K-M-N-O-P-Q-S 48 - * ^ ^ <-- vm_set.list --> 49 - * tree nodes 50 - * 51 - * We need some way to identify whether a vma is a tree node, head of a vm_set 52 - * list, or just a member of a vm_set list. We cannot use vm_flags to store 53 - * such information. The reason is, in the above figure, it is possible that 54 - * vm_flags' of R and H are covered by the different mmap_sems. When R is 55 - * removed under R->mmap_sem, H replaces R as a tree node. Since we do not hold 56 - * H->mmap_sem, we cannot use H->vm_flags for marking that H is a tree node now. 57 - * That's why some trick involving shared.vm_set.parent is used for identifying 58 - * tree nodes and list head nodes. 59 - * 60 - * vma radix priority search tree node rules: 61 - * 62 - * vma->shared.vm_set.parent != NULL ==> a tree node 63 - * vma->shared.vm_set.head != NULL ==> list of others mapping same range 64 - * vma->shared.vm_set.head == NULL ==> no others map the same range 65 - * 66 - * vma->shared.vm_set.parent == NULL 67 - * vma->shared.vm_set.head != NULL ==> list head of vmas mapping same range 68 - * vma->shared.vm_set.head == NULL ==> a list node 69 - */ 70 - 71 - /* 72 - * Add a new vma known to map the same set of pages as the old vma: 73 - * useful for fork's dup_mmap as well as vma_prio_tree_insert below. 74 - * Note that it just happens to work correctly on i_mmap_nonlinear too. 75 - */ 76 - void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old) 77 - { 78 - /* Leave these BUG_ONs till prio_tree patch stabilizes */ 79 - BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); 80 - BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old)); 81 - 82 - vma->shared.vm_set.head = NULL; 83 - vma->shared.vm_set.parent = NULL; 84 - 85 - if (!old->shared.vm_set.parent) 86 - list_add(&vma->shared.vm_set.list, 87 - &old->shared.vm_set.list); 88 - else if (old->shared.vm_set.head) 89 - list_add_tail(&vma->shared.vm_set.list, 90 - &old->shared.vm_set.head->shared.vm_set.list); 91 - else { 92 - INIT_LIST_HEAD(&vma->shared.vm_set.list); 93 - vma->shared.vm_set.head = old; 94 - old->shared.vm_set.head = vma; 95 - } 96 - } 97 - 98 - void vma_prio_tree_insert(struct vm_area_struct *vma, 99 - struct prio_tree_root *root) 100 - { 101 - struct prio_tree_node *ptr; 102 - struct vm_area_struct *old; 103 - 104 - vma->shared.vm_set.head = NULL; 105 - 106 - ptr = raw_prio_tree_insert(root, &vma->shared.prio_tree_node); 107 - if (ptr != (struct prio_tree_node *) &vma->shared.prio_tree_node) { 108 - old = prio_tree_entry(ptr, struct vm_area_struct, 109 - shared.prio_tree_node); 110 - vma_prio_tree_add(vma, old); 111 - } 112 - } 113 - 114 - void vma_prio_tree_remove(struct vm_area_struct *vma, 115 - struct prio_tree_root *root) 116 - { 117 - struct vm_area_struct *node, *head, *new_head; 118 - 119 - if (!vma->shared.vm_set.head) { 120 - if (!vma->shared.vm_set.parent) 121 - list_del_init(&vma->shared.vm_set.list); 122 - else 123 - raw_prio_tree_remove(root, &vma->shared.prio_tree_node); 124 - } else { 125 - /* Leave this BUG_ON till prio_tree patch stabilizes */ 126 - BUG_ON(vma->shared.vm_set.head->shared.vm_set.head != vma); 127 - if (vma->shared.vm_set.parent) { 128 - head = vma->shared.vm_set.head; 129 - if (!list_empty(&head->shared.vm_set.list)) { 130 - new_head = list_entry( 131 - head->shared.vm_set.list.next, 132 - struct vm_area_struct, 133 - shared.vm_set.list); 134 - list_del_init(&head->shared.vm_set.list); 135 - } else 136 - new_head = NULL; 137 - 138 - raw_prio_tree_replace(root, &vma->shared.prio_tree_node, 139 - &head->shared.prio_tree_node); 140 - head->shared.vm_set.head = new_head; 141 - if (new_head) 142 - new_head->shared.vm_set.head = head; 143 - 144 - } else { 145 - node = vma->shared.vm_set.head; 146 - if (!list_empty(&vma->shared.vm_set.list)) { 147 - new_head = list_entry( 148 - vma->shared.vm_set.list.next, 149 - struct vm_area_struct, 150 - shared.vm_set.list); 151 - list_del_init(&vma->shared.vm_set.list); 152 - node->shared.vm_set.head = new_head; 153 - new_head->shared.vm_set.head = node; 154 - } else 155 - node->shared.vm_set.head = NULL; 156 - } 157 - } 158 - } 159 - 160 - /* 161 - * Helper function to enumerate vmas that map a given file page or a set of 162 - * contiguous file pages. The function returns vmas that at least map a single 163 - * page in the given range of contiguous file pages. 164 - */ 165 - struct vm_area_struct *vma_prio_tree_next(struct vm_area_struct *vma, 166 - struct prio_tree_iter *iter) 167 - { 168 - struct prio_tree_node *ptr; 169 - struct vm_area_struct *next; 170 - 171 - if (!vma) { 172 - /* 173 - * First call is with NULL vma 174 - */ 175 - ptr = prio_tree_next(iter); 176 - if (ptr) { 177 - next = prio_tree_entry(ptr, struct vm_area_struct, 178 - shared.prio_tree_node); 179 - prefetch(next->shared.vm_set.head); 180 - return next; 181 - } else 182 - return NULL; 183 - } 184 - 185 - if (vma->shared.vm_set.parent) { 186 - if (vma->shared.vm_set.head) { 187 - next = vma->shared.vm_set.head; 188 - prefetch(next->shared.vm_set.list.next); 189 - return next; 190 - } 191 - } else { 192 - next = list_entry(vma->shared.vm_set.list.next, 193 - struct vm_area_struct, shared.vm_set.list); 194 - if (!next->shared.vm_set.head) { 195 - prefetch(next->shared.vm_set.list.next); 196 - return next; 197 - } 198 - } 199 - 200 - ptr = prio_tree_next(iter); 201 - if (ptr) { 202 - next = prio_tree_entry(ptr, struct vm_area_struct, 203 - shared.prio_tree_node); 204 - prefetch(next->shared.vm_set.head); 205 - return next; 206 - } else 207 - return NULL; 208 - }

+60 -99

mm/rmap.c

··· 127 127 avc->vma = vma; 128 128 avc->anon_vma = anon_vma; 129 129 list_add(&avc->same_vma, &vma->anon_vma_chain); 130 - 131 - /* 132 - * It's critical to add new vmas to the tail of the anon_vma, 133 - * see comment in huge_memory.c:__split_huge_page(). 134 - */ 135 - list_add_tail(&avc->same_anon_vma, &anon_vma->head); 130 + anon_vma_interval_tree_insert(avc, &anon_vma->rb_root); 136 131 } 137 132 138 133 /** ··· 264 269 } 265 270 266 271 /* 267 - * Some rmap walk that needs to find all ptes/hugepmds without false 268 - * negatives (like migrate and split_huge_page) running concurrent 269 - * with operations that copy or move pagetables (like mremap() and 270 - * fork()) to be safe. They depend on the anon_vma "same_anon_vma" 271 - * list to be in a certain order: the dst_vma must be placed after the 272 - * src_vma in the list. This is always guaranteed by fork() but 273 - * mremap() needs to call this function to enforce it in case the 274 - * dst_vma isn't newly allocated and chained with the anon_vma_clone() 275 - * function but just an extension of a pre-existing vma through 276 - * vma_merge. 277 - * 278 - * NOTE: the same_anon_vma list can still be changed by other 279 - * processes while mremap runs because mremap doesn't hold the 280 - * anon_vma mutex to prevent modifications to the list while it 281 - * runs. All we need to enforce is that the relative order of this 282 - * process vmas isn't changing (we don't care about other vmas 283 - * order). Each vma corresponds to an anon_vma_chain structure so 284 - * there's no risk that other processes calling anon_vma_moveto_tail() 285 - * and changing the same_anon_vma list under mremap() will screw with 286 - * the relative order of this process vmas in the list, because we 287 - * they can't alter the order of any vma that belongs to this 288 - * process. And there can't be another anon_vma_moveto_tail() running 289 - * concurrently with mremap() coming from this process because we hold 290 - * the mmap_sem for the whole mremap(). fork() ordering dependency 291 - * also shouldn't be affected because fork() only cares that the 292 - * parent vmas are placed in the list before the child vmas and 293 - * anon_vma_moveto_tail() won't reorder vmas from either the fork() 294 - * parent or child. 295 - */ 296 - void anon_vma_moveto_tail(struct vm_area_struct *dst) 297 - { 298 - struct anon_vma_chain *pavc; 299 - struct anon_vma *root = NULL; 300 - 301 - list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) { 302 - struct anon_vma *anon_vma = pavc->anon_vma; 303 - VM_BUG_ON(pavc->vma != dst); 304 - root = lock_anon_vma_root(root, anon_vma); 305 - list_del(&pavc->same_anon_vma); 306 - list_add_tail(&pavc->same_anon_vma, &anon_vma->head); 307 - } 308 - unlock_anon_vma_root(root); 309 - } 310 - 311 - /* 312 272 * Attach vma to its own anon_vma, as well as to the anon_vmas that 313 273 * the corresponding VMA in the parent process is attached to. 314 274 * Returns 0 on success, non-zero on failure. ··· 331 381 struct anon_vma *anon_vma = avc->anon_vma; 332 382 333 383 root = lock_anon_vma_root(root, anon_vma); 334 - list_del(&avc->same_anon_vma); 384 + anon_vma_interval_tree_remove(avc, &anon_vma->rb_root); 335 385 336 386 /* 337 387 * Leave empty anon_vmas on the list - we'll need 338 388 * to free them outside the lock. 339 389 */ 340 - if (list_empty(&anon_vma->head)) 390 + if (RB_EMPTY_ROOT(&anon_vma->rb_root)) 341 391 continue; 342 392 343 393 list_del(&avc->same_vma); ··· 366 416 367 417 mutex_init(&anon_vma->mutex); 368 418 atomic_set(&anon_vma->refcount, 0); 369 - INIT_LIST_HEAD(&anon_vma->head); 419 + anon_vma->rb_root = RB_ROOT; 370 420 } 371 421 372 422 void __init anon_vma_init(void) ··· 510 560 511 561 /* 512 562 * At what user virtual address is page expected in @vma? 513 - * Returns virtual address or -EFAULT if page's index/offset is not 514 - * within the range mapped the @vma. 515 563 */ 516 - inline unsigned long 517 - vma_address(struct page *page, struct vm_area_struct *vma) 564 + static inline unsigned long 565 + __vma_address(struct page *page, struct vm_area_struct *vma) 518 566 { 519 567 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 520 - unsigned long address; 521 568 522 569 if (unlikely(is_vm_hugetlb_page(vma))) 523 570 pgoff = page->index << huge_page_order(page_hstate(page)); 524 - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 525 - if (unlikely(address < vma->vm_start || address >= vma->vm_end)) { 526 - /* page should be within @vma mapping range */ 527 - return -EFAULT; 528 - } 571 + 572 + return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 573 + } 574 + 575 + inline unsigned long 576 + vma_address(struct page *page, struct vm_area_struct *vma) 577 + { 578 + unsigned long address = __vma_address(page, vma); 579 + 580 + /* page should be within @vma mapping range */ 581 + VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); 582 + 529 583 return address; 530 584 } 531 585 ··· 539 585 */ 540 586 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) 541 587 { 588 + unsigned long address; 542 589 if (PageAnon(page)) { 543 590 struct anon_vma *page__anon_vma = page_anon_vma(page); 544 591 /* ··· 555 600 return -EFAULT; 556 601 } else 557 602 return -EFAULT; 558 - return vma_address(page, vma); 603 + address = __vma_address(page, vma); 604 + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) 605 + return -EFAULT; 606 + return address; 559 607 } 560 608 561 609 /* ··· 632 674 pte_t *pte; 633 675 spinlock_t *ptl; 634 676 635 - address = vma_address(page, vma); 636 - if (address == -EFAULT) /* out of vma range */ 677 + address = __vma_address(page, vma); 678 + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) 637 679 return 0; 638 680 pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); 639 681 if (!pte) /* the page is not in this mm */ ··· 727 769 { 728 770 unsigned int mapcount; 729 771 struct anon_vma *anon_vma; 772 + pgoff_t pgoff; 730 773 struct anon_vma_chain *avc; 731 774 int referenced = 0; 732 775 ··· 736 777 return referenced; 737 778 738 779 mapcount = page_mapcount(page); 739 - list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { 780 + pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 781 + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 740 782 struct vm_area_struct *vma = avc->vma; 741 783 unsigned long address = vma_address(page, vma); 742 - if (address == -EFAULT) 743 - continue; 744 784 /* 745 785 * If we are reclaiming on behalf of a cgroup, skip 746 786 * counting on behalf of references from different ··· 778 820 struct address_space *mapping = page->mapping; 779 821 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 780 822 struct vm_area_struct *vma; 781 - struct prio_tree_iter iter; 782 823 int referenced = 0; 783 824 784 825 /* ··· 803 846 */ 804 847 mapcount = page_mapcount(page); 805 848 806 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 849 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 807 850 unsigned long address = vma_address(page, vma); 808 - if (address == -EFAULT) 809 - continue; 810 851 /* 811 852 * If we are reclaiming on behalf of a cgroup, skip 812 853 * counting on behalf of references from different ··· 884 929 pte_t entry; 885 930 886 931 flush_cache_page(vma, address, pte_pfn(*pte)); 887 - entry = ptep_clear_flush_notify(vma, address, pte); 932 + entry = ptep_clear_flush(vma, address, pte); 888 933 entry = pte_wrprotect(entry); 889 934 entry = pte_mkclean(entry); 890 935 set_pte_at(mm, address, pte, entry); ··· 892 937 } 893 938 894 939 pte_unmap_unlock(pte, ptl); 940 + 941 + if (ret) 942 + mmu_notifier_invalidate_page(mm, address); 895 943 out: 896 944 return ret; 897 945 } ··· 903 945 { 904 946 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 905 947 struct vm_area_struct *vma; 906 - struct prio_tree_iter iter; 907 948 int ret = 0; 908 949 909 950 BUG_ON(PageAnon(page)); 910 951 911 952 mutex_lock(&mapping->i_mmap_mutex); 912 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 953 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 913 954 if (vma->vm_flags & VM_SHARED) { 914 955 unsigned long address = vma_address(page, vma); 915 - if (address == -EFAULT) 916 - continue; 917 956 ret += page_mkclean_one(page, vma, address); 918 957 } 919 958 } ··· 1083 1128 else 1084 1129 __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); 1085 1130 __page_set_anon_rmap(page, vma, address, 1); 1086 - if (page_evictable(page, vma)) 1131 + if (!mlocked_vma_newpage(vma, page)) 1087 1132 lru_cache_add_lru(page, LRU_ACTIVE_ANON); 1088 1133 else 1089 1134 add_page_to_unevictable_list(page); ··· 1158 1203 } else { 1159 1204 __dec_zone_page_state(page, NR_FILE_MAPPED); 1160 1205 mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED); 1206 + mem_cgroup_end_update_page_stat(page, &locked, &flags); 1161 1207 } 1208 + if (unlikely(PageMlocked(page))) 1209 + clear_page_mlock(page); 1162 1210 /* 1163 1211 * It would be tidy to reset the PageAnon mapping here, 1164 1212 * but that might overwrite a racing page_add_anon_rmap ··· 1171 1213 * Leaving it set also helps swapoff to reinstate ptes 1172 1214 * faster for those pages still in swapcache. 1173 1215 */ 1216 + return; 1174 1217 out: 1175 1218 if (!anon) 1176 1219 mem_cgroup_end_update_page_stat(page, &locked, &flags); ··· 1215 1256 1216 1257 /* Nuke the page table entry. */ 1217 1258 flush_cache_page(vma, address, page_to_pfn(page)); 1218 - pteval = ptep_clear_flush_notify(vma, address, pte); 1259 + pteval = ptep_clear_flush(vma, address, pte); 1219 1260 1220 1261 /* Move the dirty bit to the physical page now the pte is gone. */ 1221 1262 if (pte_dirty(pteval)) ··· 1277 1318 1278 1319 out_unmap: 1279 1320 pte_unmap_unlock(pte, ptl); 1321 + if (ret != SWAP_FAIL) 1322 + mmu_notifier_invalidate_page(mm, address); 1280 1323 out: 1281 1324 return ret; 1282 1325 ··· 1343 1382 spinlock_t *ptl; 1344 1383 struct page *page; 1345 1384 unsigned long address; 1385 + unsigned long mmun_start; /* For mmu_notifiers */ 1386 + unsigned long mmun_end; /* For mmu_notifiers */ 1346 1387 unsigned long end; 1347 1388 int ret = SWAP_AGAIN; 1348 1389 int locked_vma = 0; ··· 1367 1404 pmd = pmd_offset(pud, address); 1368 1405 if (!pmd_present(*pmd)) 1369 1406 return ret; 1407 + 1408 + mmun_start = address; 1409 + mmun_end = end; 1410 + mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 1370 1411 1371 1412 /* 1372 1413 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED, ··· 1405 1438 1406 1439 /* Nuke the page table entry. */ 1407 1440 flush_cache_page(vma, address, pte_pfn(*pte)); 1408 - pteval = ptep_clear_flush_notify(vma, address, pte); 1441 + pteval = ptep_clear_flush(vma, address, pte); 1409 1442 1410 1443 /* If nonlinear, store the file page offset in the pte. */ 1411 1444 if (page->index != linear_page_index(vma, address)) ··· 1421 1454 (*mapcount)--; 1422 1455 } 1423 1456 pte_unmap_unlock(pte - 1, ptl); 1457 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 1424 1458 if (locked_vma) 1425 1459 up_read(&vma->vm_mm->mmap_sem); 1426 1460 return ret; ··· 1460 1492 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) 1461 1493 { 1462 1494 struct anon_vma *anon_vma; 1495 + pgoff_t pgoff; 1463 1496 struct anon_vma_chain *avc; 1464 1497 int ret = SWAP_AGAIN; 1465 1498 ··· 1468 1499 if (!anon_vma) 1469 1500 return ret; 1470 1501 1471 - list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { 1502 + pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 1503 + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 1472 1504 struct vm_area_struct *vma = avc->vma; 1473 1505 unsigned long address; 1474 1506 ··· 1486 1516 continue; 1487 1517 1488 1518 address = vma_address(page, vma); 1489 - if (address == -EFAULT) 1490 - continue; 1491 1519 ret = try_to_unmap_one(page, vma, address, flags); 1492 1520 if (ret != SWAP_AGAIN || !page_mapped(page)) 1493 1521 break; ··· 1515 1547 struct address_space *mapping = page->mapping; 1516 1548 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 1517 1549 struct vm_area_struct *vma; 1518 - struct prio_tree_iter iter; 1519 1550 int ret = SWAP_AGAIN; 1520 1551 unsigned long cursor; 1521 1552 unsigned long max_nl_cursor = 0; ··· 1522 1555 unsigned int mapcount; 1523 1556 1524 1557 mutex_lock(&mapping->i_mmap_mutex); 1525 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 1558 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 1526 1559 unsigned long address = vma_address(page, vma); 1527 - if (address == -EFAULT) 1528 - continue; 1529 1560 ret = try_to_unmap_one(page, vma, address, flags); 1530 1561 if (ret != SWAP_AGAIN || !page_mapped(page)) 1531 1562 goto out; ··· 1541 1576 goto out; 1542 1577 1543 1578 list_for_each_entry(vma, &mapping->i_mmap_nonlinear, 1544 - shared.vm_set.list) { 1579 + shared.nonlinear) { 1545 1580 cursor = (unsigned long) vma->vm_private_data; 1546 1581 if (cursor > max_nl_cursor) 1547 1582 max_nl_cursor = cursor; ··· 1573 1608 1574 1609 do { 1575 1610 list_for_each_entry(vma, &mapping->i_mmap_nonlinear, 1576 - shared.vm_set.list) { 1611 + shared.nonlinear) { 1577 1612 cursor = (unsigned long) vma->vm_private_data; 1578 1613 while ( cursor < max_nl_cursor && 1579 1614 cursor < vma->vm_end - vma->vm_start) { ··· 1596 1631 * in locked vmas). Reset cursor on all unreserved nonlinear 1597 1632 * vmas, now forgetting on which ones it had fallen behind. 1598 1633 */ 1599 - list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) 1634 + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear) 1600 1635 vma->vm_private_data = NULL; 1601 1636 out: 1602 1637 mutex_unlock(&mapping->i_mmap_mutex); ··· 1681 1716 struct vm_area_struct *, unsigned long, void *), void *arg) 1682 1717 { 1683 1718 struct anon_vma *anon_vma; 1719 + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 1684 1720 struct anon_vma_chain *avc; 1685 1721 int ret = SWAP_AGAIN; 1686 1722 ··· 1695 1729 if (!anon_vma) 1696 1730 return ret; 1697 1731 anon_vma_lock(anon_vma); 1698 - list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { 1732 + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 1699 1733 struct vm_area_struct *vma = avc->vma; 1700 1734 unsigned long address = vma_address(page, vma); 1701 - if (address == -EFAULT) 1702 - continue; 1703 1735 ret = rmap_one(page, vma, address, arg); 1704 1736 if (ret != SWAP_AGAIN) 1705 1737 break; ··· 1712 1748 struct address_space *mapping = page->mapping; 1713 1749 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 1714 1750 struct vm_area_struct *vma; 1715 - struct prio_tree_iter iter; 1716 1751 int ret = SWAP_AGAIN; 1717 1752 1718 1753 if (!mapping) 1719 1754 return ret; 1720 1755 mutex_lock(&mapping->i_mmap_mutex); 1721 - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 1756 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 1722 1757 unsigned long address = vma_address(page, vma); 1723 - if (address == -EFAULT) 1724 - continue; 1725 1758 ret = rmap_one(page, vma, address, arg); 1726 1759 if (ret != SWAP_AGAIN) 1727 1760 break;

+1 -2

mm/shmem.c

··· 1339 1339 { 1340 1340 file_accessed(file); 1341 1341 vma->vm_ops = &shmem_vm_ops; 1342 - vma->vm_flags |= VM_CAN_NONLINEAR; 1343 1342 return 0; 1344 1343 } 1345 1344 ··· 2642 2643 .set_policy = shmem_set_policy, 2643 2644 .get_policy = shmem_get_policy, 2644 2645 #endif 2646 + .remap_pages = generic_file_remap_pages, 2645 2647 }; 2646 2648 2647 2649 static struct dentry *shmem_mount(struct file_system_type *fs_type, ··· 2836 2836 fput(vma->vm_file); 2837 2837 vma->vm_file = file; 2838 2838 vma->vm_ops = &shmem_vm_ops; 2839 - vma->vm_flags |= VM_CAN_NONLINEAR; 2840 2839 return 0; 2841 2840 } 2842 2841

+11 -2

mm/swap.c

··· 446 446 } 447 447 EXPORT_SYMBOL(mark_page_accessed); 448 448 449 + /* 450 + * Order of operations is important: flush the pagevec when it's already 451 + * full, not when adding the last page, to make sure that last page is 452 + * not added to the LRU directly when passed to this function. Because 453 + * mark_page_accessed() (called after this when writing) only activates 454 + * pages that are on the LRU, linear writes in subpage chunks would see 455 + * every PAGEVEC_SIZE page activated, which is unexpected. 456 + */ 449 457 void __lru_cache_add(struct page *page, enum lru_list lru) 450 458 { 451 459 struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru]; 452 460 453 461 page_cache_get(page); 454 - if (!pagevec_add(pvec, page)) 462 + if (!pagevec_space(pvec)) 455 463 __pagevec_lru_add(pvec, lru); 464 + pagevec_add(pvec, page); 456 465 put_cpu_var(lru_add_pvecs); 457 466 } 458 467 EXPORT_SYMBOL(__lru_cache_add); ··· 751 742 752 743 SetPageLRU(page_tail); 753 744 754 - if (page_evictable(page_tail, NULL)) { 745 + if (page_evictable(page_tail)) { 755 746 if (PageActive(page)) { 756 747 SetPageActive(page_tail); 757 748 active = 1;

-3

mm/truncate.c

··· 107 107 108 108 cancel_dirty_page(page, PAGE_CACHE_SIZE); 109 109 110 - clear_page_mlock(page); 111 110 ClearPageMappedToDisk(page); 112 111 delete_from_page_cache(page); 113 112 return 0; ··· 131 132 if (page_has_private(page) && !try_to_release_page(page, 0)) 132 133 return 0; 133 134 134 - clear_page_mlock(page); 135 135 ret = remove_mapping(mapping, page); 136 136 137 137 return ret; ··· 396 398 if (PageDirty(page)) 397 399 goto failed; 398 400 399 - clear_page_mlock(page); 400 401 BUG_ON(page_has_private(page)); 401 402 __delete_from_page_cache(page); 402 403 spin_unlock_irq(&mapping->tree_lock);

+2 -3

mm/vmalloc.c

··· 2163 2163 usize -= PAGE_SIZE; 2164 2164 } while (usize > 0); 2165 2165 2166 - /* Prevent "things" like memory migration? VM_flags need a cleanup... */ 2167 - vma->vm_flags |= VM_RESERVED; 2166 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 2168 2167 2169 2168 return 0; 2170 2169 } ··· 2571 2572 { 2572 2573 struct vm_struct *v = p; 2573 2574 2574 - seq_printf(m, "0x%p-0x%p %7ld", 2575 + seq_printf(m, "0x%pK-0x%pK %7ld", 2575 2576 v->addr, v->addr + v->size, v->size); 2576 2577 2577 2578 if (v->caller)

+83 -28

mm/vmscan.c

··· 553 553 redo: 554 554 ClearPageUnevictable(page); 555 555 556 - if (page_evictable(page, NULL)) { 556 + if (page_evictable(page)) { 557 557 /* 558 558 * For evictable pages, we can use the cache. 559 559 * In event of a race, worst case is we end up with an ··· 587 587 * page is on unevictable list, it never be freed. To avoid that, 588 588 * check after we added it to the list, again. 589 589 */ 590 - if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) { 590 + if (lru == LRU_UNEVICTABLE && page_evictable(page)) { 591 591 if (!isolate_lru_page(page)) { 592 592 put_page(page); 593 593 goto redo; ··· 674 674 static unsigned long shrink_page_list(struct list_head *page_list, 675 675 struct zone *zone, 676 676 struct scan_control *sc, 677 + enum ttu_flags ttu_flags, 677 678 unsigned long *ret_nr_dirty, 678 - unsigned long *ret_nr_writeback) 679 + unsigned long *ret_nr_writeback, 680 + bool force_reclaim) 679 681 { 680 682 LIST_HEAD(ret_pages); 681 683 LIST_HEAD(free_pages); ··· 691 689 692 690 mem_cgroup_uncharge_start(); 693 691 while (!list_empty(page_list)) { 694 - enum page_references references; 695 692 struct address_space *mapping; 696 693 struct page *page; 697 694 int may_enter_fs; 695 + enum page_references references = PAGEREF_RECLAIM_CLEAN; 698 696 699 697 cond_resched(); 700 698 ··· 709 707 710 708 sc->nr_scanned++; 711 709 712 - if (unlikely(!page_evictable(page, NULL))) 710 + if (unlikely(!page_evictable(page))) 713 711 goto cull_mlocked; 714 712 715 713 if (!sc->may_unmap && page_mapped(page)) ··· 760 758 wait_on_page_writeback(page); 761 759 } 762 760 763 - references = page_check_references(page, sc); 761 + if (!force_reclaim) 762 + references = page_check_references(page, sc); 763 + 764 764 switch (references) { 765 765 case PAGEREF_ACTIVATE: 766 766 goto activate_locked; ··· 792 788 * processes. Try to unmap it here. 793 789 */ 794 790 if (page_mapped(page) && mapping) { 795 - switch (try_to_unmap(page, TTU_UNMAP)) { 791 + switch (try_to_unmap(page, ttu_flags)) { 796 792 case SWAP_FAIL: 797 793 goto activate_locked; 798 794 case SWAP_AGAIN: ··· 964 960 return nr_reclaimed; 965 961 } 966 962 963 + unsigned long reclaim_clean_pages_from_list(struct zone *zone, 964 + struct list_head *page_list) 965 + { 966 + struct scan_control sc = { 967 + .gfp_mask = GFP_KERNEL, 968 + .priority = DEF_PRIORITY, 969 + .may_unmap = 1, 970 + }; 971 + unsigned long ret, dummy1, dummy2; 972 + struct page *page, *next; 973 + LIST_HEAD(clean_pages); 974 + 975 + list_for_each_entry_safe(page, next, page_list, lru) { 976 + if (page_is_file_cache(page) && !PageDirty(page)) { 977 + ClearPageActive(page); 978 + list_move(&page->lru, &clean_pages); 979 + } 980 + } 981 + 982 + ret = shrink_page_list(&clean_pages, zone, &sc, 983 + TTU_UNMAP|TTU_IGNORE_ACCESS, 984 + &dummy1, &dummy2, true); 985 + list_splice(&clean_pages, page_list); 986 + __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); 987 + return ret; 988 + } 989 + 967 990 /* 968 991 * Attempt to remove the specified page from its LRU. Only take this page 969 992 * if it is of the appropriate PageActive status. Pages which are being ··· 1009 978 if (!PageLRU(page)) 1010 979 return ret; 1011 980 1012 - /* Do not give back unevictable pages for compaction */ 1013 - if (PageUnevictable(page)) 981 + /* Compaction should not handle unevictable pages but CMA can do so */ 982 + if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) 1014 983 return ret; 1015 984 1016 985 ret = -EBUSY; ··· 1217 1186 1218 1187 VM_BUG_ON(PageLRU(page)); 1219 1188 list_del(&page->lru); 1220 - if (unlikely(!page_evictable(page, NULL))) { 1189 + if (unlikely(!page_evictable(page))) { 1221 1190 spin_unlock_irq(&zone->lru_lock); 1222 1191 putback_lru_page(page); 1223 1192 spin_lock_irq(&zone->lru_lock); ··· 1309 1278 if (nr_taken == 0) 1310 1279 return 0; 1311 1280 1312 - nr_reclaimed = shrink_page_list(&page_list, zone, sc, 1313 - &nr_dirty, &nr_writeback); 1281 + nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, 1282 + &nr_dirty, &nr_writeback, false); 1314 1283 1315 1284 spin_lock_irq(&zone->lru_lock); 1316 1285 ··· 1470 1439 page = lru_to_page(&l_hold); 1471 1440 list_del(&page->lru); 1472 1441 1473 - if (unlikely(!page_evictable(page, NULL))) { 1442 + if (unlikely(!page_evictable(page))) { 1474 1443 putback_lru_page(page); 1475 1444 continue; 1476 1445 } ··· 1760 1729 return false; 1761 1730 } 1762 1731 1732 + #ifdef CONFIG_COMPACTION 1733 + /* 1734 + * If compaction is deferred for sc->order then scale the number of pages 1735 + * reclaimed based on the number of consecutive allocation failures 1736 + */ 1737 + static unsigned long scale_for_compaction(unsigned long pages_for_compaction, 1738 + struct lruvec *lruvec, struct scan_control *sc) 1739 + { 1740 + struct zone *zone = lruvec_zone(lruvec); 1741 + 1742 + if (zone->compact_order_failed <= sc->order) 1743 + pages_for_compaction <<= zone->compact_defer_shift; 1744 + return pages_for_compaction; 1745 + } 1746 + #else 1747 + static unsigned long scale_for_compaction(unsigned long pages_for_compaction, 1748 + struct lruvec *lruvec, struct scan_control *sc) 1749 + { 1750 + return pages_for_compaction; 1751 + } 1752 + #endif 1753 + 1763 1754 /* 1764 1755 * Reclaim/compaction is used for high-order allocation requests. It reclaims 1765 1756 * order-0 pages before compacting the zone. should_continue_reclaim() returns ··· 1829 1776 * inactive lists are large enough, continue reclaiming 1830 1777 */ 1831 1778 pages_for_compaction = (2UL << sc->order); 1779 + 1780 + pages_for_compaction = scale_for_compaction(pages_for_compaction, 1781 + lruvec, sc); 1832 1782 inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE); 1833 1783 if (nr_swap_pages > 0) 1834 1784 inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON); ··· 2895 2839 */ 2896 2840 set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); 2897 2841 2842 + /* 2843 + * Compaction records what page blocks it recently failed to 2844 + * isolate pages from and skips them in the future scanning. 2845 + * When kswapd is going to sleep, it is reasonable to assume 2846 + * that pages and compaction may succeed so reset the cache. 2847 + */ 2848 + reset_isolation_suitable(pgdat); 2849 + 2898 2850 if (!kthread_should_stop()) 2899 2851 schedule(); 2900 2852 ··· 3165 3101 if (IS_ERR(pgdat->kswapd)) { 3166 3102 /* failure at boot is fatal */ 3167 3103 BUG_ON(system_state == SYSTEM_BOOTING); 3168 - printk("Failed to start kswapd on node %d\n",nid); 3169 3104 pgdat->kswapd = NULL; 3170 - ret = -1; 3105 + pr_err("Failed to start kswapd on node %d\n", nid); 3106 + ret = PTR_ERR(pgdat->kswapd); 3171 3107 } 3172 3108 return ret; 3173 3109 } ··· 3414 3350 /* 3415 3351 * page_evictable - test whether a page is evictable 3416 3352 * @page: the page to test 3417 - * @vma: the VMA in which the page is or will be mapped, may be NULL 3418 3353 * 3419 3354 * Test whether page is evictable--i.e., should be placed on active/inactive 3420 - * lists vs unevictable list. The vma argument is !NULL when called from the 3421 - * fault path to determine how to instantate a new page. 3355 + * lists vs unevictable list. 3422 3356 * 3423 3357 * Reasons page might not be evictable: 3424 3358 * (1) page's mapping marked unevictable 3425 3359 * (2) page is part of an mlocked VMA 3426 3360 * 3427 3361 */ 3428 - int page_evictable(struct page *page, struct vm_area_struct *vma) 3362 + int page_evictable(struct page *page) 3429 3363 { 3430 - 3431 - if (mapping_unevictable(page_mapping(page))) 3432 - return 0; 3433 - 3434 - if (PageMlocked(page) || (vma && mlocked_vma_newpage(vma, page))) 3435 - return 0; 3436 - 3437 - return 1; 3364 + return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page); 3438 3365 } 3439 3366 3440 3367 #ifdef CONFIG_SHMEM ··· 3463 3408 if (!PageLRU(page) || !PageUnevictable(page)) 3464 3409 continue; 3465 3410 3466 - if (page_evictable(page, NULL)) { 3411 + if (page_evictable(page)) { 3467 3412 enum lru_list lru = page_lru_base_type(page); 3468 3413 3469 3414 VM_BUG_ON(PageActive(page));

+13 -1

mm/vmstat.c

··· 495 495 atomic_long_add(global_diff[i], &vm_stat[i]); 496 496 } 497 497 498 + void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) 499 + { 500 + int i; 501 + 502 + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 503 + if (pset->vm_stat_diff[i]) { 504 + int v = pset->vm_stat_diff[i]; 505 + pset->vm_stat_diff[i] = 0; 506 + atomic_long_add(v, &zone->vm_stat[i]); 507 + atomic_long_add(v, &vm_stat[i]); 508 + } 509 + } 498 510 #endif 499 511 500 512 #ifdef CONFIG_NUMA ··· 734 722 "numa_other", 735 723 #endif 736 724 "nr_anon_transparent_hugepages", 725 + "nr_free_cma", 737 726 "nr_dirty_threshold", 738 727 "nr_dirty_background_threshold", 739 728 ··· 794 781 "unevictable_pgs_munlocked", 795 782 "unevictable_pgs_cleared", 796 783 "unevictable_pgs_stranded", 797 - "unevictable_pgs_mlockfreed", 798 784 799 785 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 800 786 "thp_fault_alloc",

-1

net/ceph/osd_client.c

··· 221 221 kref_init(&req->r_kref); 222 222 init_completion(&req->r_completion); 223 223 init_completion(&req->r_safe_completion); 224 - rb_init_node(&req->r_node); 225 224 INIT_LIST_HEAD(&req->r_unsafe_item); 226 225 INIT_LIST_HEAD(&req->r_linger_item); 227 226 INIT_LIST_HEAD(&req->r_linger_osd);

+1 -1

security/selinux/selinuxfs.c

··· 485 485 return -EACCES; 486 486 } 487 487 488 - vma->vm_flags |= VM_RESERVED; 488 + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 489 489 vma->vm_ops = &sel_mmap_policy_ops; 490 490 491 491 return 0;

+2 -7

security/tomoyo/util.c

··· 949 949 const char *tomoyo_get_exe(void) 950 950 { 951 951 struct mm_struct *mm = current->mm; 952 - struct vm_area_struct *vma; 953 952 const char *cp = NULL; 954 953 955 954 if (!mm) 956 955 return NULL; 957 956 down_read(&mm->mmap_sem); 958 - for (vma = mm->mmap; vma; vma = vma->vm_next) { 959 - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) { 960 - cp = tomoyo_realpath_from_path(&vma->vm_file->f_path); 961 - break; 962 - } 963 - } 957 + if (mm->exe_file) 958 + cp = tomoyo_realpath_from_path(&mm->exe_file->f_path); 964 959 up_read(&mm->mmap_sem); 965 960 return cp; 966 961 }

+3 -3

sound/core/pcm_native.c

··· 3039 3039 return -EINVAL; 3040 3040 area->vm_ops = &snd_pcm_vm_ops_status; 3041 3041 area->vm_private_data = substream; 3042 - area->vm_flags |= VM_RESERVED; 3042 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 3043 3043 return 0; 3044 3044 } 3045 3045 ··· 3076 3076 return -EINVAL; 3077 3077 area->vm_ops = &snd_pcm_vm_ops_control; 3078 3078 area->vm_private_data = substream; 3079 - area->vm_flags |= VM_RESERVED; 3079 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 3080 3080 return 0; 3081 3081 } 3082 3082 #else /* ! coherent mmap */ ··· 3170 3170 int snd_pcm_lib_default_mmap(struct snd_pcm_substream *substream, 3171 3171 struct vm_area_struct *area) 3172 3172 { 3173 - area->vm_flags |= VM_RESERVED; 3173 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 3174 3174 #ifdef ARCH_HAS_DMA_MMAP_COHERENT 3175 3175 if (!substream->ops->page && 3176 3176 substream->dma_buffer.dev.type == SNDRV_DMA_TYPE_DEV)

+1 -1

sound/usb/usx2y/us122l.c

··· 262 262 } 263 263 264 264 area->vm_ops = &usb_stream_hwdep_vm_ops; 265 - area->vm_flags |= VM_RESERVED; 265 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 266 266 area->vm_private_data = us122l; 267 267 atomic_inc(&us122l->mmap_count); 268 268 out:

+1 -1

sound/usb/usx2y/usX2Yhwdep.c

··· 82 82 us428->us428ctls_sharedmem->CtlSnapShotLast = -2; 83 83 } 84 84 area->vm_ops = &us428ctls_vm_ops; 85 - area->vm_flags |= VM_RESERVED | VM_DONTEXPAND; 85 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 86 86 area->vm_private_data = hw->private_data; 87 87 return 0; 88 88 }

+1 -1

sound/usb/usx2y/usx2yhwdeppcm.c

··· 723 723 return -ENODEV; 724 724 } 725 725 area->vm_ops = &snd_usX2Y_hwdep_pcm_vm_ops; 726 - area->vm_flags |= VM_RESERVED | VM_DONTEXPAND; 726 + area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; 727 727 area->vm_private_data = hw->private_data; 728 728 return 0; 729 729 }

+1

tools/perf/util/include/linux/rbtree.h

··· 1 1 #include <stdbool.h> 2 + #include <stdbool.h> 2 3 #include "../../../../include/linux/rbtree.h"