Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: keep page cache radix tree nodes in check

Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Johannes Weiner and committed by
Linus Torvalds
449dd698 139e5616

+363 -47
+7 -1
include/linux/list_lru.h
··· 13 13 /* list_lru_walk_cb has to always return one of those */ 14 14 enum lru_status { 15 15 LRU_REMOVED, /* item removed from list */ 16 + LRU_REMOVED_RETRY, /* item removed, but lock has been 17 + dropped and reacquired */ 16 18 LRU_ROTATE, /* item referenced, give another pass */ 17 19 LRU_SKIP, /* item cannot be locked, skip */ 18 20 LRU_RETRY, /* item not freeable. May drop the lock ··· 34 32 }; 35 33 36 34 void list_lru_destroy(struct list_lru *lru); 37 - int list_lru_init(struct list_lru *lru); 35 + int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key); 36 + static inline int list_lru_init(struct list_lru *lru) 37 + { 38 + return list_lru_init_key(lru, NULL); 39 + } 38 40 39 41 /** 40 42 * list_lru_add: add an element to the lru list's tail
+1
include/linux/mmzone.h
··· 144 144 #endif 145 145 WORKINGSET_REFAULT, 146 146 WORKINGSET_ACTIVATE, 147 + WORKINGSET_NODERECLAIM, 147 148 NR_ANON_TRANSPARENT_HUGEPAGES, 148 149 NR_FREE_CMA_PAGES, 149 150 NR_VM_ZONE_STAT_ITEMS };
+28 -12
include/linux/radix-tree.h
··· 72 72 #define RADIX_TREE_TAG_LONGS \ 73 73 ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG) 74 74 75 - struct radix_tree_node { 76 - unsigned int height; /* Height from the bottom */ 77 - unsigned int count; 78 - union { 79 - struct radix_tree_node *parent; /* Used when ascending tree */ 80 - struct rcu_head rcu_head; /* Used when freeing node */ 81 - }; 82 - void __rcu *slots[RADIX_TREE_MAP_SIZE]; 83 - unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; 84 - }; 85 - 86 75 #define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long)) 87 76 #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \ 88 77 RADIX_TREE_MAP_SHIFT)) 78 + 79 + /* Height component in node->path */ 80 + #define RADIX_TREE_HEIGHT_SHIFT (RADIX_TREE_MAX_PATH + 1) 81 + #define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1) 82 + 83 + /* Internally used bits of node->count */ 84 + #define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1) 85 + #define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1) 86 + 87 + struct radix_tree_node { 88 + unsigned int path; /* Offset in parent & height from the bottom */ 89 + unsigned int count; 90 + union { 91 + struct { 92 + /* Used when ascending tree */ 93 + struct radix_tree_node *parent; 94 + /* For tree user */ 95 + void *private_data; 96 + }; 97 + /* Used when freeing node */ 98 + struct rcu_head rcu_head; 99 + }; 100 + /* For tree user */ 101 + struct list_head private_list; 102 + void __rcu *slots[RADIX_TREE_MAP_SIZE]; 103 + unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; 104 + }; 89 105 90 106 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ 91 107 struct radix_tree_root { ··· 267 251 struct radix_tree_node **nodep, void ***slotp); 268 252 void *radix_tree_lookup(struct radix_tree_root *, unsigned long); 269 253 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long); 270 - bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index, 254 + bool __radix_tree_delete_node(struct radix_tree_root *root, 271 255 struct radix_tree_node *node); 272 256 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *); 273 257 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+31
include/linux/swap.h
··· 264 264 void *workingset_eviction(struct address_space *mapping, struct page *page); 265 265 bool workingset_refault(void *shadow); 266 266 void workingset_activation(struct page *page); 267 + extern struct list_lru workingset_shadow_nodes; 268 + 269 + static inline unsigned int workingset_node_pages(struct radix_tree_node *node) 270 + { 271 + return node->count & RADIX_TREE_COUNT_MASK; 272 + } 273 + 274 + static inline void workingset_node_pages_inc(struct radix_tree_node *node) 275 + { 276 + node->count++; 277 + } 278 + 279 + static inline void workingset_node_pages_dec(struct radix_tree_node *node) 280 + { 281 + node->count--; 282 + } 283 + 284 + static inline unsigned int workingset_node_shadows(struct radix_tree_node *node) 285 + { 286 + return node->count >> RADIX_TREE_COUNT_SHIFT; 287 + } 288 + 289 + static inline void workingset_node_shadows_inc(struct radix_tree_node *node) 290 + { 291 + node->count += 1U << RADIX_TREE_COUNT_SHIFT; 292 + } 293 + 294 + static inline void workingset_node_shadows_dec(struct radix_tree_node *node) 295 + { 296 + node->count -= 1U << RADIX_TREE_COUNT_SHIFT; 297 + } 267 298 268 299 /* linux/mm/page_alloc.c */ 269 300 extern unsigned long totalram_pages;
+22 -14
lib/radix-tree.c
··· 342 342 343 343 /* Increase the height. */ 344 344 newheight = root->height+1; 345 - node->height = newheight; 345 + BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK); 346 + node->path = newheight; 346 347 node->count = 1; 347 348 node->parent = NULL; 348 349 slot = root->rnode; ··· 401 400 /* Have to add a child node. */ 402 401 if (!(slot = radix_tree_node_alloc(root))) 403 402 return -ENOMEM; 404 - slot->height = height; 403 + slot->path = height; 405 404 slot->parent = node; 406 405 if (node) { 407 406 rcu_assign_pointer(node->slots[offset], slot); 408 407 node->count++; 408 + slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT; 409 409 } else 410 410 rcu_assign_pointer(root->rnode, ptr_to_indirect(slot)); 411 411 } ··· 500 498 } 501 499 node = indirect_to_ptr(node); 502 500 503 - height = node->height; 501 + height = node->path & RADIX_TREE_HEIGHT_MASK; 504 502 if (index > radix_tree_maxindex(height)) 505 503 return NULL; 506 504 ··· 706 704 return (index == 0); 707 705 node = indirect_to_ptr(node); 708 706 709 - height = node->height; 707 + height = node->path & RADIX_TREE_HEIGHT_MASK; 710 708 if (index > radix_tree_maxindex(height)) 711 709 return 0; 712 710 ··· 743 741 { 744 742 unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK; 745 743 struct radix_tree_node *rnode, *node; 746 - unsigned long index, offset; 744 + unsigned long index, offset, height; 747 745 748 746 if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag)) 749 747 return NULL; ··· 774 772 return NULL; 775 773 776 774 restart: 777 - shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT; 775 + height = rnode->path & RADIX_TREE_HEIGHT_MASK; 776 + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; 778 777 offset = index >> shift; 779 778 780 779 /* Index outside of the tree */ ··· 1145 1142 unsigned int shift, height; 1146 1143 unsigned long i; 1147 1144 1148 - height = slot->height; 1145 + height = slot->path & RADIX_TREE_HEIGHT_MASK; 1149 1146 shift = (height-1) * RADIX_TREE_MAP_SHIFT; 1150 1147 1151 1148 for ( ; height > 1; height--) { ··· 1208 1205 } 1209 1206 1210 1207 node = indirect_to_ptr(node); 1211 - max_index = radix_tree_maxindex(node->height); 1208 + max_index = radix_tree_maxindex(node->path & 1209 + RADIX_TREE_HEIGHT_MASK); 1212 1210 if (cur_index > max_index) { 1213 1211 rcu_read_unlock(); 1214 1212 break; ··· 1305 1301 * 1306 1302 * Returns %true if @node was freed, %false otherwise. 1307 1303 */ 1308 - bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index, 1304 + bool __radix_tree_delete_node(struct radix_tree_root *root, 1309 1305 struct radix_tree_node *node) 1310 1306 { 1311 1307 bool deleted = false; ··· 1324 1320 1325 1321 parent = node->parent; 1326 1322 if (parent) { 1327 - index >>= RADIX_TREE_MAP_SHIFT; 1323 + unsigned int offset; 1328 1324 1329 - parent->slots[index & RADIX_TREE_MAP_MASK] = NULL; 1325 + offset = node->path >> RADIX_TREE_HEIGHT_SHIFT; 1326 + parent->slots[offset] = NULL; 1330 1327 parent->count--; 1331 1328 } else { 1332 1329 root_tag_clear_all(root); ··· 1391 1386 node->slots[offset] = NULL; 1392 1387 node->count--; 1393 1388 1394 - __radix_tree_delete_node(root, index, node); 1389 + __radix_tree_delete_node(root, node); 1395 1390 1396 1391 return entry; 1397 1392 } ··· 1424 1419 EXPORT_SYMBOL(radix_tree_tagged); 1425 1420 1426 1421 static void 1427 - radix_tree_node_ctor(void *node) 1422 + radix_tree_node_ctor(void *arg) 1428 1423 { 1429 - memset(node, 0, sizeof(struct radix_tree_node)); 1424 + struct radix_tree_node *node = arg; 1425 + 1426 + memset(node, 0, sizeof(*node)); 1427 + INIT_LIST_HEAD(&node->private_list); 1430 1428 } 1431 1429 1432 1430 static __init unsigned long __maxindex(unsigned int height)
+74 -16
mm/filemap.c
··· 110 110 static void page_cache_tree_delete(struct address_space *mapping, 111 111 struct page *page, void *shadow) 112 112 { 113 - if (shadow) { 114 - void **slot; 113 + struct radix_tree_node *node; 114 + unsigned long index; 115 + unsigned int offset; 116 + unsigned int tag; 117 + void **slot; 115 118 116 - slot = radix_tree_lookup_slot(&mapping->page_tree, page->index); 117 - radix_tree_replace_slot(slot, shadow); 119 + VM_BUG_ON(!PageLocked(page)); 120 + 121 + __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot); 122 + 123 + if (shadow) { 118 124 mapping->nrshadows++; 119 125 /* 120 126 * Make sure the nrshadows update is committed before ··· 129 123 * same time and miss a shadow entry. 130 124 */ 131 125 smp_wmb(); 132 - } else 133 - radix_tree_delete(&mapping->page_tree, page->index); 126 + } 134 127 mapping->nrpages--; 128 + 129 + if (!node) { 130 + /* Clear direct pointer tags in root node */ 131 + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK; 132 + radix_tree_replace_slot(slot, shadow); 133 + return; 134 + } 135 + 136 + /* Clear tree tags for the removed page */ 137 + index = page->index; 138 + offset = index & RADIX_TREE_MAP_MASK; 139 + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { 140 + if (test_bit(offset, node->tags[tag])) 141 + radix_tree_tag_clear(&mapping->page_tree, index, tag); 142 + } 143 + 144 + /* Delete page, swap shadow entry */ 145 + radix_tree_replace_slot(slot, shadow); 146 + workingset_node_pages_dec(node); 147 + if (shadow) 148 + workingset_node_shadows_inc(node); 149 + else 150 + if (__radix_tree_delete_node(&mapping->page_tree, node)) 151 + return; 152 + 153 + /* 154 + * Track node that only contains shadow entries. 155 + * 156 + * Avoid acquiring the list_lru lock if already tracked. The 157 + * list_empty() test is safe as node->private_list is 158 + * protected by mapping->tree_lock. 159 + */ 160 + if (!workingset_node_pages(node) && 161 + list_empty(&node->private_list)) { 162 + node->private_data = mapping; 163 + list_lru_add(&workingset_shadow_nodes, &node->private_list); 164 + } 135 165 } 136 166 137 167 /* ··· 513 471 static int page_cache_tree_insert(struct address_space *mapping, 514 472 struct page *page, void **shadowp) 515 473 { 474 + struct radix_tree_node *node; 516 475 void **slot; 517 476 int error; 518 477 519 - slot = radix_tree_lookup_slot(&mapping->page_tree, page->index); 520 - if (slot) { 478 + error = __radix_tree_create(&mapping->page_tree, page->index, 479 + &node, &slot); 480 + if (error) 481 + return error; 482 + if (*slot) { 521 483 void *p; 522 484 523 485 p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); 524 486 if (!radix_tree_exceptional_entry(p)) 525 487 return -EEXIST; 526 - radix_tree_replace_slot(slot, page); 527 - mapping->nrshadows--; 528 - mapping->nrpages++; 529 488 if (shadowp) 530 489 *shadowp = p; 531 - return 0; 490 + mapping->nrshadows--; 491 + if (node) 492 + workingset_node_shadows_dec(node); 532 493 } 533 - error = radix_tree_insert(&mapping->page_tree, page->index, page); 534 - if (!error) 535 - mapping->nrpages++; 536 - return error; 494 + radix_tree_replace_slot(slot, page); 495 + mapping->nrpages++; 496 + if (node) { 497 + workingset_node_pages_inc(node); 498 + /* 499 + * Don't track node that contains actual pages. 500 + * 501 + * Avoid acquiring the list_lru lock if already 502 + * untracked. The list_empty() test is safe as 503 + * node->private_list is protected by 504 + * mapping->tree_lock. 505 + */ 506 + if (!list_empty(&node->private_list)) 507 + list_lru_del(&workingset_shadow_nodes, 508 + &node->private_list); 509 + } 510 + return 0; 537 511 } 538 512 539 513 static int __add_to_page_cache_locked(struct page *page,
+14 -2
mm/list_lru.c
··· 87 87 88 88 ret = isolate(item, &nlru->lock, cb_arg); 89 89 switch (ret) { 90 + case LRU_REMOVED_RETRY: 91 + assert_spin_locked(&nlru->lock); 90 92 case LRU_REMOVED: 91 93 if (--nlru->nr_items == 0) 92 94 node_clear(nid, lru->active_nodes); 93 95 WARN_ON_ONCE(nlru->nr_items < 0); 94 96 isolated++; 97 + /* 98 + * If the lru lock has been dropped, our list 99 + * traversal is now invalid and so we have to 100 + * restart from scratch. 101 + */ 102 + if (ret == LRU_REMOVED_RETRY) 103 + goto restart; 95 104 break; 96 105 case LRU_ROTATE: 97 106 list_move_tail(item, &nlru->list); ··· 112 103 * The lru lock has been dropped, our list traversal is 113 104 * now invalid and so we have to restart from scratch. 114 105 */ 106 + assert_spin_locked(&nlru->lock); 115 107 goto restart; 116 108 default: 117 109 BUG(); ··· 124 114 } 125 115 EXPORT_SYMBOL_GPL(list_lru_walk_node); 126 116 127 - int list_lru_init(struct list_lru *lru) 117 + int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) 128 118 { 129 119 int i; 130 120 size_t size = sizeof(*lru->node) * nr_node_ids; ··· 136 126 nodes_clear(lru->active_nodes); 137 127 for (i = 0; i < nr_node_ids; i++) { 138 128 spin_lock_init(&lru->node[i].lock); 129 + if (key) 130 + lockdep_set_class(&lru->node[i].lock, key); 139 131 INIT_LIST_HEAD(&lru->node[i].list); 140 132 lru->node[i].nr_items = 0; 141 133 } 142 134 return 0; 143 135 } 144 - EXPORT_SYMBOL_GPL(list_lru_init); 136 + EXPORT_SYMBOL_GPL(list_lru_init_key); 145 137 146 138 void list_lru_destroy(struct list_lru *lru) 147 139 {
+24 -2
mm/truncate.c
··· 25 25 static void clear_exceptional_entry(struct address_space *mapping, 26 26 pgoff_t index, void *entry) 27 27 { 28 + struct radix_tree_node *node; 29 + void **slot; 30 + 28 31 /* Handled by shmem itself */ 29 32 if (shmem_mapping(mapping)) 30 33 return; ··· 38 35 * without the tree itself locked. These unlocked entries 39 36 * need verification under the tree lock. 40 37 */ 41 - if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry) 42 - mapping->nrshadows--; 38 + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) 39 + goto unlock; 40 + if (*slot != entry) 41 + goto unlock; 42 + radix_tree_replace_slot(slot, NULL); 43 + mapping->nrshadows--; 44 + if (!node) 45 + goto unlock; 46 + workingset_node_shadows_dec(node); 47 + /* 48 + * Don't track node without shadow entries. 49 + * 50 + * Avoid acquiring the list_lru lock if already untracked. 51 + * The list_empty() test is safe as node->private_list is 52 + * protected by mapping->tree_lock. 53 + */ 54 + if (!workingset_node_shadows(node) && 55 + !list_empty(&node->private_list)) 56 + list_lru_del(&workingset_shadow_nodes, &node->private_list); 57 + __radix_tree_delete_node(&mapping->page_tree, node); 58 + unlock: 43 59 spin_unlock_irq(&mapping->tree_lock); 44 60 } 45 61
+1
mm/vmstat.c
··· 772 772 #endif 773 773 "workingset_refault", 774 774 "workingset_activate", 775 + "workingset_nodereclaim", 775 776 "nr_anon_transparent_hugepages", 776 777 "nr_free_cma", 777 778 "nr_dirty_threshold",
+161
mm/workingset.c
··· 251 251 { 252 252 atomic_long_inc(&page_zone(page)->inactive_age); 253 253 } 254 + 255 + /* 256 + * Shadow entries reflect the share of the working set that does not 257 + * fit into memory, so their number depends on the access pattern of 258 + * the workload. In most cases, they will refault or get reclaimed 259 + * along with the inode, but a (malicious) workload that streams 260 + * through files with a total size several times that of available 261 + * memory, while preventing the inodes from being reclaimed, can 262 + * create excessive amounts of shadow nodes. To keep a lid on this, 263 + * track shadow nodes and reclaim them when they grow way past the 264 + * point where they would still be useful. 265 + */ 266 + 267 + struct list_lru workingset_shadow_nodes; 268 + 269 + static unsigned long count_shadow_nodes(struct shrinker *shrinker, 270 + struct shrink_control *sc) 271 + { 272 + unsigned long shadow_nodes; 273 + unsigned long max_nodes; 274 + unsigned long pages; 275 + 276 + /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ 277 + local_irq_disable(); 278 + shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid); 279 + local_irq_enable(); 280 + 281 + pages = node_present_pages(sc->nid); 282 + /* 283 + * Active cache pages are limited to 50% of memory, and shadow 284 + * entries that represent a refault distance bigger than that 285 + * do not have any effect. Limit the number of shadow nodes 286 + * such that shadow entries do not exceed the number of active 287 + * cache pages, assuming a worst-case node population density 288 + * of 1/8th on average. 289 + * 290 + * On 64-bit with 7 radix_tree_nodes per page and 64 slots 291 + * each, this will reclaim shadow entries when they consume 292 + * ~2% of available memory: 293 + * 294 + * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE 295 + */ 296 + max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3); 297 + 298 + if (shadow_nodes <= max_nodes) 299 + return 0; 300 + 301 + return shadow_nodes - max_nodes; 302 + } 303 + 304 + static enum lru_status shadow_lru_isolate(struct list_head *item, 305 + spinlock_t *lru_lock, 306 + void *arg) 307 + { 308 + struct address_space *mapping; 309 + struct radix_tree_node *node; 310 + unsigned int i; 311 + int ret; 312 + 313 + /* 314 + * Page cache insertions and deletions synchroneously maintain 315 + * the shadow node LRU under the mapping->tree_lock and the 316 + * lru_lock. Because the page cache tree is emptied before 317 + * the inode can be destroyed, holding the lru_lock pins any 318 + * address_space that has radix tree nodes on the LRU. 319 + * 320 + * We can then safely transition to the mapping->tree_lock to 321 + * pin only the address_space of the particular node we want 322 + * to reclaim, take the node off-LRU, and drop the lru_lock. 323 + */ 324 + 325 + node = container_of(item, struct radix_tree_node, private_list); 326 + mapping = node->private_data; 327 + 328 + /* Coming from the list, invert the lock order */ 329 + if (!spin_trylock(&mapping->tree_lock)) { 330 + spin_unlock(lru_lock); 331 + ret = LRU_RETRY; 332 + goto out; 333 + } 334 + 335 + list_del_init(item); 336 + spin_unlock(lru_lock); 337 + 338 + /* 339 + * The nodes should only contain one or more shadow entries, 340 + * no pages, so we expect to be able to remove them all and 341 + * delete and free the empty node afterwards. 342 + */ 343 + 344 + BUG_ON(!node->count); 345 + BUG_ON(node->count & RADIX_TREE_COUNT_MASK); 346 + 347 + for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { 348 + if (node->slots[i]) { 349 + BUG_ON(!radix_tree_exceptional_entry(node->slots[i])); 350 + node->slots[i] = NULL; 351 + BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT)); 352 + node->count -= 1U << RADIX_TREE_COUNT_SHIFT; 353 + BUG_ON(!mapping->nrshadows); 354 + mapping->nrshadows--; 355 + } 356 + } 357 + BUG_ON(node->count); 358 + inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM); 359 + if (!__radix_tree_delete_node(&mapping->page_tree, node)) 360 + BUG(); 361 + 362 + spin_unlock(&mapping->tree_lock); 363 + ret = LRU_REMOVED_RETRY; 364 + out: 365 + local_irq_enable(); 366 + cond_resched(); 367 + local_irq_disable(); 368 + spin_lock(lru_lock); 369 + return ret; 370 + } 371 + 372 + static unsigned long scan_shadow_nodes(struct shrinker *shrinker, 373 + struct shrink_control *sc) 374 + { 375 + unsigned long ret; 376 + 377 + /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ 378 + local_irq_disable(); 379 + ret = list_lru_walk_node(&workingset_shadow_nodes, sc->nid, 380 + shadow_lru_isolate, NULL, &sc->nr_to_scan); 381 + local_irq_enable(); 382 + return ret; 383 + } 384 + 385 + static struct shrinker workingset_shadow_shrinker = { 386 + .count_objects = count_shadow_nodes, 387 + .scan_objects = scan_shadow_nodes, 388 + .seeks = DEFAULT_SEEKS, 389 + .flags = SHRINKER_NUMA_AWARE, 390 + }; 391 + 392 + /* 393 + * Our list_lru->lock is IRQ-safe as it nests inside the IRQ-safe 394 + * mapping->tree_lock. 395 + */ 396 + static struct lock_class_key shadow_nodes_key; 397 + 398 + static int __init workingset_init(void) 399 + { 400 + int ret; 401 + 402 + ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key); 403 + if (ret) 404 + goto err; 405 + ret = register_shrinker(&workingset_shadow_shrinker); 406 + if (ret) 407 + goto err_list_lru; 408 + return 0; 409 + err_list_lru: 410 + list_lru_destroy(&workingset_shadow_nodes); 411 + err: 412 + return ret; 413 + } 414 + module_init(workingset_init);