Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

fix soft lock up at NFS mount via per-SB LRU-list of unused dentries

[Summary]

Split LRU-list of unused dentries to one per superblock to avoid soft
lock up during NFS mounts and remounting of any filesystem.

Previously I posted here:
http://lkml.org/lkml/2008/3/5/590

[Descriptions]

- background

dentry_unused is a list of dentries which are not referenced.
dentry_unused grows up when references on directories or files are
released. This list can be very long if there is huge free memory.

- the problem

When shrink_dcache_sb() is called, it scans all dentry_unused linearly
under spin_lock(), and if dentry->d_sb is differnt from given
superblock, scan next dentry. This scan costs very much if there are
many entries, and very ineffective if there are many superblocks.

IOW, When we need to shrink unused dentries on one dentry, but scans
unused dentries on all superblocks in the system. For example, we scan
500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
dentries on other superblocks.

In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
unused dentries on NFS, but scans 100,000,000 unused dentries on
superblocks in the system such as local ext3 filesystems. I hear NFS
mounting took 1 min on some system in use.

* : NFS uses virtual filesystem in rpc layer, so NFS is affected by
this problem.

100,000,000 is possible number on large systems.

Per-superblock LRU of unused dentried can reduce the cost in
reasonable manner.

- How to fix

I found this problem is solved by David Chinner's "Per-superblock
unused dentry LRU lists V3"(1), so I rebase it and add some fix to
reclaim with fairness, which is in Andrew Morton's comments(2).

1) http://lkml.org/lkml/2006/5/25/318
2) http://lkml.org/lkml/2006/5/25/320

Split LRU-list of unused dentries to each superblocks. Then, NFS
mounting will check dentries under a superblock instead of all. But
this spliting will break LRU of dentry-unused. So, I've attempted to
make reclaim unused dentrins with fairness by calculate number of
dentries to scan on this sb based on following way

number of dentries to scan on this sb =
count * (number of dentries on this sb / number of dentries in the machine)

- ToDo
- I have to measuring performance number and do stress tests.

- When unmount occurs during prune_dcache(), scanning on same
superblock, It is unable to reach next superblock because it is gone
away. We restart scannig superblock from first one, it causes
unfairness of reclaim unused dentries on first superblock. But I think
this happens very rarely.

- Test Results

Result on 6GB boxes with excessive unused dentries.

Without patch:

$ cat /proc/sys/fs/dentry-state
10181835 10180203 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m1.830s
user 0m0.001s
sys 0m1.653s

With this patch:
$ cat /proc/sys/fs/dentry-state
10236610 10234751 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m0.106s
user 0m0.002s
sys 0m0.032s

[akpm@linux-foundation.org: fix comments]
Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Chinner <dgc@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Kentaro Makita and committed by
Linus Torvalds
da3bbdd4 3c82d0ce

+189 -159
+184 -159
fs/dcache.c
··· 61 61 static unsigned int d_hash_mask __read_mostly; 62 62 static unsigned int d_hash_shift __read_mostly; 63 63 static struct hlist_head *dentry_hashtable __read_mostly; 64 - static LIST_HEAD(dentry_unused); 65 64 66 65 /* Statistics gathering. */ 67 66 struct dentry_stat_t dentry_stat = { ··· 95 96 call_rcu(&dentry->d_u.d_rcu, d_callback); 96 97 } 97 98 98 - static void dentry_lru_remove(struct dentry *dentry) 99 - { 100 - if (!list_empty(&dentry->d_lru)) { 101 - list_del_init(&dentry->d_lru); 102 - dentry_stat.nr_unused--; 103 - } 104 - } 105 - 106 99 /* 107 100 * Release the dentry's inode, using the filesystem 108 101 * d_iput() operation if defined. ··· 118 127 } else { 119 128 spin_unlock(&dentry->d_lock); 120 129 spin_unlock(&dcache_lock); 130 + } 131 + } 132 + 133 + /* 134 + * dentry_lru_(add|add_tail|del|del_init) must be called with dcache_lock held. 135 + */ 136 + static void dentry_lru_add(struct dentry *dentry) 137 + { 138 + list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru); 139 + dentry->d_sb->s_nr_dentry_unused++; 140 + dentry_stat.nr_unused++; 141 + } 142 + 143 + static void dentry_lru_add_tail(struct dentry *dentry) 144 + { 145 + list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru); 146 + dentry->d_sb->s_nr_dentry_unused++; 147 + dentry_stat.nr_unused++; 148 + } 149 + 150 + static void dentry_lru_del(struct dentry *dentry) 151 + { 152 + if (!list_empty(&dentry->d_lru)) { 153 + list_del(&dentry->d_lru); 154 + dentry->d_sb->s_nr_dentry_unused--; 155 + dentry_stat.nr_unused--; 156 + } 157 + } 158 + 159 + static void dentry_lru_del_init(struct dentry *dentry) 160 + { 161 + if (likely(!list_empty(&dentry->d_lru))) { 162 + list_del_init(&dentry->d_lru); 163 + dentry->d_sb->s_nr_dentry_unused--; 164 + dentry_stat.nr_unused--; 121 165 } 122 166 } 123 167 ··· 238 212 goto kill_it; 239 213 if (list_empty(&dentry->d_lru)) { 240 214 dentry->d_flags |= DCACHE_REFERENCED; 241 - list_add(&dentry->d_lru, &dentry_unused); 242 - dentry_stat.nr_unused++; 215 + dentry_lru_add(dentry); 243 216 } 244 217 spin_unlock(&dentry->d_lock); 245 218 spin_unlock(&dcache_lock); ··· 247 222 unhash_it: 248 223 __d_drop(dentry); 249 224 kill_it: 250 - dentry_lru_remove(dentry); 225 + /* if dentry was on the d_lru list delete it from there */ 226 + dentry_lru_del(dentry); 251 227 dentry = d_kill(dentry); 252 228 if (dentry) 253 229 goto repeat; ··· 316 290 static inline struct dentry * __dget_locked(struct dentry *dentry) 317 291 { 318 292 atomic_inc(&dentry->d_count); 319 - dentry_lru_remove(dentry); 293 + dentry_lru_del_init(dentry); 320 294 return dentry; 321 295 } 322 296 ··· 432 406 433 407 if (dentry->d_op && dentry->d_op->d_delete) 434 408 dentry->d_op->d_delete(dentry); 435 - dentry_lru_remove(dentry); 409 + dentry_lru_del_init(dentry); 436 410 __d_drop(dentry); 437 411 dentry = d_kill(dentry); 438 412 spin_lock(&dcache_lock); 439 413 } 440 414 } 441 415 442 - /** 443 - * prune_dcache - shrink the dcache 444 - * @count: number of entries to try and free 445 - * @sb: if given, ignore dentries for other superblocks 446 - * which are being unmounted. 447 - * 448 - * Shrink the dcache. This is done when we need 449 - * more memory, or simply when we need to unmount 450 - * something (at which point we need to unuse 451 - * all dentries). 452 - * 453 - * This function may fail to free any resources if 454 - * all the dentries are in use. 416 + /* 417 + * Shrink the dentry LRU on a given superblock. 418 + * @sb : superblock to shrink dentry LRU. 419 + * @count: If count is NULL, we prune all dentries on superblock. 420 + * @flags: If flags is non-zero, we need to do special processing based on 421 + * which flags are set. This means we don't need to maintain multiple 422 + * similar copies of this loop. 455 423 */ 456 - 457 - static void prune_dcache(int count, struct super_block *sb) 424 + static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags) 458 425 { 426 + LIST_HEAD(referenced); 427 + LIST_HEAD(tmp); 428 + struct dentry *dentry; 429 + int cnt = 0; 430 + 431 + BUG_ON(!sb); 432 + BUG_ON((flags & DCACHE_REFERENCED) && count == NULL); 459 433 spin_lock(&dcache_lock); 460 - for (; count ; count--) { 461 - struct dentry *dentry; 462 - struct list_head *tmp; 463 - struct rw_semaphore *s_umount; 434 + if (count != NULL) 435 + /* called from prune_dcache() and shrink_dcache_parent() */ 436 + cnt = *count; 437 + restart: 438 + if (count == NULL) 439 + list_splice_init(&sb->s_dentry_lru, &tmp); 440 + else { 441 + while (!list_empty(&sb->s_dentry_lru)) { 442 + dentry = list_entry(sb->s_dentry_lru.prev, 443 + struct dentry, d_lru); 444 + BUG_ON(dentry->d_sb != sb); 464 445 465 - cond_resched_lock(&dcache_lock); 466 - 467 - tmp = dentry_unused.prev; 468 - if (sb) { 469 - /* Try to find a dentry for this sb, but don't try 470 - * too hard, if they aren't near the tail they will 471 - * be moved down again soon 446 + spin_lock(&dentry->d_lock); 447 + /* 448 + * If we are honouring the DCACHE_REFERENCED flag and 449 + * the dentry has this flag set, don't free it. Clear 450 + * the flag and put it back on the LRU. 472 451 */ 473 - int skip = count; 474 - while (skip && tmp != &dentry_unused && 475 - list_entry(tmp, struct dentry, d_lru)->d_sb != sb) { 476 - skip--; 477 - tmp = tmp->prev; 452 + if ((flags & DCACHE_REFERENCED) 453 + && (dentry->d_flags & DCACHE_REFERENCED)) { 454 + dentry->d_flags &= ~DCACHE_REFERENCED; 455 + list_move_tail(&dentry->d_lru, &referenced); 456 + spin_unlock(&dentry->d_lock); 457 + } else { 458 + list_move_tail(&dentry->d_lru, &tmp); 459 + spin_unlock(&dentry->d_lock); 460 + cnt--; 461 + if (!cnt) 462 + break; 478 463 } 479 464 } 480 - if (tmp == &dentry_unused) 481 - break; 482 - list_del_init(tmp); 483 - prefetch(dentry_unused.prev); 484 - dentry_stat.nr_unused--; 485 - dentry = list_entry(tmp, struct dentry, d_lru); 486 - 487 - spin_lock(&dentry->d_lock); 465 + } 466 + while (!list_empty(&tmp)) { 467 + dentry = list_entry(tmp.prev, struct dentry, d_lru); 468 + dentry_lru_del_init(dentry); 469 + spin_lock(&dentry->d_lock); 488 470 /* 489 471 * We found an inuse dentry which was not removed from 490 - * dentry_unused because of laziness during lookup. Do not free 491 - * it - just keep it off the dentry_unused list. 472 + * the LRU because of laziness during lookup. Do not free 473 + * it - just keep it off the LRU list. 492 474 */ 493 - if (atomic_read(&dentry->d_count)) { 494 - spin_unlock(&dentry->d_lock); 475 + if (atomic_read(&dentry->d_count)) { 476 + spin_unlock(&dentry->d_lock); 495 477 continue; 496 478 } 497 - /* If the dentry was recently referenced, don't free it. */ 498 - if (dentry->d_flags & DCACHE_REFERENCED) { 499 - dentry->d_flags &= ~DCACHE_REFERENCED; 500 - list_add(&dentry->d_lru, &dentry_unused); 501 - dentry_stat.nr_unused++; 502 - spin_unlock(&dentry->d_lock); 503 - continue; 504 - } 505 - /* 506 - * If the dentry is not DCACHED_REFERENCED, it is time 507 - * to remove it from the dcache, provided the super block is 508 - * NULL (which means we are trying to reclaim memory) 509 - * or this dentry belongs to the same super block that 510 - * we want to shrink. 511 - */ 512 - /* 513 - * If this dentry is for "my" filesystem, then I can prune it 514 - * without taking the s_umount lock (I already hold it). 515 - */ 516 - if (sb && dentry->d_sb == sb) { 517 - prune_one_dentry(dentry); 518 - continue; 519 - } 520 - /* 521 - * ...otherwise we need to be sure this filesystem isn't being 522 - * unmounted, otherwise we could race with 523 - * generic_shutdown_super(), and end up holding a reference to 524 - * an inode while the filesystem is unmounted. 525 - * So we try to get s_umount, and make sure s_root isn't NULL. 526 - * (Take a local copy of s_umount to avoid a use-after-free of 527 - * `dentry'). 528 - */ 529 - s_umount = &dentry->d_sb->s_umount; 530 - if (down_read_trylock(s_umount)) { 531 - if (dentry->d_sb->s_root != NULL) { 532 - prune_one_dentry(dentry); 533 - up_read(s_umount); 534 - continue; 535 - } 536 - up_read(s_umount); 537 - } 538 - spin_unlock(&dentry->d_lock); 539 - /* 540 - * Insert dentry at the head of the list as inserting at the 541 - * tail leads to a cycle. 542 - */ 543 - list_add(&dentry->d_lru, &dentry_unused); 544 - dentry_stat.nr_unused++; 479 + prune_one_dentry(dentry); 480 + /* dentry->d_lock was dropped in prune_one_dentry() */ 481 + cond_resched_lock(&dcache_lock); 545 482 } 483 + if (count == NULL && !list_empty(&sb->s_dentry_lru)) 484 + goto restart; 485 + if (count != NULL) 486 + *count = cnt; 487 + if (!list_empty(&referenced)) 488 + list_splice(&referenced, &sb->s_dentry_lru); 546 489 spin_unlock(&dcache_lock); 547 490 } 548 491 549 - /* 550 - * Shrink the dcache for the specified super block. 551 - * This allows us to unmount a device without disturbing 552 - * the dcache for the other devices. 492 + /** 493 + * prune_dcache - shrink the dcache 494 + * @count: number of entries to try to free 553 495 * 554 - * This implementation makes just two traversals of the 555 - * unused list. On the first pass we move the selected 556 - * dentries to the most recent end, and on the second 557 - * pass we free them. The second pass must restart after 558 - * each dput(), but since the target dentries are all at 559 - * the end, it's really just a single traversal. 496 + * Shrink the dcache. This is done when we need more memory, or simply when we 497 + * need to unmount something (at which point we need to unuse all dentries). 498 + * 499 + * This function may fail to free any resources if all the dentries are in use. 560 500 */ 501 + static void prune_dcache(int count) 502 + { 503 + struct super_block *sb; 504 + int w_count; 505 + int unused = dentry_stat.nr_unused; 506 + int prune_ratio; 507 + int pruned; 508 + 509 + if (unused == 0 || count == 0) 510 + return; 511 + spin_lock(&dcache_lock); 512 + restart: 513 + if (count >= unused) 514 + prune_ratio = 1; 515 + else 516 + prune_ratio = unused / count; 517 + spin_lock(&sb_lock); 518 + list_for_each_entry(sb, &super_blocks, s_list) { 519 + if (sb->s_nr_dentry_unused == 0) 520 + continue; 521 + sb->s_count++; 522 + /* Now, we reclaim unused dentrins with fairness. 523 + * We reclaim them same percentage from each superblock. 524 + * We calculate number of dentries to scan on this sb 525 + * as follows, but the implementation is arranged to avoid 526 + * overflows: 527 + * number of dentries to scan on this sb = 528 + * count * (number of dentries on this sb / 529 + * number of dentries in the machine) 530 + */ 531 + spin_unlock(&sb_lock); 532 + if (prune_ratio != 1) 533 + w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1; 534 + else 535 + w_count = sb->s_nr_dentry_unused; 536 + pruned = w_count; 537 + /* 538 + * We need to be sure this filesystem isn't being unmounted, 539 + * otherwise we could race with generic_shutdown_super(), and 540 + * end up holding a reference to an inode while the filesystem 541 + * is unmounted. So we try to get s_umount, and make sure 542 + * s_root isn't NULL. 543 + */ 544 + if (down_read_trylock(&sb->s_umount)) { 545 + if ((sb->s_root != NULL) && 546 + (!list_empty(&sb->s_dentry_lru))) { 547 + spin_unlock(&dcache_lock); 548 + __shrink_dcache_sb(sb, &w_count, 549 + DCACHE_REFERENCED); 550 + pruned -= w_count; 551 + spin_lock(&dcache_lock); 552 + } 553 + up_read(&sb->s_umount); 554 + } 555 + spin_lock(&sb_lock); 556 + count -= pruned; 557 + /* 558 + * restart only when sb is no longer on the list and 559 + * we have more work to do. 560 + */ 561 + if (__put_super_and_need_restart(sb) && count > 0) { 562 + spin_unlock(&sb_lock); 563 + goto restart; 564 + } 565 + } 566 + spin_unlock(&sb_lock); 567 + spin_unlock(&dcache_lock); 568 + } 561 569 562 570 /** 563 571 * shrink_dcache_sb - shrink dcache for a superblock ··· 601 541 * is used to free the dcache before unmounting a file 602 542 * system 603 543 */ 604 - 605 544 void shrink_dcache_sb(struct super_block * sb) 606 545 { 607 - struct list_head *tmp, *next; 608 - struct dentry *dentry; 609 - 610 - /* 611 - * Pass one ... move the dentries for the specified 612 - * superblock to the most recent end of the unused list. 613 - */ 614 - spin_lock(&dcache_lock); 615 - list_for_each_prev_safe(tmp, next, &dentry_unused) { 616 - dentry = list_entry(tmp, struct dentry, d_lru); 617 - if (dentry->d_sb != sb) 618 - continue; 619 - list_move_tail(tmp, &dentry_unused); 620 - } 621 - 622 - /* 623 - * Pass two ... free the dentries for this superblock. 624 - */ 625 - repeat: 626 - list_for_each_prev_safe(tmp, next, &dentry_unused) { 627 - dentry = list_entry(tmp, struct dentry, d_lru); 628 - if (dentry->d_sb != sb) 629 - continue; 630 - dentry_stat.nr_unused--; 631 - list_del_init(tmp); 632 - spin_lock(&dentry->d_lock); 633 - if (atomic_read(&dentry->d_count)) { 634 - spin_unlock(&dentry->d_lock); 635 - continue; 636 - } 637 - prune_one_dentry(dentry); 638 - cond_resched_lock(&dcache_lock); 639 - goto repeat; 640 - } 641 - spin_unlock(&dcache_lock); 546 + __shrink_dcache_sb(sb, NULL, 0); 642 547 } 643 548 644 549 /* ··· 620 595 621 596 /* detach this root from the system */ 622 597 spin_lock(&dcache_lock); 623 - dentry_lru_remove(dentry); 598 + dentry_lru_del_init(dentry); 624 599 __d_drop(dentry); 625 600 spin_unlock(&dcache_lock); 626 601 ··· 634 609 spin_lock(&dcache_lock); 635 610 list_for_each_entry(loop, &dentry->d_subdirs, 636 611 d_u.d_child) { 637 - dentry_lru_remove(loop); 612 + dentry_lru_del_init(loop); 638 613 __d_drop(loop); 639 614 cond_resched_lock(&dcache_lock); 640 615 } ··· 816 791 struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child); 817 792 next = tmp->next; 818 793 819 - dentry_lru_remove(dentry); 794 + dentry_lru_del_init(dentry); 820 795 /* 821 796 * move only zero ref count dentries to the end 822 797 * of the unused list for prune_dcache 823 798 */ 824 799 if (!atomic_read(&dentry->d_count)) { 825 - list_add_tail(&dentry->d_lru, &dentry_unused); 826 - dentry_stat.nr_unused++; 800 + dentry_lru_add_tail(dentry); 827 801 found++; 828 802 } 829 803 ··· 864 840 865 841 void shrink_dcache_parent(struct dentry * parent) 866 842 { 843 + struct super_block *sb = parent->d_sb; 867 844 int found; 868 845 869 846 while ((found = select_parent(parent)) != 0) 870 - prune_dcache(found, parent->d_sb); 847 + __shrink_dcache_sb(sb, &found, 0); 871 848 } 872 849 873 850 /* ··· 888 863 if (nr) { 889 864 if (!(gfp_mask & __GFP_FS)) 890 865 return -1; 891 - prune_dcache(nr, NULL); 866 + prune_dcache(nr); 892 867 } 893 868 return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure; 894 869 } ··· 1240 1215 * rcu_read_lock() and rcu_read_unlock() are used to disable preemption while 1241 1216 * lookup is going on. 1242 1217 * 1243 - * dentry_unused list is not updated even if lookup finds the required dentry 1218 + * The dentry unused LRU is not updated even if lookup finds the required dentry 1244 1219 * in there. It is updated in places such as prune_dcache, shrink_dcache_sb, 1245 1220 * select_parent and __dget_locked. This laziness saves lookup from dcache_lock 1246 1221 * acquisition.
+1
fs/super.c
··· 70 70 INIT_LIST_HEAD(&s->s_instances); 71 71 INIT_HLIST_HEAD(&s->s_anon); 72 72 INIT_LIST_HEAD(&s->s_inodes); 73 + INIT_LIST_HEAD(&s->s_dentry_lru); 73 74 init_rwsem(&s->s_umount); 74 75 mutex_init(&s->s_lock); 75 76 lockdep_set_class(&s->s_umount, &type->s_umount_key);
+4
include/linux/fs.h
··· 1025 1025 extern struct list_head super_blocks; 1026 1026 extern spinlock_t sb_lock; 1027 1027 1028 + #define sb_entry(list) list_entry((list), struct super_block, s_list) 1028 1029 #define S_BIAS (1<<30) 1029 1030 struct super_block { 1030 1031 struct list_head s_list; /* Keep this first */ ··· 1059 1058 struct list_head s_more_io; /* parked for more writeback */ 1060 1059 struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ 1061 1060 struct list_head s_files; 1061 + /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */ 1062 + struct list_head s_dentry_lru; /* unused dentry lru */ 1063 + int s_nr_dentry_unused; /* # of dentry on lru */ 1062 1064 1063 1065 struct block_device *s_bdev; 1064 1066 struct mtd_info *s_mtd;