Merge tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull SLUB updates from Vlastimil Babka:
"SLUB: reduce irq disabled scope and make it RT compatible

This series was initially inspired by Mel's pcplist local_lock
rewrite, and also interest to better understand SLUB's locking and the
new primitives and RT variants and implications. It makes SLUB
compatible with PREEMPT_RT and generally more preemption-friendly,
apparently without significant regressions, as the fast paths are not
affected.

The main changes to SLUB by this series:

- irq disabling is now only done for minimum amount of time needed to
protect the strict kmem_cache_cpu fields, and as part of spin lock,
local lock and bit lock operations to make them irq-safe

- SLUB is fully PREEMPT_RT compatible

The series should now be sufficiently tested in both RT and !RT
configs, mainly thanks to Mike.

The RFC/v1 version also got basic performance screening by Mel that
didn't show major regressions. Mike's testing with hackbench of v2 on
!RT reported negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )

0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )

0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected
tradeoff for the preemption improvements through the RT mutex. It
didn't prevent the v2 to be incorporated to the 5.13 RT tree [7],
leading to testing exposure and bugfixes.

Before the series, SLUB is lockless in both allocation and free fast
paths, but elsewhere, it's disabling irqs for considerable periods of
time - especially in allocation slowpath and the bulk allocation,
where IRQs are re-enabled only when a new page from the page allocator
is needed, and the context allows blocking. The irq disabled sections
can then include deactivate_slab() which walks a full freelist and
frees the slab back to page allocator or unfreeze_partials() going
through a list of percpu partial slabs. The RT tree currently has some
patches mitigating these, but we can do much better in mainline too.

Patches 1-6 are straightforward improvements or cleanups that could
exist outside of this series too, but are prerequsities.

Patches 7-9 are also preparatory code changes without functional
changes, but not so useful without the rest of the series.

Patch 10 simplifies the fast paths on systems with preemption, based
on (hopefully correct) observation that the current loops to verify
tid are unnecessary.

Patches 11-20 focus on reducing irq disabled scope in the allocation
slowpath:

- patch 11 moves disabling of irqs into ___slab_alloc() from its
callers, which are the allocation slowpath, and bulk allocation.
Instead these callers only disable preemption to stabilize the cpu.

- The following patches then gradually reduce the scope of disabled
irqs in ___slab_alloc() and the functions called from there. As of
patch 14, the re-enabling of irqs based on gfp flags before calling
the page allocator is removed from allocate_slab(). As of patch 17,
it's possible to reach the page allocator (in case of existing
slabs depleted) without disabling and re-enabling irqs a single
time.

Pathces 21-26 reduce the scope of disabled irqs in functions related
to unfreezing percpu partial slab.

Patch 27 is preparatory. Patch 28 is adopted from the RT tree and
converts the flushing of percpu slabs on all cpus from using IPI to
workqueue, so that the processing isn't happening with irqs disabled
in the IPI handler. The flushing is not performance critical so it
should be acceptable.

Patch 29 also comes from RT tree and makes object_map_lock RT
compatible.

Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.

Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial()
from cmpxchg loop to a short irq disabled section, which is used by
all other code modifying the field. This addresses a theoretical race
scenario pointed out by Jann, and makes the critical section safe wrt
with RT local_lock semantics after the conversion in patch 35.

Patch 32 changes preempt disable to migrate disable, so that the
nested list_lock spinlock is safe to take on RT. Because
migrate_disable() is a function call even on !RT, a small set of
private wrappers is introduced to keep using the cheaper
preempt_disable() on !PREEMPT_RT configurations. As of this patch,
SLUB should be already compatible with RT's lock semantics.

Finally, patch 33 changes irq disabled sections that protect
kmem_cache_cpu fields in the slow paths, with a local lock. However on
PREEMPT_RT it means the lockless fast paths can now preempt slow paths
which don't expect that, so the local lock has to be taken also in the
fast paths and they are no longer lockless. RT folks seem to not mind
this tradeoff. The patch also updates the locking documentation in the
file's comment"

Mike Galbraith and Mel Gorman verified that their earlier testing
observations still hold for the final series:

Link: https://lore.kernel.org/lkml/89ba4f783114520c167cc915ba949ad2c04d6790.camel@gmx.de/
Link: https://lore.kernel.org/lkml/20210907082010.GB3959@techsingularity.net/

* tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits)
mm, slub: convert kmem_cpu_slab protection to local_lock
mm, slub: use migrate_disable() on PREEMPT_RT
mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
mm: slub: make object_map_lock a raw_spinlock_t
mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
mm, slab: split out the cpu offline variant of flush_slab()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: move irq control into unfreeze_partials()
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: stop disabling irqs around get_partial()
mm, slub: check new pages with restored irqs
mm, slub: validate slab from partial list or page allocator before making it cpu slab
mm, slub: restore irqs around calling new_slab()
...

Linus Torvalds 4 years ago cc09ee80 49832c81

+571 -259

4 changed files

expand all

include

linux

page-flags.h

slub_def.h

slab_common.c

slub.c

include/linux/page-flags.h

··· 778 778 return PageActive(page); 779 779 } 780 780 781 + /* 782 + * A version of PageSlabPfmemalloc() for opportunistic checks where the page 783 + * might have been freed under us and not be a PageSlab anymore. 784 + */ 785 + static inline int __PageSlabPfmemalloc(struct page *page) 786 + { 787 + return PageActive(page); 788 + } 789 + 781 790 static inline void SetPageSlabPfmemalloc(struct page *page) 782 791 { 783 792 VM_BUG_ON_PAGE(!PageSlab(page), page);

include/linux/slub_def.h

··· 10 10 #include <linux/kfence.h> 11 11 #include <linux/kobject.h> 12 12 #include <linux/reciprocal_div.h> 13 + #include <linux/local_lock.h> 13 14 14 15 enum stat_item { 15 16 ALLOC_FASTPATH, /* Allocation from cpu slab */ ··· 41 40 CPU_PARTIAL_DRAIN, /* Drain cpu partial to node partial */ 42 41 NR_SLUB_STAT_ITEMS }; 43 42 43 + /* 44 + * When changing the layout, make sure freelist and tid are still compatible 45 + * with this_cpu_cmpxchg_double() alignment requirements. 46 + */ 44 47 struct kmem_cache_cpu { 45 48 void **freelist; /* Pointer to next available object */ 46 49 unsigned long tid; /* Globally unique transaction id */ ··· 52 47 #ifdef CONFIG_SLUB_CPU_PARTIAL 53 48 struct page *partial; /* Partially allocated frozen slabs */ 54 49 #endif 50 + local_lock_t lock; /* Protects the fields above */ 55 51 #ifdef CONFIG_SLUB_STATS 56 52 unsigned stat[NR_SLUB_STAT_ITEMS]; 57 53 #endif

mm/slab_common.c

··· 502 502 if (unlikely(!s)) 503 503 return; 504 504 505 + cpus_read_lock(); 505 506 mutex_lock(&slab_mutex); 506 507 507 508 s->refcount--; ··· 517 516 } 518 517 out_unlock: 519 518 mutex_unlock(&slab_mutex); 519 + cpus_read_unlock(); 520 520 } 521 521 EXPORT_SYMBOL(kmem_cache_destroy); 522 522

+554 -259

mm/slub.c

··· 46 46 /* 47 47 * Lock order: 48 48 * 1. slab_mutex (Global Mutex) 49 - * 2. node->list_lock 50 - * 3. slab_lock(page) (Only on some arches and for debugging) 49 + * 2. node->list_lock (Spinlock) 50 + * 3. kmem_cache->cpu_slab->lock (Local lock) 51 + * 4. slab_lock(page) (Only on some arches or for debugging) 52 + * 5. object_map_lock (Only for debugging) 51 53 * 52 54 * slab_mutex 53 55 * 54 56 * The role of the slab_mutex is to protect the list of all the slabs 55 57 * and to synchronize major metadata changes to slab cache structures. 58 + * Also synchronizes memory hotplug callbacks. 59 + * 60 + * slab_lock 61 + * 62 + * The slab_lock is a wrapper around the page lock, thus it is a bit 63 + * spinlock. 56 64 * 57 65 * The slab_lock is only used for debugging and on arches that do not 58 66 * have the ability to do a cmpxchg_double. It only protects: ··· 69 61 * C. page->objects -> Number of objects in page 70 62 * D. page->frozen -> frozen state 71 63 * 64 + * Frozen slabs 65 + * 72 66 * If a slab is frozen then it is exempt from list management. It is not 73 67 * on any list except per cpu partial list. The processor that froze the 74 68 * slab is the one who can perform list operations on the page. Other 75 69 * processors may put objects onto the freelist but the processor that 76 70 * froze the slab is the only one that can retrieve the objects from the 77 71 * page's freelist. 72 + * 73 + * list_lock 78 74 * 79 75 * The list_lock protects the partial and full list on each node and 80 76 * the partial slab counter. If taken then no new slabs may be added or ··· 91 79 * slabs, operations can continue without any centralized lock. F.e. 92 80 * allocating a long series of objects that fill up slabs does not require 93 81 * the list lock. 94 - * Interrupts are disabled during allocation and deallocation in order to 95 - * make the slab allocator safe to use in the context of an irq. In addition 96 - * interrupts are disabled to ensure that the processor does not change 97 - * while handling per_cpu slabs, due to kernel preemption. 82 + * 83 + * cpu_slab->lock local lock 84 + * 85 + * This locks protect slowpath manipulation of all kmem_cache_cpu fields 86 + * except the stat counters. This is a percpu structure manipulated only by 87 + * the local cpu, so the lock protects against being preempted or interrupted 88 + * by an irq. Fast path operations rely on lockless operations instead. 89 + * On PREEMPT_RT, the local lock does not actually disable irqs (and thus 90 + * prevent the lockless operations), so fastpath operations also need to take 91 + * the lock and are no longer lockless. 92 + * 93 + * lockless fastpaths 94 + * 95 + * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free()) 96 + * are fully lockless when satisfied from the percpu slab (and when 97 + * cmpxchg_double is possible to use, otherwise slab_lock is taken). 98 + * They also don't disable preemption or migration or irqs. They rely on 99 + * the transaction id (tid) field to detect being preempted or moved to 100 + * another cpu. 101 + * 102 + * irq, preemption, migration considerations 103 + * 104 + * Interrupts are disabled as part of list_lock or local_lock operations, or 105 + * around the slab_lock operation, in order to make the slab allocator safe 106 + * to use in the context of an irq. 107 + * 108 + * In addition, preemption (or migration on PREEMPT_RT) is disabled in the 109 + * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the 110 + * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer 111 + * doesn't have to be revalidated in each section protected by the local lock. 98 112 * 99 113 * SLUB assigns one slab for allocation to each processor. 100 114 * Allocations only occur from these slabs called cpu slabs. ··· 155 117 * options set. This moves slab handling out of 156 118 * the fast path and disables lockless freelists. 157 119 */ 120 + 121 + /* 122 + * We could simply use migrate_disable()/enable() but as long as it's a 123 + * function call even on !PREEMPT_RT, use inline preempt_disable() there. 124 + */ 125 + #ifndef CONFIG_PREEMPT_RT 126 + #define slub_get_cpu_ptr(var) get_cpu_ptr(var) 127 + #define slub_put_cpu_ptr(var) put_cpu_ptr(var) 128 + #else 129 + #define slub_get_cpu_ptr(var) \ 130 + ({ \ 131 + migrate_disable(); \ 132 + this_cpu_ptr(var); \ 133 + }) 134 + #define slub_put_cpu_ptr(var) \ 135 + do { \ 136 + (void)(var); \ 137 + migrate_enable(); \ 138 + } while (0) 139 + #endif 158 140 159 141 #ifdef CONFIG_SLUB_DEBUG 160 142 #ifdef CONFIG_SLUB_DEBUG_ON ··· 417 359 /* 418 360 * Per slab locking using the pagelock 419 361 */ 420 - static __always_inline void slab_lock(struct page *page) 362 + static __always_inline void __slab_lock(struct page *page) 421 363 { 422 364 VM_BUG_ON_PAGE(PageTail(page), page); 423 365 bit_spin_lock(PG_locked, &page->flags); 424 366 } 425 367 426 - static __always_inline void slab_unlock(struct page *page) 368 + static __always_inline void __slab_unlock(struct page *page) 427 369 { 428 370 VM_BUG_ON_PAGE(PageTail(page), page); 429 371 __bit_spin_unlock(PG_locked, &page->flags); 430 372 } 431 373 432 - /* Interrupts must be disabled (for the fallback code to work right) */ 374 + static __always_inline void slab_lock(struct page *page, unsigned long *flags) 375 + { 376 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 377 + local_irq_save(*flags); 378 + __slab_lock(page); 379 + } 380 + 381 + static __always_inline void slab_unlock(struct page *page, unsigned long *flags) 382 + { 383 + __slab_unlock(page); 384 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 385 + local_irq_restore(*flags); 386 + } 387 + 388 + /* 389 + * Interrupts must be disabled (for the fallback code to work right), typically 390 + * by an _irqsave() lock variant. Except on PREEMPT_RT where locks are different 391 + * so we disable interrupts as part of slab_[un]lock(). 392 + */ 433 393 static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page, 434 394 void *freelist_old, unsigned long counters_old, 435 395 void *freelist_new, unsigned long counters_new, 436 396 const char *n) 437 397 { 438 - VM_BUG_ON(!irqs_disabled()); 398 + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) 399 + lockdep_assert_irqs_disabled(); 439 400 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ 440 401 defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) 441 402 if (s->flags & __CMPXCHG_DOUBLE) { ··· 465 388 } else 466 389 #endif 467 390 { 468 - slab_lock(page); 391 + /* init to 0 to prevent spurious warnings */ 392 + unsigned long flags = 0; 393 + 394 + slab_lock(page, &flags); 469 395 if (page->freelist == freelist_old && 470 396 page->counters == counters_old) { 471 397 page->freelist = freelist_new; 472 398 page->counters = counters_new; 473 - slab_unlock(page); 399 + slab_unlock(page, &flags); 474 400 return true; 475 401 } 476 - slab_unlock(page); 402 + slab_unlock(page, &flags); 477 403 } 478 404 479 405 cpu_relax(); ··· 507 427 unsigned long flags; 508 428 509 429 local_irq_save(flags); 510 - slab_lock(page); 430 + __slab_lock(page); 511 431 if (page->freelist == freelist_old && 512 432 page->counters == counters_old) { 513 433 page->freelist = freelist_new; 514 434 page->counters = counters_new; 515 - slab_unlock(page); 435 + __slab_unlock(page); 516 436 local_irq_restore(flags); 517 437 return true; 518 438 } 519 - slab_unlock(page); 439 + __slab_unlock(page); 520 440 local_irq_restore(flags); 521 441 } 522 442 ··· 532 452 533 453 #ifdef CONFIG_SLUB_DEBUG 534 454 static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)]; 535 - static DEFINE_SPINLOCK(object_map_lock); 455 + static DEFINE_RAW_SPINLOCK(object_map_lock); 456 + 457 + static void __fill_map(unsigned long *obj_map, struct kmem_cache *s, 458 + struct page *page) 459 + { 460 + void *addr = page_address(page); 461 + void *p; 462 + 463 + bitmap_zero(obj_map, page->objects); 464 + 465 + for (p = page->freelist; p; p = get_freepointer(s, p)) 466 + set_bit(__obj_to_index(s, addr, p), obj_map); 467 + } 536 468 537 469 #if IS_ENABLED(CONFIG_KUNIT) 538 470 static bool slab_add_kunit_errors(void) ··· 575 483 static unsigned long *get_map(struct kmem_cache *s, struct page *page) 576 484 __acquires(&object_map_lock) 577 485 { 578 - void *p; 579 - void *addr = page_address(page); 580 - 581 486 VM_BUG_ON(!irqs_disabled()); 582 487 583 - spin_lock(&object_map_lock); 488 + raw_spin_lock(&object_map_lock); 584 489 585 - bitmap_zero(object_map, page->objects); 586 - 587 - for (p = page->freelist; p; p = get_freepointer(s, p)) 588 - set_bit(__obj_to_index(s, addr, p), object_map); 490 + __fill_map(object_map, s, page); 589 491 590 492 return object_map; 591 493 } ··· 587 501 static void put_map(unsigned long *map) __releases(&object_map_lock) 588 502 { 589 503 VM_BUG_ON(map != object_map); 590 - spin_unlock(&object_map_lock); 504 + raw_spin_unlock(&object_map_lock); 591 505 } 592 506 593 507 static inline unsigned int size_from_object(struct kmem_cache *s) ··· 1089 1003 { 1090 1004 int maxobj; 1091 1005 1092 - VM_BUG_ON(!irqs_disabled()); 1093 - 1094 1006 if (!PageSlab(page)) { 1095 1007 slab_err(s, page, "Not a valid slab page"); 1096 1008 return 0; ··· 1349 1265 struct kmem_cache_node *n = get_node(s, page_to_nid(page)); 1350 1266 void *object = head; 1351 1267 int cnt = 0; 1352 - unsigned long flags; 1268 + unsigned long flags, flags2; 1353 1269 int ret = 0; 1354 1270 1355 1271 spin_lock_irqsave(&n->list_lock, flags); 1356 - slab_lock(page); 1272 + slab_lock(page, &flags2); 1357 1273 1358 1274 if (s->flags & SLAB_CONSISTENCY_CHECKS) { 1359 1275 if (!check_slab(s, page)) ··· 1386 1302 slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", 1387 1303 bulk_cnt, cnt); 1388 1304 1389 - slab_unlock(page); 1305 + slab_unlock(page, &flags2); 1390 1306 spin_unlock_irqrestore(&n->list_lock, flags); 1391 1307 if (!ret) 1392 1308 slab_fix(s, "Object at 0x%p not freed", object); ··· 1669 1585 { 1670 1586 kmemleak_free_recursive(x, s->flags); 1671 1587 1672 - /* 1673 - * Trouble is that we may no longer disable interrupts in the fast path 1674 - * So in order to make the debug calls that expect irqs to be 1675 - * disabled we need to disable interrupts temporarily. 1676 - */ 1677 - #ifdef CONFIG_LOCKDEP 1678 - { 1679 - unsigned long flags; 1588 + debug_check_no_locks_freed(x, s->object_size); 1680 1589 1681 - local_irq_save(flags); 1682 - debug_check_no_locks_freed(x, s->object_size); 1683 - local_irq_restore(flags); 1684 - } 1685 - #endif 1686 1590 if (!(s->flags & SLAB_DEBUG_OBJECTS)) 1687 1591 debug_check_no_obj_freed(x, s->object_size); 1688 1592 ··· 1887 1815 1888 1816 flags &= gfp_allowed_mask; 1889 1817 1890 - if (gfpflags_allow_blocking(flags)) 1891 - local_irq_enable(); 1892 - 1893 1818 flags |= s->allocflags; 1894 1819 1895 1820 /* ··· 1945 1876 page->frozen = 1; 1946 1877 1947 1878 out: 1948 - if (gfpflags_allow_blocking(flags)) 1949 - local_irq_disable(); 1950 1879 if (!page) 1951 1880 return NULL; 1952 1881 ··· 1957 1890 { 1958 1891 if (unlikely(flags & GFP_SLAB_BUG_MASK)) 1959 1892 flags = kmalloc_fix_flags(flags); 1893 + 1894 + WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); 1960 1895 1961 1896 return allocate_slab(s, 1962 1897 flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); ··· 2083 2014 return freelist; 2084 2015 } 2085 2016 2017 + #ifdef CONFIG_SLUB_CPU_PARTIAL 2086 2018 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); 2019 + #else 2020 + static inline void put_cpu_partial(struct kmem_cache *s, struct page *page, 2021 + int drain) { } 2022 + #endif 2087 2023 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); 2088 2024 2089 2025 /* 2090 2026 * Try to allocate a partial slab from a specific node. 2091 2027 */ 2092 2028 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, 2093 - struct kmem_cache_cpu *c, gfp_t flags) 2029 + struct page **ret_page, gfp_t gfpflags) 2094 2030 { 2095 2031 struct page *page, *page2; 2096 2032 void *object = NULL; 2097 2033 unsigned int available = 0; 2034 + unsigned long flags; 2098 2035 int objects; 2099 2036 2100 2037 /* ··· 2112 2037 if (!n || !n->nr_partial) 2113 2038 return NULL; 2114 2039 2115 - spin_lock(&n->list_lock); 2040 + spin_lock_irqsave(&n->list_lock, flags); 2116 2041 list_for_each_entry_safe(page, page2, &n->partial, slab_list) { 2117 2042 void *t; 2118 2043 2119 - if (!pfmemalloc_match(page, flags)) 2044 + if (!pfmemalloc_match(page, gfpflags)) 2120 2045 continue; 2121 2046 2122 2047 t = acquire_slab(s, n, page, object == NULL, &objects); ··· 2125 2050 2126 2051 available += objects; 2127 2052 if (!object) { 2128 - c->page = page; 2053 + *ret_page = page; 2129 2054 stat(s, ALLOC_FROM_PARTIAL); 2130 2055 object = t; 2131 2056 } else { ··· 2137 2062 break; 2138 2063 2139 2064 } 2140 - spin_unlock(&n->list_lock); 2065 + spin_unlock_irqrestore(&n->list_lock, flags); 2141 2066 return object; 2142 2067 } 2143 2068 ··· 2145 2070 * Get a page from somewhere. Search in increasing NUMA distances. 2146 2071 */ 2147 2072 static void *get_any_partial(struct kmem_cache *s, gfp_t flags, 2148 - struct kmem_cache_cpu *c) 2073 + struct page **ret_page) 2149 2074 { 2150 2075 #ifdef CONFIG_NUMA 2151 2076 struct zonelist *zonelist; ··· 2187 2112 2188 2113 if (n && cpuset_zone_allowed(zone, flags) && 2189 2114 n->nr_partial > s->min_partial) { 2190 - object = get_partial_node(s, n, c, flags); 2115 + object = get_partial_node(s, n, ret_page, flags); 2191 2116 if (object) { 2192 2117 /* 2193 2118 * Don't check read_mems_allowed_retry() ··· 2209 2134 * Get a partial page, lock it and return it. 2210 2135 */ 2211 2136 static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, 2212 - struct kmem_cache_cpu *c) 2137 + struct page **ret_page) 2213 2138 { 2214 2139 void *object; 2215 2140 int searchnode = node; ··· 2217 2142 if (node == NUMA_NO_NODE) 2218 2143 searchnode = numa_mem_id(); 2219 2144 2220 - object = get_partial_node(s, get_node(s, searchnode), c, flags); 2145 + object = get_partial_node(s, get_node(s, searchnode), ret_page, flags); 2221 2146 if (object || node != NUMA_NO_NODE) 2222 2147 return object; 2223 2148 2224 - return get_any_partial(s, flags, c); 2149 + return get_any_partial(s, flags, ret_page); 2225 2150 } 2226 2151 2227 2152 #ifdef CONFIG_PREEMPTION ··· 2288 2213 static void init_kmem_cache_cpus(struct kmem_cache *s) 2289 2214 { 2290 2215 int cpu; 2216 + struct kmem_cache_cpu *c; 2291 2217 2292 - for_each_possible_cpu(cpu) 2293 - per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu); 2218 + for_each_possible_cpu(cpu) { 2219 + c = per_cpu_ptr(s->cpu_slab, cpu); 2220 + local_lock_init(&c->lock); 2221 + c->tid = init_tid(cpu); 2222 + } 2294 2223 } 2295 2224 2296 2225 /* 2297 - * Remove the cpu slab 2226 + * Finishes removing the cpu slab. Merges cpu's freelist with page's freelist, 2227 + * unfreezes the slabs and puts it on the proper list. 2228 + * Assumes the slab has been already safely taken away from kmem_cache_cpu 2229 + * by the caller. 2298 2230 */ 2299 2231 static void deactivate_slab(struct kmem_cache *s, struct page *page, 2300 - void *freelist, struct kmem_cache_cpu *c) 2232 + void *freelist) 2301 2233 { 2302 2234 enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE }; 2303 2235 struct kmem_cache_node *n = get_node(s, page_to_nid(page)); ··· 2312 2230 enum slab_modes l = M_NONE, m = M_NONE; 2313 2231 void *nextfree, *freelist_iter, *freelist_tail; 2314 2232 int tail = DEACTIVATE_TO_HEAD; 2233 + unsigned long flags = 0; 2315 2234 struct page new; 2316 2235 struct page old; 2317 2236 ··· 2388 2305 * that acquire_slab() will see a slab page that 2389 2306 * is frozen 2390 2307 */ 2391 - spin_lock(&n->list_lock); 2308 + spin_lock_irqsave(&n->list_lock, flags); 2392 2309 } 2393 2310 } else { 2394 2311 m = M_FULL; ··· 2399 2316 * slabs from diagnostic functions will not see 2400 2317 * any frozen slabs. 2401 2318 */ 2402 - spin_lock(&n->list_lock); 2319 + spin_lock_irqsave(&n->list_lock, flags); 2403 2320 } 2404 2321 } 2405 2322 ··· 2416 2333 } 2417 2334 2418 2335 l = m; 2419 - if (!__cmpxchg_double_slab(s, page, 2336 + if (!cmpxchg_double_slab(s, page, 2420 2337 old.freelist, old.counters, 2421 2338 new.freelist, new.counters, 2422 2339 "unfreezing slab")) 2423 2340 goto redo; 2424 2341 2425 2342 if (lock) 2426 - spin_unlock(&n->list_lock); 2343 + spin_unlock_irqrestore(&n->list_lock, flags); 2427 2344 2428 2345 if (m == M_PARTIAL) 2429 2346 stat(s, tail); ··· 2434 2351 discard_slab(s, page); 2435 2352 stat(s, FREE_SLAB); 2436 2353 } 2437 - 2438 - c->page = NULL; 2439 - c->freelist = NULL; 2440 2354 } 2441 2355 2442 - /* 2443 - * Unfreeze all the cpu partial slabs. 2444 - * 2445 - * This function must be called with interrupts disabled 2446 - * for the cpu using c (or some other guarantee must be there 2447 - * to guarantee no concurrent accesses). 2448 - */ 2449 - static void unfreeze_partials(struct kmem_cache *s, 2450 - struct kmem_cache_cpu *c) 2451 - { 2452 2356 #ifdef CONFIG_SLUB_CPU_PARTIAL 2357 + static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page) 2358 + { 2453 2359 struct kmem_cache_node *n = NULL, *n2 = NULL; 2454 2360 struct page *page, *discard_page = NULL; 2361 + unsigned long flags = 0; 2455 2362 2456 - while ((page = slub_percpu_partial(c))) { 2363 + while (partial_page) { 2457 2364 struct page new; 2458 2365 struct page old; 2459 2366 2460 - slub_set_percpu_partial(c, page); 2367 + page = partial_page; 2368 + partial_page = page->next; 2461 2369 2462 2370 n2 = get_node(s, page_to_nid(page)); 2463 2371 if (n != n2) { 2464 2372 if (n) 2465 - spin_unlock(&n->list_lock); 2373 + spin_unlock_irqrestore(&n->list_lock, flags); 2466 2374 2467 2375 n = n2; 2468 - spin_lock(&n->list_lock); 2376 + spin_lock_irqsave(&n->list_lock, flags); 2469 2377 } 2470 2378 2471 2379 do { ··· 2485 2411 } 2486 2412 2487 2413 if (n) 2488 - spin_unlock(&n->list_lock); 2414 + spin_unlock_irqrestore(&n->list_lock, flags); 2489 2415 2490 2416 while (discard_page) { 2491 2417 page = discard_page; ··· 2495 2421 discard_slab(s, page); 2496 2422 stat(s, FREE_SLAB); 2497 2423 } 2498 - #endif /* CONFIG_SLUB_CPU_PARTIAL */ 2424 + } 2425 + 2426 + /* 2427 + * Unfreeze all the cpu partial slabs. 2428 + */ 2429 + static void unfreeze_partials(struct kmem_cache *s) 2430 + { 2431 + struct page *partial_page; 2432 + unsigned long flags; 2433 + 2434 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2435 + partial_page = this_cpu_read(s->cpu_slab->partial); 2436 + this_cpu_write(s->cpu_slab->partial, NULL); 2437 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2438 + 2439 + if (partial_page) 2440 + __unfreeze_partials(s, partial_page); 2441 + } 2442 + 2443 + static void unfreeze_partials_cpu(struct kmem_cache *s, 2444 + struct kmem_cache_cpu *c) 2445 + { 2446 + struct page *partial_page; 2447 + 2448 + partial_page = slub_percpu_partial(c); 2449 + c->partial = NULL; 2450 + 2451 + if (partial_page) 2452 + __unfreeze_partials(s, partial_page); 2499 2453 } 2500 2454 2501 2455 /* ··· 2535 2433 */ 2536 2434 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) 2537 2435 { 2538 - #ifdef CONFIG_SLUB_CPU_PARTIAL 2539 2436 struct page *oldpage; 2540 - int pages; 2541 - int pobjects; 2437 + struct page *page_to_unfreeze = NULL; 2438 + unsigned long flags; 2439 + int pages = 0; 2440 + int pobjects = 0; 2542 2441 2543 - preempt_disable(); 2544 - do { 2545 - pages = 0; 2546 - pobjects = 0; 2547 - oldpage = this_cpu_read(s->cpu_slab->partial); 2442 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2548 2443 2549 - if (oldpage) { 2444 + oldpage = this_cpu_read(s->cpu_slab->partial); 2445 + 2446 + if (oldpage) { 2447 + if (drain && oldpage->pobjects > slub_cpu_partial(s)) { 2448 + /* 2449 + * Partial array is full. Move the existing set to the 2450 + * per node partial list. Postpone the actual unfreezing 2451 + * outside of the critical section. 2452 + */ 2453 + page_to_unfreeze = oldpage; 2454 + oldpage = NULL; 2455 + } else { 2550 2456 pobjects = oldpage->pobjects; 2551 2457 pages = oldpage->pages; 2552 - if (drain && pobjects > slub_cpu_partial(s)) { 2553 - unsigned long flags; 2554 - /* 2555 - * partial array is full. Move the existing 2556 - * set to the per node partial list. 2557 - */ 2558 - local_irq_save(flags); 2559 - unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); 2560 - local_irq_restore(flags); 2561 - oldpage = NULL; 2562 - pobjects = 0; 2563 - pages = 0; 2564 - stat(s, CPU_PARTIAL_DRAIN); 2565 - } 2566 2458 } 2567 - 2568 - pages++; 2569 - pobjects += page->objects - page->inuse; 2570 - 2571 - page->pages = pages; 2572 - page->pobjects = pobjects; 2573 - page->next = oldpage; 2574 - 2575 - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) 2576 - != oldpage); 2577 - if (unlikely(!slub_cpu_partial(s))) { 2578 - unsigned long flags; 2579 - 2580 - local_irq_save(flags); 2581 - unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); 2582 - local_irq_restore(flags); 2583 2459 } 2584 - preempt_enable(); 2585 - #endif /* CONFIG_SLUB_CPU_PARTIAL */ 2460 + 2461 + pages++; 2462 + pobjects += page->objects - page->inuse; 2463 + 2464 + page->pages = pages; 2465 + page->pobjects = pobjects; 2466 + page->next = oldpage; 2467 + 2468 + this_cpu_write(s->cpu_slab->partial, page); 2469 + 2470 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2471 + 2472 + if (page_to_unfreeze) { 2473 + __unfreeze_partials(s, page_to_unfreeze); 2474 + stat(s, CPU_PARTIAL_DRAIN); 2475 + } 2586 2476 } 2477 + 2478 + #else /* CONFIG_SLUB_CPU_PARTIAL */ 2479 + 2480 + static inline void unfreeze_partials(struct kmem_cache *s) { } 2481 + static inline void unfreeze_partials_cpu(struct kmem_cache *s, 2482 + struct kmem_cache_cpu *c) { } 2483 + 2484 + #endif /* CONFIG_SLUB_CPU_PARTIAL */ 2587 2485 2588 2486 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) 2589 2487 { 2590 - stat(s, CPUSLAB_FLUSH); 2591 - deactivate_slab(s, c->page, c->freelist, c); 2488 + unsigned long flags; 2489 + struct page *page; 2490 + void *freelist; 2592 2491 2492 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2493 + 2494 + page = c->page; 2495 + freelist = c->freelist; 2496 + 2497 + c->page = NULL; 2498 + c->freelist = NULL; 2593 2499 c->tid = next_tid(c->tid); 2500 + 2501 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2502 + 2503 + if (page) { 2504 + deactivate_slab(s, page, freelist); 2505 + stat(s, CPUSLAB_FLUSH); 2506 + } 2594 2507 } 2508 + 2509 + static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) 2510 + { 2511 + struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); 2512 + void *freelist = c->freelist; 2513 + struct page *page = c->page; 2514 + 2515 + c->page = NULL; 2516 + c->freelist = NULL; 2517 + c->tid = next_tid(c->tid); 2518 + 2519 + if (page) { 2520 + deactivate_slab(s, page, freelist); 2521 + stat(s, CPUSLAB_FLUSH); 2522 + } 2523 + 2524 + unfreeze_partials_cpu(s, c); 2525 + } 2526 + 2527 + struct slub_flush_work { 2528 + struct work_struct work; 2529 + struct kmem_cache *s; 2530 + bool skip; 2531 + }; 2595 2532 2596 2533 /* 2597 2534 * Flush cpu slab. 2598 2535 * 2599 - * Called from IPI handler with interrupts disabled. 2536 + * Called from CPU work handler with migration disabled. 2600 2537 */ 2601 - static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) 2538 + static void flush_cpu_slab(struct work_struct *w) 2602 2539 { 2603 - struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); 2540 + struct kmem_cache *s; 2541 + struct kmem_cache_cpu *c; 2542 + struct slub_flush_work *sfw; 2543 + 2544 + sfw = container_of(w, struct slub_flush_work, work); 2545 + 2546 + s = sfw->s; 2547 + c = this_cpu_ptr(s->cpu_slab); 2604 2548 2605 2549 if (c->page) 2606 2550 flush_slab(s, c); 2607 2551 2608 - unfreeze_partials(s, c); 2552 + unfreeze_partials(s); 2609 2553 } 2610 2554 2611 - static void flush_cpu_slab(void *d) 2555 + static bool has_cpu_slab(int cpu, struct kmem_cache *s) 2612 2556 { 2613 - struct kmem_cache *s = d; 2614 - 2615 - __flush_cpu_slab(s, smp_processor_id()); 2616 - } 2617 - 2618 - static bool has_cpu_slab(int cpu, void *info) 2619 - { 2620 - struct kmem_cache *s = info; 2621 2557 struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); 2622 2558 2623 2559 return c->page || slub_percpu_partial(c); 2624 2560 } 2625 2561 2562 + static DEFINE_MUTEX(flush_lock); 2563 + static DEFINE_PER_CPU(struct slub_flush_work, slub_flush); 2564 + 2565 + static void flush_all_cpus_locked(struct kmem_cache *s) 2566 + { 2567 + struct slub_flush_work *sfw; 2568 + unsigned int cpu; 2569 + 2570 + lockdep_assert_cpus_held(); 2571 + mutex_lock(&flush_lock); 2572 + 2573 + for_each_online_cpu(cpu) { 2574 + sfw = &per_cpu(slub_flush, cpu); 2575 + if (!has_cpu_slab(cpu, s)) { 2576 + sfw->skip = true; 2577 + continue; 2578 + } 2579 + INIT_WORK(&sfw->work, flush_cpu_slab); 2580 + sfw->skip = false; 2581 + sfw->s = s; 2582 + schedule_work_on(cpu, &sfw->work); 2583 + } 2584 + 2585 + for_each_online_cpu(cpu) { 2586 + sfw = &per_cpu(slub_flush, cpu); 2587 + if (sfw->skip) 2588 + continue; 2589 + flush_work(&sfw->work); 2590 + } 2591 + 2592 + mutex_unlock(&flush_lock); 2593 + } 2594 + 2626 2595 static void flush_all(struct kmem_cache *s) 2627 2596 { 2628 - on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1); 2597 + cpus_read_lock(); 2598 + flush_all_cpus_locked(s); 2599 + cpus_read_unlock(); 2629 2600 } 2630 2601 2631 2602 /* ··· 2708 2533 static int slub_cpu_dead(unsigned int cpu) 2709 2534 { 2710 2535 struct kmem_cache *s; 2711 - unsigned long flags; 2712 2536 2713 2537 mutex_lock(&slab_mutex); 2714 - list_for_each_entry(s, &slab_caches, list) { 2715 - local_irq_save(flags); 2538 + list_for_each_entry(s, &slab_caches, list) 2716 2539 __flush_cpu_slab(s, cpu); 2717 - local_irq_restore(flags); 2718 - } 2719 2540 mutex_unlock(&slab_mutex); 2720 2541 return 0; 2721 2542 } ··· 2794 2623 #endif 2795 2624 } 2796 2625 2797 - static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, 2798 - int node, struct kmem_cache_cpu **pc) 2799 - { 2800 - void *freelist; 2801 - struct kmem_cache_cpu *c = *pc; 2802 - struct page *page; 2803 - 2804 - WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); 2805 - 2806 - freelist = get_partial(s, flags, node, c); 2807 - 2808 - if (freelist) 2809 - return freelist; 2810 - 2811 - page = new_slab(s, flags, node); 2812 - if (page) { 2813 - c = raw_cpu_ptr(s->cpu_slab); 2814 - if (c->page) 2815 - flush_slab(s, c); 2816 - 2817 - /* 2818 - * No other reference to the page yet so we can 2819 - * muck around with it freely without cmpxchg 2820 - */ 2821 - freelist = page->freelist; 2822 - page->freelist = NULL; 2823 - 2824 - stat(s, ALLOC_SLAB); 2825 - c->page = page; 2826 - *pc = c; 2827 - } 2828 - 2829 - return freelist; 2830 - } 2831 - 2832 2626 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags) 2833 2627 { 2834 2628 if (unlikely(PageSlabPfmemalloc(page))) 2629 + return gfp_pfmemalloc_allowed(gfpflags); 2630 + 2631 + return true; 2632 + } 2633 + 2634 + /* 2635 + * A variant of pfmemalloc_match() that tests page flags without asserting 2636 + * PageSlab. Intended for opportunistic checks before taking a lock and 2637 + * rechecking that nobody else freed the page under us. 2638 + */ 2639 + static inline bool pfmemalloc_match_unsafe(struct page *page, gfp_t gfpflags) 2640 + { 2641 + if (unlikely(__PageSlabPfmemalloc(page))) 2835 2642 return gfp_pfmemalloc_allowed(gfpflags); 2836 2643 2837 2644 return true; ··· 2822 2673 * The page is still frozen if the return value is not NULL. 2823 2674 * 2824 2675 * If this function returns NULL then the page has been unfrozen. 2825 - * 2826 - * This function must be called with interrupt disabled. 2827 2676 */ 2828 2677 static inline void *get_freelist(struct kmem_cache *s, struct page *page) 2829 2678 { 2830 2679 struct page new; 2831 2680 unsigned long counters; 2832 2681 void *freelist; 2682 + 2683 + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); 2833 2684 2834 2685 do { 2835 2686 freelist = page->freelist; ··· 2865 2716 * we need to allocate a new slab. This is the slowest path since it involves 2866 2717 * a call to the page allocator and the setup of a new slab. 2867 2718 * 2868 - * Version of __slab_alloc to use when we know that interrupts are 2719 + * Version of __slab_alloc to use when we know that preemption is 2869 2720 * already disabled (which is the case for bulk allocation). 2870 2721 */ 2871 2722 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, ··· 2873 2724 { 2874 2725 void *freelist; 2875 2726 struct page *page; 2727 + unsigned long flags; 2876 2728 2877 2729 stat(s, ALLOC_SLOWPATH); 2878 2730 2879 - page = c->page; 2731 + reread_page: 2732 + 2733 + page = READ_ONCE(c->page); 2880 2734 if (!page) { 2881 2735 /* 2882 2736 * if the node is not online or has no normal memory, just ··· 2902 2750 goto redo; 2903 2751 } else { 2904 2752 stat(s, ALLOC_NODE_MISMATCH); 2905 - deactivate_slab(s, page, c->freelist, c); 2906 - goto new_slab; 2753 + goto deactivate_slab; 2907 2754 } 2908 2755 } 2909 2756 ··· 2911 2760 * PFMEMALLOC but right now, we are losing the pfmemalloc 2912 2761 * information when the page leaves the per-cpu allocator 2913 2762 */ 2914 - if (unlikely(!pfmemalloc_match(page, gfpflags))) { 2915 - deactivate_slab(s, page, c->freelist, c); 2916 - goto new_slab; 2917 - } 2763 + if (unlikely(!pfmemalloc_match_unsafe(page, gfpflags))) 2764 + goto deactivate_slab; 2918 2765 2919 - /* must check again c->freelist in case of cpu migration or IRQ */ 2766 + /* must check again c->page in case we got preempted and it changed */ 2767 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2768 + if (unlikely(page != c->page)) { 2769 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2770 + goto reread_page; 2771 + } 2920 2772 freelist = c->freelist; 2921 2773 if (freelist) 2922 2774 goto load_freelist; ··· 2928 2774 2929 2775 if (!freelist) { 2930 2776 c->page = NULL; 2777 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2931 2778 stat(s, DEACTIVATE_BYPASS); 2932 2779 goto new_slab; 2933 2780 } ··· 2936 2781 stat(s, ALLOC_REFILL); 2937 2782 2938 2783 load_freelist: 2784 + 2785 + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); 2786 + 2939 2787 /* 2940 2788 * freelist is pointing to the list of objects to be used. 2941 2789 * page is pointing to the page from which the objects are obtained. ··· 2947 2789 VM_BUG_ON(!c->page->frozen); 2948 2790 c->freelist = get_freepointer(s, freelist); 2949 2791 c->tid = next_tid(c->tid); 2792 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2950 2793 return freelist; 2794 + 2795 + deactivate_slab: 2796 + 2797 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2798 + if (page != c->page) { 2799 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2800 + goto reread_page; 2801 + } 2802 + freelist = c->freelist; 2803 + c->page = NULL; 2804 + c->freelist = NULL; 2805 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2806 + deactivate_slab(s, page, freelist); 2951 2807 2952 2808 new_slab: 2953 2809 2954 2810 if (slub_percpu_partial(c)) { 2811 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2812 + if (unlikely(c->page)) { 2813 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2814 + goto reread_page; 2815 + } 2816 + if (unlikely(!slub_percpu_partial(c))) { 2817 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2818 + /* we were preempted and partial list got empty */ 2819 + goto new_objects; 2820 + } 2821 + 2955 2822 page = c->page = slub_percpu_partial(c); 2956 2823 slub_set_percpu_partial(c, page); 2824 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2957 2825 stat(s, CPU_PARTIAL_ALLOC); 2958 2826 goto redo; 2959 2827 } 2960 2828 2961 - freelist = new_slab_objects(s, gfpflags, node, &c); 2829 + new_objects: 2962 2830 2963 - if (unlikely(!freelist)) { 2831 + freelist = get_partial(s, gfpflags, node, &page); 2832 + if (freelist) 2833 + goto check_new_page; 2834 + 2835 + slub_put_cpu_ptr(s->cpu_slab); 2836 + page = new_slab(s, gfpflags, node); 2837 + c = slub_get_cpu_ptr(s->cpu_slab); 2838 + 2839 + if (unlikely(!page)) { 2964 2840 slab_out_of_memory(s, gfpflags, node); 2965 2841 return NULL; 2966 2842 } 2967 2843 2968 - page = c->page; 2969 - if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags))) 2970 - goto load_freelist; 2844 + /* 2845 + * No other reference to the page yet so we can 2846 + * muck around with it freely without cmpxchg 2847 + */ 2848 + freelist = page->freelist; 2849 + page->freelist = NULL; 2971 2850 2972 - /* Only entered in the debug case */ 2973 - if (kmem_cache_debug(s) && 2974 - !alloc_debug_processing(s, page, freelist, addr)) 2975 - goto new_slab; /* Slab failed checks. Next slab needed */ 2851 + stat(s, ALLOC_SLAB); 2976 2852 2977 - deactivate_slab(s, page, get_freepointer(s, freelist), c); 2853 + check_new_page: 2854 + 2855 + if (kmem_cache_debug(s)) { 2856 + if (!alloc_debug_processing(s, page, freelist, addr)) { 2857 + /* Slab failed checks. Next slab needed */ 2858 + goto new_slab; 2859 + } else { 2860 + /* 2861 + * For debug case, we don't load freelist so that all 2862 + * allocations go through alloc_debug_processing() 2863 + */ 2864 + goto return_single; 2865 + } 2866 + } 2867 + 2868 + if (unlikely(!pfmemalloc_match(page, gfpflags))) 2869 + /* 2870 + * For !pfmemalloc_match() case we don't load freelist so that 2871 + * we don't make further mismatched allocations easier. 2872 + */ 2873 + goto return_single; 2874 + 2875 + retry_load_page: 2876 + 2877 + local_lock_irqsave(&s->cpu_slab->lock, flags); 2878 + if (unlikely(c->page)) { 2879 + void *flush_freelist = c->freelist; 2880 + struct page *flush_page = c->page; 2881 + 2882 + c->page = NULL; 2883 + c->freelist = NULL; 2884 + c->tid = next_tid(c->tid); 2885 + 2886 + local_unlock_irqrestore(&s->cpu_slab->lock, flags); 2887 + 2888 + deactivate_slab(s, flush_page, flush_freelist); 2889 + 2890 + stat(s, CPUSLAB_FLUSH); 2891 + 2892 + goto retry_load_page; 2893 + } 2894 + c->page = page; 2895 + 2896 + goto load_freelist; 2897 + 2898 + return_single: 2899 + 2900 + deactivate_slab(s, page, get_freepointer(s, freelist)); 2978 2901 return freelist; 2979 2902 } 2980 2903 2981 2904 /* 2982 - * Another one that disabled interrupt and compensates for possible 2983 - * cpu changes by refetching the per cpu area pointer. 2905 + * A wrapper for ___slab_alloc() for contexts where preemption is not yet 2906 + * disabled. Compensates for possible cpu changes by refetching the per cpu area 2907 + * pointer. 2984 2908 */ 2985 2909 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, 2986 2910 unsigned long addr, struct kmem_cache_cpu *c) 2987 2911 { 2988 2912 void *p; 2989 - unsigned long flags; 2990 2913 2991 - local_irq_save(flags); 2992 - #ifdef CONFIG_PREEMPTION 2914 + #ifdef CONFIG_PREEMPT_COUNT 2993 2915 /* 2994 2916 * We may have been preempted and rescheduled on a different 2995 - * cpu before disabling interrupts. Need to reload cpu area 2917 + * cpu before disabling preemption. Need to reload cpu area 2996 2918 * pointer. 2997 2919 */ 2998 - c = this_cpu_ptr(s->cpu_slab); 2920 + c = slub_get_cpu_ptr(s->cpu_slab); 2999 2921 #endif 3000 2922 3001 2923 p = ___slab_alloc(s, gfpflags, node, addr, c); 3002 - local_irq_restore(flags); 2924 + #ifdef CONFIG_PREEMPT_COUNT 2925 + slub_put_cpu_ptr(s->cpu_slab); 2926 + #endif 3003 2927 return p; 3004 2928 } 3005 2929 ··· 3132 2892 * reading from one cpu area. That does not matter as long 3133 2893 * as we end up on the original cpu again when doing the cmpxchg. 3134 2894 * 3135 - * We should guarantee that tid and kmem_cache are retrieved on 3136 - * the same cpu. It could be different if CONFIG_PREEMPTION so we need 3137 - * to check if it is matched or not. 2895 + * We must guarantee that tid and kmem_cache_cpu are retrieved on the 2896 + * same cpu. We read first the kmem_cache_cpu pointer and use it to read 2897 + * the tid. If we are preempted and switched to another cpu between the 2898 + * two reads, it's OK as the two are still associated with the same cpu 2899 + * and cmpxchg later will validate the cpu. 3138 2900 */ 3139 - do { 3140 - tid = this_cpu_read(s->cpu_slab->tid); 3141 - c = raw_cpu_ptr(s->cpu_slab); 3142 - } while (IS_ENABLED(CONFIG_PREEMPTION) && 3143 - unlikely(tid != READ_ONCE(c->tid))); 2901 + c = raw_cpu_ptr(s->cpu_slab); 2902 + tid = READ_ONCE(c->tid); 3144 2903 3145 2904 /* 3146 2905 * Irqless object alloc/free algorithm used here depends on sequence ··· 3160 2921 3161 2922 object = c->freelist; 3162 2923 page = c->page; 3163 - if (unlikely(!object || !page || !node_match(page, node))) { 2924 + /* 2925 + * We cannot use the lockless fastpath on PREEMPT_RT because if a 2926 + * slowpath has taken the local_lock_irqsave(), it is not protected 2927 + * against a fast path operation in an irq handler. So we need to take 2928 + * the slow path which uses local_lock. It is still relatively fast if 2929 + * there is a suitable cpu freelist. 2930 + */ 2931 + if (IS_ENABLED(CONFIG_PREEMPT_RT) || 2932 + unlikely(!object || !page || !node_match(page, node))) { 3164 2933 object = __slab_alloc(s, gfpflags, node, addr, c); 3165 2934 } else { 3166 2935 void *next_object = get_freepointer_safe(s, object); ··· 3421 3174 * data is retrieved via this pointer. If we are on the same cpu 3422 3175 * during the cmpxchg then the free will succeed. 3423 3176 */ 3424 - do { 3425 - tid = this_cpu_read(s->cpu_slab->tid); 3426 - c = raw_cpu_ptr(s->cpu_slab); 3427 - } while (IS_ENABLED(CONFIG_PREEMPTION) && 3428 - unlikely(tid != READ_ONCE(c->tid))); 3177 + c = raw_cpu_ptr(s->cpu_slab); 3178 + tid = READ_ONCE(c->tid); 3429 3179 3430 3180 /* Same with comment on barrier() in slab_alloc_node() */ 3431 3181 barrier(); 3432 3182 3433 3183 if (likely(page == c->page)) { 3184 + #ifndef CONFIG_PREEMPT_RT 3434 3185 void **freelist = READ_ONCE(c->freelist); 3435 3186 3436 3187 set_freepointer(s, tail_obj, freelist); ··· 3441 3196 note_cmpxchg_failure("slab_free", s, tid); 3442 3197 goto redo; 3443 3198 } 3199 + #else /* CONFIG_PREEMPT_RT */ 3200 + /* 3201 + * We cannot use the lockless fastpath on PREEMPT_RT because if 3202 + * a slowpath has taken the local_lock_irqsave(), it is not 3203 + * protected against a fast path operation in an irq handler. So 3204 + * we need to take the local_lock. We shouldn't simply defer to 3205 + * __slab_free() as that wouldn't use the cpu freelist at all. 3206 + */ 3207 + void **freelist; 3208 + 3209 + local_lock(&s->cpu_slab->lock); 3210 + c = this_cpu_ptr(s->cpu_slab); 3211 + if (unlikely(page != c->page)) { 3212 + local_unlock(&s->cpu_slab->lock); 3213 + goto redo; 3214 + } 3215 + tid = c->tid; 3216 + freelist = c->freelist; 3217 + 3218 + set_freepointer(s, tail_obj, freelist); 3219 + c->freelist = head; 3220 + c->tid = next_tid(tid); 3221 + 3222 + local_unlock(&s->cpu_slab->lock); 3223 + #endif 3444 3224 stat(s, FREE_FASTPATH); 3445 3225 } else 3446 3226 __slab_free(s, page, head, tail_obj, cnt, addr); ··· 3643 3373 * IRQs, which protects against PREEMPT and interrupts 3644 3374 * handlers invoking normal fastpath. 3645 3375 */ 3646 - local_irq_disable(); 3647 - c = this_cpu_ptr(s->cpu_slab); 3376 + c = slub_get_cpu_ptr(s->cpu_slab); 3377 + local_lock_irq(&s->cpu_slab->lock); 3648 3378 3649 3379 for (i = 0; i < size; i++) { 3650 3380 void *object = kfence_alloc(s, s->object_size, flags); ··· 3665 3395 */ 3666 3396 c->tid = next_tid(c->tid); 3667 3397 3398 + local_unlock_irq(&s->cpu_slab->lock); 3399 + 3668 3400 /* 3669 3401 * Invoking slow path likely have side-effect 3670 3402 * of re-populating per CPU c->freelist ··· 3679 3407 c = this_cpu_ptr(s->cpu_slab); 3680 3408 maybe_wipe_obj_freeptr(s, p[i]); 3681 3409 3410 + local_lock_irq(&s->cpu_slab->lock); 3411 + 3682 3412 continue; /* goto for-loop */ 3683 3413 } 3684 3414 c->freelist = get_freepointer(s, object); ··· 3688 3414 maybe_wipe_obj_freeptr(s, p[i]); 3689 3415 } 3690 3416 c->tid = next_tid(c->tid); 3691 - local_irq_enable(); 3417 + local_unlock_irq(&s->cpu_slab->lock); 3418 + slub_put_cpu_ptr(s->cpu_slab); 3692 3419 3693 3420 /* 3694 3421 * memcg and kmem_cache debug support and memory initialization. ··· 3699 3424 slab_want_init_on_alloc(flags, s)); 3700 3425 return i; 3701 3426 error: 3702 - local_irq_enable(); 3427 + slub_put_cpu_ptr(s->cpu_slab); 3703 3428 slab_post_alloc_hook(s, objcg, flags, i, p, false); 3704 3429 __kmem_cache_free_bulk(s, i, p); 3705 3430 return 0; ··· 4213 3938 { 4214 3939 #ifdef CONFIG_SLUB_DEBUG 4215 3940 void *addr = page_address(page); 3941 + unsigned long flags; 4216 3942 unsigned long *map; 4217 3943 void *p; 4218 3944 4219 3945 slab_err(s, page, text, s->name); 4220 - slab_lock(page); 3946 + slab_lock(page, &flags); 4221 3947 4222 3948 map = get_map(s, page); 4223 3949 for_each_object(p, s, addr, page->objects) { ··· 4229 3953 } 4230 3954 } 4231 3955 put_map(map); 4232 - slab_unlock(page); 3956 + slab_unlock(page, &flags); 4233 3957 #endif 4234 3958 } 4235 3959 ··· 4279 4003 int node; 4280 4004 struct kmem_cache_node *n; 4281 4005 4282 - flush_all(s); 4006 + flush_all_cpus_locked(s); 4283 4007 /* Attempt to free all objects */ 4284 4008 for_each_kmem_cache_node(s, node, n) { 4285 4009 free_partial(s, n); ··· 4555 4279 * being allocated from last increasing the chance that the last objects 4556 4280 * are freed in them. 4557 4281 */ 4558 - int __kmem_cache_shrink(struct kmem_cache *s) 4282 + static int __kmem_cache_do_shrink(struct kmem_cache *s) 4559 4283 { 4560 4284 int node; 4561 4285 int i; ··· 4567 4291 unsigned long flags; 4568 4292 int ret = 0; 4569 4293 4570 - flush_all(s); 4571 4294 for_each_kmem_cache_node(s, node, n) { 4572 4295 INIT_LIST_HEAD(&discard); 4573 4296 for (i = 0; i < SHRINK_PROMOTE_MAX; i++) ··· 4616 4341 return ret; 4617 4342 } 4618 4343 4344 + int __kmem_cache_shrink(struct kmem_cache *s) 4345 + { 4346 + flush_all(s); 4347 + return __kmem_cache_do_shrink(s); 4348 + } 4349 + 4619 4350 static int slab_mem_going_offline_callback(void *arg) 4620 4351 { 4621 4352 struct kmem_cache *s; 4622 4353 4623 4354 mutex_lock(&slab_mutex); 4624 - list_for_each_entry(s, &slab_caches, list) 4625 - __kmem_cache_shrink(s); 4355 + list_for_each_entry(s, &slab_caches, list) { 4356 + flush_all_cpus_locked(s); 4357 + __kmem_cache_do_shrink(s); 4358 + } 4626 4359 mutex_unlock(&slab_mutex); 4627 4360 4628 4361 return 0; ··· 4956 4673 #endif 4957 4674 4958 4675 #ifdef CONFIG_SLUB_DEBUG 4959 - static void validate_slab(struct kmem_cache *s, struct page *page) 4676 + static void validate_slab(struct kmem_cache *s, struct page *page, 4677 + unsigned long *obj_map) 4960 4678 { 4961 4679 void *p; 4962 4680 void *addr = page_address(page); 4963 - unsigned long *map; 4681 + unsigned long flags; 4964 4682 4965 - slab_lock(page); 4683 + slab_lock(page, &flags); 4966 4684 4967 4685 if (!check_slab(s, page) || !on_freelist(s, page, NULL)) 4968 4686 goto unlock; 4969 4687 4970 4688 /* Now we know that a valid freelist exists */ 4971 - map = get_map(s, page); 4689 + __fill_map(obj_map, s, page); 4972 4690 for_each_object(p, s, addr, page->objects) { 4973 - u8 val = test_bit(__obj_to_index(s, addr, p), map) ? 4691 + u8 val = test_bit(__obj_to_index(s, addr, p), obj_map) ? 4974 4692 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; 4975 4693 4976 4694 if (!check_object(s, page, p, val)) 4977 4695 break; 4978 4696 } 4979 - put_map(map); 4980 4697 unlock: 4981 - slab_unlock(page); 4698 + slab_unlock(page, &flags); 4982 4699 } 4983 4700 4984 4701 static int validate_slab_node(struct kmem_cache *s, 4985 - struct kmem_cache_node *n) 4702 + struct kmem_cache_node *n, unsigned long *obj_map) 4986 4703 { 4987 4704 unsigned long count = 0; 4988 4705 struct page *page; ··· 4991 4708 spin_lock_irqsave(&n->list_lock, flags); 4992 4709 4993 4710 list_for_each_entry(page, &n->partial, slab_list) { 4994 - validate_slab(s, page); 4711 + validate_slab(s, page, obj_map); 4995 4712 count++; 4996 4713 } 4997 4714 if (count != n->nr_partial) { ··· 5004 4721 goto out; 5005 4722 5006 4723 list_for_each_entry(page, &n->full, slab_list) { 5007 - validate_slab(s, page); 4724 + validate_slab(s, page, obj_map); 5008 4725 count++; 5009 4726 } 5010 4727 if (count != atomic_long_read(&n->nr_slabs)) { ··· 5023 4740 int node; 5024 4741 unsigned long count = 0; 5025 4742 struct kmem_cache_node *n; 4743 + unsigned long *obj_map; 4744 + 4745 + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL); 4746 + if (!obj_map) 4747 + return -ENOMEM; 5026 4748 5027 4749 flush_all(s); 5028 4750 for_each_kmem_cache_node(s, node, n) 5029 - count += validate_slab_node(s, n); 4751 + count += validate_slab_node(s, n, obj_map); 4752 + 4753 + bitmap_free(obj_map); 5030 4754 5031 4755 return count; 5032 4756 } ··· 5169 4879 } 5170 4880 5171 4881 static void process_slab(struct loc_track *t, struct kmem_cache *s, 5172 - struct page *page, enum track_item alloc) 4882 + struct page *page, enum track_item alloc, 4883 + unsigned long *obj_map) 5173 4884 { 5174 4885 void *addr = page_address(page); 5175 4886 void *p; 5176 - unsigned long *map; 5177 4887 5178 - map = get_map(s, page); 4888 + __fill_map(obj_map, s, page); 4889 + 5179 4890 for_each_object(p, s, addr, page->objects) 5180 - if (!test_bit(__obj_to_index(s, addr, p), map)) 4891 + if (!test_bit(__obj_to_index(s, addr, p), obj_map)) 5181 4892 add_location(t, s, get_track(s, p, alloc)); 5182 - put_map(map); 5183 4893 } 5184 4894 #endif /* CONFIG_DEBUG_FS */ 5185 4895 #endif /* CONFIG_SLUB_DEBUG */ ··· 6106 5816 struct loc_track *t = __seq_open_private(filep, &slab_debugfs_sops, 6107 5817 sizeof(struct loc_track)); 6108 5818 struct kmem_cache *s = file_inode(filep)->i_private; 5819 + unsigned long *obj_map; 5820 + 5821 + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL); 5822 + if (!obj_map) 5823 + return -ENOMEM; 6109 5824 6110 5825 if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0) 6111 5826 alloc = TRACK_ALLOC; 6112 5827 else 6113 5828 alloc = TRACK_FREE; 6114 5829 6115 - if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) 5830 + if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) { 5831 + bitmap_free(obj_map); 6116 5832 return -ENOMEM; 6117 - 6118 - /* Push back cpu slabs */ 6119 - flush_all(s); 5833 + } 6120 5834 6121 5835 for_each_kmem_cache_node(s, node, n) { 6122 5836 unsigned long flags; ··· 6131 5837 6132 5838 spin_lock_irqsave(&n->list_lock, flags); 6133 5839 list_for_each_entry(page, &n->partial, slab_list) 6134 - process_slab(t, s, page, alloc); 5840 + process_slab(t, s, page, alloc, obj_map); 6135 5841 list_for_each_entry(page, &n->full, slab_list) 6136 - process_slab(t, s, page, alloc); 5842 + process_slab(t, s, page, alloc, obj_map); 6137 5843 spin_unlock_irqrestore(&n->list_lock, flags); 6138 5844 } 6139 5845 5846 + bitmap_free(obj_map); 6140 5847 return 0; 6141 5848 } 6142 5849