Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/slub: enable debugging memory wasting of kmalloc

kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.

The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:

[ 26.062145] kmalloc-2k 814056KB 814056KB

From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)

So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:

126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().

And when system starts some real workload like multiple docker
instances, there could are more severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/

[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

authored by

Feng Tang and committed by
Vlastimil Babka
6edf2576 1f04b07d

+142 -50
+21 -12
Documentation/mm/slub.rst
··· 400 400 allocated objects. The output is sorted by frequency of each trace. 401 401 402 402 Information in the output: 403 - Number of objects, allocating function, minimal/average/maximal jiffies since alloc, 404 - pid range of the allocating processes, cpu mask of allocating cpus, and stack trace. 403 + Number of objects, allocating function, possible memory wastage of 404 + kmalloc objects(total/per-object), minimal/average/maximal jiffies 405 + since alloc, pid range of the allocating processes, cpu mask of 406 + allocating cpus, numa node mask of origins of memory, and stack trace. 405 407 406 408 Example::: 407 409 408 - 1085 populate_error_injection_list+0x97/0x110 age=166678/166680/166682 pid=1 cpus=1:: 409 - __slab_alloc+0x6d/0x90 410 - kmem_cache_alloc_trace+0x2eb/0x300 411 - populate_error_injection_list+0x97/0x110 412 - init_error_injection+0x1b/0x71 413 - do_one_initcall+0x5f/0x2d0 414 - kernel_init_freeable+0x26f/0x2d7 415 - kernel_init+0xe/0x118 416 - ret_from_fork+0x22/0x30 417 - 410 + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 411 + __kmem_cache_alloc_node+0x11f/0x4e0 412 + kmalloc_trace+0x26/0xa0 413 + pci_alloc_dev+0x2c/0xa0 414 + pci_scan_single_device+0xd2/0x150 415 + pci_scan_slot+0xf7/0x2d0 416 + pci_scan_child_bus_extend+0x4e/0x360 417 + acpi_pci_root_create+0x32e/0x3b0 418 + pci_acpi_scan_root+0x2b9/0x2d0 419 + acpi_pci_root_add.cold.11+0x110/0xb0a 420 + acpi_bus_attach+0x262/0x3f0 421 + device_for_each_child+0xb7/0x110 422 + acpi_dev_for_each_child+0x77/0xa0 423 + acpi_bus_attach+0x108/0x3f0 424 + device_for_each_child+0xb7/0x110 425 + acpi_dev_for_each_child+0x77/0xa0 426 + acpi_bus_attach+0x108/0x3f0 418 427 419 428 2. free_traces:: 420 429
+2
include/linux/slab.h
··· 29 29 #define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U) 30 30 /* DEBUG: Poison objects */ 31 31 #define SLAB_POISON ((slab_flags_t __force)0x00000800U) 32 + /* Indicate a kmalloc slab */ 33 + #define SLAB_KMALLOC ((slab_flags_t __force)0x00001000U) 32 34 /* Align objs on cache lines */ 33 35 #define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U) 34 36 /* Use GFP_DMA memory */
+2 -1
mm/slab_common.c
··· 649 649 if (!s) 650 650 panic("Out of memory when creating slab %s\n", name); 651 651 652 - create_boot_cache(s, name, size, flags, useroffset, usersize); 652 + create_boot_cache(s, name, size, flags | SLAB_KMALLOC, useroffset, 653 + usersize); 653 654 kasan_cache_create_kmalloc(s); 654 655 list_add(&s->list, &slab_caches); 655 656 s->refcount = 1;
+117 -37
mm/slub.c
··· 194 194 #endif 195 195 #endif /* CONFIG_SLUB_DEBUG */ 196 196 197 + /* Structure holding parameters for get_partial() call chain */ 198 + struct partial_context { 199 + struct slab **slab; 200 + gfp_t flags; 201 + unsigned int orig_size; 202 + }; 203 + 197 204 static inline bool kmem_cache_debug(struct kmem_cache *s) 198 205 { 199 206 return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS); 207 + } 208 + 209 + static inline bool slub_debug_orig_size(struct kmem_cache *s) 210 + { 211 + return (kmem_cache_debug_flags(s, SLAB_STORE_USER) && 212 + (s->flags & SLAB_KMALLOC)); 200 213 } 201 214 202 215 void *fixup_red_left(struct kmem_cache *s, void *p) ··· 798 785 folio_flags(folio, 0)); 799 786 } 800 787 788 + /* 789 + * kmalloc caches has fixed sizes (mostly power of 2), and kmalloc() API 790 + * family will round up the real request size to these fixed ones, so 791 + * there could be an extra area than what is requested. Save the original 792 + * request size in the meta data area, for better debug and sanity check. 793 + */ 794 + static inline void set_orig_size(struct kmem_cache *s, 795 + void *object, unsigned int orig_size) 796 + { 797 + void *p = kasan_reset_tag(object); 798 + 799 + if (!slub_debug_orig_size(s)) 800 + return; 801 + 802 + p += get_info_end(s); 803 + p += sizeof(struct track) * 2; 804 + 805 + *(unsigned int *)p = orig_size; 806 + } 807 + 808 + static inline unsigned int get_orig_size(struct kmem_cache *s, void *object) 809 + { 810 + void *p = kasan_reset_tag(object); 811 + 812 + if (!slub_debug_orig_size(s)) 813 + return s->object_size; 814 + 815 + p += get_info_end(s); 816 + p += sizeof(struct track) * 2; 817 + 818 + return *(unsigned int *)p; 819 + } 820 + 801 821 static void slab_bug(struct kmem_cache *s, char *fmt, ...) 802 822 { 803 823 struct va_format vaf; ··· 889 843 890 844 if (s->flags & SLAB_STORE_USER) 891 845 off += 2 * sizeof(struct track); 846 + 847 + if (slub_debug_orig_size(s)) 848 + off += sizeof(unsigned int); 892 849 893 850 off += kasan_metadata_size(s); 894 851 ··· 1026 977 * 1027 978 * A. Free pointer (if we cannot overwrite object on free) 1028 979 * B. Tracking data for SLAB_STORE_USER 1029 - * C. Padding to reach required alignment boundary or at minimum 980 + * C. Original request size for kmalloc object (SLAB_STORE_USER enabled) 981 + * D. Padding to reach required alignment boundary or at minimum 1030 982 * one word if debugging is on to be able to detect writes 1031 983 * before the word boundary. 1032 984 * ··· 1045 995 { 1046 996 unsigned long off = get_info_end(s); /* The end of info */ 1047 997 1048 - if (s->flags & SLAB_STORE_USER) 998 + if (s->flags & SLAB_STORE_USER) { 1049 999 /* We also have user information there */ 1050 1000 off += 2 * sizeof(struct track); 1001 + 1002 + if (s->flags & SLAB_KMALLOC) 1003 + off += sizeof(unsigned int); 1004 + } 1051 1005 1052 1006 off += kasan_metadata_size(s); 1053 1007 ··· 1347 1293 } 1348 1294 1349 1295 static noinline int alloc_debug_processing(struct kmem_cache *s, 1350 - struct slab *slab, void *object) 1296 + struct slab *slab, void *object, int orig_size) 1351 1297 { 1352 1298 if (s->flags & SLAB_CONSISTENCY_CHECKS) { 1353 1299 if (!alloc_consistency_checks(s, slab, object)) ··· 1356 1302 1357 1303 /* Success. Perform special debug activities for allocs */ 1358 1304 trace(s, slab, object, 1); 1305 + set_orig_size(s, object, orig_size); 1359 1306 init_object(s, object, SLUB_RED_ACTIVE); 1360 1307 return 1; 1361 1308 ··· 1625 1570 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {} 1626 1571 1627 1572 static inline int alloc_debug_processing(struct kmem_cache *s, 1628 - struct slab *slab, void *object) { return 0; } 1573 + struct slab *slab, void *object, int orig_size) { return 0; } 1629 1574 1630 1575 static inline void free_debug_processing( 1631 1576 struct kmem_cache *s, struct slab *slab, ··· 2068 2013 * it to full list if it was the last free object. 2069 2014 */ 2070 2015 static void *alloc_single_from_partial(struct kmem_cache *s, 2071 - struct kmem_cache_node *n, struct slab *slab) 2016 + struct kmem_cache_node *n, struct slab *slab, int orig_size) 2072 2017 { 2073 2018 void *object; 2074 2019 ··· 2078 2023 slab->freelist = get_freepointer(s, object); 2079 2024 slab->inuse++; 2080 2025 2081 - if (!alloc_debug_processing(s, slab, object)) { 2026 + if (!alloc_debug_processing(s, slab, object, orig_size)) { 2082 2027 remove_partial(n, slab); 2083 2028 return NULL; 2084 2029 } ··· 2097 2042 * and put the slab to the partial (or full) list. 2098 2043 */ 2099 2044 static void *alloc_single_from_new_slab(struct kmem_cache *s, 2100 - struct slab *slab) 2045 + struct slab *slab, int orig_size) 2101 2046 { 2102 2047 int nid = slab_nid(slab); 2103 2048 struct kmem_cache_node *n = get_node(s, nid); ··· 2109 2054 slab->freelist = get_freepointer(s, object); 2110 2055 slab->inuse = 1; 2111 2056 2112 - if (!alloc_debug_processing(s, slab, object)) 2057 + if (!alloc_debug_processing(s, slab, object, orig_size)) 2113 2058 /* 2114 2059 * It's not really expected that this would fail on a 2115 2060 * freshly allocated slab, but a concurrent memory ··· 2187 2132 * Try to allocate a partial slab from a specific node. 2188 2133 */ 2189 2134 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, 2190 - struct slab **ret_slab, gfp_t gfpflags) 2135 + struct partial_context *pc) 2191 2136 { 2192 2137 struct slab *slab, *slab2; 2193 2138 void *object = NULL; ··· 2207 2152 list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) { 2208 2153 void *t; 2209 2154 2210 - if (!pfmemalloc_match(slab, gfpflags)) 2155 + if (!pfmemalloc_match(slab, pc->flags)) 2211 2156 continue; 2212 2157 2213 2158 if (kmem_cache_debug(s)) { 2214 - object = alloc_single_from_partial(s, n, slab); 2159 + object = alloc_single_from_partial(s, n, slab, 2160 + pc->orig_size); 2215 2161 if (object) 2216 2162 break; 2217 2163 continue; ··· 2223 2167 break; 2224 2168 2225 2169 if (!object) { 2226 - *ret_slab = slab; 2170 + *pc->slab = slab; 2227 2171 stat(s, ALLOC_FROM_PARTIAL); 2228 2172 object = t; 2229 2173 } else { ··· 2247 2191 /* 2248 2192 * Get a slab from somewhere. Search in increasing NUMA distances. 2249 2193 */ 2250 - static void *get_any_partial(struct kmem_cache *s, gfp_t flags, 2251 - struct slab **ret_slab) 2194 + static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc) 2252 2195 { 2253 2196 #ifdef CONFIG_NUMA 2254 2197 struct zonelist *zonelist; 2255 2198 struct zoneref *z; 2256 2199 struct zone *zone; 2257 - enum zone_type highest_zoneidx = gfp_zone(flags); 2200 + enum zone_type highest_zoneidx = gfp_zone(pc->flags); 2258 2201 void *object; 2259 2202 unsigned int cpuset_mems_cookie; 2260 2203 ··· 2281 2226 2282 2227 do { 2283 2228 cpuset_mems_cookie = read_mems_allowed_begin(); 2284 - zonelist = node_zonelist(mempolicy_slab_node(), flags); 2229 + zonelist = node_zonelist(mempolicy_slab_node(), pc->flags); 2285 2230 for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) { 2286 2231 struct kmem_cache_node *n; 2287 2232 2288 2233 n = get_node(s, zone_to_nid(zone)); 2289 2234 2290 - if (n && cpuset_zone_allowed(zone, flags) && 2235 + if (n && cpuset_zone_allowed(zone, pc->flags) && 2291 2236 n->nr_partial > s->min_partial) { 2292 - object = get_partial_node(s, n, ret_slab, flags); 2237 + object = get_partial_node(s, n, pc); 2293 2238 if (object) { 2294 2239 /* 2295 2240 * Don't check read_mems_allowed_retry() ··· 2310 2255 /* 2311 2256 * Get a partial slab, lock it and return it. 2312 2257 */ 2313 - static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, 2314 - struct slab **ret_slab) 2258 + static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc) 2315 2259 { 2316 2260 void *object; 2317 2261 int searchnode = node; ··· 2318 2264 if (node == NUMA_NO_NODE) 2319 2265 searchnode = numa_mem_id(); 2320 2266 2321 - object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags); 2267 + object = get_partial_node(s, get_node(s, searchnode), pc); 2322 2268 if (object || node != NUMA_NO_NODE) 2323 2269 return object; 2324 2270 2325 - return get_any_partial(s, flags, ret_slab); 2271 + return get_any_partial(s, pc); 2326 2272 } 2327 2273 2328 2274 #ifdef CONFIG_PREEMPTION ··· 3043 2989 * already disabled (which is the case for bulk allocation). 3044 2990 */ 3045 2991 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, 3046 - unsigned long addr, struct kmem_cache_cpu *c) 2992 + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size) 3047 2993 { 3048 2994 void *freelist; 3049 2995 struct slab *slab; 3050 2996 unsigned long flags; 2997 + struct partial_context pc; 3051 2998 3052 2999 stat(s, ALLOC_SLOWPATH); 3053 3000 ··· 3162 3107 3163 3108 new_objects: 3164 3109 3165 - freelist = get_partial(s, gfpflags, node, &slab); 3110 + pc.flags = gfpflags; 3111 + pc.slab = &slab; 3112 + pc.orig_size = orig_size; 3113 + freelist = get_partial(s, node, &pc); 3166 3114 if (freelist) 3167 3115 goto check_new_slab; 3168 3116 ··· 3181 3123 stat(s, ALLOC_SLAB); 3182 3124 3183 3125 if (kmem_cache_debug(s)) { 3184 - freelist = alloc_single_from_new_slab(s, slab); 3126 + freelist = alloc_single_from_new_slab(s, slab, orig_size); 3185 3127 3186 3128 if (unlikely(!freelist)) 3187 3129 goto new_objects; ··· 3213 3155 */ 3214 3156 if (s->flags & SLAB_STORE_USER) 3215 3157 set_track(s, freelist, TRACK_ALLOC, addr); 3158 + 3216 3159 return freelist; 3217 3160 } 3218 3161 ··· 3256 3197 * pointer. 3257 3198 */ 3258 3199 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, 3259 - unsigned long addr, struct kmem_cache_cpu *c) 3200 + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size) 3260 3201 { 3261 3202 void *p; 3262 3203 ··· 3269 3210 c = slub_get_cpu_ptr(s->cpu_slab); 3270 3211 #endif 3271 3212 3272 - p = ___slab_alloc(s, gfpflags, node, addr, c); 3213 + p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size); 3273 3214 #ifdef CONFIG_PREEMPT_COUNT 3274 3215 slub_put_cpu_ptr(s->cpu_slab); 3275 3216 #endif ··· 3354 3295 3355 3296 if (!USE_LOCKLESS_FAST_PATH() || 3356 3297 unlikely(!object || !slab || !node_match(slab, node))) { 3357 - object = __slab_alloc(s, gfpflags, node, addr, c); 3298 + object = __slab_alloc(s, gfpflags, node, addr, c, orig_size); 3358 3299 } else { 3359 3300 void *next_object = get_freepointer_safe(s, object); 3360 3301 ··· 3852 3793 * of re-populating per CPU c->freelist 3853 3794 */ 3854 3795 p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, 3855 - _RET_IP_, c); 3796 + _RET_IP_, c, s->object_size); 3856 3797 if (unlikely(!p[i])) 3857 3798 goto error; 3858 3799 ··· 4255 4196 } 4256 4197 4257 4198 #ifdef CONFIG_SLUB_DEBUG 4258 - if (flags & SLAB_STORE_USER) 4199 + if (flags & SLAB_STORE_USER) { 4259 4200 /* 4260 4201 * Need to store information about allocs and frees after 4261 4202 * the object. 4262 4203 */ 4263 4204 size += 2 * sizeof(struct track); 4205 + 4206 + /* Save the original kmalloc request size */ 4207 + if (flags & SLAB_KMALLOC) 4208 + size += sizeof(unsigned int); 4209 + } 4264 4210 #endif 4265 4211 4266 4212 kasan_cache_create(s, &size, &s->flags); ··· 5210 5146 depot_stack_handle_t handle; 5211 5147 unsigned long count; 5212 5148 unsigned long addr; 5149 + unsigned long waste; 5213 5150 long long sum_time; 5214 5151 long min_time; 5215 5152 long max_time; ··· 5257 5192 } 5258 5193 5259 5194 static int add_location(struct loc_track *t, struct kmem_cache *s, 5260 - const struct track *track) 5195 + const struct track *track, 5196 + unsigned int orig_size) 5261 5197 { 5262 5198 long start, end, pos; 5263 5199 struct location *l; 5264 - unsigned long caddr, chandle; 5200 + unsigned long caddr, chandle, cwaste; 5265 5201 unsigned long age = jiffies - track->when; 5266 5202 depot_stack_handle_t handle = 0; 5203 + unsigned int waste = s->object_size - orig_size; 5267 5204 5268 5205 #ifdef CONFIG_STACKDEPOT 5269 5206 handle = READ_ONCE(track->handle); ··· 5283 5216 if (pos == end) 5284 5217 break; 5285 5218 5286 - caddr = t->loc[pos].addr; 5287 - chandle = t->loc[pos].handle; 5288 - if ((track->addr == caddr) && (handle == chandle)) { 5219 + l = &t->loc[pos]; 5220 + caddr = l->addr; 5221 + chandle = l->handle; 5222 + cwaste = l->waste; 5223 + if ((track->addr == caddr) && (handle == chandle) && 5224 + (waste == cwaste)) { 5289 5225 5290 - l = &t->loc[pos]; 5291 5226 l->count++; 5292 5227 if (track->when) { 5293 5228 l->sum_time += age; ··· 5314 5245 end = pos; 5315 5246 else if (track->addr == caddr && handle < chandle) 5316 5247 end = pos; 5248 + else if (track->addr == caddr && handle == chandle && 5249 + waste < cwaste) 5250 + end = pos; 5317 5251 else 5318 5252 start = pos; 5319 5253 } ··· 5340 5268 l->min_pid = track->pid; 5341 5269 l->max_pid = track->pid; 5342 5270 l->handle = handle; 5271 + l->waste = waste; 5343 5272 cpumask_clear(to_cpumask(l->cpus)); 5344 5273 cpumask_set_cpu(track->cpu, to_cpumask(l->cpus)); 5345 5274 nodes_clear(l->nodes); ··· 5353 5280 unsigned long *obj_map) 5354 5281 { 5355 5282 void *addr = slab_address(slab); 5283 + bool is_alloc = (alloc == TRACK_ALLOC); 5356 5284 void *p; 5357 5285 5358 5286 __fill_map(obj_map, s, slab); 5359 5287 5360 5288 for_each_object(p, s, addr, slab->objects) 5361 5289 if (!test_bit(__obj_to_index(s, addr, p), obj_map)) 5362 - add_location(t, s, get_track(s, p, alloc)); 5290 + add_location(t, s, get_track(s, p, alloc), 5291 + is_alloc ? get_orig_size(s, p) : 5292 + s->object_size); 5363 5293 } 5364 5294 #endif /* CONFIG_DEBUG_FS */ 5365 5295 #endif /* CONFIG_SLUB_DEBUG */ ··· 6231 6155 seq_printf(seq, "%pS", (void *)l->addr); 6232 6156 else 6233 6157 seq_puts(seq, "<not-available>"); 6158 + 6159 + if (l->waste) 6160 + seq_printf(seq, " waste=%lu/%lu", 6161 + l->count * l->waste, l->waste); 6234 6162 6235 6163 if (l->sum_time != l->min_time) { 6236 6164 seq_printf(seq, " age=%ld/%llu/%ld",