selftests/bpf: Add benchmark for local_storage get

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage. Only one inner map is created - a hashmap
meant to hold tid -> data mapping for all tasks. Size of the hashmap is
hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
keys which are actually fetched as part of the benchmark is
configurable.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s

num keys: 1000
hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s

num keys: 10000
hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s

num keys: 100000
hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s

num keys: 4194304
hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s

Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s

num_maps: 10
local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s

num_maps: 16
local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s

num_maps: 17
local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s

num_maps: 24
local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s

num_maps: 32
local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s

num_maps: 100
local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s

num_maps: 1000
local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:

Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s

num keys: 1000
hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s

num keys: 10000
hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s

num keys: 100000
hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s

num keys: 4194304
hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s

Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s

num_maps: 10
local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s

num_maps: 16
local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s

num_maps: 17
local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s

num_maps: 24
local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s

num_maps: 32
local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s

num_maps: 100
local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s

num_maps: 1000
local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and
"sequential get" are:

hashmap_control_1k: ~53.8ns
hashmap_control_10k: ~124.2ns
hashmap_control_100k: ~206.5ns
sequential_get_1: ~12.1ns
sequential_get_10: ~16.0ns
sequential_get_16: ~13.8ns
sequential_get_17: ~16.8ns
sequential_get_24: ~40.9ns
sequential_get_32: ~65.2ns
sequential_get_100: ~148.2ns
sequential_get_1000: ~2204ns

Clearly demonstrating a cliff.

In the discussion for v1 of this patch, Alexei noted that local_storage
was 2.5x faster than a large hashmap when initially implemented [1]. The
benchmark results show that local_storage is 5-10x faster: a
long-running BPF application putting some pid-specific info into a
hashmap for each pid it sees will probably see on the order of 10-100k
pids. Bench numbers for hashmaps of this size are ~10x slower than
sequential_get_16, but as the number of local_storage maps grows far
past local_storage cache size the performance advantage shrinks and
eventually reverses.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

[0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
[1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Dave Marchevsky and committed by

Alexei Starovoitov 3 years ago 73087489 77225174

+494 -1

7 changed files

expand all

tools

testing

selftests

bpf

Makefile

bench.c

bench.h

benchs

bench_local_storage.c

run_bench_local_storage.sh

run_common.sh

progs

local_storage_bench.c

+3 -1

tools/testing/selftests/bpf/Makefile

··· 571 571 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h 572 572 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h 573 573 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h 574 + $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h 574 575 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) 575 576 $(OUTPUT)/bench: LDLIBS += -lm 576 577 $(OUTPUT)/bench: $(OUTPUT)/bench.o \ ··· 584 583 $(OUTPUT)/bench_bloom_filter_map.o \ 585 584 $(OUTPUT)/bench_bpf_loop.o \ 586 585 $(OUTPUT)/bench_strncmp.o \ 587 - $(OUTPUT)/bench_bpf_hashmap_full_update.o 586 + $(OUTPUT)/bench_bpf_hashmap_full_update.o \ 587 + $(OUTPUT)/bench_local_storage.o 588 588 $(call msg,BINARY,,$@) 589 589 $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ 590 590

+55

tools/testing/selftests/bpf/bench.c

··· 150 150 printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt); 151 151 } 152 152 153 + void local_storage_report_progress(int iter, struct bench_res *res, 154 + long delta_ns) 155 + { 156 + double important_hits_per_sec, hits_per_sec; 157 + double delta_sec = delta_ns / 1000000000.0; 158 + 159 + hits_per_sec = res->hits / 1000000.0 / delta_sec; 160 + important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec; 161 + 162 + printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0); 163 + 164 + printf("hits %8.3lfM/s ", hits_per_sec); 165 + printf("important_hits %8.3lfM/s\n", important_hits_per_sec); 166 + } 167 + 168 + void local_storage_report_final(struct bench_res res[], int res_cnt) 169 + { 170 + double important_hits_mean = 0.0, important_hits_stddev = 0.0; 171 + double hits_mean = 0.0, hits_stddev = 0.0; 172 + int i; 173 + 174 + for (i = 0; i < res_cnt; i++) { 175 + hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt); 176 + important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt); 177 + } 178 + 179 + if (res_cnt > 1) { 180 + for (i = 0; i < res_cnt; i++) { 181 + hits_stddev += (hits_mean - res[i].hits / 1000000.0) * 182 + (hits_mean - res[i].hits / 1000000.0) / 183 + (res_cnt - 1.0); 184 + important_hits_stddev += 185 + (important_hits_mean - res[i].important_hits / 1000000.0) * 186 + (important_hits_mean - res[i].important_hits / 1000000.0) / 187 + (res_cnt - 1.0); 188 + } 189 + 190 + hits_stddev = sqrt(hits_stddev); 191 + important_hits_stddev = sqrt(important_hits_stddev); 192 + } 193 + printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ", 194 + hits_mean, hits_stddev); 195 + printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean); 196 + printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n", 197 + important_hits_mean, important_hits_stddev); 198 + } 199 + 153 200 const char *argp_program_version = "benchmark"; 154 201 const char *argp_program_bug_address = "<bpf@vger.kernel.org>"; 155 202 const char argp_program_doc[] = ··· 235 188 extern struct argp bench_ringbufs_argp; 236 189 extern struct argp bench_bloom_map_argp; 237 190 extern struct argp bench_bpf_loop_argp; 191 + extern struct argp bench_local_storage_argp; 238 192 extern struct argp bench_strncmp_argp; 239 193 240 194 static const struct argp_child bench_parsers[] = { 241 195 { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, 242 196 { &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 }, 243 197 { &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 }, 198 + { &bench_local_storage_argp, 0, "local_storage benchmark", 0 }, 244 199 { &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 }, 245 200 {}, 246 201 }; ··· 446 397 extern const struct bench bench_strncmp_no_helper; 447 398 extern const struct bench bench_strncmp_helper; 448 399 extern const struct bench bench_bpf_hashmap_full_update; 400 + extern const struct bench bench_local_storage_cache_seq_get; 401 + extern const struct bench bench_local_storage_cache_interleaved_get; 402 + extern const struct bench bench_local_storage_cache_hashmap_control; 449 403 450 404 static const struct bench *benchs[] = { 451 405 &bench_count_global, ··· 484 432 &bench_strncmp_no_helper, 485 433 &bench_strncmp_helper, 486 434 &bench_bpf_hashmap_full_update, 435 + &bench_local_storage_cache_seq_get, 436 + &bench_local_storage_cache_interleaved_get, 437 + &bench_local_storage_cache_hashmap_control, 487 438 }; 488 439 489 440 static void setup_benchmark()

tools/testing/selftests/bpf/bench.h

··· 34 34 long hits; 35 35 long drops; 36 36 long false_hits; 37 + long important_hits; 37 38 }; 38 39 39 40 struct bench { ··· 62 61 void false_hits_report_final(struct bench_res res[], int res_cnt); 63 62 void ops_report_progress(int iter, struct bench_res *res, long delta_ns); 64 63 void ops_report_final(struct bench_res res[], int res_cnt); 64 + void local_storage_report_progress(int iter, struct bench_res *res, 65 + long delta_ns); 66 + void local_storage_report_final(struct bench_res res[], int res_cnt); 65 67 66 68 static inline __u64 get_time_ns(void) 67 69 {

+287

tools/testing/selftests/bpf/benchs/bench_local_storage.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #include <argp.h> 5 + #include <linux/btf.h> 6 + 7 + #include "local_storage_bench.skel.h" 8 + #include "bench.h" 9 + 10 + #include <test_btf.h> 11 + 12 + static struct { 13 + __u32 nr_maps; 14 + __u32 hashmap_nr_keys_used; 15 + } args = { 16 + .nr_maps = 1000, 17 + .hashmap_nr_keys_used = 1000, 18 + }; 19 + 20 + enum { 21 + ARG_NR_MAPS = 6000, 22 + ARG_HASHMAP_NR_KEYS_USED = 6001, 23 + }; 24 + 25 + static const struct argp_option opts[] = { 26 + { "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0, 27 + "Set number of local_storage maps"}, 28 + { "hashmap_nr_keys_used", ARG_HASHMAP_NR_KEYS_USED, "NR_KEYS", 29 + 0, "When doing hashmap test, set number of hashmap keys test uses"}, 30 + {}, 31 + }; 32 + 33 + static error_t parse_arg(int key, char *arg, struct argp_state *state) 34 + { 35 + long ret; 36 + 37 + switch (key) { 38 + case ARG_NR_MAPS: 39 + ret = strtol(arg, NULL, 10); 40 + if (ret < 1 || ret > UINT_MAX) { 41 + fprintf(stderr, "invalid nr_maps"); 42 + argp_usage(state); 43 + } 44 + args.nr_maps = ret; 45 + break; 46 + case ARG_HASHMAP_NR_KEYS_USED: 47 + ret = strtol(arg, NULL, 10); 48 + if (ret < 1 || ret > UINT_MAX) { 49 + fprintf(stderr, "invalid hashmap_nr_keys_used"); 50 + argp_usage(state); 51 + } 52 + args.hashmap_nr_keys_used = ret; 53 + break; 54 + default: 55 + return ARGP_ERR_UNKNOWN; 56 + } 57 + 58 + return 0; 59 + } 60 + 61 + const struct argp bench_local_storage_argp = { 62 + .options = opts, 63 + .parser = parse_arg, 64 + }; 65 + 66 + /* Keep in sync w/ array of maps in bpf */ 67 + #define MAX_NR_MAPS 1000 68 + /* keep in sync w/ same define in bpf */ 69 + #define HASHMAP_SZ 4194304 70 + 71 + static void validate(void) 72 + { 73 + if (env.producer_cnt != 1) { 74 + fprintf(stderr, "benchmark doesn't support multi-producer!\n"); 75 + exit(1); 76 + } 77 + if (env.consumer_cnt != 1) { 78 + fprintf(stderr, "benchmark doesn't support multi-consumer!\n"); 79 + exit(1); 80 + } 81 + 82 + if (args.nr_maps > MAX_NR_MAPS) { 83 + fprintf(stderr, "nr_maps must be <= 1000\n"); 84 + exit(1); 85 + } 86 + 87 + if (args.hashmap_nr_keys_used > HASHMAP_SZ) { 88 + fprintf(stderr, "hashmap_nr_keys_used must be <= %u\n", HASHMAP_SZ); 89 + exit(1); 90 + } 91 + } 92 + 93 + static struct { 94 + struct local_storage_bench *skel; 95 + void *bpf_obj; 96 + struct bpf_map *array_of_maps; 97 + } ctx; 98 + 99 + static void prepopulate_hashmap(int fd) 100 + { 101 + int i, key, val; 102 + 103 + /* local_storage gets will have BPF_LOCAL_STORAGE_GET_F_CREATE flag set, so 104 + * populate the hashmap for a similar comparison 105 + */ 106 + for (i = 0; i < HASHMAP_SZ; i++) { 107 + key = val = i; 108 + if (bpf_map_update_elem(fd, &key, &val, 0)) { 109 + fprintf(stderr, "Error prepopulating hashmap (key %d)\n", key); 110 + exit(1); 111 + } 112 + } 113 + } 114 + 115 + static void __setup(struct bpf_program *prog, bool hashmap) 116 + { 117 + struct bpf_map *inner_map; 118 + int i, fd, mim_fd, err; 119 + 120 + LIBBPF_OPTS(bpf_map_create_opts, create_opts); 121 + 122 + if (!hashmap) 123 + create_opts.map_flags = BPF_F_NO_PREALLOC; 124 + 125 + ctx.skel->rodata->num_maps = args.nr_maps; 126 + ctx.skel->rodata->hashmap_num_keys = args.hashmap_nr_keys_used; 127 + inner_map = bpf_map__inner_map(ctx.array_of_maps); 128 + create_opts.btf_key_type_id = bpf_map__btf_key_type_id(inner_map); 129 + create_opts.btf_value_type_id = bpf_map__btf_value_type_id(inner_map); 130 + 131 + err = local_storage_bench__load(ctx.skel); 132 + if (err) { 133 + fprintf(stderr, "Error loading skeleton\n"); 134 + goto err_out; 135 + } 136 + 137 + create_opts.btf_fd = bpf_object__btf_fd(ctx.skel->obj); 138 + 139 + mim_fd = bpf_map__fd(ctx.array_of_maps); 140 + if (mim_fd < 0) { 141 + fprintf(stderr, "Error getting map_in_map fd\n"); 142 + goto err_out; 143 + } 144 + 145 + for (i = 0; i < args.nr_maps; i++) { 146 + if (hashmap) 147 + fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(int), 148 + sizeof(int), HASHMAP_SZ, &create_opts); 149 + else 150 + fd = bpf_map_create(BPF_MAP_TYPE_TASK_STORAGE, NULL, sizeof(int), 151 + sizeof(int), 0, &create_opts); 152 + if (fd < 0) { 153 + fprintf(stderr, "Error creating map %d: %d\n", i, fd); 154 + goto err_out; 155 + } 156 + 157 + if (hashmap) 158 + prepopulate_hashmap(fd); 159 + 160 + err = bpf_map_update_elem(mim_fd, &i, &fd, 0); 161 + if (err) { 162 + fprintf(stderr, "Error updating array-of-maps w/ map %d\n", i); 163 + goto err_out; 164 + } 165 + } 166 + 167 + if (!bpf_program__attach(prog)) { 168 + fprintf(stderr, "Error attaching bpf program\n"); 169 + goto err_out; 170 + } 171 + 172 + return; 173 + err_out: 174 + exit(1); 175 + } 176 + 177 + static void hashmap_setup(void) 178 + { 179 + struct local_storage_bench *skel; 180 + 181 + setup_libbpf(); 182 + 183 + skel = local_storage_bench__open(); 184 + ctx.skel = skel; 185 + ctx.array_of_maps = skel->maps.array_of_hash_maps; 186 + skel->rodata->use_hashmap = 1; 187 + skel->rodata->interleave = 0; 188 + 189 + __setup(skel->progs.get_local, true); 190 + } 191 + 192 + static void local_storage_cache_get_setup(void) 193 + { 194 + struct local_storage_bench *skel; 195 + 196 + setup_libbpf(); 197 + 198 + skel = local_storage_bench__open(); 199 + ctx.skel = skel; 200 + ctx.array_of_maps = skel->maps.array_of_local_storage_maps; 201 + skel->rodata->use_hashmap = 0; 202 + skel->rodata->interleave = 0; 203 + 204 + __setup(skel->progs.get_local, false); 205 + } 206 + 207 + static void local_storage_cache_get_interleaved_setup(void) 208 + { 209 + struct local_storage_bench *skel; 210 + 211 + setup_libbpf(); 212 + 213 + skel = local_storage_bench__open(); 214 + ctx.skel = skel; 215 + ctx.array_of_maps = skel->maps.array_of_local_storage_maps; 216 + skel->rodata->use_hashmap = 0; 217 + skel->rodata->interleave = 1; 218 + 219 + __setup(skel->progs.get_local, false); 220 + } 221 + 222 + static void measure(struct bench_res *res) 223 + { 224 + res->hits = atomic_swap(&ctx.skel->bss->hits, 0); 225 + res->important_hits = atomic_swap(&ctx.skel->bss->important_hits, 0); 226 + } 227 + 228 + static inline void trigger_bpf_program(void) 229 + { 230 + syscall(__NR_getpgid); 231 + } 232 + 233 + static void *consumer(void *input) 234 + { 235 + return NULL; 236 + } 237 + 238 + static void *producer(void *input) 239 + { 240 + while (true) 241 + trigger_bpf_program(); 242 + 243 + return NULL; 244 + } 245 + 246 + /* cache sequential and interleaved get benchs test local_storage get 247 + * performance, specifically they demonstrate performance cliff of 248 + * current list-plus-cache local_storage model. 249 + * 250 + * cache sequential get: call bpf_task_storage_get on n maps in order 251 + * cache interleaved get: like "sequential get", but interleave 4 calls to the 252 + * 'important' map (idx 0 in array_of_maps) for every 10 calls. Goal 253 + * is to mimic environment where many progs are accessing their local_storage 254 + * maps, with 'our' prog needing to access its map more often than others 255 + */ 256 + const struct bench bench_local_storage_cache_seq_get = { 257 + .name = "local-storage-cache-seq-get", 258 + .validate = validate, 259 + .setup = local_storage_cache_get_setup, 260 + .producer_thread = producer, 261 + .consumer_thread = consumer, 262 + .measure = measure, 263 + .report_progress = local_storage_report_progress, 264 + .report_final = local_storage_report_final, 265 + }; 266 + 267 + const struct bench bench_local_storage_cache_interleaved_get = { 268 + .name = "local-storage-cache-int-get", 269 + .validate = validate, 270 + .setup = local_storage_cache_get_interleaved_setup, 271 + .producer_thread = producer, 272 + .consumer_thread = consumer, 273 + .measure = measure, 274 + .report_progress = local_storage_report_progress, 275 + .report_final = local_storage_report_final, 276 + }; 277 + 278 + const struct bench bench_local_storage_cache_hashmap_control = { 279 + .name = "local-storage-cache-hashmap-control", 280 + .validate = validate, 281 + .setup = hashmap_setup, 282 + .producer_thread = producer, 283 + .consumer_thread = consumer, 284 + .measure = measure, 285 + .report_progress = local_storage_report_progress, 286 + .report_final = local_storage_report_final, 287 + };

+24

tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + source ./benchs/run_common.sh 5 + 6 + set -eufo pipefail 7 + 8 + header "Hashmap Control" 9 + for i in 10 1000 10000 100000 4194304; do 10 + subtitle "num keys: $i" 11 + summarize_local_storage "hashmap (control) sequential get: "\ 12 + "$(./bench --nr_maps 1 --hashmap_nr_keys_used=$i local-storage-cache-hashmap-control)" 13 + printf "\n" 14 + done 15 + 16 + header "Local Storage" 17 + for i in 1 10 16 17 24 32 100 1000; do 18 + subtitle "num_maps: $i" 19 + summarize_local_storage "local_storage cache sequential get: "\ 20 + "$(./bench --nr_maps $i local-storage-cache-seq-get)" 21 + summarize_local_storage "local_storage cache interleaved get: "\ 22 + "$(./bench --nr_maps $i local-storage-cache-int-get)" 23 + printf "\n" 24 + done

+17

tools/testing/selftests/bpf/benchs/run_common.sh

··· 41 41 echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/" 42 42 } 43 43 44 + function local_storage() 45 + { 46 + echo -n "hits throughput: " 47 + echo -n "$*" | sed -E "s/.* hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/" 48 + echo -n -e ", hits latency: " 49 + echo -n "$*" | sed -E "s/.* hits latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/" 50 + echo -n ", important_hits throughput: " 51 + echo "$*" | sed -E "s/.*important_hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/" 52 + } 53 + 44 54 function total() 45 55 { 46 56 echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/" ··· 75 65 bench="$1" 76 66 summary=$(echo $2 | tail -n1) 77 67 printf "%-20s %s\n" "$bench" "$(ops $summary)" 68 + } 69 + 70 + function summarize_local_storage() 71 + { 72 + bench="$1" 73 + summary=$(echo $2 | tail -n1) 74 + printf "%-20s %s\n" "$bench" "$(local_storage $summary)" 78 75 } 79 76 80 77 function summarize_total()

+104

tools/testing/selftests/bpf/progs/local_storage_bench.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ 3 + 4 + #include "vmlinux.h" 5 + #include <bpf/bpf_helpers.h> 6 + #include "bpf_misc.h" 7 + 8 + #define HASHMAP_SZ 4194304 9 + 10 + struct { 11 + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); 12 + __uint(max_entries, 1000); 13 + __type(key, int); 14 + __type(value, int); 15 + __array(values, struct { 16 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 17 + __uint(map_flags, BPF_F_NO_PREALLOC); 18 + __type(key, int); 19 + __type(value, int); 20 + }); 21 + } array_of_local_storage_maps SEC(".maps"); 22 + 23 + struct { 24 + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); 25 + __uint(max_entries, 1000); 26 + __type(key, int); 27 + __type(value, int); 28 + __array(values, struct { 29 + __uint(type, BPF_MAP_TYPE_HASH); 30 + __uint(max_entries, HASHMAP_SZ); 31 + __type(key, int); 32 + __type(value, int); 33 + }); 34 + } array_of_hash_maps SEC(".maps"); 35 + 36 + long important_hits; 37 + long hits; 38 + 39 + /* set from user-space */ 40 + const volatile unsigned int use_hashmap; 41 + const volatile unsigned int hashmap_num_keys; 42 + const volatile unsigned int num_maps; 43 + const volatile unsigned int interleave; 44 + 45 + struct loop_ctx { 46 + struct task_struct *task; 47 + long loop_hits; 48 + long loop_important_hits; 49 + }; 50 + 51 + static int do_lookup(unsigned int elem, struct loop_ctx *lctx) 52 + { 53 + void *map, *inner_map; 54 + int idx = 0; 55 + 56 + if (use_hashmap) 57 + map = &array_of_hash_maps; 58 + else 59 + map = &array_of_local_storage_maps; 60 + 61 + inner_map = bpf_map_lookup_elem(map, &elem); 62 + if (!inner_map) 63 + return -1; 64 + 65 + if (use_hashmap) { 66 + idx = bpf_get_prandom_u32() % hashmap_num_keys; 67 + bpf_map_lookup_elem(inner_map, &idx); 68 + } else { 69 + bpf_task_storage_get(inner_map, lctx->task, &idx, 70 + BPF_LOCAL_STORAGE_GET_F_CREATE); 71 + } 72 + 73 + lctx->loop_hits++; 74 + if (!elem) 75 + lctx->loop_important_hits++; 76 + return 0; 77 + } 78 + 79 + static long loop(u32 index, void *ctx) 80 + { 81 + struct loop_ctx *lctx = (struct loop_ctx *)ctx; 82 + unsigned int map_idx = index % num_maps; 83 + 84 + do_lookup(map_idx, lctx); 85 + if (interleave && map_idx % 3 == 0) 86 + do_lookup(0, lctx); 87 + return 0; 88 + } 89 + 90 + SEC("fentry/" SYS_PREFIX "sys_getpgid") 91 + int get_local(void *ctx) 92 + { 93 + struct loop_ctx lctx; 94 + 95 + lctx.task = bpf_get_current_task_btf(); 96 + lctx.loop_hits = 0; 97 + lctx.loop_important_hits = 0; 98 + bpf_loop(10000, &loop, &lctx, 0); 99 + __sync_add_and_fetch(&hits, lctx.loop_hits); 100 + __sync_add_and_fetch(&important_hits, lctx.loop_important_hits); 101 + return 0; 102 + } 103 + 104 + char _license[] SEC("license") = "GPL";