cpumask: re-introduce constant-sized cpumask optimizations

Commit aa47a7c215e7 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
in the cpumask operations potentially becoming hugely less efficient,
because suddenly the cpumask was always considered to be variable-sized.

The optimization was then later added back in a limited form by commit
6f9c07be9d02 ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
FORCE_NR_CPUS option is not useful in a generic kernel and more of a
special case for embedded situations with fixed hardware.

Instead, just re-introduce the optimization, with some changes.

Instead of depending on CPUMASK_OFFSTACK being false, and then always
using the full constant cpumask width, this introduces three different
cpumask "sizes":

- the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.

This is used for situations where we should use the exact size.

- the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
fits in a single word and the bitmap operations thus end up able
to trigger the "small_const_nbits()" optimizations.

This is used for the operations that have optimized single-word
cases that get inlined, notably the bit find and scanning functions.

- the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
is an sufficiently small constant that makes simple "copy" and
"clear" operations more efficient.

This is arbitrarily set at four words or less.

As a an example of this situation, without this fixed size optimization,
cpumask_clear() will generate code like

movl nr_cpu_ids(%rip), %edx
addq $63, %rdx
shrq $3, %rdx
andl $-8, %edx
callq memset@PLT

on x86-64, because it would calculate the "exact" number of longwords
that need to be cleared.

In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
reasonable value to use), the above becomes a single

movq $0,cpumask

instruction instead, because instead of caring to figure out exactly how
many CPU's the system has, it just knows that the cpumask will be a
single word and can just clear it all.

Note that this does end up tightening the rules a bit from the original
version in another way: operations that set bits in the cpumask are now
limited to the actual nr_cpu_ids limit, whereas we used to do the
nr_cpumask_bits thing almost everywhere in the cpumask code.

But if you just clear bits, or scan for bits, we can use the simpler
compile-time constants.

In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
which were not useful, and which fundamentally have to be limited to
'nr_cpu_ids'. Better remove them now than have somebody introduce use
of them later.

Of course, on x86-64 with MAXSMP there is no sane small compile-time
constant for the cpumask sizes, and we end up using the actual CPU bits,
and will generate the above kind of horrors regardless. Please don't
use MAXSMP unless you really expect to have machines with thousands of
cores.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Changed files
+72 -72
arch
ia64
kernel
include
linux
lib
-1
.clang-format
··· 226 226 - 'for_each_console_srcu' 227 227 - 'for_each_cpu' 228 228 - 'for_each_cpu_and' 229 - - 'for_each_cpu_not' 230 229 - 'for_each_cpu_wrap' 231 230 - 'for_each_dapm_widgets' 232 231 - 'for_each_dedup_cand'
+1 -3
arch/ia64/kernel/acpi.c
··· 783 783 784 784 static int _acpi_map_lsapic(acpi_handle handle, int physid, int *pcpu) 785 785 { 786 - cpumask_t tmp_map; 787 786 int cpu; 788 787 789 - cpumask_complement(&tmp_map, cpu_present_mask); 790 - cpu = cpumask_first(&tmp_map); 788 + cpu = cpumask_first_zero(cpu_present_mask); 791 789 if (cpu >= nr_cpu_ids) 792 790 return -EINVAL; 793 791
+70 -55
include/linux/cpumask.h
··· 50 50 #endif 51 51 } 52 52 53 - /* Deprecated. Always use nr_cpu_ids. */ 54 - #define nr_cpumask_bits nr_cpu_ids 53 + /* 54 + * We have several different "preferred sizes" for the cpumask 55 + * operations, depending on operation. 56 + * 57 + * For example, the bitmap scanning and operating operations have 58 + * optimized routines that work for the single-word case, but only when 59 + * the size is constant. So if NR_CPUS fits in one single word, we are 60 + * better off using that small constant, in order to trigger the 61 + * optimized bit finding. That is 'small_cpumask_size'. 62 + * 63 + * The clearing and copying operations will similarly perform better 64 + * with a constant size, but we limit that size arbitrarily to four 65 + * words. We call this 'large_cpumask_size'. 66 + * 67 + * Finally, some operations just want the exact limit, either because 68 + * they set bits or just don't have any faster fixed-sized versions. We 69 + * call this just 'nr_cpumask_size'. 70 + * 71 + * Note that these optional constants are always guaranteed to be at 72 + * least as big as 'nr_cpu_ids' itself is, and all our cpumask 73 + * allocations are at least that size (see cpumask_size()). The 74 + * optimization comes from being able to potentially use a compile-time 75 + * constant instead of a run-time generated exact number of CPUs. 76 + */ 77 + #if NR_CPUS <= BITS_PER_LONG 78 + #define small_cpumask_bits ((unsigned int)NR_CPUS) 79 + #define large_cpumask_bits ((unsigned int)NR_CPUS) 80 + #elif NR_CPUS <= 4*BITS_PER_LONG 81 + #define small_cpumask_bits nr_cpu_ids 82 + #define large_cpumask_bits ((unsigned int)NR_CPUS) 83 + #else 84 + #define small_cpumask_bits nr_cpu_ids 85 + #define large_cpumask_bits nr_cpu_ids 86 + #endif 87 + #define nr_cpumask_bits nr_cpu_ids 55 88 56 89 /* 57 90 * The following particular system cpumasks and operations manage ··· 159 126 */ 160 127 static inline unsigned int cpumask_first(const struct cpumask *srcp) 161 128 { 162 - return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits); 129 + return find_first_bit(cpumask_bits(srcp), small_cpumask_bits); 163 130 } 164 131 165 132 /** ··· 170 137 */ 171 138 static inline unsigned int cpumask_first_zero(const struct cpumask *srcp) 172 139 { 173 - return find_first_zero_bit(cpumask_bits(srcp), nr_cpumask_bits); 140 + return find_first_zero_bit(cpumask_bits(srcp), small_cpumask_bits); 174 141 } 175 142 176 143 /** ··· 183 150 static inline 184 151 unsigned int cpumask_first_and(const struct cpumask *srcp1, const struct cpumask *srcp2) 185 152 { 186 - return find_first_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits); 153 + return find_first_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), small_cpumask_bits); 187 154 } 188 155 189 156 /** ··· 194 161 */ 195 162 static inline unsigned int cpumask_last(const struct cpumask *srcp) 196 163 { 197 - return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits); 164 + return find_last_bit(cpumask_bits(srcp), small_cpumask_bits); 198 165 } 199 166 200 167 /** ··· 210 177 /* -1 is a legal arg here. */ 211 178 if (n != -1) 212 179 cpumask_check(n); 213 - return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1); 180 + return find_next_bit(cpumask_bits(srcp), small_cpumask_bits, n + 1); 214 181 } 215 182 216 183 /** ··· 225 192 /* -1 is a legal arg here. */ 226 193 if (n != -1) 227 194 cpumask_check(n); 228 - return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1); 195 + return find_next_zero_bit(cpumask_bits(srcp), small_cpumask_bits, n+1); 229 196 } 230 197 231 198 #if NR_CPUS == 1 ··· 268 235 if (n != -1) 269 236 cpumask_check(n); 270 237 return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p), 271 - nr_cpumask_bits, n + 1); 238 + small_cpumask_bits, n + 1); 272 239 } 273 240 274 241 /** ··· 279 246 * After the loop, cpu is >= nr_cpu_ids. 280 247 */ 281 248 #define for_each_cpu(cpu, mask) \ 282 - for_each_set_bit(cpu, cpumask_bits(mask), nr_cpumask_bits) 283 - 284 - /** 285 - * for_each_cpu_not - iterate over every cpu in a complemented mask 286 - * @cpu: the (optionally unsigned) integer iterator 287 - * @mask: the cpumask pointer 288 - * 289 - * After the loop, cpu is >= nr_cpu_ids. 290 - */ 291 - #define for_each_cpu_not(cpu, mask) \ 292 - for_each_clear_bit(cpu, cpumask_bits(mask), nr_cpumask_bits) 249 + for_each_set_bit(cpu, cpumask_bits(mask), small_cpumask_bits) 293 250 294 251 #if NR_CPUS == 1 295 252 static inline ··· 313 290 * After the loop, cpu is >= nr_cpu_ids. 314 291 */ 315 292 #define for_each_cpu_wrap(cpu, mask, start) \ 316 - for_each_set_bit_wrap(cpu, cpumask_bits(mask), nr_cpumask_bits, start) 293 + for_each_set_bit_wrap(cpu, cpumask_bits(mask), small_cpumask_bits, start) 317 294 318 295 /** 319 296 * for_each_cpu_and - iterate over every cpu in both masks ··· 330 307 * After the loop, cpu is >= nr_cpu_ids. 331 308 */ 332 309 #define for_each_cpu_and(cpu, mask1, mask2) \ 333 - for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), nr_cpumask_bits) 310 + for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits) 334 311 335 312 /** 336 313 * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding ··· 348 325 * After the loop, cpu is >= nr_cpu_ids. 349 326 */ 350 327 #define for_each_cpu_andnot(cpu, mask1, mask2) \ 351 - for_each_andnot_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), nr_cpumask_bits) 328 + for_each_andnot_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits) 352 329 353 330 /** 354 331 * cpumask_any_but - return a "random" in a cpumask, but not this one. ··· 379 356 */ 380 357 static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp) 381 358 { 382 - return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, cpumask_check(cpu)); 359 + return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu)); 383 360 } 384 361 385 362 /** ··· 395 372 const struct cpumask *srcp2) 396 373 { 397 374 return find_nth_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), 398 - nr_cpumask_bits, cpumask_check(cpu)); 375 + small_cpumask_bits, cpumask_check(cpu)); 399 376 } 400 377 401 378 /** ··· 411 388 const struct cpumask *srcp2) 412 389 { 413 390 return find_nth_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), 414 - nr_cpumask_bits, cpumask_check(cpu)); 391 + small_cpumask_bits, cpumask_check(cpu)); 415 392 } 416 393 417 394 /** ··· 431 408 return find_nth_and_andnot_bit(cpumask_bits(srcp1), 432 409 cpumask_bits(srcp2), 433 410 cpumask_bits(srcp3), 434 - nr_cpumask_bits, cpumask_check(cpu)); 411 + small_cpumask_bits, cpumask_check(cpu)); 435 412 } 436 413 437 414 #define CPU_BITS_NONE \ ··· 518 495 /** 519 496 * cpumask_setall - set all cpus (< nr_cpu_ids) in a cpumask 520 497 * @dstp: the cpumask pointer 498 + * 499 + * Note: since we set bits, we should use the tighter 'bitmap_set()' with 500 + * the eact number of bits, not 'bitmap_fill()' that will fill past the 501 + * end. 521 502 */ 522 503 static inline void cpumask_setall(struct cpumask *dstp) 523 504 { 524 - bitmap_fill(cpumask_bits(dstp), nr_cpumask_bits); 505 + bitmap_set(cpumask_bits(dstp), 0, nr_cpumask_bits); 525 506 } 526 507 527 508 /** ··· 534 507 */ 535 508 static inline void cpumask_clear(struct cpumask *dstp) 536 509 { 537 - bitmap_zero(cpumask_bits(dstp), nr_cpumask_bits); 510 + bitmap_zero(cpumask_bits(dstp), large_cpumask_bits); 538 511 } 539 512 540 513 /** ··· 550 523 const struct cpumask *src2p) 551 524 { 552 525 return bitmap_and(cpumask_bits(dstp), cpumask_bits(src1p), 553 - cpumask_bits(src2p), nr_cpumask_bits); 526 + cpumask_bits(src2p), small_cpumask_bits); 554 527 } 555 528 556 529 /** ··· 563 536 const struct cpumask *src2p) 564 537 { 565 538 bitmap_or(cpumask_bits(dstp), cpumask_bits(src1p), 566 - cpumask_bits(src2p), nr_cpumask_bits); 539 + cpumask_bits(src2p), small_cpumask_bits); 567 540 } 568 541 569 542 /** ··· 577 550 const struct cpumask *src2p) 578 551 { 579 552 bitmap_xor(cpumask_bits(dstp), cpumask_bits(src1p), 580 - cpumask_bits(src2p), nr_cpumask_bits); 553 + cpumask_bits(src2p), small_cpumask_bits); 581 554 } 582 555 583 556 /** ··· 593 566 const struct cpumask *src2p) 594 567 { 595 568 return bitmap_andnot(cpumask_bits(dstp), cpumask_bits(src1p), 596 - cpumask_bits(src2p), nr_cpumask_bits); 597 - } 598 - 599 - /** 600 - * cpumask_complement - *dstp = ~*srcp 601 - * @dstp: the cpumask result 602 - * @srcp: the input to invert 603 - */ 604 - static inline void cpumask_complement(struct cpumask *dstp, 605 - const struct cpumask *srcp) 606 - { 607 - bitmap_complement(cpumask_bits(dstp), cpumask_bits(srcp), 608 - nr_cpumask_bits); 569 + cpumask_bits(src2p), small_cpumask_bits); 609 570 } 610 571 611 572 /** ··· 605 590 const struct cpumask *src2p) 606 591 { 607 592 return bitmap_equal(cpumask_bits(src1p), cpumask_bits(src2p), 608 - nr_cpumask_bits); 593 + small_cpumask_bits); 609 594 } 610 595 611 596 /** ··· 619 604 const struct cpumask *src3p) 620 605 { 621 606 return bitmap_or_equal(cpumask_bits(src1p), cpumask_bits(src2p), 622 - cpumask_bits(src3p), nr_cpumask_bits); 607 + cpumask_bits(src3p), small_cpumask_bits); 623 608 } 624 609 625 610 /** ··· 631 616 const struct cpumask *src2p) 632 617 { 633 618 return bitmap_intersects(cpumask_bits(src1p), cpumask_bits(src2p), 634 - nr_cpumask_bits); 619 + small_cpumask_bits); 635 620 } 636 621 637 622 /** ··· 645 630 const struct cpumask *src2p) 646 631 { 647 632 return bitmap_subset(cpumask_bits(src1p), cpumask_bits(src2p), 648 - nr_cpumask_bits); 633 + small_cpumask_bits); 649 634 } 650 635 651 636 /** ··· 654 639 */ 655 640 static inline bool cpumask_empty(const struct cpumask *srcp) 656 641 { 657 - return bitmap_empty(cpumask_bits(srcp), nr_cpumask_bits); 642 + return bitmap_empty(cpumask_bits(srcp), small_cpumask_bits); 658 643 } 659 644 660 645 /** ··· 672 657 */ 673 658 static inline unsigned int cpumask_weight(const struct cpumask *srcp) 674 659 { 675 - return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits); 660 + return bitmap_weight(cpumask_bits(srcp), small_cpumask_bits); 676 661 } 677 662 678 663 /** ··· 683 668 static inline unsigned int cpumask_weight_and(const struct cpumask *srcp1, 684 669 const struct cpumask *srcp2) 685 670 { 686 - return bitmap_weight_and(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits); 671 + return bitmap_weight_and(cpumask_bits(srcp1), cpumask_bits(srcp2), small_cpumask_bits); 687 672 } 688 673 689 674 /** ··· 696 681 const struct cpumask *srcp, int n) 697 682 { 698 683 bitmap_shift_right(cpumask_bits(dstp), cpumask_bits(srcp), n, 699 - nr_cpumask_bits); 684 + small_cpumask_bits); 700 685 } 701 686 702 687 /** ··· 720 705 static inline void cpumask_copy(struct cpumask *dstp, 721 706 const struct cpumask *srcp) 722 707 { 723 - bitmap_copy(cpumask_bits(dstp), cpumask_bits(srcp), nr_cpumask_bits); 708 + bitmap_copy(cpumask_bits(dstp), cpumask_bits(srcp), large_cpumask_bits); 724 709 } 725 710 726 711 /** ··· 804 789 */ 805 790 static inline unsigned int cpumask_size(void) 806 791 { 807 - return BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long); 792 + return BITS_TO_LONGS(large_cpumask_bits) * sizeof(long); 808 793 } 809 794 810 795 /*
+1 -13
lib/cpumask_kunit.c
··· 23 23 KUNIT_EXPECT_EQ_MSG((test), mask_weight, iter, MASK_MSG(mask)); \ 24 24 } while (0) 25 25 26 - #define EXPECT_FOR_EACH_CPU_NOT_EQ(test, mask) \ 27 - do { \ 28 - const cpumask_t *m = (mask); \ 29 - int mask_weight = cpumask_weight(m); \ 30 - int cpu, iter = 0; \ 31 - for_each_cpu_not(cpu, m) \ 32 - iter++; \ 33 - KUNIT_EXPECT_EQ_MSG((test), nr_cpu_ids - mask_weight, iter, MASK_MSG(mask)); \ 34 - } while (0) 35 - 36 26 #define EXPECT_FOR_EACH_CPU_OP_EQ(test, op, mask1, mask2) \ 37 27 do { \ 38 28 const cpumask_t *m1 = (mask1); \ ··· 67 77 KUNIT_EXPECT_EQ_MSG(test, 0, cpumask_weight(&mask_empty), MASK_MSG(&mask_empty)); 68 78 KUNIT_EXPECT_EQ_MSG(test, nr_cpu_ids, cpumask_weight(cpu_possible_mask), 69 79 MASK_MSG(cpu_possible_mask)); 70 - KUNIT_EXPECT_EQ_MSG(test, nr_cpumask_bits, cpumask_weight(&mask_all), MASK_MSG(&mask_all)); 80 + KUNIT_EXPECT_EQ_MSG(test, nr_cpu_ids, cpumask_weight(&mask_all), MASK_MSG(&mask_all)); 71 81 } 72 82 73 83 static void test_cpumask_first(struct kunit *test) ··· 103 113 static void test_cpumask_iterators(struct kunit *test) 104 114 { 105 115 EXPECT_FOR_EACH_CPU_EQ(test, &mask_empty); 106 - EXPECT_FOR_EACH_CPU_NOT_EQ(test, &mask_empty); 107 116 EXPECT_FOR_EACH_CPU_WRAP_EQ(test, &mask_empty); 108 117 EXPECT_FOR_EACH_CPU_OP_EQ(test, and, &mask_empty, &mask_empty); 109 118 EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, &mask_empty); 110 119 EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, &mask_empty, &mask_empty); 111 120 112 121 EXPECT_FOR_EACH_CPU_EQ(test, cpu_possible_mask); 113 - EXPECT_FOR_EACH_CPU_NOT_EQ(test, cpu_possible_mask); 114 122 EXPECT_FOR_EACH_CPU_WRAP_EQ(test, cpu_possible_mask); 115 123 EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, cpu_possible_mask); 116 124 EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, cpu_possible_mask, &mask_empty);