bitops: Optimize fns() for improved performance

The current fns() repeatedly uses __ffs() to find the index of the
least significant bit and then clears the corresponding bit using
__clear_bit(). The method for clearing the least significant bit can be
optimized by using word &= word - 1 instead.

Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction. To
improve performance, the loop for clearing the least significant bit
has been replaced with word &= word - 1, followed by a single __ffs()
operation to obtain the answer. This change reduces the number of
__ffs() iterations from n to just one, enhancing overall performance.

This modification significantly accelerates the fns() function in the
test_bitops benchmark, improving its speed by approximately 7.6 times.
Additionally, it enhances the performance of find_nth_bit() in the
find_bit benchmark by approximately 26%.

Before:
test_bitops: fns: 58033164 ns
find_nth_bit: 4254313 ns, 16525 iterations

After:
test_bitops: fns: 7637268 ns
find_nth_bit: 3362863 ns, 16501 iterations

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>

authored by Kuan-Wei Chiu and committed by Yury Norov 1c2aa561 0a2c6664

+3 -9
+3 -9
include/linux/bitops.h
··· 254 */ 255 static inline unsigned long fns(unsigned long word, unsigned int n) 256 { 257 - unsigned int bit; 258 259 - while (word) { 260 - bit = __ffs(word); 261 - if (n-- == 0) 262 - return bit; 263 - __clear_bit(bit, &word); 264 - } 265 - 266 - return BITS_PER_LONG; 267 } 268 269 /**
··· 254 */ 255 static inline unsigned long fns(unsigned long word, unsigned int n) 256 { 257 + while (word && n--) 258 + word &= word - 1; 259 260 + return word ? __ffs(word) : BITS_PER_LONG; 261 } 262 263 /**