Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: bpf_csum_diff: Optimize and homogenize for all archs

1. Optimization
------------

The current implementation copies the 'from' and 'to' buffers to a
scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
In the next step csum_partial() is called with this scratchpad.

so, mathematically, the current implementation is doing:

result = csum(to - from)

Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need
it in the scratchpad buffer because csum_partial() takes a single
contiguous buffer and not two disjoint buffers like 'to' and 'from'.

We can re write this equation to:

result = csum(to) - csum(from)

using the distributive property of csum().

this allows 'to' and 'from' to be at different locations and therefore
this scratchpad and copying is not needed.

This in C code will look like:

result = csum_sub(csum_partial(to, to_size, seed),
csum_partial(from, from_size, 0));

2. Homogenization
--------------

The bpf_csum_diff() helper calls csum_partial() which is implemented by
some architectures like arm and x86 but other architectures rely on the
generic implementation in lib/checksum.c

The generic implementation in lib/checksum.c returns a 16 bit value but
the arch specific implementations can return more than 16 bits, this
works out in most places because before the result is used, it is passed
through csum_fold() that turns it into a 16-bit value.

bpf_csum_diff() directly returns the value from csum_partial() and
therefore the returned values could be different on different
architectures. see discussion in [1]:

for the int value 28 the calculated checksums are:

x86 : -29 : 0xffffffe3
generic (arm64, riscv) : 65507 : 0x0000ffe3
arm : 131042 : 0x0001ffe2

Pass the result of bpf_csum_diff() through from32to16() before returning
to homogenize this result for all architectures.

NOTE: from32to16() is used instead of csum_fold() because csum_fold()
does from32to16() + bitwise NOT of the result, which is not what we want
to do here.

[1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241026125339.26459-3-puranjay@kernel.org

authored by

Puranjay Mohan and committed by
Daniel Borkmann
6a4794d5 db71aae7

+11 -28
+11 -28
net/core/filter.c
··· 1654 1654 bpf_prog_destroy(prog); 1655 1655 } 1656 1656 1657 - struct bpf_scratchpad { 1658 - union { 1659 - __be32 diff[MAX_BPF_STACK / sizeof(__be32)]; 1660 - u8 buff[MAX_BPF_STACK]; 1661 - }; 1662 - local_lock_t bh_lock; 1663 - }; 1664 - 1665 - static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = { 1666 - .bh_lock = INIT_LOCAL_LOCK(bh_lock), 1667 - }; 1668 - 1669 1657 static inline int __bpf_try_make_writable(struct sk_buff *skb, 1670 1658 unsigned int write_len) 1671 1659 { ··· 2010 2022 BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size, 2011 2023 __be32 *, to, u32, to_size, __wsum, seed) 2012 2024 { 2013 - struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp); 2014 - u32 diff_size = from_size + to_size; 2015 - int i, j = 0; 2016 - __wsum ret; 2017 - 2018 2025 /* This is quite flexible, some examples: 2019 2026 * 2020 2027 * from_size == 0, to_size > 0, seed := csum --> pushing data ··· 2018 2035 * 2019 2036 * Even for diffing, from_size and to_size don't need to be equal. 2020 2037 */ 2021 - if (unlikely(((from_size | to_size) & (sizeof(__be32) - 1)) || 2022 - diff_size > sizeof(sp->diff))) 2023 - return -EINVAL; 2024 2038 2025 - local_lock_nested_bh(&bpf_sp.bh_lock); 2026 - for (i = 0; i < from_size / sizeof(__be32); i++, j++) 2027 - sp->diff[j] = ~from[i]; 2028 - for (i = 0; i < to_size / sizeof(__be32); i++, j++) 2029 - sp->diff[j] = to[i]; 2039 + __wsum ret = seed; 2030 2040 2031 - ret = csum_partial(sp->diff, diff_size, seed); 2032 - local_unlock_nested_bh(&bpf_sp.bh_lock); 2033 - return ret; 2041 + if (from_size && to_size) 2042 + ret = csum_sub(csum_partial(to, to_size, ret), 2043 + csum_partial(from, from_size, 0)); 2044 + else if (to_size) 2045 + ret = csum_partial(to, to_size, ret); 2046 + 2047 + else if (from_size) 2048 + ret = ~csum_partial(from, from_size, ~ret); 2049 + 2050 + return csum_from32to16((__force unsigned int)ret); 2034 2051 } 2035 2052 2036 2053 static const struct bpf_func_proto bpf_csum_diff_proto = {