lib/lzo: implement run-length encoding · tjh.dev/kernel@5ee4014

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

lib/lzo: implement run-length encoding

Patch series "lib/lzo: run-length encoding support", v5.

Following on from the previous lzo-rle patchset:

https://lkml.org/lkml/2018/11/30/972

This patchset contains only the RLE patches, and should be applied on
top of the non-RLE patches ( https://lkml.org/lkml/2019/2/5/366 ).

Previously, some questions were raised around the RLE patches. I've
done some additional benchmarking to answer these questions. In short:

- RLE offers significant additional performance (data-dependent)

- I didn't measure any regressions that were clearly outside the noise

One concern with this patchset was around performance - specifically,
measuring RLE impact separately from Matt Sealey's patches (CTZ & fast
copy). I have done some additional benchmarking which I hope clarifies
the benefits of each part of the patchset.

Firstly, I've captured some memory via /dev/fmem from a Chromebook with
many tabs open which is starting to swap, and then split this into 4178
4k pages. I've excluded the all-zero pages (as zram does), and also the
no-zero pages (which won't tell us anything about RLE performance).
This should give a realistic test dataset for zram. What I found was
that the data is VERY bimodal: 44% of pages in this dataset contain 5%
or fewer zeros, and 44% contain over 90% zeros (30% if you include the
no-zero pages). This supports the idea of special-casing zeros in zram.

Next, I've benchmarked four variants of lzo on these pages (on 64-bit
Arm at max frequency): baseline LZO; baseline + Matt Sealey's patches
(aka MS); baseline + RLE only; baseline + MS + RLE. Numbers are for
weighted roundtrip throughput (the weighting reflects that zram does
more compression than decompression).

https://drive.google.com/file/d/1VLtLjRVxgUNuWFOxaGPwJYhl_hMQXpHe/view?usp=sharing

Matt's patches help in all cases for Arm (and no effect on Intel), as
expected.

RLE also behaves as expected: with few zeros present, it makes no
difference; above ~75%, it gives a good improvement (50 - 300 MB/s on
top of the benefit from Matt's patches).

Best performance is seen with both MS and RLE patches.

Finally, I have benchmarked the same dataset on an x86-64 device. Here,
the MS patches make no difference (as expected); RLE helps, similarly as
on Arm. There were no definite regressions; allowing for observational
error, 0.1% (3/4178) of cases had a regression > 1 standard deviation,
of which the largest was 4.6% (1.2 standard deviations). I think this
is probably within the noise.

https://drive.google.com/file/d/1xCUVwmiGD0heEMx5gcVEmLBI4eLaageV/view?usp=sharing

One point to note is that the graphs show RLE appears to help very
slightly with no zeros present! This is because the extra code causes
the clang optimiser to change code layout in a way that happens to have
a significant benefit. Taking baseline LZO and adding a do-nothing line
like "__builtin_prefetch(out_len);" immediately before the "goto next"
has the same effect. So this is a real, but basically spurious effect -
it's small enough not to upset the overall findings.

This patch (of 3):

When using zram, we frequently encounter long runs of zero bytes. This
adds a special case which identifies runs of zeros and encodes them
using run-length encoding.

This is faster for both compression and decompresion. For high-entropy
data which doesn't hit this case, impact is minimal.

Compression ratio is within a few percent in all cases.

This modifies the bitstream in a way which is backwards compatible
(i.e., we can decompress old bitstreams, but old versions of lzo cannot
decompress new bitstreams).

Link: http://lkml.kernel.org/r/20190205155944.16007-2-dave.rodgman@arm.com
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Markus F.X.J. Oberhumer <markus@oberhumer.com>
Cc: Matt Sealey <matt.sealey@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <nitingupta910@gmail.com>
Cc: Richard Purdie <rpurdie@openedhand.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Sonny Rao <sonnyrao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Dave Rodgman and committed by

Linus Torvalds 7 years ago 5ee4014a 761b3238

+183 -45

5 changed files

expand all

Documentation

lzo.txt

include

linux

lzo.h

lib

lzo

lzo1x_compress.c

lzo1x_decompress_safe.c

lzodefs.h

+28 -7

Documentation/lzo.txt

··· 78 78 is an implementation design choice independent on the algorithm or 79 79 encoding. 80 80 81 + Versions 82 + 83 + 0: Original version 84 + 1: LZO-RLE 85 + 86 + Version 1 of LZO implements an extension to encode runs of zeros using run 87 + length encoding. This improves speed for data with many zeros, which is a 88 + common case for zram. This modifies the bitstream in a backwards compatible way 89 + (v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data). 90 + 81 91 Byte sequences 82 92 ============== 83 93 84 94 First byte encoding:: 85 95 86 - 0..17 : follow regular instruction encoding, see below. It is worth 87 - noting that codes 16 and 17 will represent a block copy from 88 - the dictionary which is empty, and that they will always be 96 + 0..16 : follow regular instruction encoding, see below. It is worth 97 + noting that code 16 will represent a block copy from the 98 + dictionary which is empty, and that it will always be 89 99 invalid at this place. 100 + 101 + 17 : bitstream version. If the first byte is 17, the next byte 102 + gives the bitstream version. If the first byte is not 17, 103 + the bitstream version is 0. 90 104 91 105 18..21 : copy 0..3 literals 92 106 state = (byte - 17) = 0..3 [ copy <state> literals ] ··· 154 140 state = S (copy S literals after this block) 155 141 End of stream is reached if distance == 16384 156 142 143 + In version 1, this instruction is also used to encode a run of zeros if 144 + distance = 0xbfff, i.e. H = 1 and the D bits are all 1. 145 + In this case, it is followed by a fourth byte, X. 146 + run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4. 147 + 157 148 0 0 1 L L L L L (32..63) 158 149 Copy of small block within 16kB distance (preferably less than 34B) 159 150 length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte) ··· 184 165 ======= 185 166 186 167 This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an 187 - analysis of the decompression code available in Linux 3.16-rc5. The code is 188 - tricky, it is possible that this document contains mistakes or that a few 189 - corner cases were overlooked. In any case, please report any doubt, fix, or 190 - proposed updates to the author(s) so that the document can be updated. 168 + analysis of the decompression code available in Linux 3.16-rc5, and updated 169 + by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length 170 + encoding. The code is tricky, it is possible that this document contains 171 + mistakes or that a few corner cases were overlooked. In any case, please 172 + report any doubt, fix, or proposed updates to the author(s) so that the 173 + document can be updated.

+1 -1

include/linux/lzo.h

··· 18 18 #define LZO1X_1_MEM_COMPRESS (8192 * sizeof(unsigned short)) 19 19 #define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS 20 20 21 - #define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3) 21 + #define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3 + 2) 22 22 23 23 /* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */ 24 24 int lzo1x_1_compress(const unsigned char *src, size_t src_len,

+89 -11

lib/lzo/lzo1x_compress.c

··· 20 20 static noinline size_t 21 21 lzo1x_1_do_compress(const unsigned char *in, size_t in_len, 22 22 unsigned char *out, size_t *out_len, 23 - size_t ti, void *wrkmem) 23 + size_t ti, void *wrkmem, signed char *state_offset) 24 24 { 25 25 const unsigned char *ip; 26 26 unsigned char *op; ··· 35 35 ip += ti < 4 ? 4 - ti : 0; 36 36 37 37 for (;;) { 38 - const unsigned char *m_pos; 38 + const unsigned char *m_pos = NULL; 39 39 size_t t, m_len, m_off; 40 40 u32 dv; 41 + u32 run_length = 0; 41 42 literal: 42 43 ip += 1 + ((ip - ii) >> 5); 43 44 next: 44 45 if (unlikely(ip >= ip_end)) 45 46 break; 46 47 dv = get_unaligned_le32(ip); 47 - t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK; 48 - m_pos = in + dict[t]; 49 - dict[t] = (lzo_dict_t) (ip - in); 50 - if (unlikely(dv != get_unaligned_le32(m_pos))) 51 - goto literal; 48 + 49 + if (dv == 0) { 50 + const unsigned char *ir = ip + 4; 51 + const unsigned char *limit = ip_end 52 + < (ip + MAX_ZERO_RUN_LENGTH + 1) 53 + ? ip_end : ip + MAX_ZERO_RUN_LENGTH + 1; 54 + #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \ 55 + defined(LZO_FAST_64BIT_MEMORY_ACCESS) 56 + u64 dv64; 57 + 58 + for (; (ir + 32) <= limit; ir += 32) { 59 + dv64 = get_unaligned((u64 *)ir); 60 + dv64 |= get_unaligned((u64 *)ir + 1); 61 + dv64 |= get_unaligned((u64 *)ir + 2); 62 + dv64 |= get_unaligned((u64 *)ir + 3); 63 + if (dv64) 64 + break; 65 + } 66 + for (; (ir + 8) <= limit; ir += 8) { 67 + dv64 = get_unaligned((u64 *)ir); 68 + if (dv64) { 69 + # if defined(__LITTLE_ENDIAN) 70 + ir += __builtin_ctzll(dv64) >> 3; 71 + # elif defined(__BIG_ENDIAN) 72 + ir += __builtin_clzll(dv64) >> 3; 73 + # else 74 + # error "missing endian definition" 75 + # endif 76 + break; 77 + } 78 + } 79 + #else 80 + while ((ir < (const unsigned char *) 81 + ALIGN((uintptr_t)ir, 4)) && 82 + (ir < limit) && (*ir == 0)) 83 + ir++; 84 + for (; (ir + 4) <= limit; ir += 4) { 85 + dv = *((u32 *)ir); 86 + if (dv) { 87 + # if defined(__LITTLE_ENDIAN) 88 + ir += __builtin_ctz(dv) >> 3; 89 + # elif defined(__BIG_ENDIAN) 90 + ir += __builtin_clz(dv) >> 3; 91 + # else 92 + # error "missing endian definition" 93 + # endif 94 + break; 95 + } 96 + } 97 + #endif 98 + while (likely(ir < limit) && unlikely(*ir == 0)) 99 + ir++; 100 + run_length = ir - ip; 101 + if (run_length > MAX_ZERO_RUN_LENGTH) 102 + run_length = MAX_ZERO_RUN_LENGTH; 103 + } else { 104 + t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK; 105 + m_pos = in + dict[t]; 106 + dict[t] = (lzo_dict_t) (ip - in); 107 + if (unlikely(dv != get_unaligned_le32(m_pos))) 108 + goto literal; 109 + } 52 110 53 111 ii -= ti; 54 112 ti = 0; 55 113 t = ip - ii; 56 114 if (t != 0) { 57 115 if (t <= 3) { 58 - op[-2] |= t; 116 + op[*state_offset] |= t; 59 117 COPY4(op, ii); 60 118 op += t; 61 119 } else if (t <= 16) { ··· 144 86 *op++ = *ii++; 145 87 } while (--t > 0); 146 88 } 89 + } 90 + 91 + if (unlikely(run_length)) { 92 + ip += run_length; 93 + run_length -= MIN_ZERO_RUN_LENGTH; 94 + put_unaligned_le32((run_length << 21) | 0xfffc18 95 + | (run_length & 0x7), op); 96 + op += 4; 97 + run_length = 0; 98 + *state_offset = -3; 99 + goto finished_writing_instruction; 147 100 } 148 101 149 102 m_len = 4; ··· 239 170 240 171 m_off = ip - m_pos; 241 172 ip += m_len; 242 - ii = ip; 243 173 if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) { 244 174 m_off -= 1; 245 175 *op++ = (((m_len - 1) << 5) | ((m_off & 7) << 2)); ··· 275 207 *op++ = (m_off << 2); 276 208 *op++ = (m_off >> 6); 277 209 } 210 + *state_offset = -2; 211 + finished_writing_instruction: 212 + ii = ip; 278 213 goto next; 279 214 } 280 215 *out_len = op - out; ··· 292 221 unsigned char *op = out; 293 222 size_t l = in_len; 294 223 size_t t = 0; 224 + signed char state_offset = -2; 225 + 226 + // LZO v0 will never write 17 as first byte, 227 + // so this is used to version the bitstream 228 + *op++ = 17; 229 + *op++ = LZO_VERSION; 295 230 296 231 while (l > 20) { 297 232 size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1); ··· 306 229 break; 307 230 BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS); 308 231 memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t)); 309 - t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem); 232 + t = lzo1x_1_do_compress(ip, ll, op, out_len, 233 + t, wrkmem, &state_offset); 310 234 ip += ll; 311 235 op += *out_len; 312 236 l -= ll; ··· 320 242 if (op == out && t <= 238) { 321 243 *op++ = (17 + t); 322 244 } else if (t <= 3) { 323 - op[-2] |= t; 245 + op[state_offset] |= t; 324 246 } else if (t <= 18) { 325 247 *op++ = (t - 3); 326 248 } else {

+54 -25

lib/lzo/lzo1x_decompress_safe.c

··· 46 46 const unsigned char * const ip_end = in + in_len; 47 47 unsigned char * const op_end = out + *out_len; 48 48 49 + unsigned char bitstream_version; 50 + 49 51 op = out; 50 52 ip = in; 51 53 52 54 if (unlikely(in_len < 3)) 53 55 goto input_overrun; 56 + 57 + if (likely(*ip == 17)) { 58 + bitstream_version = ip[1]; 59 + ip += 2; 60 + if (unlikely(in_len < 5)) 61 + goto input_overrun; 62 + } else { 63 + bitstream_version = 0; 64 + } 65 + 54 66 if (*ip > 17) { 55 67 t = *ip++ - 17; 56 68 if (t < 4) { ··· 166 154 m_pos -= next >> 2; 167 155 next &= 3; 168 156 } else { 169 - m_pos = op; 170 - m_pos -= (t & 8) << 11; 171 - t = (t & 7) + (3 - 1); 172 - if (unlikely(t == 2)) { 173 - size_t offset; 174 - const unsigned char *ip_last = ip; 175 - 176 - while (unlikely(*ip == 0)) { 177 - ip++; 178 - NEED_IP(1); 179 - } 180 - offset = ip - ip_last; 181 - if (unlikely(offset > MAX_255_COUNT)) 182 - return LZO_E_ERROR; 183 - 184 - offset = (offset << 8) - offset; 185 - t += offset + 7 + *ip++; 186 - NEED_IP(2); 187 - } 157 + NEED_IP(2); 188 158 next = get_unaligned_le16(ip); 189 - ip += 2; 190 - m_pos -= next >> 2; 191 - next &= 3; 192 - if (m_pos == op) 193 - goto eof_found; 194 - m_pos -= 0x4000; 159 + if (((next & 0xfffc) == 0xfffc) && 160 + ((t & 0xf8) == 0x18) && 161 + likely(bitstream_version)) { 162 + NEED_IP(3); 163 + t &= 7; 164 + t |= ip[2] << 3; 165 + t += MIN_ZERO_RUN_LENGTH; 166 + NEED_OP(t); 167 + memset(op, 0, t); 168 + op += t; 169 + next &= 3; 170 + ip += 3; 171 + goto match_next; 172 + } else { 173 + m_pos = op; 174 + m_pos -= (t & 8) << 11; 175 + t = (t & 7) + (3 - 1); 176 + if (unlikely(t == 2)) { 177 + size_t offset; 178 + const unsigned char *ip_last = ip; 179 + 180 + while (unlikely(*ip == 0)) { 181 + ip++; 182 + NEED_IP(1); 183 + } 184 + offset = ip - ip_last; 185 + if (unlikely(offset > MAX_255_COUNT)) 186 + return LZO_E_ERROR; 187 + 188 + offset = (offset << 8) - offset; 189 + t += offset + 7 + *ip++; 190 + NEED_IP(2); 191 + next = get_unaligned_le16(ip); 192 + } 193 + ip += 2; 194 + m_pos -= next >> 2; 195 + next &= 3; 196 + if (m_pos == op) 197 + goto eof_found; 198 + m_pos -= 0x4000; 199 + } 195 200 } 196 201 TEST_LB(m_pos); 197 202 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)

+11 -1

lib/lzo/lzodefs.h

··· 13 13 */ 14 14 15 15 16 + /* Version 17 + * 0: original lzo version 18 + * 1: lzo with support for RLE 19 + */ 20 + #define LZO_VERSION 1 21 + 16 22 #define COPY4(dst, src) \ 17 23 put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst)) 18 24 #if defined(CONFIG_X86_64) || defined(CONFIG_ARM64) ··· 34 28 #elif defined(CONFIG_X86_64) || defined(CONFIG_ARM64) 35 29 #define LZO_USE_CTZ64 1 36 30 #define LZO_USE_CTZ32 1 31 + #define LZO_FAST_64BIT_MEMORY_ACCESS 37 32 #elif defined(CONFIG_X86) || defined(CONFIG_PPC) 38 33 #define LZO_USE_CTZ32 1 39 34 #elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5) ··· 44 37 #define M1_MAX_OFFSET 0x0400 45 38 #define M2_MAX_OFFSET 0x0800 46 39 #define M3_MAX_OFFSET 0x4000 47 - #define M4_MAX_OFFSET 0xbfff 40 + #define M4_MAX_OFFSET 0xbffe 48 41 49 42 #define M1_MIN_LEN 2 50 43 #define M1_MAX_LEN 2 ··· 59 52 #define M2_MARKER 64 60 53 #define M3_MARKER 32 61 54 #define M4_MARKER 16 55 + 56 + #define MIN_ZERO_RUN_LENGTH 4 57 + #define MAX_ZERO_RUN_LENGTH (2047 + MIN_ZERO_RUN_LENGTH) 62 58 63 59 #define lzo_dict_t unsigned short 64 60 #define D_BITS 13