Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation: lzo: document part of the encoding

Add a complete description of the LZO format as processed by the
decompressor. I have not found a public specification of this format
hence this analysis, which will be used to better understand the code.

Cc: Willem Pinckaers <willem@lekkertech.net>
Cc: "Don A. Bailey" <donb@securitymouse.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

authored by

Willy Tarreau and committed by
Greg Kroah-Hartman
d98a0526 9e82bf01

+164
+164
Documentation/lzo.txt
··· 1 + 2 + LZO stream format as understood by Linux's LZO decompressor 3 + =========================================================== 4 + 5 + Introduction 6 + 7 + This is not a specification. No specification seems to be publicly available 8 + for the LZO stream format. This document describes what input format the LZO 9 + decompressor as implemented in the Linux kernel understands. The file subject 10 + of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on 11 + the compressor nor on any other implementations though it seems likely that 12 + the format matches the standard one. The purpose of this document is to 13 + better understand what the code does in order to propose more efficient fixes 14 + for future bug reports. 15 + 16 + Description 17 + 18 + The stream is composed of a series of instructions, operands, and data. The 19 + instructions consist in a few bits representing an opcode, and bits forming 20 + the operands for the instruction, whose size and position depend on the 21 + opcode and on the number of literals copied by previous instruction. The 22 + operands are used to indicate : 23 + 24 + - a distance when copying data from the dictionary (past output buffer) 25 + - a length (number of bytes to copy from dictionary) 26 + - the number of literals to copy, which is retained in variable "state" 27 + as a piece of information for next instructions. 28 + 29 + Optionally depending on the opcode and operands, extra data may follow. These 30 + extra data can be a complement for the operand (eg: a length or a distance 31 + encoded on larger values), or a literal to be copied to the output buffer. 32 + 33 + The first byte of the block follows a different encoding from other bytes, it 34 + seems to be optimized for literal use only, since there is no dictionary yet 35 + prior to that byte. 36 + 37 + Lengths are always encoded on a variable size starting with a small number 38 + of bits in the operand. If the number of bits isn't enough to represent the 39 + length, up to 255 may be added in increments by consuming more bytes with a 40 + rate of at most 255 per extra byte (thus the compression ratio cannot exceed 41 + around 255:1). The variable length encoding using #bits is always the same : 42 + 43 + length = byte & ((1 << #bits) - 1) 44 + if (!length) { 45 + length = ((1 << #bits) - 1) 46 + length += 255*(number of zero bytes) 47 + length += first-non-zero-byte 48 + } 49 + length += constant (generally 2 or 3) 50 + 51 + For references to the dictionary, distances are relative to the output 52 + pointer. Distances are encoded using very few bits belonging to certain 53 + ranges, resulting in multiple copy instructions using different encodings. 54 + Certain encodings involve one extra byte, others involve two extra bytes 55 + forming a little-endian 16-bit quantity (marked LE16 below). 56 + 57 + After any instruction except the large literal copy, 0, 1, 2 or 3 literals 58 + are copied before starting the next instruction. The number of literals that 59 + were copied may change the meaning and behaviour of the next instruction. In 60 + practice, only one instruction needs to know whether 0, less than 4, or more 61 + literals were copied. This is the information stored in the <state> variable 62 + in this implementation. This number of immediate literals to be copied is 63 + generally encoded in the last two bits of the instruction but may also be 64 + taken from the last two bits of an extra operand (eg: distance). 65 + 66 + End of stream is declared when a block copy of distance 0 is seen. Only one 67 + instruction may encode this distance (0001HLLL), it takes one LE16 operand 68 + for the distance, thus requiring 3 bytes. 69 + 70 + IMPORTANT NOTE : in the code some length checks are missing because certain 71 + instructions are called under the assumption that a certain number of bytes 72 + follow because it has already been garanteed before parsing the instructions. 73 + They just have to "refill" this credit if they consume extra bytes. This is 74 + an implementation design choice independant on the algorithm or encoding. 75 + 76 + Byte sequences 77 + 78 + First byte encoding : 79 + 80 + 0..17 : follow regular instruction encoding, see below. It is worth 81 + noting that codes 16 and 17 will represent a block copy from 82 + the dictionary which is empty, and that they will always be 83 + invalid at this place. 84 + 85 + 18..21 : copy 0..3 literals 86 + state = (byte - 17) = 0..3 [ copy <state> literals ] 87 + skip byte 88 + 89 + 22..255 : copy literal string 90 + length = (byte - 17) = 4..238 91 + state = 4 [ don't copy extra literals ] 92 + skip byte 93 + 94 + Instruction encoding : 95 + 96 + 0 0 0 0 X X X X (0..15) 97 + Depends on the number of literals copied by the last instruction. 98 + If last instruction did not copy any literal (state == 0), this 99 + encoding will be a copy of 4 or more literal, and must be interpreted 100 + like this : 101 + 102 + 0 0 0 0 L L L L (0..15) : copy long literal string 103 + length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte) 104 + state = 4 (no extra literals are copied) 105 + 106 + If last instruction used to copy between 1 to 3 literals (encoded in 107 + the instruction's opcode or distance), the instruction is a copy of a 108 + 2-byte block from the dictionary within a 1kB distance. It is worth 109 + noting that this instruction provides little savings since it uses 2 110 + bytes to encode a copy of 2 other bytes but it encodes the number of 111 + following literals for free. It must be interpreted like this : 112 + 113 + 0 0 0 0 D D S S (0..15) : copy 2 bytes from <= 1kB distance 114 + length = 2 115 + state = S (copy S literals after this block) 116 + Always followed by exactly one byte : H H H H H H H H 117 + distance = (H << 2) + D + 1 118 + 119 + If last instruction used to copy 4 or more literals (as detected by 120 + state == 4), the instruction becomes a copy of a 3-byte block from the 121 + dictionary from a 2..3kB distance, and must be interpreted like this : 122 + 123 + 0 0 0 0 D D S S (0..15) : copy 3 bytes from 2..3 kB distance 124 + length = 3 125 + state = S (copy S literals after this block) 126 + Always followed by exactly one byte : H H H H H H H H 127 + distance = (H << 2) + D + 2049 128 + 129 + 0 0 0 1 H L L L (16..31) 130 + Copy of a block within 16..48kB distance (preferably less than 10B) 131 + length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte) 132 + Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S 133 + distance = 16384 + (H << 14) + D 134 + state = S (copy S literals after this block) 135 + End of stream is reached if distance == 16384 136 + 137 + 0 0 1 L L L L L (32..63) 138 + Copy of small block within 16kB distance (preferably less than 34B) 139 + length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte) 140 + Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S 141 + distance = D + 1 142 + state = S (copy S literals after this block) 143 + 144 + 0 1 L D D D S S (64..127) 145 + Copy 3-4 bytes from block within 2kB distance 146 + state = S (copy S literals after this block) 147 + length = 3 + L 148 + Always followed by exactly one byte : H H H H H H H H 149 + distance = (H << 3) + D + 1 150 + 151 + 1 L L D D D S S (128..255) 152 + Copy 5-8 bytes from block within 2kB distance 153 + state = S (copy S literals after this block) 154 + length = 5 + L 155 + Always followed by exactly one byte : H H H H H H H H 156 + distance = (H << 3) + D + 1 157 + 158 + Authors 159 + 160 + This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an 161 + analysis of the decompression code available in Linux 3.16-rc5. The code is 162 + tricky, it is possible that this document contains mistakes or that a few 163 + corner cases were overlooked. In any case, please report any doubt, fix, or 164 + proposed updates to the author(s) so that the document can be updated.