README.md at main · releasecandidate.bsky.social/token-string

releasecandidate.bsky.social / token-string
Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them
fork atom
token-string / README.md
at main 259 lines 13 kB view raw view rendered
wrap content
releasecandidate.bsky.social Add the documentation at Docs.rs to the README. 1y ago
adbc4e83
  1# token-string
  2
  3[![Crates.io Version](https://img.shields.io/crates/v/token-string)](https://crates.io/crates/token-string)
  4[![docs.rs](https://img.shields.io/docsrs/token-string)](https://docs.rs/token-string/latest/token_string/)
  5[![MPL v2.0 license](https://img.shields.io/badge/License-MPL--2.0-green)](./LICENSE)
  6[![Built with cargo-make](https://sagiegurari.github.io/cargo-make/assets/badges/cargo-make.svg)](https://sagiegurari.github.io/cargo-make)
  7
  8Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them, see [The Paper](#the-paper) for details.
  9
 10- [Why?](#why)
 11- [The Compromise](#the-compromise)
 12- [How?](#how)
 13  - [Memory Layout of a TokenString](#memory-layout-of-a-tokenstring)
 14- [Usage of the Library](#usage-of-the-library)
 15  - [Installation](#installation)
 16  - [Examples](#examples)
 17  - [String Concatenation](#string-concatenation)
 18    - [Examples 2](#examples-2)
 19- [Performance](#performance)
 20- [Building and Testing the Source](#building-and-testing-the-source)
 21- [The Paper](#the-paper)
 22- [License](#license)
 23
 24## Why?
 25
 26When working with single words or even just single letters, which are Unicode grapheme clusters and not necessarily single bytes, we most of the time can get by with a bit[what a pun!] more than 10 bytes of UTF-8. English has an average letter count of about 5 and needs exactly one byte per letter, Finnish has about 8 with most of them needing at most 2 bytes (the Euro sign € needs 3 bytes (`0xE2 0x82 0xAC`), for example). Chinese symbols take 3 to 4 bytes, but most[^1] words are up to two symbols long. As a pointer on 64 bit architectures[^2] are already 8 bytes we could store almost the whole string in the pointer itself without needing further indirection or heap allocation. This library is an attempt to use the string struct `TokenString` which has a size of 16 bytes (128 bits) to store such small strings itself.
 27
 28[^1]: Source for all of these average word lengths: "the internet"
 29[^2]: This does not work with 32 bit (or 128 bit) pointer types.
 30
 31And immutable strings, because they make life easier -- except when concatenating -- and in many use cases the strings don't change after being generated anyways. E.g. after tokenization the scanned strings don't change any
 32more.
 33
 34## The Compromise
 35
 36I have chosen the maximal size of such a "small" string to use 2 bytes -- 16 bits, so the maximum string length is `2^16 - 1 = 65,535` bytes. Which leaves the maximum size of strings saved in the string struct itself at 14 bytes. In other words, there is space for up to 14 ASCII (single byte) letters or three 4 byte Unicode scalar points, so e.g. 3 Chinese symbols.
 37
 38Japanese Hiragana or Katakana needs 3 bytes per symbol, so there is space for 4 of them. Which is not enough to store "most" words. By using one more byte of the size, we could save up to 15 bytes or 5 Katakana symbols, but the maximum size of a string would be only 255 bytes.
 39
 40## How?
 41
 42Strings which are at most 14 bytes long are short, or "small", `TokenStrings`.
 43Such small strings are directly contained in the `TokenString` struct, so no
 44memory is allocated on the heap and no indirection is needed to access the
 45string's data. Every string which is greater than 14 bytes -- up to the maximum of
 4665,535 bytes -- is allocated on the heap, but a copy of it's first 6 bytes, the so
 47called "prefix", is stored in the struct itself and directly accessible without
 48additional indirection through a pointer.
 49
 50So, the first 6 bytes of every `TokenString`, the "prefix", is used to speed up
 51the comparison of strings and similar tasks which may not need to access the
 52whole string in all cases. E.g. comparing two strings of the same length for
 53equality can fail fast if the prefixes of both strings differ, without the need to
 54de-reference a pointer to a string's data.
 55
 56### Memory Layout of a TokenString
 57
 58A `TokenString` is guaranteed to be aligned to 64 bits (8 bytes) and has a size of 16 bytes (128 bits).
 59
 60Below is a diagram depicting the memory used by `TokenString`s and small `TokenString`s.
 61
 62![Memory layout diagram](https://codeberg.org/Release-Candidate/token-string/raw/branch/main/images/TokenString.svg)
 63
 64The first two bytes of the struct always hold the length of the `TokenString` and
 65the next 6 bytes the prefix, which is the first 6 bytes of the string. A short
 66`TokenString` contains the bytes after the prefix in the following 8 bytes, up
 67to the maximum size of 14 bytes for a small `TokenString`. A bigger `TokenString`
 68contains a pointer to the whole string data (including the prefix) in the
 698 bytes after the prefix.
 70
 71## Usage of the Library
 72
 73The crate at crates.io: [crates.io: token_string](https://crates.io/crates/token-string).
 74
 75Full documentation at [Docs.rs: token_string](https://docs.rs/token-string/latest/token_string/)
 76
 77### Installation
 78
 79Run the following Cargo command in your project directory:
 80
 81```sh
 82cargo add token-string
 83```
 84
 85Or add the following line to your Cargo.toml:
 86
 87```toml
 88token-string = "0.8.1"
 89```
 90
 91### Examples
 92
 93A few short examples of the library, just to get a feel on how it works.
 94Detailed documentation is available [HERE](here).
 95
 96These are immutable strings, so don't use `mut`.
 97
 98```rust
 99use token_string::TokenString;
100
101// The string we passed may be too big for a `TokenString` and an error returned.
102let s1 = TokenString::try_from("Hello, world!").unwrap();
103
104// We can also create a `TokenString` from bytes:
105let s2 = TokenString::try_from(b"Hello, world!".as_slice()).unwrap();
106
107// Or a String.
108let s3_string = "Hello, world!".to_owned();
109let s3 = TokenString::try_from(&s3_string).unwrap();
110
111// We can print `TokenString`s:
112println!("{}", &s1);
113
114// We can compare `TokenString`s:
115assert!(s1 == s2);
116
117// and compare them to `&str`, `&[u8]` or `&string`:
118assert!(&s1 == "Hello, world!");
119assert!(&s1 == b"Hello, world!".as_slice());
120assert!(s1 == s3_string);
121
122// We also can convert them to `&str`, `&[u8]`, `String` or a vector of `char`s:
123let s1_str: &str = s1.as_str();
124let s1_bytes: &[u8] = s1.as_bytes();
125let s1_string: String = s1.as_string();
126let s1_chars: Vec<char> = s1.as_chars();
127
128// We can iterate through the bytes of the string.
129// CAUTION: these are bytes, not Unicode scalar values or grapheme clusters, so
130// not every iterator "is" some kind of a character.
131for i in &s1 {
132  println!("Byte: {i}");
133}
134```
135
136### String Concatenation
137
138Because `TokenString`s are immutable, we cannot concatenate two or more strings
139by directly appending to the first string, as this would mutate this string. To
140not have to allocate temporary strings when concatenating more than two, all
141strings to concatenate are added to an array of `TokenString`s to concatenate on
142the stack. Only when calling the `Builder`'s method to actually concatenate all
143strings, a new `TokenString` is generated and memory allocated if necessary.
144
145#### Examples 2
146
147```rust
148use token_string::{Builder, TokenString};
149
150// If the string given to `try_from` is too big, an error is returned.
151let s1 = TokenString::try_from("Hello, ").unwrap();
152let s2 = TokenString::try_from("world!").unwrap();
153
154// The number of strings to concatenate must be given to the `Builder` as a
155// const generic value. We are concatenating 2 strings, so this is 2.
156// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
157let mut builder = Builder::<'_, 2>::new(&s1);
158
159// Append `s2` to `s1`, "world!" to "Hello, ".
160// An error is returned if the concatenated string is too big.
161builder.concat_checked(&s2).unwrap();
162
163// Create the new `TokenString`. Again, an error may occur.
164let s1_s2 = builder.collect_checked().unwrap();
165
166// Check, if the result is actually "Hello, world!".
167assert!(&s1_s2 == "Hello, world!");
168```
169
170As checking every step like above is quite cumbersome, the two traits `Concat`
171and `Collect` make this easier by returning early if a step returns an error.
172Instead of using `concat_checked` and `collect_checked`, we use the traits
173methods `Concat::concat` and `Collect::collect`.
174
175```rust
176use token_string::{Builder, Collect as _, Concat as _, TokenString};
177
178// If the string given to `try_from` is too big, an error is returned.
179let s1 = TokenString::try_from("Hello, ").unwrap();
180let s2 = TokenString::try_from("world").unwrap();
181let s3 = TokenString::try_from("!").unwrap();
182
183// The number of strings to concatenate must be given to the `Builder` as a
184// const generic value. We are concatenating 3 strings, so this is 3.
185// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
186let s1_s2_s3 = Builder::<'_, 3>::new(&s1)
187    .concat(&s2)
188    .concat(&s3)
189    .collect()
190    .unwrap();
191
192// Check, if the result is actually "Hello, world!".
193assert!(&s1_s2_s3 == "Hello, world!");
194```
195
196This is better, but having to manually set the number of strings to concatenate
197is tedious and error-prone. So let's use a macro to do that all for us --
198`token_string::concat!` macro internally does the same as the example above.
199Just be sure to use `token_string::concat!` and not another `concat!` macro.
200
201```rust
202use token_string::{Builder, Collect as _, Concat as _, TokenString};
203
204// If the string given to `try_from` is too big, an error is returned.
205let s1 = TokenString::try_from("Hello, ").unwrap();
206let s2 = TokenString::try_from("world").unwrap();
207let s3 = TokenString::try_from("!").unwrap();
208
209// Fully qualify `token_string::concat!` to get the right `concat!` macro.
210// Again, an error may be returned if the concatenated string would be too big.
211let s1_s2_s3 = token_string::concat!(&s1, &s2, &s3).unwrap();
212
213// Check, if the result is actually "Hello, world!".
214assert!(&s1_s2_s3 == "Hello, world!");
215```
216
217## Performance
218
219You have to test for yourself! It depends on your actual usage, if this string representation is a measurable -- much less significant enough to be noticeable -- optimization.
220
221## Building and Testing the Source
222
223This uses [cargo-make](https://sagiegurari.github.io/cargo-make) to not have to input long cargo command lines. The file [./Makefile.toml](./Makefile.toml) contains its configuration.
224
225But there is no need to use `cargo-make`, all commands can be input manually too -- after having installed the needed programs.
226
227The grouped list of `cargo-make`'s build targets and the cargo command line it invokes:
228
229- `build`: `cargo build` - Build the library with debug information.
230- `doc`: `cargo doc --all-features` - Generate API documentation using rustdoc.
231- `example`: `cargo run --example example` - Run the example binary, source: [./examples/main.rs](./examples/main.rs).
232- `clean`: `cargo clean` - Delete files generated by cargo.
233- `lint`: `cargo clippy` - Run Clippy, the Rust linter, on the sources. Needs Clippy installed.
234
235Tests:
236
237- `test`: `cargo nextest run --all-features` - Run all tests except for doc-tests. Needs [cargo-nextest](https://nexte.st/).
238- `doc-test`: `cargo test --doc` - Run all doc-tests.
239- `cov`: `cargo llvm-cov nextest --all-features --lcov --output-path lcov.info` - Run tests with coverage information, generate LCOV output at `./lcov.info`. Needs [cargo-llvm-cov](https://github.com/taiki-e/cargo-llvm-cov).
240- `mutation`: `cargo mutants --test-tool=nextest -- --all-features` - Run mutation tests using [cargo-mutants](https://mutants.rs/).
241- `mutation-iterate`: `cargo mutants --test-tool=nextest --iterate -- --all-features` - Run mutation tests with [cargo-mutants](https://mutants.rs/), using the data from previous runs to not run passed tests again.
242- `miri`: `PROPTEST_DISABLE_FAILURE_PERSISTENCE=true MIRIFLAGS='-Zmiri-env-forward=PROPTEST_DISABLE_FAILURE_PERSISTENCE -Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri nextest run -j 10` - Run Miri on all tests. Needs Miri installed. **Warning:** this takes more than 10 hours if the test `concat_many` in [./tests/test_builder.rs](./tests/test_builder.rs) is run, compared to about 1 hour without that test.
243- `miri-example`: `MIRIFLAGS='-Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri run --example example` - Run Miri on the example executable. Needs Miri installed.
244
245## The Paper
246
247[*Neumann, Freitag*: **Umbra: A Disk-Based System with In-Memory Performance**](https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf)
248mentions comparable small strings in chapter **3.1 String Handling** at page 5.
249
250The main difference to the variant I've implemented is that I use only 2 bytes
251(16 bits) for the string length instead of 4 bytes (32 bits) in the paper. So the
252maximum string length of `TokenString`s is 65KB bytes compared to 4GB in the paper.
253But we gain 2 bytes of length for the short strings contained in the struct
254itself, 14 bytes vs. 12 bytes in the paper's version.
255
256## License
257
258The source in this repository is licensed under the Mozilla Public
259License version 2.0, see file [./LICENSE](./LICENSE).