Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them
at main 259 lines 13 kB view raw view rendered
1# token-string 2 3[![Crates.io Version](https://img.shields.io/crates/v/token-string)](https://crates.io/crates/token-string) 4[![docs.rs](https://img.shields.io/docsrs/token-string)](https://docs.rs/token-string/latest/token_string/) 5[![MPL v2.0 license](https://img.shields.io/badge/License-MPL--2.0-green)](./LICENSE) 6[![Built with cargo-make](https://sagiegurari.github.io/cargo-make/assets/badges/cargo-make.svg)](https://sagiegurari.github.io/cargo-make) 7 8Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them, see [The Paper](#the-paper) for details. 9 10- [Why?](#why) 11- [The Compromise](#the-compromise) 12- [How?](#how) 13 - [Memory Layout of a TokenString](#memory-layout-of-a-tokenstring) 14- [Usage of the Library](#usage-of-the-library) 15 - [Installation](#installation) 16 - [Examples](#examples) 17 - [String Concatenation](#string-concatenation) 18 - [Examples 2](#examples-2) 19- [Performance](#performance) 20- [Building and Testing the Source](#building-and-testing-the-source) 21- [The Paper](#the-paper) 22- [License](#license) 23 24## Why? 25 26When working with single words or even just single letters, which are Unicode grapheme clusters and not necessarily single bytes, we most of the time can get by with a bit[what a pun!] more than 10 bytes of UTF-8. English has an average letter count of about 5 and needs exactly one byte per letter, Finnish has about 8 with most of them needing at most 2 bytes (the Euro sign € needs 3 bytes (`0xE2 0x82 0xAC`), for example). Chinese symbols take 3 to 4 bytes, but most[^1] words are up to two symbols long. As a pointer on 64 bit architectures[^2] are already 8 bytes we could store almost the whole string in the pointer itself without needing further indirection or heap allocation. This library is an attempt to use the string struct `TokenString` which has a size of 16 bytes (128 bits) to store such small strings itself. 27 28[^1]: Source for all of these average word lengths: "the internet" 29[^2]: This does not work with 32 bit (or 128 bit) pointer types. 30 31And immutable strings, because they make life easier -- except when concatenating -- and in many use cases the strings don't change after being generated anyways. E.g. after tokenization the scanned strings don't change any 32more. 33 34## The Compromise 35 36I have chosen the maximal size of such a "small" string to use 2 bytes -- 16 bits, so the maximum string length is `2^16 - 1 = 65,535` bytes. Which leaves the maximum size of strings saved in the string struct itself at 14 bytes. In other words, there is space for up to 14 ASCII (single byte) letters or three 4 byte Unicode scalar points, so e.g. 3 Chinese symbols. 37 38Japanese Hiragana or Katakana needs 3 bytes per symbol, so there is space for 4 of them. Which is not enough to store "most" words. By using one more byte of the size, we could save up to 15 bytes or 5 Katakana symbols, but the maximum size of a string would be only 255 bytes. 39 40## How? 41 42Strings which are at most 14 bytes long are short, or "small", `TokenStrings`. 43Such small strings are directly contained in the `TokenString` struct, so no 44memory is allocated on the heap and no indirection is needed to access the 45string's data. Every string which is greater than 14 bytes -- up to the maximum of 4665,535 bytes -- is allocated on the heap, but a copy of it's first 6 bytes, the so 47called "prefix", is stored in the struct itself and directly accessible without 48additional indirection through a pointer. 49 50So, the first 6 bytes of every `TokenString`, the "prefix", is used to speed up 51the comparison of strings and similar tasks which may not need to access the 52whole string in all cases. E.g. comparing two strings of the same length for 53equality can fail fast if the prefixes of both strings differ, without the need to 54de-reference a pointer to a string's data. 55 56### Memory Layout of a TokenString 57 58A `TokenString` is guaranteed to be aligned to 64 bits (8 bytes) and has a size of 16 bytes (128 bits). 59 60Below is a diagram depicting the memory used by `TokenString`s and small `TokenString`s. 61 62![Memory layout diagram](https://codeberg.org/Release-Candidate/token-string/raw/branch/main/images/TokenString.svg) 63 64The first two bytes of the struct always hold the length of the `TokenString` and 65the next 6 bytes the prefix, which is the first 6 bytes of the string. A short 66`TokenString` contains the bytes after the prefix in the following 8 bytes, up 67to the maximum size of 14 bytes for a small `TokenString`. A bigger `TokenString` 68contains a pointer to the whole string data (including the prefix) in the 698 bytes after the prefix. 70 71## Usage of the Library 72 73The crate at crates.io: [crates.io: token_string](https://crates.io/crates/token-string). 74 75Full documentation at [Docs.rs: token_string](https://docs.rs/token-string/latest/token_string/) 76 77### Installation 78 79Run the following Cargo command in your project directory: 80 81```sh 82cargo add token-string 83``` 84 85Or add the following line to your Cargo.toml: 86 87```toml 88token-string = "0.8.1" 89``` 90 91### Examples 92 93A few short examples of the library, just to get a feel on how it works. 94Detailed documentation is available [HERE](here). 95 96These are immutable strings, so don't use `mut`. 97 98```rust 99use token_string::TokenString; 100 101// The string we passed may be too big for a `TokenString` and an error returned. 102let s1 = TokenString::try_from("Hello, world!").unwrap(); 103 104// We can also create a `TokenString` from bytes: 105let s2 = TokenString::try_from(b"Hello, world!".as_slice()).unwrap(); 106 107// Or a String. 108let s3_string = "Hello, world!".to_owned(); 109let s3 = TokenString::try_from(&s3_string).unwrap(); 110 111// We can print `TokenString`s: 112println!("{}", &s1); 113 114// We can compare `TokenString`s: 115assert!(s1 == s2); 116 117// and compare them to `&str`, `&[u8]` or `&string`: 118assert!(&s1 == "Hello, world!"); 119assert!(&s1 == b"Hello, world!".as_slice()); 120assert!(s1 == s3_string); 121 122// We also can convert them to `&str`, `&[u8]`, `String` or a vector of `char`s: 123let s1_str: &str = s1.as_str(); 124let s1_bytes: &[u8] = s1.as_bytes(); 125let s1_string: String = s1.as_string(); 126let s1_chars: Vec<char> = s1.as_chars(); 127 128// We can iterate through the bytes of the string. 129// CAUTION: these are bytes, not Unicode scalar values or grapheme clusters, so 130// not every iterator "is" some kind of a character. 131for i in &s1 { 132 println!("Byte: {i}"); 133} 134``` 135 136### String Concatenation 137 138Because `TokenString`s are immutable, we cannot concatenate two or more strings 139by directly appending to the first string, as this would mutate this string. To 140not have to allocate temporary strings when concatenating more than two, all 141strings to concatenate are added to an array of `TokenString`s to concatenate on 142the stack. Only when calling the `Builder`'s method to actually concatenate all 143strings, a new `TokenString` is generated and memory allocated if necessary. 144 145#### Examples 2 146 147```rust 148use token_string::{Builder, TokenString}; 149 150// If the string given to `try_from` is too big, an error is returned. 151let s1 = TokenString::try_from("Hello, ").unwrap(); 152let s2 = TokenString::try_from("world!").unwrap(); 153 154// The number of strings to concatenate must be given to the `Builder` as a 155// const generic value. We are concatenating 2 strings, so this is 2. 156// The first parameter can be `'_`, to let the Rust compiler infer the lifetime. 157let mut builder = Builder::<'_, 2>::new(&s1); 158 159// Append `s2` to `s1`, "world!" to "Hello, ". 160// An error is returned if the concatenated string is too big. 161builder.concat_checked(&s2).unwrap(); 162 163// Create the new `TokenString`. Again, an error may occur. 164let s1_s2 = builder.collect_checked().unwrap(); 165 166// Check, if the result is actually "Hello, world!". 167assert!(&s1_s2 == "Hello, world!"); 168``` 169 170As checking every step like above is quite cumbersome, the two traits `Concat` 171and `Collect` make this easier by returning early if a step returns an error. 172Instead of using `concat_checked` and `collect_checked`, we use the traits 173methods `Concat::concat` and `Collect::collect`. 174 175```rust 176use token_string::{Builder, Collect as _, Concat as _, TokenString}; 177 178// If the string given to `try_from` is too big, an error is returned. 179let s1 = TokenString::try_from("Hello, ").unwrap(); 180let s2 = TokenString::try_from("world").unwrap(); 181let s3 = TokenString::try_from("!").unwrap(); 182 183// The number of strings to concatenate must be given to the `Builder` as a 184// const generic value. We are concatenating 3 strings, so this is 3. 185// The first parameter can be `'_`, to let the Rust compiler infer the lifetime. 186let s1_s2_s3 = Builder::<'_, 3>::new(&s1) 187 .concat(&s2) 188 .concat(&s3) 189 .collect() 190 .unwrap(); 191 192// Check, if the result is actually "Hello, world!". 193assert!(&s1_s2_s3 == "Hello, world!"); 194``` 195 196This is better, but having to manually set the number of strings to concatenate 197is tedious and error-prone. So let's use a macro to do that all for us -- 198`token_string::concat!` macro internally does the same as the example above. 199Just be sure to use `token_string::concat!` and not another `concat!` macro. 200 201```rust 202use token_string::{Builder, Collect as _, Concat as _, TokenString}; 203 204// If the string given to `try_from` is too big, an error is returned. 205let s1 = TokenString::try_from("Hello, ").unwrap(); 206let s2 = TokenString::try_from("world").unwrap(); 207let s3 = TokenString::try_from("!").unwrap(); 208 209// Fully qualify `token_string::concat!` to get the right `concat!` macro. 210// Again, an error may be returned if the concatenated string would be too big. 211let s1_s2_s3 = token_string::concat!(&s1, &s2, &s3).unwrap(); 212 213// Check, if the result is actually "Hello, world!". 214assert!(&s1_s2_s3 == "Hello, world!"); 215``` 216 217## Performance 218 219You have to test for yourself! It depends on your actual usage, if this string representation is a measurable -- much less significant enough to be noticeable -- optimization. 220 221## Building and Testing the Source 222 223This uses [cargo-make](https://sagiegurari.github.io/cargo-make) to not have to input long cargo command lines. The file [./Makefile.toml](./Makefile.toml) contains its configuration. 224 225But there is no need to use `cargo-make`, all commands can be input manually too -- after having installed the needed programs. 226 227The grouped list of `cargo-make`'s build targets and the cargo command line it invokes: 228 229- `build`: `cargo build` - Build the library with debug information. 230- `doc`: `cargo doc --all-features` - Generate API documentation using rustdoc. 231- `example`: `cargo run --example example` - Run the example binary, source: [./examples/main.rs](./examples/main.rs). 232- `clean`: `cargo clean` - Delete files generated by cargo. 233- `lint`: `cargo clippy` - Run Clippy, the Rust linter, on the sources. Needs Clippy installed. 234 235Tests: 236 237- `test`: `cargo nextest run --all-features` - Run all tests except for doc-tests. Needs [cargo-nextest](https://nexte.st/). 238- `doc-test`: `cargo test --doc` - Run all doc-tests. 239- `cov`: `cargo llvm-cov nextest --all-features --lcov --output-path lcov.info` - Run tests with coverage information, generate LCOV output at `./lcov.info`. Needs [cargo-llvm-cov](https://github.com/taiki-e/cargo-llvm-cov). 240- `mutation`: `cargo mutants --test-tool=nextest -- --all-features` - Run mutation tests using [cargo-mutants](https://mutants.rs/). 241- `mutation-iterate`: `cargo mutants --test-tool=nextest --iterate -- --all-features` - Run mutation tests with [cargo-mutants](https://mutants.rs/), using the data from previous runs to not run passed tests again. 242- `miri`: `PROPTEST_DISABLE_FAILURE_PERSISTENCE=true MIRIFLAGS='-Zmiri-env-forward=PROPTEST_DISABLE_FAILURE_PERSISTENCE -Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri nextest run -j 10` - Run Miri on all tests. Needs Miri installed. **Warning:** this takes more than 10 hours if the test `concat_many` in [./tests/test_builder.rs](./tests/test_builder.rs) is run, compared to about 1 hour without that test. 243- `miri-example`: `MIRIFLAGS='-Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri run --example example` - Run Miri on the example executable. Needs Miri installed. 244 245## The Paper 246 247[*Neumann, Freitag*: **Umbra: A Disk-Based System with In-Memory Performance**](https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf) 248mentions comparable small strings in chapter **3.1 String Handling** at page 5. 249 250The main difference to the variant I've implemented is that I use only 2 bytes 251(16 bits) for the string length instead of 4 bytes (32 bits) in the paper. So the 252maximum string length of `TokenString`s is 65KB bytes compared to 4GB in the paper. 253But we gain 2 bytes of length for the short strings contained in the struct 254itself, 14 bytes vs. 12 bytes in the paper's version. 255 256## License 257 258The source in this repository is licensed under the Mozilla Public 259License version 2.0, see file [./LICENSE](./LICENSE).