Fast and robust atproto CAR file processing in rust
1# repo-stream 2 3A robust CAR file -> MST walker for atproto 4 5[![Crates.io][crates-badge]](https://crates.io/crates/repo-stream) 6[![Documentation][docs-badge]](https://docs.rs/repo-stream) 7 8[crates-badge]: https://img.shields.io/crates/v/repo-stream.svg 9[docs-badge]: https://docs.rs/repo-stream/badge.svg 10 11 12todo 13 14- [ ] get an *emtpy* car for the test suite 15- [ ] implement a max size on disk limit 16 17 18----- 19 20older stuff (to clean up): 21 22 23current car processing times (records processed into their length usize, phil's dev machine): 24 25- 128MiB CAR file: `347ms` 26- 5.0MiB: `6.1ms` 27- 279KiB: `139us` 28- 3.4KiB: `4.9us` 29 30 31running the huge-car benchmark 32 33- to avoid committing it to the repo, you have to pass it in through the env for now. 34 35 ```bash 36 HUGE_CAR=~/Downloads/did_plc_redacted.car cargo bench -- huge-car 37 ``` 38 39 40todo 41 42- [x] car file test fixtures & validation tests 43- [x] make sure we can get the did and signature out for verification 44 -> yeah the commit is returned from init 45- [ ] spec compliance todos 46 - [x] assert that keys are ordered and fail if not 47 - [x] verify node mst depth from key (possibly pending [interop test fixes](https://github.com/bluesky-social/atproto-interop-tests/issues/5)) 48- [ ] performance todos 49 - [x] consume the serialized nodes into a mutable efficient format 50 - [ ] maybe customize the deserialize impl to do that directly? 51 - [x] benchmark and profile 52- [ ] robustness todos 53 - [ ] swap the blocks hashmap for a BlockStore trait that can be dumped to redb 54 - [ ] maybe keep the redb function behind a feature flag? 55 - [ ] can we assert a max size for node blocks? 56 - [x] figure out why asserting the upper nibble of the fourth byte of a node fails fingerprinting 57 -> because it's the upper 3 bytes, not upper 4 byte nibble, oops. 58 - [ ] max mst depth (there is actually a hard limit but a malicious repo could do anything) 59 - [ ] i don't *think* we need a max recursion depth for processing cbor contents since we leave records to the user to decode 60 61newer ideas 62 63- fixing the interleaved mst walk/ block load actually does perform ok: just need the walker to tell the block loader which block we actually need next, so that the block loader can go ahead and load all blocks until that one without checking back with the walker. so i think we're streaming-order ready! 64 65 66later ideas 67 68- just buffering all the blocks is 2.5x faster than interleaving optimistic walking 69 - at least, this is true on huge CARs with the current (stream-unfriendly) pds export behaviour 70 71- transform function is a little tricky because we can't *know* if a block is a record or a node until we actually walk the tree to it (after they're all buffered in memory anyway). 72 - still might as well benchmark a test with optimistic block probing+transform on the way in 73 74 75original ideas: 76 77- tries to walk and emit the MST *while streaming in the CAR* 78- drops intermediate mst blocks after reading to reduce total memory 79- user-provided transform function on record blocks from IPLD 80 81future work: 82- flush to disk if needed (sqlite? redb?) https://bsky.app/profile/divy.zone/post/3m2mf3jqx3k2w 83 - either just generally to handle huge CARs, or as a fallback when streaming fails 84 85redb has an in-memory backend, so it would be possible to *always* use it for block caching. user can choose if they want to allow disk or just do memory, and then "spilling" from the cache to disk would be mostly free? 86 87 88## license 89 90This work is dual-licensed under MIT and Apache 2.0. You can choose between one of them if you use this work. 91 92`SPDX-License-Identifier: MIT OR Apache-2.0`