Fast and robust atproto CAR file processing in rust
15
fork

Configure Feed

Select the types of activity you want to include in your feed.

get ready for release

authored by bad-example.com and committed by tangled.org e64b0e07 458d4429

+27 -11
+2 -2
Cargo.toml
··· 1 1 [package] 2 2 name = "repo-stream" 3 - version = "0.2.2" 3 + version = "0.3.0" 4 4 edition = "2024" 5 5 license = "MIT OR Apache-2.0" 6 - description = "A robust CAR file -> MST walker for atproto" 6 + description = "Fast and robust atproto CAR file processing" 7 7 repository = "https://tangled.org/@microcosm.blue/repo-stream" 8 8 9 9 [dependencies]
+11
changelog.md
··· 1 + # v0.3.0 2 + 3 + _2026-01-15_ 4 + 5 + - drop sqlite, pick up fjall v3 for some speeeeeeed (and code simplification and easier build requirements and) 6 + - no more `Processable` trait, process functions are just `Vec<u8> -> Vec<u8>` now (bring your own ser/de). there's a potential small cost here where processors need to now actually go through serialization even for in-memory car walking, but i think zero-copy approaches (eg. rkyv) are low-cost enough 7 + - custom deserialize for MST nodes that does as much depth calculation and rkey validation as - possible in-line. (not clear if it actually made anything faster) 8 + - check MST depth at every node properly (previously it could do some walking before being able to check and included some assumptions) 9 + - check MST for empty leaf nodes (which not allowed) 10 + - shave 0.6 nanoseconds (really) from MST depth calculation (don't ask) 11 + - drop and swap some dependencies: `bincode`, `futures`, `futures-core`, `ipld-core` -> `cid`, `multibase`, `rusqlite` -> `fjall`. and add `hashbrown` bc it benchmarked a bit faster. (we hash on user-controlled CIDs -- is the lower DOS-resistance a risk to worry about?)
+14 -9
readme.md
··· 58 58 ``` 59 59 60 60 more recent todo 61 + - [ ] add a zero-copy rkyv process function example 61 62 - [ ] repo car slices 62 63 - [ ] lazy-value stream (rkey -> CID diffing for tap-like `#sync` handling) 63 64 - [x] get an *emtpy* car for the test suite 64 65 - [x] implement a max size on disk limit 65 66 67 + some ideas 68 + - [ ] since the disk k/v get/set interface is now so similar to HashMap (blocking, no transactions,), it's probably possible to make a single `Driver` and move the thread stuff from the disk one to generic helper functions. (might create async footguns though) 69 + - [ ] fork iroh-car into a sync version so we can drop tokio as a hard requirement, and offer async via wrapper helper things 70 + - [ ] feature-flag the sha2 crate for hmac-sha256? if someone wanted fewer deps?? then maybe make `hashbrown` also optional vs builtin hashmap? 66 71 67 72 ----- 68 73 ··· 132 137 - [x] car file test fixtures & validation tests 133 138 - [x] make sure we can get the did and signature out for verification 134 139 -> yeah the commit is returned from init 135 - - [ ] spec compliance todos 140 + - [x] spec compliance todos 136 141 - [x] assert that keys are ordered and fail if not 137 142 - [x] verify node mst depth from key (possibly pending [interop test fixes](https://github.com/bluesky-social/atproto-interop-tests/issues/5)) 138 - - [ ] performance todos 143 + - [x] performance todos 139 144 - [x] consume the serialized nodes into a mutable efficient format 140 - - [ ] maybe customize the deserialize impl to do that directly? 145 + - [x] maybe customize the deserialize impl to do that directly? 141 146 - [x] benchmark and profile 142 - - [ ] robustness todos 143 - - [ ] swap the blocks hashmap for a BlockStore trait that can be dumped to redb 144 - - [ ] maybe keep the redb function behind a feature flag? 145 - - [ ] can we assert a max size for node blocks? 147 + - [x] robustness todos 148 + - [x] swap the blocks hashmap for a BlockStore trait that can be dumped to redb 149 + - [x] maybe keep the redb function behind a feature flag? 150 + - [ ] can we assert a max size of entries for node blocks? 146 151 - [x] figure out why asserting the upper nibble of the fourth byte of a node fails fingerprinting 147 152 -> because it's the upper 3 bytes, not upper 4 byte nibble, oops. 148 - - [ ] max mst depth (there is actually a hard limit but a malicious repo could do anything) 149 - - [ ] i don't *think* we need a max recursion depth for processing cbor contents since we leave records to the user to decode 153 + - [x] max mst depth (to expensive to attack actually) 154 + - [x] i don't *think* we need a max recursion depth for processing cbor contents since we leave records to the user to decode 150 155 151 156 newer ideas 152 157