A better Rust ATProto crate
1# Lexicon Codegen Plan 2 3## Goal 4Generate idiomatic Rust types from AT Protocol lexicon schemas with minimal nesting/indirection. 5 6## Existing Infrastructure 7 8### Already Implemented 9- **lexicon.rs**: Complete lexicon parsing types (`LexiconDoc`, `LexUserType`, `LexObject`, etc) 10- **fs.rs**: Directory walking for finding `.json` lexicon files 11- **schema.rs**: `find_ref_unions()` - collects union fields from a single lexicon 12- **output.rs**: Partial - has string type mapping and doc comment generation 13 14### Attribute Macros 15- `#[lexicon]` - adds `extra_data` field to structs 16- `#[open_union]` - adds `Unknown(Data<'s>)` variant to enums 17 18## Design Decisions 19 20### Module/File Structure 21- NSID `app.bsky.feed.post``app_bsky/feed/post.rs` 22- Flat module names (no `app::bsky`, just `app_bsky`) 23- Parent modules: `app_bsky/feed.rs` with `pub mod post;` 24 25### Type Naming 26- **Main def**: Use last segment of NSID 27 - `app.bsky.feed.post#main``Post` 28- **Other defs**: Pascal-case the def name 29 - `replyRef``ReplyRef` 30- **Union variants**: Use last segment of ref NSID 31 - `app.bsky.embed.images``Images` 32 - Collisions resolved by module path, not type name 33- **No proliferation of `Main` types** like atrium has 34 35### Type Generation 36 37#### Records (lexRecord) 38```rust 39#[lexicon] 40#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] 41#[serde(rename_all = "camelCase")] 42pub struct Post<'s> { 43 /// Client-declared timestamp... 44 pub created_at: Datetime, 45 #[serde(skip_serializing_if = "Option::is_none")] 46 pub embed: Option<RecordEmbed<'s>>, 47 pub text: CowStr<'s>, 48} 49``` 50 51#### Objects (lexObject) 52Same as records but without `#[lexicon]` if inline/not a top-level def. 53 54#### Unions (lexRefUnion) 55```rust 56#[open_union] 57#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] 58#[serde(tag = "$type")] 59pub enum RecordEmbed<'s> { 60 #[serde(rename = "app.bsky.embed.images")] 61 Images(Box<jacquard_api::app_bsky::embed::Images<'s>>), 62 #[serde(rename = "app.bsky.embed.video")] 63 Video(Box<jacquard_api::app_bsky::embed::Video<'s>>), 64} 65``` 66 67- Use `Box<T>` for all variants (handles circular refs) 68- `#[open_union]` adds `Unknown(Data<'s>)` catch-all 69 70#### Queries (lexXrpcQuery) 71```rust 72pub struct GetAuthorFeedParams<'s> { 73 pub actor: AtIdentifier<'s>, 74 pub limit: Option<i64>, 75 pub cursor: Option<CowStr<'s>>, 76} 77 78pub struct GetAuthorFeedOutput<'s> { 79 pub cursor: Option<CowStr<'s>>, 80 pub feed: Vec<FeedViewPost<'s>>, 81} 82``` 83 84- Flat params/output structs 85- No nesting like `Input { params: {...} }` 86 87#### Procedures (lexXrpcProcedure) 88Same as queries but with both `Input` and `Output` structs. 89 90### Field Handling 91 92#### Optional Fields 93- Fields not in `required: []``Option<T>` 94- Add `#[serde(skip_serializing_if = "Option::is_none")]` 95 96#### Lifetimes 97- All types have `'a` lifetime for borrowing from input 98- `#[serde(borrow)]` where needed for zero-copy 99 100#### Type Mapping 101- `LexString` with format → specific types (`Datetime`, `Did`, etc) 102- `LexString` without format → `CowStr<'a>` 103- `LexInteger``i64` 104- `LexBoolean``bool` 105- `LexBytes``Bytes` 106- `LexCidLink``CidLink<'a>` 107- `LexBlob``Blob<'a>` 108- `LexRef` → resolve to actual type path 109- `LexRefUnion` → generate enum 110- `LexArray``Vec<T>` 111- `LexUnknown``Data<'a>` 112 113### Reference Resolution 114 115#### Known Refs 116- Check corpus for ref existence 117- `#ref: "app.bsky.embed.images"``jacquard_api::app_bsky::embed::Images<'a>` 118- Handle fragments: `#ref: "com.example.foo#bar"``jacquard_api::com_example::foo::Bar<'a>` 119 120#### Unknown Refs 121- **In struct fields**: use `Data<'a>` as fallback type 122- **In union variants**: handled by `Unknown(Data<'a>)` variant from `#[open_union]` 123- Optional: log warnings for missing refs 124 125## Implementation Phases 126 127### Phase 1: Corpus Loading & Registry 128**Goal**: Load all lexicons into memory for ref resolution 129 130**Tasks**: 1311. Create `LexiconCorpus` struct 132 - `BTreeMap<SmolStr, LexiconDoc<'static>>` - NSID → doc 133 - Methods: `load_from_dir()`, `get()`, `resolve_ref()` 1342. Load all `.json` files from lexicon directory 1353. Parse into `LexiconDoc` and insert into registry 1364. Handle fragments in refs (`nsid#def`) 137 138**Output**: Corpus registry that can resolve any ref 139 140### Phase 2: Ref Analysis & Union Collection 141**Goal**: Build complete picture of what refs exist and what unions need 142 143**Tasks**: 1441. Extend `find_ref_unions()` to work across entire corpus 1452. For each union, collect all refs and check existence 1463. Build `UnionRegistry`: 147 - Union name → list of (known refs, unknown refs) 1484. Detect circular refs (optional - or just Box everything) 149 150**Output**: Complete list of unions to generate with their variants 151 152### Phase 3: Code Generation - Core Types 153**Goal**: Generate Rust code for individual types 154 155**Tasks**: 1561. Implement type generators: 157 - `generate_struct()` for records/objects 158 - `generate_enum()` for unions 159 - `generate_field()` for object properties 160 - `generate_type()` for primitives/refs 1612. Handle optional fields (`required` list) 1623. Add doc comments from `description` 1634. Apply `#[lexicon]` / `#[open_union]` macros 1645. Add serde attributes 165 166**Output**: `TokenStream` for each type 167 168### Phase 4: Module Organization 169**Goal**: Organize generated types into module hierarchy 170 171**Tasks**: 1721. Parse NSID into components: `["app", "bsky", "feed", "post"]` 1732. Determine file paths: `app_bsky/feed/post.rs` 1743. Generate module files: `app_bsky/feed.rs` with `pub mod post;` 1754. Generate root module: `app_bsky.rs` 1765. Handle re-exports if needed 177 178**Output**: File path → generated code mapping 179 180### Phase 5: File Writing 181**Goal**: Write generated code to filesystem 182 183**Tasks**: 1841. Format code with `prettyplease` 1852. Create directory structure 1863. Write module files 1874. Write type files 1885. Optional: run `rustfmt` 189 190**Output**: Generated code on disk 191 192### Phase 6: Testing & Validation 193**Goal**: Ensure generated code compiles and works 194 195**Tasks**: 1961. Generate code for test lexicons 1972. Compile generated code 1983. Test serialization/deserialization 1994. Test union variant matching 2005. Test extra_data capture 201 202## Edge Cases & Considerations 203 204### Circular References 205- **Simple approach**: Union variants always use `Box<T>` → handles all circular refs 206- **Alternative**: DFS cycle detection to only Box when needed 207 - Track visited refs and recursion stack 208 - If ref appears in rec_stack → cycle detected 209 - Algorithm: 210 ```rust 211 fn has_cycle(corpus, start_ref, visited, rec_stack) -> bool { 212 visited.insert(start_ref); 213 rec_stack.insert(start_ref); 214 215 for child_ref in collect_refs_from_def(resolve(start_ref)) { 216 if !visited.contains(child_ref) { 217 if has_cycle(corpus, child_ref, visited, rec_stack) { 218 return true; 219 } 220 } else if rec_stack.contains(child_ref) { 221 return true; // back edge = cycle 222 } 223 } 224 225 rec_stack.remove(start_ref); 226 false 227 } 228 ``` 229 - Only box variants that participate in cycles 230- **Recommendation**: Start with simple (always Box), optimize later if needed 231 232### Name Collisions 233- Multiple types with same name in different lexicons 234- Module path disambiguates: `app_bsky::feed::Post` vs `com_example::feed::Post` 235 236### Unknown Refs 237- Fallback to `Data<'s>` in struct fields 238- Caught by `Unknown` variant in unions 239- Warn during generation 240 241### Inline Defs 242- Nested objects/unions in same lexicon 243- Generate as separate types in same file 244- Keep names scoped to parent (e.g., `PostReplyRef`) 245 246### Arrays 247- `Vec<T>` for arrays 248- Handle nested unions in arrays 249 250### Tokens 251- Simple marker types 252- Generate as unit structs or type aliases? 253 254## Traits for Generated Types 255 256### Collection Trait (Records) 257Records implement the existing `Collection` trait from jacquard-common: 258 259```rust 260pub struct Post<'a> { 261 // ... fields 262} 263 264impl Collection for Post<'p> { 265 const NSID: &'static str = "app.bsky.feed.post"; 266 type Record = Post<'p>; 267} 268``` 269 270### XrpcRequest Trait (Queries/Procedures) 271New trait for XRPC endpoints: 272 273```rust 274pub trait XrpcRequest<'x> { 275 /// The NSID for this XRPC method 276 const NSID: &'static str; 277 278 /// XRPC method (query/GET, procedure/POST) 279 const METHOD: XrpcMethod; 280 281 /// Input encoding (MIME type, e.g., "application/json") 282 /// None for queries (no body) 283 const INPUT_ENCODING: Option<&'static str>; 284 285 /// Output encoding (MIME type) 286 const OUTPUT_ENCODING: &'static str; 287 288 /// Request parameters type (query params or body) 289 type Params: Serialize; 290 291 /// Response output type 292 type Output: Deserialize<'x>; 293 294 type Err: Error; 295} 296 297pub enum XrpcMethod { 298 Query, // GET 299 Procedure, // POST 300} 301``` 302 303 304 305**Generated implementation:** 306```rust 307pub struct GetAuthorFeedParams<'a> { 308 pub actor: AtIdentifier<'a>, 309 pub limit: Option<i64>, 310 pub cursor: Option<CowStr<'a>>, 311} 312 313pub struct GetAuthorFeedOutput<'a> { 314 pub cursor: Option<CowStr<'a>>, 315 pub feed: Vec<FeedViewPost<'a>>, 316} 317 318impl XrpcRequest for GetAuthorFeedParams<'_> { 319 const NSID: &'static str = "app.bsky.feed.getAuthorFeed"; 320 const METHOD: XrpcMethod = XrpcMethod::Query; 321 const INPUT_ENCODING: Option<&'static str> = None; // queries have no body 322 const OUTPUT_ENCODING: &'static str = "application/json"; 323 324 type Params = Self; 325 type Output = GetAuthorFeedOutput<'static>; 326 type Err = GetAuthorFeedError; 327} 328``` 329 330**Encoding variations:** 331- Most procedures: `"application/json"` for input/output 332- Blob uploads: `"*/*"` or specific MIME type for input 333- CAR files: `"application/vnd.ipld.car"` for repo operations 334- Read from lexicon's `input.encoding` and `output.encoding` fields 335 336**Trait benefits:** 337- Allows monomorphization (static dispatch) for performance 338- Also supports `dyn XrpcRequest` for dynamic dispatch if needed 339- Client code can be generic over `impl XrpcRequest` 340 341 342#### XRPC Errors 343Lexicons contain information on the kind of errors they can return. 344Trait contains an associated error type. Error enum with thiserror::Error and 345miette:Diagnostic derives and appropriate content generated based on lexicon info. 346 347### Subscriptions 348WebSocket streams - defer for now. Will need separate trait with message types. 349 350## Open Questions 351 3521. **Validation**: Generate runtime validation (min/max length, regex, etc)? 3532. **Tokens**: How to represent token types? 3543. **Errors**: How to handle codegen errors (missing refs, invalid schemas)? 3554. **Incremental**: Support incremental codegen (only changed lexicons)? 3565. **Formatting**: Always run rustfmt or rely on prettyplease? 3576. **XrpcRequest location**: Should trait live in jacquard-common or separate jacquard-xrpc crate? 3587. **Import shortening**: Track imports and shorten ref paths in generated code 359 - Instead of `jacquard_api::app_bsky::richtext::Facet<'a>` emit `use jacquard_api::app_bsky::richtext::Facet;` and just `Facet<'a>` 360 - Would require threading `ImportTracker` through all generate functions or post-processing token stream 361 - Long paths are ugly but explicit - revisit once basic codegen is confirmed working 3628. **Web-based lexicon resolution**: Fetch lexicons from the web instead of requiring local files 363 - Implement [lexicon publication and resolution](https://atproto.com/specs/lexicon#lexicon-publication-and-resolution) spec 364 - `LexiconCorpus::fetch_from_web(nsids: &[&str])` - fetch specific NSIDs 365 - `LexiconCorpus::fetch_from_authority(authority: &str)` - fetch all from DID/domain 366 - Resolution: `https://{authority}/.well-known/atproto/lexicon/{nsid}.json` 367 - Recursively fetch refs, handle redirects/errors 368 - Use `reqwest` for HTTP - still fits in jacquard-lexicon as it's corpus loading 369 370## Success Criteria 371 372- [ ] Generates code for all official AT Protocol lexicons 373- [ ] Generated code compiles without errors 374- [ ] No `Main` proliferation 375- [ ] Union variants have readable names 376- [ ] Unknown refs handled gracefully 377- [ ] `#[lexicon]` and `#[open_union]` applied correctly 378- [ ] Serialization round-trips correctly