# Lexicon Codegen Plan ## Goal Generate idiomatic Rust types from AT Protocol lexicon schemas with minimal nesting/indirection. ## Existing Infrastructure ### Already Implemented - **lexicon.rs**: Complete lexicon parsing types (`LexiconDoc`, `LexUserType`, `LexObject`, etc) - **fs.rs**: Directory walking for finding `.json` lexicon files - **schema.rs**: `find_ref_unions()` - collects union fields from a single lexicon - **output.rs**: Partial - has string type mapping and doc comment generation ### Attribute Macros - `#[lexicon]` - adds `extra_data` field to structs - `#[open_union]` - adds `Unknown(Data<'s>)` variant to enums ## Design Decisions ### Module/File Structure - NSID `app.bsky.feed.post` → `app_bsky/feed/post.rs` - Flat module names (no `app::bsky`, just `app_bsky`) - Parent modules: `app_bsky/feed.rs` with `pub mod post;` ### Type Naming - **Main def**: Use last segment of NSID - `app.bsky.feed.post#main` → `Post` - **Other defs**: Pascal-case the def name - `replyRef` → `ReplyRef` - **Union variants**: Use last segment of ref NSID - `app.bsky.embed.images` → `Images` - Collisions resolved by module path, not type name - **No proliferation of `Main` types** like atrium has ### Type Generation #### Records (lexRecord) ```rust #[lexicon] #[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] #[serde(rename_all = "camelCase")] pub struct Post<'s> { /// Client-declared timestamp... pub created_at: Datetime, #[serde(skip_serializing_if = "Option::is_none")] pub embed: Option>, pub text: CowStr<'s>, } ``` #### Objects (lexObject) Same as records but without `#[lexicon]` if inline/not a top-level def. #### Unions (lexRefUnion) ```rust #[open_union] #[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] #[serde(tag = "$type")] pub enum RecordEmbed<'s> { #[serde(rename = "app.bsky.embed.images")] Images(Box>), #[serde(rename = "app.bsky.embed.video")] Video(Box>), } ``` - Use `Box` for all variants (handles circular refs) - `#[open_union]` adds `Unknown(Data<'s>)` catch-all #### Queries (lexXrpcQuery) ```rust pub struct GetAuthorFeedParams<'s> { pub actor: AtIdentifier<'s>, pub limit: Option, pub cursor: Option>, } pub struct GetAuthorFeedOutput<'s> { pub cursor: Option>, pub feed: Vec>, } ``` - Flat params/output structs - No nesting like `Input { params: {...} }` #### Procedures (lexXrpcProcedure) Same as queries but with both `Input` and `Output` structs. ### Field Handling #### Optional Fields - Fields not in `required: []` → `Option` - Add `#[serde(skip_serializing_if = "Option::is_none")]` #### Lifetimes - All types have `'a` lifetime for borrowing from input - `#[serde(borrow)]` where needed for zero-copy #### Type Mapping - `LexString` with format → specific types (`Datetime`, `Did`, etc) - `LexString` without format → `CowStr<'a>` - `LexInteger` → `i64` - `LexBoolean` → `bool` - `LexBytes` → `Bytes` - `LexCidLink` → `CidLink<'a>` - `LexBlob` → `Blob<'a>` - `LexRef` → resolve to actual type path - `LexRefUnion` → generate enum - `LexArray` → `Vec` - `LexUnknown` → `Data<'a>` ### Reference Resolution #### Known Refs - Check corpus for ref existence - `#ref: "app.bsky.embed.images"` → `jacquard_api::app_bsky::embed::Images<'a>` - Handle fragments: `#ref: "com.example.foo#bar"` → `jacquard_api::com_example::foo::Bar<'a>` #### Unknown Refs - **In struct fields**: use `Data<'a>` as fallback type - **In union variants**: handled by `Unknown(Data<'a>)` variant from `#[open_union]` - Optional: log warnings for missing refs ## Implementation Phases ### Phase 1: Corpus Loading & Registry **Goal**: Load all lexicons into memory for ref resolution **Tasks**: 1. Create `LexiconCorpus` struct - `BTreeMap>` - NSID → doc - Methods: `load_from_dir()`, `get()`, `resolve_ref()` 2. Load all `.json` files from lexicon directory 3. Parse into `LexiconDoc` and insert into registry 4. Handle fragments in refs (`nsid#def`) **Output**: Corpus registry that can resolve any ref ### Phase 2: Ref Analysis & Union Collection **Goal**: Build complete picture of what refs exist and what unions need **Tasks**: 1. Extend `find_ref_unions()` to work across entire corpus 2. For each union, collect all refs and check existence 3. Build `UnionRegistry`: - Union name → list of (known refs, unknown refs) 4. Detect circular refs (optional - or just Box everything) **Output**: Complete list of unions to generate with their variants ### Phase 3: Code Generation - Core Types **Goal**: Generate Rust code for individual types **Tasks**: 1. Implement type generators: - `generate_struct()` for records/objects - `generate_enum()` for unions - `generate_field()` for object properties - `generate_type()` for primitives/refs 2. Handle optional fields (`required` list) 3. Add doc comments from `description` 4. Apply `#[lexicon]` / `#[open_union]` macros 5. Add serde attributes **Output**: `TokenStream` for each type ### Phase 4: Module Organization **Goal**: Organize generated types into module hierarchy **Tasks**: 1. Parse NSID into components: `["app", "bsky", "feed", "post"]` 2. Determine file paths: `app_bsky/feed/post.rs` 3. Generate module files: `app_bsky/feed.rs` with `pub mod post;` 4. Generate root module: `app_bsky.rs` 5. Handle re-exports if needed **Output**: File path → generated code mapping ### Phase 5: File Writing **Goal**: Write generated code to filesystem **Tasks**: 1. Format code with `prettyplease` 2. Create directory structure 3. Write module files 4. Write type files 5. Optional: run `rustfmt` **Output**: Generated code on disk ### Phase 6: Testing & Validation **Goal**: Ensure generated code compiles and works **Tasks**: 1. Generate code for test lexicons 2. Compile generated code 3. Test serialization/deserialization 4. Test union variant matching 5. Test extra_data capture ## Edge Cases & Considerations ### Circular References - **Simple approach**: Union variants always use `Box` → handles all circular refs - **Alternative**: DFS cycle detection to only Box when needed - Track visited refs and recursion stack - If ref appears in rec_stack → cycle detected - Algorithm: ```rust fn has_cycle(corpus, start_ref, visited, rec_stack) -> bool { visited.insert(start_ref); rec_stack.insert(start_ref); for child_ref in collect_refs_from_def(resolve(start_ref)) { if !visited.contains(child_ref) { if has_cycle(corpus, child_ref, visited, rec_stack) { return true; } } else if rec_stack.contains(child_ref) { return true; // back edge = cycle } } rec_stack.remove(start_ref); false } ``` - Only box variants that participate in cycles - **Recommendation**: Start with simple (always Box), optimize later if needed ### Name Collisions - Multiple types with same name in different lexicons - Module path disambiguates: `app_bsky::feed::Post` vs `com_example::feed::Post` ### Unknown Refs - Fallback to `Data<'s>` in struct fields - Caught by `Unknown` variant in unions - Warn during generation ### Inline Defs - Nested objects/unions in same lexicon - Generate as separate types in same file - Keep names scoped to parent (e.g., `PostReplyRef`) ### Arrays - `Vec` for arrays - Handle nested unions in arrays ### Tokens - Simple marker types - Generate as unit structs or type aliases? ## Traits for Generated Types ### Collection Trait (Records) Records implement the existing `Collection` trait from jacquard-common: ```rust pub struct Post<'a> { // ... fields } impl Collection for Post<'p> { const NSID: &'static str = "app.bsky.feed.post"; type Record = Post<'p>; } ``` ### XrpcRequest Trait (Queries/Procedures) New trait for XRPC endpoints: ```rust pub trait XrpcRequest<'x> { /// The NSID for this XRPC method const NSID: &'static str; /// XRPC method (query/GET, procedure/POST) const METHOD: XrpcMethod; /// Input encoding (MIME type, e.g., "application/json") /// None for queries (no body) const INPUT_ENCODING: Option<&'static str>; /// Output encoding (MIME type) const OUTPUT_ENCODING: &'static str; /// Request parameters type (query params or body) type Params: Serialize; /// Response output type type Output: Deserialize<'x>; type Err: Error; } pub enum XrpcMethod { Query, // GET Procedure, // POST } ``` **Generated implementation:** ```rust pub struct GetAuthorFeedParams<'a> { pub actor: AtIdentifier<'a>, pub limit: Option, pub cursor: Option>, } pub struct GetAuthorFeedOutput<'a> { pub cursor: Option>, pub feed: Vec>, } impl XrpcRequest for GetAuthorFeedParams<'_> { const NSID: &'static str = "app.bsky.feed.getAuthorFeed"; const METHOD: XrpcMethod = XrpcMethod::Query; const INPUT_ENCODING: Option<&'static str> = None; // queries have no body const OUTPUT_ENCODING: &'static str = "application/json"; type Params = Self; type Output = GetAuthorFeedOutput<'static>; type Err = GetAuthorFeedError; } ``` **Encoding variations:** - Most procedures: `"application/json"` for input/output - Blob uploads: `"*/*"` or specific MIME type for input - CAR files: `"application/vnd.ipld.car"` for repo operations - Read from lexicon's `input.encoding` and `output.encoding` fields **Trait benefits:** - Allows monomorphization (static dispatch) for performance - Also supports `dyn XrpcRequest` for dynamic dispatch if needed - Client code can be generic over `impl XrpcRequest` #### XRPC Errors Lexicons contain information on the kind of errors they can return. Trait contains an associated error type. Error enum with thiserror::Error and miette:Diagnostic derives and appropriate content generated based on lexicon info. ### Subscriptions WebSocket streams - defer for now. Will need separate trait with message types. ## Open Questions 1. **Validation**: Generate runtime validation (min/max length, regex, etc)? 2. **Tokens**: How to represent token types? 3. **Errors**: How to handle codegen errors (missing refs, invalid schemas)? 4. **Incremental**: Support incremental codegen (only changed lexicons)? 5. **Formatting**: Always run rustfmt or rely on prettyplease? 6. **XrpcRequest location**: Should trait live in jacquard-common or separate jacquard-xrpc crate? 7. **Import shortening**: Track imports and shorten ref paths in generated code - Instead of `jacquard_api::app_bsky::richtext::Facet<'a>` emit `use jacquard_api::app_bsky::richtext::Facet;` and just `Facet<'a>` - Would require threading `ImportTracker` through all generate functions or post-processing token stream - Long paths are ugly but explicit - revisit once basic codegen is confirmed working 8. **Web-based lexicon resolution**: Fetch lexicons from the web instead of requiring local files - Implement [lexicon publication and resolution](https://atproto.com/specs/lexicon#lexicon-publication-and-resolution) spec - `LexiconCorpus::fetch_from_web(nsids: &[&str])` - fetch specific NSIDs - `LexiconCorpus::fetch_from_authority(authority: &str)` - fetch all from DID/domain - Resolution: `https://{authority}/.well-known/atproto/lexicon/{nsid}.json` - Recursively fetch refs, handle redirects/errors - Use `reqwest` for HTTP - still fits in jacquard-lexicon as it's corpus loading ## Success Criteria - [ ] Generates code for all official AT Protocol lexicons - [ ] Generated code compiles without errors - [ ] No `Main` proliferation - [ ] Union variants have readable names - [ ] Unknown refs handled gracefully - [ ] `#[lexicon]` and `#[open_union]` applied correctly - [ ] Serialization round-trips correctly