1# Lexicon Codegen Plan
2
3## Goal
4Generate idiomatic Rust types from AT Protocol lexicon schemas with minimal nesting/indirection.
5
6## Existing Infrastructure
7
8### Already Implemented
9- **lexicon.rs**: Complete lexicon parsing types (`LexiconDoc`, `LexUserType`, `LexObject`, etc)
10- **fs.rs**: Directory walking for finding `.json` lexicon files
11- **schema.rs**: `find_ref_unions()` - collects union fields from a single lexicon
12- **output.rs**: Partial - has string type mapping and doc comment generation
13
14### Attribute Macros
15- `#[lexicon]` - adds `extra_data` field to structs
16- `#[open_union]` - adds `Unknown(Data<'s>)` variant to enums
17
18## Design Decisions
19
20### Module/File Structure
21- NSID `app.bsky.feed.post` → `app_bsky/feed/post.rs`
22- Flat module names (no `app::bsky`, just `app_bsky`)
23- Parent modules: `app_bsky/feed.rs` with `pub mod post;`
24
25### Type Naming
26- **Main def**: Use last segment of NSID
27 - `app.bsky.feed.post#main` → `Post`
28- **Other defs**: Pascal-case the def name
29 - `replyRef` → `ReplyRef`
30- **Union variants**: Use last segment of ref NSID
31 - `app.bsky.embed.images` → `Images`
32 - Collisions resolved by module path, not type name
33- **No proliferation of `Main` types** like atrium has
34
35### Type Generation
36
37#### Records (lexRecord)
38```rust
39#[lexicon]
40#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)]
41#[serde(rename_all = "camelCase")]
42pub struct Post<'s> {
43 /// Client-declared timestamp...
44 pub created_at: Datetime,
45 #[serde(skip_serializing_if = "Option::is_none")]
46 pub embed: Option<RecordEmbed<'s>>,
47 pub text: CowStr<'s>,
48}
49```
50
51#### Objects (lexObject)
52Same as records but without `#[lexicon]` if inline/not a top-level def.
53
54#### Unions (lexRefUnion)
55```rust
56#[open_union]
57#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)]
58#[serde(tag = "$type")]
59pub enum RecordEmbed<'s> {
60 #[serde(rename = "app.bsky.embed.images")]
61 Images(Box<jacquard_api::app_bsky::embed::Images<'s>>),
62 #[serde(rename = "app.bsky.embed.video")]
63 Video(Box<jacquard_api::app_bsky::embed::Video<'s>>),
64}
65```
66
67- Use `Box<T>` for all variants (handles circular refs)
68- `#[open_union]` adds `Unknown(Data<'s>)` catch-all
69
70#### Queries (lexXrpcQuery)
71```rust
72pub struct GetAuthorFeedParams<'s> {
73 pub actor: AtIdentifier<'s>,
74 pub limit: Option<i64>,
75 pub cursor: Option<CowStr<'s>>,
76}
77
78pub struct GetAuthorFeedOutput<'s> {
79 pub cursor: Option<CowStr<'s>>,
80 pub feed: Vec<FeedViewPost<'s>>,
81}
82```
83
84- Flat params/output structs
85- No nesting like `Input { params: {...} }`
86
87#### Procedures (lexXrpcProcedure)
88Same as queries but with both `Input` and `Output` structs.
89
90### Field Handling
91
92#### Optional Fields
93- Fields not in `required: []` → `Option<T>`
94- Add `#[serde(skip_serializing_if = "Option::is_none")]`
95
96#### Lifetimes
97- All types have `'a` lifetime for borrowing from input
98- `#[serde(borrow)]` where needed for zero-copy
99
100#### Type Mapping
101- `LexString` with format → specific types (`Datetime`, `Did`, etc)
102- `LexString` without format → `CowStr<'a>`
103- `LexInteger` → `i64`
104- `LexBoolean` → `bool`
105- `LexBytes` → `Bytes`
106- `LexCidLink` → `CidLink<'a>`
107- `LexBlob` → `Blob<'a>`
108- `LexRef` → resolve to actual type path
109- `LexRefUnion` → generate enum
110- `LexArray` → `Vec<T>`
111- `LexUnknown` → `Data<'a>`
112
113### Reference Resolution
114
115#### Known Refs
116- Check corpus for ref existence
117- `#ref: "app.bsky.embed.images"` → `jacquard_api::app_bsky::embed::Images<'a>`
118- Handle fragments: `#ref: "com.example.foo#bar"` → `jacquard_api::com_example::foo::Bar<'a>`
119
120#### Unknown Refs
121- **In struct fields**: use `Data<'a>` as fallback type
122- **In union variants**: handled by `Unknown(Data<'a>)` variant from `#[open_union]`
123- Optional: log warnings for missing refs
124
125## Implementation Phases
126
127### Phase 1: Corpus Loading & Registry
128**Goal**: Load all lexicons into memory for ref resolution
129
130**Tasks**:
1311. Create `LexiconCorpus` struct
132 - `BTreeMap<SmolStr, LexiconDoc<'static>>` - NSID → doc
133 - Methods: `load_from_dir()`, `get()`, `resolve_ref()`
1342. Load all `.json` files from lexicon directory
1353. Parse into `LexiconDoc` and insert into registry
1364. Handle fragments in refs (`nsid#def`)
137
138**Output**: Corpus registry that can resolve any ref
139
140### Phase 2: Ref Analysis & Union Collection
141**Goal**: Build complete picture of what refs exist and what unions need
142
143**Tasks**:
1441. Extend `find_ref_unions()` to work across entire corpus
1452. For each union, collect all refs and check existence
1463. Build `UnionRegistry`:
147 - Union name → list of (known refs, unknown refs)
1484. Detect circular refs (optional - or just Box everything)
149
150**Output**: Complete list of unions to generate with their variants
151
152### Phase 3: Code Generation - Core Types
153**Goal**: Generate Rust code for individual types
154
155**Tasks**:
1561. Implement type generators:
157 - `generate_struct()` for records/objects
158 - `generate_enum()` for unions
159 - `generate_field()` for object properties
160 - `generate_type()` for primitives/refs
1612. Handle optional fields (`required` list)
1623. Add doc comments from `description`
1634. Apply `#[lexicon]` / `#[open_union]` macros
1645. Add serde attributes
165
166**Output**: `TokenStream` for each type
167
168### Phase 4: Module Organization
169**Goal**: Organize generated types into module hierarchy
170
171**Tasks**:
1721. Parse NSID into components: `["app", "bsky", "feed", "post"]`
1732. Determine file paths: `app_bsky/feed/post.rs`
1743. Generate module files: `app_bsky/feed.rs` with `pub mod post;`
1754. Generate root module: `app_bsky.rs`
1765. Handle re-exports if needed
177
178**Output**: File path → generated code mapping
179
180### Phase 5: File Writing
181**Goal**: Write generated code to filesystem
182
183**Tasks**:
1841. Format code with `prettyplease`
1852. Create directory structure
1863. Write module files
1874. Write type files
1885. Optional: run `rustfmt`
189
190**Output**: Generated code on disk
191
192### Phase 6: Testing & Validation
193**Goal**: Ensure generated code compiles and works
194
195**Tasks**:
1961. Generate code for test lexicons
1972. Compile generated code
1983. Test serialization/deserialization
1994. Test union variant matching
2005. Test extra_data capture
201
202## Edge Cases & Considerations
203
204### Circular References
205- **Simple approach**: Union variants always use `Box<T>` → handles all circular refs
206- **Alternative**: DFS cycle detection to only Box when needed
207 - Track visited refs and recursion stack
208 - If ref appears in rec_stack → cycle detected
209 - Algorithm:
210 ```rust
211 fn has_cycle(corpus, start_ref, visited, rec_stack) -> bool {
212 visited.insert(start_ref);
213 rec_stack.insert(start_ref);
214
215 for child_ref in collect_refs_from_def(resolve(start_ref)) {
216 if !visited.contains(child_ref) {
217 if has_cycle(corpus, child_ref, visited, rec_stack) {
218 return true;
219 }
220 } else if rec_stack.contains(child_ref) {
221 return true; // back edge = cycle
222 }
223 }
224
225 rec_stack.remove(start_ref);
226 false
227 }
228 ```
229 - Only box variants that participate in cycles
230- **Recommendation**: Start with simple (always Box), optimize later if needed
231
232### Name Collisions
233- Multiple types with same name in different lexicons
234- Module path disambiguates: `app_bsky::feed::Post` vs `com_example::feed::Post`
235
236### Unknown Refs
237- Fallback to `Data<'s>` in struct fields
238- Caught by `Unknown` variant in unions
239- Warn during generation
240
241### Inline Defs
242- Nested objects/unions in same lexicon
243- Generate as separate types in same file
244- Keep names scoped to parent (e.g., `PostReplyRef`)
245
246### Arrays
247- `Vec<T>` for arrays
248- Handle nested unions in arrays
249
250### Tokens
251- Simple marker types
252- Generate as unit structs or type aliases?
253
254## Traits for Generated Types
255
256### Collection Trait (Records)
257Records implement the existing `Collection` trait from jacquard-common:
258
259```rust
260pub struct Post<'a> {
261 // ... fields
262}
263
264impl Collection for Post<'p> {
265 const NSID: &'static str = "app.bsky.feed.post";
266 type Record = Post<'p>;
267}
268```
269
270### XrpcRequest Trait (Queries/Procedures)
271New trait for XRPC endpoints:
272
273```rust
274pub trait XrpcRequest<'x> {
275 /// The NSID for this XRPC method
276 const NSID: &'static str;
277
278 /// XRPC method (query/GET, procedure/POST)
279 const METHOD: XrpcMethod;
280
281 /// Input encoding (MIME type, e.g., "application/json")
282 /// None for queries (no body)
283 const INPUT_ENCODING: Option<&'static str>;
284
285 /// Output encoding (MIME type)
286 const OUTPUT_ENCODING: &'static str;
287
288 /// Request parameters type (query params or body)
289 type Params: Serialize;
290
291 /// Response output type
292 type Output: Deserialize<'x>;
293
294 type Err: Error;
295}
296
297pub enum XrpcMethod {
298 Query, // GET
299 Procedure, // POST
300}
301```
302
303
304
305**Generated implementation:**
306```rust
307pub struct GetAuthorFeedParams<'a> {
308 pub actor: AtIdentifier<'a>,
309 pub limit: Option<i64>,
310 pub cursor: Option<CowStr<'a>>,
311}
312
313pub struct GetAuthorFeedOutput<'a> {
314 pub cursor: Option<CowStr<'a>>,
315 pub feed: Vec<FeedViewPost<'a>>,
316}
317
318impl XrpcRequest for GetAuthorFeedParams<'_> {
319 const NSID: &'static str = "app.bsky.feed.getAuthorFeed";
320 const METHOD: XrpcMethod = XrpcMethod::Query;
321 const INPUT_ENCODING: Option<&'static str> = None; // queries have no body
322 const OUTPUT_ENCODING: &'static str = "application/json";
323
324 type Params = Self;
325 type Output = GetAuthorFeedOutput<'static>;
326 type Err = GetAuthorFeedError;
327}
328```
329
330**Encoding variations:**
331- Most procedures: `"application/json"` for input/output
332- Blob uploads: `"*/*"` or specific MIME type for input
333- CAR files: `"application/vnd.ipld.car"` for repo operations
334- Read from lexicon's `input.encoding` and `output.encoding` fields
335
336**Trait benefits:**
337- Allows monomorphization (static dispatch) for performance
338- Also supports `dyn XrpcRequest` for dynamic dispatch if needed
339- Client code can be generic over `impl XrpcRequest`
340
341
342#### XRPC Errors
343Lexicons contain information on the kind of errors they can return.
344Trait contains an associated error type. Error enum with thiserror::Error and
345miette:Diagnostic derives and appropriate content generated based on lexicon info.
346
347### Subscriptions
348WebSocket streams - defer for now. Will need separate trait with message types.
349
350## Open Questions
351
3521. **Validation**: Generate runtime validation (min/max length, regex, etc)?
3532. **Tokens**: How to represent token types?
3543. **Errors**: How to handle codegen errors (missing refs, invalid schemas)?
3554. **Incremental**: Support incremental codegen (only changed lexicons)?
3565. **Formatting**: Always run rustfmt or rely on prettyplease?
3576. **XrpcRequest location**: Should trait live in jacquard-common or separate jacquard-xrpc crate?
3587. **Import shortening**: Track imports and shorten ref paths in generated code
359 - Instead of `jacquard_api::app_bsky::richtext::Facet<'a>` emit `use jacquard_api::app_bsky::richtext::Facet;` and just `Facet<'a>`
360 - Would require threading `ImportTracker` through all generate functions or post-processing token stream
361 - Long paths are ugly but explicit - revisit once basic codegen is confirmed working
3628. **Web-based lexicon resolution**: Fetch lexicons from the web instead of requiring local files
363 - Implement [lexicon publication and resolution](https://atproto.com/specs/lexicon#lexicon-publication-and-resolution) spec
364 - `LexiconCorpus::fetch_from_web(nsids: &[&str])` - fetch specific NSIDs
365 - `LexiconCorpus::fetch_from_authority(authority: &str)` - fetch all from DID/domain
366 - Resolution: `https://{authority}/.well-known/atproto/lexicon/{nsid}.json`
367 - Recursively fetch refs, handle redirects/errors
368 - Use `reqwest` for HTTP - still fits in jacquard-lexicon as it's corpus loading
369
370## Success Criteria
371
372- [ ] Generates code for all official AT Protocol lexicons
373- [ ] Generated code compiles without errors
374- [ ] No `Main` proliferation
375- [ ] Union variants have readable names
376- [ ] Unknown refs handled gracefully
377- [ ] `#[lexicon]` and `#[open_union]` applied correctly
378- [ ] Serialization round-trips correctly