···1+### OAuth 2.0 Endpoints with AIP
2+3+The AIP server implements the following OAuth 2.0 endpoints:
4+5+- `GET ${AIP_BASE_URL}/oauth/authorize` - Authorization endpoint for OAuth flows
6+- `POST ${AIP_BASE_URL}/oauth/token` - Token endpoint for exchanging
7+ authorization codes for access tokens
8+- `POST ${AIP_BASE_URL}/oauth/par` - Pushed Authorization Request endpoint
9+ (RFC 9126)
10+- `POST ${AIP_BASE_URL}/oauth/clients/register` - Dynamic Client Registration
11+ endpoint (RFC 7591)
12+- `GET ${AIP_BASE_URL}/oauth/atp/callback` - ATProtocol OAuth callback handler
13+- `GET ${AIP_BASE_URL}/.well-known/oauth-authorization-server` - OAuth server
14+ metadata discovery (RFC 8414)
15+- `GET ${AIP_BASE_URL}/.well-known/oauth-protected-resource` - Protected
16+ resource metadata
17+- `GET ${AIP_BASE_URL}/.well-known/jwks.json` - JSON Web Key Set for token
18+ verification
19+- `GET ${AIP_BASE_URL}/oauth/userinfo` - introspection endpoint returning claims
20+ info where sub is the user's atproto did
21+- `GET ${AIP_BASE_URL}/api/atproto/session` - returns atproto session data
22+23+## Error Handling
24+25+All error strings must use this format:
26+27+ error-aip-<domain>-<number> <message>: <details>
28+29+Example errors:
30+31+- error-slice-resolve-1 Multiple DIDs resolved for method
32+- error-slice-plc-1 HTTP request failed: https://google.com/ Not Found
33+- error-slice-key-1 Error decoding key: invalid
34+35+Errors should be represented as enums using the `thiserror` library when
36+possible using `src/errors.rs` as a reference and example.
37+38+Avoid creating new errors with the `anyhow!(...)` macro.
39+40+## Time, Date, and Duration
41+42+Use the `chrono` crate for time, date, and duration logic.
43+44+Use the `duration_str` crate for parsing string duration values.
45+46+All stored dates and times must be in UTC. UTC should be used whenever
47+determining the current time and computing values like expiration.
48+49+## HTTP Handler Organization
50+51+HTTP handlers should be organized as Rust source files in the `src/http`
52+directory and should have the `handler_` prefix. Each handler should have it's
53+own request and response types and helper functionality.
54+55+Example handler: `handler_index.rs`
56+57+- After updating, run `cargo check` to fix errors and warnings
58+- Don't use dead code, if it's not used remove it
59+- Ise htmx and hyperscript when possible, if not javascript in script tag is ok
···1+Lexicon Lexicon is a schema definition language used to describe atproto
2+records, HTTP endpoints (XRPC), and event stream messages. It builds on top of
3+the atproto Data Model.
4+5+The schema language is similar to JSON Schema and OpenAPI, but includes some
6+atproto-specific features and semantics.
7+8+This specification describes version 1 of the Lexicon definition language.
9+10+Overview of Types Lexicon Type Data Model Type Category null Null concrete
11+boolean Boolean concrete integer Integer concrete string String concrete
12+bytes Bytes concrete cid-link Link concrete blob Blob concrete
13+array Array container object Object container params container token meta
14+ref meta union meta unknown meta record primary query primary
15+procedure primary subscription primary Lexicon Files Lexicons are JSON files
16+associated with a single NSID. A file contains one or more definitions, each
17+with a distinct short name. A definition with the name main optionally describes
18+the "primary" definition for the entire file. A Lexicon with zero definitions is
19+invalid.
20+21+A Lexicon JSON file is an object with the following fields:
22+23+lexicon (integer, required): indicates Lexicon language version. In this
24+version, a fixed value of 1 id (string, required): the NSID of the Lexicon
25+description (string, optional): short overview of the Lexicon, usually one or
26+two sentences defs (map of strings-to-objects, required): set of definitions,
27+each with a distinct name (key) Schema definitions under defs all have a type
28+field to distinguish their type. A file can have at most one definition with one
29+of the "primary" types. Primary types should always have the name main. It is
30+possible for main to describe a non-primary type.
31+32+References to specific definitions within a Lexicon use fragment syntax, like
33+com.example.defs#someView. If a main definition exists, it can be referenced
34+without a fragment, just using the NSID. For references in the $type fields in
35+data objects themselves (eg, records or contents of a union), this is a "must"
36+(use of a #main suffix is invalid). For example, com.example.record not
37+com.example.record#main.
38+39+Related Lexicons are often grouped together in the NSID hierarchy. As a
40+convention, any definitions used by multiple Lexicons are defined in a dedicated
41+*.defs Lexicon (eg, com.atproto.server.defs) within the group. A *.defs Lexicon
42+should generally not include a definition named main, though it is not strictly
43+invalid to do so.
44+45+Primary Type Definitions The primary types are:
46+47+query: describes an XRPC Query (HTTP GET) procedure: describes an XRPC Procedure
48+(HTTP POST) subscription: Event Stream (WebSocket) record: describes an object
49+that can be stored in a repository record Each primary definition schema object
50+includes these fields:
51+52+type (string, required): the type value (eg, record for records) description
53+(string, optional): short, usually only a sentence or two Record Type-specific
54+fields:
55+56+key (string, required): specifies the Record Key type record (object, required):
57+a schema definition with type object, which specifies this type of record Query
58+and Procedure (HTTP API) Type-specific fields:
59+60+parameters (object, optional): a schema definition with type params, describing
61+the HTTP query parameters for this endpoint output (object, optional): describes
62+the HTTP response body description (string, optional): short description
63+encoding (string, required): MIME type for body contents. Use application/json
64+for JSON responses. schema (object, optional): schema definition, either an
65+object, a ref, or a union of refs. Used to describe JSON encoded responses,
66+though schema is optional even for JSON responses. input (object, optional, only
67+for procedure): describes HTTP request body schema, with the same format as the
68+output field errors (array of objects, optional): set of string error codes
69+which might be returned name (string, required): short name for the error type,
70+with no whitespace description (string, optional): short description, one or two
71+sentences Subscription (Event Stream) Type-specific fields:
72+73+parameters (object, optional): same as Query and Procedure message (object,
74+optional): specifies what messages can be description (string, optional): short
75+description schema (object, required): schema definition, which must be a union
76+of refs errors (array of objects, optional): same as Query and Procedure
77+Subscription schemas (referenced by the schema field under message) must be a
78+union of refs, not an object type.
79+80+Field Type Definitions As with the primary definitions, every schema object
81+includes these fields:
82+83+type (string, required): fixed value for each type description (string,
84+optional): short, usually only a sentence or two null No additional fields.
85+86+boolean Type-specific fields:
87+88+default (boolean, optional): a default value for this field const (boolean,
89+optional): a fixed (constant) value for this field When included as an HTTP
90+query parameter, should be rendered as true or false (no quotes).
91+92+integer A signed integer number.
93+94+Type-specific fields:
95+96+minimum (integer, optional): minimum acceptable value maximum (integer,
97+optional): maximum acceptable value enum (array of integers, optional): a closed
98+set of allowed values default (integer, optional): a default value for this
99+field const (integer, optional): a fixed (constant) value for this field string
100+Type-specific fields:
101+102+format (string, optional): string format restriction maxLength (integer,
103+optional): maximum length of value, in UTF-8 bytes minLength (integer,
104+optional): minimum length of value, in UTF-8 bytes maxGraphemes (integer,
105+optional): maximum length of value, counted as Unicode Grapheme Clusters
106+minGraphemes (integer, optional): minimum length of value, counted as Unicode
107+Grapheme Clusters knownValues (array of strings, optional): a set of suggested
108+or common values for this field. Values are not limited to this set (aka, not a
109+closed enum). enum (array of strings, optional): a closed set of allowed values
110+default (string, optional): a default value for this field const (string,
111+optional): a fixed (constant) value for this field Strings are Unicode. For
112+non-Unicode encodings, use bytes instead. The basic minLength/maxLength
113+validation constraints are counted as UTF-8 bytes. Note that Javascript stores
114+strings with UTF-16 by default, and it is necessary to re-encode to count
115+accurately. The minGraphemes/maxGraphemes validation constraints work with
116+Grapheme Clusters, which have a complex technical and linguistic definition, but
117+loosely correspond to "distinct visual characters" like Latin letters, CJK
118+characters, punctuation, digits, or emoji (which might comprise multiple Unicode
119+codepoints and many UTF-8 bytes).
120+121+format constrains the string format and provides additional semantic context.
122+Refer to the Data Model specification for the available format types and their
123+definitions.
124+125+const and default are mutually exclusive.
126+127+bytes Type-specific fields:
128+129+minLength (integer, optional): minimum size of value, as raw bytes with no
130+encoding maxLength (integer, optional): maximum size of value, as raw bytes with
131+no encoding cid-link No type-specific fields.
132+133+See Data Model spec for CID restrictions.
134+135+array Type-specific fields:
136+137+items (object, required): describes the schema elements of this array minLength
138+(integer, optional): minimum count of elements in array maxLength (integer,
139+optional): maximum count of elements in array In theory arrays have homogeneous
140+types (meaning every element as the same type). However, with union types this
141+restriction is meaningless, so implementations can not assume that all the
142+elements have the same type.
143+144+object A generic object schema which can be nested inside other definitions by
145+reference.
146+147+Type-specific fields:
148+149+properties (map of strings-to-objects, required): defines the properties
150+(fields) by name, each with their own schema required (array of strings,
151+optional): indicates which properties are required nullable (array of strings,
152+optional): indicates which properties can have null as a value As described in
153+the data model specification, there is a semantic difference in data between
154+omitting a field; including the field with the value null; and including the
155+field with a "false-y" value (false, 0, empty array, etc).
156+157+blob Type-specific fields:
158+159+accept (array of strings, optional): list of acceptable MIME types. Each may end
160+in * as a glob pattern (eg, image/*). Use _/_ to indicate that any MIME type is
161+accepted. maxSize (integer, optional): maximum size in bytes params This is a
162+limited-scope type which is only ever used for the parameters field on query,
163+procedure, and subscription primary types. These map to HTTP query parameters.
164+165+Type-specific fields:
166+167+required (array of strings, optional): same semantics as field on object
168+properties: similar to properties under object, but can only include the types
169+boolean, integer, string, and unknown; or an array of one of these types Note
170+that unlike object, there is no nullable field on params.
171+172+token Tokens are empty data values which exist only to be referenced by name.
173+They are used to define a set of values with specific meanings. The description
174+field should clarify the meaning of the token. Tokens encode as string data,
175+with the string being the fully-qualified reference to the token itself (NSID
176+followed by an optional fragment).
177+178+Tokens are similar to the concept of a "symbol" in some programming languages,
179+distinct from strings, variables, built-in keywords, or other identifiers.
180+181+For example, tokens could be defined to represent the state of an entity (in a
182+state machine), or to enumerate a list of categories.
183+184+No type-specific fields.
185+186+ref Type-specific fields:
187+188+ref (string, required): reference to another schema definition Refs are a
189+mechanism for re-using a schema definition in multiple places. The ref string
190+can be a global reference to a Lexicon type definition (an NSID, optionally with
191+a #-delimited name indicating a definition other than main), or can indicate a
192+local definition within the same Lexicon file (a # followed by a name).
193+194+union Type-specific fields:
195+196+refs (array of strings, required): references to schema definitions closed
197+(boolean, optional): indicates if a union is "open" or "closed". defaults to
198+false (open union) Unions represent that multiple possible types could be
199+present at this location in the schema. The references follow the same syntax as
200+ref, allowing references to both global or local schema definitions. Actual data
201+will validate against a single specific type: the union does not combine fields
202+from multiple schemas, or define a new hybrid data type. The different types are
203+referred to as variants.
204+205+By default unions are "open", meaning that future revisions of the schema could
206+add more types to the list of refs (though can not remove types). This means
207+that implementations should be permissive when validating, in case they do not
208+have the most recent version of the Lexicon. The closed flag (boolean) can
209+indicate that the set of types is fixed and can not be extended in the future.
210+211+A union schema definition with no refs is allowed and similar to unknown, as
212+long as the closed flag is false (the default). The main difference is that the
213+data would be required to have the $type field. An empty refs list with closed
214+set to true is an invalid schema.
215+216+The schema definitions pointed to by a union are objects or types with a clear
217+mapping to an object, like a record. All the variants must be represented by a
218+CBOR map (or JSON Object) and must include a $type field indicating the variant
219+type. Because the data must be an object, unions can not reference token (which
220+would correspond to string data).
221+222+unknown Indicates than any data object could appear at this location, with no
223+specific validation. The top-level data must be an object (not a string,
224+boolean, etc). As with all other data types, the value null is not allowed
225+unless the field is specifically marked as nullable.
226+227+The data object may contain a
228+$type field indicating the schema of the data, but this is not currently required. The top-level data object must not have the structure of a compound data type, like blob ($type:
229+blob) or CID link ($link).
230+231+The (nested) contents of the data object must still be valid under the atproto
232+data model. For example, it should not contain floats. Nested compound types
233+like blobs and CID links should be validated and transformed as expected.
234+235+Lexicon designers are strongly recommended to not use unknown fields in record
236+objects for now.
237+238+No type-specific fields.
239+240+String Formats Strings can optionally be constrained to one of the following
241+format types:
242+243+at-identifier: either a Handle or a DID, details described below at-uri: AT-URI
244+cid: CID in string format, details specified in Data Model datetime: timestamp,
245+details specified below did: generic DID Identifier handle: Handle Identifier
246+nsid: Namespaced Identifier tid: Timestamp Identifier (TID) record-key: Record
247+Key, matching the general syntax ("any") uri: generic URI, details specified
248+below language: language code, details specified below For the various
249+identifier formats, when doing Lexicon schema validation the most expansive
250+identifier syntax format should be permitted. Problems with identifiers which do
251+pass basic syntax validation should be reported as application errors, not
252+lexicon data validation errors. For example, data with any kind of DID in a did
253+format string field should pass Lexicon validation, with unsupported DID methods
254+being raised separately as an application error.
255+256+at-identifier A string type which is either a DID (type: did) or a handle
257+(handle). Mostly used in XRPC query parameters. It is unambiguous whether an
258+at-identifier is a handle or a DID because a DID always starts with did:, and
259+the colon character (:) is not allowed in handles.
260+261+datetime Full-precision date and time, with timezone information.
262+263+This format is intended for use with computer-generated timestamps in the modern
264+computing era (eg, after the UNIX epoch). If you need to represent historical or
265+ancient events, ambiguity, or far-future times, a different format is probably
266+more appropriate. Datetimes before the Current Era (year zero) as specifically
267+disallowed.
268+269+Datetime format standards are notoriously flexible and overlapping. Datetime
270+strings in atproto should meet the intersecting requirements of the RFC 3339,
271+ISO 8601, and WHATWG HTML datetime standards.
272+273+The character separating "date" and "time" parts must be an upper-case T.
274+275+Timezone specification is required. It is strongly preferred to use the UTC
276+timezone, and to represent the timezone with a simple capital Z suffix
277+(lower-case is not allowed). While hour/minute suffix syntax (like +01:00 or
278+-10:30) is supported, "negative zero" (-00:00) is specifically disallowed (by
279+ISO 8601).
280+281+Whole seconds precision is required, and arbitrary fractional precision digits
282+are allowed. Best practice is to use at least millisecond precision, and to pad
283+with zeros to the generated precision (eg, trailing :12.340Z instead of
284+:12.34Z). Not all datetime formatting libraries support trailing zero
285+formatting. Both millisecond and microsecond precision have reasonable
286+cross-language support; nanosecond precision does not.
287+288+Implementations should be aware when round-tripping records containing datetimes
289+of two ambiguities: loss-of-precision, and ambiguity with trailing fractional
290+second zeros. If de-serializing Lexicon records into native types, and then
291+re-serializing, the string representation may not be the same, which could
292+result in broken hash references, sanity check failures, or repository update
293+churn. A safer thing to do is to deserialize the datetime as a simple string,
294+which ensures round-trip re-serialization.
295+296+Implementations "should" validate that the semantics of the datetime are valid.
297+For example, a month or day 00 is invalid.
298+299+Valid examples:
300+301+# preferred
302+303+1985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123456Z 1985-04-12T23:20:50.120Z
304+1985-04-12T23:20:50.120000Z
305+306+# supported
307+308+1985-04-12T23:20:50.12345678912345Z 1985-04-12T23:20:50Z 1985-04-12T23:20:50.0Z
309+1985-04-12T23:20:50.123+00:00 1985-04-12T23:20:50.123-07:00
310+311+Copy Copied! Invalid examples:
312+313+1985-04-12 1985-04-12T23:20Z 1985-04-12T23:20:5Z 1985-04-12T23:20:50.123
314++001985-04-12T23:20:50.123Z 23:20:50.123Z -1985-04-12T23:20:50.123Z
315+1985-4-12T23:20:50.123Z 01985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123+00
316+1985-04-12T23:20:50.123+0000
317+318+# ISO-8601 strict capitalization
319+320+1985-04-12t23:20:50.123Z 1985-04-12T23:20:50.123z
321+322+# RFC-3339, but not ISO-8601
323+324+1985-04-12T23:20:50.123-00:00 1985-04-12 23:20:50.123Z
325+326+# timezone is required
327+328+1985-04-12T23:20:50.123
329+330+# syntax looks ok, but datetime is not valid
331+332+1985-04-12T23:99:50.123Z 1985-00-12T23:20:50.123Z
333+334+Copy Copied! uri Flexible to any URI schema, following the generic RFC-3986 on
335+URIs. This includes, but isn’t limited to: did, https, wss, ipfs (for CIDs),
336+dns, and of course at. Maximum length in Lexicons is 8 KBytes.
337+338+language An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646
339+("Tags for Identifying Languages"). This is the same standard used to identify
340+languages in HTTP, HTML, and other web standards. The Lexicon string must
341+validate as a "well-formed" language tag, as defined in the RFC. Clients should
342+ignore language strings which are "well-formed" but not "valid" according to the
343+RFC.
344+345+As specified in the RFC, ISO 639 two-character and three-character language
346+codes can be used on their own, lower-cased, such as ja (Japanese) or ban
347+(Balinese). Regional sub-tags can be added, like pt-BR (Brazilian Portuguese).
348+Additional subtags can also be added, such as hy-Latn-IT-arevela.
349+350+Language codes generally need to be parsed, normalized, and matched
351+semantically, not simply string-compared. For example, a search engine might
352+simplify language tags to ISO 639 codes for indexing and filtering, while a
353+client application (user agent) would retain the full language code for
354+presentation (text rendering) locally.
355+356+When to use $type Data objects sometimes include a $type field which indicates
357+their Lexicon type. The general principle is that this field needs to be
358+included any time there could be ambiguity about the content type when
359+validating data.
360+361+The specific rules are:
362+363+record objects must always include $type. While the type is often known from
364+context (eg, the collection part of the path for records stored in a
365+repository), record objects can also be passed around outside of repositories
366+and need to be self-describing union variants must always include $type, except
367+at the top level of subscription messages Note that blob objects always include
368+$type, which allows generic processing.
369+370+As a reminder, main types must be referenced in $type fields as just the NSID,
371+not including a #main suffix.
372+373+Lexicon Evolution Lexicons are allowed to change over time, within some bounds
374+to ensure both forwards and backwards compatibility. The basic principle is that
375+all old data must still be valid under the updated Lexicon, and new data must be
376+valid under the old Lexicon.
377+378+Any new fields must be optional Non-optional fields can not be removed. A best
379+practice is to retain all fields in the Lexicon and mark them as deprecated if
380+they are no longer used. Types can not change Fields can not be renamed If
381+larger breaking changes are necessary, a new Lexicon name must be used.
382+383+It can be ambiguous when a Lexicon has been published and becomes "set in
384+stone". At a minimum, public adoption and implementation by a third party, even
385+without explicit permission, indicates that the Lexicon has been released and
386+should not break compatibility. A best practice is to clearly indicate in the
387+Lexicon type name any experimental or development status. Eg,
388+com.corp.experimental.newRecord.
389+390+Authority and Control The authority for a Lexicon is determined by the NSID, and
391+rooted in DNS control of the domain authority. That authority has ultimate
392+control over the Lexicon definition, and responsibility for maintenance and
393+distribution of Lexicon schema definitions.
394+395+In a crisis, such as unintentional loss of DNS control to a bad actor, the
396+protocol ecosystem could decide to disregard this chain of authority. This
397+should only be done in exceptional circumstances, and not as a mechanism to
398+subvert an active authority. The primary mechanism for resolving protocol
399+disputes is to fork Lexicons in to a new namespace.
400+401+Protocol implementations should generally consider data which fails to validate
402+against the Lexicon to be entirely invalid, and should not try to repair or do
403+partial processing on the individual piece of data.
404+405+Unexpected fields in data which otherwise conforms to the Lexicon should be
406+ignored. When doing schema validation, they should be treated at worst as
407+warnings. This is necessary to allow evolution of the schema by the controlling
408+authority, and to be robust in the case of out-of-date Lexicons.
409+410+Third parties can technically insert any additional fields they want into data.
411+This is not the recommended way to extend applications, but it is not
412+specifically disallowed. One danger with this is that the Lexicon may be updated
413+to include fields with the same field names but different types, which would
414+make existing data invalid.
415+416+Lexicon Publication and Resolution Lexicon schemas are published publicly as
417+records in atproto repositories, using the com.atproto.lexicon.schema type. The
418+domain name authority for NSIDs to specific atproto repositories (identified by
419+DID is linked by a DNS TXT record (_lexicon), similar to but distinct from the
420+handle resolution system.
421+422+The com.atproto.lexicon.schema Lexicon itself is very minimal: it only requires
423+the lexicon integer field, which must be 1 for this version of the Lexicon
424+language. In practice, same fields as Lexicon Files should be included, along
425+with $type. The record key is the NSID of the schema.
426+427+A summary of record fields:
428+429+$type: must be com.atproto.lexicon.schema (as with all atproto records) lexicon:
430+integer, indicates the overall version of the Lexicon (currently 1) id: the NSID
431+of this Lexicon. Must be a simple NSID (no fragment), and must match the record
432+key defs: the schema definitions themselves, as a map-of-objects. Names should
433+not include a # prefix. description: optional description of the overall schema;
434+though descriptions are best included on individual defs, not the overall
435+schema. The com.atproto.lexicon.schema meta-schema is somewhat unlike other
436+Lexicons, in that it is defined and governed as part of the protocol. Future
437+versions of the language and protocol might not follow the evolution rules. It
438+is an intentional decision to not express the Lexicon schema language itself
439+recursively, using the schema language.
440+441+Authority for NSID namespaces is done at the "group" level, meaning that all
442+NSIDs which differ only by the final "name" part are all published in the same
443+repository. Lexicon resolution of NSIDs is not hierarchical: DNS TXT records
444+must be created for each authority section, and resolvers should not recurse up
445+or down the DNS hierarchy looking for TXT records.
446+447+As an example, the NSID edu.university.dept.lab.blogging.getBlogPost has a
448+"name" getBlogPost. Removing the name and reversing the rest of the NSID gives
449+an "authority domain name" of blogging.lab.dept.university.edu. To link the
450+authority to a specific DID (say did:plc:ewvi7nxzyoun6zhxrhs64oiz), a DNS TXT
451+record with the name _lexicon.blogging.lab.dept.university.edu and value
452+did=did:plc:ewvi7nxzyoun6zhxrhs64oiz (note the did= prefix) would be created.
453+Then a record with collection com.atproto.lexicon.schema and record-key
454+edu.university.dept.lab.blogging.getBlogPost would be created in that account's
455+repository.
456+457+A resolving service would start with the NSID
458+(edu.university.dept.lab.blogging.getBlogPost) and do a DNS TXT resolution for
459+_lexicon.blogging.lab.dept.university.edu. Finding the DID, it would proceed
460+with atproto DID resolution, look for a PDS, and then fetch the relevant record.
461+The overall AT-URI for the record would be
462+at://did:plc:ewvi7nxzyoun6zhxrhs64oiz/com.atproto.lexicon.schema/edu.university.dept.lab.blogging.getBlogPost.
463+464+If the DNS TXT resolution for _lexicon.blogging.lab.dept.university.edu failed,
465+the resolving service would NOT try _lexicon.lab.dept.university.edu or
466+_lexicon.getBlogPost.blogging.lab.dept.university.edu or
467+_lexicon.university.edu, or any other domain name. The Lexicon resolution would
468+simply fail.
469+470+If another NSID edu.university.dept.lab.blogging.getBlogComments was created, it
471+would have the same authority name, and must be published in the same atproto
472+repository (with a different record key). If a Lexicon for
473+edu.university.dept.lab.gallery.photo was published, a new DNS TXT record would
474+be required (_lexicon.gallery.lab.dept.university.edu; it could point at the
475+same repository (DID), or a different repository.
476+477+As a simpler example, an NSID app.toy.record would resolve via _lexicon.toy.app.
478+479+A single repository can host Lexicons for multiple authority domains, possibly
480+across multiple registered domains and TLDs. Resolution DNS records can change
481+over time, moving schema resolution to different repositories, though it may
482+take time for DNS and cache changes to propagate.
483+484+Note that Lexicon record operations are broadcast over repository event streams
485+("firehose"), but that DNS resolution changes do not (unlike handle changes).
486+Resolving services should not cache DNS resolution results for long time
487+periods.
488+489+Usage and Implementation Guidelines It should be possible to translate Lexicon
490+schemas to JSON Schema or OpenAPI and use tools and libraries from those
491+ecosystems to work with atproto data in JSON format.
492+493+Implementations which serialize and deserialize data from JSON or CBOR into
494+structures derived from specific Lexicons should be aware of the risk of
495+"clobbering" unexpected fields. For example, if a Lexicon is updated to add a
496+new (optional) field, old implementations would not be aware of that field, and
497+might accidentally strip the data when de-serializing and then re-serializing.
498+Depending on the context, one way to avoid this problem is to retain any "extra"
499+fields, or to pass-through the original data object instead of re-serializing
500+it.
501+502+Possible Future Changes The validation rules for unexpected additional fields
503+may change. For example, a mechanism for Lexicons to indicate that the schema is
504+"closed" and unexpected fields are not allowed, or a convention around field
505+name prefixes (x-) to indicate unofficial extension.
api/lexicons.zip
This is a binary file and will not be displayed.
+21
api/migrations/001_initial.sql
···000000000000000000000
···1+-- AT Protocol Indexer Database Schema
2+-- Single table approach for maximum flexibility across arbitrary lexicons
3+4+CREATE TABLE IF NOT EXISTS "record" (
5+ "uri" TEXT PRIMARY KEY NOT NULL,
6+ "cid" TEXT NOT NULL,
7+ "did" TEXT NOT NULL,
8+ "collection" TEXT NOT NULL,
9+ "json" JSONB NOT NULL,
10+ "indexedAt" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
11+);
12+13+-- Essential indexes for performance
14+CREATE INDEX IF NOT EXISTS idx_record_collection ON "record"("collection");
15+CREATE INDEX IF NOT EXISTS idx_record_did ON "record"("did");
16+CREATE INDEX IF NOT EXISTS idx_record_indexed_at ON "record"("indexedAt");
17+CREATE INDEX IF NOT EXISTS idx_record_json_gin ON "record" USING GIN("json");
18+19+-- Collection-specific indexes for common queries
20+CREATE INDEX IF NOT EXISTS idx_record_collection_did ON "record"("collection", "did");
21+CREATE INDEX IF NOT EXISTS idx_record_cid ON "record"("cid");
+10
api/migrations/002_lexicons.sql
···0000000000
···1+-- Add lexicons table for storing AT Protocol lexicon schemas
2+CREATE TABLE IF NOT EXISTS "lexicons" (
3+ "nsid" TEXT PRIMARY KEY NOT NULL,
4+ "definitions" JSONB NOT NULL,
5+ "created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
6+ "updated_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
7+);
8+9+CREATE INDEX IF NOT EXISTS idx_lexicons_nsid ON "lexicons"("nsid");
10+CREATE INDEX IF NOT EXISTS idx_lexicons_definitions ON "lexicons" USING gin("definitions");
+9
api/migrations/003_actors.sql
···000000000
···1+-- Add actors table for storing AT Protocol actor/profile data
2+CREATE TABLE IF NOT EXISTS "actor" (
3+ "did" TEXT PRIMARY KEY NOT NULL,
4+ "handle" TEXT,
5+ "indexedAt" TEXT NOT NULL
6+);
7+8+CREATE INDEX IF NOT EXISTS idx_actor_handle ON "actor"("handle");
9+CREATE INDEX IF NOT EXISTS idx_actor_indexed_at ON "actor"("indexedAt");
···1+use chrono::{Utc};
2+use reqwest::Client;
3+use serde::Deserialize;
4+use serde_json::Value;
5+use tracing::{error, info};
6+7+use crate::database::Database;
8+use crate::errors::SyncError;
9+use crate::models::{Actor, Record};
10+use crate::utils::is_primary_collection;
11+12+13+#[derive(Debug, Deserialize)]
14+struct AtProtoRecord {
15+ uri: String,
16+ cid: String,
17+ value: Value,
18+}
19+20+#[derive(Debug, Deserialize)]
21+struct ListRecordsResponse {
22+ records: Vec<AtProtoRecord>,
23+ cursor: Option<String>,
24+}
25+26+27+#[derive(Debug, Deserialize)]
28+struct ListReposByCollectionResponse {
29+ repos: Vec<RepoRef>,
30+}
31+32+#[derive(Debug, Deserialize)]
33+struct RepoRef {
34+ did: String,
35+}
36+37+#[derive(Debug, Deserialize)]
38+struct DidDocument {
39+ service: Option<Vec<Service>>,
40+}
41+42+#[derive(Debug, Deserialize)]
43+struct Service {
44+ #[serde(rename = "type")]
45+ service_type: String,
46+ #[serde(rename = "serviceEndpoint")]
47+ service_endpoint: String,
48+}
49+50+#[derive(Debug, Clone)]
51+struct AtpData {
52+ did: String,
53+ pds: String,
54+ handle: Option<String>,
55+}
56+57+#[derive(Clone)]
58+pub struct SyncService {
59+ client: Client,
60+ database: Database,
61+}
62+63+impl SyncService {
64+ pub fn new(database: Database) -> Self {
65+ Self {
66+ client: Client::new(),
67+ database,
68+ }
69+ }
70+71+ // Sync using listRecords
72+ pub async fn sync_repo(&self, did: &str, collections: Option<&[String]>) -> Result<i64, SyncError> {
73+ info!("🔄 Starting sync for DID: {}", did);
74+75+ let total_records = self.listrecords_sync(did, collections).await?;
76+77+ info!("✅ Sync completed for {}: {} records", did, total_records);
78+ Ok(total_records)
79+ }
80+81+82+ // Sync using listRecords
83+ async fn listrecords_sync(&self, did: &str, collections: Option<&[String]>) -> Result<i64, SyncError> {
84+ let collections_to_sync = match collections {
85+ Some(cols) => cols,
86+ None => return Ok(0), // No collections specified = no records
87+ };
88+89+ // Get ATP data for this single repo
90+ let atp_map = self.get_atp_map_for_repos(&[did.to_string()]).await?;
91+92+ let mut total_records = 0;
93+ for collection in collections_to_sync {
94+ match self.fetch_records_for_repo_collection_with_atp_map(did, collection, &atp_map).await {
95+ Ok(records) => {
96+ if !records.is_empty() {
97+ info!("📋 Fallback sync: {} records for {}/{}", records.len(), did, collection);
98+ self.database.batch_insert_records(&records).await?;
99+ total_records += records.len() as i64;
100+ }
101+ }
102+ Err(e) => {
103+ error!("Failed fallback sync for {}/{}: {}", did, collection, e);
104+ }
105+ }
106+ }
107+108+ Ok(total_records)
109+ }
110+111+112+ pub async fn backfill_collections(&self, collections: &[String], repos: Option<&[String]>) -> Result<(), SyncError> {
113+ info!("🔄 Starting backfill operation");
114+ info!("📚 Processing {} collections: {}", collections.len(), collections.join(", "));
115+116+ let all_repos = if let Some(provided_repos) = repos {
117+ info!("📋 Using {} provided repositories", provided_repos.len());
118+ provided_repos.to_vec()
119+ } else {
120+ info!("📊 Fetching repositories for collections...");
121+ let mut unique_repos = std::collections::HashSet::new();
122+123+ // Separate primary and external collections
124+ let (primary_collections, _external_collections): (Vec<_>, Vec<_>) = collections
125+ .iter()
126+ .partition(|collection| is_primary_collection(collection));
127+128+ // First, get all repos from primary collections
129+ let mut primary_repos = std::collections::HashSet::new();
130+ for collection in &primary_collections {
131+ match self.get_repos_for_collection(collection).await {
132+ Ok(repos) => {
133+ info!("✓ Found {} repositories for primary collection \"{}\"", repos.len(), collection);
134+ primary_repos.extend(repos);
135+ },
136+ Err(e) => {
137+ error!("Failed to get repos for primary collection {}: {}", collection, e);
138+ }
139+ }
140+ }
141+142+ info!("📋 Found {} unique repositories from primary collections", primary_repos.len());
143+144+ // Use primary repos for syncing (both primary and external collections)
145+ unique_repos.extend(primary_repos);
146+147+ let repos: Vec<String> = unique_repos.into_iter().collect();
148+ info!("📋 Processing {} unique repositories", repos.len());
149+ repos
150+ };
151+152+ // Get ATP data for all repos at once
153+ info!("🔍 Resolving ATP data for repositories...");
154+ let atp_map = self.get_atp_map_for_repos(&all_repos).await?;
155+ info!("✓ Resolved ATP data for {}/{} repositories", atp_map.len(), all_repos.len());
156+157+ // Only sync repos that have valid ATP data
158+ let valid_repos: Vec<String> = atp_map.keys().cloned().collect();
159+ let failed_resolutions = all_repos.len() - valid_repos.len();
160+161+ if failed_resolutions > 0 {
162+ info!("⚠️ {} repositories failed DID resolution and will be skipped", failed_resolutions);
163+ }
164+165+ info!("🧠 Starting sync for {} repositories...", valid_repos.len());
166+167+ // Create parallel fetch tasks for repos with valid ATP data only
168+ let mut fetch_tasks = Vec::new();
169+ for repo in &valid_repos {
170+ for collection in collections {
171+ let repo_clone = repo.clone();
172+ let collection_clone = collection.clone();
173+ let sync_service = self.clone();
174+ let atp_map_clone = atp_map.clone();
175+176+ let task = tokio::spawn(async move {
177+ match sync_service.fetch_records_for_repo_collection_with_atp_map(&repo_clone, &collection_clone, &atp_map_clone).await {
178+ Ok(records) => {
179+ Ok((repo_clone, collection_clone, records))
180+ }
181+ Err(e) => {
182+ // Handle common "not error" scenarios as empty results
183+ match &e {
184+ SyncError::ListRecords { status } => {
185+ if *status == 404 || *status == 400 {
186+ // Collection doesn't exist for this repo - return empty
187+ Ok((repo_clone, collection_clone, vec![]))
188+ } else {
189+ Err(e)
190+ }
191+ }
192+ SyncError::HttpRequest(_) => {
193+ // Network errors - treat as empty (like TypeScript version)
194+ Ok((repo_clone, collection_clone, vec![]))
195+ }
196+ _ => Err(e)
197+ }
198+ }
199+ }
200+ });
201+ fetch_tasks.push(task);
202+ }
203+ }
204+205+ info!("📥 Fetching records for repositories and collections...");
206+ info!("🔧 Debug: Created {} fetch tasks for {} repos × {} collections", fetch_tasks.len(), valid_repos.len(), collections.len());
207+208+ // Collect all results
209+ let mut all_records = Vec::new();
210+ let mut successful_tasks = 0;
211+ let mut failed_tasks = 0;
212+ for task in fetch_tasks {
213+ match task.await {
214+ Ok(Ok((_repo, _collection, records))) => {
215+ all_records.extend(records);
216+ successful_tasks += 1;
217+ }
218+ Ok(Err(_)) => {
219+ failed_tasks += 1;
220+ }
221+ Err(_e) => {
222+ failed_tasks += 1;
223+ }
224+ }
225+ }
226+227+ info!("🔧 Debug: {} successful tasks, {} failed tasks", successful_tasks, failed_tasks);
228+229+ let total_records = all_records.len() as i64;
230+ info!("✓ Fetched {} total records", total_records);
231+232+ // Index actors first (like the TypeScript version)
233+ info!("📝 Indexing actors...");
234+ self.index_actors(&valid_repos, &atp_map).await?;
235+ info!("✓ Indexed {} actors", valid_repos.len());
236+237+ // Single batch insert for all records
238+ if !all_records.is_empty() {
239+ info!("📝 Indexing {} records...", total_records);
240+ self.database.batch_insert_records(&all_records).await?;
241+ }
242+243+ info!("✅ Backfill complete!");
244+245+ Ok(())
246+ }
247+248+ async fn get_repos_for_collection(&self, collection: &str) -> Result<Vec<String>, SyncError> {
249+ let response = self.client
250+ .get("https://relay1.us-west.bsky.network/xrpc/com.atproto.sync.listReposByCollection")
251+ .query(&[("collection", collection)])
252+ .send()
253+ .await?;
254+255+ if !response.status().is_success() {
256+ return Err(SyncError::ListRepos { status: response.status().as_u16() });
257+ }
258+259+ let repos_response: ListReposByCollectionResponse = response.json().await?;
260+ Ok(repos_response.repos.into_iter().map(|r| r.did).collect())
261+ }
262+263+ async fn fetch_records_for_repo_collection_with_atp_map(&self, repo: &str, collection: &str, atp_map: &std::collections::HashMap<String, AtpData>) -> Result<Vec<Record>, SyncError> {
264+ let atp_data = atp_map.get(repo).ok_or_else(|| SyncError::Generic(format!("No ATP data found for repo: {}", repo)))?;
265+ self.fetch_records_for_repo_collection(repo, collection, &atp_data.pds).await
266+ }
267+268+ async fn fetch_records_for_repo_collection(&self, repo: &str, collection: &str, pds_url: &str) -> Result<Vec<Record>, SyncError> {
269+ let mut records = Vec::new();
270+ let mut cursor: Option<String> = None;
271+272+ loop {
273+ let mut params = vec![("repo", repo), ("collection", collection), ("limit", "100")];
274+ if let Some(ref c) = cursor {
275+ params.push(("cursor", c));
276+ }
277+278+ let response = self.client
279+ .get(&format!("{}/xrpc/com.atproto.repo.listRecords", pds_url))
280+ .query(¶ms)
281+ .send()
282+ .await?;
283+284+ if !response.status().is_success() {
285+ return Err(SyncError::ListRecords { status: response.status().as_u16() });
286+ }
287+288+ let list_response: ListRecordsResponse = response.json().await?;
289+290+ for atproto_record in list_response.records {
291+ let record = Record {
292+ uri: atproto_record.uri,
293+ cid: atproto_record.cid,
294+ did: repo.to_string(),
295+ collection: collection.to_string(),
296+ json: atproto_record.value,
297+ indexed_at: Utc::now(),
298+ };
299+ records.push(record);
300+ }
301+302+ cursor = list_response.cursor;
303+ if cursor.is_none() {
304+ break;
305+ }
306+ }
307+308+ Ok(records)
309+ }
310+311+ async fn get_atp_map_for_repos(&self, repos: &[String]) -> Result<std::collections::HashMap<String, AtpData>, SyncError> {
312+ let mut atp_map = std::collections::HashMap::new();
313+314+ for repo in repos {
315+ if let Ok(atp_data) = self.resolve_atp_data(repo).await {
316+ atp_map.insert(atp_data.did.clone(), atp_data);
317+ }
318+ }
319+320+ Ok(atp_map)
321+ }
322+323+ async fn resolve_atp_data(&self, did: &str) -> Result<AtpData, SyncError> {
324+ let pds = if did.starts_with("did:plc:") {
325+ // Resolve PLC DID
326+ let response = self.client
327+ .get(&format!("https://plc.directory/{}", did))
328+ .send()
329+ .await?;
330+331+ if response.status().is_success() {
332+ let did_doc: DidDocument = response.json().await?;
333+ if let Some(services) = did_doc.service {
334+ for service in services {
335+ if service.service_type == "AtprotoPersonalDataServer" {
336+ return Ok(AtpData {
337+ did: did.to_string(),
338+ pds: service.service_endpoint,
339+ handle: None,
340+ });
341+ }
342+ }
343+ }
344+ }
345+346+ // Fallback to bsky.social
347+ "https://bsky.social".to_string()
348+ } else {
349+ // Fallback to bsky.social for non-PLC DIDs
350+ "https://bsky.social".to_string()
351+ };
352+353+ Ok(AtpData {
354+ did: did.to_string(),
355+ pds,
356+ handle: None,
357+ })
358+ }
359+360+ async fn index_actors(&self, repos: &[String], atp_map: &std::collections::HashMap<String, AtpData>) -> Result<(), SyncError> {
361+ let mut actors = Vec::new();
362+ let now = chrono::Utc::now().to_rfc3339();
363+364+ for repo in repos {
365+ if let Some(atp_data) = atp_map.get(repo) {
366+ actors.push(Actor {
367+ did: atp_data.did.clone(),
368+ handle: atp_data.handle.clone(),
369+ indexed_at: now.clone(),
370+ });
371+ }
372+ }
373+374+ if !actors.is_empty() {
375+ self.database.batch_insert_actors(&actors).await?;
376+ }
377+378+ Ok(())
379+ }
380+}
+6
api/src/utils.rs
···000000
···1+/// Determines if a collection NSID is considered "primary" vs "external"
2+/// Primary collections are social.grain.* domain
3+/// Everything else is considered external
4+pub fn is_primary_collection(nsid: &str) -> bool {
5+ nsid.starts_with("social.grain.")
6+}