lightrail: listReposByCollection service#
status: almost working well but not stable yet!!
Lightrail uses the adjacent keys in firehose commit CAR slices to detect first-record-added-to and last-record-removed-from collections in atproto repos, statelessly. Since most commits don't change repos' collection lists, this eliminates most of the work to maintain an accurate repos-by-collection index.
Compared to Bluesky's collectiondir service, lightrail:
- validates sync1.1 commit proofs for index integrity
- handles sync1.1
#syncevents, catching significant repo changes - actually removes repos from the index when their last record from a collection is removed
- while doing less work over all
Lightrail's main priorities are accuracy and correctness.
Backfill-by-collection assister#
Sync utilities in atproto like Tap and Hydrant can typically synchronize subsets of the atmosphere, filtering repositories by collection. The com.atproto.sync.listReposByCollection query answers "which repos already have content relevant to the filtered subset?", so the sync utility can backfill existing relevant network data.
You usually want to call listReposByCollection on the relay you subscribe to, to filter the same view of teh network that your firehose delivers. But relays don't usually implement listReposByCollection themselves: instead they proxy the request to a helper service, like lightrail!
___________
___________ [ lightrail ]......
[ your app ] ‾‾‾‾^‾‾‾‾‾‾ :
‾‾‾‾‾^‾‾‾‾‾ __|____ (subscribeRepos)
| .-listReposByCollection-->|--+ | :
__|__ ___/ | relay |<.......
[ tap ]<--------subscribeRepos-------| |
‾‾‾‾‾ ‾‾‾‾‾‾‾
Subscribing lightrail to the same relay it's assisting keeps its network view consistent.
API#
GET /xrpc/com.atproto.sync.listReposByCollection: Query docs#
Lightrail implements some proposed changes to this query:
collectionparameter with zero values (absent) returns all repos- repeated
collectionparameter returns repos from any of the specified collections
Quirks:
limitcan be up to 10,000 (lexicon specifies 2,000 max). This matchescollectiondir's limit.
GET /xrpc/com.atproto.sync.listRepos: Query docs#
GET /xrpc/com.atproto.sync.getRepoStatus: Query docs#
Lightrail server quick start#
(one day we'll have pre-built binaries)
Lightrail is written in rust. Installing rustup will get you everything you need to build and run it.
cargo run --release -- --upstream relay.fire.hose.cam
relay.fire.hose.cam is one of microcosm's full-network relays. Lightrail works with a relay or PDS host upstream, or any other service that implements at least:
com.atproto.sync.subscribeRepos,com.atproto.sync.listRepos, andcom.atproto.sync.getRepoStatus
Key configs#
--db-path, default./lightrail.db: where to write lightrail's fjall db--listen, default0.0.0.0:2511: host and port to bind
# you can list *all* available options with:
cargo run -- --help
Environment vars can be used for all configs, prefixed with LIGHTRAIL_, like LIGHTRAIL_DB_PATH=/path/to/wherever.db.
Atmosphere configs#
--plc-url, default:https://plc.directory: where to resolvedid:plcidentities. To use microcosm's mirror:--plc-url https://plc.wtf.--slingshot-url, default:https://slingshot.microcosm.blue: enables slingshot for identity reoslution (PLC directory acts as fallback.--deep-crawl, default:[unset]. enumerate hosts from upstream withcom.atproto.sync.listHostsand then crawl those hosts each directly withcom.atproto.sync.listRepos.
Operational configs#
--metrics-listen, default:0.0.0.0:6789: enable prometheus-style metrics collection and serving at this address--max-resync-workers, default:16: max backfill and repo resync concurrency. increase to use more resources to speed up backfill.
more knobs you can twist:
--ident-cache-size, default:2_000_000: identity resolution provides repo signing keys and PDS hostnames. a larger cache reduces outbound resolution requests at the cost of more memory used.--max-firehose-workers, default:6: max firehose event processing concurrency.--cursor-save-interval-secs, default1--describe-repo-fetch-timeout-secs, default30--get-repo-fetch-timeout-secs, default300--max-deep-crawl-workers, default4: host-crawling concurrency for--deep-crawl
quirks#
-
Lightrail's ordering of DIDs in the
listReposByCollectionresponse is different fromcollectiondircollectiondiralways inserts new DIDs at the end of the paginated response- Lightrail makes no ordering guarantee except that the paged response will not contain duplicates
You should start your firehose listener before crawling
listReposByCollection, because DIDs newly indexed by lightrail after the first is requested are not guaranteed to be present in the total paged set. -
If you see a log line like
... WARN ... error=identity resolution failed: jacquard: unsupported DID method: did:web:...it just means the did:web resolution failed. lightrail supports did:web, but a tiny current bug in jacquard surfaces this message
Backfill#
Lightrail currently uses com.atproto.repo.describeRepo, like Bluesky's collectiondir. This not as robust as we wish it was, and could be replaced by probing that authenticated repo contents (see ./authenticated-collection-list.md) soon.
The two reasons describeRepo isn't robust:
- the results are not authenticated (PDS bugs or quirks could lead to incorrect index)
- the response lacks the repo
rev, so even if the list is accurate, it's not possible to prove that the next firehose commit follows without gaps
To mitigate the second, we always call com.atproto.sync.getRecord before describeRepo. This establishes a rev prior to the list, for eventual-(usually fast)-consistency after cutting over to the firehose.
The sync.getRecord response also includes a CAR slice that we can use: for very small repos, it might actually include a full repository export, in which case we can resync directly (and robustly!) from that and exit early. If it's a partial CAR, it will still include some keys whose presence we can assert when processing the describeRepo response to maybe catch a PDS bug.
Future sync.getRecord work: since every provable partial CAR must contain at least the MST root node, we can make a very rough estimate of the full-repo export size, and go ahead and sync.getRepo instead of describeRepo when it's expected to be very small, for better accuracy without much additional bandwidth overhead.
When we call sync.getRecord, we provide a made-up collection and rkey, which works for our purposes because the response will contain a proof of absense if the key doesn't exist in the repo: a CAR slice (with rev + data from the commit object!) containing adjacent keys (that we'll use!). Unfortunately, not every PDS implements proof of absense responses, notably bridgy currently returns an error for non-existent keys.
Resync fallback with sync.getRepo#
If the describeRepo approach fails for any reason, lightstream attempt to resync from a full repo export.
Eg., Bridgy's PDS often fails the sync.getRecord probe (see above) but its repos are still indexed by lightrail via this full-repo fallback.
Sync1.1#
plz remind fig to write this up: the strictness ratchet, any handling of lenient hosts we end up needing, and proof re: correctness of the adjacent keys approach.
wishlist features (probably doable?):#
- DONE accept multiple collections for
listReposbyCollection(merge + dedup by DID; works bc key is<collection>||<did>) - DONE "wilcard" fo
listReposbyCollectionby omitting thecollectionquery param entirely not doinglistReposByCollectionPrefix, either with additional indexes up the NSID hierarchy, or via merge+dedup.- subscribe to multiple relays
- use authenticated repo contents for backfill instead of
com.atproto.repo.describeRepo(see ./authenticated-collection-list.md)
contributing#
see './hacking.md' for style, implementation, and architecture notes.
license#
This work is dual-licensed under MIT and Apache 2.0. You can choose between one of them if you use this work.
SPDX-License-Identifier: MIT OR Apache-2.0