lightrail: `listReposByCollection` service#

status: almost working well but not stable yet!!

Lightrail uses the adjacent keys in firehose commit CAR slices to detect first-record-added-to and last-record-removed-from collections in atproto repos, statelessly. Since most commits don't change repos' collection lists, this eliminates most of the work to maintain an accurate repos-by-collection index.

Compared to Bluesky's collectiondir service, lightrail:

validates sync1.1 commit proofs for index integrity
handles sync1.1 #sync events, catching significant repo changes
actually removes repos from the index when their last record from a collection is removed
while doing less work over all

Lightrail's main priorities are accuracy and correctness.

Backfill-by-collection assister#

Sync utilities in atproto like Tap and Hydrant can typically synchronize subsets of the atmosphere, filtering repositories by collection. The com.atproto.sync.listReposByCollection query answers "which repos already have content relevant to the filtered subset?", so the sync utility can backfill existing relevant network data.

You usually want to call listReposByCollection on the relay you subscribe to, to filter the same view of teh network that your firehose delivers. But relays don't usually implement listReposByCollection themselves: instead they proxy the request to a helper service, like lightrail!

                                       ___________
 ___________                          [ lightrail ]......
[ your app  ]                          ‾‾‾‾^‾‾‾‾‾‾       :
 ‾‾‾‾‾^‾‾‾‾‾                             __|____  (subscribeRepos)
      |       .-listReposByCollection-->|--+    |        :
    __|__ ___/                          | relay |<.......
   [ tap ]<--------subscribeRepos-------|       |
    ‾‾‾‾‾                                ‾‾‾‾‾‾‾

Subscribing lightrail to the same relay it's assisting keeps its network view consistent.

API#

`GET /xrpc/com.atproto.sync.listReposByCollection`: Query docs #

Lightrail implements some proposed changes to this query:

collection parameter with zero values (absent) returns all repos
repeated collection parameter returns repos from any of the specified collections

Quirks:

limit can be up to 10,000 (lexicon specifies 2,000 max). This matches collectiondir's limit.

`GET /xrpc/com.atproto.sync.listRepos`: Query docs #

`GET /xrpc/com.atproto.sync.getRepoStatus`: Query docs #

Lightrail server quick start#

(one day we'll have pre-built binaries)

Lightrail is written in rust. Installing rustup will get you everything you need to build and run it.

cargo run --release -- --upstream relay.fire.hose.cam

relay.fire.hose.cam is one of microcosm's full-network relays. Lightrail works with a relay or PDS host upstream, or any other service that implements at least:

com.atproto.sync.subscribeRepos,
com.atproto.sync.listRepos, and
com.atproto.sync.getRepoStatus

Key configs#

--db-path, default ./lightrail.db: where to write lightrail's fjall db
--listen, default 0.0.0.0:2511: host and port to bind

# you can list *all* available options with:
cargo run -- --help

Environment vars can be used for all configs, prefixed with LIGHTRAIL_, like LIGHTRAIL_DB_PATH=/path/to/wherever.db.

Atmosphere configs#

--plc-url, default: https://plc.directory: where to resolve did:plc identities. To use microcosm's mirror: --plc-url https://plc.wtf.
--slingshot-url, default: https://slingshot.microcosm.blue: enables slingshot for identity reoslution (PLC directory acts as fallback.
--deep-crawl, default: [unset]. enumerate hosts from upstream with com.atproto.sync.listHosts and then crawl those hosts each directly with com.atproto.sync.listRepos.
--heavy, default: [unset]. always resync with com.atproto.sync.getRepo instead of trying the lightweight com.atproto.repo.describeRepo query first. this pull a lot more data!!

Operational configs#

--metrics-listen, default: 0.0.0.0:6789: enable prometheus-style metrics collection and serving at this address
--max-resync-workers, default: 16: max backfill and repo resync concurrency. increase to use more resources to speed up backfill.

more knobs you can twist:

--ident-cache-size, default: 2_000_000: identity resolution provides repo signing keys and PDS hostnames. a larger cache reduces outbound resolution requests at the cost of more memory used. turn down on memory-constrained systems
--fjall-cache-mb, default 256: the database cache size. turn down on memory-constrained systems.
--max-firehose-workers, default: 6: max firehose event processing concurrency.
--cursor-save-interval-secs, default 1
--describe-repo-fetch-timeout-secs, default 30
--get-repo-fetch-timeout-secs, default 300
--max-deep-crawl-workers, default 4: host-crawling concurrency for --deep-crawl

quirks#

Lightrail's ordering of DIDs in the listReposByCollection response is different from collectiondir
- collectiondir always inserts new DIDs at the end of the paginated response
- Lightrail makes no ordering guarantee except that the paged response will not contain duplicates
You should start your firehose listener before crawling listReposByCollection, because DIDs newly indexed by lightrail after the first is requested are not guaranteed to be present in the total paged set.
If you see a log line like
```
... WARN ... error=identity resolution failed: jacquard: unsupported DID method: did:web:...
```
it just means the did:web resolution failed. lightrail supports did:web, but a tiny current bug in jacquard surfaces this message

Backfill#

Lightrail currently uses com.atproto.repo.describeRepo, like Bluesky's collectiondir. This not as robust as we wish it was, and could be replaced by probing that authenticated repo contents (see ./authenticated-collection-list.md) soon.

The two reasons describeRepo isn't robust:

the results are not authenticated (PDS bugs or quirks could lead to incorrect index)
the response lacks the repo rev, so even if the list is accurate, it's not possible to prove that the next firehose commit follows without gaps

To mitigate the second, we always call com.atproto.sync.getRecord before describeRepo. This establishes a rev prior to the list, for eventual-(usually fast)-consistency after cutting over to the firehose.

The sync.getRecord response also includes a CAR slice that we can use: for very small repos, it might actually include a full repository export, in which case we can resync directly (and robustly!) from that and exit early. If it's a partial CAR, it will still include some keys whose presence we can assert when processing the describeRepo response to maybe catch a PDS bug.

Future sync.getRecord work: since every provable partial CAR must contain at least the MST root node, we can make a very rough estimate of the full-repo export size, and go ahead and sync.getRepo instead of describeRepo when it's expected to be very small, for better accuracy without much additional bandwidth overhead.

When we call sync.getRecord, we provide a made-up collection and rkey, which works for our purposes because the response will contain a proof of absense if the key doesn't exist in the repo: a CAR slice (with rev + data from the commit object!) containing adjacent keys (that we'll use!). Unfortunately, not every PDS implements proof of absense responses, notably bridgy currently returns an error for non-existent keys.

Resync fallback with `sync.getRepo`#

If the describeRepo approach fails for any reason, lightstream attempt to resync from a full repo export.

Eg., Bridgy's PDS often fails the sync.getRecord probe (see above) but its repos are still indexed by lightrail via this full-repo fallback.

Sync1.1#

plz remind fig to write this up: the strictness ratchet, any handling of lenient hosts we end up needing, and proof re: correctness of the adjacent keys approach.

wishlist features (probably doable?):#

DONE accept multiple collections for listReposbyCollection (merge + dedup by DID; works bc key is <collection>||<did>)
DONE "wilcard" fo listReposbyCollection by omitting the collection query param entirely
~~listReposByCollectionPrefix, either with additional indexes up the NSID hierarchy, or via merge+dedup.~~ not doing
subscribe to multiple relays
use authenticated repo contents for backfill instead of com.atproto.repo.describeRepo (see ./authenticated-collection-list.md)

contributing#

see './hacking.md' for style, implementation, and architecture notes.

license#

This work is dual-licensed under MIT and Apache 2.0. You can choose between one of them if you use this work.

SPDX-License-Identifier: MIT OR Apache-2.0

Clone this repository

lightrail: listReposByCollection service#