···2233`hydrant` is an AT Protocol indexer built on the `fjall` database that handles sync for you. it's flexible, supporting both full-network indexing and filtered indexing (e.g., by DID), also allowing querying with XRPCs and providing an ordered event stream with cursor support.
4455-you can see [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) for an example on how to use hydrant.
55+you can see [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) (standalone binary using http API) or the [statusphere example](./examples/statusphere.rs) (hydrant-as-library) for examples on how to use hydrant.
6677**WARNING: *the db format is not stable yet.*** it's in active development so if you are going to rely on the db format being stable, don't (eg. for query features, if you are using ephemeral mode this doesn't matter for example, or you dont mind losing your existing backfilled data in hydrant if you already processed them.).
8899## vs `tap`
10101111-while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is designed as a firehose consumer and simply just propagates events while handling sync, `hydrant` is flexible, it allows you to directly query the database for records, and it also provides an ordered view of events, allowing the use of a cursor to fetch events from a specific point in time.
1111+while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is designed as a firehose consumer and simply just propagates events while handling sync, `hydrant` is flexible, it allows you to directly query the database for records, and it also provides an ordered view of events, allowing the use of a cursor to fetch events from a specific point. it can act as both an indexer or an ephemeral view of some window of events.
12121313### stream behavior
1414···24242525### multiple relay support
26262727-`hydrant` supports connecting to multiple relays simultaneously for both firehose ingestion and crawling. when `RELAY_HOSTS` is configured with multiple URLs:
2727+`hydrant` supports connecting to multiple relays simultaneously for firehose ingestion. when `RELAY_HOSTS` is configured with multiple URLs:
28282929- one independent firehose stream loop is spawned per relay
3030-- one independent crawling loop is spawned per relay
3131-- each relay maintains its own firehose / crawler cursor state
3232-- all ingestion loops and crawlers share the same worker pool and database
3333-- all crawlers share the same pending queue for backfill
3030+- each relay maintains its own firehose cursor state
3131+- all ingestion loops share the same worker pool and database
34323533commit events are de-duplicated according to the repo `rev`. account / identity events are de-duplicated using the `time` field.
3634todo: decide what to do on relay-side account takedowns or if relays set the `time` field.
37353636+### crawler sources
3737+3838+the crawler is configured separately from the firehose via `CRAWLER_URLS`. each source is a `[mode::]url` entry where the mode prefix is optional and defaults to `by_collection` in filter mode or `relay` in full-network mode.
3939+4040+- `relay`: enumerates the network via `com.atproto.sync.listRepos`, then checks each repo's collections via `describeRepo`. used for full-network discovery.
4141+- `by_collection`: queries `com.atproto.sync.listReposByCollection` for each configured signal. more efficient for filtered indexing since it only surfaces repos that have matching records.
4242+cursors are stored per collection.
4343+4444+```
4545+CRAWLER_URLS=by_collection::https://lightrail.microcosm.blue,relay::wss://bsky.network
4646+```
4747+4848+each source maintains its own cursor so restarts resume mid-pass.
4949+3850## configuration
39514052`hydrant` is configured via environment variables. all variables are prefixed with `HYDRANT_` (except `RUST_LOG`).
···4355| :--- | :--- | :--- |
4456| `DATABASE_PATH` | `./hydrant.db` | path to the database folder. |
4557| `RUST_LOG` | `info` | log filter directives (e.g., `debug`, `hydrant=trace`). [`tracing` env-filter syntax](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html). |
4646-| `RELAY_HOST` | `wss://relay.fire.hose.cam/` | URL of the relay. |
4747-| `RELAY_HOSTS` | | comma-separated list of relay URLs. if unset, falls back to `RELAY_HOST`. |
5858+| `RELAY_HOST` | `wss://relay.fire.hose.cam/` | URL of the relay (firehose only). |
5959+| `RELAY_HOSTS` | | comma-separated list of relay URLs (firehose only). if unset, falls back to `RELAY_HOST`. |
6060+| `CRAWLER_URLS` | relay hosts in full-network mode, `https://lightrail.microcosm.blue` in filter mode | comma-separated list of `[mode::]url` crawler sources. mode is `relay` or `by_collection`; bare URLs use the default mode. set to empty string to disable crawling. |
4861| `PLC_URL` | `https://plc.wtf`, `https://plc.directory` if full network | base URL(s) of the PLC directory (comma-separated for multiple). |
4962| `EPHEMERAL` | `false` | if enabled, no records are stored. events are deleted after a certain duration (`EPHEMERAL_TTL`). |
5063| `EPHEMERAL_TTL` | `60min` | decides after how long events should be deleted. |
···6376| `ENABLE_DEBUG` | `false` | enable debug endpoints. |
6477| `DEBUG_PORT` | `API_PORT + 1` | port for debug endpoints (if enabled). |
6578| `ENABLE_FIREHOSE` | `true` | whether to ingest relay subscriptions. |
6666-| `ENABLE_CRAWLER` | `false` (if Filter), `true` (if Full) | whether to actively query the network for unknown repositories. |
7979+| `ENABLE_CRAWLER` | `true` if full network or crawler sources are configured, `false` otherwise | whether to actively query the network for unknown repositories. |
6780| `CRAWLER_MAX_PENDING_REPOS` | `2000` | max pending repos for crawler. |
6881| `CRAWLER_RESUME_PENDING_REPOS` | `1000` | resume threshold for crawler pending repos. |
6982···869987100- `POST /db/train`: train zstd compression dictionaries for the `repos`, `blocks`, and `events` keyspaces. dictionaries are written to disk; a restart is required to apply them. the crawler, firehose, and backfill worker are paused for the duration and restored on completion.
88101- `POST /db/compact`: trigger a full major compaction of all database keyspaces in parallel. the crawler, firehose, and backfill worker are paused for the duration and restored on completion.
102102+- `DELETE /cursors`: reset all stored cursors for a given URL. body: `{ "key": "..." }` where key is a URL. clears the relay crawler cursor, and any by-collection cursors associated with that URL. causes the next crawler pass to restart from the beginning.
8910390104#### filter mode
91105
+27-8
examples/statusphere.rs
···1111//!
1212//! the database persists records across restarts. on each start the full event
1313//! history is replayed from the database to rebuild the in-memory index.
1414+//! (in a better app, we could for example use the ephemeral mode of hydrant,
1515+//! and use our db, or we could use hydrant to backfill multiple instances of the app.)
14161717+use std::str::FromStr;
1518use std::sync::Arc;
1619use std::time::Duration;
17202121+use chrono::DateTime;
1822use futures::StreamExt;
1923use hydrant::config::Config;
2024use hydrant::control::{EventStream, Hydrant, ReposControl};
2125use hydrant::filter::FilterMode;
2626+use jacquard_common::types::tid::Tid;
2227use scc::HashMap;
23282429const COLLECTION: &str = "xyz.statusphere.status";
···4045 }
4146 }
42474343- fn set(&self, did: String, emoji: String, created_at: String) -> bool {
4848+ fn set(&self, did: String, emoji: String, created_at: &str) -> bool {
4449 let is_newer = self
4550 .current
4646- .read_sync(&did, |_, e| created_at > e.created_at)
5151+ .read_sync(&did, |_, e| created_at > e.created_at.as_str())
4752 .unwrap_or(true);
4853 if is_newer {
4949- self.current
5050- .upsert_sync(did, StatusEntry { emoji, created_at });
5454+ self.current.upsert_sync(
5555+ did,
5656+ StatusEntry {
5757+ emoji,
5858+ created_at: created_at.to_owned(),
5959+ },
6060+ );
5161 }
5262 is_newer
5363 }
···107117 let created_at = record
108118 .get("createdAt")
109119 .and_then(|v| v.as_str())
110110- .unwrap_or("")
111111- .to_owned();
120120+ .unwrap_or("");
112121 if index.set(did.clone(), emoji.clone(), created_at) {
113122 let name = repos
114123 .get(&rec.did)
···117126 .flatten()
118127 .and_then(|info| info.handle)
119128 .unwrap_or(did);
120120- println!("[{}] {name} set status: {emoji}", event.id);
129129+ println!("[{created_at}] {name}: {emoji}");
121130 }
122131 }
123132 "delete" => {
···129138 .and_then(|info| info.handle)
130139 .unwrap_or(did.clone());
131140 index.delete(&did);
132132- println!("[{}] {name} cleared status", event.id);
141141+ let date = Tid::from_str(&rec.rkey)
142142+ .ok()
143143+ .and_then(|tid| DateTime::from_timestamp_micros(tid.timestamp() as i64))
144144+ .map(|date| date.to_string())
145145+ .unwrap_or_else(|| "invalid rkey".to_string());
146146+ println!("[{date}] {name} cleared status");
133147 }
134148 _ => {}
135149 }
···148162 .with_env_filter("hydrant=info")
149163 .init();
150164165165+ // config is loaded from environment variables (all prefixed with HYDRANT_).
166166+ // key defaults for this example:
167167+ // DATABASE_PATH=./hydrant.db | where to store the database.
168168+ // RELAY_HOST=wss://relay.fire.hose.cam/ | firehose source.
169169+ // CRAWLER_URLS=https://lightrail.microcosm.blue | crawler sources. in filter mode this defaults to `by-collection`.
151170 let cfg = Config::from_env()?;
152171 let hydrant = Hydrant::new(cfg).await?;
153172
···32323333 let crawler_default = match config.enable_crawler {
3434 Some(b) => b,
3535- // default: enabled in full-network mode, disabled in filter mode
3636- None => filter_config.mode == crate::filter::FilterMode::Full,
3535+ // default: enabled if full-network mode, or if crawler sources are configured
3636+ None => {
3737+ filter_config.mode == crate::filter::FilterMode::Full
3838+ || !config.crawler_sources.is_empty()
3939+ }
3740 };
38413942 let filter = new_handle(filter_config);
+120
tests/collection_index_test.nu
···11+#!/usr/bin/env nu
22+# tests that the collection-index crawler (listReposByCollection) correctly discovers
33+# and backfills repos from a lightrail-style index server.
44+#
55+# usage: nu tests/collection_index_test.nu
66+#
77+# requires network access to lightrail.microcosm.blue.
88+use common.nu *
99+1010+def main [] {
1111+ let port = 3015
1212+ let url = $"http://localhost:($port)"
1313+ let db_path = (mktemp -d -t hydrant_collection_index_test.XXXXXX)
1414+ let collection = "app.bsky.graph.starterpack"
1515+ let index_url = "https://lightrail.microcosm.blue"
1616+1717+ print $"database path: ($db_path)"
1818+1919+ # fetch a small known set of repos from the collection index so we can verify
2020+ # they appear in hydrant after the discovery pass
2121+ print $"fetching known repos for ($collection) from ($index_url)..."
2222+ let index_resp = (http get $"($index_url)/xrpc/com.atproto.sync.listReposByCollection?collection=($collection)&limit=3")
2323+ let known_dids = ($index_resp.repos | each { |r| $r.did })
2424+2525+ if ($known_dids | is-empty) {
2626+ print "SKIP: collection index returned no repos, cannot verify discovery"
2727+ rm -rf $db_path
2828+ exit 0
2929+ }
3030+3131+ print $"will verify these DIDs are discovered: ($known_dids | str join ', ')"
3232+3333+ # start hydrant in filter mode with the test collection as a signal.
3434+ # HYDRANT_ENABLE_COLLECTION_INDEX is true by default when signals are set,
3535+ # so no explicit override needed.
3636+ $env.HYDRANT_FILTER_SIGNALS = $collection
3737+ $env.HYDRANT_COLLECTION_INDEX_URL = $index_url
3838+ # keep the pending queue very small so throttling doesn't interfere with the test
3939+ $env.HYDRANT_CRAWLER_MAX_PENDING_REPOS = "500"
4040+ $env.HYDRANT_CRAWLER_RESUME_PENDING_REPOS = "200"
4141+ # no relay needed for collection-index testing
4242+ $env.HYDRANT_ENABLE_FIREHOSE = "false"
4343+4444+ let binary = build-hydrant
4545+ let instance = start-hydrant $binary $db_path $port
4646+4747+ mut test_passed = false
4848+4949+ if not (wait-for-api $url) {
5050+ print "ERROR: hydrant failed to start"
5151+ try { kill -9 $instance.pid }
5252+ rm -rf $db_path
5353+ exit 1
5454+ }
5555+5656+ # verify the filter was applied correctly
5757+ let filter = (http get $"($url)/filter")
5858+ print $"filter state: ($filter | to json)"
5959+ if not ($filter.signals | any { |s| $s == $collection }) {
6060+ print $"FAILED: ($collection) not in signals — filter not configured"
6161+ try { kill -9 $instance.pid }
6262+ rm -rf $db_path
6363+ exit 1
6464+ }
6565+6666+ # wait for the collection-index pass to discover and enqueue repos.
6767+ # the first pass runs immediately on startup; repos should appear within ~30s
6868+ # depending on pagination speed and backfill concurrency.
6969+ print "waiting for collection-index discovery pass..."
7070+ mut discovered = false
7171+ for i in 1..60 {
7272+ let stats = (try { (http get $"($url)/stats?accurate=true").counts } catch { {} })
7373+ let repos = ($stats | get --optional repos | default 0 | into int)
7474+ print $"[($i)/60] repos: ($repos)"
7575+ if $repos >= ($known_dids | length) {
7676+ $discovered = true
7777+ break
7878+ }
7979+ sleep 2sec
8080+ }
8181+8282+ if not $discovered {
8383+ print "FAILED: collection-index did not discover any repos within timeout"
8484+ try { kill -9 $instance.pid }
8585+ rm -rf $db_path
8686+ exit 1
8787+ }
8888+8989+ # verify each known DID appears in the repos API
9090+ print "verifying known DIDs were discovered..."
9191+ mut all_found = true
9292+ for did in $known_dids {
9393+ let repo = (try {
9494+ http get $"($url)/repos/($did)"
9595+ } catch {
9696+ null
9797+ })
9898+ if ($repo | is-empty) {
9999+ print $"FAILED: ($did) not found in repos API"
100100+ $all_found = false
101101+ } else {
102102+ print $"ok: ($did) — status: ($repo.status)"
103103+ }
104104+ }
105105+106106+ if $all_found {
107107+ print "test PASSED: collection-index correctly discovered repos from listReposByCollection"
108108+ $test_passed = true
109109+ }
110110+111111+ print "stopping hydrant..."
112112+ try { kill -9 $instance.pid }
113113+ rm -rf $db_path
114114+115115+ if $test_passed {
116116+ exit 0
117117+ } else {
118118+ exit 1
119119+ }
120120+}