···11+# check performance via logfire
22+33+use `mcp__logfire__arbitrary_query` with `age` in minutes (max 43200 = 30 days).
44+55+note: `duration` is in seconds (DOUBLE PRECISION), multiply by 1000 for ms.
66+77+## latency percentiles by endpoint
88+```sql
99+SELECT span_name,
1010+ COUNT(*) as count,
1111+ ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p50_ms,
1212+ ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p95_ms,
1313+ ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p99_ms
1414+FROM records
1515+WHERE span_name LIKE 'http.%'
1616+GROUP BY span_name
1717+ORDER BY count DESC
1818+```
1919+2020+## slow requests with trace IDs
2121+```sql
2222+SELECT span_name, duration * 1000 as ms, trace_id, start_timestamp
2323+FROM records
2424+WHERE span_name LIKE 'http.%' AND duration > 0.1
2525+ORDER BY duration DESC
2626+LIMIT 20
2727+```
2828+2929+## trace breakdown (drill into slow request)
3030+```sql
3131+SELECT span_name, duration * 1000 as ms, message, attributes->>'sql' as sql
3232+FROM records
3333+WHERE trace_id = '<TRACE_ID>'
3434+ORDER BY start_timestamp
3535+```
3636+3737+## database comparison (turso vs local)
3838+```sql
3939+SELECT
4040+ CASE WHEN span_name = 'db.query' THEN 'turso'
4141+ WHEN span_name = 'db.local.query' THEN 'local' END as db,
4242+ COUNT(*) as queries,
4343+ ROUND(AVG(duration) * 1000, 2) as avg_ms,
4444+ ROUND(MAX(duration) * 1000, 2) as max_ms
4545+FROM records
4646+WHERE span_name IN ('db.query', 'db.local.query')
4747+GROUP BY db
4848+```
4949+5050+## recent errors
5151+```sql
5252+SELECT start_timestamp, span_name, exception_type, exception_message
5353+FROM records
5454+WHERE exception_type IS NOT NULL
5555+ORDER BY start_timestamp DESC
5656+LIMIT 10
5757+```
5858+5959+## traffic pattern (requests per minute)
6060+```sql
6161+SELECT date_trunc('minute', start_timestamp) as minute,
6262+ COUNT(*) as requests
6363+FROM records
6464+WHERE span_name LIKE 'http.%'
6565+GROUP BY minute
6666+ORDER BY minute DESC
6767+LIMIT 30
6868+```
6969+7070+## search query distribution
7171+```sql
7272+SELECT attributes->>'query' as query, COUNT(*) as count
7373+FROM records
7474+WHERE span_name = 'http.search' AND attributes->>'query' IS NOT NULL
7575+GROUP BY query
7676+ORDER BY count DESC
7777+LIMIT 20
7878+```
7979+8080+## typical workflow
8181+1. run latency percentiles to get baseline
8282+2. if p95/p99 high, find slow requests with trace IDs
8383+3. drill into specific trace to see which child spans are slow
8484+4. check db comparison to see if turso calls are the bottleneck
+23
.claude/commands/check-prod.md
···11+# check prod health
22+33+## quick status
44+```bash
55+curl -s https://leaflet-search-backend.fly.dev/health
66+curl -s https://leaflet-search-backend.fly.dev/stats | jq
77+```
88+99+## observability
1010+use the logfire MCP server to query traces and logs:
1111+- `mcp__logfire__arbitrary_query` - run SQL against traces/spans
1212+- `mcp__logfire__find_exceptions_in_file` - recent exceptions by file
1313+- `mcp__logfire__schema_reference` - see available columns
1414+1515+## database
1616+use turso CLI for direct SQL:
1717+```bash
1818+turso db shell leaflet-search "SELECT COUNT(*) FROM documents"
1919+turso db shell leaflet-search "SELECT * FROM documents ORDER BY created_at DESC LIMIT 5"
2020+```
2121+2222+## tap status
2323+from `tap/` directory: `just check`
···11+# leaflet-search notes
22+33+## deployment
44+- **backend**: push to `main` touching `backend/**` โ auto-deploys via GitHub Actions
55+- **frontend**: manual deploy only (`wrangler pages deploy site --project-name leaflet-search`)
66+- **tap**: manual deploy from `tap/` directory (`fly deploy --app leaflet-search-tap`)
77+88+## remotes
99+- `origin`: tangled.sh:zzstoatzz.io/leaflet-search
1010+- `github`: github.com/zzstoatzz/leaflet-search (CI runs here)
1111+- push to both: `git push origin main && git push github main`
1212+1313+## architecture
1414+- **backend** (Zig): HTTP API, FTS5 search, vector similarity
1515+- **tap**: firehose sync via bluesky-social/indigo tap
1616+- **site**: static frontend on Cloudflare Pages
1717+- **db**: Turso (source of truth) + local SQLite read replica (FTS queries)
1818+1919+## platforms
2020+- leaflet, pckt, offprint, greengale: known platforms (detected via basePath)
2121+- other: site.standard.* documents not from a known platform
2222+2323+## search ranking
2424+- hybrid BM25 + recency: `ORDER BY rank + (days_old / 30)`
2525+- OR between terms for recall, prefix on last word
2626+- unicode61 tokenizer (non-alphanumeric = separator)
2727+2828+## tap operations
2929+- from `tap/` directory: `just check` (status), `just turbo` (catch-up), `just normal` (steady state)
3030+- see `docs/tap.md` for memory tuning and debugging
3131+3232+## common tasks
3333+- check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`
+36-35
README.md
···11-# leaflet-search
11+# pub search
2233by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io)
4455-search for [leaflet](https://leaflet.pub).
55+search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and others using [standard.site](https://standard.site)).
66+77+**live:** [pub-search.waow.tech](https://pub-search.waow.tech)
6877-**live:** [leaflet-search.pages.dev](https://leaflet-search.pages.dev)
99+> formerly "leaflet-search" - generalized to support multiple publishing platforms
810911## how it works
10121111-1. **tap** syncs leaflet content from the network
1313+1. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (signals on `site.standard.document`, filters `pub.leaflet.*` + `site.standard.*`)
12142. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API
13153. **site** static frontend on Cloudflare Pages
1416···1719search is also exposed as an MCP server for AI agents like Claude Code:
18201921```bash
2020-claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}'
2222+claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}'
2123```
22242325see [mcp/README.md](mcp/README.md) for local setup and usage details.
···2527## api
26282729```
2828-GET /search?q=<query>&tag=<tag> # full-text search with query, tag, or both
2929-GET /similar?uri=<at-uri> # find similar documents via vector embeddings
3030-GET /tags # list all tags with counts
3131-GET /popular # popular search queries
3232-GET /stats # document/publication counts
3333-GET /health # health check
3030+GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search
3131+GET /similar?uri=<at-uri> # find similar documents
3232+GET /tags # list all tags with counts
3333+GET /popular # popular search queries
3434+GET /stats # counts + request latency (p50/p95)
3535+GET /health # health check
3436```
35373636-search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). tag filtering applies to documents only.
3838+search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, or other). tag and platform filtering apply to documents only.
3939+4040+**ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`).
37413842`/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs).
39434040-## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a)
4444+## configuration
41454242-- [Fly.io](https://fly.io) hosts backend + tap
4343-- [Turso](https://turso.tech) cloud SQLite with vector support
4444-- [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite)
4545-- [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs leaflet content from ATProto firehose
4646-- [Zig](https://ziglang.org) HTTP server, search API, content indexing
4747-- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
4646+the backend is fully configurable via environment variables:
48474949-## embeddings
4848+| variable | default | description |
4949+|----------|---------|-------------|
5050+| `APP_NAME` | `leaflet-search` | name shown in startup logs |
5151+| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` |
5252+| `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host |
5353+| `TAP_PORT` | `443` | tap websocket port |
5454+| `PORT` | `3000` | HTTP server port |
5555+| `TURSO_URL` | - | Turso database URL (required) |
5656+| `TURSO_TOKEN` | - | Turso auth token (required) |
5757+| `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) |
50585151-documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). new documents from the firehose don't automatically get embeddings - they need to be backfilled periodically.
5959+the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results.
52605353-### backfill embeddings
6161+## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a)
54625555-requires `TURSO_URL`, `TURSO_TOKEN`, and `VOYAGE_API_KEY` in `.env`:
6363+- [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing
6464+- [Turso](https://turso.tech) cloud SQLite with [Voyage AI](https://voyageai.com) vector support
6565+- [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
6666+- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
56675757-```bash
5858-# check how many docs need embeddings
5959-./scripts/backfill-embeddings --dry-run
6868+## embeddings
60696161-# run the backfill (uses batching + concurrency)
6262-./scripts/backfill-embeddings --batch-size 50
6363-```
6464-6565-the script:
6666-- fetches docs where `embedding IS NULL`
6767-- batches them to Voyage API (50 docs/batch default)
6868-- writes embeddings to Turso in batched transactions
6969-- runs 8 concurrent workers
7070+documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). the backend automatically generates embeddings for new documents via a background worker - no manual backfill needed.
70717172**note:** we use brute-force cosine similarity instead of a vector index. Turso's DiskANN index has ~60s write latency per row, making it impractical for incremental updates. brute-force on 3500 vectors runs in ~0.15s which is fine for this scale.
···11const std = @import("std");
22+const posix = std.posix;
2334const schema = @import("schema.zig");
45const result = @import("result.zig");
66+const sync = @import("sync.zig");
5768// re-exports
79pub const Client = @import("Client.zig");
1010+pub const LocalDb = @import("LocalDb.zig");
811pub const Row = result.Row;
912pub const Result = result.Result;
1013pub const BatchResult = result.BatchResult;
···1215// global state
1316var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{};
1417var client: ?Client = null;
1818+var local_db: ?LocalDb = null;
15191616-pub fn init() !void {
2020+/// Initialize Turso client only (fast, call synchronously at startup)
2121+pub fn initTurso() !void {
1722 client = try Client.init(gpa.allocator());
1823 try schema.init(&client.?);
1924}
20252626+/// Initialize local SQLite replica (slow, call in background thread)
2727+pub fn initLocalDb() void {
2828+ initLocal() catch |err| {
2929+ std.debug.print("local db init failed (will use turso only): {}\n", .{err});
3030+ };
3131+}
3232+3333+pub fn init() !void {
3434+ try initTurso();
3535+ initLocalDb();
3636+}
3737+3838+fn initLocal() !void {
3939+ // check if local db is disabled
4040+ if (posix.getenv("LOCAL_DB_ENABLED")) |val| {
4141+ if (std.mem.eql(u8, val, "false") or std.mem.eql(u8, val, "0")) {
4242+ std.debug.print("local db disabled via LOCAL_DB_ENABLED\n", .{});
4343+ return;
4444+ }
4545+ }
4646+4747+ local_db = LocalDb.init(gpa.allocator());
4848+ try local_db.?.open();
4949+}
5050+2151pub fn getClient() ?*Client {
2252 if (client) |*c| return c;
2353 return null;
2454}
5555+5656+/// Get local db if ready (synced and available)
5757+pub fn getLocalDb() ?*LocalDb {
5858+ if (local_db) |*l| {
5959+ if (l.isReady()) return l;
6060+ }
6161+ return null;
6262+}
6363+6464+/// Get local db even if not ready (for sync operations)
6565+pub fn getLocalDbRaw() ?*LocalDb {
6666+ if (local_db) |*l| return l;
6767+ return null;
6868+}
6969+7070+/// Start background sync thread (call from main after db.init)
7171+pub fn startSync() void {
7272+ const c = getClient() orelse {
7373+ std.debug.print("sync: no turso client, skipping\n", .{});
7474+ return;
7575+ };
7676+ const local = getLocalDbRaw() orelse {
7777+ std.debug.print("sync: no local db, skipping\n", .{});
7878+ return;
7979+ };
8080+8181+ const thread = std.Thread.spawn(.{}, syncLoop, .{ c, local }) catch |err| {
8282+ std.debug.print("sync: failed to start thread: {}\n", .{err});
8383+ return;
8484+ };
8585+ thread.detach();
8686+ std.debug.print("sync: background thread started\n", .{});
8787+}
8888+8989+fn syncLoop(turso: *Client, local: *LocalDb) void {
9090+ // full sync on startup
9191+ sync.fullSync(turso, local) catch |err| {
9292+ std.debug.print("sync: initial full sync failed: {}\n", .{err});
9393+ };
9494+9595+ // get sync interval from env (default 5 minutes)
9696+ const interval_secs: u64 = blk: {
9797+ const env_val = posix.getenv("SYNC_INTERVAL_SECS") orelse "300";
9898+ break :blk std.fmt.parseInt(u64, env_val, 10) catch 300;
9999+ };
100100+101101+ std.debug.print("sync: incremental sync every {d} seconds\n", .{interval_secs});
102102+103103+ // periodic incremental sync
104104+ while (true) {
105105+ std.Thread.sleep(interval_secs * std.time.ns_per_s);
106106+ sync.incrementalSync(turso, local) catch |err| {
107107+ std.debug.print("sync: incremental sync failed: {}\n", .{err});
108108+ };
109109+ }
110110+}
+105-1
backend/src/db/schema.zig
···4444 \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5(
4545 \\ uri UNINDEXED,
4646 \\ name,
4747- \\ description
4747+ \\ description,
4848+ \\ base_path
4849 \\)
4950 , &.{});
5051···127128 client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {};
128129 client.exec("UPDATE documents SET source_collection = 'pub.leaflet.document' WHERE source_collection IS NULL", &.{}) catch {};
129130131131+ // multi-platform support for publications
132132+ client.exec("ALTER TABLE publications ADD COLUMN platform TEXT DEFAULT 'leaflet'", &.{}) catch {};
133133+ client.exec("ALTER TABLE publications ADD COLUMN source_collection TEXT DEFAULT 'pub.leaflet.publication'", &.{}) catch {};
134134+ client.exec("UPDATE publications SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {};
135135+ client.exec("UPDATE publications SET source_collection = 'pub.leaflet.publication' WHERE source_collection IS NULL", &.{}) catch {};
136136+130137 // vector embeddings column already added by backfill script
138138+139139+ // dedupe index: same (did, rkey) across collections = same document
140140+ // e.g., pub.leaflet.document/abc and site.standard.document/abc are the same content
141141+ client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_documents_did_rkey ON documents(did, rkey)", &.{}) catch {};
142142+ client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_publications_did_rkey ON publications(did, rkey)", &.{}) catch {};
143143+144144+ // backfill platform from source_collection for records indexed before platform detection fix
145145+ client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform = 'unknown' AND source_collection LIKE 'pub.leaflet.%'", &.{}) catch {};
146146+ client.exec("UPDATE documents SET platform = 'pckt' WHERE platform = 'unknown' AND source_collection LIKE 'blog.pckt.%'", &.{}) catch {};
147147+148148+ // rename 'standardsite' to 'other' (standardsite was a misnomer - it's a lexicon, not a platform)
149149+ // documents using site.standard.* that don't match a known platform are simply "other"
150150+ client.exec("UPDATE documents SET platform = 'other' WHERE platform = 'standardsite'", &.{}) catch {};
151151+152152+ // detect platform from publication basePath (site.standard.* is a lexicon, not a platform)
153153+ // known platforms (pckt, leaflet, offprint) use site.standard.* but have distinct basePaths
154154+ client.exec(
155155+ \\UPDATE documents SET platform = 'pckt'
156156+ \\WHERE platform IN ('other', 'unknown')
157157+ \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%')
158158+ , &.{}) catch {};
159159+160160+ client.exec(
161161+ \\UPDATE documents SET platform = 'leaflet'
162162+ \\WHERE platform IN ('other', 'unknown')
163163+ \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%')
164164+ , &.{}) catch {};
165165+166166+ client.exec(
167167+ \\UPDATE documents SET platform = 'offprint'
168168+ \\WHERE platform IN ('other', 'unknown')
169169+ \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%offprint.app%' OR base_path LIKE '%offprint.test%')
170170+ , &.{}) catch {};
171171+172172+ client.exec(
173173+ \\UPDATE documents SET platform = 'greengale'
174174+ \\WHERE platform IN ('other', 'unknown')
175175+ \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%greengale.app%')
176176+ , &.{}) catch {};
177177+178178+ // URL path field for documents (e.g., "/001" for zat.dev)
179179+ // used to build full URL: publication.url + document.path
180180+ client.exec("ALTER TABLE documents ADD COLUMN path TEXT", &.{}) catch {};
181181+182182+ // denormalized columns for query performance (avoids per-row subqueries)
183183+ client.exec("ALTER TABLE documents ADD COLUMN base_path TEXT DEFAULT ''", &.{}) catch {};
184184+ client.exec("ALTER TABLE documents ADD COLUMN has_publication INTEGER DEFAULT 0", &.{}) catch {};
185185+186186+ // backfill base_path from publications (idempotent - only updates empty values)
187187+ client.exec(
188188+ \\UPDATE documents SET base_path = COALESCE(
189189+ \\ (SELECT p.base_path FROM publications p WHERE p.uri = documents.publication_uri),
190190+ \\ (SELECT p.base_path FROM publications p WHERE p.did = documents.did LIMIT 1),
191191+ \\ ''
192192+ \\) WHERE base_path IS NULL OR base_path = ''
193193+ , &.{}) catch {};
194194+195195+ // backfill has_publication (idempotent)
196196+ client.exec(
197197+ "UPDATE documents SET has_publication = CASE WHEN publication_uri != '' THEN 1 ELSE 0 END WHERE has_publication = 0 AND publication_uri != ''",
198198+ &.{},
199199+ ) catch {};
200200+201201+ // note: publications_fts was rebuilt with base_path column via scripts/rebuild-pub-fts
202202+ // new publications will include base_path via insertPublication in indexer.zig
203203+204204+ // 2026-01-22: clean up stale publication/self records that were deleted from ATProto
205205+ // these cause incorrect basePath lookups for greengale documents
206206+ // specifically: did:plc:27ivzcszryxp6mehutodmcxo had publication/self with basePath 'greengale.app'
207207+ // but that publication was deleted, and the correct one is 'greengale.app/3fz.org'
208208+ client.exec(
209209+ \\DELETE FROM publications WHERE rkey = 'self'
210210+ \\AND base_path = 'greengale.app'
211211+ \\AND did = 'did:plc:27ivzcszryxp6mehutodmcxo'
212212+ , &.{}) catch {};
213213+ client.exec(
214214+ \\DELETE FROM publications_fts WHERE uri IN (
215215+ \\ SELECT 'at://' || did || '/site.standard.publication/self'
216216+ \\ FROM publications WHERE rkey = 'self' AND base_path = 'greengale.app'
217217+ \\)
218218+ , &.{}) catch {};
219219+220220+ // re-derive basePath for greengale documents that got wrong basePath
221221+ // match documents to greengale publications (basePath contains greengale.app)
222222+ // prefer more specific basePaths (with subdomain)
223223+ client.exec(
224224+ \\UPDATE documents SET base_path = (
225225+ \\ SELECT p.base_path FROM publications p
226226+ \\ WHERE p.did = documents.did
227227+ \\ AND p.base_path LIKE 'greengale.app/%'
228228+ \\ ORDER BY LENGTH(p.base_path) DESC
229229+ \\ LIMIT 1
230230+ \\)
231231+ \\WHERE platform = 'greengale'
232232+ \\AND (base_path = 'greengale.app' OR base_path LIKE '%pckt.blog%')
233233+ \\AND did IN (SELECT did FROM publications WHERE base_path LIKE 'greengale.app/%')
234234+ , &.{}) catch {};
131235}
···11+# API reference
22+33+base URL: `https://leaflet-search-backend.fly.dev`
44+55+## endpoints
66+77+### search
88+99+```
1010+GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date>
1111+```
1212+1313+full-text search across documents and publications.
1414+1515+**parameters:**
1616+| param | type | required | description |
1717+|-------|------|----------|-------------|
1818+| `q` | string | no* | search query (titles and content) |
1919+| `tag` | string | no | filter by tag (documents only) |
2020+| `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `other` |
2121+| `since` | string | no | ISO date, filter to documents created after |
2222+2323+*at least one of `q` or `tag` required
2424+2525+**response:**
2626+```json
2727+[
2828+ {
2929+ "type": "article|looseleaf|publication",
3030+ "uri": "at://did:plc:.../collection/rkey",
3131+ "did": "did:plc:...",
3232+ "title": "document title",
3333+ "snippet": "...matched text...",
3434+ "createdAt": "2025-01-15T...",
3535+ "rkey": "abc123",
3636+ "basePath": "gyst.leaflet.pub",
3737+ "platform": "leaflet",
3838+ "path": "/001"
3939+ }
4040+]
4141+```
4242+4343+**result types:**
4444+- `article`: document in a publication
4545+- `looseleaf`: standalone document (no publication)
4646+- `publication`: the publication itself (only returned for text queries, not tag/platform filters)
4747+4848+**ranking:** hybrid BM25 + recency. text relevance primary, recent docs boosted (~1 point per 30 days).
4949+5050+### similar
5151+5252+```
5353+GET /similar?uri=<at-uri>
5454+```
5555+5656+find semantically similar documents using vector similarity (voyage-3-lite embeddings).
5757+5858+**parameters:**
5959+| param | type | required | description |
6060+|-------|------|----------|-------------|
6161+| `uri` | string | yes | AT-URI of source document |
6262+6363+**response:** same format as search (array of results)
6464+6565+### tags
6666+6767+```
6868+GET /tags
6969+```
7070+7171+list all tags with document counts, sorted by popularity.
7272+7373+**response:**
7474+```json
7575+[
7676+ {"tag": "programming", "count": 42},
7777+ {"tag": "rust", "count": 15}
7878+]
7979+```
8080+8181+### popular
8282+8383+```
8484+GET /popular
8585+```
8686+8787+popular search queries.
8888+8989+**response:**
9090+```json
9191+[
9292+ {"query": "rust async", "count": 12},
9393+ {"query": "leaflet", "count": 8}
9494+]
9595+```
9696+9797+### platforms
9898+9999+```
100100+GET /platforms
101101+```
102102+103103+document counts by platform.
104104+105105+**response:**
106106+```json
107107+[
108108+ {"platform": "leaflet", "count": 2500},
109109+ {"platform": "pckt", "count": 800},
110110+ {"platform": "greengale", "count": 150},
111111+ {"platform": "offprint", "count": 50},
112112+ {"platform": "other", "count": 100}
113113+]
114114+```
115115+116116+### stats
117117+118118+```
119119+GET /stats
120120+```
121121+122122+index statistics and request timing.
123123+124124+**response:**
125125+```json
126126+{
127127+ "documents": 3500,
128128+ "publications": 120,
129129+ "embeddings": 3200,
130130+ "searches": 5000,
131131+ "errors": 5,
132132+ "cache_hits": 1200,
133133+ "cache_misses": 800,
134134+ "timing": {
135135+ "search": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150},
136136+ "similar": {"count": 200, "avg_ms": 150, "p50_ms": 140, "p95_ms": 200, "p99_ms": 250, "max_ms": 300},
137137+ "tags": {"count": 500, "avg_ms": 5, "p50_ms": 4, "p95_ms": 10, "p99_ms": 15, "max_ms": 25},
138138+ "popular": {"count": 300, "avg_ms": 3, "p50_ms": 2, "p95_ms": 5, "p99_ms": 8, "max_ms": 12}
139139+ }
140140+}
141141+```
142142+143143+### activity
144144+145145+```
146146+GET /activity
147147+```
148148+149149+hourly activity counts (last 24 hours).
150150+151151+**response:**
152152+```json
153153+[12, 8, 5, 3, 2, 1, 0, 0, 1, 5, 15, 25, 30, 28, 22, 18, 20, 25, 30, 35, 28, 20, 15, 10]
154154+```
155155+156156+### dashboard
157157+158158+```
159159+GET /api/dashboard
160160+```
161161+162162+rich dashboard data for analytics UI.
163163+164164+**response:**
165165+```json
166166+{
167167+ "startedAt": 1705000000,
168168+ "searches": 5000,
169169+ "publications": 120,
170170+ "documents": 3500,
171171+ "platforms": [{"platform": "leaflet", "count": 2500}],
172172+ "tags": [{"tag": "programming", "count": 42}],
173173+ "timeline": [{"date": "2025-01-15", "count": 25}],
174174+ "topPubs": [{"name": "gyst", "basePath": "gyst.leaflet.pub", "count": 150}],
175175+ "timing": {...}
176176+}
177177+```
178178+179179+### health
180180+181181+```
182182+GET /health
183183+```
184184+185185+**response:**
186186+```json
187187+{"status": "ok"}
188188+```
189189+190190+## building URLs
191191+192192+documents can be accessed on the web via their `basePath` and `rkey`:
193193+- articles: `https://{basePath}/{rkey}` or `https://{basePath}{path}` if path is set
194194+- publications: `https://{basePath}`
195195+196196+examples:
197197+- `https://gyst.leaflet.pub/3ldasifz7bs2l`
198198+- `https://greengale.app/3fz.org/001`
+99
docs/content-extraction.md
···11+# content extraction for site.standard.document
22+33+lessons learned from implementing cross-platform content extraction.
44+55+## the problem
66+77+[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y):
88+99+> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.
1010+1111+short answer: yes. but once you handle `content.pages` extraction, it's straightforward.
1212+1313+## textContent: platform-dependent
1414+1515+`site.standard.document` has a `textContent` field for pre-flattened plaintext:
1616+1717+```json
1818+{
1919+ "title": "my post",
2020+ "textContent": "the full text content, ready for indexing...",
2121+ "content": {
2222+ "$type": "blog.pckt.content",
2323+ "items": [ /* platform-specific blocks */ ]
2424+ }
2525+}
2626+```
2727+2828+**pckt, offprint, greengale** populate `textContent`. extraction is trivial.
2929+3030+**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`.
3131+3232+## extraction strategy
3333+3434+priority order (in `extractor.zig`):
3535+3636+1. `textContent` - use if present
3737+2. `pages` - top-level blocks (pub.leaflet.document)
3838+3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content)
3939+4040+```zig
4141+// try textContent first
4242+if (zat.json.getString(record, "textContent")) |text| {
4343+ return text;
4444+}
4545+4646+// fall back to block parsing
4747+const pages = zat.json.getArray(record, "pages") orelse
4848+ zat.json.getArray(record, "content.pages");
4949+```
5050+5151+the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls.
5252+5353+## deduplication
5454+5555+documents can appear in both collections with identical `(did, rkey)`:
5656+- `site.standard.document`
5757+- `pub.leaflet.document`
5858+5959+handle with `ON CONFLICT`:
6060+6161+```sql
6262+INSERT INTO documents (uri, ...)
6363+ON CONFLICT(uri) DO UPDATE SET ...
6464+```
6565+6666+note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat.
6767+6868+## platform detection
6969+7070+collection name doesn't indicate platform for `site.standard.*` records. detection order:
7171+7272+1. **basePath** - infer from publication basePath:
7373+7474+| basePath contains | platform |
7575+|-------------------|----------|
7676+| `leaflet.pub` | leaflet |
7777+| `pckt.blog` | pckt |
7878+| `offprint.app` | offprint |
7979+| `greengale.app` | greengale |
8080+8181+2. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`):
8282+8383+| content.$type starts with | platform |
8484+|---------------------------|----------|
8585+| `pub.leaflet.` | leaflet |
8686+8787+3. if neither matches โ `other`
8888+8989+## summary
9090+9191+- **pckt/offprint/greengale**: use `textContent` directly
9292+- **leaflet**: extract from `content.pages[].blocks[].block.plaintext`
9393+- **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri`
9494+- **platform**: infer from basePath, fallback to content.$type for custom domains
9595+9696+## code references
9797+9898+- `backend/src/extractor.zig` - content extraction logic, content_type field
9999+- `backend/src/indexer.zig:99-118` - platform detection from basePath + content_type
+226
docs/scratch/leaflet-publishing-plan.md
···11+# publishing to leaflet.pub
22+33+## goal
44+55+publish markdown docs to both:
66+1. `site.standard.document` (for search/interop) - already working
77+2. `pub.leaflet.document` (for leaflet.pub display) - this plan
88+99+## the mapping
1010+1111+### block types
1212+1313+| markdown | leaflet block |
1414+|----------|---------------|
1515+| `# heading` | `pub.leaflet.blocks.header` (level 1-6) |
1616+| paragraph | `pub.leaflet.blocks.text` |
1717+| ``` code ``` | `pub.leaflet.blocks.code` |
1818+| `> quote` | `pub.leaflet.blocks.blockquote` |
1919+| `---` | `pub.leaflet.blocks.horizontalRule` |
2020+| `- item` | `pub.leaflet.blocks.unorderedList` |
2121+| `` | `pub.leaflet.blocks.image` (requires blob upload) |
2222+| `[text](url)` (standalone) | `pub.leaflet.blocks.website` |
2323+2424+### inline formatting (facets)
2525+2626+leaflet uses byte-indexed facets for inline formatting within text blocks:
2727+2828+```json
2929+{
3030+ "$type": "pub.leaflet.blocks.text",
3131+ "plaintext": "hello world with bold text",
3232+ "facets": [{
3333+ "index": { "byteStart": 17, "byteEnd": 21 },
3434+ "features": [{ "$type": "pub.leaflet.richtext.facet#bold" }]
3535+ }]
3636+}
3737+```
3838+3939+| markdown | facet type |
4040+|----------|------------|
4141+| `**bold**` | `pub.leaflet.richtext.facet#bold` |
4242+| `*italic*` | `pub.leaflet.richtext.facet#italic` |
4343+| `` `code` `` | `pub.leaflet.richtext.facet#code` |
4444+| `[text](url)` | `pub.leaflet.richtext.facet#link` |
4545+| `~~strike~~` | `pub.leaflet.richtext.facet#strikethrough` |
4646+4747+## record structure
4848+4949+```json
5050+{
5151+ "$type": "pub.leaflet.document",
5252+ "author": "did:plc:...",
5353+ "title": "document title",
5454+ "description": "optional description",
5555+ "publishedAt": "2026-01-06T00:00:00Z",
5656+ "publication": "at://did:plc:.../pub.leaflet.publication/rkey",
5757+ "tags": ["tag1", "tag2"],
5858+ "pages": [{
5959+ "$type": "pub.leaflet.pages.linearDocument",
6060+ "id": "page-uuid",
6161+ "blocks": [
6262+ {
6363+ "$type": "pub.leaflet.pages.linearDocument#block",
6464+ "block": { /* one of the block types above */ }
6565+ }
6666+ ]
6767+ }]
6868+}
6969+```
7070+7171+## implementation plan
7272+7373+### phase 1: markdown parser
7474+7575+add a simple markdown block parser to zat or the publish script:
7676+7777+```zig
7878+const BlockType = enum {
7979+ heading,
8080+ paragraph,
8181+ code,
8282+ blockquote,
8383+ horizontal_rule,
8484+ unordered_list,
8585+ image,
8686+};
8787+8888+const Block = struct {
8989+ type: BlockType,
9090+ content: []const u8,
9191+ level: ?u8 = null, // for headings
9292+ language: ?[]const u8 = null, // for code blocks
9393+ alt: ?[]const u8 = null, // for images
9494+ src: ?[]const u8 = null, // for images
9595+};
9696+9797+fn parseMarkdownBlocks(allocator: Allocator, markdown: []const u8) ![]Block
9898+```
9999+100100+parsing approach:
101101+- split on blank lines to get blocks
102102+- identify block type by first characters:
103103+ - `#` โ heading (count `#` for level)
104104+ - ``` โ code block (capture until closing ```)
105105+ - `>` โ blockquote
106106+ - `---` โ horizontal rule
107107+ - `-` or `*` at start โ list item
108108+ - `![` โ image
109109+ - else โ paragraph
110110+111111+### phase 2: inline facet extraction
112112+113113+for text blocks, extract inline formatting:
114114+115115+```zig
116116+const Facet = struct {
117117+ byte_start: usize,
118118+ byte_end: usize,
119119+ feature: FacetFeature,
120120+};
121121+122122+const FacetFeature = union(enum) {
123123+ bold,
124124+ italic,
125125+ code,
126126+ link: []const u8, // url
127127+ strikethrough,
128128+};
129129+130130+fn extractFacets(allocator: Allocator, text: []const u8) !struct {
131131+ plaintext: []const u8,
132132+ facets: []Facet,
133133+}
134134+```
135135+136136+approach:
137137+- scan for `**`, `*`, `` ` ``, `[`, `~~`
138138+- track byte positions as we strip markers
139139+- build facet list with adjusted indices
140140+141141+### phase 3: image blob upload
142142+143143+images need to be uploaded as blobs before referencing:
144144+145145+```zig
146146+fn uploadImageBlob(client: *XrpcClient, allocator: Allocator, image_path: []const u8) !BlobRef
147147+```
148148+149149+for now, could skip images or require them to already be uploaded.
150150+151151+### phase 4: json serialization
152152+153153+build the full `pub.leaflet.document` record:
154154+155155+```zig
156156+const LeafletDocument = struct {
157157+ @"$type": []const u8 = "pub.leaflet.document",
158158+ author: []const u8,
159159+ title: []const u8,
160160+ description: ?[]const u8 = null,
161161+ publishedAt: []const u8,
162162+ publication: ?[]const u8 = null,
163163+ tags: ?[][]const u8 = null,
164164+ pages: []Page,
165165+};
166166+167167+const Page = struct {
168168+ @"$type": []const u8 = "pub.leaflet.pages.linearDocument",
169169+ id: []const u8,
170170+ blocks: []BlockWrapper,
171171+};
172172+```
173173+174174+### phase 5: integrate into publish-docs.zig
175175+176176+update the publish script to:
177177+1. parse markdown into blocks
178178+2. convert to leaflet structure
179179+3. publish `pub.leaflet.document` alongside `site.standard.document`
180180+181181+```zig
182182+// existing: publish site.standard.document
183183+try putRecord(&client, allocator, session.did, "site.standard.document", tid.str(), doc_record);
184184+185185+// new: also publish pub.leaflet.document
186186+const leaflet_record = try markdownToLeaflet(allocator, content, title, session.did, pub_uri);
187187+try putRecord(&client, allocator, session.did, "pub.leaflet.document", tid.str(), leaflet_record);
188188+```
189189+190190+## complexity estimate
191191+192192+| component | complexity | notes |
193193+|-----------|------------|-------|
194194+| block parsing | medium | regex-free, line-by-line |
195195+| facet extraction | medium | byte index tracking is fiddly |
196196+| image upload | low | already have blob upload in xrpc |
197197+| json serialization | low | std.json handles it |
198198+| integration | low | add to existing publish flow |
199199+200200+total: ~300-500 lines of zig
201201+202202+## open questions
203203+204204+1. **publication record**: do we need a `pub.leaflet.publication` too, or just documents?
205205+ - leaflet allows standalone documents without publications
206206+ - could skip publication for now
207207+208208+2. **image handling**:
209209+ - option A: skip images initially (just text content)
210210+ - option B: require images to be URLs (no blob upload)
211211+ - option C: full blob upload support
212212+213213+3. **deduplication**: same rkey for both record types?
214214+ - pro: easy to correlate
215215+ - con: different collections, might not matter
216216+217217+4. **validation**: leaflet has a validate endpoint
218218+ - could call `/api/unstable_validate` to check records before publish
219219+ - probably skip for v1
220220+221221+## references
222222+223223+- [pub.leaflet.document schema](/tmp/leaflet/lexicons/pub/leaflet/document.json)
224224+- [leaflet publishToPublication.ts](/tmp/leaflet/actions/publishToPublication.ts) - how leaflet creates records
225225+- [site.standard.document schema](/tmp/standard.site/app/data/lexicons/document.json)
226226+- paul's site: fetches records, doesn't publish them
+272
docs/scratch/logfire-zig-adoption.md
···11+# logfire-zig adoption guide for leaflet-search
22+33+guide for integrating logfire-zig into the leaflet-search backend.
44+55+## 1. add dependency
66+77+in `backend/build.zig.zon`:
88+99+```zig
1010+.dependencies = .{
1111+ // ... existing deps ...
1212+ .logfire = .{
1313+ .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main",
1414+ .hash = "...", // run zig build to get hash
1515+ },
1616+},
1717+```
1818+1919+in `backend/build.zig`, add the import:
2020+2121+```zig
2222+const logfire = b.dependency("logfire", .{
2323+ .target = target,
2424+ .optimize = optimize,
2525+});
2626+exe.root_module.addImport("logfire", logfire.module("logfire"));
2727+```
2828+2929+## 2. configure in main.zig
3030+3131+```zig
3232+const std = @import("std");
3333+const logfire = @import("logfire");
3434+// ... other imports ...
3535+3636+pub fn main() !void {
3737+ var gpa = std.heap.GeneralPurposeAllocator(.{}){};
3838+ defer _ = gpa.deinit();
3939+ const allocator = gpa.allocator();
4040+4141+ // configure logfire early
4242+ // reads LOGFIRE_WRITE_TOKEN from env automatically
4343+ const lf = try logfire.configure(.{
4444+ .service_name = "leaflet-search",
4545+ .service_version = "0.0.1",
4646+ .environment = std.posix.getenv("FLY_APP_NAME") orelse "development",
4747+ });
4848+ defer lf.shutdown();
4949+5050+ logfire.info("starting leaflet-search on port {d}", .{port});
5151+5252+ // ... rest of main ...
5353+}
5454+```
5555+5656+## 3. replace timing.zig with spans
5757+5858+current pattern in server.zig:
5959+6060+```zig
6161+fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
6262+ const start_time = std.time.microTimestamp();
6363+ defer timing.record(.search, start_time);
6464+ // ...
6565+}
6666+```
6767+6868+with logfire:
6969+7070+```zig
7171+fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
7272+ const span = logfire.span("search.handle", .{});
7373+ defer span.end();
7474+7575+ // parse params
7676+ const query = parseQueryParam(alloc, target, "q") catch "";
7777+7878+ // add attributes after parsing
7979+ span.setAttribute("query", query);
8080+ span.setAttribute("tag", tag_filter orelse "");
8181+8282+ // ...
8383+}
8484+```
8585+8686+for nested operations:
8787+8888+```zig
8989+fn search(alloc: Allocator, query: []const u8, ...) ![]Result {
9090+ const span = logfire.span("search.execute", .{
9191+ .query_length = @intCast(query.len),
9292+ });
9393+ defer span.end();
9494+9595+ // FTS query
9696+ {
9797+ const fts_span = logfire.span("search.fts", .{});
9898+ defer fts_span.end();
9999+ // ... FTS logic ...
100100+ }
101101+102102+ // vector search fallback
103103+ if (results.len < limit) {
104104+ const vec_span = logfire.span("search.vector", .{});
105105+ defer vec_span.end();
106106+ // ... vector search ...
107107+ }
108108+109109+ return results;
110110+}
111111+```
112112+113113+## 4. add structured logging
114114+115115+replace `std.debug.print` with logfire:
116116+117117+```zig
118118+// before
119119+std.debug.print("accept error: {}\n", .{err});
120120+121121+// after
122122+logfire.err("accept error: {}", .{err});
123123+```
124124+125125+```zig
126126+// before
127127+std.debug.print("{s} listening on http://0.0.0.0:{d}\n", .{app_name, port});
128128+129129+// after
130130+logfire.info("{s} listening on port {d}", .{app_name, port});
131131+```
132132+133133+for sync operations in tap.zig:
134134+135135+```zig
136136+logfire.info("sync complete", .{});
137137+logfire.debug("processed {d} events", .{event_count});
138138+```
139139+140140+for errors:
141141+142142+```zig
143143+logfire.err("turso query failed: {}", .{@errorName(err)});
144144+```
145145+146146+## 5. add metrics
147147+148148+replace stats.zig counters with logfire metrics:
149149+150150+```zig
151151+// before (in stats.zig)
152152+pub fn recordSearch(query: []const u8) void {
153153+ total_searches.fetchAdd(1, .monotonic);
154154+ // ...
155155+}
156156+157157+// with logfire (in server.zig or stats.zig)
158158+pub fn recordSearch(query: []const u8) void {
159159+ logfire.counter("search.total", 1);
160160+ // existing logic...
161161+}
162162+```
163163+164164+for gauges (e.g., active connections, document counts):
165165+166166+```zig
167167+logfire.gaugeInt("documents.indexed", doc_count);
168168+logfire.gaugeInt("connections.active", active_count);
169169+```
170170+171171+for latency histograms (more detail than counter):
172172+173173+```zig
174174+// after search completes
175175+logfire.metric(.{
176176+ .name = "search.latency_ms",
177177+ .unit = "ms",
178178+ .data = .{
179179+ .histogram = .{
180180+ .data_points = &[_]logfire.HistogramDataPoint{.{
181181+ .start_time_ns = start_ns,
182182+ .time_ns = std.time.nanoTimestamp(),
183183+ .count = 1,
184184+ .sum = latency_ms,
185185+ .bucket_counts = ...,
186186+ .explicit_bounds = ...,
187187+ .min = latency_ms,
188188+ .max = latency_ms,
189189+ }},
190190+ },
191191+ },
192192+});
193193+```
194194+195195+## 6. deployment
196196+197197+add to fly.toml secrets:
198198+199199+```bash
200200+fly secrets set LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxxxx --app leaflet-search-backend
201201+```
202202+203203+logfire-zig reads from `LOGFIRE_WRITE_TOKEN` or `LOGFIRE_TOKEN` automatically.
204204+205205+## 7. what to keep from existing code
206206+207207+**keep timing.zig** - it provides local latency histograms for the dashboard API. logfire spans complement this with distributed tracing.
208208+209209+**keep stats.zig** - local counters are still useful for the `/stats` endpoint. logfire metrics add remote observability.
210210+211211+**keep activity.zig** - tracks recent activity for the dashboard. orthogonal to logfire.
212212+213213+the pattern is: local state for dashboard UI, logfire for observability.
214214+215215+## 8. migration order
216216+217217+1. add dependency, configure in main.zig
218218+2. add spans to request handlers (search, similar, tags, popular)
219219+3. add structured logging for errors and important events
220220+4. add metrics for key counters
221221+5. gradually replace `std.debug.print` with logfire logging
222222+6. consider removing timing.zig if logfire histograms are sufficient
223223+224224+## 9. example: full search handler
225225+226226+```zig
227227+fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
228228+ const span = logfire.span("http.search", .{});
229229+ defer span.end();
230230+231231+ var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
232232+ defer arena.deinit();
233233+ const alloc = arena.allocator();
234234+235235+ const query = parseQueryParam(alloc, target, "q") catch "";
236236+ const tag_filter = parseQueryParam(alloc, target, "tag") catch null;
237237+238238+ if (query.len == 0 and tag_filter == null) {
239239+ logfire.debug("empty search request", .{});
240240+ try sendJson(request, "{\"error\":\"enter a search term\"}");
241241+ return;
242242+ }
243243+244244+ const results = search.search(alloc, query, tag_filter, null, null) catch |err| {
245245+ logfire.err("search failed: {}", .{@errorName(err)});
246246+ stats.recordError();
247247+ return err;
248248+ };
249249+250250+ logfire.counter("search.requests", 1);
251251+ logfire.info("search completed", .{});
252252+253253+ // ... send response ...
254254+}
255255+```
256256+257257+## 10. verifying it works
258258+259259+run locally:
260260+261261+```bash
262262+LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxx zig build run
263263+```
264264+265265+check logfire dashboard for traces from `leaflet-search` service.
266266+267267+without token (console fallback):
268268+269269+```bash
270270+zig build run
271271+# prints [span], [info], [metric] to stderr
272272+```
+350
docs/scratch/standard-search-planning.md
···11+# standard-search planning
22+33+expanding leaflet-search to index all standard.site records.
44+55+## references
66+77+- [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto
88+- [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons
99+- [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons
1010+- [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons
1111+- [ATProto docs](https://atproto.com/docs) - protocol documentation
1212+1313+## context
1414+1515+discussion with pckt.blog team about building global search for standard.site ecosystem.
1616+current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons.
1717+1818+### recent work (2026-01-05)
1919+2020+added similarity cache to improve `/similar` endpoint performance:
2121+- `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)`
2222+- cache auto-invalidates when document count changes
2323+- `/stats` endpoint now shows `cache_hits` and `cache_misses`
2424+- first request ~3s (cold), cached requests ~0.15s
2525+2626+also added loading indicator for "related to" results in frontend.
2727+2828+### recent work (2026-01-06)
2929+3030+- merged PR1: multi-platform schema (platform + source_collection columns)
3131+- added `loading.js` - portable loading state handler for dashboards
3232+ - skeleton shimmer while loading
3333+ - "waking up" toast after 2s threshold (fly.io cold start handling)
3434+ - designed to be copied to other projects
3535+- fixed pluralization ("1 result" vs "2 results")
3636+3737+## what we know
3838+3939+### standard.site lexicons
4040+4141+two shared lexicons for long-form publishing on ATProto:
4242+- `site.standard.document` - document content and metadata
4343+- `site.standard.publication` - publication/blog metadata
4444+4545+implementing platforms:
4646+- leaflet.pub (`pub.leaflet.*`)
4747+- pckt.blog (`blog.pckt.*`)
4848+- offprint.app (`app.offprint.*`)
4949+5050+### site.standard.document schema
5151+5252+examined real records from pckt.blog. key fields:
5353+5454+```
5555+textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail)
5656+content - platform-specific block structure
5757+ .$type - identifies platform (e.g., "blog.pckt.content")
5858+title - document title
5959+tags - array of strings
6060+site - AT-URI reference to site.standard.publication
6161+path - URL path (e.g., "/my-post-abc123")
6262+publishedAt - ISO timestamp
6363+updatedAt - ISO timestamp
6464+coverImage - blob reference
6565+```
6666+6767+### the textContent field
6868+6969+this is huge. platforms flatten their block content into a single text field:
7070+7171+```json
7272+{
7373+ "content": {
7474+ "$type": "blog.pckt.content",
7575+ "items": [ /* platform-specific blocks */ ]
7676+ },
7777+ "textContent": "i have been writing a lot of atproto things in zig!..."
7878+}
7979+```
8080+8181+no need to parse platform-specific blocks - just index `textContent` directly.
8282+8383+### platform detection
8484+8585+derive platform from `content.$type` prefix:
8686+- `blog.pckt.content` โ pckt
8787+- `pub.leaflet.content` โ leaflet (TBD - need to verify)
8888+- `app.offprint.content` โ offprint (TBD - need to verify)
8989+9090+### current leaflet-search architecture
9191+9292+```
9393+ATProto firehose (via tap)
9494+ โ
9595+tap.zig - subscribes to pub.leaflet.document/publication
9696+ โ
9797+indexer.zig - extracts content from nested pages[].blocks[] structure
9898+ โ
9999+turso (sqlite) - documents table + FTS5 + embeddings
100100+ โ
101101+search.zig - FTS5 queries + vector similarity
102102+ โ
103103+server.zig - HTTP API (/search, /similar, /stats)
104104+```
105105+106106+leaflet-specific code:
107107+- tap.zig lines 10-11: hardcoded collection names
108108+- tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*)
109109+- recursive page/block traversal logic
110110+111111+generalizable code:
112112+- database schema (FTS5, tags, stats, similarity cache)
113113+- search/similar logic
114114+- HTTP API
115115+- embedding pipeline
116116+117117+## proposed architecture for standard-search
118118+119119+### ingestion changes
120120+121121+subscribe to:
122122+- `site.standard.document`
123123+- `site.standard.publication`
124124+125125+optionally also subscribe to platform-specific collections for richer data:
126126+- `pub.leaflet.document/publication`
127127+- `blog.pckt.document/publication` (if they have these)
128128+- `app.offprint.document/publication` (if they have these)
129129+130130+### content extraction
131131+132132+for `site.standard.document`:
133133+1. use `textContent` field directly - no block parsing!
134134+2. fall back to title + description if textContent missing
135135+136136+for platform-specific records (if needed):
137137+- keep existing leaflet block parser
138138+- add parsers for other platforms as needed
139139+140140+### database changes
141141+142142+add to documents table:
143143+- `platform` TEXT - derived from content.$type (leaflet, pckt, offprint)
144144+- `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document)
145145+- `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document
146146+147147+### API changes
148148+149149+- `/search?q=...&platform=leaflet` - optional platform filter
150150+- results include `platform` field
151151+- `/similar` works across all platforms
152152+153153+### naming/deployment
154154+155155+options:
156156+1. rename leaflet-search โ standard-search (breaking change)
157157+2. new repo/deployment, keep leaflet-search as-is
158158+3. branch and generalize, decide naming later
159159+160160+leaning toward option 3 for now.
161161+162162+## findings from exploration
163163+164164+### pckt.blog - READY
165165+- writes `site.standard.document` records
166166+- has `textContent` field (pre-flattened)
167167+- `content.$type` = `blog.pckt.content`
168168+- 6+ records found on pckt.blog service account
169169+170170+### leaflet.pub - NOT YET MIGRATED
171171+- still using `pub.leaflet.document` only
172172+- no `site.standard.document` records found
173173+- no `textContent` field - content is in nested `pages[].blocks[]`
174174+- will need to continue parsing blocks OR wait for migration
175175+176176+### offprint.app - NOW INDEXED (2026-01-22)
177177+- writes `site.standard.document` records with `app.offprint.content` blocks
178178+- has `textContent` field (pre-flattened)
179179+- platform detected via basePath (`*.offprint.app`, `*.offprint.test`)
180180+- now fully supported alongside leaflet and pckt
181181+182182+### greengale.app - NOW INDEXED (2026-01-22)
183183+- writes `site.standard.document` records
184184+- has `textContent` field (pre-flattened)
185185+- platform detected via basePath (`greengale.app/*`)
186186+- ~29 documents indexed at time of discovery
187187+- now fully supported alongside leaflet, pckt, and offprint
188188+189189+### implication for architecture
190190+191191+two paths:
192192+193193+**path A: wait for leaflet migration**
194194+- simpler: just index `site.standard.document` with `textContent`
195195+- all platforms converge on same schema
196196+- downside: loses existing leaflet search until they migrate
197197+198198+**path B: hybrid approach**
199199+- index `site.standard.document` (pckt, future leaflet, offprint)
200200+- ALSO index `pub.leaflet.document` with existing block parser
201201+- dedupe by URI or store both with `source_collection` indicator
202202+- more complex but maintains backwards compat
203203+204204+leaning toward **path B** - can't lose 3500 leaflet docs.
205205+206206+## open questions
207207+208208+- [x] does leaflet write site.standard.document records? **NO, not yet**
209209+- [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet**
210210+- [ ] when will leaflet migrate to standard.site?
211211+- [ ] should we dedupe platform-specific vs standard records?
212212+- [ ] embeddings: regenerate for all, or use same model?
213213+214214+## implementation plan (PRs)
215215+216216+breaking work into reviewable chunks:
217217+218218+### PR1: database schema for multi-platform โ MERGED
219219+- add `platform TEXT` column to documents (default 'leaflet')
220220+- add `source_collection TEXT` column (default 'pub.leaflet.document')
221221+- backfill existing ~3500 records
222222+- no behavior change, just schema prep
223223+- https://github.com/zzstoatzz/leaflet-search/pull/1
224224+225225+### PR2: generalized content extraction
226226+- new `extractor.zig` module with platform-agnostic interface
227227+- `textContent` extraction for standard.site records
228228+- keep existing block parser for `pub.leaflet.*`
229229+- platform detection from `content.$type`
230230+231231+### PR3: tap subscriber for site.standard.document
232232+- subscribe to `site.standard.document` + `site.standard.publication`
233233+- route to appropriate extractor
234234+- starts ingesting pckt.blog content
235235+236236+### PR4: API platform filter
237237+- add `?platform=` query param to `/search`
238238+- include `platform` field in results
239239+- frontend: show platform badge, optional filter
240240+241241+### PR5 (optional, separate track): witness cache
242242+- `witness_cache` table for raw records
243243+- replay tooling for backfills
244244+- independent of above work
245245+246246+## operational notes
247247+248248+- **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required:
249249+ ```bash
250250+ wrangler pages deploy site --project-name leaflet-search
251251+ ```
252252+- **fly.io backend**: deploy from backend directory:
253253+ ```bash
254254+ cd backend && fly deploy
255255+ ```
256256+- **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs)
257257+258258+## next steps
259259+260260+1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any)
261261+2. ~~find and examine offprint records~~ (done - no public content yet)
262262+3. ~~PR1: database schema~~ (merged)
263263+4. PR2: generalized content extraction
264264+5. PR3: tap subscriber
265265+6. PR4: API platform filter
266266+7. consider witness cache architecture (see below)
267267+268268+---
269269+270270+## architectural consideration: witness cache
271271+272272+[paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05):
273273+274274+> I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories.
275275+>
276276+> A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep
277277+>
278278+> The key feature is: you can replay it
279279+280280+> With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow
281281+282282+### current leaflet-search architecture (no witness cache)
283283+284284+```
285285+Firehose โ tap โ Parse & Transform โ Store DERIVED data โ Discard raw record
286286+```
287287+288288+we store:
289289+- `uri`, `did`, `rkey`
290290+- `title` (extracted)
291291+- `content` (flattened from blocks)
292292+- `created_at`, `publication_uri`
293293+294294+we discard: the raw record JSON
295295+296296+### witness cache architecture
297297+298298+```
299299+Firehose โ Store RAW record + witness_time โ Derive indexes on demand (replayable)
300300+```
301301+302302+would store:
303303+- `uri`, `collection`, `rkey`
304304+- `raw_record` (full JSON blob)
305305+- `witness_time` (when we indexed it)
306306+307307+then derive FTS, embeddings, etc. from local data via replay.
308308+309309+### comparison
310310+311311+| scenario | current (no cache) | with witness cache |
312312+|----------|-------------------|-------------------|
313313+| add new parser (offprint) | re-crawl network | replay local |
314314+| leaflet adds textContent | wait for new records | replay & re-extract |
315315+| fix parsing bug | re-crawl affected | replay & re-derive |
316316+| change embedding model | re-fetch content | replay local |
317317+| add new index/table | backfill from network | replay locally |
318318+319319+### trade-offs
320320+321321+**storage cost:**
322322+- ~3500 docs ร ~10KB avg = ~35MB (not huge)
323323+- turso free tier: 9GB, so plenty of room
324324+325325+**complexity:**
326326+- two-phase: store raw, then derive
327327+- vs current one-phase: derive immediately
328328+329329+**benefits for standard-search:**
330330+- could add offprint/pckt parsers and replay existing data
331331+- when leaflet migrates to standard.site, re-derive without network
332332+- embedding backfill becomes local-only (no voyage API for content fetch)
333333+334334+### implementation options
335335+336336+1. **add `raw_record TEXT` column to existing tables**
337337+ - simple, backwards compatible
338338+ - can migrate incrementally
339339+340340+2. **separate `witness_cache` table**
341341+ - `(uri PRIMARY KEY, collection, raw_record, witness_time)`
342342+ - cleaner separation of concerns
343343+ - documents/publications tables become derived views
344344+345345+3. **use duckdb/clickhouse for witness cache** (paul's suggestion)
346346+ - better compression for JSON blobs
347347+ - good for analytics queries
348348+ - adds operational complexity
349349+350350+for our scale, option 1 or 2 with turso is probably fine.
+124
docs/search-architecture.md
···11+# search architecture
22+33+current state, rationale, and future options.
44+55+## current: SQLite FTS5
66+77+we use SQLite's built-in full-text search (FTS5) via Turso.
88+99+### why FTS5 works for now
1010+1111+- **scale**: ~3500 documents. FTS5 handles this trivially.
1212+- **latency**: 10-50ms for search queries. fine for our use case.
1313+- **cost**: $0. included with Turso free tier.
1414+- **ops**: zero. no separate service to run.
1515+- **simplicity**: one database for everything (docs, FTS, vectors, cache).
1616+1717+### how it works
1818+1919+```
2020+user query: "crypto-casino"
2121+ โ
2222+buildFtsQuery(): "crypto OR casino*"
2323+ โ
2424+FTS5 MATCH query with BM25 + recency decay
2525+ โ
2626+results with snippet()
2727+```
2828+2929+key decisions:
3030+- **OR between terms** for better recall (deliberate, see commit 35ad4b5)
3131+- **prefix match on last word** for type-ahead feel
3232+- **unicode61 tokenizer** splits on non-alphanumeric (we match this in buildFtsQuery)
3333+- **recency decay** boosts recent docs: `ORDER BY rank + (days_old / 30)`
3434+3535+### what's coupled to FTS5
3636+3737+all in `backend/src/search.zig`:
3838+3939+| component | FTS5-specific |
4040+|-----------|---------------|
4141+| 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` |
4242+| `buildFtsQuery()` | constructs FTS5 syntax |
4343+| schema | `documents_fts`, `publications_fts` virtual tables |
4444+4545+### what's already decoupled
4646+4747+- result types (`SearchResultJson`, `Doc`, `Pub`)
4848+- similarity search (uses `vector_distance_cos`, not FTS5)
4949+- caching logic
5050+- HTTP layer (server.zig just calls `search()`)
5151+5252+### known limitations
5353+5454+- **no typo tolerance**: "leafet" won't find "leaflet"
5555+- **no relevance tuning**: can't boost title vs content
5656+- **single writer**: SQLite write lock
5757+- **no horizontal scaling**: single database
5858+5959+these aren't problems at current scale.
6060+6161+## future: if we need to scale
6262+6363+### when to consider switching
6464+6565+- search latency consistently >100ms
6666+- write contention from indexing
6767+- need typo tolerance or better relevance
6868+- millions of documents
6969+7070+### recommended: Elasticsearch
7171+7272+Elasticsearch is the battle-tested choice for production search:
7373+7474+- proven at massive scale (Wikipedia, GitHub, Stack Overflow)
7575+- rich query DSL, analyzers, aggregations
7676+- typo tolerance via fuzzy matching
7777+- horizontal scaling built-in
7878+- extensive tooling and community
7979+8080+trade-offs:
8181+- operational complexity (JVM, cluster management)
8282+- resource hungry (~2GB+ RAM minimum)
8383+- cost: $50-500/month depending on scale
8484+8585+### alternatives considered
8686+8787+**Meilisearch/Typesense**: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom.
8888+8989+**Algolia**: fully managed, excellent but expensive. makes sense if you want zero ops.
9090+9191+**PostgreSQL full-text**: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system.
9292+9393+### migration path
9494+9595+1. keep Turso as source of truth
9696+2. add Elasticsearch as search index
9797+3. sync documents to ES on write (async)
9898+4. point `/search` at Elasticsearch
9999+5. keep `/similar` on Turso (vector search)
100100+101101+the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged.
102102+103103+estimated effort: 1-2 days to swap search backend.
104104+105105+### vector search scaling
106106+107107+similarity search currently uses brute-force `vector_distance_cos` with caching. at scale:
108108+109109+- **Elasticsearch**: has vector search (dense_vector + kNN)
110110+- **dedicated vector DB**: Qdrant, Pinecone, Weaviate
111111+- **pgvector**: if on Postgres
112112+113113+could consolidate text + vector in Elasticsearch, or keep them separate.
114114+115115+## summary
116116+117117+| scale | recommendation |
118118+|-------|----------------|
119119+| <10k docs | keep FTS5 (current) |
120120+| 10k-100k docs | still probably fine, monitor latency |
121121+| 100k+ docs | consider Elasticsearch |
122122+| millions + sub-ms latency | Elasticsearch cluster + caching layer |
123123+124124+we're in the "keep FTS5" zone. the code is structured to swap later if needed.
-343
docs/standard-search-planning.md
···11-# standard-search planning
22-33-expanding leaflet-search to index all standard.site records.
44-55-## references
66-77-- [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto
88-- [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons
99-- [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons
1010-- [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons (early beta)
1111-- [ATProto docs](https://atproto.com/docs) - protocol documentation
1212-1313-## context
1414-1515-discussion with pckt.blog team about building global search for standard.site ecosystem.
1616-current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons.
1717-1818-### recent work (2026-01-05)
1919-2020-added similarity cache to improve `/similar` endpoint performance:
2121-- `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)`
2222-- cache auto-invalidates when document count changes
2323-- `/stats` endpoint now shows `cache_hits` and `cache_misses`
2424-- first request ~3s (cold), cached requests ~0.15s
2525-2626-also added loading indicator for "related to" results in frontend.
2727-2828-### recent work (2026-01-06)
2929-3030-- merged PR1: multi-platform schema (platform + source_collection columns)
3131-- added `loading.js` - portable loading state handler for dashboards
3232- - skeleton shimmer while loading
3333- - "waking up" toast after 2s threshold (fly.io cold start handling)
3434- - designed to be copied to other projects
3535-- fixed pluralization ("1 result" vs "2 results")
3636-3737-## what we know
3838-3939-### standard.site lexicons
4040-4141-two shared lexicons for long-form publishing on ATProto:
4242-- `site.standard.document` - document content and metadata
4343-- `site.standard.publication` - publication/blog metadata
4444-4545-implementing platforms:
4646-- leaflet.pub (`pub.leaflet.*`)
4747-- pckt.blog (`blog.pckt.*`)
4848-- offprint.app (`app.offprint.*`)
4949-5050-### site.standard.document schema
5151-5252-examined real records from pckt.blog. key fields:
5353-5454-```
5555-textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail)
5656-content - platform-specific block structure
5757- .$type - identifies platform (e.g., "blog.pckt.content")
5858-title - document title
5959-tags - array of strings
6060-site - AT-URI reference to site.standard.publication
6161-path - URL path (e.g., "/my-post-abc123")
6262-publishedAt - ISO timestamp
6363-updatedAt - ISO timestamp
6464-coverImage - blob reference
6565-```
6666-6767-### the textContent field
6868-6969-this is huge. platforms flatten their block content into a single text field:
7070-7171-```json
7272-{
7373- "content": {
7474- "$type": "blog.pckt.content",
7575- "items": [ /* platform-specific blocks */ ]
7676- },
7777- "textContent": "i have been writing a lot of atproto things in zig!..."
7878-}
7979-```
8080-8181-no need to parse platform-specific blocks - just index `textContent` directly.
8282-8383-### platform detection
8484-8585-derive platform from `content.$type` prefix:
8686-- `blog.pckt.content` โ pckt
8787-- `pub.leaflet.content` โ leaflet (TBD - need to verify)
8888-- `app.offprint.content` โ offprint (TBD - need to verify)
8989-9090-### current leaflet-search architecture
9191-9292-```
9393-ATProto firehose (via tap)
9494- โ
9595-tap.zig - subscribes to pub.leaflet.document/publication
9696- โ
9797-indexer.zig - extracts content from nested pages[].blocks[] structure
9898- โ
9999-turso (sqlite) - documents table + FTS5 + embeddings
100100- โ
101101-search.zig - FTS5 queries + vector similarity
102102- โ
103103-server.zig - HTTP API (/search, /similar, /stats)
104104-```
105105-106106-leaflet-specific code:
107107-- tap.zig lines 10-11: hardcoded collection names
108108-- tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*)
109109-- recursive page/block traversal logic
110110-111111-generalizable code:
112112-- database schema (FTS5, tags, stats, similarity cache)
113113-- search/similar logic
114114-- HTTP API
115115-- embedding pipeline
116116-117117-## proposed architecture for standard-search
118118-119119-### ingestion changes
120120-121121-subscribe to:
122122-- `site.standard.document`
123123-- `site.standard.publication`
124124-125125-optionally also subscribe to platform-specific collections for richer data:
126126-- `pub.leaflet.document/publication`
127127-- `blog.pckt.document/publication` (if they have these)
128128-- `app.offprint.document/publication` (if they have these)
129129-130130-### content extraction
131131-132132-for `site.standard.document`:
133133-1. use `textContent` field directly - no block parsing!
134134-2. fall back to title + description if textContent missing
135135-136136-for platform-specific records (if needed):
137137-- keep existing leaflet block parser
138138-- add parsers for other platforms as needed
139139-140140-### database changes
141141-142142-add to documents table:
143143-- `platform` TEXT - derived from content.$type (leaflet, pckt, offprint)
144144-- `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document)
145145-- `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document
146146-147147-### API changes
148148-149149-- `/search?q=...&platform=leaflet` - optional platform filter
150150-- results include `platform` field
151151-- `/similar` works across all platforms
152152-153153-### naming/deployment
154154-155155-options:
156156-1. rename leaflet-search โ standard-search (breaking change)
157157-2. new repo/deployment, keep leaflet-search as-is
158158-3. branch and generalize, decide naming later
159159-160160-leaning toward option 3 for now.
161161-162162-## findings from exploration
163163-164164-### pckt.blog - READY
165165-- writes `site.standard.document` records
166166-- has `textContent` field (pre-flattened)
167167-- `content.$type` = `blog.pckt.content`
168168-- 6+ records found on pckt.blog service account
169169-170170-### leaflet.pub - NOT YET MIGRATED
171171-- still using `pub.leaflet.document` only
172172-- no `site.standard.document` records found
173173-- no `textContent` field - content is in nested `pages[].blocks[]`
174174-- will need to continue parsing blocks OR wait for migration
175175-176176-### offprint.app - LIKELY EARLY BETA
177177-- no `site.standard.document` records found on offprint.app account
178178-- no `app.offprint.document` collection visible
179179-- website shows no example users/content
180180-- probably in early/private beta - no public records yet
181181-182182-### implication for architecture
183183-184184-two paths:
185185-186186-**path A: wait for leaflet migration**
187187-- simpler: just index `site.standard.document` with `textContent`
188188-- all platforms converge on same schema
189189-- downside: loses existing leaflet search until they migrate
190190-191191-**path B: hybrid approach**
192192-- index `site.standard.document` (pckt, future leaflet, offprint)
193193-- ALSO index `pub.leaflet.document` with existing block parser
194194-- dedupe by URI or store both with `source_collection` indicator
195195-- more complex but maintains backwards compat
196196-197197-leaning toward **path B** - can't lose 3500 leaflet docs.
198198-199199-## open questions
200200-201201-- [x] does leaflet write site.standard.document records? **NO, not yet**
202202-- [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet**
203203-- [ ] when will leaflet migrate to standard.site?
204204-- [ ] should we dedupe platform-specific vs standard records?
205205-- [ ] embeddings: regenerate for all, or use same model?
206206-207207-## implementation plan (PRs)
208208-209209-breaking work into reviewable chunks:
210210-211211-### PR1: database schema for multi-platform โ MERGED
212212-- add `platform TEXT` column to documents (default 'leaflet')
213213-- add `source_collection TEXT` column (default 'pub.leaflet.document')
214214-- backfill existing ~3500 records
215215-- no behavior change, just schema prep
216216-- https://github.com/zzstoatzz/leaflet-search/pull/1
217217-218218-### PR2: generalized content extraction
219219-- new `extractor.zig` module with platform-agnostic interface
220220-- `textContent` extraction for standard.site records
221221-- keep existing block parser for `pub.leaflet.*`
222222-- platform detection from `content.$type`
223223-224224-### PR3: TAP subscriber for site.standard.document
225225-- subscribe to `site.standard.document` + `site.standard.publication`
226226-- route to appropriate extractor
227227-- starts ingesting pckt.blog content
228228-229229-### PR4: API platform filter
230230-- add `?platform=` query param to `/search`
231231-- include `platform` field in results
232232-- frontend: show platform badge, optional filter
233233-234234-### PR5 (optional, separate track): witness cache
235235-- `witness_cache` table for raw records
236236-- replay tooling for backfills
237237-- independent of above work
238238-239239-## operational notes
240240-241241-- **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required:
242242- ```bash
243243- wrangler pages deploy site --project-name leaflet-search
244244- ```
245245-- **fly.io backend**: deploy from backend directory:
246246- ```bash
247247- cd backend && fly deploy
248248- ```
249249-- **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs)
250250-251251-## next steps
252252-253253-1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any)
254254-2. ~~find and examine offprint records~~ (done - no public content yet)
255255-3. ~~PR1: database schema~~ (merged)
256256-4. PR2: generalized content extraction
257257-5. PR3: TAP subscriber
258258-6. PR4: API platform filter
259259-7. consider witness cache architecture (see below)
260260-261261----
262262-263263-## architectural consideration: witness cache
264264-265265-[paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05):
266266-267267-> I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories.
268268->
269269-> A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep
270270->
271271-> The key feature is: you can replay it
272272-273273-> With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow
274274-275275-### current leaflet-search architecture (no witness cache)
276276-277277-```
278278-Firehose โ TAP โ Parse & Transform โ Store DERIVED data โ Discard raw record
279279-```
280280-281281-we store:
282282-- `uri`, `did`, `rkey`
283283-- `title` (extracted)
284284-- `content` (flattened from blocks)
285285-- `created_at`, `publication_uri`
286286-287287-we discard: the raw record JSON
288288-289289-### witness cache architecture
290290-291291-```
292292-Firehose โ Store RAW record + witness_time โ Derive indexes on demand (replayable)
293293-```
294294-295295-would store:
296296-- `uri`, `collection`, `rkey`
297297-- `raw_record` (full JSON blob)
298298-- `witness_time` (when we indexed it)
299299-300300-then derive FTS, embeddings, etc. from local data via replay.
301301-302302-### comparison
303303-304304-| scenario | current (no cache) | with witness cache |
305305-|----------|-------------------|-------------------|
306306-| add new parser (offprint) | re-crawl network | replay local |
307307-| leaflet adds textContent | wait for new records | replay & re-extract |
308308-| fix parsing bug | re-crawl affected | replay & re-derive |
309309-| change embedding model | re-fetch content | replay local |
310310-| add new index/table | backfill from network | replay locally |
311311-312312-### trade-offs
313313-314314-**storage cost:**
315315-- ~3500 docs ร ~10KB avg = ~35MB (not huge)
316316-- turso free tier: 9GB, so plenty of room
317317-318318-**complexity:**
319319-- two-phase: store raw, then derive
320320-- vs current one-phase: derive immediately
321321-322322-**benefits for standard-search:**
323323-- could add offprint/pckt parsers and replay existing data
324324-- when leaflet migrates to standard.site, re-derive without network
325325-- embedding backfill becomes local-only (no voyage API for content fetch)
326326-327327-### implementation options
328328-329329-1. **add `raw_record TEXT` column to existing tables**
330330- - simple, backwards compatible
331331- - can migrate incrementally
332332-333333-2. **separate `witness_cache` table**
334334- - `(uri PRIMARY KEY, collection, raw_record, witness_time)`
335335- - cleaner separation of concerns
336336- - documents/publications tables become derived views
337337-338338-3. **use duckdb/clickhouse for witness cache** (paul's suggestion)
339339- - better compression for JSON blobs
340340- - good for analytics queries
341341- - adds operational complexity
342342-343343-for our scale, option 1 or 2 with turso is probably fine.
+215
docs/tap.md
···11+# tap (firehose sync)
22+33+leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
44+55+## what is tap?
66+77+tap subscribes to the ATProto firehose, filters for specific collections (e.g., `site.standard.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records.
88+99+key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking:
1010+1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
1111+2. live firehose events during backfill are buffered in memory
1212+3. historical events (marked `live: false`) are delivered first
1313+4. after historical events complete, buffered live events are released
1414+5. subsequent firehose events arrive immediately marked as `live: true`
1515+1616+tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
1717+1818+## message format
1919+2020+tap sends JSON messages over websocket. record events look like:
2121+2222+```json
2323+{
2424+ "type": "record",
2525+ "record": {
2626+ "live": true,
2727+ "did": "did:plc:abc123...",
2828+ "rev": "3mbspmpaidl2a",
2929+ "collection": "site.standard.document",
3030+ "rkey": "3lzyrj6q6gs27",
3131+ "action": "create",
3232+ "record": { ... },
3333+ "cid": "bafyrei..."
3434+ }
3535+}
3636+```
3737+3838+### field types (important!)
3939+4040+| field | type | values | notes |
4141+|-------|------|--------|-------|
4242+| type | string | "record", "identity", "account" | message type |
4343+| action | **string** | "create", "update", "delete" | NOT an enum! |
4444+| live | bool | true/false | true = firehose, false = resync |
4545+| collection | string | e.g., "site.standard.document" | lexicon collection |
4646+4747+## gotchas
4848+4949+1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
5050+5151+2. **collection filters apply during processing** - `TAP_COLLECTION_FILTERS` controls which records tap processes and sends to clients, both during live commits and resync CAR walks. records from other collections are skipped entirely.
5252+5353+3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`.
5454+5555+4. **silent extraction failures** - if using zat's `extractAt`, enable debug logging to see why parsing fails:
5656+ ```zig
5757+ pub const std_options = .{
5858+ .log_scope_levels = &.{.{ .scope = .zat, .level = .debug }},
5959+ };
6060+ ```
6161+ this will show messages like:
6262+ ```
6363+ debug(zat): extractAt: parse failed for Op at path { "op" }: InvalidEnumTag
6464+ ```
6565+6666+## memory and performance tuning
6767+6868+tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
6969+7070+### recommended settings for leaflet-search
7171+7272+```toml
7373+[[vm]]
7474+ memory = '2gb' # 1gb is not enough
7575+7676+[env]
7777+ TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5)
7878+ TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10)
7979+ TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000)
8080+ TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000)
8181+```
8282+8383+### why these values?
8484+8585+- **2GB memory**: 1GB causes OOM kills when resyncing large repos
8686+- **resync parallelism 1**: prevents multiple large CARs in memory simultaneously
8787+- **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill
8888+- **smaller ident cache**: we don't need 2M cached identities
8989+9090+if tap keeps OOM'ing, check logs for large repo resyncs:
9191+```bash
9292+fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}"
9393+```
9494+9595+## quick status check
9696+9797+from the `tap/` directory:
9898+```bash
9999+just check
100100+```
101101+102102+shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
103103+104104+example output:
105105+```
106106+=== tap status ===
107107+app 781417db604d48 23 ewr started ...
108108+109109+=== Recent Indexing Activity ===
110110+Last indexed: 2026-01-08 (14 docs)
111111+Today: 2026-01-11
112112+Docs: 3742 | Pubs: 1231
113113+114114+=== Timeline (last 7 days) ===
115115+2026-01-08: 14 docs
116116+2026-01-07: 29 docs
117117+...
118118+```
119119+120120+if "Last indexed" is more than a day behind "Today", tap may be down or catching up.
121121+122122+## checking catch-up progress
123123+124124+when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress:
125125+126126+```bash
127127+# see current firehose position (look for timestamps in log messages)
128128+fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3
129129+```
130130+131131+the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up.
132132+133133+catch-up speed varies:
134134+- **~0.3x** when resync queue is full (large repos being fetched)
135135+- **~1x or faster** once resyncs clear
136136+137137+## debugging
138138+139139+### check tap connection
140140+```bash
141141+fly logs -a leaflet-search-tap --no-tail | tail -30
142142+```
143143+144144+look for:
145145+- `"connected to firehose"` - successfully connected to bsky relay
146146+- `"websocket connected"` - backend connected to tap
147147+- `"dialing failed"` / `"i/o timeout"` - network issues
148148+149149+### check backend is receiving
150150+```bash
151151+fly logs -a leaflet-search-backend --no-tail | grep -E "(tap|indexed)"
152152+```
153153+154154+look for:
155155+- `tap connected!` - connected to tap
156156+- `tap: msg_type=record` - receiving messages
157157+- `indexed document:` - successfully processing
158158+159159+### common issues
160160+161161+| symptom | cause | fix |
162162+|---------|-------|-----|
163163+| tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164164+| `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match |
165165+| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue |
166166+| messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types |
167167+| repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168168+| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap |
169169+| indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up |
170170+171171+## tap API endpoints
172172+173173+tap exposes HTTP endpoints for monitoring and control:
174174+175175+| endpoint | description |
176176+|----------|-------------|
177177+| `/health` | health check |
178178+| `/stats/repo-count` | number of tracked repos |
179179+| `/stats/record-count` | total records processed |
180180+| `/stats/outbox-buffer` | events waiting to be sent |
181181+| `/stats/resync-buffer` | buffered commits for repos currently resyncing (NOT the resync queue) |
182182+| `/stats/cursors` | firehose cursor position |
183183+| `/info/:did` | repo status: `{"did":"...","state":"active","records":N}` |
184184+| `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos |
185185+| `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos |
186186+187187+example: check repo status
188188+```bash
189189+fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123"
190190+```
191191+192192+example: manually add a repo for backfill
193193+```bash
194194+fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add'
195195+```
196196+197197+## fly.io deployment
198198+199199+both tap and backend should be in the same region for internal networking:
200200+201201+```bash
202202+# check current regions
203203+fly status -a leaflet-search-tap
204204+fly status -a leaflet-search-backend
205205+206206+# restart tap if needed
207207+fly machine restart -a leaflet-search-tap <machine-id>
208208+```
209209+210210+note: changing `primary_region` in fly.toml only affects new machines. to move existing machines, clone to new region and destroy old one.
211211+212212+## references
213213+214214+- [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
215215+- [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+5-5
mcp/README.md
···11-# leaflet-mcp
11+# pub search MCP
2233-MCP server for [Leaflet](https://leaflet.pub) - search decentralized publications on ATProto.
33+MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site).
4455## usage
6677### hosted (recommended)
8899```bash
1010-claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}'
1010+claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}'
1111```
12121313### local
···1515run the MCP server locally with `uvx`:
16161717```bash
1818-uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp leaflet-mcp
1818+uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp pub-search
1919```
20202121to add it to claude code as a local stdio server:
22222323```bash
2424-claude mcp add leaflet -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' leaflet-mcp
2424+claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search
2525```
26262727## workflow
···11+#!/usr/bin/env python3
22+"""Test the pub-search MCP server."""
33+44+import asyncio
55+import sys
66+77+from fastmcp import Client
88+from fastmcp.client.transports import FastMCPTransport
99+1010+from pub_search.server import mcp
1111+1212+1313+async def main():
1414+ # use local transport for testing, or live URL if --live flag
1515+ if "--live" in sys.argv:
1616+ print("testing against live Horizon server...")
1717+ client = Client("https://pub-search-by-zzstoatzz.fastmcp.app/mcp")
1818+ else:
1919+ print("testing locally with FastMCPTransport...")
2020+ client = Client(transport=FastMCPTransport(mcp))
2121+2222+ async with client:
2323+ # list tools
2424+ print("=== tools ===")
2525+ tools = await client.list_tools()
2626+ for t in tools:
2727+ print(f" {t.name}")
2828+2929+ # test search with new platform filter
3030+ print("\n=== search(query='zig', platform='leaflet', limit=3) ===")
3131+ result = await client.call_tool(
3232+ "search", {"query": "zig", "platform": "leaflet", "limit": 3}
3333+ )
3434+ for item in result.content:
3535+ print(f" {item.text[:200]}...")
3636+3737+ # test search with since filter
3838+ print("\n=== search(query='python', since='2025-01-01', limit=2) ===")
3939+ result = await client.call_tool(
4040+ "search", {"query": "python", "since": "2025-01-01", "limit": 2}
4141+ )
4242+ for item in result.content:
4343+ print(f" {item.text[:200]}...")
4444+4545+ # test get_tags
4646+ print("\n=== get_tags() ===")
4747+ result = await client.call_tool("get_tags", {})
4848+ for item in result.content:
4949+ print(f" {item.text[:150]}...")
5050+5151+ # test get_stats
5252+ print("\n=== get_stats() ===")
5353+ result = await client.call_tool("get_stats", {})
5454+ for item in result.content:
5555+ print(f" {item.text}")
5656+5757+ # test get_popular
5858+ print("\n=== get_popular(limit=3) ===")
5959+ result = await client.call_tool("get_popular", {"limit": 3})
6060+ for item in result.content:
6161+ print(f" {item.text[:100]}...")
6262+6363+ print("\n=== all tests passed ===")
6464+6565+6666+if __name__ == "__main__":
6767+ asyncio.run(main())
-5
mcp/src/leaflet_mcp/__init__.py
···11-"""Leaflet MCP server - search decentralized publications on ATProto."""
22-33-from leaflet_mcp.server import main, mcp
44-55-__all__ = ["main", "mcp"]
-58
mcp/src/leaflet_mcp/_types.py
···11-"""Type definitions for Leaflet MCP responses."""
22-33-from typing import Literal
44-55-from pydantic import BaseModel, computed_field
66-77-88-class SearchResult(BaseModel):
99- """A search result from the Leaflet API."""
1010-1111- type: Literal["article", "looseleaf", "publication"]
1212- uri: str
1313- did: str
1414- title: str
1515- snippet: str
1616- createdAt: str = ""
1717- rkey: str
1818- basePath: str = ""
1919-2020- @computed_field
2121- @property
2222- def url(self) -> str:
2323- """web URL for this document."""
2424- if self.basePath:
2525- return f"https://{self.basePath}/{self.rkey}"
2626- return ""
2727-2828-2929-class Tag(BaseModel):
3030- """A tag with document count."""
3131-3232- tag: str
3333- count: int
3434-3535-3636-class PopularSearch(BaseModel):
3737- """A popular search query with count."""
3838-3939- query: str
4040- count: int
4141-4242-4343-class Stats(BaseModel):
4444- """Leaflet index statistics."""
4545-4646- documents: int
4747- publications: int
4848-4949-5050-class Document(BaseModel):
5151- """Full document content from ATProto."""
5252-5353- uri: str
5454- title: str
5555- content: str
5656- createdAt: str = ""
5757- tags: list[str] = []
5858- publicationUri: str = ""
-21
mcp/src/leaflet_mcp/client.py
···11-"""HTTP client for Leaflet search API."""
22-33-import os
44-from contextlib import asynccontextmanager
55-from typing import AsyncIterator
66-77-import httpx
88-99-# configurable via env var, defaults to production
1010-LEAFLET_API_URL = os.getenv("LEAFLET_API_URL", "https://leaflet-search-backend.fly.dev")
1111-1212-1313-@asynccontextmanager
1414-async def get_http_client() -> AsyncIterator[httpx.AsyncClient]:
1515- """Get an async HTTP client for Leaflet API requests."""
1616- async with httpx.AsyncClient(
1717- base_url=LEAFLET_API_URL,
1818- timeout=30.0,
1919- headers={"Accept": "application/json"},
2020- ) as client:
2121- yield client
-289
mcp/src/leaflet_mcp/server.py
···11-"""Leaflet MCP server implementation using fastmcp."""
22-33-from __future__ import annotations
44-55-from typing import Any
66-77-from fastmcp import FastMCP
88-99-from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag
1010-from leaflet_mcp.client import get_http_client
1111-1212-mcp = FastMCP("leaflet")
1313-1414-1515-# -----------------------------------------------------------------------------
1616-# prompts
1717-# -----------------------------------------------------------------------------
1818-1919-2020-@mcp.prompt("usage_guide")
2121-def usage_guide() -> str:
2222- """instructions for using leaflet MCP tools."""
2323- return """\
2424-# Leaflet MCP server usage guide
2525-2626-Leaflet is a decentralized publishing platform on ATProto (the protocol behind Bluesky).
2727-This MCP server provides search and discovery tools for Leaflet publications.
2828-2929-## core tools
3030-3131-- `search(query, tag)` - search documents and publications by text or tag
3232-- `get_document(uri)` - get the full content of a document by its AT-URI
3333-- `find_similar(uri)` - find documents similar to a given document
3434-- `get_tags()` - list all available tags with document counts
3535-- `get_stats()` - get index statistics (document/publication counts)
3636-- `get_popular()` - see popular search queries
3737-3838-## workflow for research
3939-4040-1. use `search("your topic")` to find relevant documents
4141-2. use `get_document(uri)` to retrieve full content of interesting results
4242-3. use `find_similar(uri)` to discover related content
4343-4444-## result types
4545-4646-search returns three types of results:
4747-- **publication**: a collection of articles (like a blog or magazine)
4848-- **article**: a document that belongs to a publication
4949-- **looseleaf**: a standalone document not part of a publication
5050-5151-## AT-URIs
5252-5353-documents are identified by AT-URIs like:
5454- `at://did:plc:abc123/pub.leaflet.document/xyz789`
5555-5656-you can also browse documents on the web at leaflet.pub
5757-"""
5858-5959-6060-@mcp.prompt("search_tips")
6161-def search_tips() -> str:
6262- """tips for effective searching."""
6363- return """\
6464-# Leaflet search tips
6565-6666-## text search
6767-- searches both document titles and content
6868-- uses FTS5 full-text search with prefix matching
6969-- the last word gets prefix matching: "cat dog" matches "cat dogs"
7070-7171-## tag filtering
7272-- combine text search with tag filter: `search("python", tag="programming")`
7373-- use `get_tags()` to discover available tags
7474-- tags are only applied to documents, not publications
7575-7676-## finding related content
7777-- after finding an interesting document, use `find_similar(uri)`
7878-- similarity is based on semantic embeddings (voyage-3-lite)
7979-- great for exploring related topics
8080-8181-## browsing by popularity
8282-- use `get_popular()` to see what others are searching for
8383-- can inspire new research directions
8484-"""
8585-8686-8787-# -----------------------------------------------------------------------------
8888-# tools
8989-# -----------------------------------------------------------------------------
9090-9191-9292-@mcp.tool
9393-async def search(
9494- query: str = "",
9595- tag: str | None = None,
9696- limit: int = 5,
9797-) -> list[SearchResult]:
9898- """search leaflet documents and publications.
9999-100100- searches the full text of documents (titles and content) and publications.
101101- results include a snippet showing where the match was found.
102102-103103- args:
104104- query: search query (searches titles and content)
105105- tag: optional tag to filter by (only applies to documents)
106106- limit: max results to return (default 5, max 40)
107107-108108- returns:
109109- list of search results with uri, title, snippet, and metadata
110110- """
111111- if not query and not tag:
112112- return []
113113-114114- params: dict[str, Any] = {}
115115- if query:
116116- params["q"] = query
117117- if tag:
118118- params["tag"] = tag
119119-120120- async with get_http_client() as client:
121121- response = await client.get("/search", params=params)
122122- response.raise_for_status()
123123- results = response.json()
124124-125125- # apply client-side limit since API returns up to 40
126126- return [SearchResult(**r) for r in results[:limit]]
127127-128128-129129-@mcp.tool
130130-async def get_document(uri: str) -> Document:
131131- """get the full content of a document by its AT-URI.
132132-133133- fetches the complete document from ATProto, including full text content.
134134- use this after finding documents via search to get the complete text.
135135-136136- args:
137137- uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...)
138138-139139- returns:
140140- document with full content, title, tags, and metadata
141141- """
142142- # use pdsx to fetch the actual record from ATProto
143143- try:
144144- from pdsx._internal.operations import get_record
145145- from pdsx.mcp.client import get_atproto_client
146146- except ImportError as e:
147147- raise RuntimeError(
148148- "pdsx is required for fetching full documents. install with: uv add pdsx"
149149- ) from e
150150-151151- # extract repo from URI for PDS discovery
152152- # at://did:plc:xxx/collection/rkey
153153- parts = uri.replace("at://", "").split("/")
154154- if len(parts) < 3:
155155- raise ValueError(f"invalid AT-URI: {uri}")
156156-157157- repo = parts[0]
158158-159159- async with get_atproto_client(target_repo=repo) as client:
160160- record = await get_record(client, uri)
161161-162162- value = record.value
163163- # DotDict doesn't have a working .get(), convert to dict first
164164- if hasattr(value, "to_dict") and callable(value.to_dict):
165165- value = value.to_dict()
166166- elif not isinstance(value, dict):
167167- value = dict(value)
168168-169169- # extract content from leaflet's block structure
170170- # pages[].blocks[].block.plaintext
171171- content_parts = []
172172- for page in value.get("pages", []):
173173- for block_wrapper in page.get("blocks", []):
174174- block = block_wrapper.get("block", {})
175175- plaintext = block.get("plaintext", "")
176176- if plaintext:
177177- content_parts.append(plaintext)
178178-179179- content = "\n\n".join(content_parts)
180180-181181- return Document(
182182- uri=record.uri,
183183- title=value.get("title", ""),
184184- content=content,
185185- createdAt=value.get("publishedAt", "") or value.get("createdAt", ""),
186186- tags=value.get("tags", []),
187187- publicationUri=value.get("publication", ""),
188188- )
189189-190190-191191-@mcp.tool
192192-async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]:
193193- """find documents similar to a given document.
194194-195195- uses vector similarity (voyage-3-lite embeddings) to find semantically
196196- related documents. great for discovering related content after finding
197197- an interesting document.
198198-199199- args:
200200- uri: the AT-URI of the document to find similar content for
201201- limit: max similar documents to return (default 5)
202202-203203- returns:
204204- list of similar documents with uri, title, and metadata
205205- """
206206- async with get_http_client() as client:
207207- response = await client.get("/similar", params={"uri": uri})
208208- response.raise_for_status()
209209- results = response.json()
210210-211211- return [SearchResult(**r) for r in results[:limit]]
212212-213213-214214-@mcp.tool
215215-async def get_tags() -> list[Tag]:
216216- """list all available tags with document counts.
217217-218218- returns tags sorted by document count (most popular first).
219219- useful for discovering topics and filtering searches.
220220-221221- returns:
222222- list of tags with their document counts
223223- """
224224- async with get_http_client() as client:
225225- response = await client.get("/tags")
226226- response.raise_for_status()
227227- results = response.json()
228228-229229- return [Tag(**t) for t in results]
230230-231231-232232-@mcp.tool
233233-async def get_stats() -> Stats:
234234- """get leaflet index statistics.
235235-236236- returns:
237237- document and publication counts
238238- """
239239- async with get_http_client() as client:
240240- response = await client.get("/stats")
241241- response.raise_for_status()
242242- return Stats(**response.json())
243243-244244-245245-@mcp.tool
246246-async def get_popular(limit: int = 5) -> list[PopularSearch]:
247247- """get popular search queries.
248248-249249- see what others are searching for on leaflet.
250250- can inspire new research directions.
251251-252252- args:
253253- limit: max queries to return (default 5)
254254-255255- returns:
256256- list of popular queries with search counts
257257- """
258258- async with get_http_client() as client:
259259- response = await client.get("/popular")
260260- response.raise_for_status()
261261- results = response.json()
262262-263263- return [PopularSearch(**p) for p in results[:limit]]
264264-265265-266266-# -----------------------------------------------------------------------------
267267-# resources
268268-# -----------------------------------------------------------------------------
269269-270270-271271-@mcp.resource("leaflet://stats")
272272-async def stats_resource() -> str:
273273- """current leaflet index statistics."""
274274- stats = await get_stats()
275275- return f"Leaflet index: {stats.documents} documents, {stats.publications} publications"
276276-277277-278278-# -----------------------------------------------------------------------------
279279-# entrypoint
280280-# -----------------------------------------------------------------------------
281281-282282-283283-def main() -> None:
284284- """run the MCP server."""
285285- mcp.run()
286286-287287-288288-if __name__ == "__main__":
289289- main()
+5
mcp/src/pub_search/__init__.py
···11+"""MCP server for searching ATProto publishing platforms."""
22+33+from pub_search.server import main, mcp
44+55+__all__ = ["main", "mcp"]
+59
mcp/src/pub_search/_types.py
···11+"""Type definitions for Leaflet MCP responses."""
22+33+from typing import Literal
44+55+from pydantic import BaseModel, computed_field
66+77+88+class SearchResult(BaseModel):
99+ """A search result from the Leaflet API."""
1010+1111+ type: Literal["article", "looseleaf", "publication"]
1212+ uri: str
1313+ did: str
1414+ title: str
1515+ snippet: str
1616+ createdAt: str = ""
1717+ rkey: str
1818+ basePath: str = ""
1919+ platform: Literal["leaflet", "pckt", "offprint", "greengale", "other"] = "leaflet"
2020+2121+ @computed_field
2222+ @property
2323+ def url(self) -> str:
2424+ """web URL for this document."""
2525+ if self.basePath:
2626+ return f"https://{self.basePath}/{self.rkey}"
2727+ return ""
2828+2929+3030+class Tag(BaseModel):
3131+ """A tag with document count."""
3232+3333+ tag: str
3434+ count: int
3535+3636+3737+class PopularSearch(BaseModel):
3838+ """A popular search query with count."""
3939+4040+ query: str
4141+ count: int
4242+4343+4444+class Stats(BaseModel):
4545+ """Leaflet index statistics."""
4646+4747+ documents: int
4848+ publications: int
4949+5050+5151+class Document(BaseModel):
5252+ """Full document content from ATProto."""
5353+5454+ uri: str
5555+ title: str
5656+ content: str
5757+ createdAt: str = ""
5858+ tags: list[str] = []
5959+ publicationUri: str = ""
+21
mcp/src/pub_search/client.py
···11+"""HTTP client for leaflet-search API."""
22+33+import os
44+from contextlib import asynccontextmanager
55+from typing import AsyncIterator
66+77+import httpx
88+99+# configurable via env var, defaults to production
1010+API_URL = os.getenv("LEAFLET_SEARCH_API_URL", "https://leaflet-search-backend.fly.dev")
1111+1212+1313+@asynccontextmanager
1414+async def get_http_client() -> AsyncIterator[httpx.AsyncClient]:
1515+ """Get an async HTTP client for API requests."""
1616+ async with httpx.AsyncClient(
1717+ base_url=API_URL,
1818+ timeout=30.0,
1919+ headers={"Accept": "application/json"},
2020+ ) as client:
2121+ yield client
+276
mcp/src/pub_search/server.py
···11+"""MCP server for searching ATProto publishing platforms."""
22+33+from __future__ import annotations
44+55+from typing import Any, Literal
66+77+from fastmcp import FastMCP
88+99+from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag
1010+from pub_search.client import get_http_client
1111+1212+mcp = FastMCP("pub-search")
1313+1414+1515+# -----------------------------------------------------------------------------
1616+# prompts
1717+# -----------------------------------------------------------------------------
1818+1919+2020+@mcp.prompt("usage_guide")
2121+def usage_guide() -> str:
2222+ """instructions for using pub-search MCP tools."""
2323+ return """\
2424+# pub-search MCP
2525+2626+search ATProto publishing platforms: leaflet, pckt, offprint, greengale.
2727+2828+## tools
2929+3030+- `search(query, tag, platform, since)` - full-text search with filters
3131+- `get_document(uri)` - fetch full content by AT-URI
3232+- `find_similar(uri)` - semantic similarity search
3333+- `get_tags()` - available tags
3434+- `get_stats()` - index statistics
3535+- `get_popular()` - popular queries
3636+3737+## workflow
3838+3939+1. `search("topic")` or `search("topic", platform="leaflet")`
4040+2. `get_document(uri)` for full text
4141+3. `find_similar(uri)` for related content
4242+4343+## result types
4444+4545+- **article**: document in a publication
4646+- **looseleaf**: standalone document
4747+- **publication**: the publication itself
4848+4949+results include a `url` field for web access.
5050+"""
5151+5252+5353+@mcp.prompt("search_tips")
5454+def search_tips() -> str:
5555+ """tips for effective searching."""
5656+ return """\
5757+# search tips
5858+5959+- prefix matching on last word: "cat dog" matches "cat dogs"
6060+- combine filters: `search("python", tag="tutorial", platform="leaflet")`
6161+- use `since="2025-01-01"` for recent content
6262+- `find_similar(uri)` for semantic similarity (voyage-3-lite embeddings)
6363+- `get_tags()` to discover available tags
6464+"""
6565+6666+6767+# -----------------------------------------------------------------------------
6868+# tools
6969+# -----------------------------------------------------------------------------
7070+7171+7272+Platform = Literal["leaflet", "pckt", "offprint", "greengale", "other"]
7373+7474+7575+@mcp.tool
7676+async def search(
7777+ query: str = "",
7878+ tag: str | None = None,
7979+ platform: Platform | None = None,
8080+ since: str | None = None,
8181+ limit: int = 5,
8282+) -> list[SearchResult]:
8383+ """search documents and publications.
8484+8585+ args:
8686+ query: search query (titles and content)
8787+ tag: filter by tag
8888+ platform: filter by platform (leaflet, pckt, offprint, greengale, other)
8989+ since: ISO date - only documents created after this date
9090+ limit: max results (default 5, max 40)
9191+9292+ returns:
9393+ list of results with uri, title, snippet, platform, and web url
9494+ """
9595+ if not query and not tag:
9696+ return []
9797+9898+ params: dict[str, Any] = {}
9999+ if query:
100100+ params["q"] = query
101101+ if tag:
102102+ params["tag"] = tag
103103+ if platform:
104104+ params["platform"] = platform
105105+ if since:
106106+ params["since"] = since
107107+108108+ async with get_http_client() as client:
109109+ response = await client.get("/search", params=params)
110110+ response.raise_for_status()
111111+ results = response.json()
112112+113113+ return [SearchResult(**r) for r in results[:limit]]
114114+115115+116116+@mcp.tool
117117+async def get_document(uri: str) -> Document:
118118+ """get the full content of a document by its AT-URI.
119119+120120+ fetches the complete document from ATProto, including full text content.
121121+ use this after finding documents via search to get the complete text.
122122+123123+ args:
124124+ uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...)
125125+126126+ returns:
127127+ document with full content, title, tags, and metadata
128128+ """
129129+ # use pdsx to fetch the actual record from ATProto
130130+ try:
131131+ from pdsx._internal.operations import get_record
132132+ from pdsx.mcp.client import get_atproto_client
133133+ except ImportError as e:
134134+ raise RuntimeError(
135135+ "pdsx is required for fetching full documents. install with: uv add pdsx"
136136+ ) from e
137137+138138+ # extract repo from URI for PDS discovery
139139+ # at://did:plc:xxx/collection/rkey
140140+ parts = uri.replace("at://", "").split("/")
141141+ if len(parts) < 3:
142142+ raise ValueError(f"invalid AT-URI: {uri}")
143143+144144+ repo = parts[0]
145145+146146+ async with get_atproto_client(target_repo=repo) as client:
147147+ record = await get_record(client, uri)
148148+149149+ value = record.value
150150+ # DotDict doesn't have a working .get(), convert to dict first
151151+ if hasattr(value, "to_dict") and callable(value.to_dict):
152152+ value = value.to_dict()
153153+ elif not isinstance(value, dict):
154154+ value = dict(value)
155155+156156+ # extract content from leaflet's block structure
157157+ # pages[].blocks[].block.plaintext
158158+ content_parts = []
159159+ for page in value.get("pages", []):
160160+ for block_wrapper in page.get("blocks", []):
161161+ block = block_wrapper.get("block", {})
162162+ plaintext = block.get("plaintext", "")
163163+ if plaintext:
164164+ content_parts.append(plaintext)
165165+166166+ content = "\n\n".join(content_parts)
167167+168168+ return Document(
169169+ uri=record.uri,
170170+ title=value.get("title", ""),
171171+ content=content,
172172+ createdAt=value.get("publishedAt", "") or value.get("createdAt", ""),
173173+ tags=value.get("tags", []),
174174+ publicationUri=value.get("publication", ""),
175175+ )
176176+177177+178178+@mcp.tool
179179+async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]:
180180+ """find documents similar to a given document.
181181+182182+ uses vector similarity (voyage-3-lite embeddings) to find semantically
183183+ related documents. great for discovering related content after finding
184184+ an interesting document.
185185+186186+ args:
187187+ uri: the AT-URI of the document to find similar content for
188188+ limit: max similar documents to return (default 5)
189189+190190+ returns:
191191+ list of similar documents with uri, title, and metadata
192192+ """
193193+ async with get_http_client() as client:
194194+ response = await client.get("/similar", params={"uri": uri})
195195+ response.raise_for_status()
196196+ results = response.json()
197197+198198+ return [SearchResult(**r) for r in results[:limit]]
199199+200200+201201+@mcp.tool
202202+async def get_tags() -> list[Tag]:
203203+ """list all available tags with document counts.
204204+205205+ returns tags sorted by document count (most popular first).
206206+ useful for discovering topics and filtering searches.
207207+208208+ returns:
209209+ list of tags with their document counts
210210+ """
211211+ async with get_http_client() as client:
212212+ response = await client.get("/tags")
213213+ response.raise_for_status()
214214+ results = response.json()
215215+216216+ return [Tag(**t) for t in results]
217217+218218+219219+@mcp.tool
220220+async def get_stats() -> Stats:
221221+ """get index statistics.
222222+223223+ returns:
224224+ document and publication counts
225225+ """
226226+ async with get_http_client() as client:
227227+ response = await client.get("/stats")
228228+ response.raise_for_status()
229229+ return Stats(**response.json())
230230+231231+232232+@mcp.tool
233233+async def get_popular(limit: int = 5) -> list[PopularSearch]:
234234+ """get popular search queries.
235235+236236+ see what others are searching for.
237237+ can inspire new research directions.
238238+239239+ args:
240240+ limit: max queries to return (default 5)
241241+242242+ returns:
243243+ list of popular queries with search counts
244244+ """
245245+ async with get_http_client() as client:
246246+ response = await client.get("/popular")
247247+ response.raise_for_status()
248248+ results = response.json()
249249+250250+ return [PopularSearch(**p) for p in results[:limit]]
251251+252252+253253+# -----------------------------------------------------------------------------
254254+# resources
255255+# -----------------------------------------------------------------------------
256256+257257+258258+@mcp.resource("pub-search://stats")
259259+async def stats_resource() -> str:
260260+ """current index statistics."""
261261+ stats = await get_stats()
262262+ return f"pub search index: {stats.documents} documents, {stats.publications} publications"
263263+264264+265265+# -----------------------------------------------------------------------------
266266+# entrypoint
267267+# -----------------------------------------------------------------------------
268268+269269+270270+def main() -> None:
271271+ """run the MCP server."""
272272+ mcp.run()
273273+274274+275275+if __name__ == "__main__":
276276+ main()
+12-9
mcp/tests/test_mcp.py
···11-"""tests for leaflet MCP server."""
11+"""tests for pub-search MCP server."""
2233import pytest
44from mcp.types import TextContent
···66from fastmcp.client import Client
77from fastmcp.client.transports import FastMCPTransport
8899-from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag
1010-from leaflet_mcp.server import mcp
99+from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag
1010+from pub_search.server import mcp
111112121313class TestTypes:
···2323 snippet="this is a test...",
2424 createdAt="2025-01-01T00:00:00Z",
2525 rkey="123",
2626- basePath="/blog",
2626+ basePath="gyst.leaflet.pub",
2727+ platform="leaflet",
2728 )
2829 assert r.type == "article"
2930 assert r.uri == "at://did:plc:abc/pub.leaflet.document/123"
3031 assert r.title == "test article"
3232+ assert r.platform == "leaflet"
3333+ assert r.url == "https://gyst.leaflet.pub/123"
31343235 def test_search_result_looseleaf(self):
3336 """SearchResult supports looseleaf type."""
···93969497 def test_mcp_server_imports(self):
9598 """mcp server can be imported without errors."""
9696- from leaflet_mcp import mcp
9999+ from pub_search import mcp
971009898- assert mcp.name == "leaflet"
101101+ assert mcp.name == "pub-search"
99102100103 def test_exports(self):
101104 """all expected exports are available."""
102102- from leaflet_mcp import main, mcp
105105+ from pub_search import main, mcp
103106104107 assert mcp is not None
105108 assert main is not None
···138141 resources = await client.list_resources()
139142140143 resource_uris = {str(r.uri) for r in resources}
141141- assert "leaflet://stats" in resource_uris
144144+ assert "pub-search://stats" in resource_uris
142145143146 async def test_usage_guide_prompt_content(self, client):
144147 """usage_guide prompt returns helpful content."""
···148151 assert len(result.messages) > 0
149152 content = result.messages[0].content
150153 assert isinstance(content, TextContent)
151151- assert "Leaflet" in content.text
154154+ assert "pub-search" in content.text
152155 assert "search" in content.text
153156154157 async def test_search_tips_prompt_content(self, client):