···11-# leaflet-search notes
22-33-## deployment
44-- **backend**: push to `main` touching `backend/**` โ auto-deploys via GitHub Actions
55-- **frontend**: manual deploy only (`wrangler pages deploy site --project-name leaflet-search`)
66-- **tap**: manual deploy from `tap/` directory (`fly deploy --app leaflet-search-tap`)
77-88-## remotes
99-- `origin`: tangled.sh:zzstoatzz.io/leaflet-search
1010-- `github`: github.com/zzstoatzz/leaflet-search (CI runs here)
1111-- push to both: `git push origin main && git push github main`
1212-1313-## architecture
1414-- **backend** (Zig): HTTP API, FTS5 search, vector similarity
1515-- **tap**: firehose sync via bluesky-social/indigo tap
1616-- **site**: static frontend on Cloudflare Pages
1717-- **db**: Turso (source of truth) + local SQLite read replica (FTS queries)
1818-1919-## platforms
2020-- leaflet, pckt, offprint: known platforms (detected via basePath)
2121-- other: site.standard.* documents not from a known platform
2222-2323-## search ranking
2424-- hybrid BM25 + recency: `ORDER BY rank + (days_old / 30)`
2525-- OR between terms for recall, prefix on last word
2626-- unicode61 tokenizer (non-alphanumeric = separator)
2727-2828-## tap operations
2929-- from `tap/` directory: `just check` (status), `just turbo` (catch-up), `just normal` (steady state)
3030-- see `docs/tap.md` for memory tuning and debugging
3131-3232-## common tasks
3333-- backfill embeddings: `./scripts/backfill-embeddings`
3434-- check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`
+13-34
README.md
···11-# pub search
11+# leaflet-search
2233by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io)
4455-search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and others using [standard.site](https://standard.site)).
66-77-**live:** [pub-search.waow.tech](https://pub-search.waow.tech)
55+search for [leaflet](https://leaflet.pub).
8699-> formerly "leaflet-search" - generalized to support multiple publishing platforms
77+**live:** [leaflet-search.pages.dev](https://leaflet-search.pages.dev)
108119## how it works
12101313-1. **tap** syncs content from ATProto firehose (signals on `pub.leaflet.document`, filters `pub.leaflet.*` + `site.standard.*`)
1111+1. **tap** syncs leaflet content from the network
14122. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API
15133. **site** static frontend on Cloudflare Pages
1614···1917search is also exposed as an MCP server for AI agents like Claude Code:
20182119```bash
2222-claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}'
2020+claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}'
2321```
24222523see [mcp/README.md](mcp/README.md) for local setup and usage details.
···2725## api
28262927```
3030-GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search
3131-GET /similar?uri=<at-uri> # find similar documents
3232-GET /tags # list all tags with counts
3333-GET /popular # popular search queries
3434-GET /stats # counts + request latency (p50/p95)
3535-GET /health # health check
2828+GET /search?q=<query>&tag=<tag> # full-text search with query, tag, or both
2929+GET /similar?uri=<at-uri> # find similar documents via vector embeddings
3030+GET /tags # list all tags with counts
3131+GET /popular # popular search queries
3232+GET /stats # document/publication counts
3333+GET /health # health check
3634```
37353838-search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, or other). tag and platform filtering apply to documents only.
3939-4040-**ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`).
3636+search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). tag filtering applies to documents only.
41374238`/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs).
43394444-## configuration
4545-4646-the backend is fully configurable via environment variables:
4747-4848-| variable | default | description |
4949-|----------|---------|-------------|
5050-| `APP_NAME` | `leaflet-search` | name shown in startup logs |
5151-| `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` |
5252-| `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host |
5353-| `TAP_PORT` | `443` | tap websocket port |
5454-| `PORT` | `3000` | HTTP server port |
5555-| `TURSO_URL` | - | Turso database URL (required) |
5656-| `TURSO_TOKEN` | - | Turso auth token (required) |
5757-| `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) |
5858-5959-the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results.
6060-6140## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a)
62416342- [Fly.io](https://fly.io) hosts backend + tap
6443- [Turso](https://turso.tech) cloud SQLite with vector support
6544- [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite)
6666-- [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose
4545+- [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs leaflet content from ATProto firehose
6746- [Zig](https://ziglang.org) HTTP server, search API, content indexing
6847- [Cloudflare Pages](https://pages.cloudflare.com) static frontend
6948
···11const std = @import("std");
22-const posix = std.posix;
3243const schema = @import("schema.zig");
54const result = @import("result.zig");
66-const sync = @import("sync.zig");
7586// re-exports
97pub const Client = @import("Client.zig");
1010-pub const LocalDb = @import("LocalDb.zig");
118pub const Row = result.Row;
129pub const Result = result.Result;
1310pub const BatchResult = result.BatchResult;
···1512// global state
1613var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{};
1714var client: ?Client = null;
1818-var local_db: ?LocalDb = null;
19152020-/// Initialize Turso client only (fast, call synchronously at startup)
2121-pub fn initTurso() !void {
1616+pub fn init() !void {
2217 client = try Client.init(gpa.allocator());
2318 try schema.init(&client.?);
2419}
25202626-/// Initialize local SQLite replica (slow, call in background thread)
2727-pub fn initLocalDb() void {
2828- initLocal() catch |err| {
2929- std.debug.print("local db init failed (will use turso only): {}\n", .{err});
3030- };
3131-}
3232-3333-pub fn init() !void {
3434- try initTurso();
3535- initLocalDb();
3636-}
3737-3838-fn initLocal() !void {
3939- // check if local db is disabled
4040- if (posix.getenv("LOCAL_DB_ENABLED")) |val| {
4141- if (std.mem.eql(u8, val, "false") or std.mem.eql(u8, val, "0")) {
4242- std.debug.print("local db disabled via LOCAL_DB_ENABLED\n", .{});
4343- return;
4444- }
4545- }
4646-4747- local_db = LocalDb.init(gpa.allocator());
4848- try local_db.?.open();
4949-}
5050-5121pub fn getClient() ?*Client {
5222 if (client) |*c| return c;
5323 return null;
5424}
5555-5656-/// Get local db if ready (synced and available)
5757-pub fn getLocalDb() ?*LocalDb {
5858- if (local_db) |*l| {
5959- if (l.isReady()) return l;
6060- }
6161- return null;
6262-}
6363-6464-/// Get local db even if not ready (for sync operations)
6565-pub fn getLocalDbRaw() ?*LocalDb {
6666- if (local_db) |*l| return l;
6767- return null;
6868-}
6969-7070-/// Start background sync thread (call from main after db.init)
7171-pub fn startSync() void {
7272- const c = getClient() orelse {
7373- std.debug.print("sync: no turso client, skipping\n", .{});
7474- return;
7575- };
7676- const local = getLocalDbRaw() orelse {
7777- std.debug.print("sync: no local db, skipping\n", .{});
7878- return;
7979- };
8080-8181- const thread = std.Thread.spawn(.{}, syncLoop, .{ c, local }) catch |err| {
8282- std.debug.print("sync: failed to start thread: {}\n", .{err});
8383- return;
8484- };
8585- thread.detach();
8686- std.debug.print("sync: background thread started\n", .{});
8787-}
8888-8989-fn syncLoop(turso: *Client, local: *LocalDb) void {
9090- // full sync on startup
9191- sync.fullSync(turso, local) catch |err| {
9292- std.debug.print("sync: initial full sync failed: {}\n", .{err});
9393- };
9494-9595- // get sync interval from env (default 5 minutes)
9696- const interval_secs: u64 = blk: {
9797- const env_val = posix.getenv("SYNC_INTERVAL_SECS") orelse "300";
9898- break :blk std.fmt.parseInt(u64, env_val, 10) catch 300;
9999- };
100100-101101- std.debug.print("sync: incremental sync every {d} seconds\n", .{interval_secs});
102102-103103- // periodic incremental sync
104104- while (true) {
105105- std.Thread.sleep(interval_secs * std.time.ns_per_s);
106106- sync.incrementalSync(turso, local) catch |err| {
107107- std.debug.print("sync: incremental sync failed: {}\n", .{err});
108108- };
109109- }
110110-}
+1-105
backend/src/db/schema.zig
···4444 \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5(
4545 \\ uri UNINDEXED,
4646 \\ name,
4747- \\ description,
4848- \\ base_path
4747+ \\ description
4948 \\)
5049 , &.{});
5150···128127 client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {};
129128 client.exec("UPDATE documents SET source_collection = 'pub.leaflet.document' WHERE source_collection IS NULL", &.{}) catch {};
130129131131- // multi-platform support for publications
132132- client.exec("ALTER TABLE publications ADD COLUMN platform TEXT DEFAULT 'leaflet'", &.{}) catch {};
133133- client.exec("ALTER TABLE publications ADD COLUMN source_collection TEXT DEFAULT 'pub.leaflet.publication'", &.{}) catch {};
134134- client.exec("UPDATE publications SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {};
135135- client.exec("UPDATE publications SET source_collection = 'pub.leaflet.publication' WHERE source_collection IS NULL", &.{}) catch {};
136136-137130 // vector embeddings column already added by backfill script
138138-139139- // dedupe index: same (did, rkey) across collections = same document
140140- // e.g., pub.leaflet.document/abc and site.standard.document/abc are the same content
141141- client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_documents_did_rkey ON documents(did, rkey)", &.{}) catch {};
142142- client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_publications_did_rkey ON publications(did, rkey)", &.{}) catch {};
143143-144144- // backfill platform from source_collection for records indexed before platform detection fix
145145- client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform = 'unknown' AND source_collection LIKE 'pub.leaflet.%'", &.{}) catch {};
146146- client.exec("UPDATE documents SET platform = 'pckt' WHERE platform = 'unknown' AND source_collection LIKE 'blog.pckt.%'", &.{}) catch {};
147147-148148- // rename 'standardsite' to 'other' (standardsite was a misnomer - it's a lexicon, not a platform)
149149- // documents using site.standard.* that don't match a known platform are simply "other"
150150- client.exec("UPDATE documents SET platform = 'other' WHERE platform = 'standardsite'", &.{}) catch {};
151151-152152- // detect platform from publication basePath (site.standard.* is a lexicon, not a platform)
153153- // known platforms (pckt, leaflet, offprint) use site.standard.* but have distinct basePaths
154154- client.exec(
155155- \\UPDATE documents SET platform = 'pckt'
156156- \\WHERE platform IN ('other', 'unknown')
157157- \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%')
158158- , &.{}) catch {};
159159-160160- client.exec(
161161- \\UPDATE documents SET platform = 'leaflet'
162162- \\WHERE platform IN ('other', 'unknown')
163163- \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%')
164164- , &.{}) catch {};
165165-166166- client.exec(
167167- \\UPDATE documents SET platform = 'offprint'
168168- \\WHERE platform IN ('other', 'unknown')
169169- \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%offprint.app%' OR base_path LIKE '%offprint.test%')
170170- , &.{}) catch {};
171171-172172- client.exec(
173173- \\UPDATE documents SET platform = 'greengale'
174174- \\WHERE platform IN ('other', 'unknown')
175175- \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%greengale.app%')
176176- , &.{}) catch {};
177177-178178- // URL path field for documents (e.g., "/001" for zat.dev)
179179- // used to build full URL: publication.url + document.path
180180- client.exec("ALTER TABLE documents ADD COLUMN path TEXT", &.{}) catch {};
181181-182182- // denormalized columns for query performance (avoids per-row subqueries)
183183- client.exec("ALTER TABLE documents ADD COLUMN base_path TEXT DEFAULT ''", &.{}) catch {};
184184- client.exec("ALTER TABLE documents ADD COLUMN has_publication INTEGER DEFAULT 0", &.{}) catch {};
185185-186186- // backfill base_path from publications (idempotent - only updates empty values)
187187- client.exec(
188188- \\UPDATE documents SET base_path = COALESCE(
189189- \\ (SELECT p.base_path FROM publications p WHERE p.uri = documents.publication_uri),
190190- \\ (SELECT p.base_path FROM publications p WHERE p.did = documents.did LIMIT 1),
191191- \\ ''
192192- \\) WHERE base_path IS NULL OR base_path = ''
193193- , &.{}) catch {};
194194-195195- // backfill has_publication (idempotent)
196196- client.exec(
197197- "UPDATE documents SET has_publication = CASE WHEN publication_uri != '' THEN 1 ELSE 0 END WHERE has_publication = 0 AND publication_uri != ''",
198198- &.{},
199199- ) catch {};
200200-201201- // note: publications_fts was rebuilt with base_path column via scripts/rebuild-pub-fts
202202- // new publications will include base_path via insertPublication in indexer.zig
203203-204204- // 2026-01-22: clean up stale publication/self records that were deleted from ATProto
205205- // these cause incorrect basePath lookups for greengale documents
206206- // specifically: did:plc:27ivzcszryxp6mehutodmcxo had publication/self with basePath 'greengale.app'
207207- // but that publication was deleted, and the correct one is 'greengale.app/3fz.org'
208208- client.exec(
209209- \\DELETE FROM publications WHERE rkey = 'self'
210210- \\AND base_path = 'greengale.app'
211211- \\AND did = 'did:plc:27ivzcszryxp6mehutodmcxo'
212212- , &.{}) catch {};
213213- client.exec(
214214- \\DELETE FROM publications_fts WHERE uri IN (
215215- \\ SELECT 'at://' || did || '/site.standard.publication/self'
216216- \\ FROM publications WHERE rkey = 'self' AND base_path = 'greengale.app'
217217- \\)
218218- , &.{}) catch {};
219219-220220- // re-derive basePath for greengale documents that got wrong basePath
221221- // match documents to greengale publications (basePath contains greengale.app)
222222- // prefer more specific basePaths (with subdomain)
223223- client.exec(
224224- \\UPDATE documents SET base_path = (
225225- \\ SELECT p.base_path FROM publications p
226226- \\ WHERE p.did = documents.did
227227- \\ AND p.base_path LIKE 'greengale.app/%'
228228- \\ ORDER BY LENGTH(p.base_path) DESC
229229- \\ LIMIT 1
230230- \\)
231231- \\WHERE platform = 'greengale'
232232- \\AND (base_path = 'greengale.app' OR base_path LIKE '%pckt.blog%')
233233- \\AND did IN (SELECT did FROM publications WHERE base_path LIKE 'greengale.app/%')
234234- , &.{}) catch {};
235131}
···11-# API reference
22-33-base URL: `https://leaflet-search-backend.fly.dev`
44-55-## endpoints
66-77-### search
88-99-```
1010-GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date>
1111-```
1212-1313-full-text search across documents and publications.
1414-1515-**parameters:**
1616-| param | type | required | description |
1717-|-------|------|----------|-------------|
1818-| `q` | string | no* | search query (titles and content) |
1919-| `tag` | string | no | filter by tag (documents only) |
2020-| `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `other` |
2121-| `since` | string | no | ISO date, filter to documents created after |
2222-2323-*at least one of `q` or `tag` required
2424-2525-**response:**
2626-```json
2727-[
2828- {
2929- "type": "article|looseleaf|publication",
3030- "uri": "at://did:plc:.../collection/rkey",
3131- "did": "did:plc:...",
3232- "title": "document title",
3333- "snippet": "...matched text...",
3434- "createdAt": "2025-01-15T...",
3535- "rkey": "abc123",
3636- "basePath": "gyst.leaflet.pub",
3737- "platform": "leaflet",
3838- "path": "/001"
3939- }
4040-]
4141-```
4242-4343-**result types:**
4444-- `article`: document in a publication
4545-- `looseleaf`: standalone document (no publication)
4646-- `publication`: the publication itself (only returned for text queries, not tag/platform filters)
4747-4848-**ranking:** hybrid BM25 + recency. text relevance primary, recent docs boosted (~1 point per 30 days).
4949-5050-### similar
5151-5252-```
5353-GET /similar?uri=<at-uri>
5454-```
5555-5656-find semantically similar documents using vector similarity (voyage-3-lite embeddings).
5757-5858-**parameters:**
5959-| param | type | required | description |
6060-|-------|------|----------|-------------|
6161-| `uri` | string | yes | AT-URI of source document |
6262-6363-**response:** same format as search (array of results)
6464-6565-### tags
6666-6767-```
6868-GET /tags
6969-```
7070-7171-list all tags with document counts, sorted by popularity.
7272-7373-**response:**
7474-```json
7575-[
7676- {"tag": "programming", "count": 42},
7777- {"tag": "rust", "count": 15}
7878-]
7979-```
8080-8181-### popular
8282-8383-```
8484-GET /popular
8585-```
8686-8787-popular search queries.
8888-8989-**response:**
9090-```json
9191-[
9292- {"query": "rust async", "count": 12},
9393- {"query": "leaflet", "count": 8}
9494-]
9595-```
9696-9797-### platforms
9898-9999-```
100100-GET /platforms
101101-```
102102-103103-document counts by platform.
104104-105105-**response:**
106106-```json
107107-[
108108- {"platform": "leaflet", "count": 2500},
109109- {"platform": "pckt", "count": 800},
110110- {"platform": "greengale", "count": 150},
111111- {"platform": "offprint", "count": 50},
112112- {"platform": "other", "count": 100}
113113-]
114114-```
115115-116116-### stats
117117-118118-```
119119-GET /stats
120120-```
121121-122122-index statistics and request timing.
123123-124124-**response:**
125125-```json
126126-{
127127- "documents": 3500,
128128- "publications": 120,
129129- "embeddings": 3200,
130130- "searches": 5000,
131131- "errors": 5,
132132- "cache_hits": 1200,
133133- "cache_misses": 800,
134134- "timing": {
135135- "search": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150},
136136- "similar": {"count": 200, "avg_ms": 150, "p50_ms": 140, "p95_ms": 200, "p99_ms": 250, "max_ms": 300},
137137- "tags": {"count": 500, "avg_ms": 5, "p50_ms": 4, "p95_ms": 10, "p99_ms": 15, "max_ms": 25},
138138- "popular": {"count": 300, "avg_ms": 3, "p50_ms": 2, "p95_ms": 5, "p99_ms": 8, "max_ms": 12}
139139- }
140140-}
141141-```
142142-143143-### activity
144144-145145-```
146146-GET /activity
147147-```
148148-149149-hourly activity counts (last 24 hours).
150150-151151-**response:**
152152-```json
153153-[12, 8, 5, 3, 2, 1, 0, 0, 1, 5, 15, 25, 30, 28, 22, 18, 20, 25, 30, 35, 28, 20, 15, 10]
154154-```
155155-156156-### dashboard
157157-158158-```
159159-GET /api/dashboard
160160-```
161161-162162-rich dashboard data for analytics UI.
163163-164164-**response:**
165165-```json
166166-{
167167- "startedAt": 1705000000,
168168- "searches": 5000,
169169- "publications": 120,
170170- "documents": 3500,
171171- "platforms": [{"platform": "leaflet", "count": 2500}],
172172- "tags": [{"tag": "programming", "count": 42}],
173173- "timeline": [{"date": "2025-01-15", "count": 25}],
174174- "topPubs": [{"name": "gyst", "basePath": "gyst.leaflet.pub", "count": 150}],
175175- "timing": {...}
176176-}
177177-```
178178-179179-### health
180180-181181-```
182182-GET /health
183183-```
184184-185185-**response:**
186186-```json
187187-{"status": "ok"}
188188-```
189189-190190-## building URLs
191191-192192-documents can be accessed on the web via their `basePath` and `rkey`:
193193-- articles: `https://{basePath}/{rkey}` or `https://{basePath}{path}` if path is set
194194-- publications: `https://{basePath}`
195195-196196-examples:
197197-- `https://gyst.leaflet.pub/3ldasifz7bs2l`
198198-- `https://greengale.app/3fz.org/001`
-90
docs/content-extraction.md
···11-# content extraction for site.standard.document
22-33-lessons learned from implementing cross-platform content extraction.
44-55-## the problem
66-77-[eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y):
88-99-> The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint.
1010-1111-short answer: yes. but once you handle `content.pages` extraction, it's straightforward.
1212-1313-## textContent: platform-dependent
1414-1515-`site.standard.document` has a `textContent` field for pre-flattened plaintext:
1616-1717-```json
1818-{
1919- "title": "my post",
2020- "textContent": "the full text content, ready for indexing...",
2121- "content": {
2222- "$type": "blog.pckt.content",
2323- "items": [ /* platform-specific blocks */ ]
2424- }
2525-}
2626-```
2727-2828-**pckt, offprint, greengale** populate `textContent`. extraction is trivial.
2929-3030-**leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`.
3131-3232-## extraction strategy
3333-3434-priority order (in `extractor.zig`):
3535-3636-1. `textContent` - use if present
3737-2. `pages` - top-level blocks (pub.leaflet.document)
3838-3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content)
3939-4040-```zig
4141-// try textContent first
4242-if (zat.json.getString(record, "textContent")) |text| {
4343- return text;
4444-}
4545-4646-// fall back to block parsing
4747-const pages = zat.json.getArray(record, "pages") orelse
4848- zat.json.getArray(record, "content.pages");
4949-```
5050-5151-the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls.
5252-5353-## deduplication
5454-5555-documents can appear in both collections with identical `(did, rkey)`:
5656-- `site.standard.document`
5757-- `pub.leaflet.document`
5858-5959-handle with `ON CONFLICT`:
6060-6161-```sql
6262-INSERT INTO documents (uri, ...)
6363-ON CONFLICT(uri) DO UPDATE SET ...
6464-```
6565-6666-note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat.
6767-6868-## platform detection
6969-7070-collection name doesn't indicate platform for `site.standard.*` records. infer from publication `basePath`:
7171-7272-| basePath contains | platform |
7373-|-------------------|----------|
7474-| `leaflet.pub` | leaflet |
7575-| `pckt.blog` | pckt |
7676-| `offprint.app` | offprint |
7777-| `greengale.app` | greengale |
7878-| (none) | other |
7979-8080-## summary
8181-8282-- **pckt/offprint/greengale**: use `textContent` directly
8383-- **leaflet**: extract from `content.pages[].blocks[].block.plaintext`
8484-- **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri`
8585-- **platform**: infer from publication basePath, not collection name
8686-8787-## code references
8888-8989-- `backend/src/extractor.zig` - content extraction logic
9090-- `backend/src/indexer.zig:99-112` - platform detection from basePath
-226
docs/scratch/leaflet-publishing-plan.md
···11-# publishing to leaflet.pub
22-33-## goal
44-55-publish markdown docs to both:
66-1. `site.standard.document` (for search/interop) - already working
77-2. `pub.leaflet.document` (for leaflet.pub display) - this plan
88-99-## the mapping
1010-1111-### block types
1212-1313-| markdown | leaflet block |
1414-|----------|---------------|
1515-| `# heading` | `pub.leaflet.blocks.header` (level 1-6) |
1616-| paragraph | `pub.leaflet.blocks.text` |
1717-| ``` code ``` | `pub.leaflet.blocks.code` |
1818-| `> quote` | `pub.leaflet.blocks.blockquote` |
1919-| `---` | `pub.leaflet.blocks.horizontalRule` |
2020-| `- item` | `pub.leaflet.blocks.unorderedList` |
2121-| `` | `pub.leaflet.blocks.image` (requires blob upload) |
2222-| `[text](url)` (standalone) | `pub.leaflet.blocks.website` |
2323-2424-### inline formatting (facets)
2525-2626-leaflet uses byte-indexed facets for inline formatting within text blocks:
2727-2828-```json
2929-{
3030- "$type": "pub.leaflet.blocks.text",
3131- "plaintext": "hello world with bold text",
3232- "facets": [{
3333- "index": { "byteStart": 17, "byteEnd": 21 },
3434- "features": [{ "$type": "pub.leaflet.richtext.facet#bold" }]
3535- }]
3636-}
3737-```
3838-3939-| markdown | facet type |
4040-|----------|------------|
4141-| `**bold**` | `pub.leaflet.richtext.facet#bold` |
4242-| `*italic*` | `pub.leaflet.richtext.facet#italic` |
4343-| `` `code` `` | `pub.leaflet.richtext.facet#code` |
4444-| `[text](url)` | `pub.leaflet.richtext.facet#link` |
4545-| `~~strike~~` | `pub.leaflet.richtext.facet#strikethrough` |
4646-4747-## record structure
4848-4949-```json
5050-{
5151- "$type": "pub.leaflet.document",
5252- "author": "did:plc:...",
5353- "title": "document title",
5454- "description": "optional description",
5555- "publishedAt": "2026-01-06T00:00:00Z",
5656- "publication": "at://did:plc:.../pub.leaflet.publication/rkey",
5757- "tags": ["tag1", "tag2"],
5858- "pages": [{
5959- "$type": "pub.leaflet.pages.linearDocument",
6060- "id": "page-uuid",
6161- "blocks": [
6262- {
6363- "$type": "pub.leaflet.pages.linearDocument#block",
6464- "block": { /* one of the block types above */ }
6565- }
6666- ]
6767- }]
6868-}
6969-```
7070-7171-## implementation plan
7272-7373-### phase 1: markdown parser
7474-7575-add a simple markdown block parser to zat or the publish script:
7676-7777-```zig
7878-const BlockType = enum {
7979- heading,
8080- paragraph,
8181- code,
8282- blockquote,
8383- horizontal_rule,
8484- unordered_list,
8585- image,
8686-};
8787-8888-const Block = struct {
8989- type: BlockType,
9090- content: []const u8,
9191- level: ?u8 = null, // for headings
9292- language: ?[]const u8 = null, // for code blocks
9393- alt: ?[]const u8 = null, // for images
9494- src: ?[]const u8 = null, // for images
9595-};
9696-9797-fn parseMarkdownBlocks(allocator: Allocator, markdown: []const u8) ![]Block
9898-```
9999-100100-parsing approach:
101101-- split on blank lines to get blocks
102102-- identify block type by first characters:
103103- - `#` โ heading (count `#` for level)
104104- - ``` โ code block (capture until closing ```)
105105- - `>` โ blockquote
106106- - `---` โ horizontal rule
107107- - `-` or `*` at start โ list item
108108- - `![` โ image
109109- - else โ paragraph
110110-111111-### phase 2: inline facet extraction
112112-113113-for text blocks, extract inline formatting:
114114-115115-```zig
116116-const Facet = struct {
117117- byte_start: usize,
118118- byte_end: usize,
119119- feature: FacetFeature,
120120-};
121121-122122-const FacetFeature = union(enum) {
123123- bold,
124124- italic,
125125- code,
126126- link: []const u8, // url
127127- strikethrough,
128128-};
129129-130130-fn extractFacets(allocator: Allocator, text: []const u8) !struct {
131131- plaintext: []const u8,
132132- facets: []Facet,
133133-}
134134-```
135135-136136-approach:
137137-- scan for `**`, `*`, `` ` ``, `[`, `~~`
138138-- track byte positions as we strip markers
139139-- build facet list with adjusted indices
140140-141141-### phase 3: image blob upload
142142-143143-images need to be uploaded as blobs before referencing:
144144-145145-```zig
146146-fn uploadImageBlob(client: *XrpcClient, allocator: Allocator, image_path: []const u8) !BlobRef
147147-```
148148-149149-for now, could skip images or require them to already be uploaded.
150150-151151-### phase 4: json serialization
152152-153153-build the full `pub.leaflet.document` record:
154154-155155-```zig
156156-const LeafletDocument = struct {
157157- @"$type": []const u8 = "pub.leaflet.document",
158158- author: []const u8,
159159- title: []const u8,
160160- description: ?[]const u8 = null,
161161- publishedAt: []const u8,
162162- publication: ?[]const u8 = null,
163163- tags: ?[][]const u8 = null,
164164- pages: []Page,
165165-};
166166-167167-const Page = struct {
168168- @"$type": []const u8 = "pub.leaflet.pages.linearDocument",
169169- id: []const u8,
170170- blocks: []BlockWrapper,
171171-};
172172-```
173173-174174-### phase 5: integrate into publish-docs.zig
175175-176176-update the publish script to:
177177-1. parse markdown into blocks
178178-2. convert to leaflet structure
179179-3. publish `pub.leaflet.document` alongside `site.standard.document`
180180-181181-```zig
182182-// existing: publish site.standard.document
183183-try putRecord(&client, allocator, session.did, "site.standard.document", tid.str(), doc_record);
184184-185185-// new: also publish pub.leaflet.document
186186-const leaflet_record = try markdownToLeaflet(allocator, content, title, session.did, pub_uri);
187187-try putRecord(&client, allocator, session.did, "pub.leaflet.document", tid.str(), leaflet_record);
188188-```
189189-190190-## complexity estimate
191191-192192-| component | complexity | notes |
193193-|-----------|------------|-------|
194194-| block parsing | medium | regex-free, line-by-line |
195195-| facet extraction | medium | byte index tracking is fiddly |
196196-| image upload | low | already have blob upload in xrpc |
197197-| json serialization | low | std.json handles it |
198198-| integration | low | add to existing publish flow |
199199-200200-total: ~300-500 lines of zig
201201-202202-## open questions
203203-204204-1. **publication record**: do we need a `pub.leaflet.publication` too, or just documents?
205205- - leaflet allows standalone documents without publications
206206- - could skip publication for now
207207-208208-2. **image handling**:
209209- - option A: skip images initially (just text content)
210210- - option B: require images to be URLs (no blob upload)
211211- - option C: full blob upload support
212212-213213-3. **deduplication**: same rkey for both record types?
214214- - pro: easy to correlate
215215- - con: different collections, might not matter
216216-217217-4. **validation**: leaflet has a validate endpoint
218218- - could call `/api/unstable_validate` to check records before publish
219219- - probably skip for v1
220220-221221-## references
222222-223223-- [pub.leaflet.document schema](/tmp/leaflet/lexicons/pub/leaflet/document.json)
224224-- [leaflet publishToPublication.ts](/tmp/leaflet/actions/publishToPublication.ts) - how leaflet creates records
225225-- [site.standard.document schema](/tmp/standard.site/app/data/lexicons/document.json)
226226-- paul's site: fetches records, doesn't publish them
-272
docs/scratch/logfire-zig-adoption.md
···11-# logfire-zig adoption guide for leaflet-search
22-33-guide for integrating logfire-zig into the leaflet-search backend.
44-55-## 1. add dependency
66-77-in `backend/build.zig.zon`:
88-99-```zig
1010-.dependencies = .{
1111- // ... existing deps ...
1212- .logfire = .{
1313- .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main",
1414- .hash = "...", // run zig build to get hash
1515- },
1616-},
1717-```
1818-1919-in `backend/build.zig`, add the import:
2020-2121-```zig
2222-const logfire = b.dependency("logfire", .{
2323- .target = target,
2424- .optimize = optimize,
2525-});
2626-exe.root_module.addImport("logfire", logfire.module("logfire"));
2727-```
2828-2929-## 2. configure in main.zig
3030-3131-```zig
3232-const std = @import("std");
3333-const logfire = @import("logfire");
3434-// ... other imports ...
3535-3636-pub fn main() !void {
3737- var gpa = std.heap.GeneralPurposeAllocator(.{}){};
3838- defer _ = gpa.deinit();
3939- const allocator = gpa.allocator();
4040-4141- // configure logfire early
4242- // reads LOGFIRE_WRITE_TOKEN from env automatically
4343- const lf = try logfire.configure(.{
4444- .service_name = "leaflet-search",
4545- .service_version = "0.0.1",
4646- .environment = std.posix.getenv("FLY_APP_NAME") orelse "development",
4747- });
4848- defer lf.shutdown();
4949-5050- logfire.info("starting leaflet-search on port {d}", .{port});
5151-5252- // ... rest of main ...
5353-}
5454-```
5555-5656-## 3. replace timing.zig with spans
5757-5858-current pattern in server.zig:
5959-6060-```zig
6161-fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
6262- const start_time = std.time.microTimestamp();
6363- defer timing.record(.search, start_time);
6464- // ...
6565-}
6666-```
6767-6868-with logfire:
6969-7070-```zig
7171-fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
7272- const span = logfire.span("search.handle", .{});
7373- defer span.end();
7474-7575- // parse params
7676- const query = parseQueryParam(alloc, target, "q") catch "";
7777-7878- // add attributes after parsing
7979- span.setAttribute("query", query);
8080- span.setAttribute("tag", tag_filter orelse "");
8181-8282- // ...
8383-}
8484-```
8585-8686-for nested operations:
8787-8888-```zig
8989-fn search(alloc: Allocator, query: []const u8, ...) ![]Result {
9090- const span = logfire.span("search.execute", .{
9191- .query_length = @intCast(query.len),
9292- });
9393- defer span.end();
9494-9595- // FTS query
9696- {
9797- const fts_span = logfire.span("search.fts", .{});
9898- defer fts_span.end();
9999- // ... FTS logic ...
100100- }
101101-102102- // vector search fallback
103103- if (results.len < limit) {
104104- const vec_span = logfire.span("search.vector", .{});
105105- defer vec_span.end();
106106- // ... vector search ...
107107- }
108108-109109- return results;
110110-}
111111-```
112112-113113-## 4. add structured logging
114114-115115-replace `std.debug.print` with logfire:
116116-117117-```zig
118118-// before
119119-std.debug.print("accept error: {}\n", .{err});
120120-121121-// after
122122-logfire.err("accept error: {}", .{err});
123123-```
124124-125125-```zig
126126-// before
127127-std.debug.print("{s} listening on http://0.0.0.0:{d}\n", .{app_name, port});
128128-129129-// after
130130-logfire.info("{s} listening on port {d}", .{app_name, port});
131131-```
132132-133133-for sync operations in tap.zig:
134134-135135-```zig
136136-logfire.info("sync complete", .{});
137137-logfire.debug("processed {d} events", .{event_count});
138138-```
139139-140140-for errors:
141141-142142-```zig
143143-logfire.err("turso query failed: {}", .{@errorName(err)});
144144-```
145145-146146-## 5. add metrics
147147-148148-replace stats.zig counters with logfire metrics:
149149-150150-```zig
151151-// before (in stats.zig)
152152-pub fn recordSearch(query: []const u8) void {
153153- total_searches.fetchAdd(1, .monotonic);
154154- // ...
155155-}
156156-157157-// with logfire (in server.zig or stats.zig)
158158-pub fn recordSearch(query: []const u8) void {
159159- logfire.counter("search.total", 1);
160160- // existing logic...
161161-}
162162-```
163163-164164-for gauges (e.g., active connections, document counts):
165165-166166-```zig
167167-logfire.gaugeInt("documents.indexed", doc_count);
168168-logfire.gaugeInt("connections.active", active_count);
169169-```
170170-171171-for latency histograms (more detail than counter):
172172-173173-```zig
174174-// after search completes
175175-logfire.metric(.{
176176- .name = "search.latency_ms",
177177- .unit = "ms",
178178- .data = .{
179179- .histogram = .{
180180- .data_points = &[_]logfire.HistogramDataPoint{.{
181181- .start_time_ns = start_ns,
182182- .time_ns = std.time.nanoTimestamp(),
183183- .count = 1,
184184- .sum = latency_ms,
185185- .bucket_counts = ...,
186186- .explicit_bounds = ...,
187187- .min = latency_ms,
188188- .max = latency_ms,
189189- }},
190190- },
191191- },
192192-});
193193-```
194194-195195-## 6. deployment
196196-197197-add to fly.toml secrets:
198198-199199-```bash
200200-fly secrets set LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxxxx --app leaflet-search-backend
201201-```
202202-203203-logfire-zig reads from `LOGFIRE_WRITE_TOKEN` or `LOGFIRE_TOKEN` automatically.
204204-205205-## 7. what to keep from existing code
206206-207207-**keep timing.zig** - it provides local latency histograms for the dashboard API. logfire spans complement this with distributed tracing.
208208-209209-**keep stats.zig** - local counters are still useful for the `/stats` endpoint. logfire metrics add remote observability.
210210-211211-**keep activity.zig** - tracks recent activity for the dashboard. orthogonal to logfire.
212212-213213-the pattern is: local state for dashboard UI, logfire for observability.
214214-215215-## 8. migration order
216216-217217-1. add dependency, configure in main.zig
218218-2. add spans to request handlers (search, similar, tags, popular)
219219-3. add structured logging for errors and important events
220220-4. add metrics for key counters
221221-5. gradually replace `std.debug.print` with logfire logging
222222-6. consider removing timing.zig if logfire histograms are sufficient
223223-224224-## 9. example: full search handler
225225-226226-```zig
227227-fn handleSearch(request: *http.Server.Request, target: []const u8) !void {
228228- const span = logfire.span("http.search", .{});
229229- defer span.end();
230230-231231- var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
232232- defer arena.deinit();
233233- const alloc = arena.allocator();
234234-235235- const query = parseQueryParam(alloc, target, "q") catch "";
236236- const tag_filter = parseQueryParam(alloc, target, "tag") catch null;
237237-238238- if (query.len == 0 and tag_filter == null) {
239239- logfire.debug("empty search request", .{});
240240- try sendJson(request, "{\"error\":\"enter a search term\"}");
241241- return;
242242- }
243243-244244- const results = search.search(alloc, query, tag_filter, null, null) catch |err| {
245245- logfire.err("search failed: {}", .{@errorName(err)});
246246- stats.recordError();
247247- return err;
248248- };
249249-250250- logfire.counter("search.requests", 1);
251251- logfire.info("search completed", .{});
252252-253253- // ... send response ...
254254-}
255255-```
256256-257257-## 10. verifying it works
258258-259259-run locally:
260260-261261-```bash
262262-LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxx zig build run
263263-```
264264-265265-check logfire dashboard for traces from `leaflet-search` service.
266266-267267-without token (console fallback):
268268-269269-```bash
270270-zig build run
271271-# prints [span], [info], [metric] to stderr
272272-```
-350
docs/scratch/standard-search-planning.md
···11-# standard-search planning
22-33-expanding leaflet-search to index all standard.site records.
44-55-## references
66-77-- [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto
88-- [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons
99-- [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons
1010-- [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons
1111-- [ATProto docs](https://atproto.com/docs) - protocol documentation
1212-1313-## context
1414-1515-discussion with pckt.blog team about building global search for standard.site ecosystem.
1616-current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons.
1717-1818-### recent work (2026-01-05)
1919-2020-added similarity cache to improve `/similar` endpoint performance:
2121-- `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)`
2222-- cache auto-invalidates when document count changes
2323-- `/stats` endpoint now shows `cache_hits` and `cache_misses`
2424-- first request ~3s (cold), cached requests ~0.15s
2525-2626-also added loading indicator for "related to" results in frontend.
2727-2828-### recent work (2026-01-06)
2929-3030-- merged PR1: multi-platform schema (platform + source_collection columns)
3131-- added `loading.js` - portable loading state handler for dashboards
3232- - skeleton shimmer while loading
3333- - "waking up" toast after 2s threshold (fly.io cold start handling)
3434- - designed to be copied to other projects
3535-- fixed pluralization ("1 result" vs "2 results")
3636-3737-## what we know
3838-3939-### standard.site lexicons
4040-4141-two shared lexicons for long-form publishing on ATProto:
4242-- `site.standard.document` - document content and metadata
4343-- `site.standard.publication` - publication/blog metadata
4444-4545-implementing platforms:
4646-- leaflet.pub (`pub.leaflet.*`)
4747-- pckt.blog (`blog.pckt.*`)
4848-- offprint.app (`app.offprint.*`)
4949-5050-### site.standard.document schema
5151-5252-examined real records from pckt.blog. key fields:
5353-5454-```
5555-textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail)
5656-content - platform-specific block structure
5757- .$type - identifies platform (e.g., "blog.pckt.content")
5858-title - document title
5959-tags - array of strings
6060-site - AT-URI reference to site.standard.publication
6161-path - URL path (e.g., "/my-post-abc123")
6262-publishedAt - ISO timestamp
6363-updatedAt - ISO timestamp
6464-coverImage - blob reference
6565-```
6666-6767-### the textContent field
6868-6969-this is huge. platforms flatten their block content into a single text field:
7070-7171-```json
7272-{
7373- "content": {
7474- "$type": "blog.pckt.content",
7575- "items": [ /* platform-specific blocks */ ]
7676- },
7777- "textContent": "i have been writing a lot of atproto things in zig!..."
7878-}
7979-```
8080-8181-no need to parse platform-specific blocks - just index `textContent` directly.
8282-8383-### platform detection
8484-8585-derive platform from `content.$type` prefix:
8686-- `blog.pckt.content` โ pckt
8787-- `pub.leaflet.content` โ leaflet (TBD - need to verify)
8888-- `app.offprint.content` โ offprint (TBD - need to verify)
8989-9090-### current leaflet-search architecture
9191-9292-```
9393-ATProto firehose (via tap)
9494- โ
9595-tap.zig - subscribes to pub.leaflet.document/publication
9696- โ
9797-indexer.zig - extracts content from nested pages[].blocks[] structure
9898- โ
9999-turso (sqlite) - documents table + FTS5 + embeddings
100100- โ
101101-search.zig - FTS5 queries + vector similarity
102102- โ
103103-server.zig - HTTP API (/search, /similar, /stats)
104104-```
105105-106106-leaflet-specific code:
107107-- tap.zig lines 10-11: hardcoded collection names
108108-- tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*)
109109-- recursive page/block traversal logic
110110-111111-generalizable code:
112112-- database schema (FTS5, tags, stats, similarity cache)
113113-- search/similar logic
114114-- HTTP API
115115-- embedding pipeline
116116-117117-## proposed architecture for standard-search
118118-119119-### ingestion changes
120120-121121-subscribe to:
122122-- `site.standard.document`
123123-- `site.standard.publication`
124124-125125-optionally also subscribe to platform-specific collections for richer data:
126126-- `pub.leaflet.document/publication`
127127-- `blog.pckt.document/publication` (if they have these)
128128-- `app.offprint.document/publication` (if they have these)
129129-130130-### content extraction
131131-132132-for `site.standard.document`:
133133-1. use `textContent` field directly - no block parsing!
134134-2. fall back to title + description if textContent missing
135135-136136-for platform-specific records (if needed):
137137-- keep existing leaflet block parser
138138-- add parsers for other platforms as needed
139139-140140-### database changes
141141-142142-add to documents table:
143143-- `platform` TEXT - derived from content.$type (leaflet, pckt, offprint)
144144-- `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document)
145145-- `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document
146146-147147-### API changes
148148-149149-- `/search?q=...&platform=leaflet` - optional platform filter
150150-- results include `platform` field
151151-- `/similar` works across all platforms
152152-153153-### naming/deployment
154154-155155-options:
156156-1. rename leaflet-search โ standard-search (breaking change)
157157-2. new repo/deployment, keep leaflet-search as-is
158158-3. branch and generalize, decide naming later
159159-160160-leaning toward option 3 for now.
161161-162162-## findings from exploration
163163-164164-### pckt.blog - READY
165165-- writes `site.standard.document` records
166166-- has `textContent` field (pre-flattened)
167167-- `content.$type` = `blog.pckt.content`
168168-- 6+ records found on pckt.blog service account
169169-170170-### leaflet.pub - NOT YET MIGRATED
171171-- still using `pub.leaflet.document` only
172172-- no `site.standard.document` records found
173173-- no `textContent` field - content is in nested `pages[].blocks[]`
174174-- will need to continue parsing blocks OR wait for migration
175175-176176-### offprint.app - NOW INDEXED (2026-01-22)
177177-- writes `site.standard.document` records with `app.offprint.content` blocks
178178-- has `textContent` field (pre-flattened)
179179-- platform detected via basePath (`*.offprint.app`, `*.offprint.test`)
180180-- now fully supported alongside leaflet and pckt
181181-182182-### greengale.app - NOW INDEXED (2026-01-22)
183183-- writes `site.standard.document` records
184184-- has `textContent` field (pre-flattened)
185185-- platform detected via basePath (`greengale.app/*`)
186186-- ~29 documents indexed at time of discovery
187187-- now fully supported alongside leaflet, pckt, and offprint
188188-189189-### implication for architecture
190190-191191-two paths:
192192-193193-**path A: wait for leaflet migration**
194194-- simpler: just index `site.standard.document` with `textContent`
195195-- all platforms converge on same schema
196196-- downside: loses existing leaflet search until they migrate
197197-198198-**path B: hybrid approach**
199199-- index `site.standard.document` (pckt, future leaflet, offprint)
200200-- ALSO index `pub.leaflet.document` with existing block parser
201201-- dedupe by URI or store both with `source_collection` indicator
202202-- more complex but maintains backwards compat
203203-204204-leaning toward **path B** - can't lose 3500 leaflet docs.
205205-206206-## open questions
207207-208208-- [x] does leaflet write site.standard.document records? **NO, not yet**
209209-- [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet**
210210-- [ ] when will leaflet migrate to standard.site?
211211-- [ ] should we dedupe platform-specific vs standard records?
212212-- [ ] embeddings: regenerate for all, or use same model?
213213-214214-## implementation plan (PRs)
215215-216216-breaking work into reviewable chunks:
217217-218218-### PR1: database schema for multi-platform โ MERGED
219219-- add `platform TEXT` column to documents (default 'leaflet')
220220-- add `source_collection TEXT` column (default 'pub.leaflet.document')
221221-- backfill existing ~3500 records
222222-- no behavior change, just schema prep
223223-- https://github.com/zzstoatzz/leaflet-search/pull/1
224224-225225-### PR2: generalized content extraction
226226-- new `extractor.zig` module with platform-agnostic interface
227227-- `textContent` extraction for standard.site records
228228-- keep existing block parser for `pub.leaflet.*`
229229-- platform detection from `content.$type`
230230-231231-### PR3: tap subscriber for site.standard.document
232232-- subscribe to `site.standard.document` + `site.standard.publication`
233233-- route to appropriate extractor
234234-- starts ingesting pckt.blog content
235235-236236-### PR4: API platform filter
237237-- add `?platform=` query param to `/search`
238238-- include `platform` field in results
239239-- frontend: show platform badge, optional filter
240240-241241-### PR5 (optional, separate track): witness cache
242242-- `witness_cache` table for raw records
243243-- replay tooling for backfills
244244-- independent of above work
245245-246246-## operational notes
247247-248248-- **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required:
249249- ```bash
250250- wrangler pages deploy site --project-name leaflet-search
251251- ```
252252-- **fly.io backend**: deploy from backend directory:
253253- ```bash
254254- cd backend && fly deploy
255255- ```
256256-- **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs)
257257-258258-## next steps
259259-260260-1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any)
261261-2. ~~find and examine offprint records~~ (done - no public content yet)
262262-3. ~~PR1: database schema~~ (merged)
263263-4. PR2: generalized content extraction
264264-5. PR3: tap subscriber
265265-6. PR4: API platform filter
266266-7. consider witness cache architecture (see below)
267267-268268----
269269-270270-## architectural consideration: witness cache
271271-272272-[paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05):
273273-274274-> I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories.
275275->
276276-> A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep
277277->
278278-> The key feature is: you can replay it
279279-280280-> With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow
281281-282282-### current leaflet-search architecture (no witness cache)
283283-284284-```
285285-Firehose โ tap โ Parse & Transform โ Store DERIVED data โ Discard raw record
286286-```
287287-288288-we store:
289289-- `uri`, `did`, `rkey`
290290-- `title` (extracted)
291291-- `content` (flattened from blocks)
292292-- `created_at`, `publication_uri`
293293-294294-we discard: the raw record JSON
295295-296296-### witness cache architecture
297297-298298-```
299299-Firehose โ Store RAW record + witness_time โ Derive indexes on demand (replayable)
300300-```
301301-302302-would store:
303303-- `uri`, `collection`, `rkey`
304304-- `raw_record` (full JSON blob)
305305-- `witness_time` (when we indexed it)
306306-307307-then derive FTS, embeddings, etc. from local data via replay.
308308-309309-### comparison
310310-311311-| scenario | current (no cache) | with witness cache |
312312-|----------|-------------------|-------------------|
313313-| add new parser (offprint) | re-crawl network | replay local |
314314-| leaflet adds textContent | wait for new records | replay & re-extract |
315315-| fix parsing bug | re-crawl affected | replay & re-derive |
316316-| change embedding model | re-fetch content | replay local |
317317-| add new index/table | backfill from network | replay locally |
318318-319319-### trade-offs
320320-321321-**storage cost:**
322322-- ~3500 docs ร ~10KB avg = ~35MB (not huge)
323323-- turso free tier: 9GB, so plenty of room
324324-325325-**complexity:**
326326-- two-phase: store raw, then derive
327327-- vs current one-phase: derive immediately
328328-329329-**benefits for standard-search:**
330330-- could add offprint/pckt parsers and replay existing data
331331-- when leaflet migrates to standard.site, re-derive without network
332332-- embedding backfill becomes local-only (no voyage API for content fetch)
333333-334334-### implementation options
335335-336336-1. **add `raw_record TEXT` column to existing tables**
337337- - simple, backwards compatible
338338- - can migrate incrementally
339339-340340-2. **separate `witness_cache` table**
341341- - `(uri PRIMARY KEY, collection, raw_record, witness_time)`
342342- - cleaner separation of concerns
343343- - documents/publications tables become derived views
344344-345345-3. **use duckdb/clickhouse for witness cache** (paul's suggestion)
346346- - better compression for JSON blobs
347347- - good for analytics queries
348348- - adds operational complexity
349349-350350-for our scale, option 1 or 2 with turso is probably fine.
-124
docs/search-architecture.md
···11-# search architecture
22-33-current state, rationale, and future options.
44-55-## current: SQLite FTS5
66-77-we use SQLite's built-in full-text search (FTS5) via Turso.
88-99-### why FTS5 works for now
1010-1111-- **scale**: ~3500 documents. FTS5 handles this trivially.
1212-- **latency**: 10-50ms for search queries. fine for our use case.
1313-- **cost**: $0. included with Turso free tier.
1414-- **ops**: zero. no separate service to run.
1515-- **simplicity**: one database for everything (docs, FTS, vectors, cache).
1616-1717-### how it works
1818-1919-```
2020-user query: "crypto-casino"
2121- โ
2222-buildFtsQuery(): "crypto OR casino*"
2323- โ
2424-FTS5 MATCH query with BM25 + recency decay
2525- โ
2626-results with snippet()
2727-```
2828-2929-key decisions:
3030-- **OR between terms** for better recall (deliberate, see commit 35ad4b5)
3131-- **prefix match on last word** for type-ahead feel
3232-- **unicode61 tokenizer** splits on non-alphanumeric (we match this in buildFtsQuery)
3333-- **recency decay** boosts recent docs: `ORDER BY rank + (days_old / 30)`
3434-3535-### what's coupled to FTS5
3636-3737-all in `backend/src/search.zig`:
3838-3939-| component | FTS5-specific |
4040-|-----------|---------------|
4141-| 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` |
4242-| `buildFtsQuery()` | constructs FTS5 syntax |
4343-| schema | `documents_fts`, `publications_fts` virtual tables |
4444-4545-### what's already decoupled
4646-4747-- result types (`SearchResultJson`, `Doc`, `Pub`)
4848-- similarity search (uses `vector_distance_cos`, not FTS5)
4949-- caching logic
5050-- HTTP layer (server.zig just calls `search()`)
5151-5252-### known limitations
5353-5454-- **no typo tolerance**: "leafet" won't find "leaflet"
5555-- **no relevance tuning**: can't boost title vs content
5656-- **single writer**: SQLite write lock
5757-- **no horizontal scaling**: single database
5858-5959-these aren't problems at current scale.
6060-6161-## future: if we need to scale
6262-6363-### when to consider switching
6464-6565-- search latency consistently >100ms
6666-- write contention from indexing
6767-- need typo tolerance or better relevance
6868-- millions of documents
6969-7070-### recommended: Elasticsearch
7171-7272-Elasticsearch is the battle-tested choice for production search:
7373-7474-- proven at massive scale (Wikipedia, GitHub, Stack Overflow)
7575-- rich query DSL, analyzers, aggregations
7676-- typo tolerance via fuzzy matching
7777-- horizontal scaling built-in
7878-- extensive tooling and community
7979-8080-trade-offs:
8181-- operational complexity (JVM, cluster management)
8282-- resource hungry (~2GB+ RAM minimum)
8383-- cost: $50-500/month depending on scale
8484-8585-### alternatives considered
8686-8787-**Meilisearch/Typesense**: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom.
8888-8989-**Algolia**: fully managed, excellent but expensive. makes sense if you want zero ops.
9090-9191-**PostgreSQL full-text**: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system.
9292-9393-### migration path
9494-9595-1. keep Turso as source of truth
9696-2. add Elasticsearch as search index
9797-3. sync documents to ES on write (async)
9898-4. point `/search` at Elasticsearch
9999-5. keep `/similar` on Turso (vector search)
100100-101101-the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged.
102102-103103-estimated effort: 1-2 days to swap search backend.
104104-105105-### vector search scaling
106106-107107-similarity search currently uses brute-force `vector_distance_cos` with caching. at scale:
108108-109109-- **Elasticsearch**: has vector search (dense_vector + kNN)
110110-- **dedicated vector DB**: Qdrant, Pinecone, Weaviate
111111-- **pgvector**: if on Postgres
112112-113113-could consolidate text + vector in Elasticsearch, or keep them separate.
114114-115115-## summary
116116-117117-| scale | recommendation |
118118-|-------|----------------|
119119-| <10k docs | keep FTS5 (current) |
120120-| 10k-100k docs | still probably fine, monitor latency |
121121-| 100k+ docs | consider Elasticsearch |
122122-| millions + sub-ms latency | Elasticsearch cluster + caching layer |
123123-124124-we're in the "keep FTS5" zone. the code is structured to swap later if needed.
+343
docs/standard-search-planning.md
···11+# standard-search planning
22+33+expanding leaflet-search to index all standard.site records.
44+55+## references
66+77+- [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto
88+- [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons
99+- [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons
1010+- [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons (early beta)
1111+- [ATProto docs](https://atproto.com/docs) - protocol documentation
1212+1313+## context
1414+1515+discussion with pckt.blog team about building global search for standard.site ecosystem.
1616+current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons.
1717+1818+### recent work (2026-01-05)
1919+2020+added similarity cache to improve `/similar` endpoint performance:
2121+- `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)`
2222+- cache auto-invalidates when document count changes
2323+- `/stats` endpoint now shows `cache_hits` and `cache_misses`
2424+- first request ~3s (cold), cached requests ~0.15s
2525+2626+also added loading indicator for "related to" results in frontend.
2727+2828+### recent work (2026-01-06)
2929+3030+- merged PR1: multi-platform schema (platform + source_collection columns)
3131+- added `loading.js` - portable loading state handler for dashboards
3232+ - skeleton shimmer while loading
3333+ - "waking up" toast after 2s threshold (fly.io cold start handling)
3434+ - designed to be copied to other projects
3535+- fixed pluralization ("1 result" vs "2 results")
3636+3737+## what we know
3838+3939+### standard.site lexicons
4040+4141+two shared lexicons for long-form publishing on ATProto:
4242+- `site.standard.document` - document content and metadata
4343+- `site.standard.publication` - publication/blog metadata
4444+4545+implementing platforms:
4646+- leaflet.pub (`pub.leaflet.*`)
4747+- pckt.blog (`blog.pckt.*`)
4848+- offprint.app (`app.offprint.*`)
4949+5050+### site.standard.document schema
5151+5252+examined real records from pckt.blog. key fields:
5353+5454+```
5555+textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail)
5656+content - platform-specific block structure
5757+ .$type - identifies platform (e.g., "blog.pckt.content")
5858+title - document title
5959+tags - array of strings
6060+site - AT-URI reference to site.standard.publication
6161+path - URL path (e.g., "/my-post-abc123")
6262+publishedAt - ISO timestamp
6363+updatedAt - ISO timestamp
6464+coverImage - blob reference
6565+```
6666+6767+### the textContent field
6868+6969+this is huge. platforms flatten their block content into a single text field:
7070+7171+```json
7272+{
7373+ "content": {
7474+ "$type": "blog.pckt.content",
7575+ "items": [ /* platform-specific blocks */ ]
7676+ },
7777+ "textContent": "i have been writing a lot of atproto things in zig!..."
7878+}
7979+```
8080+8181+no need to parse platform-specific blocks - just index `textContent` directly.
8282+8383+### platform detection
8484+8585+derive platform from `content.$type` prefix:
8686+- `blog.pckt.content` โ pckt
8787+- `pub.leaflet.content` โ leaflet (TBD - need to verify)
8888+- `app.offprint.content` โ offprint (TBD - need to verify)
8989+9090+### current leaflet-search architecture
9191+9292+```
9393+ATProto firehose (via tap)
9494+ โ
9595+tap.zig - subscribes to pub.leaflet.document/publication
9696+ โ
9797+indexer.zig - extracts content from nested pages[].blocks[] structure
9898+ โ
9999+turso (sqlite) - documents table + FTS5 + embeddings
100100+ โ
101101+search.zig - FTS5 queries + vector similarity
102102+ โ
103103+server.zig - HTTP API (/search, /similar, /stats)
104104+```
105105+106106+leaflet-specific code:
107107+- tap.zig lines 10-11: hardcoded collection names
108108+- tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*)
109109+- recursive page/block traversal logic
110110+111111+generalizable code:
112112+- database schema (FTS5, tags, stats, similarity cache)
113113+- search/similar logic
114114+- HTTP API
115115+- embedding pipeline
116116+117117+## proposed architecture for standard-search
118118+119119+### ingestion changes
120120+121121+subscribe to:
122122+- `site.standard.document`
123123+- `site.standard.publication`
124124+125125+optionally also subscribe to platform-specific collections for richer data:
126126+- `pub.leaflet.document/publication`
127127+- `blog.pckt.document/publication` (if they have these)
128128+- `app.offprint.document/publication` (if they have these)
129129+130130+### content extraction
131131+132132+for `site.standard.document`:
133133+1. use `textContent` field directly - no block parsing!
134134+2. fall back to title + description if textContent missing
135135+136136+for platform-specific records (if needed):
137137+- keep existing leaflet block parser
138138+- add parsers for other platforms as needed
139139+140140+### database changes
141141+142142+add to documents table:
143143+- `platform` TEXT - derived from content.$type (leaflet, pckt, offprint)
144144+- `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document)
145145+- `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document
146146+147147+### API changes
148148+149149+- `/search?q=...&platform=leaflet` - optional platform filter
150150+- results include `platform` field
151151+- `/similar` works across all platforms
152152+153153+### naming/deployment
154154+155155+options:
156156+1. rename leaflet-search โ standard-search (breaking change)
157157+2. new repo/deployment, keep leaflet-search as-is
158158+3. branch and generalize, decide naming later
159159+160160+leaning toward option 3 for now.
161161+162162+## findings from exploration
163163+164164+### pckt.blog - READY
165165+- writes `site.standard.document` records
166166+- has `textContent` field (pre-flattened)
167167+- `content.$type` = `blog.pckt.content`
168168+- 6+ records found on pckt.blog service account
169169+170170+### leaflet.pub - NOT YET MIGRATED
171171+- still using `pub.leaflet.document` only
172172+- no `site.standard.document` records found
173173+- no `textContent` field - content is in nested `pages[].blocks[]`
174174+- will need to continue parsing blocks OR wait for migration
175175+176176+### offprint.app - LIKELY EARLY BETA
177177+- no `site.standard.document` records found on offprint.app account
178178+- no `app.offprint.document` collection visible
179179+- website shows no example users/content
180180+- probably in early/private beta - no public records yet
181181+182182+### implication for architecture
183183+184184+two paths:
185185+186186+**path A: wait for leaflet migration**
187187+- simpler: just index `site.standard.document` with `textContent`
188188+- all platforms converge on same schema
189189+- downside: loses existing leaflet search until they migrate
190190+191191+**path B: hybrid approach**
192192+- index `site.standard.document` (pckt, future leaflet, offprint)
193193+- ALSO index `pub.leaflet.document` with existing block parser
194194+- dedupe by URI or store both with `source_collection` indicator
195195+- more complex but maintains backwards compat
196196+197197+leaning toward **path B** - can't lose 3500 leaflet docs.
198198+199199+## open questions
200200+201201+- [x] does leaflet write site.standard.document records? **NO, not yet**
202202+- [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet**
203203+- [ ] when will leaflet migrate to standard.site?
204204+- [ ] should we dedupe platform-specific vs standard records?
205205+- [ ] embeddings: regenerate for all, or use same model?
206206+207207+## implementation plan (PRs)
208208+209209+breaking work into reviewable chunks:
210210+211211+### PR1: database schema for multi-platform โ MERGED
212212+- add `platform TEXT` column to documents (default 'leaflet')
213213+- add `source_collection TEXT` column (default 'pub.leaflet.document')
214214+- backfill existing ~3500 records
215215+- no behavior change, just schema prep
216216+- https://github.com/zzstoatzz/leaflet-search/pull/1
217217+218218+### PR2: generalized content extraction
219219+- new `extractor.zig` module with platform-agnostic interface
220220+- `textContent` extraction for standard.site records
221221+- keep existing block parser for `pub.leaflet.*`
222222+- platform detection from `content.$type`
223223+224224+### PR3: TAP subscriber for site.standard.document
225225+- subscribe to `site.standard.document` + `site.standard.publication`
226226+- route to appropriate extractor
227227+- starts ingesting pckt.blog content
228228+229229+### PR4: API platform filter
230230+- add `?platform=` query param to `/search`
231231+- include `platform` field in results
232232+- frontend: show platform badge, optional filter
233233+234234+### PR5 (optional, separate track): witness cache
235235+- `witness_cache` table for raw records
236236+- replay tooling for backfills
237237+- independent of above work
238238+239239+## operational notes
240240+241241+- **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required:
242242+ ```bash
243243+ wrangler pages deploy site --project-name leaflet-search
244244+ ```
245245+- **fly.io backend**: deploy from backend directory:
246246+ ```bash
247247+ cd backend && fly deploy
248248+ ```
249249+- **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs)
250250+251251+## next steps
252252+253253+1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any)
254254+2. ~~find and examine offprint records~~ (done - no public content yet)
255255+3. ~~PR1: database schema~~ (merged)
256256+4. PR2: generalized content extraction
257257+5. PR3: TAP subscriber
258258+6. PR4: API platform filter
259259+7. consider witness cache architecture (see below)
260260+261261+---
262262+263263+## architectural consideration: witness cache
264264+265265+[paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05):
266266+267267+> I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories.
268268+>
269269+> A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep
270270+>
271271+> The key feature is: you can replay it
272272+273273+> With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow
274274+275275+### current leaflet-search architecture (no witness cache)
276276+277277+```
278278+Firehose โ TAP โ Parse & Transform โ Store DERIVED data โ Discard raw record
279279+```
280280+281281+we store:
282282+- `uri`, `did`, `rkey`
283283+- `title` (extracted)
284284+- `content` (flattened from blocks)
285285+- `created_at`, `publication_uri`
286286+287287+we discard: the raw record JSON
288288+289289+### witness cache architecture
290290+291291+```
292292+Firehose โ Store RAW record + witness_time โ Derive indexes on demand (replayable)
293293+```
294294+295295+would store:
296296+- `uri`, `collection`, `rkey`
297297+- `raw_record` (full JSON blob)
298298+- `witness_time` (when we indexed it)
299299+300300+then derive FTS, embeddings, etc. from local data via replay.
301301+302302+### comparison
303303+304304+| scenario | current (no cache) | with witness cache |
305305+|----------|-------------------|-------------------|
306306+| add new parser (offprint) | re-crawl network | replay local |
307307+| leaflet adds textContent | wait for new records | replay & re-extract |
308308+| fix parsing bug | re-crawl affected | replay & re-derive |
309309+| change embedding model | re-fetch content | replay local |
310310+| add new index/table | backfill from network | replay locally |
311311+312312+### trade-offs
313313+314314+**storage cost:**
315315+- ~3500 docs ร ~10KB avg = ~35MB (not huge)
316316+- turso free tier: 9GB, so plenty of room
317317+318318+**complexity:**
319319+- two-phase: store raw, then derive
320320+- vs current one-phase: derive immediately
321321+322322+**benefits for standard-search:**
323323+- could add offprint/pckt parsers and replay existing data
324324+- when leaflet migrates to standard.site, re-derive without network
325325+- embedding backfill becomes local-only (no voyage API for content fetch)
326326+327327+### implementation options
328328+329329+1. **add `raw_record TEXT` column to existing tables**
330330+ - simple, backwards compatible
331331+ - can migrate incrementally
332332+333333+2. **separate `witness_cache` table**
334334+ - `(uri PRIMARY KEY, collection, raw_record, witness_time)`
335335+ - cleaner separation of concerns
336336+ - documents/publications tables become derived views
337337+338338+3. **use duckdb/clickhouse for witness cache** (paul's suggestion)
339339+ - better compression for JSON blobs
340340+ - good for analytics queries
341341+ - adds operational complexity
342342+343343+for our scale, option 1 or 2 with turso is probably fine.
-215
docs/tap.md
···11-# tap (firehose sync)
22-33-leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose.
44-55-## what is tap?
66-77-tap subscribes to the ATProto firehose, filters for specific collections (e.g., `site.standard.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records.
88-99-key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking:
1010-1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo`
1111-2. live firehose events during backfill are buffered in memory
1212-3. historical events (marked `live: false`) are delivered first
1313-4. after historical events complete, buffered live events are released
1414-5. subsequent firehose events arrive immediately marked as `live: true`
1515-1616-tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first.
1717-1818-## message format
1919-2020-tap sends JSON messages over websocket. record events look like:
2121-2222-```json
2323-{
2424- "type": "record",
2525- "record": {
2626- "live": true,
2727- "did": "did:plc:abc123...",
2828- "rev": "3mbspmpaidl2a",
2929- "collection": "site.standard.document",
3030- "rkey": "3lzyrj6q6gs27",
3131- "action": "create",
3232- "record": { ... },
3333- "cid": "bafyrei..."
3434- }
3535-}
3636-```
3737-3838-### field types (important!)
3939-4040-| field | type | values | notes |
4141-|-------|------|--------|-------|
4242-| type | string | "record", "identity", "account" | message type |
4343-| action | **string** | "create", "update", "delete" | NOT an enum! |
4444-| live | bool | true/false | true = firehose, false = resync |
4545-| collection | string | e.g., "site.standard.document" | lexicon collection |
4646-4747-## gotchas
4848-4949-1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison.
5050-5151-2. **collection filters apply during processing** - `TAP_COLLECTION_FILTERS` controls which records tap processes and sends to clients, both during live commits and resync CAR walks. records from other collections are skipped entirely.
5252-5353-3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`.
5454-5555-4. **silent extraction failures** - if using zat's `extractAt`, enable debug logging to see why parsing fails:
5656- ```zig
5757- pub const std_options = .{
5858- .log_scope_levels = &.{.{ .scope = .zat, .level = .debug }},
5959- };
6060- ```
6161- this will show messages like:
6262- ```
6363- debug(zat): extractAt: parse failed for Op at path { "op" }: InvalidEnumTag
6464- ```
6565-6666-## memory and performance tuning
6767-6868-tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine.
6969-7070-### recommended settings for leaflet-search
7171-7272-```toml
7373-[[vm]]
7474- memory = '2gb' # 1gb is not enough
7575-7676-[env]
7777- TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5)
7878- TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10)
7979- TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000)
8080- TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000)
8181-```
8282-8383-### why these values?
8484-8585-- **2GB memory**: 1GB causes OOM kills when resyncing large repos
8686-- **resync parallelism 1**: prevents multiple large CARs in memory simultaneously
8787-- **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill
8888-- **smaller ident cache**: we don't need 2M cached identities
8989-9090-if tap keeps OOM'ing, check logs for large repo resyncs:
9191-```bash
9292-fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}"
9393-```
9494-9595-## quick status check
9696-9797-from the `tap/` directory:
9898-```bash
9999-just check
100100-```
101101-102102-shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts.
103103-104104-example output:
105105-```
106106-=== tap status ===
107107-app 781417db604d48 23 ewr started ...
108108-109109-=== Recent Indexing Activity ===
110110-Last indexed: 2026-01-08 (14 docs)
111111-Today: 2026-01-11
112112-Docs: 3742 | Pubs: 1231
113113-114114-=== Timeline (last 7 days) ===
115115-2026-01-08: 14 docs
116116-2026-01-07: 29 docs
117117-...
118118-```
119119-120120-if "Last indexed" is more than a day behind "Today", tap may be down or catching up.
121121-122122-## checking catch-up progress
123123-124124-when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress:
125125-126126-```bash
127127-# see current firehose position (look for timestamps in log messages)
128128-fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3
129129-```
130130-131131-the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up.
132132-133133-catch-up speed varies:
134134-- **~0.3x** when resync queue is full (large repos being fetched)
135135-- **~1x or faster** once resyncs clear
136136-137137-## debugging
138138-139139-### check tap connection
140140-```bash
141141-fly logs -a leaflet-search-tap --no-tail | tail -30
142142-```
143143-144144-look for:
145145-- `"connected to firehose"` - successfully connected to bsky relay
146146-- `"websocket connected"` - backend connected to tap
147147-- `"dialing failed"` / `"i/o timeout"` - network issues
148148-149149-### check backend is receiving
150150-```bash
151151-fly logs -a leaflet-search-backend --no-tail | grep -E "(tap|indexed)"
152152-```
153153-154154-look for:
155155-- `tap connected!` - connected to tap
156156-- `tap: msg_type=record` - receiving messages
157157-- `indexed document:` - successfully processing
158158-159159-### common issues
160160-161161-| symptom | cause | fix |
162162-|---------|-------|-----|
163163-| tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 |
164164-| `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match |
165165-| `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue |
166166-| messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types |
167167-| repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` |
168168-| new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap |
169169-| indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up |
170170-171171-## tap API endpoints
172172-173173-tap exposes HTTP endpoints for monitoring and control:
174174-175175-| endpoint | description |
176176-|----------|-------------|
177177-| `/health` | health check |
178178-| `/stats/repo-count` | number of tracked repos |
179179-| `/stats/record-count` | total records processed |
180180-| `/stats/outbox-buffer` | events waiting to be sent |
181181-| `/stats/resync-buffer` | buffered commits for repos currently resyncing (NOT the resync queue) |
182182-| `/stats/cursors` | firehose cursor position |
183183-| `/info/:did` | repo status: `{"did":"...","state":"active","records":N}` |
184184-| `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos |
185185-| `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos |
186186-187187-example: check repo status
188188-```bash
189189-fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123"
190190-```
191191-192192-example: manually add a repo for backfill
193193-```bash
194194-fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add'
195195-```
196196-197197-## fly.io deployment
198198-199199-both tap and backend should be in the same region for internal networking:
200200-201201-```bash
202202-# check current regions
203203-fly status -a leaflet-search-tap
204204-fly status -a leaflet-search-backend
205205-206206-# restart tap if needed
207207-fly machine restart -a leaflet-search-tap <machine-id>
208208-```
209209-210210-note: changing `primary_region` in fly.toml only affects new machines. to move existing machines, clone to new region and destroy old one.
211211-212212-## references
213213-214214-- [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)
215215-- [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+5-5
mcp/README.md
···11-# pub search MCP
11+# leaflet-mcp
2233-MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site).
33+MCP server for [Leaflet](https://leaflet.pub) - search decentralized publications on ATProto.
4455## usage
6677### hosted (recommended)
8899```bash
1010-claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}'
1010+claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}'
1111```
12121313### local
···1515run the MCP server locally with `uvx`:
16161717```bash
1818-uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp pub-search
1818+uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp leaflet-mcp
1919```
20202121to add it to claude code as a local stdio server:
22222323```bash
2424-claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search
2424+claude mcp add leaflet -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' leaflet-mcp
2525```
26262727## workflow
···11-#!/usr/bin/env python3
22-"""Test the pub-search MCP server."""
33-44-import asyncio
55-import sys
66-77-from fastmcp import Client
88-from fastmcp.client.transports import FastMCPTransport
99-1010-from pub_search.server import mcp
1111-1212-1313-async def main():
1414- # use local transport for testing, or live URL if --live flag
1515- if "--live" in sys.argv:
1616- print("testing against live Horizon server...")
1717- client = Client("https://pub-search-by-zzstoatzz.fastmcp.app/mcp")
1818- else:
1919- print("testing locally with FastMCPTransport...")
2020- client = Client(transport=FastMCPTransport(mcp))
2121-2222- async with client:
2323- # list tools
2424- print("=== tools ===")
2525- tools = await client.list_tools()
2626- for t in tools:
2727- print(f" {t.name}")
2828-2929- # test search with new platform filter
3030- print("\n=== search(query='zig', platform='leaflet', limit=3) ===")
3131- result = await client.call_tool(
3232- "search", {"query": "zig", "platform": "leaflet", "limit": 3}
3333- )
3434- for item in result.content:
3535- print(f" {item.text[:200]}...")
3636-3737- # test search with since filter
3838- print("\n=== search(query='python', since='2025-01-01', limit=2) ===")
3939- result = await client.call_tool(
4040- "search", {"query": "python", "since": "2025-01-01", "limit": 2}
4141- )
4242- for item in result.content:
4343- print(f" {item.text[:200]}...")
4444-4545- # test get_tags
4646- print("\n=== get_tags() ===")
4747- result = await client.call_tool("get_tags", {})
4848- for item in result.content:
4949- print(f" {item.text[:150]}...")
5050-5151- # test get_stats
5252- print("\n=== get_stats() ===")
5353- result = await client.call_tool("get_stats", {})
5454- for item in result.content:
5555- print(f" {item.text}")
5656-5757- # test get_popular
5858- print("\n=== get_popular(limit=3) ===")
5959- result = await client.call_tool("get_popular", {"limit": 3})
6060- for item in result.content:
6161- print(f" {item.text[:100]}...")
6262-6363- print("\n=== all tests passed ===")
6464-6565-6666-if __name__ == "__main__":
6767- asyncio.run(main())
+5
mcp/src/leaflet_mcp/__init__.py
···11+"""Leaflet MCP server - search decentralized publications on ATProto."""
22+33+from leaflet_mcp.server import main, mcp
44+55+__all__ = ["main", "mcp"]
+58
mcp/src/leaflet_mcp/_types.py
···11+"""Type definitions for Leaflet MCP responses."""
22+33+from typing import Literal
44+55+from pydantic import BaseModel, computed_field
66+77+88+class SearchResult(BaseModel):
99+ """A search result from the Leaflet API."""
1010+1111+ type: Literal["article", "looseleaf", "publication"]
1212+ uri: str
1313+ did: str
1414+ title: str
1515+ snippet: str
1616+ createdAt: str = ""
1717+ rkey: str
1818+ basePath: str = ""
1919+2020+ @computed_field
2121+ @property
2222+ def url(self) -> str:
2323+ """web URL for this document."""
2424+ if self.basePath:
2525+ return f"https://{self.basePath}/{self.rkey}"
2626+ return ""
2727+2828+2929+class Tag(BaseModel):
3030+ """A tag with document count."""
3131+3232+ tag: str
3333+ count: int
3434+3535+3636+class PopularSearch(BaseModel):
3737+ """A popular search query with count."""
3838+3939+ query: str
4040+ count: int
4141+4242+4343+class Stats(BaseModel):
4444+ """Leaflet index statistics."""
4545+4646+ documents: int
4747+ publications: int
4848+4949+5050+class Document(BaseModel):
5151+ """Full document content from ATProto."""
5252+5353+ uri: str
5454+ title: str
5555+ content: str
5656+ createdAt: str = ""
5757+ tags: list[str] = []
5858+ publicationUri: str = ""
+21
mcp/src/leaflet_mcp/client.py
···11+"""HTTP client for Leaflet search API."""
22+33+import os
44+from contextlib import asynccontextmanager
55+from typing import AsyncIterator
66+77+import httpx
88+99+# configurable via env var, defaults to production
1010+LEAFLET_API_URL = os.getenv("LEAFLET_API_URL", "https://leaflet-search-backend.fly.dev")
1111+1212+1313+@asynccontextmanager
1414+async def get_http_client() -> AsyncIterator[httpx.AsyncClient]:
1515+ """Get an async HTTP client for Leaflet API requests."""
1616+ async with httpx.AsyncClient(
1717+ base_url=LEAFLET_API_URL,
1818+ timeout=30.0,
1919+ headers={"Accept": "application/json"},
2020+ ) as client:
2121+ yield client
+289
mcp/src/leaflet_mcp/server.py
···11+"""Leaflet MCP server implementation using fastmcp."""
22+33+from __future__ import annotations
44+55+from typing import Any
66+77+from fastmcp import FastMCP
88+99+from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag
1010+from leaflet_mcp.client import get_http_client
1111+1212+mcp = FastMCP("leaflet")
1313+1414+1515+# -----------------------------------------------------------------------------
1616+# prompts
1717+# -----------------------------------------------------------------------------
1818+1919+2020+@mcp.prompt("usage_guide")
2121+def usage_guide() -> str:
2222+ """instructions for using leaflet MCP tools."""
2323+ return """\
2424+# Leaflet MCP server usage guide
2525+2626+Leaflet is a decentralized publishing platform on ATProto (the protocol behind Bluesky).
2727+This MCP server provides search and discovery tools for Leaflet publications.
2828+2929+## core tools
3030+3131+- `search(query, tag)` - search documents and publications by text or tag
3232+- `get_document(uri)` - get the full content of a document by its AT-URI
3333+- `find_similar(uri)` - find documents similar to a given document
3434+- `get_tags()` - list all available tags with document counts
3535+- `get_stats()` - get index statistics (document/publication counts)
3636+- `get_popular()` - see popular search queries
3737+3838+## workflow for research
3939+4040+1. use `search("your topic")` to find relevant documents
4141+2. use `get_document(uri)` to retrieve full content of interesting results
4242+3. use `find_similar(uri)` to discover related content
4343+4444+## result types
4545+4646+search returns three types of results:
4747+- **publication**: a collection of articles (like a blog or magazine)
4848+- **article**: a document that belongs to a publication
4949+- **looseleaf**: a standalone document not part of a publication
5050+5151+## AT-URIs
5252+5353+documents are identified by AT-URIs like:
5454+ `at://did:plc:abc123/pub.leaflet.document/xyz789`
5555+5656+you can also browse documents on the web at leaflet.pub
5757+"""
5858+5959+6060+@mcp.prompt("search_tips")
6161+def search_tips() -> str:
6262+ """tips for effective searching."""
6363+ return """\
6464+# Leaflet search tips
6565+6666+## text search
6767+- searches both document titles and content
6868+- uses FTS5 full-text search with prefix matching
6969+- the last word gets prefix matching: "cat dog" matches "cat dogs"
7070+7171+## tag filtering
7272+- combine text search with tag filter: `search("python", tag="programming")`
7373+- use `get_tags()` to discover available tags
7474+- tags are only applied to documents, not publications
7575+7676+## finding related content
7777+- after finding an interesting document, use `find_similar(uri)`
7878+- similarity is based on semantic embeddings (voyage-3-lite)
7979+- great for exploring related topics
8080+8181+## browsing by popularity
8282+- use `get_popular()` to see what others are searching for
8383+- can inspire new research directions
8484+"""
8585+8686+8787+# -----------------------------------------------------------------------------
8888+# tools
8989+# -----------------------------------------------------------------------------
9090+9191+9292+@mcp.tool
9393+async def search(
9494+ query: str = "",
9595+ tag: str | None = None,
9696+ limit: int = 5,
9797+) -> list[SearchResult]:
9898+ """search leaflet documents and publications.
9999+100100+ searches the full text of documents (titles and content) and publications.
101101+ results include a snippet showing where the match was found.
102102+103103+ args:
104104+ query: search query (searches titles and content)
105105+ tag: optional tag to filter by (only applies to documents)
106106+ limit: max results to return (default 5, max 40)
107107+108108+ returns:
109109+ list of search results with uri, title, snippet, and metadata
110110+ """
111111+ if not query and not tag:
112112+ return []
113113+114114+ params: dict[str, Any] = {}
115115+ if query:
116116+ params["q"] = query
117117+ if tag:
118118+ params["tag"] = tag
119119+120120+ async with get_http_client() as client:
121121+ response = await client.get("/search", params=params)
122122+ response.raise_for_status()
123123+ results = response.json()
124124+125125+ # apply client-side limit since API returns up to 40
126126+ return [SearchResult(**r) for r in results[:limit]]
127127+128128+129129+@mcp.tool
130130+async def get_document(uri: str) -> Document:
131131+ """get the full content of a document by its AT-URI.
132132+133133+ fetches the complete document from ATProto, including full text content.
134134+ use this after finding documents via search to get the complete text.
135135+136136+ args:
137137+ uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...)
138138+139139+ returns:
140140+ document with full content, title, tags, and metadata
141141+ """
142142+ # use pdsx to fetch the actual record from ATProto
143143+ try:
144144+ from pdsx._internal.operations import get_record
145145+ from pdsx.mcp.client import get_atproto_client
146146+ except ImportError as e:
147147+ raise RuntimeError(
148148+ "pdsx is required for fetching full documents. install with: uv add pdsx"
149149+ ) from e
150150+151151+ # extract repo from URI for PDS discovery
152152+ # at://did:plc:xxx/collection/rkey
153153+ parts = uri.replace("at://", "").split("/")
154154+ if len(parts) < 3:
155155+ raise ValueError(f"invalid AT-URI: {uri}")
156156+157157+ repo = parts[0]
158158+159159+ async with get_atproto_client(target_repo=repo) as client:
160160+ record = await get_record(client, uri)
161161+162162+ value = record.value
163163+ # DotDict doesn't have a working .get(), convert to dict first
164164+ if hasattr(value, "to_dict") and callable(value.to_dict):
165165+ value = value.to_dict()
166166+ elif not isinstance(value, dict):
167167+ value = dict(value)
168168+169169+ # extract content from leaflet's block structure
170170+ # pages[].blocks[].block.plaintext
171171+ content_parts = []
172172+ for page in value.get("pages", []):
173173+ for block_wrapper in page.get("blocks", []):
174174+ block = block_wrapper.get("block", {})
175175+ plaintext = block.get("plaintext", "")
176176+ if plaintext:
177177+ content_parts.append(plaintext)
178178+179179+ content = "\n\n".join(content_parts)
180180+181181+ return Document(
182182+ uri=record.uri,
183183+ title=value.get("title", ""),
184184+ content=content,
185185+ createdAt=value.get("publishedAt", "") or value.get("createdAt", ""),
186186+ tags=value.get("tags", []),
187187+ publicationUri=value.get("publication", ""),
188188+ )
189189+190190+191191+@mcp.tool
192192+async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]:
193193+ """find documents similar to a given document.
194194+195195+ uses vector similarity (voyage-3-lite embeddings) to find semantically
196196+ related documents. great for discovering related content after finding
197197+ an interesting document.
198198+199199+ args:
200200+ uri: the AT-URI of the document to find similar content for
201201+ limit: max similar documents to return (default 5)
202202+203203+ returns:
204204+ list of similar documents with uri, title, and metadata
205205+ """
206206+ async with get_http_client() as client:
207207+ response = await client.get("/similar", params={"uri": uri})
208208+ response.raise_for_status()
209209+ results = response.json()
210210+211211+ return [SearchResult(**r) for r in results[:limit]]
212212+213213+214214+@mcp.tool
215215+async def get_tags() -> list[Tag]:
216216+ """list all available tags with document counts.
217217+218218+ returns tags sorted by document count (most popular first).
219219+ useful for discovering topics and filtering searches.
220220+221221+ returns:
222222+ list of tags with their document counts
223223+ """
224224+ async with get_http_client() as client:
225225+ response = await client.get("/tags")
226226+ response.raise_for_status()
227227+ results = response.json()
228228+229229+ return [Tag(**t) for t in results]
230230+231231+232232+@mcp.tool
233233+async def get_stats() -> Stats:
234234+ """get leaflet index statistics.
235235+236236+ returns:
237237+ document and publication counts
238238+ """
239239+ async with get_http_client() as client:
240240+ response = await client.get("/stats")
241241+ response.raise_for_status()
242242+ return Stats(**response.json())
243243+244244+245245+@mcp.tool
246246+async def get_popular(limit: int = 5) -> list[PopularSearch]:
247247+ """get popular search queries.
248248+249249+ see what others are searching for on leaflet.
250250+ can inspire new research directions.
251251+252252+ args:
253253+ limit: max queries to return (default 5)
254254+255255+ returns:
256256+ list of popular queries with search counts
257257+ """
258258+ async with get_http_client() as client:
259259+ response = await client.get("/popular")
260260+ response.raise_for_status()
261261+ results = response.json()
262262+263263+ return [PopularSearch(**p) for p in results[:limit]]
264264+265265+266266+# -----------------------------------------------------------------------------
267267+# resources
268268+# -----------------------------------------------------------------------------
269269+270270+271271+@mcp.resource("leaflet://stats")
272272+async def stats_resource() -> str:
273273+ """current leaflet index statistics."""
274274+ stats = await get_stats()
275275+ return f"Leaflet index: {stats.documents} documents, {stats.publications} publications"
276276+277277+278278+# -----------------------------------------------------------------------------
279279+# entrypoint
280280+# -----------------------------------------------------------------------------
281281+282282+283283+def main() -> None:
284284+ """run the MCP server."""
285285+ mcp.run()
286286+287287+288288+if __name__ == "__main__":
289289+ main()
-5
mcp/src/pub_search/__init__.py
···11-"""MCP server for searching ATProto publishing platforms."""
22-33-from pub_search.server import main, mcp
44-55-__all__ = ["main", "mcp"]
-59
mcp/src/pub_search/_types.py
···11-"""Type definitions for Leaflet MCP responses."""
22-33-from typing import Literal
44-55-from pydantic import BaseModel, computed_field
66-77-88-class SearchResult(BaseModel):
99- """A search result from the Leaflet API."""
1010-1111- type: Literal["article", "looseleaf", "publication"]
1212- uri: str
1313- did: str
1414- title: str
1515- snippet: str
1616- createdAt: str = ""
1717- rkey: str
1818- basePath: str = ""
1919- platform: Literal["leaflet", "pckt", "offprint", "greengale", "other"] = "leaflet"
2020-2121- @computed_field
2222- @property
2323- def url(self) -> str:
2424- """web URL for this document."""
2525- if self.basePath:
2626- return f"https://{self.basePath}/{self.rkey}"
2727- return ""
2828-2929-3030-class Tag(BaseModel):
3131- """A tag with document count."""
3232-3333- tag: str
3434- count: int
3535-3636-3737-class PopularSearch(BaseModel):
3838- """A popular search query with count."""
3939-4040- query: str
4141- count: int
4242-4343-4444-class Stats(BaseModel):
4545- """Leaflet index statistics."""
4646-4747- documents: int
4848- publications: int
4949-5050-5151-class Document(BaseModel):
5252- """Full document content from ATProto."""
5353-5454- uri: str
5555- title: str
5656- content: str
5757- createdAt: str = ""
5858- tags: list[str] = []
5959- publicationUri: str = ""
-21
mcp/src/pub_search/client.py
···11-"""HTTP client for leaflet-search API."""
22-33-import os
44-from contextlib import asynccontextmanager
55-from typing import AsyncIterator
66-77-import httpx
88-99-# configurable via env var, defaults to production
1010-API_URL = os.getenv("LEAFLET_SEARCH_API_URL", "https://leaflet-search-backend.fly.dev")
1111-1212-1313-@asynccontextmanager
1414-async def get_http_client() -> AsyncIterator[httpx.AsyncClient]:
1515- """Get an async HTTP client for API requests."""
1616- async with httpx.AsyncClient(
1717- base_url=API_URL,
1818- timeout=30.0,
1919- headers={"Accept": "application/json"},
2020- ) as client:
2121- yield client
-276
mcp/src/pub_search/server.py
···11-"""MCP server for searching ATProto publishing platforms."""
22-33-from __future__ import annotations
44-55-from typing import Any, Literal
66-77-from fastmcp import FastMCP
88-99-from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag
1010-from pub_search.client import get_http_client
1111-1212-mcp = FastMCP("pub-search")
1313-1414-1515-# -----------------------------------------------------------------------------
1616-# prompts
1717-# -----------------------------------------------------------------------------
1818-1919-2020-@mcp.prompt("usage_guide")
2121-def usage_guide() -> str:
2222- """instructions for using pub-search MCP tools."""
2323- return """\
2424-# pub-search MCP
2525-2626-search ATProto publishing platforms: leaflet, pckt, offprint, greengale.
2727-2828-## tools
2929-3030-- `search(query, tag, platform, since)` - full-text search with filters
3131-- `get_document(uri)` - fetch full content by AT-URI
3232-- `find_similar(uri)` - semantic similarity search
3333-- `get_tags()` - available tags
3434-- `get_stats()` - index statistics
3535-- `get_popular()` - popular queries
3636-3737-## workflow
3838-3939-1. `search("topic")` or `search("topic", platform="leaflet")`
4040-2. `get_document(uri)` for full text
4141-3. `find_similar(uri)` for related content
4242-4343-## result types
4444-4545-- **article**: document in a publication
4646-- **looseleaf**: standalone document
4747-- **publication**: the publication itself
4848-4949-results include a `url` field for web access.
5050-"""
5151-5252-5353-@mcp.prompt("search_tips")
5454-def search_tips() -> str:
5555- """tips for effective searching."""
5656- return """\
5757-# search tips
5858-5959-- prefix matching on last word: "cat dog" matches "cat dogs"
6060-- combine filters: `search("python", tag="tutorial", platform="leaflet")`
6161-- use `since="2025-01-01"` for recent content
6262-- `find_similar(uri)` for semantic similarity (voyage-3-lite embeddings)
6363-- `get_tags()` to discover available tags
6464-"""
6565-6666-6767-# -----------------------------------------------------------------------------
6868-# tools
6969-# -----------------------------------------------------------------------------
7070-7171-7272-Platform = Literal["leaflet", "pckt", "offprint", "greengale", "other"]
7373-7474-7575-@mcp.tool
7676-async def search(
7777- query: str = "",
7878- tag: str | None = None,
7979- platform: Platform | None = None,
8080- since: str | None = None,
8181- limit: int = 5,
8282-) -> list[SearchResult]:
8383- """search documents and publications.
8484-8585- args:
8686- query: search query (titles and content)
8787- tag: filter by tag
8888- platform: filter by platform (leaflet, pckt, offprint, greengale, other)
8989- since: ISO date - only documents created after this date
9090- limit: max results (default 5, max 40)
9191-9292- returns:
9393- list of results with uri, title, snippet, platform, and web url
9494- """
9595- if not query and not tag:
9696- return []
9797-9898- params: dict[str, Any] = {}
9999- if query:
100100- params["q"] = query
101101- if tag:
102102- params["tag"] = tag
103103- if platform:
104104- params["platform"] = platform
105105- if since:
106106- params["since"] = since
107107-108108- async with get_http_client() as client:
109109- response = await client.get("/search", params=params)
110110- response.raise_for_status()
111111- results = response.json()
112112-113113- return [SearchResult(**r) for r in results[:limit]]
114114-115115-116116-@mcp.tool
117117-async def get_document(uri: str) -> Document:
118118- """get the full content of a document by its AT-URI.
119119-120120- fetches the complete document from ATProto, including full text content.
121121- use this after finding documents via search to get the complete text.
122122-123123- args:
124124- uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...)
125125-126126- returns:
127127- document with full content, title, tags, and metadata
128128- """
129129- # use pdsx to fetch the actual record from ATProto
130130- try:
131131- from pdsx._internal.operations import get_record
132132- from pdsx.mcp.client import get_atproto_client
133133- except ImportError as e:
134134- raise RuntimeError(
135135- "pdsx is required for fetching full documents. install with: uv add pdsx"
136136- ) from e
137137-138138- # extract repo from URI for PDS discovery
139139- # at://did:plc:xxx/collection/rkey
140140- parts = uri.replace("at://", "").split("/")
141141- if len(parts) < 3:
142142- raise ValueError(f"invalid AT-URI: {uri}")
143143-144144- repo = parts[0]
145145-146146- async with get_atproto_client(target_repo=repo) as client:
147147- record = await get_record(client, uri)
148148-149149- value = record.value
150150- # DotDict doesn't have a working .get(), convert to dict first
151151- if hasattr(value, "to_dict") and callable(value.to_dict):
152152- value = value.to_dict()
153153- elif not isinstance(value, dict):
154154- value = dict(value)
155155-156156- # extract content from leaflet's block structure
157157- # pages[].blocks[].block.plaintext
158158- content_parts = []
159159- for page in value.get("pages", []):
160160- for block_wrapper in page.get("blocks", []):
161161- block = block_wrapper.get("block", {})
162162- plaintext = block.get("plaintext", "")
163163- if plaintext:
164164- content_parts.append(plaintext)
165165-166166- content = "\n\n".join(content_parts)
167167-168168- return Document(
169169- uri=record.uri,
170170- title=value.get("title", ""),
171171- content=content,
172172- createdAt=value.get("publishedAt", "") or value.get("createdAt", ""),
173173- tags=value.get("tags", []),
174174- publicationUri=value.get("publication", ""),
175175- )
176176-177177-178178-@mcp.tool
179179-async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]:
180180- """find documents similar to a given document.
181181-182182- uses vector similarity (voyage-3-lite embeddings) to find semantically
183183- related documents. great for discovering related content after finding
184184- an interesting document.
185185-186186- args:
187187- uri: the AT-URI of the document to find similar content for
188188- limit: max similar documents to return (default 5)
189189-190190- returns:
191191- list of similar documents with uri, title, and metadata
192192- """
193193- async with get_http_client() as client:
194194- response = await client.get("/similar", params={"uri": uri})
195195- response.raise_for_status()
196196- results = response.json()
197197-198198- return [SearchResult(**r) for r in results[:limit]]
199199-200200-201201-@mcp.tool
202202-async def get_tags() -> list[Tag]:
203203- """list all available tags with document counts.
204204-205205- returns tags sorted by document count (most popular first).
206206- useful for discovering topics and filtering searches.
207207-208208- returns:
209209- list of tags with their document counts
210210- """
211211- async with get_http_client() as client:
212212- response = await client.get("/tags")
213213- response.raise_for_status()
214214- results = response.json()
215215-216216- return [Tag(**t) for t in results]
217217-218218-219219-@mcp.tool
220220-async def get_stats() -> Stats:
221221- """get index statistics.
222222-223223- returns:
224224- document and publication counts
225225- """
226226- async with get_http_client() as client:
227227- response = await client.get("/stats")
228228- response.raise_for_status()
229229- return Stats(**response.json())
230230-231231-232232-@mcp.tool
233233-async def get_popular(limit: int = 5) -> list[PopularSearch]:
234234- """get popular search queries.
235235-236236- see what others are searching for.
237237- can inspire new research directions.
238238-239239- args:
240240- limit: max queries to return (default 5)
241241-242242- returns:
243243- list of popular queries with search counts
244244- """
245245- async with get_http_client() as client:
246246- response = await client.get("/popular")
247247- response.raise_for_status()
248248- results = response.json()
249249-250250- return [PopularSearch(**p) for p in results[:limit]]
251251-252252-253253-# -----------------------------------------------------------------------------
254254-# resources
255255-# -----------------------------------------------------------------------------
256256-257257-258258-@mcp.resource("pub-search://stats")
259259-async def stats_resource() -> str:
260260- """current index statistics."""
261261- stats = await get_stats()
262262- return f"pub search index: {stats.documents} documents, {stats.publications} publications"
263263-264264-265265-# -----------------------------------------------------------------------------
266266-# entrypoint
267267-# -----------------------------------------------------------------------------
268268-269269-270270-def main() -> None:
271271- """run the MCP server."""
272272- mcp.run()
273273-274274-275275-if __name__ == "__main__":
276276- main()
+9-12
mcp/tests/test_mcp.py
···11-"""tests for pub-search MCP server."""
11+"""tests for leaflet MCP server."""
2233import pytest
44from mcp.types import TextContent
···66from fastmcp.client import Client
77from fastmcp.client.transports import FastMCPTransport
8899-from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag
1010-from pub_search.server import mcp
99+from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag
1010+from leaflet_mcp.server import mcp
111112121313class TestTypes:
···2323 snippet="this is a test...",
2424 createdAt="2025-01-01T00:00:00Z",
2525 rkey="123",
2626- basePath="gyst.leaflet.pub",
2727- platform="leaflet",
2626+ basePath="/blog",
2827 )
2928 assert r.type == "article"
3029 assert r.uri == "at://did:plc:abc/pub.leaflet.document/123"
3130 assert r.title == "test article"
3232- assert r.platform == "leaflet"
3333- assert r.url == "https://gyst.leaflet.pub/123"
34313532 def test_search_result_looseleaf(self):
3633 """SearchResult supports looseleaf type."""
···96939794 def test_mcp_server_imports(self):
9895 """mcp server can be imported without errors."""
9999- from pub_search import mcp
9696+ from leaflet_mcp import mcp
10097101101- assert mcp.name == "pub-search"
9898+ assert mcp.name == "leaflet"
10299103100 def test_exports(self):
104101 """all expected exports are available."""
105105- from pub_search import main, mcp
102102+ from leaflet_mcp import main, mcp
106103107104 assert mcp is not None
108105 assert main is not None
···141138 resources = await client.list_resources()
142139143140 resource_uris = {str(r.uri) for r in resources}
144144- assert "pub-search://stats" in resource_uris
141141+ assert "leaflet://stats" in resource_uris
145142146143 async def test_usage_guide_prompt_content(self, client):
147144 """usage_guide prompt returns helpful content."""
···151148 assert len(result.messages) > 0
152149 content = result.messages[0].content
153150 assert isinstance(content, TextContent)
154154- assert "pub-search" in content.text
151151+ assert "Leaflet" in content.text
155152 assert "search" in content.text
156153157154 async def test_search_tips_prompt_content(self, client):