search for standard sites pub-search.waow.tech/
search zig blog atproto

Compare changes

Choose any two refs to compare.

+27
.github/workflows/deploy-backend.yml
··· 1 + name: Deploy Backend 2 + 3 + on: 4 + push: 5 + branches: [main] 6 + paths: 7 + - 'backend/**' 8 + 9 + concurrency: 10 + group: backend-deploy 11 + cancel-in-progress: true 12 + 13 + env: 14 + FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} 15 + 16 + jobs: 17 + deploy: 18 + runs-on: ubuntu-latest 19 + steps: 20 + - uses: actions/checkout@v4 21 + 22 + - name: Setup flyctl 23 + uses: superfly/flyctl-actions/setup-flyctl@master 24 + 25 + - name: Deploy to Fly.io 26 + working-directory: backend 27 + run: flyctl deploy --remote-only
+30
CLAUDE.md
··· 1 + # leaflet-search notes 2 + 3 + ## deployment 4 + - **backend**: push to `main` touching `backend/**` โ†’ auto-deploys via GitHub Actions 5 + - **frontend**: manual deploy only (`wrangler pages deploy site --project-name leaflet-search`) 6 + - **tap**: manual deploy from `tap/` directory (`fly deploy --app leaflet-search-tap`) 7 + 8 + ## remotes 9 + - `origin`: tangled.sh:zzstoatzz.io/leaflet-search 10 + - `github`: github.com/zzstoatzz/leaflet-search (CI runs here) 11 + - push to both: `git push origin main && git push github main` 12 + 13 + ## architecture 14 + - **backend** (Zig): HTTP API, FTS5 search, vector similarity 15 + - **tap**: firehose sync via bluesky-social/indigo tap 16 + - **site**: static frontend on Cloudflare Pages 17 + - **db**: Turso (SQLite) - FTS5 + embeddings 18 + 19 + ## search ranking 20 + - hybrid BM25 + recency: `ORDER BY rank + (days_old / 30)` 21 + - OR between terms for recall, prefix on last word 22 + - unicode61 tokenizer (non-alphanumeric = separator) 23 + 24 + ## tap operations 25 + - from `tap/` directory: `just check` (status), `just turbo` (catch-up), `just normal` (steady state) 26 + - see `docs/tap.md` for memory tuning and debugging 27 + 28 + ## common tasks 29 + - backfill embeddings: `./scripts/backfill-embeddings` 30 + - check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`
+34 -13
README.md
··· 1 - # leaflet-search 1 + # pub search 2 2 3 3 by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io) 4 4 5 - search for [leaflet](https://leaflet.pub). 5 + search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), and others using [standard.site](https://standard.site)). 6 6 7 - **live:** [leaflet-search.pages.dev](https://leaflet-search.pages.dev) 7 + **live:** [pub-search.waow.tech](https://pub-search.waow.tech) 8 + 9 + > formerly "leaflet-search" - generalized to support multiple publishing platforms 8 10 9 11 ## how it works 10 12 11 - 1. **tap** syncs leaflet content from the network 13 + 1. **tap** syncs content from ATProto firehose (signals on `pub.leaflet.document`, filters `pub.leaflet.*` + `site.standard.*`) 12 14 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API 13 15 3. **site** static frontend on Cloudflare Pages 14 16 ··· 17 19 search is also exposed as an MCP server for AI agents like Claude Code: 18 20 19 21 ```bash 20 - claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 22 + claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 21 23 ``` 22 24 23 25 see [mcp/README.md](mcp/README.md) for local setup and usage details. ··· 25 27 ## api 26 28 27 29 ``` 28 - GET /search?q=<query>&tag=<tag> # full-text search with query, tag, or both 29 - GET /similar?uri=<at-uri> # find similar documents via vector embeddings 30 - GET /tags # list all tags with counts 31 - GET /popular # popular search queries 32 - GET /stats # document/publication counts 33 - GET /health # health check 30 + GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search 31 + GET /similar?uri=<at-uri> # find similar documents 32 + GET /tags # list all tags with counts 33 + GET /popular # popular search queries 34 + GET /stats # document/publication counts 35 + GET /health # health check 34 36 ``` 35 37 36 - search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). tag filtering applies to documents only. 38 + search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, etc). tag and platform filtering apply to documents only. 39 + 40 + **ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 37 41 38 42 `/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs). 39 43 44 + ## configuration 45 + 46 + the backend is fully configurable via environment variables: 47 + 48 + | variable | default | description | 49 + |----------|---------|-------------| 50 + | `APP_NAME` | `leaflet-search` | name shown in startup logs | 51 + | `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` | 52 + | `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host | 53 + | `TAP_PORT` | `443` | tap websocket port | 54 + | `PORT` | `3000` | HTTP server port | 55 + | `TURSO_URL` | - | Turso database URL (required) | 56 + | `TURSO_TOKEN` | - | Turso auth token (required) | 57 + | `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) | 58 + 59 + the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results. 60 + 40 61 ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 41 62 42 63 - [Fly.io](https://fly.io) hosts backend + tap 43 64 - [Turso](https://turso.tech) cloud SQLite with vector support 44 65 - [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite) 45 - - [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs leaflet content from ATProto firehose 66 + - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 46 67 - [Zig](https://ziglang.org) HTTP server, search API, content indexing 47 68 - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 48 69
+2 -2
backend/build.zig.zon
··· 13 13 .hash = "zql-0.0.1-alpha-xNRI4IRNAABUb9gLat5FWUaZDD5HvxAxet_-elgR_A_y", 14 14 }, 15 15 .zat = .{ 16 - .url = "https://tangled.sh/zzstoatzz.io/zat/archive/main", 17 - .hash = "zat-0.1.0-5PuC7ntmAQA9_8rALQwWad2riXWTY9p_ohVOD54_Y-2c", 16 + .url = "https://tangled.sh/zat.dev/zat/archive/main", 17 + .hash = "zat-0.1.0-5PuC7heIAQA4j2UVmJT-oivQh5AwZTrFQ-NC4CJi2-_R", 18 18 }, 19 19 }, 20 20 .paths = .{
+27 -18
backend/src/dashboard.zig
··· 7 7 const TagJson = struct { tag: []const u8, count: i64 }; 8 8 const TimelineJson = struct { date: []const u8, count: i64 }; 9 9 const PubJson = struct { name: []const u8, basePath: []const u8, count: i64 }; 10 + const PlatformJson = struct { platform: []const u8, count: i64 }; 10 11 11 12 /// All data needed to render the dashboard 12 13 pub const Data = struct { 13 14 started_at: i64, 14 15 searches: i64, 15 16 publications: i64, 16 - articles: i64, 17 - looseleafs: i64, 17 + documents: i64, 18 18 tags_json: []const u8, 19 19 timeline_json: []const u8, 20 20 top_pubs_json: []const u8, 21 + platforms_json: []const u8, 21 22 }; 22 23 23 24 // all dashboard queries batched into one request ··· 30 31 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at 31 32 ; 32 33 33 - const DOC_TYPES_SQL = 34 - \\SELECT 35 - \\ SUM(CASE WHEN publication_uri != '' THEN 1 ELSE 0 END) as articles, 36 - \\ SUM(CASE WHEN publication_uri = '' OR publication_uri IS NULL THEN 1 ELSE 0 END) as looseleafs 34 + const PLATFORMS_SQL = 35 + \\SELECT platform, COUNT(*) as count 37 36 \\FROM documents 37 + \\GROUP BY platform 38 + \\ORDER BY count DESC 38 39 ; 39 40 40 41 const TAGS_SQL = ··· 69 70 // batch all 5 queries into one HTTP request 70 71 var batch = client.queryBatch(&.{ 71 72 .{ .sql = STATS_SQL }, 72 - .{ .sql = DOC_TYPES_SQL }, 73 + .{ .sql = PLATFORMS_SQL }, 73 74 .{ .sql = TAGS_SQL }, 74 75 .{ .sql = TIMELINE_SQL }, 75 76 .{ .sql = TOP_PUBS_SQL }, ··· 81 82 const started_at = if (stats_row) |r| r.int(4) else 0; 82 83 const searches = if (stats_row) |r| r.int(2) else 0; 83 84 const publications = if (stats_row) |r| r.int(1) else 0; 84 - 85 - // extract doc types (query 1) 86 - const doc_row = batch.getFirst(1); 87 - const articles = if (doc_row) |r| r.int(0) else 0; 88 - const looseleafs = if (doc_row) |r| r.int(1) else 0; 85 + const documents = if (stats_row) |r| r.int(0) else 0; 89 86 90 87 return .{ 91 88 .started_at = started_at, 92 89 .searches = searches, 93 90 .publications = publications, 94 - .articles = articles, 95 - .looseleafs = looseleafs, 91 + .documents = documents, 96 92 .tags_json = try formatTagsJson(alloc, batch.get(2)), 97 93 .timeline_json = try formatTimelineJson(alloc, batch.get(3)), 98 94 .top_pubs_json = try formatPubsJson(alloc, batch.get(4)), 95 + .platforms_json = try formatPlatformsJson(alloc, batch.get(1)), 99 96 }; 100 97 } 101 98 ··· 129 126 return try output.toOwnedSlice(); 130 127 } 131 128 129 + fn formatPlatformsJson(alloc: Allocator, rows: []const db.Row) ![]const u8 { 130 + var output: std.Io.Writer.Allocating = .init(alloc); 131 + errdefer output.deinit(); 132 + var jw: json.Stringify = .{ .writer = &output.writer }; 133 + try jw.beginArray(); 134 + for (rows) |row| try jw.write(PlatformJson{ .platform = row.text(0), .count = row.int(1) }); 135 + try jw.endArray(); 136 + return try output.toOwnedSlice(); 137 + } 138 + 132 139 /// Generate dashboard data as JSON for API endpoint 133 140 pub fn toJson(alloc: Allocator, data: Data) ![]const u8 { 134 141 var output: std.Io.Writer.Allocating = .init(alloc); ··· 146 153 try jw.objectField("publications"); 147 154 try jw.write(data.publications); 148 155 149 - try jw.objectField("articles"); 150 - try jw.write(data.articles); 156 + try jw.objectField("documents"); 157 + try jw.write(data.documents); 151 158 152 - try jw.objectField("looseleafs"); 153 - try jw.write(data.looseleafs); 159 + try jw.objectField("platforms"); 160 + try jw.beginWriteRaw(); 161 + try jw.writer.writeAll(data.platforms_json); 162 + jw.endWriteRaw(); 154 163 155 164 // use beginWriteRaw/endWriteRaw for pre-formatted JSON arrays 156 165 try jw.objectField("tags");
+39 -1
backend/src/db/schema.zig
··· 44 44 \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5( 45 45 \\ uri UNINDEXED, 46 46 \\ name, 47 - \\ description 47 + \\ description, 48 + \\ base_path 48 49 \\) 49 50 , &.{}); 50 51 ··· 127 128 client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 128 129 client.exec("UPDATE documents SET source_collection = 'pub.leaflet.document' WHERE source_collection IS NULL", &.{}) catch {}; 129 130 131 + // multi-platform support for publications 132 + client.exec("ALTER TABLE publications ADD COLUMN platform TEXT DEFAULT 'leaflet'", &.{}) catch {}; 133 + client.exec("ALTER TABLE publications ADD COLUMN source_collection TEXT DEFAULT 'pub.leaflet.publication'", &.{}) catch {}; 134 + client.exec("UPDATE publications SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 135 + client.exec("UPDATE publications SET source_collection = 'pub.leaflet.publication' WHERE source_collection IS NULL", &.{}) catch {}; 136 + 130 137 // vector embeddings column already added by backfill script 138 + 139 + // dedupe index: same (did, rkey) across collections = same document 140 + // e.g., pub.leaflet.document/abc and site.standard.document/abc are the same content 141 + client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_documents_did_rkey ON documents(did, rkey)", &.{}) catch {}; 142 + client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_publications_did_rkey ON publications(did, rkey)", &.{}) catch {}; 143 + 144 + // backfill platform from source_collection for records indexed before platform detection fix 145 + client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform = 'unknown' AND source_collection LIKE 'pub.leaflet.%'", &.{}) catch {}; 146 + client.exec("UPDATE documents SET platform = 'pckt' WHERE platform = 'unknown' AND source_collection LIKE 'blog.pckt.%'", &.{}) catch {}; 147 + 148 + // detect platform from publication basePath (site.standard.* is a lexicon, not a platform) 149 + // pckt uses site.standard.* lexicon but basePath contains pckt.blog 150 + client.exec( 151 + \\UPDATE documents SET platform = 'pckt' 152 + \\WHERE platform IN ('standardsite', 'unknown') 153 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 154 + , &.{}) catch {}; 155 + 156 + // leaflet also uses site.standard.* lexicon, detect by basePath 157 + client.exec( 158 + \\UPDATE documents SET platform = 'leaflet' 159 + \\WHERE platform IN ('standardsite', 'unknown') 160 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 161 + , &.{}) catch {}; 162 + 163 + // URL path field for documents (e.g., "/001" for zat.dev) 164 + // used to build full URL: publication.url + document.path 165 + client.exec("ALTER TABLE documents ADD COLUMN path TEXT", &.{}) catch {}; 166 + 167 + // note: publications_fts was rebuilt with base_path column via scripts/rebuild-pub-fts 168 + // new publications will include base_path via insertPublication in indexer.zig 131 169 }
+46 -33
backend/src/extractor.zig
··· 4 4 const Allocator = mem.Allocator; 5 5 const zat = @import("zat"); 6 6 7 - /// Detected platform from content.$type 7 + /// Detected platform from collection name 8 + /// Note: pckt and other platforms use site.standard.* collections. 9 + /// Platform detection from collection only distinguishes leaflet (custom lexicon) 10 + /// from site.standard users. Actual platform (pckt vs others) is detected later 11 + /// from publication basePath. 8 12 pub const Platform = enum { 9 13 leaflet, 10 - pckt, 11 - offprint, 14 + standardsite, // pckt and others using site.standard.* lexicon 12 15 unknown, 13 16 14 - pub fn fromContentType(content_type: []const u8) Platform { 15 - if (mem.startsWith(u8, content_type, "pub.leaflet.")) return .leaflet; 16 - if (mem.startsWith(u8, content_type, "blog.pckt.")) return .pckt; 17 - if (mem.startsWith(u8, content_type, "app.offprint.")) return .offprint; 17 + pub fn fromCollection(collection: []const u8) Platform { 18 + if (mem.startsWith(u8, collection, "pub.leaflet.")) return .leaflet; 19 + if (mem.startsWith(u8, collection, "site.standard.")) return .standardsite; 18 20 return .unknown; 19 21 } 20 22 23 + /// Internal name (for DB storage) 21 24 pub fn name(self: Platform) []const u8 { 22 25 return @tagName(self); 23 26 } 27 + 28 + /// Display name (for UI) 29 + pub fn displayName(self: Platform) []const u8 { 30 + return @tagName(self); 31 + } 24 32 }; 25 33 26 34 /// Extracted document data ready for indexing. ··· 34 42 tags: [][]const u8, 35 43 platform: Platform, 36 44 source_collection: []const u8, 45 + path: ?[]const u8, // URL path from record (e.g., "/001" for zat.dev) 37 46 38 47 pub fn deinit(self: *ExtractedDocument) void { 39 48 self.allocator.free(self.content); ··· 54 63 .{ "pub.leaflet.blocks.code", {} }, 55 64 }); 56 65 57 - /// Detect platform from record's content.$type field 58 - pub fn detectPlatform(record: json.ObjectMap) Platform { 59 - const content = record.get("content") orelse return .unknown; 60 - if (content != .object) return .unknown; 61 - 62 - const type_val = content.object.get("$type") orelse return .unknown; 63 - if (type_val != .string) return .unknown; 64 - 65 - return Platform.fromContentType(type_val.string); 66 + /// Detect platform from collection name 67 + pub fn detectPlatform(collection: []const u8) Platform { 68 + return Platform.fromCollection(collection); 66 69 } 67 70 68 71 /// Extract document content from a record. ··· 73 76 collection: []const u8, 74 77 ) !ExtractedDocument { 75 78 const record_val: json.Value = .{ .object = record }; 76 - const platform = detectPlatform(record); 79 + const platform = detectPlatform(collection); 77 80 78 81 // extract required fields 79 82 const title = zat.json.getString(record_val, "title") orelse return error.MissingTitle; ··· 81 84 // extract optional fields 82 85 const created_at = zat.json.getString(record_val, "publishedAt") orelse 83 86 zat.json.getString(record_val, "createdAt"); 87 + 88 + // publication/site can be a string (direct URI) or strongRef object ({uri, cid}) 89 + // zat.json.getString supports paths like "publication.uri" 84 90 const publication_uri = zat.json.getString(record_val, "publication") orelse 85 - zat.json.getString(record_val, "site"); // site.standard uses "site" 91 + zat.json.getString(record_val, "publication.uri") orelse 92 + zat.json.getString(record_val, "site") orelse 93 + zat.json.getString(record_val, "site.uri"); 94 + 95 + // extract URL path (site.standard.document uses "path" field like "/001") 96 + const path = zat.json.getString(record_val, "path"); 86 97 87 98 // extract tags - allocate owned slice 88 99 const tags = try extractTags(allocator, record_val); ··· 100 111 .tags = tags, 101 112 .platform = platform, 102 113 .source_collection = collection, 114 + .path = path, 103 115 }; 104 116 } 105 117 ··· 222 234 223 235 // --- tests --- 224 236 225 - test "Platform.fromContentType: leaflet" { 226 - try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.content")); 227 - try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.blocks.text")); 237 + test "Platform.fromCollection: leaflet" { 238 + try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.document")); 239 + try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.publication")); 228 240 } 229 241 230 - test "Platform.fromContentType: pckt" { 231 - try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.content")); 232 - try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.blocks.whatever")); 242 + test "Platform.fromCollection: standardsite" { 243 + // pckt and others use site.standard.* collections 244 + try std.testing.expectEqual(Platform.standardsite, Platform.fromCollection("site.standard.document")); 245 + try std.testing.expectEqual(Platform.standardsite, Platform.fromCollection("site.standard.publication")); 233 246 } 234 247 235 - test "Platform.fromContentType: offprint" { 236 - try std.testing.expectEqual(Platform.offprint, Platform.fromContentType("app.offprint.content")); 237 - } 238 - 239 - test "Platform.fromContentType: unknown" { 240 - try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("something.else")); 241 - try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("")); 248 + test "Platform.fromCollection: unknown" { 249 + try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("something.else")); 250 + try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("")); 242 251 } 243 252 244 253 test "Platform.name" { 245 254 try std.testing.expectEqualStrings("leaflet", Platform.leaflet.name()); 246 - try std.testing.expectEqualStrings("pckt", Platform.pckt.name()); 247 - try std.testing.expectEqualStrings("offprint", Platform.offprint.name()); 255 + try std.testing.expectEqualStrings("standardsite", Platform.standardsite.name()); 248 256 try std.testing.expectEqualStrings("unknown", Platform.unknown.name()); 249 257 } 258 + 259 + test "Platform.displayName" { 260 + try std.testing.expectEqualStrings("leaflet", Platform.leaflet.displayName()); 261 + try std.testing.expectEqualStrings("standardsite", Platform.standardsite.displayName()); 262 + }
+34 -5
backend/src/indexer.zig
··· 12 12 tags: []const []const u8, 13 13 platform: []const u8, 14 14 source_collection: []const u8, 15 + path: ?[]const u8, 15 16 ) !void { 16 17 const c = db.getClient() orelse return error.NotInitialized; 17 18 19 + // dedupe: if (did, rkey) exists with different uri, clean up old record first 20 + // this handles cross-collection duplicates (e.g., pub.leaflet.document + site.standard.document) 21 + if (c.query("SELECT uri FROM documents WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 22 + var result = result_val; 23 + defer result.deinit(); 24 + if (result.first()) |row| { 25 + const old_uri = row.text(0); 26 + if (!std.mem.eql(u8, old_uri, uri)) { 27 + c.exec("DELETE FROM documents_fts WHERE uri = ?", &.{old_uri}) catch {}; 28 + c.exec("DELETE FROM document_tags WHERE document_uri = ?", &.{old_uri}) catch {}; 29 + c.exec("DELETE FROM documents WHERE uri = ?", &.{old_uri}) catch {}; 30 + } 31 + } 32 + } else |_| {} 33 + 18 34 try c.exec( 19 - "INSERT OR REPLACE INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", 20 - &.{ uri, did, rkey, title, content, created_at orelse "", publication_uri orelse "", platform, source_collection }, 35 + "INSERT OR REPLACE INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", 36 + &.{ uri, did, rkey, title, content, created_at orelse "", publication_uri orelse "", platform, source_collection, path orelse "" }, 21 37 ); 22 38 23 39 // update FTS index ··· 47 63 ) !void { 48 64 const c = db.getClient() orelse return error.NotInitialized; 49 65 66 + // dedupe: if (did, rkey) exists with different uri, clean up old record first 67 + if (c.query("SELECT uri FROM publications WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 68 + var result = result_val; 69 + defer result.deinit(); 70 + if (result.first()) |row| { 71 + const old_uri = row.text(0); 72 + if (!std.mem.eql(u8, old_uri, uri)) { 73 + c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{old_uri}) catch {}; 74 + c.exec("DELETE FROM publications WHERE uri = ?", &.{old_uri}) catch {}; 75 + } 76 + } 77 + } else |_| {} 78 + 50 79 try c.exec( 51 80 "INSERT OR REPLACE INTO publications (uri, did, rkey, name, description, base_path) VALUES (?, ?, ?, ?, ?, ?)", 52 81 &.{ uri, did, rkey, name, description orelse "", base_path orelse "" }, 53 82 ); 54 83 55 - // update FTS index 84 + // update FTS index (includes base_path for subdomain search) 56 85 c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{uri}) catch {}; 57 86 c.exec( 58 - "INSERT INTO publications_fts (uri, name, description) VALUES (?, ?, ?)", 59 - &.{ uri, name, description orelse "" }, 87 + "INSERT INTO publications_fts (uri, name, description, base_path) VALUES (?, ?, ?, ?)", 88 + &.{ uri, name, description orelse "", base_path orelse "" }, 60 89 ) catch {}; 61 90 } 62 91
+2 -1
backend/src/main.zig
··· 43 43 var listener = try address.listen(.{ .reuse_address = true }); 44 44 defer listener.deinit(); 45 45 46 - std.debug.print("leaflet-search listening on http://0.0.0.0:{d} (max {} workers)\n", .{ port, MAX_HTTP_WORKERS }); 46 + const app_name = posix.getenv("APP_NAME") orelse "leaflet-search"; 47 + std.debug.print("{s} listening on http://0.0.0.0:{d} (max {} workers)\n", .{ app_name, port, MAX_HTTP_WORKERS }); 47 48 48 49 while (true) { 49 50 const conn = listener.accept() catch |err| {
+220 -31
backend/src/search.zig
··· 16 16 rkey: []const u8, 17 17 basePath: []const u8, 18 18 platform: []const u8, 19 + path: []const u8 = "", // URL path from record (e.g., "/001") 19 20 }; 20 21 21 22 /// Document search result (internal) ··· 29 30 basePath: []const u8, 30 31 hasPublication: bool, 31 32 platform: []const u8, 33 + path: []const u8, 32 34 33 35 fn fromRow(row: db.Row) Doc { 34 36 return .{ ··· 41 43 .basePath = row.text(6), 42 44 .hasPublication = row.int(7) != 0, 43 45 .platform = row.text(8), 46 + .path = row.text(9), 44 47 }; 45 48 } 46 49 ··· 55 58 .rkey = self.rkey, 56 59 .basePath = self.basePath, 57 60 .platform = self.platform, 61 + .path = self.path, 58 62 }; 59 63 } 60 64 }; 61 65 62 66 const DocsByTag = zql.Query( 63 67 \\SELECT d.uri, d.did, d.title, '' as snippet, 64 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 68 + \\ d.created_at, d.rkey, 69 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 65 70 \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 66 - \\ d.platform 71 + \\ d.platform, COALESCE(d.path, '') as path 67 72 \\FROM documents d 68 73 \\LEFT JOIN publications p ON d.publication_uri = p.uri 69 74 \\JOIN document_tags dt ON d.uri = dt.document_uri ··· 74 79 const DocsByFtsAndTag = zql.Query( 75 80 \\SELECT f.uri, d.did, d.title, 76 81 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 77 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 82 + \\ d.created_at, d.rkey, 83 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 78 84 \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 79 - \\ d.platform 85 + \\ d.platform, COALESCE(d.path, '') as path 80 86 \\FROM documents_fts f 81 87 \\JOIN documents d ON f.uri = d.uri 82 88 \\LEFT JOIN publications p ON d.publication_uri = p.uri 83 89 \\JOIN document_tags dt ON d.uri = dt.document_uri 84 90 \\WHERE documents_fts MATCH :query AND dt.tag = :tag 85 - \\ORDER BY rank LIMIT 40 91 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 86 92 ); 87 93 88 94 const DocsByFts = zql.Query( 89 95 \\SELECT f.uri, d.did, d.title, 90 96 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 91 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 97 + \\ d.created_at, d.rkey, 98 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 92 99 \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 93 - \\ d.platform 100 + \\ d.platform, COALESCE(d.path, '') as path 94 101 \\FROM documents_fts f 95 102 \\JOIN documents d ON f.uri = d.uri 96 103 \\LEFT JOIN publications p ON d.publication_uri = p.uri 97 104 \\WHERE documents_fts MATCH :query 98 - \\ORDER BY rank LIMIT 40 105 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 106 + ); 107 + 108 + const DocsByFtsAndSince = zql.Query( 109 + \\SELECT f.uri, d.did, d.title, 110 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 111 + \\ d.created_at, d.rkey, 112 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 113 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 114 + \\ d.platform, COALESCE(d.path, '') as path 115 + \\FROM documents_fts f 116 + \\JOIN documents d ON f.uri = d.uri 117 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 118 + \\WHERE documents_fts MATCH :query AND d.created_at >= :since 119 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 120 + ); 121 + 122 + const DocsByFtsAndPlatform = zql.Query( 123 + \\SELECT f.uri, d.did, d.title, 124 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 125 + \\ d.created_at, d.rkey, 126 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 127 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 128 + \\ d.platform, COALESCE(d.path, '') as path 129 + \\FROM documents_fts f 130 + \\JOIN documents d ON f.uri = d.uri 131 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 132 + \\WHERE documents_fts MATCH :query AND d.platform = :platform 133 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 134 + ); 135 + 136 + const DocsByFtsAndPlatformAndSince = zql.Query( 137 + \\SELECT f.uri, d.did, d.title, 138 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 139 + \\ d.created_at, d.rkey, 140 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 141 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 142 + \\ d.platform, COALESCE(d.path, '') as path 143 + \\FROM documents_fts f 144 + \\JOIN documents d ON f.uri = d.uri 145 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 146 + \\WHERE documents_fts MATCH :query AND d.platform = :platform AND d.created_at >= :since 147 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 148 + ); 149 + 150 + const DocsByTagAndPlatform = zql.Query( 151 + \\SELECT d.uri, d.did, d.title, '' as snippet, 152 + \\ d.created_at, d.rkey, 153 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 154 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 155 + \\ d.platform, COALESCE(d.path, '') as path 156 + \\FROM documents d 157 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 158 + \\JOIN document_tags dt ON d.uri = dt.document_uri 159 + \\WHERE dt.tag = :tag AND d.platform = :platform 160 + \\ORDER BY d.created_at DESC LIMIT 40 161 + ); 162 + 163 + const DocsByFtsAndTagAndPlatform = zql.Query( 164 + \\SELECT f.uri, d.did, d.title, 165 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 166 + \\ d.created_at, d.rkey, 167 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 168 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 169 + \\ d.platform, COALESCE(d.path, '') as path 170 + \\FROM documents_fts f 171 + \\JOIN documents d ON f.uri = d.uri 172 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 173 + \\JOIN document_tags dt ON d.uri = dt.document_uri 174 + \\WHERE documents_fts MATCH :query AND dt.tag = :tag AND d.platform = :platform 175 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 176 + ); 177 + 178 + const DocsByPlatform = zql.Query( 179 + \\SELECT d.uri, d.did, d.title, '' as snippet, 180 + \\ d.created_at, d.rkey, 181 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d.did LIMIT 1), '') as base_path, 182 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 183 + \\ d.platform, COALESCE(d.path, '') as path 184 + \\FROM documents d 185 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 186 + \\WHERE d.platform = :platform 187 + \\ORDER BY d.created_at DESC LIMIT 40 188 + ); 189 + 190 + // Find documents by their publication's base_path (subdomain search) 191 + // e.g., searching "gyst" finds all docs on gyst.leaflet.pub 192 + // Uses recency decay: recent docs rank higher than old ones with same match 193 + const DocsByPubBasePath = zql.Query( 194 + \\SELECT d.uri, d.did, d.title, '' as snippet, 195 + \\ d.created_at, d.rkey, 196 + \\ p.base_path, 197 + \\ 1 as has_publication, 198 + \\ d.platform, COALESCE(d.path, '') as path 199 + \\FROM documents d 200 + \\JOIN publications p ON d.publication_uri = p.uri 201 + \\JOIN publications_fts pf ON p.uri = pf.uri 202 + \\WHERE publications_fts MATCH :query 203 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 204 + ); 205 + 206 + const DocsByPubBasePathAndPlatform = zql.Query( 207 + \\SELECT d.uri, d.did, d.title, '' as snippet, 208 + \\ d.created_at, d.rkey, 209 + \\ p.base_path, 210 + \\ 1 as has_publication, 211 + \\ d.platform, COALESCE(d.path, '') as path 212 + \\FROM documents d 213 + \\JOIN publications p ON d.publication_uri = p.uri 214 + \\JOIN publications_fts pf ON p.uri = pf.uri 215 + \\WHERE publications_fts MATCH :query AND d.platform = :platform 216 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 99 217 ); 100 218 101 219 /// Publication search result (internal) ··· 106 224 snippet: []const u8, 107 225 rkey: []const u8, 108 226 basePath: []const u8, 227 + platform: []const u8, 109 228 110 229 fn fromRow(row: db.Row) Pub { 111 230 return .{ ··· 115 234 .snippet = row.text(3), 116 235 .rkey = row.text(4), 117 236 .basePath = row.text(5), 237 + .platform = row.text(6), 118 238 }; 119 239 } 120 240 ··· 127 247 .snippet = self.snippet, 128 248 .rkey = self.rkey, 129 249 .basePath = self.basePath, 130 - .platform = "leaflet", // publications are leaflet-only for now 250 + .platform = self.platform, 131 251 }; 132 252 } 133 253 }; ··· 135 255 const PubSearch = zql.Query( 136 256 \\SELECT f.uri, p.did, p.name, 137 257 \\ snippet(publications_fts, 2, '', '', '...', 32) as snippet, 138 - \\ p.rkey, p.base_path 258 + \\ p.rkey, p.base_path, p.platform 139 259 \\FROM publications_fts f 140 260 \\JOIN publications p ON f.uri = p.uri 141 261 \\WHERE publications_fts MATCH :query 142 - \\ORDER BY rank LIMIT 10 262 + \\ORDER BY rank + (julianday('now') - julianday(p.created_at)) / 30.0 LIMIT 10 143 263 ); 144 264 145 - pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8) ![]const u8 { 265 + pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8, since_filter: ?[]const u8) ![]const u8 { 146 266 const c = db.getClient() orelse return error.NotInitialized; 147 267 148 268 var output: std.Io.Writer.Allocating = .init(alloc); ··· 152 272 try jw.beginArray(); 153 273 154 274 const fts_query = try buildFtsQuery(alloc, query); 275 + const has_query = query.len > 0; 276 + const has_tag = tag_filter != null; 277 + const has_platform = platform_filter != null; 278 + const has_since = since_filter != null; 155 279 156 - // search documents 157 - var doc_result = if (query.len == 0 and tag_filter != null) 280 + // track seen URIs for deduplication (content match + base_path match) 281 + var seen_uris = std.StringHashMap(void).init(alloc); 282 + defer seen_uris.deinit(); 283 + 284 + // search documents by content (title, content) - handle all filter combinations 285 + // note: since filter only supported with query (not tag-only searches) 286 + var doc_result = if (has_query and has_tag and has_platform) 287 + c.query(DocsByFtsAndTagAndPlatform.positional, DocsByFtsAndTagAndPlatform.bind(.{ 288 + .query = fts_query, 289 + .tag = tag_filter.?, 290 + .platform = platform_filter.?, 291 + })) catch null 292 + else if (has_query and has_tag) 293 + c.query(DocsByFtsAndTag.positional, DocsByFtsAndTag.bind(.{ .query = fts_query, .tag = tag_filter.? })) catch null 294 + else if (has_query and has_platform and has_since) 295 + c.query(DocsByFtsAndPlatformAndSince.positional, DocsByFtsAndPlatformAndSince.bind(.{ .query = fts_query, .platform = platform_filter.?, .since = since_filter.? })) catch null 296 + else if (has_query and has_platform) 297 + c.query(DocsByFtsAndPlatform.positional, DocsByFtsAndPlatform.bind(.{ .query = fts_query, .platform = platform_filter.? })) catch null 298 + else if (has_query and has_since) 299 + c.query(DocsByFtsAndSince.positional, DocsByFtsAndSince.bind(.{ .query = fts_query, .since = since_filter.? })) catch null 300 + else if (has_query) 301 + c.query(DocsByFts.positional, DocsByFts.bind(.{ .query = fts_query })) catch null 302 + else if (has_tag and has_platform) 303 + c.query(DocsByTagAndPlatform.positional, DocsByTagAndPlatform.bind(.{ .tag = tag_filter.?, .platform = platform_filter.? })) catch null 304 + else if (has_tag) 158 305 c.query(DocsByTag.positional, DocsByTag.bind(.{ .tag = tag_filter.? })) catch null 159 - else if (tag_filter) |tag| 160 - c.query(DocsByFtsAndTag.positional, DocsByFtsAndTag.bind(.{ .query = fts_query, .tag = tag })) catch null 306 + else if (has_platform) 307 + c.query(DocsByPlatform.positional, DocsByPlatform.bind(.{ .platform = platform_filter.? })) catch null 161 308 else 162 - c.query(DocsByFts.positional, DocsByFts.bind(.{ .query = fts_query })) catch null; 309 + null; // no filters at all - return empty 163 310 164 311 if (doc_result) |*res| { 165 312 defer res.deinit(); 166 313 for (res.rows) |row| { 167 314 const doc = Doc.fromRow(row); 168 - // filter by platform if specified 169 - if (platform_filter) |pf| { 170 - if (!std.mem.eql(u8, doc.platform, pf)) continue; 315 + // dupe URI for hash map (outlives result) 316 + const uri_dupe = try alloc.dupe(u8, doc.uri); 317 + try seen_uris.put(uri_dupe, {}); 318 + try jw.write(doc.toJson()); 319 + } 320 + } 321 + 322 + // also search documents by publication base_path (subdomain search) 323 + // e.g., "gyst" finds all docs on gyst.leaflet.pub even if content doesn't contain "gyst" 324 + // skip if tag filter is set (tag filter is content-specific) 325 + if (has_query and !has_tag) { 326 + var basepath_result = if (has_platform) 327 + c.query(DocsByPubBasePathAndPlatform.positional, DocsByPubBasePathAndPlatform.bind(.{ 328 + .query = fts_query, 329 + .platform = platform_filter.?, 330 + })) catch null 331 + else 332 + c.query(DocsByPubBasePath.positional, DocsByPubBasePath.bind(.{ .query = fts_query })) catch null; 333 + 334 + if (basepath_result) |*res| { 335 + defer res.deinit(); 336 + for (res.rows) |row| { 337 + const doc = Doc.fromRow(row); 338 + // deduplicate: skip if already found by content search 339 + if (!seen_uris.contains(doc.uri)) { 340 + try jw.write(doc.toJson()); 341 + } 171 342 } 172 - try jw.write(doc.toJson()); 173 343 } 174 344 } 175 345 176 - // publications are excluded when filtering by tag or platform (only leaflet has publications) 177 - if (tag_filter == null and (platform_filter == null or std.mem.eql(u8, platform_filter.?, "leaflet"))) { 346 + // publications are excluded when filtering by tag or platform 347 + // (platform filter is for documents only - publications don't have meaningful platform distinction) 348 + if (tag_filter == null and platform_filter == null) { 178 349 var pub_result = c.query( 179 350 PubSearch.positional, 180 351 PubSearch.bind(.{ .query = fts_query }), ··· 182 353 183 354 if (pub_result) |*res| { 184 355 defer res.deinit(); 185 - for (res.rows) |row| try jw.write(Pub.fromRow(row).toJson()); 356 + for (res.rows) |row| { 357 + try jw.write(Pub.fromRow(row).toJson()); 358 + } 186 359 } 187 360 } 188 361 ··· 215 388 // brute-force cosine similarity search (no vector index needed) 216 389 var res = c.query( 217 390 \\SELECT d2.uri, d2.did, d2.title, '' as snippet, 218 - \\ d2.created_at, d2.rkey, COALESCE(p.base_path, '') as base_path, 391 + \\ d2.created_at, d2.rkey, 392 + \\ COALESCE(p.base_path, (SELECT base_path FROM publications WHERE did = d2.did LIMIT 1), '') as base_path, 219 393 \\ CASE WHEN d2.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 220 - \\ d2.platform 394 + \\ d2.platform, COALESCE(d2.path, '') as path 221 395 \\FROM documents d1, documents d2 222 396 \\LEFT JOIN publications p ON d2.publication_uri = p.uri 223 397 \\WHERE d1.uri = ? ··· 282 456 /// Build FTS5 query with OR between terms: "cat dog" -> "cat OR dog*" 283 457 /// Uses OR for better recall with BM25 ranking (more matches = higher score) 284 458 /// Quoted queries are passed through as phrase matches: "exact phrase" -> "exact phrase" 459 + /// Separators match FTS5 unicode61 tokenizer: any non-alphanumeric character 285 460 pub fn buildFtsQuery(alloc: Allocator, query: []const u8) ![]const u8 { 286 461 if (query.len == 0) return ""; 287 462 ··· 300 475 } 301 476 302 477 // count words and total length 478 + // match FTS5 unicode61 tokenizer: non-alphanumeric = separator 303 479 var word_count: usize = 0; 304 480 var total_word_len: usize = 0; 305 481 var in_word = false; 306 482 for (trimmed) |c| { 307 - const is_sep = (c == ' ' or c == '.'); 308 - if (is_sep) { 483 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 484 + if (!is_alnum) { 309 485 in_word = false; 310 486 } else { 311 487 if (!in_word) word_count += 1; ··· 321 497 const buf = try alloc.alloc(u8, total_word_len + 1); 322 498 var pos: usize = 0; 323 499 for (trimmed) |c| { 324 - if (c != ' ' and c != '.') { 500 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 501 + if (is_alnum) { 325 502 buf[pos] = c; 326 503 pos += 1; 327 504 } ··· 340 517 in_word = false; 341 518 342 519 for (trimmed) |c| { 343 - const is_sep = (c == ' ' or c == '.'); 344 - if (is_sep) { 520 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 521 + if (!is_alnum) { 345 522 if (in_word) { 346 523 // end of word - add " OR " if not last 347 524 current_word += 1; ··· 408 585 defer std.testing.allocator.free(result); 409 586 try std.testing.expectEqualStrings("foo OR bar*", result); 410 587 } 588 + 589 + test "buildFtsQuery: hyphens as separators" { 590 + const result = try buildFtsQuery(std.testing.allocator, "crypto-casino"); 591 + defer std.testing.allocator.free(result); 592 + try std.testing.expectEqualStrings("crypto OR casino*", result); 593 + } 594 + 595 + test "buildFtsQuery: mixed punctuation" { 596 + const result = try buildFtsQuery(std.testing.allocator, "don't@stop_now"); 597 + defer std.testing.allocator.free(result); 598 + try std.testing.expectEqualStrings("don OR t OR stop OR now*", result); 599 + }
+21 -4
backend/src/server.zig
··· 56 56 try sendJson(request, "{\"status\":\"ok\"}"); 57 57 } else if (mem.eql(u8, target, "/popular")) { 58 58 try handlePopular(request); 59 + } else if (mem.eql(u8, target, "/platforms")) { 60 + try handlePlatforms(request); 59 61 } else if (mem.eql(u8, target, "/dashboard")) { 60 62 try handleDashboard(request); 61 63 } else if (mem.eql(u8, target, "/api/dashboard")) { ··· 74 76 defer arena.deinit(); 75 77 const alloc = arena.allocator(); 76 78 77 - // parse query params: /search?q=something&tag=foo&platform=leaflet 79 + // parse query params: /search?q=something&tag=foo&platform=leaflet&since=2025-01-01 78 80 const query = parseQueryParam(alloc, target, "q") catch ""; 79 81 const tag_filter = parseQueryParam(alloc, target, "tag") catch null; 80 82 const platform_filter = parseQueryParam(alloc, target, "platform") catch null; 83 + const since_filter = parseQueryParam(alloc, target, "since") catch null; 81 84 82 85 if (query.len == 0 and tag_filter == null) { 83 86 try sendJson(request, "{\"error\":\"enter a search term\"}"); ··· 85 88 } 86 89 87 90 // perform FTS search - arena handles cleanup 88 - const results = search.search(alloc, query, tag_filter, platform_filter) catch |err| { 91 + const results = search.search(alloc, query, tag_filter, platform_filter, since_filter) catch |err| { 89 92 stats.recordError(); 90 93 return err; 91 94 }; ··· 111 114 try sendJson(request, popular); 112 115 } 113 116 117 + fn handlePlatforms(request: *http.Server.Request) !void { 118 + var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 119 + defer arena.deinit(); 120 + const alloc = arena.allocator(); 121 + 122 + const data = try stats.getPlatformCounts(alloc); 123 + try sendJson(request, data); 124 + } 125 + 114 126 fn parseQueryParam(alloc: std.mem.Allocator, target: []const u8, param: []const u8) ![]const u8 { 115 127 // look for ?param= or &param= 116 128 const patterns = [_][]const u8{ "?", "&" }; ··· 142 154 var response: std.ArrayList(u8) = .{}; 143 155 defer response.deinit(alloc); 144 156 145 - try response.print(alloc, "{{\"documents\":{d},\"publications\":{d},\"cache_hits\":{d},\"cache_misses\":{d}}}", .{ db_stats.documents, db_stats.publications, db_stats.cache_hits, db_stats.cache_misses }); 157 + try response.print(alloc, "{{\"documents\":{d},\"publications\":{d},\"embeddings\":{d},\"cache_hits\":{d},\"cache_misses\":{d}}}", .{ db_stats.documents, db_stats.publications, db_stats.embeddings, db_stats.cache_hits, db_stats.cache_misses }); 146 158 147 159 try sendJson(request, response.items); 148 160 } ··· 198 210 try sendJson(request, json_response); 199 211 } 200 212 213 + fn getDashboardUrl() []const u8 { 214 + return std.posix.getenv("DASHBOARD_URL") orelse "https://leaflet-search.pages.dev/dashboard.html"; 215 + } 216 + 201 217 fn handleDashboard(request: *http.Server.Request) !void { 218 + const dashboard_url = getDashboardUrl(); 202 219 try request.respond("", .{ 203 220 .status = .moved_permanently, 204 221 .extra_headers = &.{ 205 - .{ .name = "location", .value = "https://leaflet-search.pages.dev/dashboard.html" }, 222 + .{ .name = "location", .value = dashboard_url }, 206 223 }, 207 224 }); 208 225 }
+64 -8
backend/src/stats.zig
··· 38 38 pub const Stats = struct { 39 39 documents: i64, 40 40 publications: i64, 41 + embeddings: i64, 41 42 searches: i64, 42 43 errors: i64, 43 44 started_at: i64, ··· 45 46 cache_misses: i64, 46 47 }; 47 48 49 + const default_stats: Stats = .{ .documents = 0, .publications = 0, .embeddings = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 50 + 48 51 pub fn getStats() Stats { 49 - const c = db.getClient() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 52 + const c = db.getClient() orelse return default_stats; 50 53 51 54 var res = c.query( 52 55 \\SELECT 53 56 \\ (SELECT COUNT(*) FROM documents) as docs, 54 57 \\ (SELECT COUNT(*) FROM publications) as pubs, 58 + \\ (SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL) as embeddings, 55 59 \\ (SELECT total_searches FROM stats WHERE id = 1) as searches, 56 60 \\ (SELECT total_errors FROM stats WHERE id = 1) as errors, 57 61 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at, 58 62 \\ (SELECT COALESCE(cache_hits, 0) FROM stats WHERE id = 1) as cache_hits, 59 63 \\ (SELECT COALESCE(cache_misses, 0) FROM stats WHERE id = 1) as cache_misses 60 - , &.{}) catch return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 64 + , &.{}) catch return default_stats; 61 65 defer res.deinit(); 62 66 63 - const row = res.first() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 67 + const row = res.first() orelse return default_stats; 64 68 return .{ 65 69 .documents = row.int(0), 66 70 .publications = row.int(1), 67 - .searches = row.int(2), 68 - .errors = row.int(3), 69 - .started_at = row.int(4), 70 - .cache_hits = row.int(5), 71 - .cache_misses = row.int(6), 71 + .embeddings = row.int(2), 72 + .searches = row.int(3), 73 + .errors = row.int(4), 74 + .started_at = row.int(5), 75 + .cache_hits = row.int(6), 76 + .cache_misses = row.int(7), 72 77 }; 73 78 } 74 79 ··· 100 105 pub fn recordCacheMiss() void { 101 106 const c = db.getClient() orelse return; 102 107 c.exec("UPDATE stats SET cache_misses = COALESCE(cache_misses, 0) + 1 WHERE id = 1", &.{}) catch {}; 108 + } 109 + 110 + const PlatformCount = struct { platform: []const u8, count: i64 }; 111 + 112 + pub fn getPlatformCounts(alloc: Allocator) ![]const u8 { 113 + const c = db.getClient() orelse return error.NotInitialized; 114 + 115 + var output: std.Io.Writer.Allocating = .init(alloc); 116 + errdefer output.deinit(); 117 + 118 + var jw: json.Stringify = .{ .writer = &output.writer }; 119 + try jw.beginObject(); 120 + 121 + // documents by platform 122 + try jw.objectField("documents"); 123 + if (c.query("SELECT platform, COUNT(*) as count FROM documents GROUP BY platform ORDER BY count DESC", &.{})) |res_val| { 124 + var res = res_val; 125 + defer res.deinit(); 126 + try jw.beginArray(); 127 + for (res.rows) |row| try jw.write(PlatformCount{ .platform = row.text(0), .count = row.int(1) }); 128 + try jw.endArray(); 129 + } else |_| { 130 + try jw.beginArray(); 131 + try jw.endArray(); 132 + } 133 + 134 + // FTS document count 135 + try jw.objectField("fts_count"); 136 + if (c.query("SELECT COUNT(*) FROM documents_fts", &.{})) |res_val| { 137 + var res = res_val; 138 + defer res.deinit(); 139 + if (res.first()) |row| { 140 + try jw.write(row.int(0)); 141 + } else try jw.write(0); 142 + } else |_| try jw.write(0); 143 + 144 + // sample URIs from each platform (for debugging) 145 + try jw.objectField("sample_standardsite"); 146 + if (c.query("SELECT uri FROM documents WHERE platform = 'standardsite' LIMIT 3", &.{})) |res_val| { 147 + var res = res_val; 148 + defer res.deinit(); 149 + try jw.beginArray(); 150 + for (res.rows) |row| try jw.write(row.text(0)); 151 + try jw.endArray(); 152 + } else |_| { 153 + try jw.beginArray(); 154 + try jw.endArray(); 155 + } 156 + 157 + try jw.endObject(); 158 + return try output.toOwnedSlice(); 103 159 } 104 160 105 161 pub fn getPopular(alloc: Allocator, limit: usize) ![]const u8 {
+98 -35
backend/src/tap.zig
··· 60 60 61 61 const Handler = struct { 62 62 allocator: Allocator, 63 + client: *websocket.Client, 63 64 msg_count: usize = 0, 65 + ack_buf: [64]u8 = undefined, 64 66 65 67 pub fn serverMessage(self: *Handler, data: []const u8) !void { 66 68 self.msg_count += 1; 67 69 if (self.msg_count % 100 == 1) { 68 70 std.debug.print("tap: received {} messages\n", .{self.msg_count}); 69 71 } 72 + 73 + // extract message ID for ACK 74 + const msg_id = extractMessageId(self.allocator, data); 75 + 76 + // process the message 70 77 processMessage(self.allocator, data) catch |err| { 71 78 std.debug.print("message processing error: {}\n", .{err}); 79 + // still ACK even on error to avoid infinite retries 80 + }; 81 + 82 + // send ACK if we have a message ID 83 + if (msg_id) |id| { 84 + self.sendAck(id); 85 + } 86 + } 87 + 88 + fn sendAck(self: *Handler, msg_id: i64) void { 89 + const ack_json = std.fmt.bufPrint(&self.ack_buf, "{{\"type\":\"ack\",\"id\":{d}}}", .{msg_id}) catch |err| { 90 + std.debug.print("tap: ACK format error: {}\n", .{err}); 91 + return; 92 + }; 93 + std.debug.print("tap: sending ACK for id={d}\n", .{msg_id}); 94 + self.client.write(@constCast(ack_json)) catch |err| { 95 + std.debug.print("tap: failed to send ACK: {}\n", .{err}); 72 96 }; 73 97 } 74 98 ··· 76 100 std.debug.print("tap connection closed\n", .{}); 77 101 } 78 102 }; 103 + 104 + fn extractMessageId(allocator: Allocator, payload: []const u8) ?i64 { 105 + const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return null; 106 + defer parsed.deinit(); 107 + return zat.json.getInt(parsed.value, "id"); 108 + } 79 109 80 110 fn connect(allocator: Allocator) !void { 81 111 const host = getTapHost(); ··· 106 136 107 137 std.debug.print("tap connected!\n", .{}); 108 138 109 - var handler = Handler{ .allocator = allocator }; 139 + var handler = Handler{ .allocator = allocator, .client = &client }; 110 140 client.readLoop(&handler) catch |err| { 111 141 std.debug.print("websocket read loop error: {}\n", .{err}); 112 142 return err; ··· 116 146 /// TAP record envelope - extracted via zat.json.extractAt 117 147 const TapRecord = struct { 118 148 collection: []const u8, 119 - action: zat.CommitAction, 149 + action: []const u8, // "create", "update", "delete" 120 150 did: []const u8, 121 151 rkey: []const u8, 152 + 153 + pub fn isCreate(self: TapRecord) bool { 154 + return mem.eql(u8, self.action, "create"); 155 + } 156 + pub fn isUpdate(self: TapRecord) bool { 157 + return mem.eql(u8, self.action, "update"); 158 + } 159 + pub fn isDelete(self: TapRecord) bool { 160 + return mem.eql(u8, self.action, "delete"); 161 + } 122 162 }; 123 163 124 164 /// Leaflet publication fields ··· 129 169 }; 130 170 131 171 fn processMessage(allocator: Allocator, payload: []const u8) !void { 132 - const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return; 172 + const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch { 173 + std.debug.print("tap: JSON parse failed, first 100 bytes: {s}\n", .{payload[0..@min(payload.len, 100)]}); 174 + return; 175 + }; 133 176 defer parsed.deinit(); 134 177 135 178 // check message type 136 - const msg_type = zat.json.getString(parsed.value, "type") orelse return; 179 + const msg_type = zat.json.getString(parsed.value, "type") orelse { 180 + std.debug.print("tap: no type field in message\n", .{}); 181 + return; 182 + }; 183 + 137 184 if (!mem.eql(u8, msg_type, "record")) return; 138 185 139 - // extract record envelope 140 - const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch return; 186 + // extract record envelope (extractAt ignores extra fields like live, rev, cid) 187 + const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch |err| { 188 + std.debug.print("tap: failed to extract record: {}\n", .{err}); 189 + return; 190 + }; 141 191 142 192 // validate DID 143 193 const did = zat.Did.parse(rec.did) orelse return; 144 194 145 - // build AT-URI string 146 - const uri = try std.fmt.allocPrint(allocator, "at://{s}/{s}/{s}", .{ did.raw, rec.collection, rec.rkey }); 147 - defer allocator.free(uri); 195 + // build AT-URI string (no allocation - uses stack buffer) 196 + var uri_buf: [256]u8 = undefined; 197 + const uri = zat.AtUri.format(&uri_buf, did.raw, rec.collection, rec.rkey) orelse return; 148 198 149 - switch (rec.action) { 150 - .create, .update => { 151 - const record_obj = zat.json.getObject(parsed.value, "record.record") orelse return; 199 + if (rec.isCreate() or rec.isUpdate()) { 200 + const inner_record = zat.json.getObject(parsed.value, "record.record") orelse return; 152 201 153 - if (isDocumentCollection(rec.collection)) { 154 - processDocument(allocator, uri, did.raw, rec.rkey, record_obj, rec.collection) catch |err| { 155 - std.debug.print("document processing error: {}\n", .{err}); 156 - }; 157 - } else if (isPublicationCollection(rec.collection)) { 158 - processPublication(allocator, uri, did.raw, rec.rkey, record_obj) catch |err| { 159 - std.debug.print("publication processing error: {}\n", .{err}); 160 - }; 161 - } 162 - }, 163 - .delete => { 164 - if (isDocumentCollection(rec.collection)) { 165 - indexer.deleteDocument(uri); 166 - std.debug.print("deleted document: {s}\n", .{uri}); 167 - } else if (isPublicationCollection(rec.collection)) { 168 - indexer.deletePublication(uri); 169 - std.debug.print("deleted publication: {s}\n", .{uri}); 170 - } 171 - }, 202 + if (isDocumentCollection(rec.collection)) { 203 + processDocument(allocator, uri, did.raw, rec.rkey, inner_record, rec.collection) catch |err| { 204 + std.debug.print("document processing error: {}\n", .{err}); 205 + }; 206 + } else if (isPublicationCollection(rec.collection)) { 207 + processPublication(allocator, uri, did.raw, rec.rkey, inner_record) catch |err| { 208 + std.debug.print("publication processing error: {}\n", .{err}); 209 + }; 210 + } 211 + } else if (rec.isDelete()) { 212 + if (isDocumentCollection(rec.collection)) { 213 + indexer.deleteDocument(uri); 214 + std.debug.print("deleted document: {s}\n", .{uri}); 215 + } else if (isPublicationCollection(rec.collection)) { 216 + indexer.deletePublication(uri); 217 + std.debug.print("deleted publication: {s}\n", .{uri}); 218 + } 172 219 } 173 220 } 174 221 ··· 192 239 doc.tags, 193 240 doc.platformName(), 194 241 doc.source_collection, 242 + doc.path, 195 243 ); 196 244 std.debug.print("indexed document: {s} [{s}] ({} chars, {} tags)\n", .{ uri, doc.platformName(), doc.content.len, doc.tags.len }); 197 245 } 198 246 199 - fn processPublication(allocator: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 247 + fn processPublication(_: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 200 248 const record_val: json.Value = .{ .object = record }; 201 - const pub_data = zat.json.extractAt(LeafletPublication, allocator, record_val, .{}) catch return; 249 + 250 + // extract required field 251 + const name = zat.json.getString(record_val, "name") orelse return; 252 + const description = zat.json.getString(record_val, "description"); 253 + 254 + // base_path: try leaflet's "base_path", then site.standard's "url" 255 + // url is full URL like "https://devlog.pckt.blog", we need just the host 256 + const base_path = zat.json.getString(record_val, "base_path") orelse 257 + stripUrlScheme(zat.json.getString(record_val, "url")); 202 258 203 - try indexer.insertPublication(uri, did, rkey, pub_data.name, pub_data.description, pub_data.base_path); 204 - std.debug.print("indexed publication: {s} (base_path: {s})\n", .{ uri, pub_data.base_path orelse "none" }); 259 + try indexer.insertPublication(uri, did, rkey, name, description, base_path); 260 + std.debug.print("indexed publication: {s} (base_path: {s})\n", .{ uri, base_path orelse "none" }); 261 + } 262 + 263 + fn stripUrlScheme(url: ?[]const u8) ?[]const u8 { 264 + const u = url orelse return null; 265 + if (mem.startsWith(u8, u, "https://")) return u["https://".len..]; 266 + if (mem.startsWith(u8, u, "http://")) return u["http://".len..]; 267 + return u; 205 268 }
+226
docs/leaflet-publishing-plan.md
··· 1 + # publishing to leaflet.pub 2 + 3 + ## goal 4 + 5 + publish markdown docs to both: 6 + 1. `site.standard.document` (for search/interop) - already working 7 + 2. `pub.leaflet.document` (for leaflet.pub display) - this plan 8 + 9 + ## the mapping 10 + 11 + ### block types 12 + 13 + | markdown | leaflet block | 14 + |----------|---------------| 15 + | `# heading` | `pub.leaflet.blocks.header` (level 1-6) | 16 + | paragraph | `pub.leaflet.blocks.text` | 17 + | ``` code ``` | `pub.leaflet.blocks.code` | 18 + | `> quote` | `pub.leaflet.blocks.blockquote` | 19 + | `---` | `pub.leaflet.blocks.horizontalRule` | 20 + | `- item` | `pub.leaflet.blocks.unorderedList` | 21 + | `![alt](src)` | `pub.leaflet.blocks.image` (requires blob upload) | 22 + | `[text](url)` (standalone) | `pub.leaflet.blocks.website` | 23 + 24 + ### inline formatting (facets) 25 + 26 + leaflet uses byte-indexed facets for inline formatting within text blocks: 27 + 28 + ```json 29 + { 30 + "$type": "pub.leaflet.blocks.text", 31 + "plaintext": "hello world with bold text", 32 + "facets": [{ 33 + "index": { "byteStart": 17, "byteEnd": 21 }, 34 + "features": [{ "$type": "pub.leaflet.richtext.facet#bold" }] 35 + }] 36 + } 37 + ``` 38 + 39 + | markdown | facet type | 40 + |----------|------------| 41 + | `**bold**` | `pub.leaflet.richtext.facet#bold` | 42 + | `*italic*` | `pub.leaflet.richtext.facet#italic` | 43 + | `` `code` `` | `pub.leaflet.richtext.facet#code` | 44 + | `[text](url)` | `pub.leaflet.richtext.facet#link` | 45 + | `~~strike~~` | `pub.leaflet.richtext.facet#strikethrough` | 46 + 47 + ## record structure 48 + 49 + ```json 50 + { 51 + "$type": "pub.leaflet.document", 52 + "author": "did:plc:...", 53 + "title": "document title", 54 + "description": "optional description", 55 + "publishedAt": "2026-01-06T00:00:00Z", 56 + "publication": "at://did:plc:.../pub.leaflet.publication/rkey", 57 + "tags": ["tag1", "tag2"], 58 + "pages": [{ 59 + "$type": "pub.leaflet.pages.linearDocument", 60 + "id": "page-uuid", 61 + "blocks": [ 62 + { 63 + "$type": "pub.leaflet.pages.linearDocument#block", 64 + "block": { /* one of the block types above */ } 65 + } 66 + ] 67 + }] 68 + } 69 + ``` 70 + 71 + ## implementation plan 72 + 73 + ### phase 1: markdown parser 74 + 75 + add a simple markdown block parser to zat or the publish script: 76 + 77 + ```zig 78 + const BlockType = enum { 79 + heading, 80 + paragraph, 81 + code, 82 + blockquote, 83 + horizontal_rule, 84 + unordered_list, 85 + image, 86 + }; 87 + 88 + const Block = struct { 89 + type: BlockType, 90 + content: []const u8, 91 + level: ?u8 = null, // for headings 92 + language: ?[]const u8 = null, // for code blocks 93 + alt: ?[]const u8 = null, // for images 94 + src: ?[]const u8 = null, // for images 95 + }; 96 + 97 + fn parseMarkdownBlocks(allocator: Allocator, markdown: []const u8) ![]Block 98 + ``` 99 + 100 + parsing approach: 101 + - split on blank lines to get blocks 102 + - identify block type by first characters: 103 + - `#` โ†’ heading (count `#` for level) 104 + - ``` โ†’ code block (capture until closing ```) 105 + - `>` โ†’ blockquote 106 + - `---` โ†’ horizontal rule 107 + - `-` or `*` at start โ†’ list item 108 + - `![` โ†’ image 109 + - else โ†’ paragraph 110 + 111 + ### phase 2: inline facet extraction 112 + 113 + for text blocks, extract inline formatting: 114 + 115 + ```zig 116 + const Facet = struct { 117 + byte_start: usize, 118 + byte_end: usize, 119 + feature: FacetFeature, 120 + }; 121 + 122 + const FacetFeature = union(enum) { 123 + bold, 124 + italic, 125 + code, 126 + link: []const u8, // url 127 + strikethrough, 128 + }; 129 + 130 + fn extractFacets(allocator: Allocator, text: []const u8) !struct { 131 + plaintext: []const u8, 132 + facets: []Facet, 133 + } 134 + ``` 135 + 136 + approach: 137 + - scan for `**`, `*`, `` ` ``, `[`, `~~` 138 + - track byte positions as we strip markers 139 + - build facet list with adjusted indices 140 + 141 + ### phase 3: image blob upload 142 + 143 + images need to be uploaded as blobs before referencing: 144 + 145 + ```zig 146 + fn uploadImageBlob(client: *XrpcClient, allocator: Allocator, image_path: []const u8) !BlobRef 147 + ``` 148 + 149 + for now, could skip images or require them to already be uploaded. 150 + 151 + ### phase 4: json serialization 152 + 153 + build the full `pub.leaflet.document` record: 154 + 155 + ```zig 156 + const LeafletDocument = struct { 157 + @"$type": []const u8 = "pub.leaflet.document", 158 + author: []const u8, 159 + title: []const u8, 160 + description: ?[]const u8 = null, 161 + publishedAt: []const u8, 162 + publication: ?[]const u8 = null, 163 + tags: ?[][]const u8 = null, 164 + pages: []Page, 165 + }; 166 + 167 + const Page = struct { 168 + @"$type": []const u8 = "pub.leaflet.pages.linearDocument", 169 + id: []const u8, 170 + blocks: []BlockWrapper, 171 + }; 172 + ``` 173 + 174 + ### phase 5: integrate into publish-docs.zig 175 + 176 + update the publish script to: 177 + 1. parse markdown into blocks 178 + 2. convert to leaflet structure 179 + 3. publish `pub.leaflet.document` alongside `site.standard.document` 180 + 181 + ```zig 182 + // existing: publish site.standard.document 183 + try putRecord(&client, allocator, session.did, "site.standard.document", tid.str(), doc_record); 184 + 185 + // new: also publish pub.leaflet.document 186 + const leaflet_record = try markdownToLeaflet(allocator, content, title, session.did, pub_uri); 187 + try putRecord(&client, allocator, session.did, "pub.leaflet.document", tid.str(), leaflet_record); 188 + ``` 189 + 190 + ## complexity estimate 191 + 192 + | component | complexity | notes | 193 + |-----------|------------|-------| 194 + | block parsing | medium | regex-free, line-by-line | 195 + | facet extraction | medium | byte index tracking is fiddly | 196 + | image upload | low | already have blob upload in xrpc | 197 + | json serialization | low | std.json handles it | 198 + | integration | low | add to existing publish flow | 199 + 200 + total: ~300-500 lines of zig 201 + 202 + ## open questions 203 + 204 + 1. **publication record**: do we need a `pub.leaflet.publication` too, or just documents? 205 + - leaflet allows standalone documents without publications 206 + - could skip publication for now 207 + 208 + 2. **image handling**: 209 + - option A: skip images initially (just text content) 210 + - option B: require images to be URLs (no blob upload) 211 + - option C: full blob upload support 212 + 213 + 3. **deduplication**: same rkey for both record types? 214 + - pro: easy to correlate 215 + - con: different collections, might not matter 216 + 217 + 4. **validation**: leaflet has a validate endpoint 218 + - could call `/api/unstable_validate` to check records before publish 219 + - probably skip for v1 220 + 221 + ## references 222 + 223 + - [pub.leaflet.document schema](/tmp/leaflet/lexicons/pub/leaflet/document.json) 224 + - [leaflet publishToPublication.ts](/tmp/leaflet/actions/publishToPublication.ts) - how leaflet creates records 225 + - [site.standard.document schema](/tmp/standard.site/app/data/lexicons/document.json) 226 + - paul's site: fetches records, doesn't publish them
+124
docs/search-architecture.md
··· 1 + # search architecture 2 + 3 + current state, rationale, and future options. 4 + 5 + ## current: SQLite FTS5 6 + 7 + we use SQLite's built-in full-text search (FTS5) via Turso. 8 + 9 + ### why FTS5 works for now 10 + 11 + - **scale**: ~3500 documents. FTS5 handles this trivially. 12 + - **latency**: 10-50ms for search queries. fine for our use case. 13 + - **cost**: $0. included with Turso free tier. 14 + - **ops**: zero. no separate service to run. 15 + - **simplicity**: one database for everything (docs, FTS, vectors, cache). 16 + 17 + ### how it works 18 + 19 + ``` 20 + user query: "crypto-casino" 21 + โ†“ 22 + buildFtsQuery(): "crypto OR casino*" 23 + โ†“ 24 + FTS5 MATCH query with BM25 + recency decay 25 + โ†“ 26 + results with snippet() 27 + ``` 28 + 29 + key decisions: 30 + - **OR between terms** for better recall (deliberate, see commit 35ad4b5) 31 + - **prefix match on last word** for type-ahead feel 32 + - **unicode61 tokenizer** splits on non-alphanumeric (we match this in buildFtsQuery) 33 + - **recency decay** boosts recent docs: `ORDER BY rank + (days_old / 30)` 34 + 35 + ### what's coupled to FTS5 36 + 37 + all in `backend/src/search.zig`: 38 + 39 + | component | FTS5-specific | 40 + |-----------|---------------| 41 + | 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` | 42 + | `buildFtsQuery()` | constructs FTS5 syntax | 43 + | schema | `documents_fts`, `publications_fts` virtual tables | 44 + 45 + ### what's already decoupled 46 + 47 + - result types (`SearchResultJson`, `Doc`, `Pub`) 48 + - similarity search (uses `vector_distance_cos`, not FTS5) 49 + - caching logic 50 + - HTTP layer (server.zig just calls `search()`) 51 + 52 + ### known limitations 53 + 54 + - **no typo tolerance**: "leafet" won't find "leaflet" 55 + - **no relevance tuning**: can't boost title vs content 56 + - **single writer**: SQLite write lock 57 + - **no horizontal scaling**: single database 58 + 59 + these aren't problems at current scale. 60 + 61 + ## future: if we need to scale 62 + 63 + ### when to consider switching 64 + 65 + - search latency consistently >100ms 66 + - write contention from indexing 67 + - need typo tolerance or better relevance 68 + - millions of documents 69 + 70 + ### recommended: Elasticsearch 71 + 72 + Elasticsearch is the battle-tested choice for production search: 73 + 74 + - proven at massive scale (Wikipedia, GitHub, Stack Overflow) 75 + - rich query DSL, analyzers, aggregations 76 + - typo tolerance via fuzzy matching 77 + - horizontal scaling built-in 78 + - extensive tooling and community 79 + 80 + trade-offs: 81 + - operational complexity (JVM, cluster management) 82 + - resource hungry (~2GB+ RAM minimum) 83 + - cost: $50-500/month depending on scale 84 + 85 + ### alternatives considered 86 + 87 + **Meilisearch/Typesense**: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom. 88 + 89 + **Algolia**: fully managed, excellent but expensive. makes sense if you want zero ops. 90 + 91 + **PostgreSQL full-text**: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system. 92 + 93 + ### migration path 94 + 95 + 1. keep Turso as source of truth 96 + 2. add Elasticsearch as search index 97 + 3. sync documents to ES on write (async) 98 + 4. point `/search` at Elasticsearch 99 + 5. keep `/similar` on Turso (vector search) 100 + 101 + the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged. 102 + 103 + estimated effort: 1-2 days to swap search backend. 104 + 105 + ### vector search scaling 106 + 107 + similarity search currently uses brute-force `vector_distance_cos` with caching. at scale: 108 + 109 + - **Elasticsearch**: has vector search (dense_vector + kNN) 110 + - **dedicated vector DB**: Qdrant, Pinecone, Weaviate 111 + - **pgvector**: if on Postgres 112 + 113 + could consolidate text + vector in Elasticsearch, or keep them separate. 114 + 115 + ## summary 116 + 117 + | scale | recommendation | 118 + |-------|----------------| 119 + | <10k docs | keep FTS5 (current) | 120 + | 10k-100k docs | still probably fine, monitor latency | 121 + | 100k+ docs | consider Elasticsearch | 122 + | millions + sub-ms latency | Elasticsearch cluster + caching layer | 123 + 124 + we're in the "keep FTS5" zone. the code is structured to swap later if needed.
+3 -3
docs/standard-search-planning.md
··· 221 221 - keep existing block parser for `pub.leaflet.*` 222 222 - platform detection from `content.$type` 223 223 224 - ### PR3: TAP subscriber for site.standard.document 224 + ### PR3: tap subscriber for site.standard.document 225 225 - subscribe to `site.standard.document` + `site.standard.publication` 226 226 - route to appropriate extractor 227 227 - starts ingesting pckt.blog content ··· 254 254 2. ~~find and examine offprint records~~ (done - no public content yet) 255 255 3. ~~PR1: database schema~~ (merged) 256 256 4. PR2: generalized content extraction 257 - 5. PR3: TAP subscriber 257 + 5. PR3: tap subscriber 258 258 6. PR4: API platform filter 259 259 7. consider witness cache architecture (see below) 260 260 ··· 275 275 ### current leaflet-search architecture (no witness cache) 276 276 277 277 ``` 278 - Firehose โ†’ TAP โ†’ Parse & Transform โ†’ Store DERIVED data โ†’ Discard raw record 278 + Firehose โ†’ tap โ†’ Parse & Transform โ†’ Store DERIVED data โ†’ Discard raw record 279 279 ``` 280 280 281 281 we store:
+215
docs/tap.md
··· 1 + # tap (firehose sync) 2 + 3 + leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose. 4 + 5 + ## what is tap? 6 + 7 + tap subscribes to the ATProto firehose, filters for specific collections (e.g., `pub.leaflet.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records. 8 + 9 + key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking: 10 + 1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo` 11 + 2. live firehose events during backfill are buffered in memory 12 + 3. historical events (marked `live: false`) are delivered first 13 + 4. after historical events complete, buffered live events are released 14 + 5. subsequent firehose events arrive immediately marked as `live: true` 15 + 16 + tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first. 17 + 18 + ## message format 19 + 20 + tap sends JSON messages over websocket. record events look like: 21 + 22 + ```json 23 + { 24 + "type": "record", 25 + "record": { 26 + "live": true, 27 + "did": "did:plc:abc123...", 28 + "rev": "3mbspmpaidl2a", 29 + "collection": "pub.leaflet.document", 30 + "rkey": "3lzyrj6q6gs27", 31 + "action": "create", 32 + "record": { ... }, 33 + "cid": "bafyrei..." 34 + } 35 + } 36 + ``` 37 + 38 + ### field types (important!) 39 + 40 + | field | type | values | notes | 41 + |-------|------|--------|-------| 42 + | type | string | "record", "identity", "account" | message type | 43 + | action | **string** | "create", "update", "delete" | NOT an enum! | 44 + | live | bool | true/false | true = firehose, false = resync | 45 + | collection | string | e.g., "pub.leaflet.document" | lexicon collection | 46 + 47 + ## gotchas 48 + 49 + 1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison. 50 + 51 + 2. **collection filters apply to output** - `TAP_COLLECTION_FILTERS` controls which records tap sends to clients. records from other collections are fetched but not forwarded. 52 + 53 + 3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`. 54 + 55 + 4. **silent extraction failures** - if using zat's `extractAt`, enable debug logging to see why parsing fails: 56 + ```zig 57 + pub const std_options = .{ 58 + .log_scope_levels = &.{.{ .scope = .zat, .level = .debug }}, 59 + }; 60 + ``` 61 + this will show messages like: 62 + ``` 63 + debug(zat): extractAt: parse failed for Op at path { "op" }: InvalidEnumTag 64 + ``` 65 + 66 + ## memory and performance tuning 67 + 68 + tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine. 69 + 70 + ### recommended settings for leaflet-search 71 + 72 + ```toml 73 + [[vm]] 74 + memory = '2gb' # 1gb is not enough 75 + 76 + [env] 77 + TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5) 78 + TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10) 79 + TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000) 80 + TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000) 81 + ``` 82 + 83 + ### why these values? 84 + 85 + - **2GB memory**: 1GB causes OOM kills when resyncing large repos 86 + - **resync parallelism 1**: prevents multiple large CARs in memory simultaneously 87 + - **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill 88 + - **smaller ident cache**: we don't need 2M cached identities 89 + 90 + if tap keeps OOM'ing, check logs for large repo resyncs: 91 + ```bash 92 + fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}" 93 + ``` 94 + 95 + ## quick status check 96 + 97 + from the `tap/` directory: 98 + ```bash 99 + just check 100 + ``` 101 + 102 + shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts. 103 + 104 + example output: 105 + ``` 106 + === tap status === 107 + app 781417db604d48 23 ewr started ... 108 + 109 + === Recent Indexing Activity === 110 + Last indexed: 2026-01-08 (14 docs) 111 + Today: 2026-01-11 112 + Docs: 3742 | Pubs: 1231 113 + 114 + === Timeline (last 7 days) === 115 + 2026-01-08: 14 docs 116 + 2026-01-07: 29 docs 117 + ... 118 + ``` 119 + 120 + if "Last indexed" is more than a day behind "Today", tap may be down or catching up. 121 + 122 + ## checking catch-up progress 123 + 124 + when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress: 125 + 126 + ```bash 127 + # see current firehose position (look for timestamps in log messages) 128 + fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3 129 + ``` 130 + 131 + the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up. 132 + 133 + catch-up speed varies: 134 + - **~0.3x** when resync queue is full (large repos being fetched) 135 + - **~1x or faster** once resyncs clear 136 + 137 + ## debugging 138 + 139 + ### check tap connection 140 + ```bash 141 + fly logs -a leaflet-search-tap --no-tail | tail -30 142 + ``` 143 + 144 + look for: 145 + - `"connected to firehose"` - successfully connected to bsky relay 146 + - `"websocket connected"` - backend connected to tap 147 + - `"dialing failed"` / `"i/o timeout"` - network issues 148 + 149 + ### check backend is receiving 150 + ```bash 151 + fly logs -a leaflet-search-backend --no-tail | grep -E "(tap|indexed)" 152 + ``` 153 + 154 + look for: 155 + - `tap connected!` - connected to tap 156 + - `tap: msg_type=record` - receiving messages 157 + - `indexed document:` - successfully processing 158 + 159 + ### common issues 160 + 161 + | symptom | cause | fix | 162 + |---------|-------|-----| 163 + | tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 | 164 + | `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match | 165 + | `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue | 166 + | messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types | 167 + | repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` | 168 + | new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap | 169 + | indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up | 170 + 171 + ## tap API endpoints 172 + 173 + tap exposes HTTP endpoints for monitoring and control: 174 + 175 + | endpoint | description | 176 + |----------|-------------| 177 + | `/health` | health check | 178 + | `/stats/repo-count` | number of tracked repos | 179 + | `/stats/record-count` | total records processed | 180 + | `/stats/outbox-buffer` | events waiting to be sent | 181 + | `/stats/resync-buffer` | DIDs waiting to be resynced | 182 + | `/stats/cursors` | firehose cursor position | 183 + | `/info/:did` | repo status: `{"did":"...","state":"active","records":N}` | 184 + | `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos | 185 + | `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos | 186 + 187 + example: check repo status 188 + ```bash 189 + fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123" 190 + ``` 191 + 192 + example: manually add a repo for backfill 193 + ```bash 194 + fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add' 195 + ``` 196 + 197 + ## fly.io deployment 198 + 199 + both tap and backend should be in the same region for internal networking: 200 + 201 + ```bash 202 + # check current regions 203 + fly status -a leaflet-search-tap 204 + fly status -a leaflet-search-backend 205 + 206 + # restart tap if needed 207 + fly machine restart -a leaflet-search-tap <machine-id> 208 + ``` 209 + 210 + note: changing `primary_region` in fly.toml only affects new machines. to move existing machines, clone to new region and destroy old one. 211 + 212 + ## references 213 + 214 + - [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) 215 + - [ATProto firehose docs](https://atproto.com/specs/sync#firehose)
+5 -5
mcp/README.md
··· 1 - # leaflet-mcp 1 + # pub search MCP 2 2 3 - MCP server for [Leaflet](https://leaflet.pub) - search decentralized publications on ATProto. 3 + MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site). 4 4 5 5 ## usage 6 6 7 7 ### hosted (recommended) 8 8 9 9 ```bash 10 - claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 10 + claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 11 11 ``` 12 12 13 13 ### local ··· 15 15 run the MCP server locally with `uvx`: 16 16 17 17 ```bash 18 - uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp leaflet-mcp 18 + uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp pub-search 19 19 ``` 20 20 21 21 to add it to claude code as a local stdio server: 22 22 23 23 ```bash 24 - claude mcp add leaflet -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' leaflet-mcp 24 + claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search 25 25 ``` 26 26 27 27 ## workflow
+5 -5
mcp/pyproject.toml
··· 1 1 [project] 2 - name = "leaflet-mcp" 2 + name = "pub-search" 3 3 dynamic = ["version"] 4 - description = "MCP server for Leaflet - search decentralized publications on ATProto" 4 + description = "MCP server for searching ATProto publishing platforms (Leaflet, pckt, and more)" 5 5 readme = "README.md" 6 6 authors = [{ name = "zzstoatzz", email = "thrast36@gmail.com" }] 7 7 requires-python = ">=3.10" 8 8 license = "MIT" 9 9 10 - keywords = ["leaflet", "mcp", "atproto", "publications", "search", "fastmcp"] 10 + keywords = ["pub-search", "mcp", "atproto", "publications", "search", "fastmcp", "leaflet", "pckt"] 11 11 12 12 classifiers = [ 13 13 "Development Status :: 3 - Alpha", ··· 27 27 ] 28 28 29 29 [project.scripts] 30 - leaflet-mcp = "leaflet_mcp.server:main" 30 + pub-search = "pub_search.server:main" 31 31 32 32 [build-system] 33 33 requires = ["hatchling", "uv-dynamic-versioning>=0.7.0"] 34 34 build-backend = "hatchling.build" 35 35 36 36 [tool.hatch.build.targets.wheel] 37 - packages = ["src/leaflet_mcp"] 37 + packages = ["src/pub_search"] 38 38 39 39 [tool.hatch.version] 40 40 source = "uv-dynamic-versioning"
-5
mcp/src/leaflet_mcp/__init__.py
··· 1 - """Leaflet MCP server - search decentralized publications on ATProto.""" 2 - 3 - from leaflet_mcp.server import main, mcp 4 - 5 - __all__ = ["main", "mcp"]
-58
mcp/src/leaflet_mcp/_types.py
··· 1 - """Type definitions for Leaflet MCP responses.""" 2 - 3 - from typing import Literal 4 - 5 - from pydantic import BaseModel, computed_field 6 - 7 - 8 - class SearchResult(BaseModel): 9 - """A search result from the Leaflet API.""" 10 - 11 - type: Literal["article", "looseleaf", "publication"] 12 - uri: str 13 - did: str 14 - title: str 15 - snippet: str 16 - createdAt: str = "" 17 - rkey: str 18 - basePath: str = "" 19 - 20 - @computed_field 21 - @property 22 - def url(self) -> str: 23 - """web URL for this document.""" 24 - if self.basePath: 25 - return f"https://{self.basePath}/{self.rkey}" 26 - return "" 27 - 28 - 29 - class Tag(BaseModel): 30 - """A tag with document count.""" 31 - 32 - tag: str 33 - count: int 34 - 35 - 36 - class PopularSearch(BaseModel): 37 - """A popular search query with count.""" 38 - 39 - query: str 40 - count: int 41 - 42 - 43 - class Stats(BaseModel): 44 - """Leaflet index statistics.""" 45 - 46 - documents: int 47 - publications: int 48 - 49 - 50 - class Document(BaseModel): 51 - """Full document content from ATProto.""" 52 - 53 - uri: str 54 - title: str 55 - content: str 56 - createdAt: str = "" 57 - tags: list[str] = [] 58 - publicationUri: str = ""
-21
mcp/src/leaflet_mcp/client.py
··· 1 - """HTTP client for Leaflet search API.""" 2 - 3 - import os 4 - from contextlib import asynccontextmanager 5 - from typing import AsyncIterator 6 - 7 - import httpx 8 - 9 - # configurable via env var, defaults to production 10 - LEAFLET_API_URL = os.getenv("LEAFLET_API_URL", "https://leaflet-search-backend.fly.dev") 11 - 12 - 13 - @asynccontextmanager 14 - async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 - """Get an async HTTP client for Leaflet API requests.""" 16 - async with httpx.AsyncClient( 17 - base_url=LEAFLET_API_URL, 18 - timeout=30.0, 19 - headers={"Accept": "application/json"}, 20 - ) as client: 21 - yield client
-289
mcp/src/leaflet_mcp/server.py
··· 1 - """Leaflet MCP server implementation using fastmcp.""" 2 - 3 - from __future__ import annotations 4 - 5 - from typing import Any 6 - 7 - from fastmcp import FastMCP 8 - 9 - from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from leaflet_mcp.client import get_http_client 11 - 12 - mcp = FastMCP("leaflet") 13 - 14 - 15 - # ----------------------------------------------------------------------------- 16 - # prompts 17 - # ----------------------------------------------------------------------------- 18 - 19 - 20 - @mcp.prompt("usage_guide") 21 - def usage_guide() -> str: 22 - """instructions for using leaflet MCP tools.""" 23 - return """\ 24 - # Leaflet MCP server usage guide 25 - 26 - Leaflet is a decentralized publishing platform on ATProto (the protocol behind Bluesky). 27 - This MCP server provides search and discovery tools for Leaflet publications. 28 - 29 - ## core tools 30 - 31 - - `search(query, tag)` - search documents and publications by text or tag 32 - - `get_document(uri)` - get the full content of a document by its AT-URI 33 - - `find_similar(uri)` - find documents similar to a given document 34 - - `get_tags()` - list all available tags with document counts 35 - - `get_stats()` - get index statistics (document/publication counts) 36 - - `get_popular()` - see popular search queries 37 - 38 - ## workflow for research 39 - 40 - 1. use `search("your topic")` to find relevant documents 41 - 2. use `get_document(uri)` to retrieve full content of interesting results 42 - 3. use `find_similar(uri)` to discover related content 43 - 44 - ## result types 45 - 46 - search returns three types of results: 47 - - **publication**: a collection of articles (like a blog or magazine) 48 - - **article**: a document that belongs to a publication 49 - - **looseleaf**: a standalone document not part of a publication 50 - 51 - ## AT-URIs 52 - 53 - documents are identified by AT-URIs like: 54 - `at://did:plc:abc123/pub.leaflet.document/xyz789` 55 - 56 - you can also browse documents on the web at leaflet.pub 57 - """ 58 - 59 - 60 - @mcp.prompt("search_tips") 61 - def search_tips() -> str: 62 - """tips for effective searching.""" 63 - return """\ 64 - # Leaflet search tips 65 - 66 - ## text search 67 - - searches both document titles and content 68 - - uses FTS5 full-text search with prefix matching 69 - - the last word gets prefix matching: "cat dog" matches "cat dogs" 70 - 71 - ## tag filtering 72 - - combine text search with tag filter: `search("python", tag="programming")` 73 - - use `get_tags()` to discover available tags 74 - - tags are only applied to documents, not publications 75 - 76 - ## finding related content 77 - - after finding an interesting document, use `find_similar(uri)` 78 - - similarity is based on semantic embeddings (voyage-3-lite) 79 - - great for exploring related topics 80 - 81 - ## browsing by popularity 82 - - use `get_popular()` to see what others are searching for 83 - - can inspire new research directions 84 - """ 85 - 86 - 87 - # ----------------------------------------------------------------------------- 88 - # tools 89 - # ----------------------------------------------------------------------------- 90 - 91 - 92 - @mcp.tool 93 - async def search( 94 - query: str = "", 95 - tag: str | None = None, 96 - limit: int = 5, 97 - ) -> list[SearchResult]: 98 - """search leaflet documents and publications. 99 - 100 - searches the full text of documents (titles and content) and publications. 101 - results include a snippet showing where the match was found. 102 - 103 - args: 104 - query: search query (searches titles and content) 105 - tag: optional tag to filter by (only applies to documents) 106 - limit: max results to return (default 5, max 40) 107 - 108 - returns: 109 - list of search results with uri, title, snippet, and metadata 110 - """ 111 - if not query and not tag: 112 - return [] 113 - 114 - params: dict[str, Any] = {} 115 - if query: 116 - params["q"] = query 117 - if tag: 118 - params["tag"] = tag 119 - 120 - async with get_http_client() as client: 121 - response = await client.get("/search", params=params) 122 - response.raise_for_status() 123 - results = response.json() 124 - 125 - # apply client-side limit since API returns up to 40 126 - return [SearchResult(**r) for r in results[:limit]] 127 - 128 - 129 - @mcp.tool 130 - async def get_document(uri: str) -> Document: 131 - """get the full content of a document by its AT-URI. 132 - 133 - fetches the complete document from ATProto, including full text content. 134 - use this after finding documents via search to get the complete text. 135 - 136 - args: 137 - uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 138 - 139 - returns: 140 - document with full content, title, tags, and metadata 141 - """ 142 - # use pdsx to fetch the actual record from ATProto 143 - try: 144 - from pdsx._internal.operations import get_record 145 - from pdsx.mcp.client import get_atproto_client 146 - except ImportError as e: 147 - raise RuntimeError( 148 - "pdsx is required for fetching full documents. install with: uv add pdsx" 149 - ) from e 150 - 151 - # extract repo from URI for PDS discovery 152 - # at://did:plc:xxx/collection/rkey 153 - parts = uri.replace("at://", "").split("/") 154 - if len(parts) < 3: 155 - raise ValueError(f"invalid AT-URI: {uri}") 156 - 157 - repo = parts[0] 158 - 159 - async with get_atproto_client(target_repo=repo) as client: 160 - record = await get_record(client, uri) 161 - 162 - value = record.value 163 - # DotDict doesn't have a working .get(), convert to dict first 164 - if hasattr(value, "to_dict") and callable(value.to_dict): 165 - value = value.to_dict() 166 - elif not isinstance(value, dict): 167 - value = dict(value) 168 - 169 - # extract content from leaflet's block structure 170 - # pages[].blocks[].block.plaintext 171 - content_parts = [] 172 - for page in value.get("pages", []): 173 - for block_wrapper in page.get("blocks", []): 174 - block = block_wrapper.get("block", {}) 175 - plaintext = block.get("plaintext", "") 176 - if plaintext: 177 - content_parts.append(plaintext) 178 - 179 - content = "\n\n".join(content_parts) 180 - 181 - return Document( 182 - uri=record.uri, 183 - title=value.get("title", ""), 184 - content=content, 185 - createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 186 - tags=value.get("tags", []), 187 - publicationUri=value.get("publication", ""), 188 - ) 189 - 190 - 191 - @mcp.tool 192 - async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 193 - """find documents similar to a given document. 194 - 195 - uses vector similarity (voyage-3-lite embeddings) to find semantically 196 - related documents. great for discovering related content after finding 197 - an interesting document. 198 - 199 - args: 200 - uri: the AT-URI of the document to find similar content for 201 - limit: max similar documents to return (default 5) 202 - 203 - returns: 204 - list of similar documents with uri, title, and metadata 205 - """ 206 - async with get_http_client() as client: 207 - response = await client.get("/similar", params={"uri": uri}) 208 - response.raise_for_status() 209 - results = response.json() 210 - 211 - return [SearchResult(**r) for r in results[:limit]] 212 - 213 - 214 - @mcp.tool 215 - async def get_tags() -> list[Tag]: 216 - """list all available tags with document counts. 217 - 218 - returns tags sorted by document count (most popular first). 219 - useful for discovering topics and filtering searches. 220 - 221 - returns: 222 - list of tags with their document counts 223 - """ 224 - async with get_http_client() as client: 225 - response = await client.get("/tags") 226 - response.raise_for_status() 227 - results = response.json() 228 - 229 - return [Tag(**t) for t in results] 230 - 231 - 232 - @mcp.tool 233 - async def get_stats() -> Stats: 234 - """get leaflet index statistics. 235 - 236 - returns: 237 - document and publication counts 238 - """ 239 - async with get_http_client() as client: 240 - response = await client.get("/stats") 241 - response.raise_for_status() 242 - return Stats(**response.json()) 243 - 244 - 245 - @mcp.tool 246 - async def get_popular(limit: int = 5) -> list[PopularSearch]: 247 - """get popular search queries. 248 - 249 - see what others are searching for on leaflet. 250 - can inspire new research directions. 251 - 252 - args: 253 - limit: max queries to return (default 5) 254 - 255 - returns: 256 - list of popular queries with search counts 257 - """ 258 - async with get_http_client() as client: 259 - response = await client.get("/popular") 260 - response.raise_for_status() 261 - results = response.json() 262 - 263 - return [PopularSearch(**p) for p in results[:limit]] 264 - 265 - 266 - # ----------------------------------------------------------------------------- 267 - # resources 268 - # ----------------------------------------------------------------------------- 269 - 270 - 271 - @mcp.resource("leaflet://stats") 272 - async def stats_resource() -> str: 273 - """current leaflet index statistics.""" 274 - stats = await get_stats() 275 - return f"Leaflet index: {stats.documents} documents, {stats.publications} publications" 276 - 277 - 278 - # ----------------------------------------------------------------------------- 279 - # entrypoint 280 - # ----------------------------------------------------------------------------- 281 - 282 - 283 - def main() -> None: 284 - """run the MCP server.""" 285 - mcp.run() 286 - 287 - 288 - if __name__ == "__main__": 289 - main()
+5
mcp/src/pub_search/__init__.py
··· 1 + """MCP server for searching ATProto publishing platforms.""" 2 + 3 + from pub_search.server import main, mcp 4 + 5 + __all__ = ["main", "mcp"]
+58
mcp/src/pub_search/_types.py
··· 1 + """Type definitions for Leaflet MCP responses.""" 2 + 3 + from typing import Literal 4 + 5 + from pydantic import BaseModel, computed_field 6 + 7 + 8 + class SearchResult(BaseModel): 9 + """A search result from the Leaflet API.""" 10 + 11 + type: Literal["article", "looseleaf", "publication"] 12 + uri: str 13 + did: str 14 + title: str 15 + snippet: str 16 + createdAt: str = "" 17 + rkey: str 18 + basePath: str = "" 19 + 20 + @computed_field 21 + @property 22 + def url(self) -> str: 23 + """web URL for this document.""" 24 + if self.basePath: 25 + return f"https://{self.basePath}/{self.rkey}" 26 + return "" 27 + 28 + 29 + class Tag(BaseModel): 30 + """A tag with document count.""" 31 + 32 + tag: str 33 + count: int 34 + 35 + 36 + class PopularSearch(BaseModel): 37 + """A popular search query with count.""" 38 + 39 + query: str 40 + count: int 41 + 42 + 43 + class Stats(BaseModel): 44 + """Leaflet index statistics.""" 45 + 46 + documents: int 47 + publications: int 48 + 49 + 50 + class Document(BaseModel): 51 + """Full document content from ATProto.""" 52 + 53 + uri: str 54 + title: str 55 + content: str 56 + createdAt: str = "" 57 + tags: list[str] = [] 58 + publicationUri: str = ""
+21
mcp/src/pub_search/client.py
··· 1 + """HTTP client for leaflet-search API.""" 2 + 3 + import os 4 + from contextlib import asynccontextmanager 5 + from typing import AsyncIterator 6 + 7 + import httpx 8 + 9 + # configurable via env var, defaults to production 10 + API_URL = os.getenv("LEAFLET_SEARCH_API_URL", "https://leaflet-search-backend.fly.dev") 11 + 12 + 13 + @asynccontextmanager 14 + async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 + """Get an async HTTP client for API requests.""" 16 + async with httpx.AsyncClient( 17 + base_url=API_URL, 18 + timeout=30.0, 19 + headers={"Accept": "application/json"}, 20 + ) as client: 21 + yield client
+288
mcp/src/pub_search/server.py
··· 1 + """MCP server for searching ATProto publishing platforms.""" 2 + 3 + from __future__ import annotations 4 + 5 + from typing import Any 6 + 7 + from fastmcp import FastMCP 8 + 9 + from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from pub_search.client import get_http_client 11 + 12 + mcp = FastMCP("pub-search") 13 + 14 + 15 + # ----------------------------------------------------------------------------- 16 + # prompts 17 + # ----------------------------------------------------------------------------- 18 + 19 + 20 + @mcp.prompt("usage_guide") 21 + def usage_guide() -> str: 22 + """instructions for using pub-search MCP tools.""" 23 + return """\ 24 + # pub-search MCP usage guide 25 + 26 + search documents across ATProto publishing platforms including Leaflet, pckt, and others. 27 + 28 + ## core tools 29 + 30 + - `search(query, tag)` - search documents and publications by text or tag 31 + - `get_document(uri)` - get the full content of a document by its AT-URI 32 + - `find_similar(uri)` - find documents similar to a given document 33 + - `get_tags()` - list all available tags with document counts 34 + - `get_stats()` - get index statistics (document/publication counts) 35 + - `get_popular()` - see popular search queries 36 + 37 + ## workflow for research 38 + 39 + 1. use `search("your topic")` to find relevant documents 40 + 2. use `get_document(uri)` to retrieve full content of interesting results 41 + 3. use `find_similar(uri)` to discover related content 42 + 43 + ## result types 44 + 45 + search returns three types of results: 46 + - **publication**: a collection of articles (like a blog or magazine) 47 + - **article**: a document that belongs to a publication 48 + - **looseleaf**: a standalone document not part of a publication 49 + 50 + ## AT-URIs 51 + 52 + documents are identified by AT-URIs like: 53 + `at://did:plc:abc123/pub.leaflet.document/xyz789` 54 + 55 + browse the web UI at pub-search.waow.tech 56 + """ 57 + 58 + 59 + @mcp.prompt("search_tips") 60 + def search_tips() -> str: 61 + """tips for effective searching.""" 62 + return """\ 63 + # search tips 64 + 65 + ## text search 66 + - searches both document titles and content 67 + - uses FTS5 full-text search with prefix matching 68 + - the last word gets prefix matching: "cat dog" matches "cat dogs" 69 + 70 + ## tag filtering 71 + - combine text search with tag filter: `search("python", tag="programming")` 72 + - use `get_tags()` to discover available tags 73 + - tags are only applied to documents, not publications 74 + 75 + ## finding related content 76 + - after finding an interesting document, use `find_similar(uri)` 77 + - similarity is based on semantic embeddings (voyage-3-lite) 78 + - great for exploring related topics 79 + 80 + ## browsing by popularity 81 + - use `get_popular()` to see what others are searching for 82 + - can inspire new research directions 83 + """ 84 + 85 + 86 + # ----------------------------------------------------------------------------- 87 + # tools 88 + # ----------------------------------------------------------------------------- 89 + 90 + 91 + @mcp.tool 92 + async def search( 93 + query: str = "", 94 + tag: str | None = None, 95 + limit: int = 5, 96 + ) -> list[SearchResult]: 97 + """search documents and publications. 98 + 99 + searches the full text of documents (titles and content) and publications. 100 + results include a snippet showing where the match was found. 101 + 102 + args: 103 + query: search query (searches titles and content) 104 + tag: optional tag to filter by (only applies to documents) 105 + limit: max results to return (default 5, max 40) 106 + 107 + returns: 108 + list of search results with uri, title, snippet, and metadata 109 + """ 110 + if not query and not tag: 111 + return [] 112 + 113 + params: dict[str, Any] = {} 114 + if query: 115 + params["q"] = query 116 + if tag: 117 + params["tag"] = tag 118 + 119 + async with get_http_client() as client: 120 + response = await client.get("/search", params=params) 121 + response.raise_for_status() 122 + results = response.json() 123 + 124 + # apply client-side limit since API returns up to 40 125 + return [SearchResult(**r) for r in results[:limit]] 126 + 127 + 128 + @mcp.tool 129 + async def get_document(uri: str) -> Document: 130 + """get the full content of a document by its AT-URI. 131 + 132 + fetches the complete document from ATProto, including full text content. 133 + use this after finding documents via search to get the complete text. 134 + 135 + args: 136 + uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 137 + 138 + returns: 139 + document with full content, title, tags, and metadata 140 + """ 141 + # use pdsx to fetch the actual record from ATProto 142 + try: 143 + from pdsx._internal.operations import get_record 144 + from pdsx.mcp.client import get_atproto_client 145 + except ImportError as e: 146 + raise RuntimeError( 147 + "pdsx is required for fetching full documents. install with: uv add pdsx" 148 + ) from e 149 + 150 + # extract repo from URI for PDS discovery 151 + # at://did:plc:xxx/collection/rkey 152 + parts = uri.replace("at://", "").split("/") 153 + if len(parts) < 3: 154 + raise ValueError(f"invalid AT-URI: {uri}") 155 + 156 + repo = parts[0] 157 + 158 + async with get_atproto_client(target_repo=repo) as client: 159 + record = await get_record(client, uri) 160 + 161 + value = record.value 162 + # DotDict doesn't have a working .get(), convert to dict first 163 + if hasattr(value, "to_dict") and callable(value.to_dict): 164 + value = value.to_dict() 165 + elif not isinstance(value, dict): 166 + value = dict(value) 167 + 168 + # extract content from leaflet's block structure 169 + # pages[].blocks[].block.plaintext 170 + content_parts = [] 171 + for page in value.get("pages", []): 172 + for block_wrapper in page.get("blocks", []): 173 + block = block_wrapper.get("block", {}) 174 + plaintext = block.get("plaintext", "") 175 + if plaintext: 176 + content_parts.append(plaintext) 177 + 178 + content = "\n\n".join(content_parts) 179 + 180 + return Document( 181 + uri=record.uri, 182 + title=value.get("title", ""), 183 + content=content, 184 + createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 185 + tags=value.get("tags", []), 186 + publicationUri=value.get("publication", ""), 187 + ) 188 + 189 + 190 + @mcp.tool 191 + async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 192 + """find documents similar to a given document. 193 + 194 + uses vector similarity (voyage-3-lite embeddings) to find semantically 195 + related documents. great for discovering related content after finding 196 + an interesting document. 197 + 198 + args: 199 + uri: the AT-URI of the document to find similar content for 200 + limit: max similar documents to return (default 5) 201 + 202 + returns: 203 + list of similar documents with uri, title, and metadata 204 + """ 205 + async with get_http_client() as client: 206 + response = await client.get("/similar", params={"uri": uri}) 207 + response.raise_for_status() 208 + results = response.json() 209 + 210 + return [SearchResult(**r) for r in results[:limit]] 211 + 212 + 213 + @mcp.tool 214 + async def get_tags() -> list[Tag]: 215 + """list all available tags with document counts. 216 + 217 + returns tags sorted by document count (most popular first). 218 + useful for discovering topics and filtering searches. 219 + 220 + returns: 221 + list of tags with their document counts 222 + """ 223 + async with get_http_client() as client: 224 + response = await client.get("/tags") 225 + response.raise_for_status() 226 + results = response.json() 227 + 228 + return [Tag(**t) for t in results] 229 + 230 + 231 + @mcp.tool 232 + async def get_stats() -> Stats: 233 + """get index statistics. 234 + 235 + returns: 236 + document and publication counts 237 + """ 238 + async with get_http_client() as client: 239 + response = await client.get("/stats") 240 + response.raise_for_status() 241 + return Stats(**response.json()) 242 + 243 + 244 + @mcp.tool 245 + async def get_popular(limit: int = 5) -> list[PopularSearch]: 246 + """get popular search queries. 247 + 248 + see what others are searching for. 249 + can inspire new research directions. 250 + 251 + args: 252 + limit: max queries to return (default 5) 253 + 254 + returns: 255 + list of popular queries with search counts 256 + """ 257 + async with get_http_client() as client: 258 + response = await client.get("/popular") 259 + response.raise_for_status() 260 + results = response.json() 261 + 262 + return [PopularSearch(**p) for p in results[:limit]] 263 + 264 + 265 + # ----------------------------------------------------------------------------- 266 + # resources 267 + # ----------------------------------------------------------------------------- 268 + 269 + 270 + @mcp.resource("pub-search://stats") 271 + async def stats_resource() -> str: 272 + """current index statistics.""" 273 + stats = await get_stats() 274 + return f"pub search index: {stats.documents} documents, {stats.publications} publications" 275 + 276 + 277 + # ----------------------------------------------------------------------------- 278 + # entrypoint 279 + # ----------------------------------------------------------------------------- 280 + 281 + 282 + def main() -> None: 283 + """run the MCP server.""" 284 + mcp.run() 285 + 286 + 287 + if __name__ == "__main__": 288 + main()
+8 -8
mcp/tests/test_mcp.py
··· 1 - """tests for leaflet MCP server.""" 1 + """tests for pub-search MCP server.""" 2 2 3 3 import pytest 4 4 from mcp.types import TextContent ··· 6 6 from fastmcp.client import Client 7 7 from fastmcp.client.transports import FastMCPTransport 8 8 9 - from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from leaflet_mcp.server import mcp 9 + from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from pub_search.server import mcp 11 11 12 12 13 13 class TestTypes: ··· 93 93 94 94 def test_mcp_server_imports(self): 95 95 """mcp server can be imported without errors.""" 96 - from leaflet_mcp import mcp 96 + from pub_search import mcp 97 97 98 - assert mcp.name == "leaflet" 98 + assert mcp.name == "pub-search" 99 99 100 100 def test_exports(self): 101 101 """all expected exports are available.""" 102 - from leaflet_mcp import main, mcp 102 + from pub_search import main, mcp 103 103 104 104 assert mcp is not None 105 105 assert main is not None ··· 138 138 resources = await client.list_resources() 139 139 140 140 resource_uris = {str(r.uri) for r in resources} 141 - assert "leaflet://stats" in resource_uris 141 + assert "pub-search://stats" in resource_uris 142 142 143 143 async def test_usage_guide_prompt_content(self, client): 144 144 """usage_guide prompt returns helpful content.""" ··· 148 148 assert len(result.messages) > 0 149 149 content = result.messages[0].content 150 150 assert isinstance(content, TextContent) 151 - assert "Leaflet" in content.text 151 + assert "pub-search" in content.text 152 152 assert "search" in content.text 153 153 154 154 async def test_search_tips_prompt_content(self, client):
+32 -32
mcp/uv.lock
··· 691 691 ] 692 692 693 693 [[package]] 694 - name = "leaflet-mcp" 695 - source = { editable = "." } 696 - dependencies = [ 697 - { name = "fastmcp" }, 698 - { name = "httpx" }, 699 - { name = "pdsx" }, 700 - ] 701 - 702 - [package.dev-dependencies] 703 - dev = [ 704 - { name = "pytest" }, 705 - { name = "pytest-asyncio" }, 706 - { name = "pytest-sugar" }, 707 - { name = "ruff" }, 708 - ] 709 - 710 - [package.metadata] 711 - requires-dist = [ 712 - { name = "fastmcp", specifier = ">=2.0" }, 713 - { name = "httpx", specifier = ">=0.28" }, 714 - { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 715 - ] 716 - 717 - [package.metadata.requires-dev] 718 - dev = [ 719 - { name = "pytest", specifier = ">=8.3.0" }, 720 - { name = "pytest-asyncio", specifier = ">=0.25.0" }, 721 - { name = "pytest-sugar" }, 722 - { name = "ruff", specifier = ">=0.12.0" }, 723 - ] 724 - 725 - [[package]] 726 694 name = "libipld" 727 695 version = "3.3.2" 728 696 source = { registry = "https://pypi.org/simple" } ··· 1075 1043 sdist = { url = "https://files.pythonhosted.org/packages/23/53/3edb5d68ecf6b38fcbcc1ad28391117d2a322d9a1a3eff04bfdb184d8c3b/prometheus_client-0.23.1.tar.gz", hash = "sha256:6ae8f9081eaaaf153a2e959d2e6c4f4fb57b12ef76c8c7980202f1e57b48b2ce", size = 80481, upload-time = "2025-09-18T20:47:25.043Z" } 1076 1044 wheels = [ 1077 1045 { url = "https://files.pythonhosted.org/packages/b8/db/14bafcb4af2139e046d03fd00dea7873e48eafe18b7d2797e73d6681f210/prometheus_client-0.23.1-py3-none-any.whl", hash = "sha256:dd1913e6e76b59cfe44e7a4b83e01afc9873c1bdfd2ed8739f1e76aeca115f99", size = 61145, upload-time = "2025-09-18T20:47:23.875Z" }, 1046 + ] 1047 + 1048 + [[package]] 1049 + name = "pub-search" 1050 + source = { editable = "." } 1051 + dependencies = [ 1052 + { name = "fastmcp" }, 1053 + { name = "httpx" }, 1054 + { name = "pdsx" }, 1055 + ] 1056 + 1057 + [package.dev-dependencies] 1058 + dev = [ 1059 + { name = "pytest" }, 1060 + { name = "pytest-asyncio" }, 1061 + { name = "pytest-sugar" }, 1062 + { name = "ruff" }, 1063 + ] 1064 + 1065 + [package.metadata] 1066 + requires-dist = [ 1067 + { name = "fastmcp", specifier = ">=2.0" }, 1068 + { name = "httpx", specifier = ">=0.28" }, 1069 + { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 1070 + ] 1071 + 1072 + [package.metadata.requires-dev] 1073 + dev = [ 1074 + { name = "pytest", specifier = ">=8.3.0" }, 1075 + { name = "pytest-asyncio", specifier = ">=0.25.0" }, 1076 + { name = "pytest-sugar" }, 1077 + { name = "ruff", specifier = ">=0.12.0" }, 1078 1078 ] 1079 1079 1080 1080 [[package]]
+383
scripts/backfill-pds
··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx", "pydantic-settings"] 5 + # /// 6 + """ 7 + Backfill records directly from a PDS. 8 + 9 + Usage: 10 + ./scripts/backfill-pds did:plc:mkqt76xvfgxuemlwlx6ruc3w 11 + ./scripts/backfill-pds zat.dev 12 + """ 13 + 14 + import argparse 15 + import json 16 + import os 17 + import sys 18 + 19 + import httpx 20 + from pydantic_settings import BaseSettings, SettingsConfigDict 21 + 22 + 23 + class Settings(BaseSettings): 24 + model_config = SettingsConfigDict( 25 + env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 26 + ) 27 + 28 + turso_url: str 29 + turso_token: str 30 + 31 + @property 32 + def turso_host(self) -> str: 33 + url = self.turso_url 34 + if url.startswith("libsql://"): 35 + url = url[len("libsql://") :] 36 + return url 37 + 38 + 39 + def resolve_handle(handle: str) -> str: 40 + """Resolve a handle to a DID.""" 41 + resp = httpx.get( 42 + f"https://bsky.social/xrpc/com.atproto.identity.resolveHandle", 43 + params={"handle": handle}, 44 + timeout=30, 45 + ) 46 + resp.raise_for_status() 47 + return resp.json()["did"] 48 + 49 + 50 + def get_pds_endpoint(did: str) -> str: 51 + """Get PDS endpoint from PLC directory.""" 52 + resp = httpx.get(f"https://plc.directory/{did}", timeout=30) 53 + resp.raise_for_status() 54 + data = resp.json() 55 + for service in data.get("service", []): 56 + if service.get("type") == "AtprotoPersonalDataServer": 57 + return service["serviceEndpoint"] 58 + raise ValueError(f"No PDS endpoint found for {did}") 59 + 60 + 61 + def list_records(pds: str, did: str, collection: str) -> list[dict]: 62 + """List all records from a collection.""" 63 + records = [] 64 + cursor = None 65 + while True: 66 + params = {"repo": did, "collection": collection, "limit": 100} 67 + if cursor: 68 + params["cursor"] = cursor 69 + resp = httpx.get( 70 + f"{pds}/xrpc/com.atproto.repo.listRecords", params=params, timeout=30 71 + ) 72 + resp.raise_for_status() 73 + data = resp.json() 74 + records.extend(data.get("records", [])) 75 + cursor = data.get("cursor") 76 + if not cursor: 77 + break 78 + return records 79 + 80 + 81 + def turso_exec(settings: Settings, sql: str, args: list | None = None) -> None: 82 + """Execute a statement against Turso.""" 83 + stmt = {"sql": sql} 84 + if args: 85 + # Handle None values properly - use null type 86 + stmt["args"] = [] 87 + for a in args: 88 + if a is None: 89 + stmt["args"].append({"type": "null"}) 90 + else: 91 + stmt["args"].append({"type": "text", "value": str(a)}) 92 + 93 + response = httpx.post( 94 + f"https://{settings.turso_host}/v2/pipeline", 95 + headers={ 96 + "Authorization": f"Bearer {settings.turso_token}", 97 + "Content-Type": "application/json", 98 + }, 99 + json={"requests": [{"type": "execute", "stmt": stmt}, {"type": "close"}]}, 100 + timeout=30, 101 + ) 102 + if response.status_code != 200: 103 + print(f"Turso error: {response.text}", file=sys.stderr) 104 + response.raise_for_status() 105 + 106 + 107 + def extract_leaflet_blocks(pages: list) -> str: 108 + """Extract text from leaflet pages/blocks structure.""" 109 + texts = [] 110 + for page in pages: 111 + if not isinstance(page, dict): 112 + continue 113 + blocks = page.get("blocks", []) 114 + for wrapper in blocks: 115 + if not isinstance(wrapper, dict): 116 + continue 117 + block = wrapper.get("block", {}) 118 + if not isinstance(block, dict): 119 + continue 120 + # Extract plaintext from text, header, blockquote, code blocks 121 + block_type = block.get("$type", "") 122 + if block_type in ( 123 + "pub.leaflet.blocks.text", 124 + "pub.leaflet.blocks.header", 125 + "pub.leaflet.blocks.blockquote", 126 + "pub.leaflet.blocks.code", 127 + ): 128 + plaintext = block.get("plaintext", "") 129 + if plaintext: 130 + texts.append(plaintext) 131 + # Handle lists 132 + elif block_type == "pub.leaflet.blocks.unorderedList": 133 + texts.extend(extract_list_items(block.get("children", []))) 134 + return " ".join(texts) 135 + 136 + 137 + def extract_list_items(children: list) -> list[str]: 138 + """Recursively extract text from list items.""" 139 + texts = [] 140 + for child in children: 141 + if not isinstance(child, dict): 142 + continue 143 + content = child.get("content", {}) 144 + if isinstance(content, dict): 145 + plaintext = content.get("plaintext", "") 146 + if plaintext: 147 + texts.append(plaintext) 148 + # Recurse into nested children 149 + nested = child.get("children", []) 150 + if nested: 151 + texts.extend(extract_list_items(nested)) 152 + return texts 153 + 154 + 155 + def extract_document(record: dict, collection: str) -> dict | None: 156 + """Extract document fields from a record.""" 157 + value = record.get("value", {}) 158 + 159 + # Get title 160 + title = value.get("title") 161 + if not title: 162 + return None 163 + 164 + # Get content - try textContent (site.standard), then leaflet blocks, then content/text 165 + content = value.get("textContent") or "" 166 + if not content: 167 + # Try leaflet-style pages/blocks 168 + pages = value.get("pages", []) 169 + if pages: 170 + content = extract_leaflet_blocks(pages) 171 + if not content: 172 + # Fall back to simple content/text fields 173 + content = value.get("content") or value.get("text") or "" 174 + if isinstance(content, dict): 175 + # Handle richtext format 176 + content = content.get("text", "") 177 + 178 + # Get created_at 179 + created_at = value.get("createdAt", "") 180 + 181 + # Get publication reference - try "publication" (leaflet) then "site" (site.standard) 182 + publication = value.get("publication") or value.get("site") 183 + publication_uri = None 184 + if publication: 185 + if isinstance(publication, dict): 186 + publication_uri = publication.get("uri") 187 + elif isinstance(publication, str): 188 + publication_uri = publication 189 + 190 + # Get URL path (site.standard.document uses "path" field like "/001") 191 + path = value.get("path") 192 + 193 + # Get tags 194 + tags = value.get("tags", []) 195 + if not isinstance(tags, list): 196 + tags = [] 197 + 198 + # Determine platform from collection (site.standard is a lexicon, not a platform) 199 + if collection.startswith("pub.leaflet"): 200 + platform = "leaflet" 201 + elif collection.startswith("blog.pckt"): 202 + platform = "pckt" 203 + else: 204 + # site.standard.* and others - platform will be detected from publication basePath 205 + platform = "unknown" 206 + 207 + return { 208 + "title": title, 209 + "content": content, 210 + "created_at": created_at, 211 + "publication_uri": publication_uri, 212 + "tags": tags, 213 + "platform": platform, 214 + "collection": collection, 215 + "path": path, 216 + } 217 + 218 + 219 + def main(): 220 + parser = argparse.ArgumentParser(description="Backfill records from a PDS") 221 + parser.add_argument("identifier", help="DID or handle to backfill") 222 + parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 223 + args = parser.parse_args() 224 + 225 + try: 226 + settings = Settings() # type: ignore 227 + except Exception as e: 228 + print(f"error loading settings: {e}", file=sys.stderr) 229 + print("required env vars: TURSO_URL, TURSO_TOKEN", file=sys.stderr) 230 + sys.exit(1) 231 + 232 + # Resolve identifier to DID 233 + identifier = args.identifier 234 + if identifier.startswith("did:"): 235 + did = identifier 236 + else: 237 + print(f"resolving handle {identifier}...") 238 + did = resolve_handle(identifier) 239 + print(f" -> {did}") 240 + 241 + # Get PDS endpoint 242 + print(f"looking up PDS for {did}...") 243 + pds = get_pds_endpoint(did) 244 + print(f" -> {pds}") 245 + 246 + # Collections to fetch 247 + collections = [ 248 + "pub.leaflet.document", 249 + "pub.leaflet.publication", 250 + "site.standard.document", 251 + "site.standard.publication", 252 + ] 253 + 254 + total_docs = 0 255 + total_pubs = 0 256 + 257 + for collection in collections: 258 + print(f"fetching {collection}...") 259 + try: 260 + records = list_records(pds, did, collection) 261 + except httpx.HTTPStatusError as e: 262 + if e.response.status_code == 400: 263 + print(f" (no records)") 264 + continue 265 + raise 266 + 267 + if not records: 268 + print(f" (no records)") 269 + continue 270 + 271 + print(f" found {len(records)} records") 272 + 273 + for record in records: 274 + uri = record["uri"] 275 + # Parse rkey from URI: at://did/collection/rkey 276 + parts = uri.split("/") 277 + rkey = parts[-1] 278 + 279 + if collection.endswith(".document"): 280 + doc = extract_document(record, collection) 281 + if not doc: 282 + print(f" skip {uri} (no title)") 283 + continue 284 + 285 + if args.dry_run: 286 + print(f" would insert: {doc['title'][:50]}...") 287 + else: 288 + # Insert document 289 + turso_exec( 290 + settings, 291 + """ 292 + INSERT INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path) 293 + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 294 + ON CONFLICT(did, rkey) DO UPDATE SET 295 + uri = excluded.uri, 296 + title = excluded.title, 297 + content = excluded.content, 298 + created_at = excluded.created_at, 299 + publication_uri = excluded.publication_uri, 300 + platform = excluded.platform, 301 + source_collection = excluded.source_collection, 302 + path = excluded.path 303 + """, 304 + [uri, did, rkey, doc["title"], doc["content"], doc["created_at"], doc["publication_uri"], doc["platform"], doc["collection"], doc["path"]], 305 + ) 306 + # Insert tags 307 + for tag in doc["tags"]: 308 + turso_exec( 309 + settings, 310 + "INSERT OR IGNORE INTO document_tags (document_uri, tag) VALUES (?, ?)", 311 + [uri, tag], 312 + ) 313 + # Update FTS index (delete then insert, FTS5 doesn't support ON CONFLICT) 314 + turso_exec(settings, "DELETE FROM documents_fts WHERE uri = ?", [uri]) 315 + turso_exec( 316 + settings, 317 + "INSERT INTO documents_fts (uri, title, content) VALUES (?, ?, ?)", 318 + [uri, doc["title"], doc["content"]], 319 + ) 320 + print(f" indexed: {doc['title'][:50]}...") 321 + total_docs += 1 322 + 323 + elif collection.endswith(".publication"): 324 + value = record["value"] 325 + name = value.get("name", "") 326 + description = value.get("description") 327 + # base_path: try leaflet's "base_path", then strip scheme from site.standard's "url" 328 + base_path = value.get("base_path") 329 + if not base_path: 330 + url = value.get("url") 331 + if url: 332 + # Strip https:// or http:// prefix 333 + if url.startswith("https://"): 334 + base_path = url[len("https://"):] 335 + elif url.startswith("http://"): 336 + base_path = url[len("http://"):] 337 + else: 338 + base_path = url 339 + 340 + if args.dry_run: 341 + print(f" would insert pub: {name}") 342 + else: 343 + turso_exec( 344 + settings, 345 + """ 346 + INSERT INTO publications (uri, did, rkey, name, description, base_path) 347 + VALUES (?, ?, ?, ?, ?, ?) 348 + ON CONFLICT(uri) DO UPDATE SET 349 + name = excluded.name, 350 + description = excluded.description, 351 + base_path = excluded.base_path 352 + """, 353 + [uri, did, rkey, name, description, base_path], 354 + ) 355 + print(f" indexed pub: {name}") 356 + total_pubs += 1 357 + 358 + # post-process: detect platform from publication basePath 359 + if not args.dry_run and (total_docs > 0 or total_pubs > 0): 360 + print("detecting platforms from publication basePath...") 361 + turso_exec( 362 + settings, 363 + """ 364 + UPDATE documents SET platform = 'pckt' 365 + WHERE platform IN ('standardsite', 'unknown') 366 + AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 367 + """, 368 + ) 369 + turso_exec( 370 + settings, 371 + """ 372 + UPDATE documents SET platform = 'leaflet' 373 + WHERE platform IN ('standardsite', 'unknown') 374 + AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 375 + """, 376 + ) 377 + print(" done") 378 + 379 + print(f"\ndone! {total_docs} documents, {total_pubs} publications") 380 + 381 + 382 + if __name__ == "__main__": 383 + main()
+109
scripts/enumerate-standard-repos
··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx"] 5 + # /// 6 + """ 7 + Enumerate repos with site.standard.* records and add them to TAP. 8 + 9 + TAP only signals on one collection, so we use this to discover repos 10 + that use site.standard.publication (pckt, etc) and add them to TAP. 11 + 12 + Usage: 13 + ./scripts/enumerate-standard-repos 14 + ./scripts/enumerate-standard-repos --dry-run 15 + """ 16 + 17 + import argparse 18 + import sys 19 + 20 + import httpx 21 + 22 + RELAY_URL = "https://relay1.us-east.bsky.network" 23 + TAP_URL = "http://leaflet-search-tap.internal:2480" # fly internal network 24 + COLLECTION = "site.standard.publication" 25 + 26 + 27 + def enumerate_repos(relay_url: str, collection: str) -> list[str]: 28 + """Enumerate all repos with records in the given collection.""" 29 + dids = [] 30 + cursor = None 31 + 32 + print(f"enumerating repos with {collection}...") 33 + 34 + while True: 35 + params = {"collection": collection, "limit": 1000} 36 + if cursor: 37 + params["cursor"] = cursor 38 + 39 + resp = httpx.get( 40 + f"{relay_url}/xrpc/com.atproto.sync.listReposByCollection", 41 + params=params, 42 + timeout=60, 43 + ) 44 + resp.raise_for_status() 45 + data = resp.json() 46 + 47 + repos = data.get("repos", []) 48 + for repo in repos: 49 + dids.append(repo["did"]) 50 + 51 + if not repos: 52 + break 53 + 54 + cursor = data.get("cursor") 55 + if not cursor: 56 + break 57 + 58 + print(f" found {len(dids)} repos so far...") 59 + 60 + return dids 61 + 62 + 63 + def add_repos_to_tap(tap_url: str, dids: list[str]) -> None: 64 + """Add repos to TAP for syncing.""" 65 + if not dids: 66 + return 67 + 68 + # batch in chunks of 100 69 + batch_size = 100 70 + for i in range(0, len(dids), batch_size): 71 + batch = dids[i:i + batch_size] 72 + resp = httpx.post( 73 + f"{tap_url}/repos/add", 74 + json={"dids": batch}, 75 + timeout=30, 76 + ) 77 + resp.raise_for_status() 78 + print(f" added batch {i // batch_size + 1}: {len(batch)} repos") 79 + 80 + 81 + def main(): 82 + parser = argparse.ArgumentParser(description="Enumerate and add standard.site repos to TAP") 83 + parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 84 + parser.add_argument("--relay-url", default=RELAY_URL, help="Relay URL") 85 + parser.add_argument("--tap-url", default=TAP_URL, help="TAP URL") 86 + args = parser.parse_args() 87 + 88 + dids = enumerate_repos(args.relay_url, COLLECTION) 89 + print(f"found {len(dids)} repos with {COLLECTION}") 90 + 91 + if not dids: 92 + print("no repos to add") 93 + return 94 + 95 + if args.dry_run: 96 + print("dry run - would add these repos to TAP:") 97 + for did in dids[:10]: 98 + print(f" {did}") 99 + if len(dids) > 10: 100 + print(f" ... and {len(dids) - 10} more") 101 + return 102 + 103 + print(f"adding {len(dids)} repos to TAP...") 104 + add_repos_to_tap(args.tap_url, dids) 105 + print("done!") 106 + 107 + 108 + if __name__ == "__main__": 109 + main()
+86
scripts/rebuild-pub-fts
··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx", "pydantic-settings"] 5 + # /// 6 + """Rebuild publications_fts with base_path column for subdomain search.""" 7 + import os 8 + import httpx 9 + from pydantic_settings import BaseSettings, SettingsConfigDict 10 + 11 + 12 + class Settings(BaseSettings): 13 + model_config = SettingsConfigDict( 14 + env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 15 + ) 16 + turso_url: str 17 + turso_token: str 18 + 19 + @property 20 + def turso_host(self) -> str: 21 + url = self.turso_url 22 + if url.startswith("libsql://"): 23 + url = url[len("libsql://") :] 24 + return url 25 + 26 + 27 + settings = Settings() # type: ignore 28 + 29 + print("Rebuilding publications_fts with base_path column...") 30 + 31 + response = httpx.post( 32 + f"https://{settings.turso_host}/v2/pipeline", 33 + headers={ 34 + "Authorization": f"Bearer {settings.turso_token}", 35 + "Content-Type": "application/json", 36 + }, 37 + json={ 38 + "requests": [ 39 + {"type": "execute", "stmt": {"sql": "DROP TABLE IF EXISTS publications_fts"}}, 40 + { 41 + "type": "execute", 42 + "stmt": { 43 + "sql": """ 44 + CREATE VIRTUAL TABLE publications_fts USING fts5( 45 + uri UNINDEXED, 46 + name, 47 + description, 48 + base_path 49 + ) 50 + """ 51 + }, 52 + }, 53 + { 54 + "type": "execute", 55 + "stmt": { 56 + "sql": """ 57 + INSERT INTO publications_fts (uri, name, description, base_path) 58 + SELECT uri, name, COALESCE(description, ''), COALESCE(base_path, '') 59 + FROM publications 60 + """ 61 + }, 62 + }, 63 + {"type": "execute", "stmt": {"sql": "SELECT COUNT(*) FROM publications_fts"}}, 64 + {"type": "close"}, 65 + ] 66 + }, 67 + timeout=60, 68 + ) 69 + response.raise_for_status() 70 + data = response.json() 71 + 72 + for i, result in enumerate(data["results"][:-1]): # skip close 73 + if result["type"] == "error": 74 + print(f"Step {i} error: {result['error']}") 75 + elif result["type"] == "ok": 76 + if i == 3: # count query 77 + rows = result["response"]["result"].get("rows", []) 78 + if rows: 79 + count = ( 80 + rows[0][0].get("value", rows[0][0]) 81 + if isinstance(rows[0][0], dict) 82 + else rows[0][0] 83 + ) 84 + print(f"Rebuilt with {count} publications") 85 + 86 + print("Done!")
+5 -12
site/dashboard.html
··· 3 3 <head> 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 - <title>leaflet search / stats</title> 6 + <title>pub search / stats</title> 7 7 <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 32 32'><rect x='4' y='18' width='6' height='10' fill='%231B7340'/><rect x='13' y='12' width='6' height='16' fill='%231B7340'/><rect x='22' y='6' width='6' height='22' fill='%231B7340'/></svg>"> 8 8 <link rel="stylesheet" href="dashboard.css"> 9 9 </head> 10 10 <body> 11 11 <div class="container"> 12 - <h1><a href="https://leaflet-search.pages.dev" class="title">leaflet search</a> <span class="dim">/ stats</span></h1> 12 + <h1><a href="https://pub-search.waow.tech" class="title">pub search</a> <span class="dim">/ stats</span></h1> 13 13 14 14 <section> 15 15 <div class="metrics"> ··· 30 30 </section> 31 31 32 32 <section> 33 - <div class="section-title">documents</div> 33 + <div class="section-title">documents by platform</div> 34 34 <div class="chart-box"> 35 - <div class="doc-row"> 36 - <span class="doc-type">articles</span> 37 - <span class="doc-count" id="articles">--</span> 38 - </div> 39 - <div class="doc-row"> 40 - <span class="doc-type">looseleafs</span> 41 - <span class="doc-count" id="looseleafs">--</span> 42 - </div> 35 + <div id="platforms"></div> 43 36 </div> 44 37 </section> 45 38 ··· 63 56 </section> 64 57 65 58 <footer> 66 - <a href="https://leaflet-search.pages.dev">back</a> ยท source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 59 + <a href="https://pub-search.waow.tech">back</a> ยท source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 67 60 </footer> 68 61 </div> 69 62
+14 -3
site/dashboard.js
··· 57 57 if (!tags) return; 58 58 59 59 el.innerHTML = tags.slice(0, 20).map(t => 60 - '<a class="tag" href="https://leaflet-search.pages.dev/?tag=' + encodeURIComponent(t.tag) + '">' + 60 + '<a class="tag" href="https://pub-search.waow.tech/?tag=' + encodeURIComponent(t.tag) + '">' + 61 61 escapeHtml(t.tag) + '<span class="n">' + t.count + '</span></a>' 62 62 ).join(''); 63 + } 64 + 65 + function renderPlatforms(platforms) { 66 + const el = document.getElementById('platforms'); 67 + if (!platforms) return; 68 + 69 + platforms.forEach(p => { 70 + const row = document.createElement('div'); 71 + row.className = 'doc-row'; 72 + row.innerHTML = '<span class="doc-type">' + escapeHtml(p.platform) + '</span><span class="doc-count">' + p.count + '</span>'; 73 + el.appendChild(row); 74 + }); 63 75 } 64 76 65 77 function escapeHtml(str) { ··· 83 95 84 96 document.getElementById('searches').textContent = data.searches; 85 97 document.getElementById('publications').textContent = data.publications; 86 - document.getElementById('articles').textContent = data.articles; 87 - document.getElementById('looseleafs').textContent = data.looseleafs; 88 98 99 + renderPlatforms(data.platforms); 89 100 renderTimeline(data.timeline); 90 101 renderPubs(data.topPubs); 91 102 renderTags(data.tags);
+316 -44
site/index.html
··· 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 6 <link rel="icon" type="image/svg+xml" href="/favicon.svg"> 7 - <title>leaflet search</title> 8 - <meta name="description" content="search for leaflet"> 9 - <meta property="og:title" content="leaflet search"> 10 - <meta property="og:description" content="search for leaflet"> 7 + <title>pub search</title> 8 + <meta name="description" content="search atproto publishing platforms"> 9 + <meta property="og:title" content="pub search"> 10 + <meta property="og:description" content="search atproto publishing platforms"> 11 11 <meta property="og:type" content="website"> 12 12 <meta name="twitter:card" content="summary"> 13 - <meta name="twitter:title" content="leaflet search"> 14 - <meta name="twitter:description" content="search for leaflet"> 13 + <meta name="twitter:title" content="pub search"> 14 + <meta name="twitter:description" content="search atproto publishing platforms"> 15 15 <style> 16 16 * { box-sizing: border-box; margin: 0; padding: 0; } 17 17 ··· 75 75 flex: 1; 76 76 padding: 0.5rem; 77 77 font-family: monospace; 78 - font-size: 14px; 78 + font-size: 16px; /* prevents iOS auto-zoom on focus */ 79 79 background: #111; 80 80 border: 1px solid #333; 81 81 color: #ccc; ··· 111 111 .result-title { 112 112 color: #fff; 113 113 margin-bottom: 0.5rem; 114 + /* prevent long titles from breaking layout */ 115 + display: -webkit-box; 116 + -webkit-line-clamp: 2; 117 + -webkit-box-orient: vertical; 118 + overflow: hidden; 119 + word-break: break-word; 114 120 } 115 121 116 122 .result-title a { color: inherit; } ··· 325 331 margin-left: 4px; 326 332 } 327 333 334 + .platform-filter { 335 + margin-bottom: 1rem; 336 + } 337 + 338 + .platform-filter-label { 339 + font-size: 11px; 340 + color: #444; 341 + margin-bottom: 0.5rem; 342 + } 343 + 344 + .platform-filter-list { 345 + display: flex; 346 + gap: 0.5rem; 347 + } 348 + 349 + .platform-option { 350 + font-size: 11px; 351 + padding: 3px 8px; 352 + background: #151515; 353 + border: 1px solid #252525; 354 + border-radius: 3px; 355 + cursor: pointer; 356 + color: #777; 357 + } 358 + 359 + .platform-option:hover { 360 + background: #1a1a1a; 361 + border-color: #333; 362 + color: #aaa; 363 + } 364 + 365 + .platform-option.active { 366 + background: rgba(180, 100, 64, 0.2); 367 + border-color: #d4956a; 368 + color: #d4956a; 369 + } 370 + 328 371 .active-filter { 329 372 display: flex; 330 373 align-items: center; ··· 346 389 .active-filter .clear:hover { 347 390 color: #c44; 348 391 } 392 + 393 + /* mobile improvements */ 394 + @media (max-width: 600px) { 395 + body { 396 + padding: 0.75rem; 397 + font-size: 13px; 398 + } 399 + 400 + .container { 401 + max-width: 100%; 402 + } 403 + 404 + /* ensure minimum 44px touch targets */ 405 + .tag, .platform-option, .suggestion { 406 + min-height: 44px; 407 + display: inline-flex; 408 + align-items: center; 409 + padding: 0.5rem 0.75rem; 410 + } 411 + 412 + button { 413 + min-height: 44px; 414 + padding: 0.5rem 0.75rem; 415 + } 416 + 417 + /* stack search box on very small screens */ 418 + .search-box { 419 + flex-direction: column; 420 + gap: 0.5rem; 421 + } 422 + 423 + .search-box input[type="text"] { 424 + width: 100%; 425 + } 426 + 427 + .search-box button { 428 + width: 100%; 429 + } 430 + 431 + /* result card mobile tweaks */ 432 + .result { 433 + padding: 0.75rem 0; 434 + } 435 + 436 + .result:hover { 437 + margin: 0 -0.75rem; 438 + padding: 0.75rem; 439 + } 440 + 441 + .result-title { 442 + font-size: 14px; 443 + line-height: 1.4; 444 + } 445 + 446 + .result-snippet { 447 + font-size: 12px; 448 + line-height: 1.5; 449 + } 450 + 451 + /* badges inline on mobile */ 452 + .entity-type, .platform-badge { 453 + font-size: 9px; 454 + padding: 2px 5px; 455 + margin-right: 6px; 456 + vertical-align: middle; 457 + } 458 + 459 + /* tags wrap better on mobile */ 460 + .tags-list, .platform-filter-list { 461 + gap: 0.5rem; 462 + } 463 + 464 + /* suggestions responsive */ 465 + .suggestions { 466 + line-height: 2; 467 + } 468 + 469 + /* related items more compact */ 470 + .related-item { 471 + max-width: 150px; 472 + font-size: 11px; 473 + padding: 0.5rem; 474 + } 475 + } 476 + 477 + /* ensure touch targets on tablets too */ 478 + @media (hover: none) and (pointer: coarse) { 479 + .tag, .platform-option, .suggestion, .related-item { 480 + min-height: 44px; 481 + display: inline-flex; 482 + align-items: center; 483 + } 484 + } 349 485 </style> 350 486 </head> 351 487 <body> 352 488 <div class="container"> 353 - <h1><a href="/" class="title">leaflet search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 489 + <h1><a href="/" class="title">pub search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 354 490 355 491 <div class="search-box"> 356 492 <input type="text" id="query" placeholder="search content..." autofocus> ··· 363 499 364 500 <div id="tags" class="tags"></div> 365 501 502 + <div id="platform-filter" class="platform-filter"></div> 503 + 366 504 <div id="results" class="results"> 367 505 <div class="empty-state"> 368 - <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 506 + <p>search atproto publishing platforms</p> 507 + <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> ยท <a href="https://pckt.blog" target="_blank">pckt</a> ยท <a href="https://standard.site" target="_blank">standard.site</a></p> 369 508 </div> 370 509 </div> 371 510 ··· 384 523 const tagsDiv = document.getElementById('tags'); 385 524 const activeFilterDiv = document.getElementById('active-filter'); 386 525 const suggestionsDiv = document.getElementById('suggestions'); 526 + const platformFilterDiv = document.getElementById('platform-filter'); 387 527 388 528 let currentTag = null; 529 + let currentPlatform = null; 389 530 let allTags = []; 390 531 let popularSearches = []; 391 532 392 - async function search(query, tag = null) { 393 - if (!query.trim() && !tag) return; 533 + async function search(query, tag = null, platform = null) { 534 + if (!query.trim() && !tag && !platform) return; 394 535 395 536 searchBtn.disabled = true; 396 537 let searchUrl = `${API_URL}/search?q=${encodeURIComponent(query || '')}`; 397 538 if (tag) searchUrl += `&tag=${encodeURIComponent(tag)}`; 539 + if (platform) searchUrl += `&platform=${encodeURIComponent(platform)}`; 398 540 resultsDiv.innerHTML = `<div class="status">searching...</div>`; 399 541 400 542 try { ··· 417 559 if (results.length === 0) { 418 560 resultsDiv.innerHTML = ` 419 561 <div class="empty-state"> 420 - <p>no results${query ? ` for "${escapeHtml(query)}"` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}</p> 562 + <p>no results${query ? ` for ${formatQueryForDisplay(query)}` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}${platform ? ` on ${escapeHtml(platform)}` : ''}</p> 421 563 <p>try different keywords</p> 422 564 </div> 423 565 `; ··· 429 571 430 572 for (const doc of results) { 431 573 const entityType = doc.type || 'article'; 432 - 433 - // build URL based on entity type 434 - let leafletUrl = null; 435 - if (entityType === 'publication') { 436 - // publications link to their base path 437 - leafletUrl = doc.basePath ? `https://${doc.basePath}` : null; 438 - } else { 439 - // articles and looseleafs link to specific document 440 - leafletUrl = doc.basePath && doc.rkey 441 - ? `https://${doc.basePath}/${doc.rkey}` 442 - : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 443 - } 574 + const platform = doc.platform || 'leaflet'; 444 575 576 + // build URL based on entity type and platform 577 + const docUrl = buildDocUrl(doc, entityType, platform); 578 + // only show platform badge for actual platforms, not for lexicon-only records 579 + const platformConfig = PLATFORM_CONFIG[platform]; 580 + const platformBadge = platformConfig 581 + ? `<span class="platform-badge">${escapeHtml(platformConfig.label)}</span>` 582 + : ''; 445 583 const date = doc.createdAt ? new Date(doc.createdAt).toLocaleDateString() : ''; 446 - const platform = doc.platform || 'leaflet'; 447 - const platformBadge = platform !== 'leaflet' ? `<span class="platform-badge">${escapeHtml(platform)}</span>` : ''; 584 + 585 + // platform home URL for meta link 586 + const platformHome = getPlatformHome(platform, doc.basePath); 587 + 448 588 html += ` 449 589 <div class="result"> 450 590 <div class="result-title"> 451 591 <span class="entity-type ${entityType}">${entityType}</span>${platformBadge} 452 - ${leafletUrl 453 - ? `<a href="${leafletUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 592 + ${docUrl 593 + ? `<a href="${docUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 454 594 : escapeHtml(doc.title || 'Untitled')} 455 595 </div> 456 596 <div class="result-snippet">${highlightTerms(doc.snippet, query)}</div> 457 597 <div class="result-meta"> 458 - ${date ? `${date} | ` : ''}${doc.basePath 459 - ? `<a href="https://${doc.basePath}" target="_blank">${doc.basePath}</a>` 460 - : `<a href="https://leaflet.pub" target="_blank">leaflet.pub</a>`} 598 + ${date ? `${date} | ` : ''}${platformHome.url 599 + ? `<a href="${platformHome.url}" target="_blank">${platformHome.label}</a>` 600 + : platformHome.label} 461 601 </div> 462 602 </div> 463 603 `; ··· 485 625 })[c]); 486 626 } 487 627 628 + // display query without adding redundant quotes 629 + function formatQueryForDisplay(query) { 630 + if (!query) return ''; 631 + const escaped = escapeHtml(query); 632 + // if query is already fully quoted, don't add more quotes 633 + if (query.startsWith('"') && query.endsWith('"')) { 634 + return escaped; 635 + } 636 + return `"${escaped}"`; 637 + } 638 + 639 + // platform-specific URL patterns 640 + // note: some platforms use basePath from publication, which we prefer 641 + // fallback docUrl() is used when basePath is missing 642 + const PLATFORM_CONFIG = { 643 + leaflet: { 644 + home: 'https://leaflet.pub', 645 + label: 'leaflet.pub', 646 + // leaflet uses did/rkey pattern for fallback URLs 647 + docUrl: (did, rkey) => `https://leaflet.pub/p/${did}/${rkey}` 648 + }, 649 + pckt: { 650 + home: 'https://pckt.blog', 651 + label: 'pckt.blog', 652 + // pckt uses blog slugs + path, not did/rkey - needs basePath from publication 653 + docUrl: null 654 + }, 655 + offprint: { 656 + home: 'https://offprint.app', 657 + label: 'offprint.app', 658 + // offprint is in early beta, URL pattern unknown 659 + docUrl: null 660 + }, 661 + }; 662 + 663 + function buildDocUrl(doc, entityType, platform) { 664 + if (entityType === 'publication') { 665 + return doc.basePath ? `https://${doc.basePath}` : null; 666 + } 667 + 668 + // Platform-specific URL patterns: 669 + // 1. Leaflet: basePath + rkey (e.g., https://dad.leaflet.pub/3mburumcnbs2m) 670 + if (platform === 'leaflet' && doc.basePath && doc.rkey) { 671 + return `https://${doc.basePath}/${doc.rkey}`; 672 + } 673 + 674 + // 2. pckt: basePath + path (e.g., https://devlog.pckt.blog/some-slug-abc123) 675 + if (platform === 'pckt' && doc.basePath && doc.path) { 676 + return `https://${doc.basePath}${doc.path}`; 677 + } 678 + 679 + // 3. Other platforms with path: basePath + path 680 + if (doc.basePath && doc.path) { 681 + return `https://${doc.basePath}${doc.path}`; 682 + } 683 + 684 + // 4. Platform-specific fallback URL (e.g., leaflet.pub/p/did/rkey) 685 + const config = PLATFORM_CONFIG[platform]; 686 + if (config?.docUrl && doc.did && doc.rkey) { 687 + return config.docUrl(doc.did, doc.rkey); 688 + } 689 + 690 + // 5. Fallback: pdsls.dev universal viewer (always works for any AT Protocol record) 691 + if (doc.uri) { 692 + return `https://pdsls.dev/${doc.uri}`; 693 + } 694 + 695 + return null; 696 + } 697 + 698 + function getPlatformHome(platform, basePath) { 699 + if (basePath) { 700 + return { url: `https://${basePath}`, label: basePath }; 701 + } 702 + const config = PLATFORM_CONFIG[platform]; 703 + if (config) { 704 + return { url: config.home, label: config.label }; 705 + } 706 + // unknown platform using standard.site lexicon - link to standard.site 707 + return { url: 'https://standard.site', label: 'standard.site' }; 708 + } 709 + 488 710 function highlightTerms(text, query) { 489 711 if (!text || !query) return escapeHtml(text); 490 712 const terms = query.toLowerCase().split(/\s+/).filter(t => t.length > 0); ··· 503 725 const q = queryInput.value.trim(); 504 726 if (q) params.set('q', q); 505 727 if (currentTag) params.set('tag', currentTag); 728 + if (currentPlatform) params.set('platform', currentPlatform); 506 729 const url = params.toString() ? `?${params}` : '/'; 507 730 history.pushState(null, '', url); 508 731 } 509 732 510 733 function doSearch() { 511 734 updateUrl(); 512 - search(queryInput.value, currentTag); 735 + search(queryInput.value, currentTag, currentPlatform); 513 736 } 514 737 515 738 function setTag(tag) { 739 + if (currentTag === tag) { 740 + clearTag(); 741 + return; 742 + } 516 743 currentTag = tag; 517 744 renderActiveFilter(); 518 745 renderTags(); ··· 524 751 renderActiveFilter(); 525 752 renderTags(); 526 753 updateUrl(); 527 - if (queryInput.value.trim()) { 528 - search(queryInput.value, null); 754 + if (queryInput.value.trim() || currentPlatform) { 755 + search(queryInput.value, null, currentPlatform); 529 756 } else { 530 757 renderEmptyState(); 531 758 } 532 759 } 533 760 761 + function setPlatform(platform) { 762 + if (currentPlatform === platform) { 763 + clearPlatform(); 764 + return; 765 + } 766 + currentPlatform = platform; 767 + renderActiveFilter(); 768 + renderPlatformFilter(); 769 + doSearch(); 770 + } 771 + 772 + function clearPlatform() { 773 + currentPlatform = null; 774 + renderActiveFilter(); 775 + renderPlatformFilter(); 776 + updateUrl(); 777 + if (queryInput.value.trim() || currentTag) { 778 + search(queryInput.value, currentTag, null); 779 + } else { 780 + renderEmptyState(); 781 + } 782 + } 783 + 784 + function renderPlatformFilter() { 785 + const platforms = [ 786 + { id: 'leaflet', label: 'leaflet' }, 787 + { id: 'pckt', label: 'pckt' }, 788 + ]; 789 + const html = platforms.map(p => ` 790 + <span class="platform-option${currentPlatform === p.id ? ' active' : ''}" onclick="setPlatform('${p.id}')">${p.label}</span> 791 + `).join(''); 792 + platformFilterDiv.innerHTML = `<div class="platform-filter-label">filter by platform:</div><div class="platform-filter-list">${html}</div>`; 793 + } 794 + 534 795 function renderActiveFilter() { 535 - if (!currentTag) { 796 + if (!currentTag && !currentPlatform) { 536 797 activeFilterDiv.innerHTML = ''; 537 798 return; 538 799 } 800 + let parts = []; 801 + if (currentTag) parts.push(`tag: <strong>#${escapeHtml(currentTag)}</strong>`); 802 + if (currentPlatform) parts.push(`platform: <strong>${escapeHtml(currentPlatform)}</strong>`); 803 + const clearActions = []; 804 + if (currentTag) clearActions.push(`<span class="clear" onclick="clearTag()">ร— tag</span>`); 805 + if (currentPlatform) clearActions.push(`<span class="clear" onclick="clearPlatform()">ร— platform</span>`); 539 806 activeFilterDiv.innerHTML = ` 540 807 <div class="active-filter"> 541 - <span>filtering by tag: <strong>#${escapeHtml(currentTag)}</strong> <span style="color:#666;font-size:10px">(documents only)</span></span> 542 - <span class="clear" onclick="clearTag()">ร— clear</span> 808 + <span>filtering by ${parts.join(', ')} <span style="color:#666;font-size:10px">(documents only)</span></span> 809 + ${clearActions.join(' ')} 543 810 </div> 544 811 `; 545 812 } ··· 601 868 function renderEmptyState() { 602 869 resultsDiv.innerHTML = ` 603 870 <div class="empty-state"> 604 - <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 871 + <p>search atproto publishing platforms</p> 872 + <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> ยท <a href="https://pckt.blog" target="_blank">pckt</a> ยท <a href="https://standard.site" target="_blank">standard.site</a></p> 605 873 </div> 606 874 `; 607 875 } ··· 620 888 const params = new URLSearchParams(location.search); 621 889 queryInput.value = params.get('q') || ''; 622 890 currentTag = params.get('tag') || null; 891 + currentPlatform = params.get('platform') || null; 623 892 renderActiveFilter(); 624 893 renderTags(); 625 - if (queryInput.value || currentTag) search(queryInput.value, currentTag); 894 + renderPlatformFilter(); 895 + if (queryInput.value || currentTag || currentPlatform) search(queryInput.value, currentTag, currentPlatform); 626 896 }); 627 897 628 898 // init 629 899 const initialParams = new URLSearchParams(location.search); 630 900 const initialQuery = initialParams.get('q'); 631 901 const initialTag = initialParams.get('tag'); 902 + const initialPlatform = initialParams.get('platform'); 632 903 if (initialQuery) queryInput.value = initialQuery; 633 904 if (initialTag) currentTag = initialTag; 905 + if (initialPlatform) currentPlatform = initialPlatform; 634 906 renderActiveFilter(); 907 + renderPlatformFilter(); 635 908 636 - if (initialQuery || initialTag) { 637 - search(initialQuery || '', initialTag); 909 + if (initialQuery || initialTag || initialPlatform) { 910 + search(initialQuery || '', initialTag, initialPlatform); 638 911 } 639 912 640 913 async function loadRelated(topResult) { ··· 660 933 if (filtered.length === 0) return; 661 934 662 935 const items = filtered.map(doc => { 663 - const url = doc.basePath && doc.rkey 664 - ? `https://${doc.basePath}/${doc.rkey}` 665 - : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 936 + const platform = doc.platform || 'leaflet'; 937 + const url = buildDocUrl(doc, doc.type || 'article', platform); 666 938 return url 667 939 ? `<a href="${url}" target="_blank" class="related-item">${escapeHtml(doc.title || 'Untitled')}</a>` 668 940 : `<span class="related-item">${escapeHtml(doc.title || 'Untitled')}</span>`;
+32 -40
site/loading.js
··· 82 82 const style = document.createElement('style'); 83 83 style.id = 'loader-styles'; 84 84 style.textContent = ` 85 - /* skeleton shimmer for loading values */ 85 + /* skeleton shimmer - subtle pulse */ 86 86 .loading .metric-value, 87 87 .loading .doc-count, 88 88 .loading .pub-count { 89 - background: linear-gradient(90deg, #1a1a1a 25%, #252525 50%, #1a1a1a 75%); 90 - background-size: 200% 100%; 91 - animation: shimmer 1.5s infinite; 92 - border-radius: 3px; 93 - color: transparent !important; 94 - min-width: 3ch; 95 - display: inline-block; 89 + color: #333 !important; 90 + animation: dim-pulse 2s ease-in-out infinite; 96 91 } 97 92 98 - @keyframes shimmer { 99 - 0% { background-position: 200% 0; } 100 - 100% { background-position: -200% 0; } 93 + @keyframes dim-pulse { 94 + 0%, 100% { opacity: 0.3; } 95 + 50% { opacity: 0.6; } 101 96 } 102 97 103 - /* wake message */ 98 + /* wake message - terminal style, ephemeral */ 104 99 .wake-message { 105 100 position: fixed; 106 - top: 1rem; 107 - right: 1rem; 101 + bottom: 1rem; 102 + left: 1rem; 103 + font-family: monospace; 108 104 font-size: 11px; 109 - color: #666; 110 - background: #111; 111 - border: 1px solid #222; 112 - padding: 6px 12px; 113 - border-radius: 4px; 114 - display: flex; 115 - align-items: center; 116 - gap: 8px; 105 + color: #444; 117 106 z-index: 1000; 118 - animation: fade-in 0.2s ease; 107 + animation: fade-in 0.5s ease; 108 + } 109 + 110 + .wake-message::before { 111 + content: '>'; 112 + margin-right: 6px; 113 + opacity: 0.5; 119 114 } 120 115 121 116 .wake-dot { 122 - width: 6px; 123 - height: 6px; 124 - background: #4ade80; 117 + display: inline-block; 118 + width: 4px; 119 + height: 4px; 120 + background: #555; 125 121 border-radius: 50%; 126 - animation: pulse-dot 1s infinite; 122 + margin-left: 4px; 123 + animation: blink 1s step-end infinite; 127 124 } 128 125 129 - @keyframes pulse-dot { 130 - 0%, 100% { opacity: 0.3; } 131 - 50% { opacity: 1; } 126 + @keyframes blink { 127 + 0%, 100% { opacity: 1; } 128 + 50% { opacity: 0; } 132 129 } 133 130 134 131 @keyframes fade-in { 135 - from { opacity: 0; transform: translateY(-4px); } 136 - to { opacity: 1; transform: translateY(0); } 132 + from { opacity: 0; } 133 + to { opacity: 1; } 137 134 } 138 135 139 136 .wake-message.fade-out { 140 - animation: fade-out 0.3s ease forwards; 137 + animation: fade-out 0.5s ease forwards; 141 138 } 142 139 143 140 @keyframes fade-out { 144 - to { opacity: 0; transform: translateY(-4px); } 141 + to { opacity: 0; } 145 142 } 146 143 147 144 /* loaded transition */ 148 145 .loaded .metric-value, 149 146 .loaded .doc-count, 150 147 .loaded .pub-count { 151 - animation: reveal 0.3s ease; 152 - } 153 - 154 - @keyframes reveal { 155 - from { opacity: 0; } 156 - to { opacity: 1; } 148 + animation: none; 157 149 } 158 150 `; 159 151 document.head.appendChild(style);
+6 -3
tap/fly.toml
··· 1 1 app = 'leaflet-search-tap' 2 - primary_region = 'iad' 2 + primary_region = 'ewr' 3 3 4 4 [build] 5 5 image = 'ghcr.io/bluesky-social/indigo/tap:latest' ··· 9 9 TAP_BIND = ':2480' 10 10 TAP_RELAY_URL = 'https://relay1.us-east.bsky.network' 11 11 TAP_SIGNAL_COLLECTION = 'pub.leaflet.document' 12 - TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication' 13 - TAP_DISABLE_ACKS = 'true' 12 + TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication,site.standard.document,site.standard.publication' 14 13 TAP_LOG_LEVEL = 'info' 14 + TAP_RESYNC_PARALLELISM = '1' 15 + TAP_FIREHOSE_PARALLELISM = '5' 16 + TAP_OUTBOX_CAPACITY = '10000' 17 + TAP_IDENT_CACHE_SIZE = '10000' 15 18 TAP_CURSOR_SAVE_INTERVAL = '5s' 16 19 TAP_REPO_FETCH_TIMEOUT = '600s' 17 20
+36
tap/justfile
··· 1 1 # tap instance for leaflet-search 2 2 3 + # get machine id 4 + _machine_id := `fly status --app leaflet-search-tap --json 2>/dev/null | jq -r '.Machines[0].id'` 5 + 6 + # crank up parallelism for faster catch-up (uses more memory + CPU) 7 + turbo: 8 + @echo "Switching to TURBO mode (4GB, 2 CPUs, higher parallelism)..." 9 + fly machine update {{ _machine_id }} --app leaflet-search-tap \ 10 + --vm-memory 4096 \ 11 + --vm-cpus 2 \ 12 + -e TAP_RESYNC_PARALLELISM=4 \ 13 + -e TAP_FIREHOSE_PARALLELISM=10 \ 14 + --yes 15 + @echo "TURBO mode enabled. Run 'just normal' when caught up." 16 + 17 + # restore normal settings (lower memory, conservative parallelism) 18 + normal: 19 + @echo "Switching to NORMAL mode (2GB, 1 CPU, conservative parallelism)..." 20 + fly machine update {{ _machine_id }} --app leaflet-search-tap \ 21 + --vm-memory 2048 \ 22 + --vm-cpus 1 \ 23 + -e TAP_RESYNC_PARALLELISM=1 \ 24 + -e TAP_FIREHOSE_PARALLELISM=5 \ 25 + --yes 26 + @echo "NORMAL mode restored." 27 + 28 + # check indexing status - shows most recent indexed documents 29 + check: 30 + @echo "=== tap status ===" 31 + @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)" 32 + @echo "" 33 + @echo "=== Recent Indexing Activity ===" 34 + @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '"Last indexed: \(.timeline[0].date) (\(.timeline[0].count) docs)\nToday: '$(date +%Y-%m-%d)'\nDocs: \(.documents) | Pubs: \(.publications)"' 35 + @echo "" 36 + @echo "=== Timeline (last 7 days) ===" 37 + @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '.timeline[:7][] | "\(.date): \(.count) docs"' 38 + 3 39 deploy: 4 40 fly deploy --app leaflet-search-tap 5 41